HK40083865A - A session processing method, device, electronic equipment and readable storage medium - Google Patents
A session processing method, device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- HK40083865A HK40083865A HK42023072795.0A HK42023072795A HK40083865A HK 40083865 A HK40083865 A HK 40083865A HK 42023072795 A HK42023072795 A HK 42023072795A HK 40083865 A HK40083865 A HK 40083865A
- Authority
- HK
- Hong Kong
- Prior art keywords
- special effect
- conversation
- target object
- session
- video
- Prior art date
Links
Description
Technical Field
The present application relates to internet technologies, and in particular, to a session processing method and apparatus, an electronic device, and a computer-readable storage medium.
Background
With the development of internet technology, instant messaging is more and more widely applied, and instant messaging is used for providing an internet-based instant messaging service for users, so that two or more people are allowed to transmit text information, voice, video and the like in real time through a network. With the development of instant messaging applications, instant messaging applications have penetrated into people's lives, and more people use instant messaging applications to communicate.
In the process of conversation through instant messaging application, when the environmental condition of a user is poor, such as a public scene with overtaking road, dim light and many noisy people, video information conforming to the scene is difficult to send. In such a case, the user has to output a message using text or voice, and it is difficult for the text message or voice message to transmit information such as a real conversation scene, which results in a problem of low conversation efficiency.
Disclosure of Invention
The embodiment of the application provides a session processing method and device, electronic equipment and a computer readable storage medium, which can fully and effectively display a session scene and improve session efficiency.
The technical scheme of the embodiment of the application is realized as follows:
an embodiment of the present application provides a session processing method, including:
presenting a session editing area;
responding to an input operation based on the session editing area, and acquiring session content formed by the input operation;
and presenting a special effect generated based on the virtual image of the target object in the conversation area in response to the sending operation aiming at the conversation content, wherein the special effect is used for representing the conversation content.
In the above technical solution, before the special effect generated based on the avatar of the target object is presented in the session area, the method further includes:
determining special effect data for representing the conversation content based on the virtual image of the target object and the conversation content;
determining voice data conforming to the target object sound based on the conversation content;
and synthesizing the special effect data and the voice data to obtain a special effect generated based on the virtual image of the target object.
In the above technical solution, before determining special effect data for characterizing the session content, the method further includes:
acquiring a real image of the target object;
and calling an avatar generation model based on the real image to obtain the avatar of the target object.
In the above technical solution, the invoking an avatar generation model based on the real image to obtain the avatar of the target object includes:
performing the following processing by the avatar generation model:
performing principal component analysis processing on the real image to obtain geometric information distribution and texture information distribution corresponding to the target object;
and carrying out deformation processing based on the geometric information distribution and the texture information distribution to obtain the virtual image of the target object.
In the above technical solution, the determining the voice data corresponding to the target object sound based on the session content includes:
performing text analysis processing on text information corresponding to the conversation content to obtain context characteristics of the conversation content;
performing voice parameter prediction processing on the context characteristics of the conversation content based on the target object sound to obtain a plurality of voice parameters which correspond to the conversation content and are in line with the target object sound;
and synthesizing the plurality of voice parameters to obtain voice data which accords with the target object sound.
In the above technical solution, when the conversation content is a text message, the determining special effect data for characterizing the conversation content based on the avatar of the target object and the conversation content includes:
performing text feature extraction processing on the text message to obtain text features of the text message;
matching a plurality of candidate special effect data contained in a database based on the text features, and taking the matched candidate special effect data as special effect data for representing the conversation content;
wherein the text features include at least one of: characters, symbols, expression pictures.
In the above technical solution, when the conversation content is a voice message, the determining special effect data for characterizing the conversation content based on the avatar of the target object and the conversation content includes:
carrying out format conversion processing on the voice message to obtain a text message corresponding to the voice message;
performing text feature extraction processing on the text message to obtain text features of the text message;
matching a plurality of candidate special effect data contained in a database based on the text features, and taking the matched candidate special effect data as special effect data for representing the conversation content.
An embodiment of the present application provides a session processing apparatus, including:
the first display module is used for presenting a session editing area;
the acquisition module is used for responding to the input operation based on the session editing area and acquiring the session content formed by the input operation;
and the second display module is used for responding to the sending operation aiming at the conversation content and presenting a special effect generated based on the virtual image of the target object in the conversation area, wherein the special effect is used for representing the conversation content.
An embodiment of the present application provides an electronic device for session processing, where the electronic device includes:
a memory for storing executable instructions;
and the processor is used for realizing the session processing method provided by the embodiment of the application when the executable instructions stored in the memory are executed.
The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to implement the session processing method provided by the embodiment of the present application when executed.
The embodiment of the application has the following beneficial effects:
the conversation content is fully and effectively displayed by presenting the special effect which is generated based on the virtual image of the target object and used for representing the conversation content in the conversation area, and compared with a conversation scheme of directly displaying text messages or voice messages in the conversation area, the conversation interaction efficiency and the conversation effect are improved, so that related communication resources and computing resources are saved.
Drawings
Fig. 1 is a schematic architecture diagram of a session system provided in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of an electronic device for session processing according to an embodiment of the present disclosure;
fig. 3A is a schematic flowchart of a session processing method provided in an embodiment of the present application;
fig. 3B is a schematic diagram of a session editing area provided in an embodiment of the present application;
3C-3D are schematic diagrams of the transmission action provided by embodiments of the present application;
FIG. 4A is a schematic diagram of avatar generation provided by an embodiment of the present application;
fig. 4B is a schematic diagram of sound collection provided by an embodiment of the present application;
5-6 are schematic diagrams of a text publication provided by an embodiment of the present application;
FIG. 7 is a schematic illustration of a voice announcement provided by an embodiment of the application;
fig. 8A-8D are schematic diagrams illustrating a mobile terminal receiving a surprise expression according to an embodiment of the disclosure;
FIG. 9 is a schematic diagram of a Personal Computer (PC) receiving a surprise expression according to an embodiment of the present application;
fig. 10 is a schematic diagram of a mobile terminal receiving a video signal according to an embodiment of the present application;
fig. 11 is a schematic diagram of a Personal Computer (PC) receiving video information according to an embodiment of the present application;
12A-12B are schematic flow charts of a session processing method provided by an embodiment of the present application;
FIG. 13 is a schematic diagram of user image feature and sound feature collection provided in an embodiment of the present application;
FIG. 14 is a flow chart of a speech synthesis method provided by an embodiment of the present application;
fig. 15 is a schematic flowchart of converting speech into text according to an embodiment of the present application.
Detailed Description
In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
In the following description, references to the terms "first", "second", and the like are only used for distinguishing similar objects and do not denote a particular order or importance, but rather the terms "first", "second", and the like may be used interchangeably with the order of priority or the order in which they are expressed, where permissible, to enable embodiments of the present application described herein to be practiced otherwise than as specifically illustrated and described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.
1) In response to: for indicating the condition or state on which the performed operation depends, when the condition or state on which the performed operation depends is satisfied, the performed operation or operations may be in real time or may have a set delay; there is no restriction on the order of execution of the operations performed unless otherwise specified.
2) Expression: after the social application is active, a popular culture is formed to express specific emotion, such as emotion expressed on the face or posture of a user; in practical application, the expressions can be divided into symbolic expressions, static picture expressions, dynamic picture expressions, video expressions, and the like, for example, the expressions can be made of faces expressing various emotions of a user, or popular stars, cartoons, movie screenshots, and the like, and then a series of matched characters are matched with the materials.
3) Surprise expression: the method is also called expression special effect, user voice can be synthesized through text information sent by a user, and synthesized expression pictures are obtained by combining text content and user image data recorded in advance, or voice messages sent by the user are used as voice, and synthesized expression pictures are obtained by combining text content corresponding to the voice and user image data recorded in advance.
4) Video information: the video special effect is also called as video special effect, audio can be synthesized through text information sent by a user, and the audio is obtained through a synthesized video frame by combining text content and user image data input in advance, or voice information sent by the user is used as audio, and the audio is obtained through a synthesized video frame by combining text content corresponding to the audio and the user image data input in advance.
5) And (3) virtual image: and the characters of various people and objects which can interact in the virtual scene, such as virtual characters, virtual animals, cartoon characters and the like. The avatar may be an avatar in the virtual scene that is virtual to represent the user.
The embodiment of the application provides a session processing method and device, electronic equipment and a computer readable storage medium, which can fully and effectively display a session scene and improve session efficiency.
The session processing method provided by the embodiment of the application can be independently realized by the terminal; the terminal and the server can cooperate to implement the method, for example, the terminal solely undertakes the session processing method described below, or the terminal of the session sender sends the input session content to the server, and the server obtains a special effect generated based on the avatar of the target object (e.g., a real user, a virtual object) and used for representing the session content according to the received session content, and sends the special effect to the terminal of the session receiver, so as to display the special effect in the session area of the session receiver, so as to conveniently transfer information, make the session more interesting, and improve the viscosity of the session.
An exemplary application of the electronic device provided by the embodiment of the present application is described below, and the electronic device provided by the embodiment of the present application may be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a smart television, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device, and an in-vehicle device). In the following, an exemplary application will be explained when the electronic device is implemented as a terminal.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a session system 100 provided in an embodiment of the present application, and terminals (illustratively, a terminal 200-1 and a terminal 200-2) are connected to a server 100 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of both.
In some embodiments, taking an electronic device as a terminal as an example, the session processing method provided in the embodiments of the present application may be implemented by the terminal. For example, the terminal 200-1 presents a conversation editing area, acquires conversation content formed by input in response to an input operation based on the conversation editing area, presents a special effect for representing the conversation content generated based on an avatar of a target object in the conversation area in response to a transmission operation for the conversation content, and transmits the special effect to the terminal 200-2 to display the special effect in the conversation area of the terminal 200-2 to transfer information conveniently, make the conversation more interesting, and improve the viscosity of the conversation.
In some embodiments, the session processing method provided by the embodiments of the present application may also be cooperatively implemented by a server and a terminal. For example, the terminal 200-1 presents a session editing area, acquires session content formed by input in response to an input operation based on the session editing area, and transmits the session content to the server 100, and the server 100 acquires a special effect generated based on an avatar of a target object and used for representing the session content according to the received session content, and transmits the special effect to the terminal 200-1 and the terminal 200-2, so that the special effect is displayed in the session area of the terminal 200-1 and the terminal 200-2, information can be conveniently transferred, the session is more interesting, and the viscosity of the session is improved.
In some embodiments, the terminal or the server may implement the session processing method provided by the embodiments of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; can be a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run, such as an Application program of the instant messaging type; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module, or plug-in.
In some embodiments, the server 100 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform, where the cloud service may be a live broadcast processing service for a terminal to call.
In some embodiments, multiple servers may be grouped into a blockchain, and the server 100 is a node on the blockchain, and there may be an information connection between each node in the blockchain, and information transmission between the nodes may be performed through the information connection. Data (e.g., logic and special effects of session processing) related to the session processing method provided by the embodiment of the present application may be stored in the blockchain.
The structure of the electronic device for session processing provided in the embodiment of the present application is described below, and referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 for session processing provided in the embodiment of the present application. Taking the electronic device 500 as an example of a terminal, the electronic device 500 for session processing shown in fig. 2 includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.
The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.
The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.
In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;
in some embodiments, the session processing apparatus provided in the embodiments of the present application may be implemented in software, and the session processing apparatus provided in the embodiments of the present application may be provided in various software embodiments, including various forms of applications, software modules, scripts, or codes.
Fig. 2 shows a session processing means 555 stored in the memory 550, which may be software in the form of programs and plug-ins, etc., and includes a series of modules including a first display module 5551, an acquisition module 5552, a second display module 5553, and a processing module 5554, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented, and the functions of the respective modules will be described below.
As described above, the session processing method provided by the embodiment of the present application can be implemented by various types of electronic devices. Referring to fig. 3A, fig. 3A is a schematic flowchart of a session processing method provided in an embodiment of the present application, and is described with reference to the steps shown in fig. 3A.
In step 101, a session editing area is presented.
As shown in fig. 3B, when a conversation interaction is performed, a conversation editing area 301 is presented in the human-computer interaction interface, the conversation editing area 301 is used for inputting conversation contents, wherein the conversation editing area 301 comprises a text input box for displaying input text, a virtual keyboard and the like.
In step 102, in response to an input operation based on the session editing area, session content formed by the input operation is acquired.
As shown in fig. 3B, when the conversation content is a text message, the user may input text messages such as characters, expressions, and the like in the text input box, and after the input is completed, the input text message may be acquired. When the conference content is a voice message, a voice message formed by a trigger operation may be obtained in response to the trigger operation of the recording entry in the session editing area, where the trigger operation is not limited in this embodiment of the application, for example, a click operation or a long-press operation.
In step 103, in response to the sending operation for the conversation content, a special effect generated based on the avatar of the target object is presented in the conversation area, wherein the special effect is used for representing the conversation content.
As shown in fig. 4A, in response to a transmission operation for conversation contents, a transmission special effect is performed, and when the transmission is successful, a special effect (startle expression 502 and video message 504 shown in the drawing) generated based on an avatar of a target object is presented in a conversation area for presenting a conversation message record. When the special effect transmission fails, presenting prompt information of the special effect transmission failure in a session interface, for example, indicating the special effect transmission failure through an exclamation mark icon.
In some embodiments, rendering a special effect in the conversation region that is generated based on the avatar of the target object includes: playing an expressive special effect generated based on the avatar of the target object, wherein the expressive special effect comprises at least one expressive picture representing the conversation content based on the avatar and voice of text representing the conversation content based on the sound of the target object.
For example, the special effect includes an expression special effect, wherein the expression special effect includes a plurality of static expression pictures representing conversation contents based on the virtual image, and the plurality of expression pictures are played in sequence, that is, a dynamic expression special effect is formed. It should be noted that the emoticon special effect may include not only a plurality of emoticons, but also voice representing at least a part of text of the conversation content based on the target object sound, that is, the voice conforms to the target object sound and can represent at least a part of text of the conversation content, for example, the voice can guarantee keywords in the conversation content.
In some embodiments, the emoticon is used to characterize the conversation content from at least one of the following dimensions: keywords of the session content, emotion information carried by the session content, and a theme to which the session content belongs.
For example, the emoticon may represent the conversation content by a keyword (e.g., a picture including the keyword) of the conversation content, for example, if there is a keyword "bailey" in the conversation content, the emoticon may represent the conversation content by a "bailey" text, and may also represent the conversation content by a "bailey" gesture; the emoticon can represent the conversation content through emotion information (such as happy emotion and sad emotion, namely facial expressions or body actions representing the emotion information) carried by the conversation content, for example, the happy emotion carried by the conversation content can be represented by the facial expressions or the body actions of the virtual image; the emoticon can represent the conversation content through the theme to which the conversation content belongs, for example, the theme to which the conversation content belongs is travel, and the emoticon represents the travel theme through the background (for example, sight spots) of the corresponding theme in the virtual image, or the emoticon is adapted to the theme through the emoticon and the limb action (for example, sight spots) of the virtual image to represent the travel theme.
In some embodiments, the expressive special effects are set to a play mode, the play mode comprising: when the expression special effect is received, any one expression picture is automatically played, when the expression special effect is triggered to be played, a plurality of expression pictures are switched to be played, and voice is synchronously played.
As shown in fig. 8A, after the session message receiver receives the expression special effects, any one expression picture in the expression special effects 801 may be displayed in the session area of the session message receiver, and the play button 802 is triggered, so that the expression special effects are triggered to be played, a plurality of expression pictures in the expression special effects are switched to be played, and the voice is synchronously played.
For example, after the receiving party of the session message receives the expression special effect, a plurality of expression pictures in the expression special effect can be directly switched and played in the session area of the receiving party of the session message, and voice is synchronously played.
In some embodiments, rendering a special effect in the conversation region that is generated based on the avatar of the target object includes: playing a video special effect generated based on the avatar of the target object, wherein the video special effect comprises a plurality of video frames representing the conversation content based on the avatar and a plurality of audio frames representing the text of the conversation content based on the target object sound.
For example, the special effect includes a video special effect, wherein the video special effect includes a plurality of video frames representing session content based on the virtual image, and the plurality of video frames are played in sequence, that is, a dynamic video special effect is formed. It should be noted that the video special effect may include not only a plurality of video frames, but also a plurality of audio frames that characterize at least part of the text of the conversation content based on the target object sound, that is, the plurality of audio frames conform to the target object sound and can characterize at least part of the text of the conversation content, for example, the plurality of audio frames can guarantee keywords in the conversation content.
It should be noted that the amount of information carried by the video special effect is greater than the amount of information carried by the expression special effect, so the dynamic effect of the video special effect is better than that of the expression special effect, but the storage space occupied by the video special effect is more than that occupied by the expression special effect.
In some embodiments, the video effects are set to a play mode, the play mode comprising: when the video special effect is received, any one video frame is automatically played, when the video special effect is triggered to be played, a plurality of video frames are played, and a plurality of audio frames are played synchronously.
As shown in fig. 10, after the session message receiver receives a video special effect, any one of video pictures in the video special effect 1001 may be displayed in a session area of the session message receiver, and when the play button 1002 is triggered, the video special effect is triggered to be played, multiple video frames in the video special effect are switched to be played, and multiple audio frames are synchronously switched to be played.
For example, after the session message receiver receives the video special effect, the session area of the session message receiver can be directly switched to play a plurality of video frames in the video special effect, and synchronously switched to play a plurality of audio frames.
In some embodiments, in response to a transmission operation for the conversation content, before presenting a special effect generated based on an avatar of the target object in the conversation area, presenting the conversation content in the conversation editing area, and playing a preview screen for characterizing the special effect of the conversation content; presenting a sending entry for the session content; and presenting a special effect generated based on the avatar of the target object in the conversation area in response to the trigger operation for the preview screen and the preview screen playing is completed. The session content may be a text message or a voice message.
The preview picture comprises at least part of expressive pictures in the expressive special effects (at least part of voice can also be played synchronously) or at least part of video frames in the video special effects (at least part of audio frames can also be played synchronously). As shown in fig. 3C, a session content is presented in a session editing area, a preview screen 302 for representing a special effect of the session content is played, a send special effect is executed in response to a trigger operation for the preview screen and the preview screen is played completely (i.e., at least part of emoticons in the emoticons or at least part of video frames in the video special effects are played completely), a special effect 303 is presented in the session area, and a state of whether the send is successful is synchronously displayed (i.e., when the send of the special effect fails, prompt information indicating that the send of the special effect fails is presented in a session interface), where the trigger operation is not limited in this embodiment of the application, such as a click operation or a long press operation.
In some embodiments, the types of effects include expressive effects, video effects; presenting, in response to a transmission operation for the conversation content, a special effect generated based on an avatar of the target object in the conversation area, including: presenting an expression special effect sending entry aiming at the expression special effect, responding to the trigger operation aiming at the expression special effect sending entry, executing the expression special effect sending, and presenting the expression special effect generated based on the virtual image of the target object in the conversation area; presenting a video special effect sending entry aiming at the video special effect, responding to the triggering operation aiming at the video special effect sending entry, executing the sending of the video special effect, and presenting the video special effect generated based on the virtual image of the target object in the conversation area.
As shown in fig. 3D, presenting the session content in the session editing area, presenting an expression special effect sending entry 304 for an expression special effect in the session editing area, in response to a trigger operation for the expression special effect sending entry 304, executing sending the expression special effect, and presenting an expression special effect 303 generated based on the avatar of the target object in the session area; presenting conversation content in a conversation editing area, presenting a video special effect sending inlet 305 aiming at the video special effect, responding to the triggering operation aiming at the video special effect sending inlet 305, executing the sending of the video special effect, and presenting the video special effect generated based on the virtual image of the target object in the conversation area, thereby quickly and intuitively selecting the type of the special effect needing to be sent through a secondary function inlet presenting the special effect.
In some embodiments, presenting, in response to a sending operation for the conversation content, a special effect generated based on an avatar of the target object in the conversation area includes: presenting a sending entry for the session content; in response to a sending operation for triggering a sending entry, playing a preview screen for representing a special effect of the conversation content; in response to the preview screen playing being completed, the transmission of the special effect is performed, and the special effect generated based on the avatar of the target object is presented in the conversation area.
For example, the sending operation for triggering the sending entry is a first triggering operation, the first triggering operation is different from a second triggering operation, the second triggering operation is used for executing sending session content, for example, the first triggering operation is a single-click operation, the second triggering operation is a long-press operation, the sending session content is executed in response to the single-click operation aiming at the sending entry, a special effect generated based on the virtual image of the target object is presented in the session area, and a state whether the sending is successful or not is synchronously displayed; and responding to the long-time pressing operation aiming at the sending inlet, playing a preview picture of the special effect for representing the conversation content, finishing the playing of the preview picture, executing the sending special effect, presenting the special effect in the conversation area, and synchronously displaying the state whether the sending is successful or not, so that the function of playing the special effect is realized by multiplexing the sending inlet, the occupied space of a function inlet in a man-machine interaction interface is saved, and the interaction efficiency can be improved compared with the method of searching a secondary function inlet. The embodiments of the present application are not limited to the specific expressions of the first trigger operation and the second trigger operation.
In some embodiments, the types of effects include expressive effects, video effects; when the played preview picture for representing the special effect of the conversation content is from a first type special effect, responding to the switching operation aiming at the first type special effect, switching the first type special effect into a second type special effect, and playing the preview picture of the second type special effect; the first type of special effect is any one of an expression special effect and a video special effect, and the second type of special effect is the other one of the expression special effect and the video special effect.
For example, a preview screen of an expression special effect is presented in the session area, and the expression special effect is switched to a preview screen of a video special effect in response to a switching operation, or the video special effect is switched to a preview screen of the expression special effect in response to a switching operation in the session area, so that an appropriate special effect is selected by the switching operation and transmitted.
In some embodiments, switching the first type of special effect to the second type of special effect in response to the switching operation for the first type of special effect comprises: presenting a switching entry of a preview screen for the first type of special effect; and switching the presented first type special effect to a second type special effect in response to the triggering operation aiming at the switching entrance.
As shown in fig. 5, when a preview screen for a first type of special effect (i.e., an expressive special effect) is presented, a switching entry 503 for the preview screen for the first type of special effect is also presented, and in response to a trigger operation for the switching entry, the presented first type of special effect is switched to a second type of special effect (i.e., a video special effect), so that an appropriate special effect is selected through the switching entry and transmitted.
As an example, when a preview screen for a first type of special effect (i.e. expressive special effect) is presented, the presented first type of special effect may also be switched to a second type of special effect (i.e. video special effect) in response to a trigger operation for the sending entry, so that an appropriate special effect is selected for sending through the switching entry. The triggering operation is not limited in the embodiment of the application, for example, clicking or long-time pressing operation.
In some embodiments, the types of effects include expressive effects, video effects; in response to a transmission operation for triggering a transmission entry, playing a preview screen for characterizing a special effect of conversation content, comprising: responding to a trigger operation for triggering a sending inlet, determining a parameter corresponding to the trigger operation, and playing a preview picture of a special effect of a type corresponding to the parameter; the triggering operation comprises different parameters, and the different parameters correspond to different types of special effects, and the parameters included in the triggering operation comprise at least one of the following parameters: trigger time, trigger action mode.
For example, the trigger operation includes a parameter of trigger time, the types of special effects include expression special effects and video special effects, the pressing state is released after the sending entry is pressed and the pressing state is kept for a first set time, a preview picture of the expression special effects is played, the pressing state is released after the sending entry is pressed and the pressing state is kept for a second set time, and a preview picture of the video special effects is played, wherein the first set time is different from the second set time.
For example, the trigger operation includes a parameter of a trigger action mode (e.g., sliding, double-clicking, etc.), the types of the special effects include an expressive special effect and a video special effect, a preview screen of the expressive special effect is played in response to the double-clicking operation on the transmission entry, and a preview screen of the video special effect is played in response to the sliding operation on the transmission entry.
In some embodiments, when playing a preview screen for characterizing the special effects of the conversation content, the method further comprises: in response to a stop transmission operation for the special effect, the preview screen of the special effect is stopped from being played, and the conversation content is presented in the conversation region.
For example, in the process of playing the preview screen, the user does not want to send a special effect, and in response to a transmission stop operation for the special effect, the preview screen for playing the special effect is stopped, transmission of the conversation content is executed, the conversation content is presented in the conversation area, and a state of whether the conversation content is successfully transmitted or not is synchronously presented.
In some embodiments, in response to a stop sending operation for a special effect, stopping playing a preview screen of the special effect includes: presenting a play countdown control of a preview picture in a sending entry; and stopping playing the preview picture of the special effect in response to the triggering operation of the play countdown control.
As shown in fig. 5, in the process of playing the preview screen, the user does not want to send a special effect, the play countdown control of the preview screen is presented in the sending entry, in response to the trigger operation for the play countdown control, the stop sending operation is triggered to stop playing the special preview screen, the session content is sent, the session content is presented in the session area, and the state of whether the session content is successfully sent is presented synchronously.
In some embodiments, in response to a stop sending operation for a special effect, stopping playing a preview screen of the special effect includes: and in response to the fact that the sending inlet is pressed and the pressing state is not released after the pressing state is kept for the set time, stopping playing the special preview picture.
For example, the sending portal is kept in the pressed state until the pressed state is not released for a set time (for example, 5 seconds), a sending stopping operation is triggered to stop playing the preview screen of the special effect, sending the conversation content is executed, the conversation content is presented in the conversation area, and the state of whether the conversation content is successfully sent or not is synchronously presented.
In some embodiments, when the conversation content is a voice message, in response to an input operation based on the conversation editing area, acquiring the conversation content formed by the input operation includes: presenting a recording entry in a session editing area; responding to the pressing of the recording entry and keeping the pressing state, and acquiring a voice message collected for voice input operation in the pressing state; presenting, in response to a transmission operation for the conversation content, a special effect generated based on an avatar of the target object in the conversation area, including: presenting a special effect sending entrance; and in response to the pressing state being released after moving to the special effect sending entrance, presenting a special effect generated based on the avatar of the target object in the conversation area.
As shown in fig. 7, the types of the special effects include an expression special effect and a video special effect, the special effect sending entry includes an expression special effect sending entry and a video special effect sending entry, the pressed state is released after moving to the expression special effect sending entry 702 in response to the pressed state, the expression special effect 703 generated based on the avatar of the target object is presented in the session area, the pressed state is released after moving to the video special effect sending entry 704 in response to the pressed state, the video special effect 705 generated based on the avatar of the target object is presented in the session area, and the pressed state is released after moving to the blank position in response to the pressed state, and the content is presented in the session area.
In some embodiments, before presenting a special effect generated based on the avatar of the target object in the conversation area, determining special effect data for characterizing the conversation content based on the avatar of the target object and the conversation content; determining voice data conforming to the target object sound based on the conversation content; and synthesizing the special effect data and the voice data to obtain a special effect generated based on the virtual image of the target object.
For example, the special effect data includes expression pictures or video frames, the voice data includes voice files or audio frames, a plurality of expression pictures for representing conversation contents are determined based on the virtual image of the target object and the conversation contents, the voice files conforming to the sound of the target object are determined based on the conversation contents, the plurality of expression pictures and the voice files are synthesized, and expression special effects generated based on the virtual image of the target object are obtained; determining a plurality of video frames for representing the conversation content based on the virtual image of the target object and the conversation content, determining a plurality of audio frames conforming to the sound of the target object based on the conversation content, and synthesizing the plurality of video frames and the plurality of audio frames to obtain a video special effect generated based on the virtual image of the target object. It should be noted that, when the session content is a voice message, the voice message can be directly used as voice data.
In some embodiments, before determining special effect data for characterizing the conversation content, a real image of the target object is acquired; and calling an avatar generation model based on the real image to obtain the avatar of the target object.
As shown in fig. 4A, personal character information (i.e., a real image) of the user is previously entered. Clicking an avatar generation entry 401 to enter a product function module to prompt that at least one photo needs to be uploaded to generate an avatar, clicking an "upload photo" button 402 to select at least one photo from an album, or taking at least one photo in real time, clicking an "upload photo" button 403 to upload at least one photo, and calling an avatar generation model based on the uploaded at least one photo to obtain the avatar 404 of the user (i.e., the avatar of the target object).
In some embodiments, invoking an avatar generation model based on the real image to obtain an avatar of the target object, comprises: performing the following processing by the avatar generation model: performing principal component analysis processing on the real image to obtain geometric information distribution and texture information distribution of the corresponding target object; and performing deformation processing based on the geometric information distribution and the texture information distribution to obtain the virtual image of the target object.
For example, a plurality of real images (including three-dimensional shape and color data) are aligned by an avatar generation model (e.g., a three-dimensional deformation model), and the aligned real images are extracted from the three-dimensional shape and color data by using Principal Component Analysis (PCA), so as to obtain a subspace of a lower dimension (i.e., a geometric information distribution and a texture information distribution of a corresponding target object). These subspaces are subjected to combined deformation to generate a new avatar.
It should be noted that before generating the avatar of the target object, the avatar style (e.g. animation, ancient wind, etc.) may be selected, and in response to the selection operation of the avatar style, the avatar conforming to the selected avatar style is generated, wherein different avatar styles correspond to different avatar generation models.
In some embodiments, determining speech data that corresponds to the target object sound based on the conversation content includes: performing text analysis processing on text information corresponding to the session content to obtain context characteristics of the session content; performing voice parameter prediction processing on the context characteristics of conversation content based on a target object sound, wherein the conversation content corresponds to a plurality of voice parameters conforming to the target object sound; and synthesizing the plurality of voice parameters to obtain voice data which accords with the target object voice.
As shown in fig. 4B, the real sound of the user (i.e., the target object sound) is input in advance, the "start recording" button 406 is clicked to input the real sound of the user, the hidden markov model is trained by the target object sound, when the input conversation content is a text message, text analysis processing is performed on the text message of the conversation message to obtain the context feature of the text message, speech parameter prediction processing is performed on the context feature of the text message by the trained hidden markov model to obtain a plurality of speech parameters corresponding to the conversation content and conforming to the target object sound, and synthesis processing is performed on the plurality of speech parameters to obtain speech data conforming to the target object sound.
In some embodiments, when the conversation content is a text message, determining special effect data for characterizing the conversation content based on the avatar of the target object and the conversation content, includes: performing text feature extraction processing on the text message to obtain text features of the text message; matching a plurality of candidate special effect data contained in the database based on the text characteristics, and taking the matched candidate special effect data as special effect data for representing conversation contents; wherein the text features include at least one of: characters, symbols, expression pictures.
The database may be of multiple types, such as an action library and an expression library, and the action library includes various action data, such as an expression picture including "baiye" (i.e. a picture of an expression of "baiye")) A video frame featuring "kagaku" (i.e. a video frame containing the "kagaku" action of the avatar), and the expression library includes various expression data, such as an expression picture containing "smiley face", a video frame featuring "smiley face". For example, if the text feature includes a "bayer" text, the matched emotional picture containing "bayer" or the video frame representing "bayer" is used as the special effect data representing the conversation content; the text features include ": ) If the characters are 'characters', the matched expression pictures containing the smiling face 'or video frames representing the smiling face' are used as special effect data representing conversation contents; the text features includeAnd taking the matched expressive picture containing the smiling face or the video frame representing the smiling face as special effect data representing the conversation content.
In some embodiments, when the conversation content is a voice message, determining special effect data for characterizing the conversation content based on the avatar of the target object and the conversation content includes: carrying out format conversion processing on the voice message to obtain a text message corresponding to the voice message; performing text feature extraction processing on the text message to obtain text features of the text message; and matching a plurality of candidate special effect data contained in the database based on the text characteristics, and using the matched candidate special effect data as special effect data for representing conversation contents.
For example, when the conversation content is a voice message, the voice message needs to be converted into a text, feature extraction is performed on the basis of the converted text to obtain text features, a plurality of candidate special effect data included in the database are matched on the basis of the text features, and the matched candidate special effect data are used as special effect data for representing the conversation content.
Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
In the related technology, the functions of characters, expressions, pure voice, pure audio, pure video and the like are adopted to realize the session of instant messaging.
However, compared with text input, the threshold for information transmission by shooting an elaborate video is higher. Especially, when the environmental conditions of the user are not good, for example, in a public scene with a busy road, dim light, and much noise, it is difficult to use audio and video information meeting the requirements for interaction. In such a case, the user may have to use the text for information output, and the text may have difficulty in transferring information such as a real scene, user emotion, and the like.
In order to solve the above problem, an embodiment of the present application provides a communication method based on an avatar (i.e., a session processing method), which enables a user to realize an effect of transmitting information through the avatar by inputting text and voice information without being limited by scenes and time. For example, the image and the voice data of the user are recorded firstly, and the surprise expression and the video message function which accord with the scene can be triggered during chat communication, so that the user is helped to conveniently transmit the emotional information, the virtual image is integrated into the social scene of the user, the user is helped to self present, and the relationship between the user and a friend is better linked.
The following specifically describes the communication method based on the avatar provided in the embodiment of the present application in different scenarios:
scene 1, collecting user voice and image information
As shown in fig. 4A, personal image information of the user is previously entered. Clicking an avatar generation entry 401 to enter a product function module to prompt that at least one photo needs to be uploaded to generate an avatar, clicking an "upload photo" button 402 to select at least one photo from an album, or taking at least one photo in real time, clicking an "upload photo" button 403 to upload at least one photo, and then automatically generating the avatar 404 of the user.
The voice information of the user is entered in advance. As shown in fig. 4A, after automatically generating the avatar 404 of the user, click the "next" button 405, present the recording interface as shown in fig. 4B, click the "start recording" button 406, enter the real sound of the user, and click the "sound synthesis" button 407, get a synthetic audio effect similar to the personal sound, may listen to the synthetic audio on trial, and present the generated avatar of the user at the avatar generation entry.
Scene 2, message publishing scene (text publishing)
As shown in fig. 5, in a publishing manner, when text editing is completed in the session editing area 501, pressing the "send" button (i.e., send entry) for 2 seconds, triggering the surprise emotion function, previewing the surprise emotion 502 on the session interface, starting the function countdown for 5 seconds, simultaneously vibrating the mobile phone in sequence, and sending the surprise emotion 502 when the function countdown for 5 seconds is completed; when the countdown of the 5-second function is not finished, clicking a switching button 503 of the surprise expression 502 to switch the surprise expression 502 into a video message in a switching and sending mode, previewing the video message 504 on a session interface, similarly starting the countdown of the 5-second function, and sending the video message 504 when the countdown of the 5-second function is finished; when the countdown of the 5 second function is not finished, the countdown control 505 is clicked, the avatar transmission mode is canceled, and the text message 506 in the session editing area 501 is transmitted.
As another release method shown in fig. 6, when the text editing is completed in the session editing area 501, the "send" button is pressed for 2 seconds, the avatar startle expression function is triggered, the "send" button is released, the startle expression 502 is previewed in the session interface, and the mobile phone vibrates in sequence, and when the preview animation of the startle expression 502 is finished, the startle expression 502 is sent. When the surprise expression function of the avatar is triggered and the sending button is not released, timing is continued, animation of the surprise expression disappears after 2 seconds (the accumulated time is 4 seconds) until the accumulated time reaches 5 seconds, the video letter function of the avatar is triggered, the sending button is released, the video letter 504 is previewed on a session interface, meanwhile, the mobile phone vibrates in sequence, and when the previewing animation of the video letter 504 is finished, the video letter 504 is sent; after the avatar video letter function is triggered, the send button is not released until the preview animation of the video letter 504 is finished, the avatar sending mode is cancelled, and the text information in the session editing area is sent.
Scene 3, message publishing scene (Voice publishing)
As shown in fig. 7, a long press of the voice button 701 (i.e., the recording entry) may trigger a voice input function, and when the voice input is completed, the user's finger moves to the left icon (icon), and a function prompt is displayed, for example, when the user's finger keeps pressing and moves to the surprise emoticon 702 (i.e., the emoticon special effect sending entry) and releases the finger on the surprise emoticon 702, a sending instruction of the surprise emoticon is triggered, and the surprise emoticon 703 is displayed in the session interface; when the user's finger is kept pressed and moved to the video letter icon 704 and releases the finger on the video letter icon 704 (i.e. the video special effect sending entry), a sending instruction of the video letter is triggered, and the video letter 705 is displayed in the session interface. When the user's finger is moved to the blank position 706 while keeping the pressed state and releases the finger from the blank position 706, the avatar transmission mode is canceled and the inputted voice message is transmitted.
Scene 4, mobile terminal message receiving scene (surprise expression)
As shown in fig. 8A, after the message sender triggers sending a surprise emoticon, when the message receiver of the mobile terminal opens a chat window, defaults to display a static effect 801 of the surprise emoticon, and clicks a play button 802, the sound of the message sender can be played, and the emoticon can be triggered.
As shown in fig. 8B, when the input text message includes an action type emotion 803, a label 804 corresponding to the action is matched in the action library and an emotion animation is displayed in a composite manner with the avatar; as shown in fig. 8C, when the object type emoticon 805 is included in the input text message, the label 806 of the corresponding object is matched in the object library and the emoticon animation is displayed in a composite with the avatar; as shown in fig. 8D, when a weather type emoticon 807 is included in the input text message, a label 808 corresponding to weather is matched in the weather bank and an emoticon animation is displayed in combination with the avatar.
Scene 5, PC end message receiving scene (surprise expression)
As shown in fig. 9, after the message sender triggers the sending of the surprise emotion command, when the message receiver at the PC opens the chat window, the static effect 901 of the surprise emotion is displayed by default, and the play button 802 is clicked to play the sound of the message sender and trigger the surprise emotion.
Scene 6, mobile terminal message receiving scene (video message)
As shown in fig. 10, after the message sender triggers the video message sending command, when the message receiver of the mobile terminal opens the chat window, the video message 1001 is presented by default, and clicks the play button 1002, the video content 1003 can be played, and the sound of the message sender is played at the same time.
Scene 7, PC end message receiving scene (video message)
As shown in fig. 11, after the message sender triggers the video message sending command, when the message receiver at the PC opens the chat window, the video message 1101 is displayed by default, and the play button 1102 is clicked, so that the video content 1103 can be played, and the sound of the message sender can be played at the same time.
The following describes in detail a communication method based on an avatar according to an embodiment of the present application with reference to the flowcharts shown in fig. 12A to 12B:
as shown in fig. 12A, the front end uploads a text message, selects a sending type (i.e., a surprise expression or a video message), and after receiving the text message, the background performs feature extraction processing to obtain a text feature, and synthesizes audio information based on the text feature; when the text features comprise the expressions, identifying label information (namely types) of the expressions, matching databases (such as an avatar expression library, an avatar action library and the like) based on the expressions, and matching the labels with first special effect data; when the text features do not include expressions, database matching (such as an avatar expression library, an avatar action library and the like) is performed based on characters in the text features, second special effect data is matched, video information of the avatar is synthesized based on the first special effect data and the second special effect data, and a final animation effect (namely, a surprise expression or a video message) is synthesized based on the audio information and the video information so as to present a final effect of the avatar at the front end.
As shown in fig. 12B, the front end uploads a voice message, selects a sending type (i.e., surprise expression or video message), the background receives the voice message, converts the voice message into a text, performs feature extraction processing on the text to obtain text features, performs database matching (e.g., avatar expression library, avatar action library, etc.) based on characters in the text features to synthesize video information of an avatar based on the matched special effect data, and synthesizes a final animation effect (i.e., surprise expression or video message) based on the voice message and the video information to present a final effect of the avatar at the front end.
It should be noted that, it is necessary to obtain the user image features and the sound features, and generate the user-customized virtual image and sound sample based on the user image features and the sound features. As shown in fig. 13, a personal photo of a user (the photo may be taken in real time or obtained from an album) is uploaded at the front end, after the personal photo is received at the back end, the image features are identified based on the personal photo and the user avatar is synthesized based on the image features, a sound sample (the photo may be obtained by real-time recording or obtained from a pre-stored recording file) is uploaded at the front end, and after the sound sample is received at the back end, the sound features are identified based on the sound sample and the user sound sample is synthesized based on the sound features.
The following describes a communication method based on an avatar provided in the embodiment of the present application based on a specific algorithm:
1) Generating avatars based on personal photos of a user
The embodiment of the application adopts a three-dimensional deformation model (3 DMM,3D Morphable Models) technology to realize that the 2D photo generates the 3D virtual image. And a 3D face model database is stored in the background to generate a face deformable model, the input 2D picture is matched with the corresponding 3D face deformable model through face analysis, and a 3D virtual image is generated through certain adjustment.
The 3DMM can be applied to the fields of face analysis, model fitting, image synthesis and the like. For the field of face synthesis, firstly, a plurality of sets of face 3D data (including three-dimensional shape and color data) are scanned together with high precision and aligned, and then, a subspace of lower dimensions (i.e. geometric information of a face and texture information of the face) is extracted from the three-dimensional shape and color data by using Principal Component Analysis (PCA). And performing combined deformation on the PCA subspaces, and transferring the characteristics of one face to another face to generate a new virtual face.
The geometric information of the human face is represented by a shape vector, and the texture information of the human face is represented by a texture vector. The three-dimensional deformation face model established in the embodiment of the application is composed of a plurality of face models, and the plurality of face models in the data set are subjected to weighted combination to obtain a new face model.
The 3DMM and the matching process of the face image adopt an analysis-by-synthesis (analysis-by-synthesis) technology, firstly, the face is preliminarily reconstructed in three dimensions based on the current model parameters, the two-dimensional image is mapped to the input image and is continuously compared with the input image, the parameters are updated based on residual error information, so that the generated two-dimensional image is similar to the input image as much as possible, and some attributes of the face can be adjusted based on the final result, so that the virtual image similar to the input image of the user is output.
2) Synthesizing user sound samples based on user sound
By learning the acoustic characteristics of the voice synthesis technology for the audio characteristics of the user, the synthesized voice information of the user is suitable for a scene of uploading text messages.
The embodiment of the application adopts a parameter-based speech synthesis method, which uses a statistical model to generate speech parameters at any time and converts the parameters into sound waveforms. The process is a process of abstracting a text into phonetic features, learning the corresponding relation between the phonetic features and the acoustic features by using a statistical model, and then restoring the predicted acoustic features into waveforms (waves). Where a vocoder (vocoder) is used to generate the waveform to implement the feature to waveform.
As shown in fig. 14, a Hidden Markov Model (HMM) is trained based on a user sound in a speech library to obtain a context-dependent HMM Model, a text analysis is performed on an input text message to obtain a context feature, a state sequence is generated by combining the context feature with the context-dependent HMM Model to obtain a speech parameter, and a synthesized user sound sample (i.e., an audio representing the input text message) is obtained by synthesizing through a parameter synthesizer. The embodiment of the application can store the synthesized user voice sample so as to be directly called later.
It should be noted that after acquiring a voice signal in a voice library, audio feature extraction is required to be performed to acquire an audio feature (i.e., mel spectrum (Mel-cepstram)), mel cepstrum coefficient (MFCC, mel-scale Frequency Cepstral Coefficients) which is an audio feature, i.e., a one-dimensional time domain signal, it is visually difficult to see a change rule of a Frequency domain, and by using fourier transform, corresponding Frequency domain information is obtained, but time domain information is lost, and a change of the Frequency domain along with the time domain cannot be seen, so that sound cannot be well described.
The short-time fourier transform is performed on a short-time signal (obtained by framing a long-time signal). A section of long signal is divided into frames and added with windows, then Fourier transform (FFT) is carried out on each frame, and finally the results of each frame are stacked to obtain a two-dimensional signal (namely a spectrogram). The spectrogram is often a large image, and is transformed into a mel-frequency spectrum by a mel-scale filter bank (mel-scale filters) in order to obtain a sound feature with a proper size. Cepstral analysis (i.e., taking the logarithm, performing DCT) is performed on the Mel frequency spectrum to obtain the Mel cepstrum.
In the training process of the HMM, the output state of the HMM is represented by a single Gaussian function (Gaussian) or a mixed Gaussian function (GMM), and the goal of the parameter generation algorithm is to calculate a speech parameter sequence having a maximum likelihood function on the premise of a given Gaussian distribution sequence.
3) Conversion of voice messages to text
If the information input by the user is audio (i.e. voice message), in order to extract emotion information of the user in the voice, text feature extraction needs to be performed on a text, so that the audio needs to be converted into the text, i.e. the audio needs to be uploaded to a server, and the server converts the audio into text information by using an Automatic Speech Recognition (ASR) technology.
As shown in fig. 15, feature extraction is performed on an input voice message to obtain voice features, and decoding processing is performed on the voice features based on an acoustic model and a language model to obtain text information corresponding to the voice message.
Wherein, the feature extraction is to extract important information reflecting the voice feature from the voice waveform, remove irrelevant information (such as background noise) and convert the information into a group of discrete parameter vectors.
If a plurality of Voice features exist in the Voice message, simultaneously separating the voices of speakers, judging the Voice feature with the highest proportion, preprocessing the Voice message, performing Voice Activity Detection (VAD) end point Detection and framing on audio contents to obtain a Voice oscillogram, and then completing the conversion from a time domain to a Frequency domain through Fourier change, namely performing Fourier change on each frame and adopting a characteristic parameter (namely Mel-scale Frequency Cepstral Coefficients) to characterize, so as to obtain a Frequency spectrum of each frame, so as to remove background noise, irrelevant voices and the like in the audio.
After the feature extraction is finished, the voice is framed through an acoustic model, pronunciation-related work is processed, so that the basic phoneme state and probability of the vocalization are obtained, the smallest phoneme in the voice is identified, words are formed by a plurality of phonemes, and text sentences are formed by the words. The language model is used to combine semantic scenes and context to generate coherent, correct text.
4) Text information extraction
And if the information input by the user is a text message, directly extracting text features to obtain the text features, wherein the text feature extraction comprises the steps of automatic word segmentation (word segmentation), part-of-speech tagging (part-of-speech), syntactic analysis, temperament prediction and the like. Wherein the syntactic analysis is used to determine the syntactic structure of a sentence or the dependency between words in the sentence.
The embodiment of the application adopts a word segmentation method based on statistics. A word is a stable combination of words, so in this context, the more times adjacent words occur simultaneously, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the adjacent characters can better reflect the feasibility of the word. The mutual occurrence information may be calculated by counting the frequency of combinations of adjacent co-occurring words in the text, for example, calculating the mutual occurrence information of word X and word Y as M (X, Y) = log (P (X, Y)/P (X) P (Y)). Where P (X, Y) represents the adjacent co-occurrence probability of word X and word Y, P (X) represents the frequency of occurrence of word X in the text, and P (Y) represents the frequency of occurrence of word Y in the text. The mutual presentation information embodies how closely the relationships between words are combined. When the degree of closeness is higher than a certain threshold, it is considered that the adjacent word (i.e., word group) may constitute a word. The method only needs to count the word group frequency in the text, and does not need to segment the dictionary.
Part-of-speech tagging refers to a procedure for tagging each word in the word segmentation result with a correct part-of-speech, that is, a process for determining whether each word is a noun, a verb, an adjective, or other part-of-speech, so as to assist syntactic analysis preprocessing. The embodiment of the application adopts a hidden Markov model-based statistical model, the hidden Markov model-based statistical model can be trained by using a large corpus with marked data, and the marked data refers to a text with each word assigned with correct part-of-speech labels.
Meanwhile, the extracted text features are used for matching databases such as an action library, an expression library, an object library, a weather library and the like so as to synthesize a dynamic picture of the virtual image.
5) Label recognition of emoticons (i.e., emoticons)
Label identification of emoticons is used for synthesizing different avatar actions, and the following processing modes are included for different types of emoticons:
1. action class emoticons-look for approximate action labels (e.g., action emoticons or action video frames) in the action library and preferentially match such actions in conjunction with text features extracted from the text message;
2. emotion expression type emoticons-look for approximate emotion labels (e.g., emotional expressions or emotion video frames) in an expression library and preferentially match such expressions in conjunction with text features extracted from text messages;
3. object emoticons-related object labels (such as object expressions or object video frames) are established in a background library, and when special effects are synthesized, the object labels appear in an avatar background according to needs and can be displayed by combining matched action labels;
4. weather-like emoticons — related weather tags (e.g., weather emoticons or weather video frames) are built in the background library, such weather tags appearing in the avatar background as needed.
6) Motion matching
The text information in the text characteristics can be obtained by analyzing the text message or the voice message uploaded by the user, so that the action of the virtual image in the corresponding action library can be screened out. And transmitting the action to a background, performing confidence coefficient matching through picture retrieval, determining the most suitable action by combining with the label identification result of the expression symbol, binding the user virtual image, and synthesizing to obtain the virtual image action (namely the expression picture or the video frame).
In conclusion, if the user inputs a text message, the synthesized audio and the virtual image actions are matched to synthesize a surprise expression or a video message; and if the voice message is input by the user, matching the voice message input by the user with the virtual image action to synthesize a surprise expression or video information.
Therefore, the method and the device can solve the problem that the user cannot use the language and the video for communication caused by the problems that the environment where the user is located is noisy, the environment scene is not attractive enough and the like, enable the user to obtain the expression effect of the personalized virtual image video letter and the voice expression of the user by using a low-cost text and voice input mode, and improve the emotional contact between people more conveniently.
The session processing method provided by the embodiment of the present application has been described so far with reference to the exemplary application and implementation of the electronic device provided by the embodiment of the present application, and a scheme for implementing session processing by cooperation of the modules (the first display module 5551, the obtaining module 5552, and the second display module 5553) in the session processing apparatus 555 provided by the embodiment of the present application is continuously described below.
A first display module 5551 for presenting a session edit area; an obtaining module 5552, configured to, in response to an input operation based on the session editing area, obtain session content formed by the input operation; a second display module 5553, configured to present, in response to the sending operation for the conversation content, a special effect generated based on the avatar of the target object in the conversation area, wherein the special effect is used to characterize the conversation content.
In some embodiments, the second display module 5553 is further configured to play an expressive special effect generated based on an avatar of a target object, wherein the expressive special effect comprises at least one expressive picture characterizing the conversation content based on the avatar, and speech characterizing text of the conversation content based on the target object sound.
In some embodiments, the emoticon is used to characterize the conversation content from at least one of the following dimensions: the key words of the conversation content, the emotion information carried by the conversation content and the theme to which the conversation content belongs.
In some embodiments, the expressive special effects are set to a play mode, the play mode comprising: and automatically playing any one expression picture when the expression special effect is received, switching to play a plurality of expression pictures when the expression special effect is triggered to play, and synchronously playing the voice.
In some embodiments, the second display module 5553 is further configured to play a video special effect generated based on an avatar of a target object, wherein the video special effect includes a plurality of video frames characterizing the conversation content based on the avatar and a plurality of audio frames characterizing text of the conversation content based on the target object sound.
In some embodiments, the video effects are set to a play mode, the play mode comprising: and when the video special effect is received, any one of the video frames is automatically played, and when the video special effect is triggered to be played, the plurality of video frames are played, and the plurality of audio frames are synchronously played.
In some embodiments, the second display module 5553 is further configured to present a send entry for the session content; playing a preview screen for representing a special effect of the session content in response to a sending operation for triggering the sending entry, wherein the first triggering operation is different from a second triggering operation for executing sending of the session content; in response to the preview screen playing being completed, transmitting the special effect, and presenting a special effect generated based on the avatar of the target object in a conversation area.
In some embodiments, the types of effects include expressive effects, video effects; the second display module 5553 is further configured to, when the played preview screen for representing the special effect of the session content is from a first type of special effect, switch the first type of special effect to a second type of special effect in response to a switching operation for the first type of special effect, and play a preview screen for the second type of special effect; wherein the first type of special effect is any one of the expression special effect and the video special effect, and the second type of special effect is the other one of the expression special effect and the video special effect.
In some embodiments, the second display module 5553 is further configured to present a switch entry for a preview screen for the first type of special effect; switching the presented first type of special effect to the second type of special effect in response to a triggering operation for the switching entry.
In some embodiments, the types of effects include expressive effects, video effects; the second display module 5553 is further configured to, in response to a trigger operation for triggering the sending entry, determine a parameter corresponding to the trigger operation, and play a preview screen of a special effect of a type corresponding to the parameter; the trigger operation comprises different parameters, the different parameters correspond to different types of special effects, and the parameters of the trigger operation comprise at least one of the following parameters: trigger time, trigger action mode.
In some embodiments, when the preview screen for representing the special effect of the conversation content is played, the second display module 5553 is further configured to present the conversation content in the conversation area.
In some embodiments, the second display module 5553 is further configured to present a play countdown control of the preview screen in the send entry; and stopping playing the special-effect preview picture in response to the triggering operation of the play countdown control.
In some embodiments, the second display module 5553 is further configured to stop playing the preview screen of the special effect in response to not releasing the pressed state after pressing the send entry and maintaining the pressed state for a set time period.
In some embodiments, when the conversation content is a voice message, the second display module 5553 is further configured to present a recording entry in the conversation editing area; in response to pressing the recording entry and maintaining a pressed state, acquiring the voice message acquired for a voice input operation in the pressed state; presenting a special effect sending entrance; and in response to the pressing state being released after the pressing state is moved to the special effect sending entrance, presenting a special effect generated based on the avatar of the target object in the conversation area.
In some embodiments, the types of the special effects include expression special effects and video special effects, and the special effect sending entries include expression special effect sending entries and video special effect sending entries; the second display module 5553 is further configured to release the pressed state after moving to the expression special effect sending entry in response to the pressed state, and present an expression special effect generated based on an avatar of a target object in a session area; and in response to the pressing state being released after the pressing state is moved to the video special effect sending entrance, presenting the video special effect generated based on the virtual image of the target object in the conversation area.
In some embodiments, before presenting the special effect generated based on the avatar of the target object in the conversation region, the apparatus further comprises: a processing module 5554, configured to determine special effect data for characterizing the session content based on the avatar of the target object and the session content; determining voice data conforming to the target object sound based on the conversation content; and synthesizing the special effect data and the voice data to obtain a special effect generated based on the virtual image of the target object.
In some embodiments, before determining special effects data for characterizing the session content, the processing module 5554 is further configured to obtain a real image of the target object; and calling an avatar generation model based on the real image to obtain the avatar of the target object.
In some embodiments, the processing module 5554 is further configured to perform the following processing by the avatar generation model: performing principal component analysis processing on the real image to obtain geometric information distribution and texture information distribution corresponding to the target object; and carrying out deformation processing based on the geometric information distribution and the texture information distribution to obtain the virtual image of the target object.
In some embodiments, the processing module 5554 is further configured to perform text analysis processing on text information corresponding to the session content, so as to obtain a context feature of the session content; performing voice parameter prediction processing on the context characteristics of the conversation content based on the target object sound, wherein the conversation content corresponds to a plurality of voice parameters which accord with the target object sound; and synthesizing the plurality of voice parameters to obtain voice data which accords with the target object sound.
In some embodiments, when the session content is a text message, the processing module 5554 is further configured to perform text feature extraction processing on the text message to obtain a text feature of the text message; matching a plurality of candidate special effect data contained in a database based on the text features, and taking the matched candidate special effect data as special effect data for representing the conversation content; wherein the text features include at least one of: characters, symbols, expression pictures.
In some embodiments, when the session content is a voice message, the processing module 5554 is further configured to perform format conversion processing on the voice message to obtain a text message corresponding to the voice message; performing text feature extraction processing on the text message to obtain text features of the text message; matching a plurality of candidate special effect data contained in a database based on the text features, and taking the matched candidate special effect data as special effect data for representing the conversation content.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the session processing method described in the embodiment of the present application.
Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to execute a session processing method provided by embodiments of the present application, for example, the session processing method shown in fig. 3A.
In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.
Claims (20)
1. A method for session processing, the method comprising:
presenting a session editing area;
responding to an input operation based on the session editing area, and acquiring session content formed by the input operation;
and presenting a special effect generated based on the virtual image of the target object in the conversation area in response to the sending operation aiming at the conversation content, wherein the special effect is used for representing the conversation content.
2. The method of claim 1, wherein presenting the special effect generated based on the avatar of the target object in the conversation region comprises:
playing an expressive special effect generated based on an avatar of a target object, wherein the expressive special effect comprises at least one expressive picture representing the conversation content based on the avatar, and voice representing text of the conversation content based on the target object sound.
3. The method of claim 2,
the emoticon is used for representing the conversation content from at least one of the following dimensions: keywords of the session content and emotion information carried by the session content.
4. The method of claim 2, wherein the expressive special effects are set to a play mode comprising: and automatically playing any one expression picture when the expression special effect is received, switching to play a plurality of expression pictures when the expression special effect is triggered to play, and synchronously playing the voice.
5. The method of claim 1, wherein presenting the special effect generated based on the avatar of the target object in the conversation region comprises:
playing a video special effect generated based on an avatar of a target object, wherein the video special effect includes a plurality of video frames characterizing the conversational content based on the avatar and a plurality of audio frames characterizing text of the conversational content based on the target object sound.
6. The method of claim 5, wherein the video special effect is set to a play mode, the play mode comprising: and when the video special effect is received, any one of the video frames is automatically played, and when the video special effect is triggered to be played, the plurality of video frames are played, and the plurality of audio frames are synchronously played.
7. The method according to claim 1, wherein presenting a special effect generated based on an avatar of a target object in a conversation area in response to a transmission operation for the conversation content comprises:
presenting a sending entry for the session content;
responding to a sending operation for triggering the sending entrance, and playing a preview screen for representing a special effect of the conversation content;
in response to the preview screen playing being completed, transmitting the special effect, and presenting a special effect generated based on the avatar of the target object in a conversation area.
8. The method of claim 7,
the types of the special effects comprise expression special effects and video special effects;
the method further comprises the following steps:
when the played preview picture for representing the special effect of the conversation content is from a first type special effect, responding to the switching operation aiming at the first type special effect, switching the first type special effect into a second type special effect, and playing the preview picture of the second type special effect;
wherein the first type of special effect is any one of the expression special effect and the video special effect, and the second type of special effect is the other one of the expression special effect and the video special effect.
9. The method of claim 8, wherein switching the first type of special effect to a second type of special effect in response to the switching operation for the first type of special effect comprises:
presenting a switching entry of a preview screen for the first type of special effect;
switching the presented first type of special effect to the second type of special effect in response to a triggering operation for the switching entry.
10. The method of claim 7,
the types of the special effects comprise expression special effects and video special effects;
the playing a preview screen for characterizing a special effect of the conversation content in response to a sending operation for triggering the sending entry, comprising:
responding to a trigger operation for triggering the sending inlet, determining a parameter corresponding to the trigger operation, and playing a preview picture of a special effect of a type corresponding to the parameter;
the trigger operation comprises different parameters, the different parameters correspond to different types of special effects, and the parameters of the trigger operation comprise at least one of the following parameters: trigger time, trigger action mode.
11. The method of claim 7, wherein when playing a preview screen for characterizing a special effect of the session content, the method further comprises:
and in response to the stop sending operation aiming at the special effect, stopping playing the preview screen of the special effect, and presenting the conversation content in the conversation area.
12. The method according to claim 11, wherein the stopping of the playing of the preview screen of the special effect in response to the stop transmission operation for the special effect comprises:
presenting a play countdown control of the preview picture in the sending entry;
and stopping playing the special preview picture in response to the triggering operation of the play countdown control.
13. The method according to claim 11, wherein the stopping of the playing of the preview screen of the special effect in response to the stop transmission operation for the special effect comprises:
and in response to the fact that the pressing state is not released after the sending inlet is pressed and the pressing state is kept for a set time length, stopping playing the special preview picture.
14. The method of claim 1,
when the conversation content is a voice message, the obtaining of the conversation content formed by the input operation in response to the input operation based on the conversation editing area comprises:
presenting a recording entry in the session editing area;
in response to pressing the recording entry and maintaining a pressed state, acquiring the voice message acquired for a voice input operation in the pressed state;
the presenting, in response to the transmission operation for the conversation content, a special effect generated based on an avatar of a target object in a conversation area includes:
presenting a special effect sending entrance;
and responding to the pressing state and releasing the pressing state after moving to the special effect sending entrance, and presenting the special effect generated based on the virtual image of the target object in the conversation area.
15. The method of claim 14,
the types of the special effects comprise expression special effects and video special effects, and the special effect sending entries comprise expression special effect sending entries and video special effect sending entries;
the releasing the pressing state after responding to the pressing state and moving to the special effect sending entrance, and presenting the special effect generated based on the virtual image of the target object in the conversation area, wherein the presenting comprises:
responding to the pressing state, releasing the pressing state after moving to the expression special effect sending entrance, and presenting an expression special effect generated based on the virtual image of the target object in the conversation area;
and responding to the pressing state and releasing the pressing state after moving to the video special effect sending entrance, and presenting the video special effect generated based on the virtual image of the target object in the conversation area.
16. The method of claim 1, wherein prior to presenting the special effect generated based on the avatar of the target object in the conversation region, the method further comprises:
determining special effect data for representing the conversation content based on the virtual image of the target object and the conversation content;
determining voice data conforming to the target object sound based on the conversation content;
and synthesizing the special effect data and the voice data to obtain a special effect generated based on the virtual image of the target object.
17. The method of claim 16, wherein prior to determining special effects data characterizing the conversational content, the method further comprises:
acquiring a real image of the target object;
and calling an avatar generation model based on the real image to obtain the avatar of the target object.
18. A session processing apparatus, characterized in that the apparatus comprises:
the first display module is used for presenting a session editing area;
the acquisition module is used for responding to the input operation based on the session editing area and acquiring the session content formed by the input operation;
and the second display module is used for responding to the sending operation aiming at the conversation content and presenting a special effect generated based on the virtual image of the target object in the conversation area, wherein the special effect is used for representing the conversation content.
19. An electronic device, characterized in that the electronic device comprises:
a memory for storing executable instructions;
a processor, configured to implement the session processing method of any one of claims 1 to 17 when executing the executable instructions stored in the memory.
20. A computer-readable storage medium storing executable instructions for implementing the session processing method of any one of claims 1 to 17 when executed by a processor.
Publications (1)
| Publication Number | Publication Date |
|---|---|
| HK40083865A true HK40083865A (en) | 2023-07-07 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114401438B (en) | Video generation method and device for virtual digital person, storage medium and terminal | |
| CN114357135B (en) | Interaction method, interaction device, electronic equipment and storage medium | |
| CN110517689B (en) | Voice data processing method, device and storage medium | |
| CN112162628A (en) | Multi-mode interaction method, device and system based on virtual role, storage medium and terminal | |
| WO2022089224A1 (en) | Video communication method and apparatus, electronic device, computer readable storage medium, and computer program product | |
| JP2020034895A (en) | Response method and device | |
| CN107040452B (en) | Information processing method and device and computer readable storage medium | |
| US11954794B2 (en) | Retrieval of augmented parameters for artificial intelligence-based characters | |
| US20250200855A1 (en) | Method for real-time generation of empathy expression of virtual human based on multimodal emotion recognition and artificial intelligence system using the method | |
| CN115167733A (en) | Method and device for displaying live broadcast resources, electronic equipment and storage medium | |
| CN117828010A (en) | Text processing method, apparatus, electronic device, storage medium, and program product | |
| US20260024261A1 (en) | Server, display device and digital human processing method | |
| CN115442495A (en) | AI studio system | |
| CN120353929A (en) | Digital life individuation realization method, device, equipment and medium | |
| CN115730048A (en) | A session processing method, device, electronic equipment and readable storage medium | |
| CN112492400B (en) | Interaction method, device, equipment, communication method and shooting method | |
| CN117292022A (en) | Video generation method and device based on virtual object and electronic equipment | |
| CN114514576A (en) | Data processing method, device and storage medium | |
| CN120563687A (en) | AI digital human construction method and device based on cross-modal joint representation and time series analysis | |
| US20240320519A1 (en) | Systems and methods for providing a digital human in a virtual environment | |
| CN110795581B (en) | Image searching method and device, terminal equipment and storage medium | |
| KR102120936B1 (en) | System for providing customized character doll including smart phone | |
| HK40083865A (en) | A session processing method, device, electronic equipment and readable storage medium | |
| CN117711369A (en) | Intelligent communication system for human and animal situational language | |
| CN117590949A (en) | Man-machine interaction method and device for virtual character |