TW202138992A

TW202138992A - Method and apparatus for driving interactive object, device and storage medium

Info

Publication number: TW202138992A
Application number: TW109144447A
Authority: TW
Inventors: 吳文岩; 吳潛溢; 錢晨; 白晨
Original assignee: 大陸商北京市商湯科技開發有限公司
Priority date: 2020-03-31
Filing date: 2020-12-16
Publication date: 2021-10-16
Also published as: KR20210124307A; JP2022530935A; CN111460785A; SG11202111909QA; WO2021196644A1; CN111460785B

Abstract

The present disclosure relates to a method and an apparatus for driving interactive object, a device and a storage medium. The method comprises: obtaining a phoneme sequence corresponding to text data; obtaining a value of control parameter of at least one local area of an interactive object matched with the phoneme sequence; and controlling a posture of the interactive object according to the obtained value of control parameter.

Description

Driving method, device, equipment and storage medium of interactive object

[相關申請案的交叉參考][Cross reference of related applications]

本申請基於申請號為2020102458024、申請日為2020年3月31日的中國專利申請提出，並要求該中國專利申請的優先權，該中國專利申請的全部內容在此引入本申請作為參考。This application is based on a Chinese patent application with an application number of 2020102458024 and an application date of March 31, 2020, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this application.

本公開涉及電腦技術領域，具體涉及一種互動物件的驅動方法、裝置、設備以及存儲介質。The present disclosure relates to the field of computer technology, and in particular to a driving method, device, equipment and storage medium of an interactive object.

人機互動的方式大多基於按鍵、觸摸、語音進行輸入，透過在顯示幕上呈現圖像、文本或虛擬人物進行回應。目前虛擬人物多是在語音助理的基礎上改進得到的。Human-computer interaction is mostly based on keystrokes, touches, and voice input, and responds by presenting images, text, or virtual characters on the display. At present, virtual characters are mostly improved on the basis of voice assistants.

本公開實施例提供一種互動物件的驅動方案。The embodiments of the present disclosure provide a driving solution for interactive objects.

根據本公開的一方面，提供一種互動物件的驅動方法，所述方法包括：獲取文本資料對應的音素序列；獲取與所述音素序列匹配的互動物件的至少一個局部區域的控制參數值；根據獲取的所述控制參數值控制所述互動物件的姿態。According to an aspect of the present disclosure, a method for driving an interactive object is provided, the method includes: obtaining a phoneme sequence corresponding to text data; obtaining a control parameter value of at least one partial area of an interactive object matching the phoneme sequence; The control parameter value controls the posture of the interactive object.

結合本公開提供的任一實施方式，所述方法更包括：根據所述文本資料控制展示所述互動物件的顯示裝置展示文本，和/或根據所述文本資料對應的音素序列控制所述顯示裝置輸出語音。With reference to any of the embodiments provided in the present disclosure, the method further includes: controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device according to the phoneme sequence corresponding to the text data Output speech.

結合本公開提供的任一實施方式，所述互動物件的局部區域的控制參數包括所述局部區域的姿態控制向量，獲取與所述音素序列匹配的互動物件的至少一個局部區域的控制參數值，包括：對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列；根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼；獲取所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量。With reference to any one of the embodiments provided in the present disclosure, the control parameter of the local area of the interactive object includes the posture control vector of the local area, and the control parameter value of at least one local area of the interactive object matching the phoneme sequence is acquired, The method includes: performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; obtaining a feature code corresponding to at least one phoneme according to the first coding sequence; obtaining the interaction corresponding to the feature code The attitude control vector of at least one local area of the object.

結合本公開提供的任一實施方式，對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列，包括：針對所述音素序列包含的多種音素中的每種音素，生成所述音素對應的子編碼序列；根據所述多種音素分別對應的子編碼序列，獲得所述音素序列對應的第一編碼序列。In conjunction with any one of the embodiments provided in the present disclosure, performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence includes: generating all phonemes for each of the multiple phonemes included in the phoneme sequence. The sub-coding sequence corresponding to the phoneme; and the first coding sequence corresponding to the phoneme sequence is obtained according to the sub-coding sequences respectively corresponding to the multiple phonemes.

結合本公開提供的任一實施方式，針對所述音素序列包含的多種音素中的每種音素，生成所述音素對應的子編碼序列，包括：檢測各時間點上是否對應有所述音素；透過將有所述音素的時間點上的編碼值設置為第一數值，將沒有所述音素的時間點上的編碼值設置為第二數值，得到所述音素對應的所述子編碼序列。With reference to any of the embodiments provided in the present disclosure, for each of the multiple phonemes included in the phoneme sequence, generating a sub-coding sequence corresponding to the phoneme includes: detecting whether the phoneme corresponds to each time point; The coding value at the time point with the phoneme is set to a first value, and the coding value at the time point without the phoneme is set to a second value to obtain the sub-coding sequence corresponding to the phoneme.

結合本公開提供的任一實施方式，所述方法更包括：對於所述多種音素中的每種音素對應的所述子編碼序列，利用高斯濾波器對所述音素在時間上的連續值進行高斯卷積操作。With reference to any of the embodiments provided in the present disclosure, the method further includes: for the sub-coding sequence corresponding to each phoneme of the multiple phonemes, using a Gaussian filter to perform Gaussian on the continuous values of the phonemes in time. Convolution operation.

結合本公開提供的任一實施方式，根據獲取的所述控制參數值控制所述互動物件的姿態，包括：獲取與所述第二編碼序列對應的姿態控制向量的序列；根據所述姿態控制向量的序列控制所述互動物件的姿態。With reference to any one of the embodiments provided in the present disclosure, controlling the posture of the interactive object according to the obtained control parameter value includes: obtaining a sequence of posture control vectors corresponding to the second coding sequence; and according to the posture control vector The sequence of controls the gesture of the interactive object.

結合本公開提供的任一實施方式，所述方法更包括：在所述音素序列中的所述音素之間的時間間隔大於設定閾值的情況下，根據所述局部區域的設定控制參數值，控制所述互動物件的姿態。With reference to any of the embodiments provided in the present disclosure, the method further includes: in the case that the time interval between the phonemes in the phoneme sequence is greater than a set threshold, controlling according to the set control parameter value of the local area The posture of the interactive object.

結合本公開提供的任一實施方式，獲取所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量，包括：將所述特徵編碼輸入至預先訓練的迴圈神經網路，獲得與所述特徵編碼對應的所述互動物件的至少一個局部區域的所述姿態控制向量。In conjunction with any one of the embodiments provided in the present disclosure, obtaining the attitude control vector of at least one local area of the interactive object corresponding to the feature code includes: inputting the feature code into a pre-trained loop neural network to obtain The attitude control vector of at least one partial area of the interactive object corresponding to the feature code.

結合本公開提供的任一實施方式，所述迴圈神經網路透過特徵編碼樣本訓練得到；所述方法更包括：獲取一角色發出語音的影片段，並根據所述影片段獲取多個包含所述角色的第一圖像幀；從所述影片段中提取相應的語音段，根據所述語音段獲取樣本音素序列，並對所述樣本音素序列進行特徵編碼；獲取與所述第一圖像幀對應的至少一個音素的特徵編碼；將所述第一圖像幀轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的至少一個局部區域的姿態控制向量值；根據所述姿態控制向量值，對與所述第一圖像幀對應的所述特徵編碼進行標注，獲得所述特徵編碼樣本。With reference to any of the embodiments provided in the present disclosure, the loop neural network is obtained through feature coding sample training; the method further includes: obtaining a video segment of a character's voice, and obtaining a plurality of video segments containing all of them according to the video segment. The first image frame of the character; extract the corresponding speech segment from the film segment, obtain a sample phoneme sequence according to the speech segment, and perform feature encoding on the sample phoneme sequence; obtain the first image Feature encoding of at least one phoneme corresponding to the frame; transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector of at least one local area corresponding to the second image frame Value; according to the attitude control vector value, annotate the feature code corresponding to the first image frame to obtain the feature code sample.

結合本公開提供的任一實施方式，所述方法更包括：根據所述特徵編碼樣本對初始迴圈神經網路進行訓練，在網路損失的變化滿足收斂條件後訓練得到所述迴圈神經網路，其中，所述網路損失包括所述迴圈神經網路預測得到的所述至少一個局部區域的所述姿態控制向量值與標注的所述姿態控制向量值之間的差異。With reference to any one of the embodiments provided in the present disclosure, the method further includes: training the initial loop neural network according to the characteristic coding samples, and training to obtain the loop neural network after the change of the network loss satisfies the convergence condition Wherein, the network loss includes the difference between the attitude control vector value of the at least one local area predicted by the loop neural network and the marked attitude control vector value.

根據本公開的一方面，提供一種互動物件的驅動裝置，所述裝置包括：第一獲取單元，用於獲取文本資料對應的音素序列；第二獲取單元，用於獲取與所述音素序列匹配的互動物件的至少一個局部區域的控制參數值；驅動單元，用於根據獲取的所述控制參數值控制所述互動物件的姿態。According to an aspect of the present disclosure, there is provided a driving device for an interactive object. The device includes: a first acquiring unit for acquiring a phoneme sequence corresponding to a text data; a second acquiring unit for acquiring a phoneme sequence matching the phoneme sequence The control parameter value of at least one partial area of the interactive object; the driving unit is used to control the posture of the interactive object according to the acquired control parameter value.

根據本公開的一方面，提供一種電子設備，所述設備包括記憶體、處理器，所述記憶體用於存儲可在處理器上運行的電腦指令，所述處理器用於在執行所述電腦指令時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to one aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to execute the computer instructions. The driving method of the interactive object described in any of the embodiments provided in the present disclosure is realized at a time.

根據本公開的一方面，提供一種電腦可讀存儲介質，其上存儲有電腦程式，所述電腦程式被處理器執行時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to one aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the method for driving an interactive object according to any one of the embodiments provided in the present disclosure.

本公開一個或多個實施例的互動物件的驅動方法、裝置、設備及電腦可讀存儲介質，透過獲取文本資料對應的音素序列，並獲取與所述音素序列匹配的互動物件的至少一個局部區域的控制參數值，來控制所述互動物件的姿態，可以使互動物件做出與文本資料所對應的音素匹配的姿態，該姿態包括面部姿態和肢體姿態，從而使目標物件產生互動物件正在說出文本內容的感覺，提升了目標物件與互動物件的互動體驗。The driving method, device, device, and computer-readable storage medium of an interactive object according to one or more embodiments of the present disclosure obtain a phoneme sequence corresponding to text data, and obtain at least one partial area of an interactive object matching the phoneme sequence The control parameter value of to control the posture of the interactive object, so that the interactive object can make a posture that matches the phoneme corresponding to the text data. The posture includes facial posture and body posture, so that the target object generates the interactive object is speaking The sense of text content enhances the interactive experience between the target object and the interactive object.

這裡將詳細地對示例性實施例進行說明，其示例表示在附圖中。下面的描述涉及附圖時，除非另有表示，不同附圖中的相同數字表示相同或相似的要素。以下示例性實施例中所描述的實施方式並不代表與本公開相一致的所有實施方式。相反，它們僅是與如所附申請專利範圍中所詳述的、本公開的一些方面相一致的裝置和方法的例子。The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are only examples of devices and methods consistent with some aspects of the present disclosure as detailed in the scope of the appended application.

本文中術語“和/或”，僅僅是一種描述關聯物件的關聯關係，表示可以存在三種關係，例如，A和/或B，可以表示：單獨存在A，同時存在A和B，單獨存在B這三種情況。另外，本文中術語“至少一種”表示多種中的任意一種或多種中的至少兩種的任意組合，例如，包括A、B、C中的至少一種，可以表示包括從A、B和C構成的集合中選擇的任意一個或多個元素。The term "and/or" in this article is only an association relationship describing related objects, which means that there can be three relationships. For example, A and/or B can mean: A alone exists, A and B exist at the same time, and B exists alone. three conditions. In addition, the term "at least one" herein means any one or any combination of at least two of the multiple, for example, including at least one of A, B, and C, and may mean including those made from A, B, and C Any one or more elements selected in the set.

本公開至少一個實施例提供了一種互動物件的驅動方法，所述驅動方法可以由終端設備或伺服器等電子設備執行，所述終端設備可以是固定終端或行動終端，例如手機、平板電腦、遊戲機、桌上型電腦、廣告機、一體機、車載終端等等，所述伺服器包括本機伺服器或雲端伺服器等，所述方法還可以透過處理器調用記憶體中存儲的電腦可讀指令的方式來實現。At least one embodiment of the present disclosure provides a method for driving an interactive object. The driving method may be executed by an electronic device such as a terminal device or a server. The terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, or a game. Computer, desktop computer, advertising player, all-in-one, vehicle-mounted terminal, etc., the server includes a local server or a cloud server, etc. The method can also call a computer readable stored in the memory through the processor The way to implement instructions.

在本公開實施例中，互動物件可以是任意一種能夠與目標物件進行互動的虛擬形象。在一實施例中，互動物件可以是虛擬人物，還可以是虛擬動物、虛擬物品、卡通形象等等其他能夠實現互動功能的虛擬形象。互動物件的展現形式既可以是2D形式也可以是3D形式，本公開對此並不限定。所述目標物件可以是使用者，也可以是機器人，還可以是其他智慧設備。所述互動物件和所述目標物件之間的對話模式可以是主動對話模式，也可以是被動對話模式。一示例中，目標物件可以透過做出手勢或者肢體動作來發出需求，透過主動互動的方式來觸發互動物件與其互動。另一示例中，互動物件可以透過主動打招呼、提示目標物件做出動作等方式，使得目標物件採用被動方式與互動物件進行互動。In the embodiment of the present disclosure, the interactive object may be any virtual image capable of interacting with the target object. In an embodiment, the interactive object may be a virtual character, or may also be a virtual animal, a virtual item, a cartoon image, or other virtual images capable of realizing interactive functions. The display form of the interactive object may be 2D or 3D, which is not limited in the present disclosure. The target object can be a user, a robot, or other smart devices. The dialogue mode between the interactive object and the target object may be an active dialogue mode or a passive dialogue mode. In one example, the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it through active interaction. In another example, the interactive object may actively greet, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.

所述互動物件可以透過終端設備進行展示，所述終端設備可以是電視機、帶有顯示功能的一體機、投影儀、虛擬實境（Virtual Reality，VR）設備、增強實境（Augmented Reality，AR）設備等，本公開並不限定終端設備的具體形式。The interactive object may be displayed through terminal equipment, which may be a television, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, and an augmented reality (Augmented Reality, AR) device. ) Device, etc., the present disclosure does not limit the specific form of the terminal device.

圖1示出本公開至少一個實施例提出的顯示裝置。如圖1所示，該顯示裝置具有透明顯示幕，在透明顯示幕上可以顯示立體畫面，以呈現出具有立體效果的虛擬場景以及互動物件。例如圖1中透明顯示幕顯示的互動物件包括虛擬卡通人物。在一些實施例中，本公開中所述的終端設備也可以為上述具有透明顯示幕的顯示裝置，顯示裝置中配置有記憶體和處理器，記憶體用於存儲可在處理器上運行的電腦指令，所述處理器用於在執行所述電腦指令時實現本公開提供的互動物件的驅動方法，以驅動透明顯示幕中顯示的互動物件對目標物件進行交流或回應。FIG. 1 shows a display device proposed by at least one embodiment of the present disclosure. As shown in FIG. 1, the display device has a transparent display screen, and a three-dimensional picture can be displayed on the transparent display screen to present a virtual scene and interactive objects with a three-dimensional effect. For example, the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen. The display device is configured with a memory and a processor, and the memory is used to store a computer that can run on the processor. Instructions, the processor is used to implement the method for driving an interactive object provided by the present disclosure when executing the computer instruction, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.

在一些實施例中，回應於用於驅動互動物件輸出語音的聲音驅動資料，互動物件可以對目標物件發出指定語音。終端設備可以根據終端設備周邊目標物件的動作、表情、身份、偏好等，生成聲音驅動資料，以驅動互動物件透過發出指定語音進行回應，從而為目標物件提供擬人化的服務。需要說明的是，聲音驅動資料也可以透過其他方式生成，比如，由伺服器生成併發送給終端設備。In some embodiments, in response to the sound-driven data used to drive the interactive object to output a voice, the interactive object can emit a specified voice to the target object. The terminal device can generate sound-driven data according to the actions, expressions, identities, preferences, etc. of the target object around the terminal device to drive the interactive object to respond by sending a specified voice, thereby providing anthropomorphic services for the target object. It should be noted that the sound-driven data can also be generated in other ways, for example, generated by a server and sent to the terminal device.

在互動物件與目標物件的互動過程中，根據該聲音驅動資料驅動互動物件發出指定語音時，可能無法驅動所述互動物件做出與該指定語音同步的面部動作，使得互動物件在發出語音時呆板、不自然，影響了目標物件與互動物件的互動體驗。基於此，本公開至少一個實施例提出一種互動物件的驅動方法，以提升目標物件與互動物件進行互動的體驗。During the interaction between the interactive object and the target object, when the interactive object is driven to emit a specified voice according to the sound-driven data, the interactive object may not be able to drive the interactive object to make facial movements synchronized with the specified voice, making the interactive object dull when the voice is emitted , Unnatural, affecting the interactive experience of the target object and the interactive object. Based on this, at least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interactive experience between the target object and the interactive object.

圖2示出根據本公開至少一個實施例的互動物件的驅動方法的流程圖，如圖2所示，所述方法包括步驟201~步驟203。FIG. 2 shows a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 203.

步驟201，獲取文本資料對應的音素序列。Step 201: Obtain the phoneme sequence corresponding to the text data.

所述文本資料可以是用於驅動所述互動物件的驅動資料。該驅動資料可以是伺服器或終端設備根據與互動物件進行互動的目標物件的動作、表情、身份、偏好等生成的驅動資料，也可以是終端設備從內部記憶體調用的驅動資料。本公開對於該文本資料的獲取方式不進行限制。The text data may be driving data used to drive the interactive object. The driving data can be the driving data generated by the server or the terminal device according to the action, expression, identity, preference, etc. of the target object interacting with the interactive object, or it can be the driving data called by the terminal device from the internal memory. This disclosure does not restrict the method of obtaining the textual information.

在本公開實施例中，可以根據文本所包含的語素，獲得所述語素所對應的音素，從而獲得文本對應的音素序列。其中，音素是根據語音的自然屬性劃分出來的最小語音單元，真實人物一個發音動作能夠形成一個音素。In the embodiment of the present disclosure, the phoneme corresponding to the morpheme can be obtained according to the morphemes contained in the text, so as to obtain the phoneme sequence corresponding to the text. Among them, the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and a pronunciation action of a real person can form a phoneme.

在一實施例中，回應於所述文本為中文文本，可以透過將中文文本文字轉換成拼音，利用拼音生成音素序列，並生成每個音素的時間戳記。In one embodiment, in response to the text being a Chinese text, the Chinese text can be converted into pinyin, using pinyin to generate a phoneme sequence, and generating a time stamp for each phoneme.

步驟202，獲取與所述音素序列匹配的、互動物件的至少一個局部區域的控制參數值。Step 202: Obtain a control parameter value of at least one partial region of an interactive object that matches the phoneme sequence.

所述局部區域是對互動物件的整體（包括面部和/或身體）進行劃分而得到的。面部的一個或多個局部區域的控制可以對應於互動物件的一系列面部表情或動作，例如眼部區域的控制可以對應於互動物件睜眼、閉眼、眨眼、視角變換等面部動作；又例如嘴部區域的控制可以對應於互動物件閉嘴、不同程度的張嘴等面部動作。而身體的其中一個或多個局部區域的控制可以對應於互動物件的一系列肢體動作，例如腿部區域的控制可以對應於互動物件走路、跳躍、踢腿等動作。The partial area is obtained by dividing the entire interactive object (including the face and/or body). The control of one or more partial areas of the face can correspond to a series of facial expressions or actions of the interactive object. For example, the control of the eye area can correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective; another example is the mouth. The control of the part area can correspond to facial actions such as closing the mouth of the interactive object and opening the mouth to different degrees. The control of one or more local areas of the body may correspond to a series of physical actions of the interactive object. For example, the control of the leg area may correspond to the actions of the interactive object such as walking, jumping, and kicking.

所述互動物件的局部區域的控制參數包括所述局部區域的姿態控制向量。每個局部區域的姿態控制向量用於驅動所述互動物件的所述局部區域進行動作。不同的姿態控制向量值對應於不同的動作或者動作幅度。例如，對於嘴部區域的姿態控制向量，其一組姿態控制向量值可以使所述互動物件的嘴部微張，而另一組姿態控制向量值可以使所述互動物件的嘴部大張。透過以不同的姿態控制向量值來驅動所述互動物件，可以使相應的局部區域做出不同動作或者不同幅度的動作。The control parameter of the local area of the interactive object includes the attitude control vector of the local area. The attitude control vector of each local area is used to drive the local area of the interactive object to move. Different posture control vector values correspond to different motions or motion ranges. For example, for the posture control vector of the mouth area, one set of posture control vector values can make the mouth of the interactive object slightly open, and another set of posture control vector values can make the mouth of the interactive object open wider. By using different posture control vector values to drive the interactive object, the corresponding local area can perform different actions or actions with different amplitudes.

局部區域可以根據需要控制的互動物件的動作進行選擇，例如在需要控制所述互動物件面部以及肢體同時進行動作時，可以獲取全部局部區域的姿態控制向量值；在需要控制所述互動物件的表情時，則可以獲取所述面部所對應的局部區域的姿態控制向量值。The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to move at the same time, the posture control vector value of all the local areas can be obtained; when the expression of the interactive object needs to be controlled At this time, the posture control vector value of the local area corresponding to the face can be obtained.

在本公開實施例中，可以透過對所述音素序列進行特徵編碼，確定特徵編碼所對應的控制參數值，從而確定所述音素序列對應的控制參數值。不同的編碼方式可以體現所述音素序列的不同特徵。對於具體的編碼方式本公開不進行限制。In the embodiment of the present disclosure, the control parameter value corresponding to the characteristic code can be determined by performing feature encoding on the phoneme sequence, thereby determining the control parameter value corresponding to the phoneme sequence. Different encoding methods can reflect different characteristics of the phoneme sequence. The present disclosure does not limit the specific encoding method.

在本公開實施例中，可以預先建立所述文本資料對應的音素序列的特徵編碼與互動物件的控制參數值的對應關係，從而透過文本資料，可獲得對應的控制參數值。獲取與所述文本資料的音素序列的特徵編碼匹配的控制參數值的具體方法容後詳述。In the embodiment of the present disclosure, the corresponding relationship between the feature code of the phoneme sequence corresponding to the text data and the control parameter value of the interactive object can be established in advance, so that the corresponding control parameter value can be obtained through the text data. The specific method for obtaining the control parameter value matching the feature code of the phoneme sequence of the text data will be described in detail later.

步驟203，根據獲取的所述控制參數值控制所述互動物件的姿態。Step 203: Control the posture of the interactive object according to the acquired control parameter value.

其中，所述控制參數值，例如姿態控制向量值，是與所述文本資料所包含的音素序列相匹配的。例如，在根據所述文本資料控制展示所述互動物件的顯示裝置展示文本，和/或根據所述文本資料對應的音素序列控制所述顯示裝置輸出語音時，互動物件所做出的姿態與所輸出的語音和/或所展示的文本是同步的，從而給目標物件一種所述互動物件正在說話的感覺。Wherein, the control parameter value, such as the attitude control vector value, matches the phoneme sequence contained in the text data. For example, when the display device displaying the interactive object is controlled to display text according to the text data, and/or the display device is controlled to output speech according to the phoneme sequence corresponding to the text data, the gesture and the gesture made by the interactive object are The output voice and/or the displayed text are synchronized, thereby giving the target object a feeling that the interactive object is speaking.

在本公開實施例中，透過獲取文本資料對應的音素序列，並獲取與所述音素序列匹配的互動物件的至少一個局部區域的控制參數值，來控制所述互動物件的姿態，可以使互動物件做出與文本資料所對應的音素匹配的姿態，該姿態包括面部姿態和肢體姿態，從而使目標物件產生互動物件正在說出文本內容的感覺，提升了目標物件的互動體驗。In the embodiment of the present disclosure, by obtaining the phoneme sequence corresponding to the text data, and obtaining the control parameter value of at least one partial area of the interactive object matching the phoneme sequence, the posture of the interactive object can be controlled, so that the interactive object A gesture that matches the phoneme corresponding to the text data is made. The gesture includes facial gestures and body gestures, so that the target object has the feeling that the interactive object is speaking the text content, and the interactive experience of the target object is improved.

在一些實施例中，所述方法應用於伺服器，包括本機伺服器或雲端伺服器等，所述伺服器對於文本資料進行處理，生成所述互動物件的控制參數值，並根據所述控制參數值利用三維渲染引擎進行渲染，得到所述互動物件的動畫。所述伺服器可以將所述動畫發送至終端進行展示來對目標物件進行交流或回應，還可以將所述動畫發送至雲端，以使終端能夠從雲端獲取所述動畫來對目標物件進行交流或回應。在伺服器生成所述互動物件的控制參數值後，還可以將所述控制參數值發送至終端，以使終端完成渲染、生成動畫、進行展示的過程。In some embodiments, the method is applied to a server, including a local server or a cloud server. The server processes text data to generate control parameter values of the interactive object, and according to the control The parameter value is rendered using a three-dimensional rendering engine to obtain an animation of the interactive object. The server may send the animation to the terminal for display to communicate or respond to the target object, and may also send the animation to the cloud, so that the terminal can obtain the animation from the cloud to communicate or communicate with the target object. Response. After the server generates the control parameter value of the interactive object, it can also send the control parameter value to the terminal, so that the terminal completes the process of rendering, generating animation, and displaying.

在一些實施例中，所述方法應用於終端，所述終端對於文本資料進行處理，生成所述互動物件的控制參數值，並根據所述控制參數值利用三維渲染引擎進行渲染，得到所述互動物件的動畫，所述終端可以展示所述動畫以對目標物件進行交流或回應。In some embodiments, the method is applied to a terminal, and the terminal processes text data, generates control parameter values of the interactive object, and renders the interactive object using a three-dimensional rendering engine according to the control parameter value to obtain the interactive The animation of the object, the terminal can display the animation to communicate or respond to the target object.

在一些實施例中，可以根據所述文本資料控制展示所述互動物件的顯示裝置展示文本，和/或根據所述文本資料對應的音素序列控制所述顯示裝置輸出語音。In some embodiments, the display device displaying the interactive object may be controlled to display text according to the text data, and/or the display device may be controlled to output speech according to the phoneme sequence corresponding to the text data.

在本公開實施例中，由於所述控制參數值與所述文本資料的音素序列相匹配，因此根據所述文本資料輸出的語音和/或文本，與根據所述控制參數值控制互動物件的姿態是同步進行的情況下，互動物件所做出的姿態與所輸出的語音和/或所展示的文本是同步的，給目標物件以所述互動物件正在說話的感覺。In the embodiment of the present disclosure, since the control parameter value matches the phoneme sequence of the text data, the voice and/or text outputted according to the text data is different from controlling the gesture of the interactive object according to the control parameter value. In the case of synchronization, the gesture made by the interactive object is synchronized with the output voice and/or the displayed text, giving the target object the feeling that the interactive object is speaking.

在一些實施例中，所述互動物件的至少一個局部區域的控制參數包括姿態控制向量，所述姿態控制向量可以透過以下方式獲得。In some embodiments, the control parameter of the at least one partial area of the interactive object includes an attitude control vector, and the attitude control vector can be obtained in the following manner.

首先，對所述音素序列進行特徵編碼，獲得所述音素序列對應的編碼序列。為了與後續提到的編碼序列進行區分，將所述文本資料的音素序列對應的編碼序列稱為第一編碼序列，即透過對所述音素序列進行特徵編碼，獲得第一編碼序列。First, feature encoding is performed on the phoneme sequence to obtain the encoding sequence corresponding to the phoneme sequence. In order to distinguish it from the coding sequence mentioned later, the coding sequence corresponding to the phoneme sequence of the text data is called the first coding sequence, that is, the first coding sequence is obtained by performing feature coding on the phoneme sequence.

針對所述音素序列包含的多種音素，生成每種音素對應的子編碼序列。For multiple phonemes included in the phoneme sequence, a sub-coding sequence corresponding to each phoneme is generated.

在一個示例中，檢測各時間點上是否對應有第一音素，所述第一音素為所述多種音素中的任一種；將有所述第一音素的時間點上的編碼值設置為第一數值，將沒有所述第一音素的時間點上的編碼值設置為第二數值，在對各個時間點上的編碼值進行賦值之後可得到第一音素對應的編碼序列。例如，可以將有所述第一音素的時間點上的編碼值設置為1，將沒有所述第一音素的時間點上的編碼值設置為0。即，針對所述音素序列包含的多種音素中的每種音素，檢測各時間點上是否對應有該音素；將有所述音素的時間點上的編碼值設置為第一數值，將沒有所述音素的時間點上的編碼值設置為第二數值，在對各個時間點上的編碼值進行賦值之後可得到該音素對應的編碼序列。本領域具有通常知識者應當理解，上述編碼值的設置僅為示例，也可以將編碼值設置為其他值，本公開對此不進行限制。In an example, it is detected whether there is a first phoneme corresponding to each time point, and the first phoneme is any one of the multiple phonemes; the encoding value at the time point where the first phoneme is present is set as the first phoneme. Numerical value, the coding value at the time point without the first phoneme is set to the second numerical value, and the coding sequence corresponding to the first phoneme can be obtained after assigning the coding value at each time point. For example, the code value at the time point when the first phoneme is present may be set to 1, and the code value at the time point when the first phoneme is not present may be set to 0. That is, for each phoneme of the multiple phonemes included in the phoneme sequence, it is detected whether the phoneme corresponds to the phoneme at each time point; the encoding value at the time point where the phoneme is present is set to the first value, and the phoneme is not The encoding value at the time point of the phoneme is set to the second value, and the encoding sequence corresponding to the phoneme can be obtained after assigning the encoding value at each time point. Those with ordinary knowledge in the art should understand that the setting of the encoding value described above is only an example, and the encoding value may also be set to other values, which is not limited in the present disclosure.

根據所述多種音素分別對應的子編碼序列，獲得所述音素序列對應的第一編碼序列。The first coding sequence corresponding to the phoneme sequence is obtained according to the respective sub-coding sequences corresponding to the multiple phonemes.

在一個示例中，對於第一音素對應的子編碼序列，可利用高斯濾波器對所述第一音素在時間上的連續值進行高斯卷積操作，以對特徵編碼所對應的矩陣進行濾波，平滑每一個音素轉換時，嘴部區域過渡的動作。In an example, for the sub-coding sequence corresponding to the first phoneme, a Gaussian filter may be used to perform a Gaussian convolution operation on the continuous values of the first phoneme in time, so as to filter and smooth the matrix corresponding to the feature encoding. The transition of the mouth area when each phoneme is converted.

圖3示出了本公開至少一個實施例提出的互動物件的驅動方法的示意圖。如圖3所示，音素序列310含音素j、i1、j、ie4（為簡潔起見，只示出部分音素），針對每種音素j、i1、ie4分別獲得對應的子編碼序列321、322、323。在各個子編碼序列中，將有所述音素的時間點上對應的編碼值設置為第一數值（例如為1），將沒有所述音素的時間點上對應的編碼值設置為第二數值（例如為0）。以子編碼序列321為例，在音素序列310中有音素j的時間點上，子編碼序列321的值為第一數值1，在沒有音素j的時間點上，子編碼序列321的值為第二數值0。所有子編碼序列構成第一編碼序列320。FIG. 3 shows a schematic diagram of a driving method of an interactive object proposed by at least one embodiment of the present disclosure. As shown in Figure 3, the phoneme sequence 310 contains phonemes j, i1, j, and ie4 (for brevity, only some phonemes are shown), and corresponding sub-coding sequences 321, 322 are obtained for each phoneme j, i1, and ie4. , 323. In each sub-coding sequence, the corresponding code value at the time point with the phoneme is set to the first value (for example, 1), and the corresponding code value at the time point without the phoneme is set to the second value ( For example, 0). Taking the sub-coding sequence 321 as an example, at the time point when there is phoneme j in the phoneme sequence 310, the value of the sub-coding sequence 321 is the first value 1, and at the time point when there is no phoneme j, the value of the sub-coding sequence 321 is the first value. The two value is 0. All the sub-coding sequences constitute the first coding sequence 320.

接下來，根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼。Next, according to the first coding sequence, a feature code corresponding to at least one phoneme is obtained.

根據音素j、i1、ie4分別對應的子編碼序列321、322、323的編碼值，以及該三個子編碼序列中對應的音素的持續時間，也即在子編碼序列321中j的持續時間、在子編碼序列322中i1的持續時間、在子編碼序列323中ie4的持續時間，可以獲得子編碼序列321、322、323的特徵資訊。According to the encoding values of the sub-coding sequences 321, 322, and 323 corresponding to phonemes j, i1, and ie4, and the duration of the corresponding phonemes in the three sub-coding sequences, that is, the duration of j in the sub-coding sequence 321, From the duration of i1 in the sub-coding sequence 322 and the duration of ie4 in the sub-coding sequence 323, the characteristic information of the sub-coding sequences 321, 322, and 323 can be obtained.

在一個示例中，可以利用高斯濾波器分別對子編碼序列321、322、323中的音素j、i1、ie4在時間上的連續值進行高斯卷積操作，以對特徵編碼進行平滑，得到平滑後的第一編碼序列330。也即，透過高斯濾波器對於音素在時間上的連續值進行高斯卷積操作，使得各個編碼序列中編碼值從第二數值到第一數值或者從第一數值到第二數值的變化階段變得平滑。例如，編碼序列的值除了0和1也呈現出中間狀態的值，例如0.2、0.3等等，而根據這些中間狀態的值所獲取的姿態控制向量，使得互動人物的動作過度、表情變化更加平緩、自然，提高了目標物件的互動體驗。In an example, a Gaussian filter may be used to perform Gaussian convolution operations on the consecutive values of phonemes j, i1, and ie4 in the sub-encoding sequences 321, 322, and 323, respectively, to smooth the feature encoding to obtain the smoothed的第一coding sequence 330. That is, the Gaussian convolution operation is performed on the continuous value of the phoneme in time through the Gaussian filter, so that the code value in each code sequence changes from the second value to the first value or from the first value to the second value. smooth. For example, in addition to 0 and 1, the values of the coding sequence also present intermediate state values, such as 0.2, 0.3, etc., and the posture control vector obtained according to the values of these intermediate states makes the interactive characters excessively move and change their expressions more smoothly , Naturally, improve the interactive experience of the target object.

在一些實施例中，可以透過在所述第一編碼序列上進行滑窗的方式獲取至少一個音素對應的特徵編碼。其中，所述第一編碼序列可以是經過高斯卷積操作後的編碼序列。In some embodiments, the feature code corresponding to at least one phoneme can be obtained by performing a sliding window on the first code sequence. Wherein, the first coding sequence may be a coding sequence after a Gaussian convolution operation.

以設定長度的時間視窗和設定步長，對所述編碼序列進行滑窗，將所述時間視窗內的特徵編碼作為所對應的至少一個音素的特徵編碼，在完成滑窗後，根據得到的多個特徵編碼，可以獲得第二編碼序列。由於各音素的持續時間不同，且各音素的持續時間與時間窗口的長度所成比例不同，故時間視窗內的特徵編碼所對應的音素數量根據時間視窗的位置可能為1、2甚至更多。如圖3所示，透過在第一編碼序列320或者平滑後的第一編碼序列330上，滑動設定長度的時間視窗，分別獲得特徵編碼1、特徵編碼2、特徵編碼3，以此類推，在遍歷第一編碼序列後，獲得特徵編碼1、特徵編碼2、特徵編碼3、…、特徵編碼M，從而得到了第二編碼序列340。其中，M為正整數，其數值根據第一編碼序列的長度、時間視窗的長度以及時間視窗滑動的步長確定。A sliding window is performed on the coding sequence with a time window of a set length and a set step size, and the feature code in the time window is used as the feature code of the corresponding at least one phoneme. After the sliding window is completed, according to the obtained multiple A feature code can be used to obtain a second code sequence. Since the duration of each phoneme is different, and the duration of each phoneme is different in proportion to the length of the time window, the number of phonemes corresponding to the feature code in the time window may be 1, 2 or even more depending on the position of the time window. As shown in FIG. 3, by sliding a time window of a set length on the first coding sequence 320 or the smoothed first coding sequence 330, feature code 1, feature code 2, feature code 3 are obtained respectively, and so on, in After traversing the first coding sequence, feature code 1, feature code 2, feature code 3,..., Feature code M are obtained, thereby obtaining a second code sequence 340. Wherein, M is a positive integer, and its value is determined according to the length of the first coding sequence, the length of the time window, and the sliding step of the time window.

最後，獲取所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量。Finally, the attitude control vector of at least one partial area of the interactive object corresponding to the feature code is acquired.

根據特徵編碼1、特徵編碼2、特徵編碼3、…、特徵編碼M，分別可以獲得相應的姿態控制向量1、姿態控制向量2、姿態控制向量3、…、姿態控制向量M，從而獲得姿態控制向量的序列350。According to feature code 1, feature code 2, feature code 3,..., feature code M, the corresponding attitude control vector 1, attitude control vector 2, attitude control vector 3,..., attitude control vector M can be obtained respectively, thereby obtaining attitude control 350 of the sequence of vectors.

姿態控制向量的序列350與第二編碼序列340在時間上是對齊的，由於所述第二編碼序列中的每個特徵編碼是根據音素序列中的至少一個音素獲得的，因此姿態控制向量的序列350中的每個控制向量同樣是根據音素序列中的至少一個音素獲得的。在播放文本資料所對應的音素序列的同時，根據所述姿態控制向量的序列驅動所述互動物件做出動作，即能夠實現驅動互動物件發出文本內容所對應的聲音的同時，做出與聲音同步的動作，給目標物件以所述互動物件正在說話的感覺，提升了目標物件的互動體驗。The sequence 350 of the attitude control vector and the second coding sequence 340 are aligned in time. Since each feature code in the second coding sequence is obtained according to at least one phoneme in the phoneme sequence, the sequence of the attitude control vector Each control vector in 350 is also obtained based on at least one phoneme in the phoneme sequence. While playing the phoneme sequence corresponding to the text data, the interactive object is driven to make an action according to the sequence of the attitude control vector, that is, the interactive object can be driven to emit the sound corresponding to the text content while being synchronized with the sound The action of, gives the target object the feeling that the interactive object is speaking, and enhances the interactive experience of the target object.

假設在第一個時間視窗的設定時刻開始輸出特徵編碼，可以將在所述設定時刻之前的姿態控制向量值設置為預設值，也即在剛開始播放音素序列時，使所述互動物件做出預設的動作，在所述設定時刻之後開始利用根據第一編碼序列所得到的姿態控制向量的序列驅動所述互動物件做出動作。以圖3為例，在t0時刻開始輸出特徵編碼1，在t0時刻之前對應的是默認姿態控制向量。Assuming that the feature code starts to be output at the set time of the first time window, the attitude control vector value before the set time can be set to a preset value, that is, when the phoneme sequence is just started to be played, the interactive object is made to do A preset action is generated, and after the set time, the sequence of the attitude control vector obtained according to the first coding sequence is used to drive the interactive object to make an action. Taking Figure 3 as an example, the feature code 1 starts to be output at time t0, and before time t0 corresponds to the default attitude control vector.

所述時間視窗的長度與所述特徵編碼所包含的信息量相關。在時間視窗所含的信息量較大的情況下，經所述迴圈神經網路處理會輸出較均勻的結果。若時間視窗的長度過大，可能導致互動物件說話時的表情無法與部分文字對應；若時間視窗的長度過小，可能導致互動對象說話時的表情顯得生硬。因此，時間視窗的時長需要根據文本資料所對應的音素持續的最小時間來確定，以使驅動所述互動物件所做出的動作與聲音具有更強的關聯性。The length of the time window is related to the amount of information contained in the feature code. In the case where the amount of information contained in the time window is relatively large, the processing of the loop neural network will output a more uniform result. If the length of the time window is too large, the expression of the interactive object may not correspond to part of the text; if the length of the time window is too small, the expression of the interactive object may appear rigid when speaking. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the actions and sounds made by driving the interactive objects have a stronger correlation.

時間視窗滑動的步長與獲取姿態控制向量的時間間隔（頻率）相關，也即與驅動互動物件做出動作的頻率相關。可以根據實際的互動場景來設置所述時間視窗的長度以及步長，以使互動物件做出的表情和動作與聲音的關聯性更強，並且更加生動、自然。The sliding step of the time window is related to the time interval (frequency) of obtaining the attitude control vector, that is, it is related to the frequency of driving the interactive object to make an action. The length and step length of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive object are more closely related to the sound, and are more vivid and natural.

在一些實施例中，在所述音素序列中音素之間的時間間隔大於設定閾值的情況下，根據所述局部區域的設定姿態控制向量，驅動所述互動物件做出動作。也即，在互動人物說話停頓較長的時候，則驅動所述互動物件做出設定的動作。例如，在輸出的語音停頓較長時，可以使互動物件做出微笑的表情，或者身體微微的擺動，以避免在停頓較長時互動物件面無表情地直立，從而使得互動物件說話的過程更加自然、流暢，提高了目標物件與互動物件的互動感受。In some embodiments, when the time interval between phonemes in the phoneme sequence is greater than a set threshold, the interactive object is driven to perform an action according to the set attitude control vector of the local area. That is, when the interactive character pauses for a long time, the interactive object is driven to make a set action. For example, when the output voice pauses for a long time, the interactive object can be made to make a smiling expression or slightly swing the body to avoid the interactive object standing upright without expression during the long pause, thereby making the interactive object speak more Natural and smooth, it improves the interactive experience between the target object and the interactive object.

在一些實施例中，可以透過將所述特徵編碼輸入至預先訓練的迴圈神經網路，所述迴圈神經網路根據所述第一編碼序列，輸出與所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量。由於所述迴圈神經網路是一種時間遞迴神經網路，其可以學習所輸入的特徵編碼的歷史資訊，根據所述特徵編碼序列輸出所述至少一個局部區域的姿態控制向量。其中，所述特徵編碼序列包括第一編碼序列和第二編碼序列。所述迴圈神經網路例如可以是長短期記憶網路（Long Short-Term Memory，LSTM）。In some embodiments, the feature code can be input to a pre-trained loop neural network, and the loop neural network outputs the interaction corresponding to the feature code according to the first coding sequence. The attitude control vector of at least one local area of the object. Since the loop neural network is a time recurrent neural network, it can learn the historical information of the input feature code, and output the attitude control vector of the at least one local area according to the feature code sequence. Wherein, the characteristic coding sequence includes a first coding sequence and a second coding sequence. The loop neural network may be, for example, a long short-term memory network (Long Short-Term Memory, LSTM).

在本公開實施例中，利用預先訓練的迴圈神經網路獲取所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量，將特徵編碼的歷史特徵資訊和當前特徵資訊進行融合，從而使得歷史姿態控制向量對當前姿態控制向量的變化產生影響，使得互動人物的表情變化和肢體動作更加平緩、自然。In the embodiment of the present disclosure, a pre-trained loop neural network is used to obtain the attitude control vector of at least one local area of the interactive object corresponding to the feature code, and the historical feature information of the feature code is merged with the current feature information , So that the historical attitude control vector has an impact on the change of the current attitude control vector, making the expression changes and body movements of the interactive characters more smooth and natural.

在一些實施例中，可以透過以下方式對所述迴圈神經網路進行訓練。In some embodiments, the loop neural network can be trained in the following manner.

首先，獲取特徵編碼樣本，所述特徵編碼樣本標注有真實值，所述真實值為所述互動物件的至少一個局部區域的姿態控制向量值。First, a feature code sample is obtained, the feature code sample is marked with a true value, and the true value is a posture control vector value of at least one partial area of the interactive object.

在獲得了特徵編碼樣本後，根據所述特徵編碼樣本對初始迴圈神經網路進行訓練，在網路損失的變化滿足收斂條件後訓練得到所述迴圈神經網路，其中，所述網路損失包括所述迴圈神經網路預測得到的所述至少一個局部區域的姿態控制向量值與所述真實值之間的差異。After obtaining the feature code samples, the initial loop neural network is trained according to the feature code samples, and the loop neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network The loss includes the difference between the attitude control vector value of the at least one local area predicted by the loop neural network and the actual value.

在一些實施例中，可以透過以下方法獲取特徵編碼樣本。In some embodiments, feature code samples can be obtained through the following methods.

首先，獲取一角色發出語音的影片段，並根據所述影片段獲取多個包含所述角色的第一圖像幀。例如，可以獲取一真實人物正在說話的影片段。First, obtain a film segment in which a character speaks, and obtain a plurality of first image frames containing the character according to the film segment. For example, a video segment in which a real person is speaking can be obtained.

接下來，從所述影片段中提取相應的語音段，根據所述語音段獲取樣本音素序列，並對所述樣本音素序列進行特徵編碼。其中，對所述樣本音素序列進行編碼的方式與上述的文本資料對應的音素序列的編碼方式相同。Next, extract a corresponding speech segment from the film segment, obtain a sample phoneme sequence according to the speech segment, and perform feature encoding on the sample phoneme sequence. Wherein, the method of encoding the sample phoneme sequence is the same as the encoding method of the phoneme sequence corresponding to the text data described above.

根據對所述樣本音素序列進行特徵編碼所得到的樣本編碼序列，獲取與所述第一圖像幀對應的至少一個音素的特徵編碼。其中，所述至少一個音素可以是在所述第一圖像幀出現時間的設定範圍內的音素。According to the sample code sequence obtained by performing feature coding on the sample phoneme sequence, the feature code of at least one phoneme corresponding to the first image frame is obtained. Wherein, the at least one phoneme may be a phoneme within a set range of the appearance time of the first image frame.

接著，將所述第一圖像幀轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的至少一個局部區域的姿態控制向量值。其中，該姿態控制向量值可以包括所有局部區域的姿態控制向量值，也可以包括其中部分的局部區域的姿態控制向量值。Then, the first image frame is converted into a second image frame containing the interactive object, and the attitude control vector value of at least one local area corresponding to the second image frame is obtained. Wherein, the attitude control vector value may include the attitude control vector value of all the local areas, and may also include the attitude control vector value of some of the local areas.

以所述第一圖像幀為包含真實人物的圖像幀為例，可以將該真實人物的圖像幀轉換為包含互動物件所表示的形象的第二圖像幀，並且所述真實人物的各個局部區域的姿態控制向量與所述互動物件的各個局部區域的姿態控制向量是對應的，從而可以獲取第二圖像幀中互動物件的各個局部區域的姿態控制向量。Taking the first image frame as an image frame containing a real person as an example, the image frame of the real person can be converted into a second image frame containing the image represented by the interactive object, and the image of the real person The posture control vector of each local area corresponds to the posture control vector of each local area of the interactive object, so that the posture control vector of each local area of the interactive object in the second image frame can be obtained.

最後，根據所述姿態控制向量值對上述所獲得的所述第一圖像幀對應的至少一個音素的特徵編碼進行標注，獲得特徵編碼樣本。Finally, the feature code of at least one phoneme corresponding to the first image frame obtained above is annotated according to the attitude control vector value to obtain feature code samples.

在本公開實施例中，透過將一角色的影片段，拆分為對應的多個第一圖像幀和語音段，並透過將包含真實人物的第一圖像幀轉化為包含互動物件的第二圖像幀來獲取音素的特徵編碼對應的姿態控制向量，使得特徵編碼與姿態控制向量的對應性較好，從而獲得高品質的特徵編碼樣本，使得互動物件的動作更接近於對應角色的真實動作。In the embodiment of the present disclosure, the movie segment of a character is split into a plurality of corresponding first image frames and voice segments, and the first image frame containing the real person is transformed into the first image frame containing the interactive object. Two image frames are used to obtain the attitude control vector corresponding to the feature code of the phoneme, so that the feature code and the attitude control vector correspond better, so as to obtain high-quality feature code samples, making the action of the interactive object closer to the reality of the corresponding character action.

圖4示出根據本公開至少一個實施例的互動物件的驅動裝置的結構示意圖，如圖4所示，該裝置可以包括：第一獲取單元401，用於獲取文本資料對應的音素序列；第二獲取單元402，用於獲取與所述音素序列匹配的互動物件的至少一個局部區域的控制參數值；驅動單元403，用於根據獲取的所述控制參數值控制所述互動物件的姿態。FIG. 4 shows a schematic structural diagram of a driving device for an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 4, the device may include: a first obtaining unit 401, configured to obtain a phoneme sequence corresponding to a text data; and second The obtaining unit 402 is configured to obtain the control parameter value of at least one partial area of the interactive object matching the phoneme sequence; the driving unit 403 is configured to control the posture of the interactive object according to the obtained control parameter value.

在一些實施例中，所述裝置更包括輸出單元，用於根據所述文本資料控制展示所述互動物件的顯示裝置展示文本，和/或根據所述文本資料對應的音素序列控制所述顯示裝置輸出語音。In some embodiments, the device further includes an output unit for controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device according to the phoneme sequence corresponding to the text data Output speech.

在一些實施例中，所述第二獲取單元具體用於：對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列；根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼；獲取所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量。In some embodiments, the second obtaining unit is specifically configured to: perform feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; and obtain at least one phoneme corresponding to the phoneme sequence according to the first coding sequence. The feature code; obtaining the attitude control vector of at least one partial area of the interactive object corresponding to the feature code.

在一些實施例中，在對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列時，所述第二獲取單元具體用於：針對所述音素序列包含的多種音素，生成每種音素對應的子編碼序列；根據所述多種音素分別對應的子編碼序列，獲得所述音素序列對應的第一編碼序列。In some embodiments, when performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence, the second acquiring unit is specifically configured to: generate, for multiple phonemes included in the phoneme sequence, A sub-coding sequence corresponding to each phoneme; and obtaining the first coding sequence corresponding to the phoneme sequence according to the sub-coding sequences corresponding to the multiple phonemes respectively.

在一些實施例中，在針對所述音素序列包含的多種音素，生成每種音素對應的子編碼序列時，所述第二獲取單元具體用於：檢測各時間點上是否對應有第一音素，所述第一音素為所述多種音素中的任一種；透過將有所述第一音素的時間點上的編碼值設置為第一數值，將沒有所述第一音素的時間點上的編碼值設置為第二數值，得到所述第一音素對應的子編碼序列。In some embodiments, when generating the sub-coding sequence corresponding to each phoneme for the multiple phonemes included in the phoneme sequence, the second acquiring unit is specifically configured to: detect whether there is a first phoneme corresponding to each time point, The first phoneme is any one of the multiple phonemes; by setting the code value at the time point when the first phoneme is present to the first value, the code value at the time point without the first phoneme is set Set to the second value to obtain the sub-coding sequence corresponding to the first phoneme.

在一些實施例中，所述裝置更包括濾波單元，用於對於所述多種音素中的每種音素對應的所述子編碼序列，利用高斯濾波器對所述音素在時間上的連續值進行高斯卷積操作。在一實施例中，對於第一音素對應的子編碼序列，利用高斯濾波器對所述第一音素在時間上的連續值進行高斯卷積操作，所述第一音素為所述多種音素中的任一種。In some embodiments, the device further includes a filtering unit for performing a Gaussian filter on the continuous value of the phoneme in time by using a Gaussian filter for the sub-coding sequence corresponding to each phoneme of the multiple phonemes. Convolution operation. In an embodiment, for the sub-coding sequence corresponding to the first phoneme, a Gaussian filter is used to perform a Gaussian convolution operation on the continuous values of the first phoneme in time, and the first phoneme is one of the multiple phonemes. Any kind.

在一些實施例中，在根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼時，所述第二獲取單元具體用於：以設定長度的時間視窗和設定步長，對所述編碼序列進行滑窗，將所述時間視窗內的特徵編碼作為所對應的至少一個音素的特徵編碼，並根據完成滑窗得到的多個特徵編碼，獲得第二編碼序列。In some embodiments, when acquiring the feature code corresponding to at least one phoneme according to the first encoding sequence, the second acquiring unit is specifically configured to: The sequence performs a sliding window, the feature code in the time window is used as the feature code of the corresponding at least one phoneme, and the second code sequence is obtained according to the multiple feature codes obtained by completing the sliding window.

在一些實施例中，所述驅動單元具體用於：獲取與所述第二編碼序列對應的姿態控制向量的序列；根據所述姿態控制向量的序列控制所述互動物件的姿態。In some embodiments, the driving unit is specifically configured to: obtain a sequence of a posture control vector corresponding to the second coding sequence; and control the posture of the interactive object according to the sequence of the posture control vector.

在一些實施例中，所述裝置更包括停頓驅動單元，用於在所述音素序列中音素之間的時間間隔大於設定閾值的情況下，根據所述局部區域的設定控制參數值，控制所述互動物件的姿態。In some embodiments, the device further includes a pause driving unit, which is used to control the set control parameter value of the local area when the time interval between phonemes in the phoneme sequence is greater than a set threshold. The posture of the interactive object.

在一些實施例中，在獲取所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量時，所述第二獲取單元具體用於：將所述特徵編碼輸入至預先訓練的迴圈神經網路，獲得與所述特徵編碼對應的所述互動物件的至少一個局部區域的姿態控制向量。In some embodiments, when acquiring the attitude control vector of at least one partial region of the interactive object corresponding to the feature code, the second acquiring unit is specifically configured to: input the feature code into a pre-trained response. A loop neural network to obtain a posture control vector of at least one local area of the interactive object corresponding to the feature code.

在一些實施例中，所述神經網路透過音素序列樣本訓練得到；所述裝置更包括樣本獲取單元，用於：獲取一角色發出語音的影片段，並根據所述影片段獲取多個包含所述角色的第一圖像幀；從所述影片段中提取相應的語音段，根據所述語音段獲取樣本音素序列，並對所述樣本音素序列進行特徵編碼；獲取與所述第一圖像幀對應的至少一個音素的特徵編碼；將所述第一圖像幀轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的至少一個局部區域的姿態控制向量值；根據所述姿態控制向量值，對所述第一圖像幀對應的特徵編碼進行標注，獲得特徵編碼樣本。In some embodiments, the neural network is obtained through phoneme sequence sample training; the device further includes a sample acquisition unit for: acquiring a video segment of a character's voice, and acquiring a plurality of video segments containing all of them according to the video segment. The first image frame of the character; extract the corresponding speech segment from the film segment, obtain a sample phoneme sequence according to the speech segment, and perform feature encoding on the sample phoneme sequence; obtain the first image Feature encoding of at least one phoneme corresponding to the frame; transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector of at least one local area corresponding to the second image frame Value; according to the attitude control vector value, annotate the feature code corresponding to the first image frame to obtain a feature code sample.

在一些實施例中，所述裝置更包括訓練單元，用於根據所述特徵編碼樣本對初始迴圈神經網路進行訓練，在網路損失的變化滿足收斂條件後訓練得到所述迴圈神經網路，其中，所述網路損失包括所述迴圈神經網路預測得到的所述至少一個局部區域的姿態控制向量值與標注的姿態控制向量值之間的差異。In some embodiments, the device further includes a training unit for training the initial loop neural network according to the characteristic coding samples, and trains to obtain the loop neural network after the change of the network loss satisfies the convergence condition Wherein, the network loss includes the difference between the attitude control vector value of the at least one local area and the labeled attitude control vector value predicted by the loop neural network.

本說明書至少一個實施例還提供了一種電子設備，如圖5所示，所述設備包括記憶體、處理器，記憶體用於存儲可在處理器上運行的電腦指令，處理器用於在執行所述電腦指令時實現本公開任一實施例所述的互動物件的驅動方法。At least one embodiment of this specification also provides an electronic device. As shown in FIG. 5, the device includes a memory and a processor. The memory is used to store computer instructions that can run on the processor. The computer instruction is used to implement the driving method of the interactive object described in any embodiment of the present disclosure.

本說明書至少一個實施例還提供了一種電腦可讀存儲介質，其上存儲有電腦程式，所述程式被處理器執行時實現本公開任一實施例所述的互動物件的驅動方法。At least one embodiment of this specification also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for driving an interactive object according to any embodiment of the present disclosure is realized.

本領域具有通常知識者應明白，本說明書一個或多個實施例可提供為方法、系統或電腦程式產品。因此，本說明書一個或多個實施例可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本說明書一個或多個實施例可採用在一個或多個其中包含有電腦可用程式碼的電腦可用存儲介質（包括但不限於磁碟記憶體、CD-ROM、光學記憶體等）上實施的電腦程式產品的形式。Those with ordinary knowledge in the art should understand that one or more embodiments of this specification can be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of this specification may adopt the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification can be implemented on one or more computer-usable storage media (including but not limited to magnetic disk memory, CD-ROM, optical memory, etc.) containing computer-usable program codes. In the form of a computer program product.

本說明書中的各個實施例均採用遞進的方式描述，各個實施例之間相同相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。尤其，對於資料處理設備實施例而言，由於其基本相似於方法實施例，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, as for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.

上述對本說明書特定實施例進行了描述。其它實施例在所附申請專利範圍的範圍內。在一些情況下，在申請專利範圍中記載的行為或步驟可以按照不同於實施例中的順序來執行並且仍然可以實現期望的結果。另外，在附圖中描繪的過程不一定要求示出的特定順序或者連續順序才能實現期望的結果。在某些實施方式中，多工處理和並行處理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the attached patent application. In some cases, the actions or steps described in the scope of the patent application may be performed in a different order from the embodiment and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired results. In some embodiments, multiplexing and parallel processing are also possible or may be advantageous.

本說明書中描述的主題及功能操作的實施例可以在以下中實現：數位電子電路、有形體現的電腦軟體或硬體、包括本說明書中公開的結構及其結構性等同物的電腦硬體、或者它們中的一個或多個的組合。本說明書中描述的主題的實施例可以實現為一個或多個電腦程式，即編碼在有形非暫時性程式載體上以被資料處理裝置執行或控制資料處理裝置的操作的電腦程式指令中的一個或多個模組。可替代地或附加地，程式指令可以被編碼在人工生成的傳播信號上，例如機器生成的電、光或電磁信號，該信號被生成以將資訊編碼並傳輸到合適的接收機裝置以由資料處理裝置執行。電腦存儲介質可以是機器可讀存放裝置、機器可讀存儲基板、隨機或串列存取記憶體設備、或它們中的一個或多個的組合。The embodiments of the subject and functional operations described in this specification can be implemented in the following: digital electronic circuits, tangible computer software or hardware, computer hardware including the structure disclosed in this specification and its structural equivalents, or A combination of one or more of them. The embodiments of the subject described in this specification can be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules. Alternatively or additionally, the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission. The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

本說明書中描述的處理及邏輯流程可以由執行一個或多個電腦程式的一個或多個可程式設計電腦執行，以透過根據輸入資料進行操作並生成輸出來執行相應的功能。所述處理及邏輯流程還可以由專用邏輯電路—例如FPGA（現場可程式化閘陣列）或ASIC（專用積體電路）來執行，並且裝置也可以實現為專用邏輯電路。The processing and logic flow described in this manual can be executed by one or more programmable computers that execute one or more computer programs to perform corresponding functions by operating based on input data and generating output. The processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Dedicated Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.

適合用於執行電腦程式的電腦包括，例如通用和/或專用微處理器，或任何其他類型的中央處理單元。通常，中央處理單元將從唯讀記憶體和/或隨機存取記憶體接收指令和資料。電腦的基本元件包括用於實施或執行指令的中央處理單元以及用於存儲指令和資料的一個或多個記憶體設備。通常，電腦還將包括用於存儲資料的一個或多個大型存放區設備，例如磁片、磁光碟或光碟等，或者電腦將可操作地與此大型存放區設備耦接以從其接收資料或向其傳送資料，抑或兩種情況兼而有之。然而，電腦不是必須具有這樣的設備。此外，電腦可以嵌入在另一設備中，例如行動電話、個人數位助理（PDA）、行動音訊或影片播放機、遊戲操縱臺、全球定位系統（GPS）接收機、或例如通用序列匯流排（USB）快閃記憶體驅動器的可擕式存放裝置，僅舉幾例。Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other types of central processing units. Generally, the central processing unit will receive commands and data from read-only memory and/or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Usually, the computer will also include one or more large storage area devices for storing data, such as floppy disks, magneto-optical disks, or optical discs, or the computer will be operably coupled to this large storage area device to receive data or Send data to it, or both. However, the computer does not have to have such equipment. In addition, the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a universal serial bus (USB) ) Portable storage devices for flash drives, to name a few.

適合於存儲電腦程式指令和資料的電腦可讀介質包括所有形式的非揮發性記憶體、媒介和記憶體設備，例如包括半導體記憶體設備（例如EPROM、EEPROM和快閃記憶體設備）、磁片（例如內部硬碟或隨身碟）、磁光碟以及CD-ROM和DVD-ROM。處理器和記憶體可由專用邏輯電路補充或併入專用邏輯電路中。Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, such as semiconductor memory devices (such as EPROM, EEPROM and flash memory devices), magnetic disks (Such as internal hard drives or flash drives), magneto-optical discs, and CD-ROM and DVD-ROM. The processor and memory can be supplemented by or incorporated into a dedicated logic circuit.

雖然本說明書包含許多具體實施細節，但是這些不應被解釋為限制任何發明的範圍或所要求保護的範圍，而是主要用於描述特定發明的具體實施例的特徵。本說明書內在多個實施例中描述的某些特徵也可以在單個實施例中被組合實施。另一方面，在單個實施例中描述的各種特徵也可以在多個實施例中分開實施或以任何合適的子組合來實施。此外，雖然特徵可以如上所述在某些組合中起作用並且甚至最初如此要求保護，但是來自所要求保護的組合中的一個或多個特徵在一些情況下可以從該組合中去除，並且所要求保護的組合可以指向子組合或子組合的變型。Although this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claimed protection, but are mainly used to describe the features of specific embodiments of a particular invention. Certain features described in multiple embodiments in this specification can also be implemented in combination in a single embodiment. On the other hand, various features described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. In addition, although features can function in certain combinations as described above and even initially claimed as such, one or more features from the claimed combination can in some cases be removed from the combination, and the claimed The combination of protection can be directed to a sub-combination or a variant of the sub-combination.

類似地，雖然在附圖中以特定順序描繪了操作，但是這不應被理解為要求這些操作以所示的特定循序執行或順次執行、或者要求所有例示的操作被執行，以實現期望的結果。在某些情況下，多工和並行處理可能是有利的。此外，上述實施例中的各種系統模組和元件的分離不應被理解為在所有實施例中均需要這樣的分離，並且應當理解，所描述的程式元件和系統通常可以一起集成在單個軟體產品中，或者封裝成多個軟體產品。Similarly, although operations are depicted in a specific order in the drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve desired results . In some cases, multiplexing and parallel processing may be advantageous. In addition, the separation of various system modules and components in the above embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can usually be integrated together in a single software product. Or packaged into multiple software products.

由此，主題的特定實施例已被描述。其他實施例在所附申請專利範圍的範圍以內。在某些情況下，申請專利範圍中記載的動作可以以不同的循序執行並且仍實現期望的結果。此外，附圖中描繪的處理並非必需所示的特定順序或順次順序，以實現期望的結果。在某些實現中，多工和並行處理可能是有利的。Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the attached patent application. In some cases, the actions described in the scope of the patent application can be executed in a different order and still achieve the desired result. In addition, the processes depicted in the drawings are not necessarily in the specific order or sequential order shown in order to achieve the desired result. In some implementations, multiplexing and parallel processing may be advantageous.

以上所述僅為本說明書一個或多個實施例的較佳實施例而已，並不用以限制本說明書一個或多個實施例，凡在本說明書一個或多個實施例的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本說明書一個或多個實施例保護的範圍之內。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principle of one or more embodiments of this specification, Any modification, equivalent replacement, improvement, etc. made should be included in the protection scope of one or more embodiments of this specification.

201、202、203:步驟 401:第一獲取單元 402:第二獲取單元 403:驅動單元201, 202, 203: steps 401: The first acquisition unit 402: second acquisition unit 403: drive unit

為了更清楚地說明本說明書一個或多個實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本說明書一個或多個實施例中記載的一些實施例，對於本領域具有通常知識者來講，在不付出進步性勞動的前提下，還可以根據這些附圖獲得其他的附圖。圖1是本公開至少一個實施例提出的互動物件的驅動方法中顯示裝置的示意圖。圖2是本公開至少一個實施例提出的互動物件的驅動方法的流程圖。圖3是本公開至少一個實施例提出的對音素序列進行特徵編碼的過程示意圖。圖4是本公開至少一個實施例提出的互動物件的驅動裝置的結構示意圖。圖5是本公開至少一個實施例提出的電子設備的結構示意圖。In order to more clearly explain one or more embodiments of this specification or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the appendix in the following description The figures are only some of the embodiments described in one or more embodiments of this specification. For those with ordinary knowledge in the art, other figures can be obtained based on these figures without progressive labor. FIG. 1 is a schematic diagram of a display device in a driving method of an interactive object provided by at least one embodiment of the present disclosure. FIG. 2 is a flowchart of a method for driving an interactive object according to at least one embodiment of the present disclosure. FIG. 3 is a schematic diagram of the process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure. FIG. 4 is a schematic structural diagram of a driving device for an interactive object provided by at least one embodiment of the present disclosure. FIG. 5 is a schematic structural diagram of an electronic device proposed in at least one embodiment of the present disclosure.

201、202、203:步驟201, 202, 203: steps

Claims

A driving method of interactive objects includes: Obtain the phoneme sequence corresponding to the text data; Acquiring a control parameter value of at least one partial area of an interactive object matching the phoneme sequence; Control the posture of the interactive object according to the acquired control parameter value.

The method according to claim 1, further comprising: controlling the display device displaying the interactive object to display text according to the text data, and/or controlling the display device to output voice according to the phoneme sequence corresponding to the text data.

The method according to claim 1 or 2, wherein the control parameter of the local area of the interactive object includes the attitude control vector of the local area; Obtaining the control parameter value of at least one partial area of the interactive object matching the phoneme sequence includes: Performing feature encoding on the phoneme sequence to obtain a first encoding sequence corresponding to the phoneme sequence; Obtaining a feature code corresponding to at least one phoneme according to the first coding sequence; Obtain a posture control vector of at least one partial area of the interactive object corresponding to the feature code.

The method according to claim 3, wherein performing feature encoding on the phoneme sequence to obtain the first encoding sequence corresponding to the phoneme sequence includes: For each phoneme of the multiple phonemes included in the phoneme sequence, generating a sub-coding sequence corresponding to the phoneme; The first coding sequence corresponding to the phoneme sequence is obtained according to the respective sub-coding sequences corresponding to the multiple phonemes.

The method according to claim 4, wherein for each of the multiple phonemes included in the phoneme sequence, generating a sub-coding sequence corresponding to the phoneme includes: Detecting whether the phoneme corresponds to each time point; By setting the coding value at the time point with the phoneme to the first value, and setting the coding value at the time point without the phoneme to the second value, the sub-coding sequence corresponding to the phoneme is obtained.

The method described in claim 5 further includes: For the sub-coding sequence corresponding to each phoneme of the multiple phonemes, a Gaussian convolution operation is performed on the continuous values of the phoneme in time by using a Gaussian filter.

The method according to any one of claim items 3 to 6, wherein obtaining a feature code corresponding to at least one phoneme according to the first coding sequence includes: A sliding window is performed on the first coding sequence with a time window of a set length and a set step size, the feature code in the time window is used as the feature code of the corresponding at least one phoneme, and the sliding is performed according to the completion of the sliding window. A plurality of the feature codes obtained by the window to obtain a second coding sequence; Controlling the posture of the interactive object according to the acquired control parameter value includes: Acquiring a sequence of attitude control vectors corresponding to the second coding sequence; Control the posture of the interactive object according to the sequence of the posture control vector.

The method according to any one of claims 1 to 7, further comprising: In the case that the time interval between the phonemes in the phoneme sequence is greater than a set threshold, the posture of the interactive object is controlled according to the set control parameter value of the local area.

The method according to claim 3, wherein obtaining a posture control vector of at least one partial area of the interactive object corresponding to the feature code includes: The feature code is input to a pre-trained loop neural network to obtain the attitude control vector of at least one local area of the interactive object corresponding to the feature code.

The method according to claim 9, wherein the loop neural network is obtained through feature coding sample training, and the method further includes: Acquiring a film segment in which a character speaks, and acquiring a plurality of first image frames containing the character according to the film segment; Extracting a corresponding speech segment from the film segment, obtaining a sample phoneme sequence according to the speech segment, and performing feature encoding on the sample phoneme sequence; Acquiring a feature code of at least one phoneme corresponding to the first image frame; Transforming the first image frame into a second image frame containing the interactive object, and obtaining a posture control vector value of at least one local area corresponding to the second image frame; According to the attitude control vector value, annotate the feature code corresponding to the first image frame to obtain the feature code sample.

The method described in claim 10 further includes: The initial loop neural network is trained according to the feature code samples, and the loop neural network is trained after the change of the network loss satisfies the convergence condition, wherein the network loss includes the loop neural network The difference between the attitude control vector value of the at least one local area obtained by road prediction and the marked attitude control vector value.

A driving device for interactive objects includes: The first acquiring unit is used to acquire the phoneme sequence corresponding to the text data; The second acquiring unit is configured to acquire the control parameter value of at least one partial area of the interactive object matching the phoneme sequence; The driving unit is configured to control the posture of the interactive object according to the acquired control parameter value.

An electronic device comprising a memory and a processor, the memory is used to store computer instructions that can be run on the processor, and the processor is used to implement any one of request items 1 to 11 when the computer instructions are executed. The method described.

A computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the method described in any one of request items 1 to 11 is realized.