TWI760015B

TWI760015B - Method and apparatus for driving interactive object, device and storage medium

Info

Publication number: TWI760015B
Application number: TW109144967A
Authority: TW
Inventors: 張子隆; 吳文岩; 吳潛溢; 許親親
Original assignee: 大陸商北京市商湯科技開發有限公司
Priority date: 2020-03-31
Filing date: 2020-12-18
Publication date: 2022-04-01
Also published as: JP7227395B2; TW202138970A; CN111459452A; WO2021196645A1; JP2022531072A; KR20210129713A; CN111459452B; KR102707613B1

Abstract

The present disclosure relates to a method and an apparatus for driving interactive object, a device and a storage medium. The interactive object is displayed in a display device, and the method comprises: obtaining driving data of the interactive object and determining a driving mode of the driving data; in response to the driving mode, obtaining a value of control parameter of the interactive object according to the driving data; and controlling a posture of the interactive object according to the value of control parameter.

Description

Driving method, device, device and storage medium for interactive objects

本公開涉及計算機技術領域，具體涉及一種互動物件的驅動方法、裝置、設備以及儲存媒體。The present disclosure relates to the field of computer technology, and in particular, to a driving method, apparatus, device and storage medium for interactive objects.

人機互動的方式大多基於按鍵、觸控、語音進行輸入，通過在顯示螢幕上呈現圖像、文本或虛擬人物進行回應。目前虛擬人物多是在語音助理的基礎上改進得到的，其只是對設備的語音進行輸出。Most of the human-computer interaction methods are based on keystrokes, touch, and voice input, and respond by presenting images, texts or virtual characters on the display screen. At present, virtual characters are mostly improved on the basis of voice assistants, which only output the voice of the device.

本公開實施例提供一種互動物件的驅動方案。Embodiments of the present disclosure provide a driving solution for an interactive object.

根據本公開的一方面，提供一種互動物件的驅動方法，所述互動物件顯示在顯示設備中，所述方法包括：獲取所述互動物件的驅動數據，並確定所述驅動數據的驅動模式；響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值；根據所述控制參數值控制所述互動物件的姿態。According to an aspect of the present disclosure, there is provided a method for driving an interactive object, where the interactive object is displayed on a display device, the method comprising: acquiring driving data of the interactive object, and determining a driving mode of the driving data; responding to In the driving mode, the control parameter value of the interactive object is acquired according to the driving data; the posture of the interactive object is controlled according to the control parameter value.

結合本公開提供的任一實施方式，所述方法更包括：根據所述驅動數據控制所述顯示設備輸出語音和/或顯示文本。With reference to any of the embodiments provided in the present disclosure, the method further includes: controlling the display device to output voice and/or display text according to the driving data.

結合本公開提供的任一實施方式，所述確定所述驅動數據對應的驅動模式，包括：根據所述驅動數據的類型，獲取所述驅動數據對應的語音數據序列，所述語音數據序列包括多個語音數據單元；響應於檢測到所述語音數據單元中包括目標數據，則確定所述驅動數據的驅動模式為第一驅動模式，所述目標數據與所述互動物件的預設控制參數值相對應；所述響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值，包括：響應於所述第一驅動模式，將所述目標數據對應的所述預設控制參數值，作為所述互動物件的控制參數值。With reference to any implementation manner provided in the present disclosure, the determining of the driving mode corresponding to the driving data includes: acquiring, according to the type of the driving data, a voice data sequence corresponding to the driving data, where the voice data sequence includes multiple a voice data unit; in response to detecting that the voice data unit includes target data, determine that the driving mode of the driving data is the first driving mode, and the target data is consistent with the preset control parameter value of the interactive object Corresponding; the acquiring, in response to the driving mode, the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, converting the preset control parameter corresponding to the target data to value, as the control parameter value of the interactive object.

結合本公開提供的任一實施方式，所述目標數據包括關鍵詞或關鍵字，所述關鍵詞或所述關鍵字與所述互動物件的設定動作的預設控制參數值相對應；或者，所述目標數據包括音節，所述音節與所述互動物件的設定嘴型動作的預設控制參數值對應。With reference to any of the implementation manners provided in the present disclosure, the target data includes keywords or keywords, and the keywords or the keywords correspond to preset control parameter values of the setting action of the interactive object; or, the The target data includes syllables, and the syllables correspond to preset control parameter values of the interactive object for setting the mouth shape action.

結合本公開提供的任一實施方式，所述確定所述驅動數據對應的驅動模式，包括：根據所述驅動數據的類型，獲取所述驅動數據對應的語音數據序列，所述語音數據序列包括多個語音數據單元；若未檢測到所述語音數據單元中包括目標數據，則確定所述驅動數據的驅動模式為第二驅動模式，所述目標數據與所述互動物件的預設控制參數值相對應。響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值，包括：響應於所述第二驅動模式，獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊；獲取與所述特徵資訊對應的所述互動物件的控制參數值。With reference to any implementation manner provided in the present disclosure, the determining of the driving mode corresponding to the driving data includes: acquiring, according to the type of the driving data, a voice data sequence corresponding to the driving data, where the voice data sequence includes multiple a voice data unit; if it is not detected that the voice data unit includes target data, the driving mode of the driving data is determined to be the second driving mode, and the target data is consistent with the preset control parameter value of the interactive object. correspond. In response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring feature information of at least one voice data unit in the voice data sequence; Obtain the control parameter value of the interactive object corresponding to the feature information.

結合本公開提供的任一實施方式，所述語音數據序列包括音素序列，所述獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊，包括：對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列；根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼；根據所述特徵編碼，獲得所述至少一個音素的特徵資訊。With reference to any of the embodiments provided in the present disclosure, the voice data sequence includes a phoneme sequence, and the acquiring feature information of at least one voice data unit in the voice data sequence includes: performing feature encoding on the phoneme sequence to obtain The first coding sequence corresponding to the phoneme sequence; according to the first coding sequence, a feature code corresponding to at least one phoneme is obtained; according to the feature code, feature information of the at least one phoneme is obtained.

結合本公開提供的任一實施方式，所述語音數據序列包括語音幀序列，所述獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊，包括：獲取所述語音幀序列對應的第一聲學特徵序列，所述第一聲學特徵序列包括與所述語音幀序列中的每個語音幀對應的聲學特徵向量；根據所述第一聲學特徵序列，獲取至少一個語音幀對應的聲學特徵向量；根據所述聲學特徵向量，獲得所述至少一個語音幀對應的特徵資訊。With reference to any of the embodiments provided in the present disclosure, the voice data sequence includes a voice frame sequence, and the acquiring feature information of at least one voice data unit in the voice data sequence includes: acquiring the first voice data unit corresponding to the voice frame sequence. an acoustic feature sequence, the first acoustic feature sequence includes an acoustic feature vector corresponding to each speech frame in the speech frame sequence; according to the first acoustic feature sequence, acquire an acoustic feature vector corresponding to at least one speech frame ; According to the acoustic feature vector, obtain feature information corresponding to the at least one speech frame.

結合本公開提供的任一實施方式，所述互動物件的控制參數包括面部姿態參數，所述面部姿態參數包括面部肌肉控制係數，所述面部肌肉控制係數用於控制至少一個面部肌肉的運動狀態；所述根據所述驅動數據獲取所述互動物件的控制參數值，包括：根據所述驅動數據獲取所述互動物件的面部肌肉控制係數；所述根據所述控制參數值控制所述互動物件的姿態，包括：根據所獲取的面部肌肉控制係數，驅動所述互動物件做出與所述驅動數據匹配的面部動作。With reference to any of the embodiments provided in the present disclosure, the control parameter of the interactive object includes a facial posture parameter, the facial posture parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control the motion state of at least one facial muscle; The obtaining the control parameter value of the interactive object according to the driving data includes: obtaining the facial muscle control coefficient of the interactive object according to the driving data; controlling the posture of the interactive object according to the control parameter value , which includes: driving the interactive object to perform facial actions matching the driving data according to the acquired facial muscle control coefficients.

結合本公開提供的任一實施方式，所述方法更包括：獲取與所述面部姿態參數關聯的身體姿態的驅動數據；根據與所述面部姿態參數值關聯的身體姿態的驅動數據，驅動所述互動物件做出肢體動作。In conjunction with any of the embodiments provided in the present disclosure, the method further includes: acquiring driving data of a body posture associated with the facial posture parameter; driving the body posture according to the driving data of the body posture associated with the facial posture parameter value Interactive objects make body movements.

結合本公開提供的任一實施方式，所述互動物件的控制參數值包括所述互動物件的至少一個局部區域的控制向量；所述根據所述驅動數據獲取所述互動物件的控制參數值，包括：根據所述驅動數據獲取所述互動物件的至少一個局部區域的控制向量；所述根據所述控制參數值控制所述互動物件的姿態，包括：根據所獲取的所述至少一個局部區域的控制向量，控制所述互動物件的面部動作和/或肢體動作。With reference to any of the implementation manners provided in the present disclosure, the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object; and the acquiring the control parameter value of the interactive object according to the driving data includes: : obtaining a control vector of at least one partial area of the interactive object according to the driving data; the controlling the posture of the interactive object according to the control parameter value includes: controlling the at least one partial area according to the obtained A vector that controls the facial and/or body movements of the interactive object.

結合本公開提供的任一實施方式，所述獲取與所述特徵資訊對應的所述互動物件的控制參數值，包括：將所述特徵資訊輸入至預先訓練的循環神經網路，獲得與所述特徵資訊對應的所述互動物件的控制參數值。With reference to any of the embodiments provided in the present disclosure, the obtaining the control parameter value of the interactive object corresponding to the feature information includes: inputting the feature information into a pre-trained recurrent neural network, and obtaining the value of the control parameter corresponding to the feature information. The control parameter value of the interactive object corresponding to the feature information.

根據本公開的一方面，提出一種互動物件的驅動裝置，所述互動物件顯示在顯示設備中，所述裝置包括：第一獲取單元，用於獲取所述互動物件的驅動數據，並確定所述驅動數據的驅動模式；第二獲取單元，用於響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值；驅動單元，用於根據所述控制參數值控制所述互動物件的姿態。According to an aspect of the present disclosure, an apparatus for driving an interactive object is provided, the interactive object is displayed in a display device, the apparatus includes: a first acquiring unit, configured to acquire driving data of the interactive object, and determine the a driving mode of the driving data; a second acquiring unit, configured to respond to the driving mode and acquire a control parameter value of the interactive object according to the driving data; a driving unit, configured to control the interaction according to the control parameter value The pose of the object.

根據本公開的一方面，提供一種電子設備，所述設備包括記憶體、處理器，所述記憶體用於儲存可在處理器上運行的計算機指令，所述處理器用於在執行所述計算機指令時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to an aspect of the present disclosure, an electronic device is provided, the device includes a memory and a processor, the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute the computer instructions. At the same time, the driving method of the interactive object described in any one of the implementation manners provided by the present disclosure is implemented.

根據本公開的一方面，提供一種計算機可讀儲存媒體，其上儲存有計算機程式，所述計算機程式被處理器執行時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the method for driving an interactive object according to any one of the implementation manners provided in the present disclosure.

本公開一個或多個實施例的互動物件的驅動方法、裝置、設備及計算機可讀儲存媒體，根據所述互動物件的驅動數據的驅動模式，來獲取所述互動物件的控制參數值，從而控制所述互動物件的姿態，其中，對於不同的驅動模式可以通過不同的方式來獲取相應的互動物件的控制參數值，使得互動物件顯示出與所述驅動數據的內容和/或對應的語音匹配的姿態，從而使目標物件產生與互動物件正在交流的感覺，提升了目標物件與互動物件的互動體驗。The method, apparatus, device, and computer-readable storage medium for driving an interactive object according to one or more embodiments of the present disclosure acquire control parameter values of the interactive object according to the driving mode of the driving data of the interactive object, so as to control the The gesture of the interactive object, wherein, for different driving modes, the control parameter values of the corresponding interactive object can be obtained in different ways, so that the interactive object displays the content of the driving data and/or the corresponding voice matching. gesture, so that the target object feels that it is communicating with the interactive object, which improves the interactive experience between the target object and the interactive object.

這裡將詳細地對範例性實施例進行說明，其範例表示在附圖中。下面的描述涉及附圖時，除非另有表示，不同附圖中的相同數位表示相同或相似的要素。以下範例性實施例中所描述的實施方式並不代表與本公開相一致的所有實施方式。相反，它們僅是與如所附請求項中所詳述的、本公開的一些方面相一致的裝置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the appended claims.

本文中術語“和/或”，僅僅是一種描述關聯物件的關聯關係，表示可以存在三種關係，例如，A和/或B，可以表示：單獨存在A，同時存在A和B，單獨存在B這三種情況。另外，本文中術語“至少一種”表示多種中的任意一種或多種中的至少兩種的任意組合，例如，包括A、B、C中的至少一種，可以表示包括從A、B和C構成的集合中選擇的任意一個或多個元素。The term "and/or" in this article is only a relationship to describe related objects, which means that there can be three relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. three situations. In addition, the term "at least one" herein refers to any combination of any one of a plurality or at least two of a plurality, for example, including at least one of A, B, and C, and may mean including those composed of A, B, and C. Any one or more elements selected in the collection.

本公開至少一個實施例提供了一種互動物件的驅動方法，所述驅動方法可以由終端設備或伺服器等電子設備執行，所述終端設備可以是固定終端或移動終端，例如手機、平板電腦、遊戲機、台式機、廣告機、一體機、車載終端等等，所述伺服器包括本地伺服器或雲端伺服器等，所述方法還可以通過處理器調用記憶體中儲存的計算機可讀指令的方式來實現。At least one embodiment of the present disclosure provides a method for driving an interactive object. The driving method can be executed by an electronic device such as a terminal device or a server, and the terminal device can be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game computer, desktop computer, advertising machine, all-in-one computer, vehicle-mounted terminal, etc., the server includes a local server or a cloud server, etc., the method can also call the computer-readable instructions stored in the memory through the processor. to fulfill.

在本公開實施例中，互動物件可以是任意一種能夠與目標物件進行互動的虛擬形象。在一實施例中，互動物件可以是虛擬人物，還可以是虛擬動物、虛擬物品、卡通形象等等其他能夠實現互動功能的虛擬形象。互動物件的呈現形式即可以是2D形式也可以是3D形式，本公開對此並不限定。所述目標物件可以是使用者，也可以是機器人，還可以是其他智能設備。所述互動物件和所述目標物件之間的互動方式可以是主動互動方式，也可以是被動互動方式。一範例中，目標物件可以通過做出手勢或者肢體動作來發出需求，通過主動互動的方式來觸發互動物件與其互動。另一範例中，互動物件可以通過主動打招呼、提示目標物件做出動作等方式，使得目標物件採用被動方式與互動物件進行互動。In the embodiment of the present disclosure, the interactive object may be any virtual image capable of interacting with the target object. In one embodiment, the interactive object may be a virtual character, and may also be a virtual animal, a virtual item, a cartoon image, or other virtual images capable of realizing interactive functions. The presentation form of the interactive object can be either a 2D form or a 3D form, which is not limited in the present disclosure. The target object may be a user, a robot, or other smart devices. The interaction mode between the interactive object and the target object may be an active interaction mode or a passive interaction mode. In one example, the target object can issue a demand by making gestures or body movements, and trigger the interactive object to interact with it through active interaction. In another example, the interactive object may actively say hello, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.

所述互動物件可以通過終端設備進行顯示，所述終端設備可以是電視機、帶有顯示功能的一體機、投影機、虛擬實境（Virtual Reality，VR）設備、擴增實境（Augmented Reality，AR）設備等，本公開並不限定終端設備的具體形式。The interactive object can be displayed by a terminal device, and the terminal device can be a TV, an all-in-one machine with a display function, a projector, a virtual reality (Virtual Reality, VR) device, an augmented reality (Augmented Reality, AR) equipment, etc., the present disclosure does not limit the specific form of the terminal equipment.

圖1繪示本公開至少一個實施例提出的顯示設備。如圖1所示，該顯示設備具有透明顯示螢幕，在透明顯示螢幕上可以顯示立體畫面，以呈現出具有立體效果的虛擬場景以及互動物件。例如圖1中透明顯示螢幕顯示的互動物件包括虛擬卡通人物。在一些實施例中，本公開中所述的終端設備也可以為上述具有透明顯示螢幕的顯示設備，顯示設備中配置有記憶體和處理器，記憶體用於儲存可在處理器上運行的計算機指令，所述處理器用於在執行所述計算機指令時實現本公開提供的互動物件的驅動方法，以驅動透明顯示螢幕中顯示的互動物件對目標物件進行交流或回應。FIG. 1 illustrates a display device according to at least one embodiment of the present disclosure. As shown in FIG. 1 , the display device has a transparent display screen, and a stereoscopic image can be displayed on the transparent display screen, so as to present a virtual scene with a stereoscopic effect and interactive objects. For example, the interactive objects displayed on the transparent display screen in FIG. 1 include virtual cartoon characters. In some embodiments, the terminal device described in the present disclosure may also be the above-mentioned display device with a transparent display screen. The display device is configured with a memory and a processor, and the memory is used to store a computer that can run on the processor. The processor is configured to implement the driving method of the interactive object provided by the present disclosure when executing the computer instruction, so as to drive the interactive object displayed on the transparent display screen to communicate or respond to the target object.

在一些實施例中，響應於用於驅動互動物件輸出語音的驅動數據，互動物件可以對目標物件發出指定語音。終端設備可以根據終端設備周邊目標物件的動作、表情、身份、偏好等，生成驅動數據，以驅動互動物件通過發出指定語音進行交流或回應，從而為目標物件提供擬人化的服務。需要說明的是，聲音驅動數據也可以通過其他方式生成，例如，由伺服器生成並發送給終端設備。In some embodiments, in response to the driving data for driving the interactive object to output the voice, the interactive object can emit the specified voice to the target object. The terminal device can generate driving data according to the actions, expressions, identities, preferences, etc. of the target objects around the terminal device, so as to drive the interactive objects to communicate or respond by issuing specified voices, thereby providing anthropomorphic services for the target objects. It should be noted that the sound-driven data can also be generated in other ways, for example, generated by a server and sent to a terminal device.

在互動物件與目標物件的互動過程中，根據該驅動數據驅動互動物件發出指定語音時，可能無法驅動所述互動物件做出與該指定語音同步的面部動作，使得互動物件在發出語音時呆板、不自然，影響了目標物件與互動物件的互動體驗。基於此，本公開至少一個實施例提出一種互動物件的驅動方法，以提升目標物件與互動物件進行互動的體驗。During the interaction between the interactive object and the target object, when the interactive object is driven to emit a specified voice according to the driving data, the interactive object may not be driven to make facial movements synchronized with the specified voice, so that the interactive object is rigid and rigid when it emits a voice. Unnatural, affecting the interactive experience between the target object and the interactive object. Based on this, at least one embodiment of the present disclosure provides a method for driving an interactive object, so as to improve the experience of interacting between a target object and an interactive object.

圖2繪示根據本公開至少一個實施例的互動物件的驅動方法的流程圖，所述互動物件顯示在顯示設備中，如圖2所示，所述方法包括步驟201~步驟203。FIG. 2 is a flow chart of a method for driving an interactive object according to at least one embodiment of the present disclosure. The interactive object is displayed on a display device. As shown in FIG. 2 , the method includes steps 201 to 203 .

在步驟201中，獲取所述互動物件的驅動數據，並確定所述驅動數據的驅動模式。In step 201, the driving data of the interactive object is acquired, and the driving mode of the driving data is determined.

在本公開實施例中，所述驅動數據可以包括音訊數據（語音數據）、文本等等。所述驅動數據可以是伺服器端或終端設備根據與互動物件進行互動的目標物件的動作、表情、身份、偏好等生成的驅動數據，也可以是終端設備直接獲取的，例如從內部記憶體調用的驅動數據等。本公開對於該驅動數據的獲取方式不進行限制。In an embodiment of the present disclosure, the driving data may include audio data (voice data), text, and the like. The driving data may be the driving data generated by the server or the terminal device according to the action, expression, identity, preference, etc. of the target object interacting with the interactive object, or it may be obtained directly by the terminal device, such as calling from the internal memory. drive data, etc. The present disclosure does not limit the manner of acquiring the driving data.

根據所述驅動數據的類型，以及所述驅動數據中所包含的資訊，可以確定所述驅動數據的驅動模式。According to the type of the driving data and the information contained in the driving data, the driving mode of the driving data can be determined.

在一個範例中，可以根據所述驅動數據的類型，獲取所述驅動數據對應的語音數據序列，其中，所述語音數據序列包括多個語音數據單元。其中，所述語音數據單元可以是以字或詞為單位構成的，也可以是以音素或音節為單位構成的。對應于文本類型的驅動數據，則可以獲得所述驅動數據對應的字序列、詞序列等等；對應於音訊類型的驅動數據，則可以獲得所述驅動數據對應的音素序列、音節序列、語音幀序列等等。在一實施例中，音訊數據和文本數據可以互相轉換。例如，將音訊數據轉換為文本數據之後再進行語音數據單元的劃分，或者，將文本數據轉換為音訊數據之後再進行語音數據單元的劃分，本公開對此並不限定。In an example, a voice data sequence corresponding to the driving data may be acquired according to the type of the driving data, wherein the voice data sequence includes a plurality of voice data units. Wherein, the voice data unit may be composed of words or words, or may be composed of phonemes or syllables. Corresponding to the driving data of the text type, the word sequence, word sequence, etc. corresponding to the driving data can be obtained; corresponding to the driving data of the audio type, the phoneme sequence, syllable sequence, and speech frame corresponding to the driving data can be obtained. sequence and so on. In one embodiment, audio data and text data can be converted to each other. For example, the division of voice data units is performed after converting audio data into text data, or the division of voice data units is performed after converting text data into audio data, which is not limited in the present disclosure.

在檢測到所述語音數據單元中包括目標數據的情況下，則可以確定所述驅動數據的驅動模式為第一驅動模式，所述目標數據與互動物件的預設控制參數值相對應。When it is detected that the voice data unit includes target data, it can be determined that the driving mode of the driving data is the first driving mode, and the target data corresponds to the preset control parameter value of the interactive object.

所述目標數據可以是設置的關鍵詞或關鍵字等等，所述關鍵詞或所述關鍵字與互動物件的設定動作的預設控制參數值相對應。The target data may be a set keyword or a keyword, etc., and the keyword or the keyword corresponds to a preset control parameter value of the setting action of the interactive object.

在本公開實施例中，預先為每一個目標數據匹配了設定動作，而每個設定動作通過相應的控制參數值進行控制而實現，因而每個目標數據與設定動作的控制參數值匹配。以關鍵詞為“揮手”為例，在所述語音數據單元包含了文本形式的“揮手”，和/或語音形式的“揮手”的情況下，則可以確定所述驅動數據中包含了目標數據。In the embodiment of the present disclosure, each target data is pre-matched with a set action, and each set action is controlled by a corresponding control parameter value, so each target data matches the control parameter value of the set action. Taking the keyword "waving hand" as an example, in the case where the voice data unit includes "waving hand" in text form and/or "waving hand" in voice form, it can be determined that the driving data includes target data. .

範例性的，所述目標數據包括音節，所述音節與所述互動物件的設定嘴型動作的預設控制參數值對應。Exemplarily, the target data includes syllables, and the syllables correspond to the preset control parameter values of the interactive object for setting the lip movement.

所述目標數據對應的音節屬於預先劃分好的一種音節類型，且所述一種音節類型與一種設定嘴型相匹配。其中，音節是由至少一個音素組合形成的語音單位，所述音節包括拼音語言的音節，和非拼音語言（例如，漢語）的音節。一種音節類型是指發音動作一致或者基本一致的音節，一種音節類型可與互動物件的一種動作對應。在一實施例中，一種音節類型可與互動物件說話時的一種設定的嘴型對應，即與一種發音動作對應。這樣，不同音節類型分別匹配了不同的設定嘴型的控制參數值，例如，拼音“ma”、“man”、“mang”這類型的音節，由於這類音節的發音動作基本一致，故可以視為同一類型，均可對應互動物件說話時“嘴巴張開”的嘴型的控制參數值。The syllable corresponding to the target data belongs to a pre-divided syllable type, and the one syllable type matches a preset mouth shape. Wherein, a syllable is a phonetic unit formed by a combination of at least one phoneme, and the syllable includes a syllable in a pinyin language and a syllable in a non-pinyin language (eg, Chinese). A syllable type refers to a syllable whose pronunciation action is consistent or basically consistent, and a syllable type can correspond to an action of an interactive object. In one embodiment, a syllable type may correspond to a set mouth shape when the interactive object speaks, that is, to a pronunciation action. In this way, different syllable types are matched with different control parameter values for setting the mouth shape. For example, the pinyin "ma", "man", "mang" type syllables, because the pronunciation actions of these syllables are basically the same, it can be regarded as are of the same type, and can correspond to the control parameter values of the mouth shape of the "mouth open" when the interactive object speaks.

在未檢測到所述語音數據單元中包括目標數據的情況下，則可以確定所述驅動數據的驅動模式為第二驅動模式，所述目標數據與互動物件的預設控制參數值相對應。If it is not detected that the voice data unit includes target data, it may be determined that the driving mode of the driving data is the second driving mode, and the target data corresponds to the preset control parameter value of the interactive object.

本領域技術人員應當理解，上述第一驅動模式和第二驅動模式僅用於範例，本公開實施例對於具體驅動模式不進行限定。Those skilled in the art should understand that the above-mentioned first driving mode and second driving mode are only used as examples, and the embodiment of the present disclosure does not limit the specific driving mode.

在步驟202中，響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值。In step 202, in response to the driving mode, a control parameter value of the interactive object is acquired according to the driving data.

對於驅動數據的各種驅動模式，可以採用相應的方式獲取所述互動物件的控制參數值。For various driving modes of the driving data, the control parameter value of the interactive object may be obtained in a corresponding manner.

在一個範例中，響應於步驟201中確定的第一驅動模式，可以將所述目標數據對應的所述預設控制參數值作為所述互動物件的控制參數值。例如，對於第一驅動模式，可以將所述語音數據序列中包含的目標數據（例如“揮手”）所對應的預設控制參數值作為所述互動物件的控制參數值。In an example, in response to the first driving mode determined in step 201, the preset control parameter value corresponding to the target data may be used as the control parameter value of the interactive object. For example, for the first driving mode, the preset control parameter value corresponding to the target data (for example, "waving hand") included in the voice data sequence may be used as the control parameter value of the interactive object.

在一個範例中，響應於步驟201中確定的第二驅動模式，可以獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊；獲取與所述特徵資訊對應的所述互動物件的控制參數值。也即，在未檢測到語音數據序列中包含目標數據的情況下，則可以根據所述語音數據單元的特徵資訊來獲取對應的控制參數值。所述特徵資訊可以包括對所述語音數據序列進行特徵編碼所獲得的語音數據單元的特徵資訊、根據所述語音數據序列的聲學特徵資訊所獲得的語音數據單元的特徵資訊等等。In an example, in response to the second driving mode determined in step 201, feature information of at least one voice data unit in the voice data sequence may be acquired; control parameters of the interactive object corresponding to the feature information may be acquired value. That is, in the case where the target data is not detected in the speech data sequence, the corresponding control parameter value can be obtained according to the characteristic information of the speech data unit. The feature information may include feature information of the voice data unit obtained by feature encoding the voice data sequence, feature information of the voice data unit obtained according to the acoustic feature information of the voice data sequence, and the like.

在步驟203中，根據所述控制參數值控制所述互動物件的姿態。In step 203, the posture of the interactive object is controlled according to the control parameter value.

在一些實施例中，所述互動物件的控制參數包括面部姿態參數，所述面部姿態參數包括面部肌肉控制係數，該面部肌肉控制係數用於控制至少一個面部肌肉的運動狀態。在一實施例中，可以根據所述驅動數據獲取所述互動物件的面部肌肉控制係數；並根據所獲取的面部肌肉控制係數，驅動所述互動物件做出與所述驅動數據匹配的面部動作。In some embodiments, the control parameter of the interactive object includes a facial posture parameter, and the facial posture parameter includes a facial muscle control coefficient, and the facial muscle control coefficient is used to control the motion state of at least one facial muscle. In one embodiment, the facial muscle control coefficient of the interactive object may be acquired according to the driving data; and according to the acquired facial muscle control coefficient, the interactive object may be driven to perform a facial action matching the driving data.

在一些實施例中，所述互動物件的控制參數值包括所述互動物件的至少一個局部區域的控制向量。在一實施例中，根據所述驅動數據可以獲取所述互動物件的至少一個局部區域的控制向量；並根據所獲取的所述至少一個局部區域的控制向量可以控制所述互動物件的面部動作和/或肢體動作。In some embodiments, the control parameter value of the interactive object includes a control vector of at least one local area of the interactive object. In one embodiment, the control vector of at least one local area of the interactive object can be acquired according to the driving data; and the facial action and / or body movements.

根據所述互動物件的驅動數據的驅動模式，來獲取所述互動物件的控制參數值，從而控制所述互動物件的姿態，其中，對於不同的驅動模式可以通過不同的方式來獲取相應的互動物件的控制參數值，使得互動物件顯示出與所述驅動數據的內容和/或對應的語音匹配的姿態，從而使目標物件產生與互動物件正在交流的感覺，提升了目標物件與互動物件的互動體驗。The control parameter value of the interactive object is obtained according to the driving mode of the driving data of the interactive object, so as to control the posture of the interactive object, wherein for different driving modes, the corresponding interactive object can be obtained in different ways The control parameter value of , so that the interactive object displays a posture that matches the content of the driving data and/or the corresponding voice, so that the target object feels that it is communicating with the interactive object, and the interactive experience between the target object and the interactive object is improved. .

在一些實施例中，還可以根據所述驅動數據控制所述顯示設備輸出語音和/或顯示文本。並且可以在輸出語音和/或顯示文本的同時，根據所述控制參數值控制所述互動物件的姿態。In some embodiments, the display device may also be controlled to output voice and/or display text according to the driving data. And while outputting voice and/or displaying text, the gesture of the interactive object can be controlled according to the control parameter value.

在本公開實施例中，由於控制參數值與所述驅動數據相匹配，因此根據所述驅動數據輸出語音和/或顯示文本與根據所述控制參數值控制互動物件的姿態同步的情況下，互動物件所做出的姿態與所輸出的語音和/或所顯示的文本也是同步的，從而給目標物件一種所述互動物件正在與其進行交流的感覺。In the embodiment of the present disclosure, since the control parameter value matches the driving data, in the case where the output of the voice and/or the displayed text according to the driving data is synchronized with the gesture of controlling the interactive object according to the control parameter value, the interaction The gestures made by the object are also synchronized with the output speech and/or displayed text, thereby giving the target object a sense that the interactive object is communicating with it.

在一些實施例中，所述語音數據序列包括音素序列。響應於所述驅動數據包括音訊數據，可以通過將音訊數據拆分為多個音訊幀，根據音訊幀的狀態對音訊幀進行組合而形成音素；根據所述音訊數據所形成的各個音素則形成了音素序列。其中，音素是根據語音的自然屬性劃分出來的最小語音單元，真實人物一個發音動作能夠形成一個音素。響應於所述驅動數據為文本，可以根據所述文本中包含的語素，獲得所述語素所對應的音素，從而獲得相應的音素序列。In some embodiments, the sequence of speech data includes a sequence of phonemes. In response to the drive data including audio data, the audio data can be split into a plurality of audio frames, and the audio frames can be combined according to the state of the audio frames to form phonemes; each phoneme formed according to the audio data forms a phoneme. phoneme sequence. Among them, the phoneme is the smallest phonetic unit divided according to the natural attributes of the voice, and one pronunciation action of a real person can form a phoneme. In response to the drive data being text, the phoneme corresponding to the morpheme may be obtained according to the morpheme contained in the text, thereby obtaining a corresponding phoneme sequence.

在一些實施例中，可以通過以下方式獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊：對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列；根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼；根據所述特徵編碼，獲得所述至少一個音素的特徵資訊。In some embodiments, the feature information of at least one voice data unit in the voice data sequence may be obtained by: performing feature coding on the phoneme sequence to obtain a first coding sequence corresponding to the phoneme sequence; The first coding sequence is used to obtain the feature code corresponding to at least one phoneme; and the feature information of the at least one phoneme is obtained according to the feature code.

圖3繪示對音素序列進行特徵編碼的過程示意圖。如圖3所示，音素序列310含音素j、i1、j、ie4（為簡潔起見，只繪示部分音素），針對每種音素j、i1、ie4分別獲得對應的編碼序列321、322、323。在各個編碼序列中，在有所述音素的時間點上對應的編碼值設置為第一數值（例如為1），在沒有所述音素的時間點上對應的編碼值設置為第二數值（例如為0）。以編碼序列321為例，在音素序列310中有音素j的時間點上，編碼序列321的值設置為第一數值1；在沒有音素j的時間點上，編碼序列321的值設置為第二數值0。所有編碼序列321、322、323構成總編碼序列320。FIG. 3 is a schematic diagram illustrating a process of feature encoding for a phoneme sequence. As shown in FIG. 3 , the phoneme sequence 310 includes phonemes j, i1, j, and ie4 (for the sake of brevity, only some phonemes are shown), and corresponding coding sequences 321, 322, 322, 322, 323. In each encoding sequence, the corresponding encoding value at the time point with the phoneme is set to a first value (for example, 1), and the corresponding encoding value at the time point without the phoneme is set to the second value (for example, is 0). Taking the coding sequence 321 as an example, at the time point when the phoneme sequence 310 has phoneme j, the value of the coding sequence 321 is set to the first value 1; at the time point when there is no phoneme j, the value of the coding sequence 321 is set to the second value. The value is 0. All coding sequences 321 , 322 , 323 constitute the total coding sequence 320 .

根據音素j、i1、ie4分別對應的編碼序列321、322、323的編碼值，以及該三個編碼序列中對應的音素的持續時間，也即在編碼序列321中j的持續時間、在編碼序列322中i1的持續時間、在編碼序列323中ie4的持續時間，可以獲得編碼序列321、322、323的特徵資訊。According to the coding values of the coding sequences 321, 322 and 323 corresponding to the phonemes j, i1 and ie4 respectively, and the durations of the corresponding phonemes in the three coding sequences, that is, the duration of j in the coding sequence 321, the duration of the coding sequence in the coding sequence The duration of i1 in 322 and the duration of ie4 in the coding sequence 323 can obtain the characteristic information of the coding sequences 321 , 322 and 323 .

例如，可以利用高斯濾波器分別對所述編碼序列321、322、323中的音素j、i1、ie4在時間上的連續值進行高斯卷積操作，獲得所述編碼序列的特徵資訊。也即，通過高斯濾波器對音素在時間上的連續值進行高斯卷積操作，使得各個編碼序列中編碼值從第二數值到第一數值或者從第一數值到第二數值的變化階段變得平滑。對各個編碼序列321、322、323分別進行高斯卷積操作，從而獲得各個編碼序列的特徵值，其中，特徵值為構成特徵資訊中的參數，根據各個編碼序列的特徵資訊的集合，獲得了該音素序列310所對應的特徵資訊330。本領域技術人員應當理解，也可以對各個編碼序列進行其他的操作來獲得所述編碼序列的特徵資訊，本公開對此不進行限制。For example, Gaussian convolution operation can be performed on successive temporal values of phonemes j, i1, ie4 in the encoded sequences 321, 322, 323 using a Gaussian filter to obtain the feature information of the encoded sequences. That is, the Gaussian convolution operation is performed on the continuous values of the phoneme in time through the Gaussian filter, so that the change stage of the coding value in each coding sequence from the second value to the first value or from the first value to the second value becomes smooth. The Gaussian convolution operation is performed on each coding sequence 321, 322, 323, respectively, so as to obtain the feature value of each coding sequence, wherein the feature value is a parameter constituting the feature information, and the set of feature information of each coding sequence is obtained. The feature information 330 corresponding to the phoneme sequence 310 . Those skilled in the art should understand that other operations may also be performed on each coding sequence to obtain characteristic information of the coding sequence, which is not limited in the present disclosure.

在本公開實施例中，通過根據音素序列中每種音素的持續時間獲得所述編碼序列的特徵資訊，使得編碼序列的變化階段平滑，例如，編碼序列的值除了0和1也呈現出中間狀態的值，例如0.2、0.3等等，而根據這些中間狀態的值所獲取的姿態參數值，使得互動人物的姿態變化過度的更加平緩、自然，尤其是互動人物的表情變化更加平緩、自然，提高了目標物件的互動體驗。In the embodiment of the present disclosure, the characteristic information of the coding sequence is obtained according to the duration of each phoneme in the phoneme sequence, so that the change phase of the coding sequence is smooth. For example, the values of the coding sequence show an intermediate state except for 0 and 1. value, such as 0.2, 0.3, etc., and the attitude parameter values obtained according to the values of these intermediate states make the changes of the interactive characters' poses more gentle and natural, especially the changes of the interactive characters' expressions are more gentle and natural, and improve the The interactive experience of the target object.

在一些實施例中，所述面部姿態參數可以包括面部肌肉控制係數。In some embodiments, the facial pose parameters may include facial muscle control coefficients.

人臉的運動，從解剖學角度來看，是由面部各部分肌肉協同變形的結果。因此，通過對互動物件的面部肌肉進行劃分而獲得面部肌肉模型，並對劃分得到的每一塊肌肉（區域）通過對應的面部肌肉控制係數控制其運動，也即對其進行收縮/擴張控制，則能夠使互動人物的面部做出各種表情。對於所述面部肌肉模型的每一塊肌肉，可以根據肌肉所在的面部位置和肌肉自身的運動特徵，來設置不同的肌肉控制係數所對應的運動狀態。例如，對於上唇肌肉，其控制係數的數值範圍為0~1，在該範圍內的不同數值，對應於上唇肌肉不同的收縮／擴張狀態，通過改變該數值，可以實現嘴部的縱向開合；而對於左嘴角肌肉，其控制係數的數值範圍為0~1，在該範圍內的不同數值，對應于左嘴角肌肉的收縮／擴張狀態，通過改變該數值，可以實現嘴部的橫向變化。The movement of the human face, from an anatomical point of view, is the result of cooperative deformation of the muscles of various parts of the face. Therefore, the facial muscle model is obtained by dividing the facial muscles of the interactive object, and the movement of each muscle (region) obtained by the division is controlled by the corresponding facial muscle control coefficient, that is, the contraction/expansion control is performed on it, then It can make the faces of interactive characters make various expressions. For each muscle of the facial muscle model, motion states corresponding to different muscle control coefficients can be set according to the facial position where the muscle is located and the motion characteristics of the muscle itself. For example, for the upper lip muscle, the value range of its control coefficient is 0~1. Different values in this range correspond to different contraction/expansion states of the upper lip muscle. By changing the value, the longitudinal opening and closing of the mouth can be realized; For the muscle of the left corner of the mouth, the value of the control coefficient ranges from 0 to 1. Different values in this range correspond to the contraction/expansion state of the muscle at the left corner of the mouth. By changing the value, the lateral change of the mouth can be achieved.

在根據音素序列輸出聲音的同時，根據與所述音素序列對應的面部肌肉控制係數來驅動所述互動物件做出面部表情，則可以實現顯示設備在輸出聲音時，互動物件同步做出發出該聲音的表情，從而使目標物件產生該互動物件正在說話的感覺，提高了目標物件的互動體驗。While outputting the sound according to the phoneme sequence, the interactive object is driven to make a facial expression according to the facial muscle control coefficient corresponding to the phoneme sequence, so that when the display device outputs the sound, the interactive object can make the sound synchronously , so that the target object feels that the interactive object is talking, and the interactive experience of the target object is improved.

在一些實施例中，可以將所述互動物件的面部動作與身體姿態相關聯，也即將該面部動作所對應的面部姿態參數值與所述身體姿態相關聯，所述身體姿態可以包括肢體動作、手勢動作、走路姿態等等。In some embodiments, the facial action of the interactive object may be associated with a body posture, that is, the facial posture parameter value corresponding to the facial action may be associated with the body posture, and the body posture may include body movements, Gestures, walking gestures, etc.

在互動物件的驅動過程中，獲取與所述面部姿態參數值關聯的身體姿態的驅動數據；在根據所述音素序列輸出聲音的同時，根據與所述面部姿態參數值關聯的身體姿態的驅動數據，驅動所述互動物件做出肢體動作。也即，在根據所述互動物件的驅動數據驅動所述互動物件做出面部動作的同時，還根據該面部動作對應的面部姿態參數值獲取相關聯的身體姿態的驅動數據，從而在輸出聲音時，可以驅動互動物件同步做出相應的面部動作和肢體動作，使互動物件的說話狀態更加生動自然，提高了目標物件的互動體驗。During the driving process of the interactive object, obtain the driving data of the body posture associated with the facial posture parameter value; while outputting the sound according to the phoneme sequence, according to the driving data of the body posture associated with the facial posture parameter value , and drive the interactive object to make body movements. That is, while driving the interactive object to make a facial action according to the driving data of the interactive object, the driving data of the associated body posture is also obtained according to the facial posture parameter value corresponding to the facial action, so that when the sound is output, , which can drive the interactive objects to make corresponding facial and body movements synchronously, so that the speaking state of the interactive objects is more vivid and natural, and the interactive experience of the target objects is improved.

由於聲音的輸出需要保持連續性，因此，在一實施例中，在音素序列上移動時間視窗，並輸出在每次移動過程中時間視窗內的音素，其中，以設定時長作為每次移動時間視窗的步長。例如，可以將時間視窗的長度設置為1秒，將設定時長設置為0.1秒。在輸出時間視窗內的音素的同時，獲取時間視窗設定位置處的音素或音素的特徵資訊所對應的姿態參數值，利用所述姿態參數值控制所述互動物件的姿態；該設定位置為距離時間視窗起始位置設定時長的位置，例如在時間視窗的長度設置為1s時，該設定位置距離時間視窗的起始位置可以為0.5s。隨著時間視窗的每次移動，在輸出時間視窗內的音素同時，都以時間視窗設定位置處對應的姿態參數值控制互動物件的姿態，從而使互動物件的姿態與輸出的語音同步，給目標物件以所述互動物件正在說話的感覺。Since the output of sound needs to maintain continuity, in one embodiment, the time window is moved on the phoneme sequence, and the phonemes in the time window during each movement are output, wherein the set duration is used as the time of each movement The step size of the window. For example, the length of the time window can be set to 1 second, and the set duration can be set to 0.1 seconds. While outputting the phonemes in the time window, obtain the gesture parameter value corresponding to the phoneme or the feature information of the phoneme at the set position of the time window, and use the gesture parameter value to control the gesture of the interactive object; the set position is the distance time The starting position of the window is the position where the duration is set. For example, when the length of the time window is set to 1s, the distance between the set position and the starting position of the time window can be 0.5s. With each movement of the time window, while outputting the phonemes in the time window, the posture of the interactive object is controlled by the corresponding attitude parameter value at the set position of the time window, so that the posture of the interactive object is synchronized with the output voice, giving the target The object feels like the interactive object is talking.

通過改變設定時長，可以改變獲取姿態參數值的時間間隔（頻率），從而改變了互動物件做出姿態的頻率。可以根據實際的互動場景來設置該設定時長，以使互動物件的姿態變化更加自然。By changing the set duration, you can change the time interval (frequency) for obtaining the attitude parameter value, thereby changing the frequency of the interactive object making the gesture. The set duration can be set according to the actual interactive scene, so as to make the posture change of the interactive object more natural.

在一些實施例中，可以通過獲取互動物件的至少一個局部區域的控制向量控制所述互動物件的姿態。In some embodiments, the gesture of the interactive object can be controlled by obtaining a control vector of at least one local area of the interactive object.

所述局部區域是對互動物件的整體（包括面部和/或身體）進行劃分而得到的。面部的一個或多個局部區域的控制可以對應於互動物件的一系列面部表情或動作，例如眼部區域的控制可以對應於互動物件睜眼、閉眼、眨眼、視角變換等面部動作；又例如嘴部區域的控制可以對應於互動物件閉嘴、不同程度的張嘴等面部動作。而身體的一個或多個局部區域的控制可以對應於互動物件的一系列肢體動作，例如腿部區域的控制可以對應於互動物件走路、跳躍、踢腿等動作。The local area is obtained by dividing the whole (including face and/or body) of the interactive object. The control of one or more local areas of the face can correspond to a series of facial expressions or actions of the interactive object. For example, the control of the eye area can correspond to the facial actions of the interactive object such as opening, closing, blinking, and changing the perspective; another example is the mouth. The control of the facial area can correspond to the facial actions of the interactive object, such as closing the mouth and opening the mouth to different degrees. The control of one or more local regions of the body may correspond to a series of body actions of the interactive object, for example, the control of the leg region may correspond to the actions of the interactive object such as walking, jumping, and kicking.

所述互動物件的局部區域的控制參數，包括所述局部區域的控制向量。每個局部區域的姿態控制向量用於驅動所述互動物件的所述局部區域進行動作。不同的控制向量值對應於不同的動作或者動作幅度。例如，對於嘴部區域的控制向量，其一組控制向量值可以使所述互動物件的嘴部微張，而另一組控制向量值可以使所述互動物件的嘴部大張。通過以不同的控制向量值來驅動所述互動物件，可以使相應的局部區域做出不同動作或者不同幅度的動作。The control parameters of the local area of the interactive object include the control vector of the local area. The gesture control vector of each local area is used to drive the local area of the interactive object to act. Different control vector values correspond to different motions or motion magnitudes. For example, for the control vector of the mouth region, one set of control vector values can make the mouth of the interactive object slightly open, and another set of control vector values can make the mouth of the interactive object open wide. By driving the interactive objects with different control vector values, the corresponding local regions can be made to perform different actions or actions of different magnitudes.

局部區域可以根據需要控制的互動物件的動作進行選擇，例如在需要控制所述互動物件面部以及肢體同時進行動作時，可以獲取全部局部區域的控制向量；在需要控制所述互動物件的表情時，則可以獲取所述面部所對應的局部區域的控制向量。The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to move at the same time, the control vectors of all local areas can be obtained; when the expression of the interactive object needs to be controlled, Then the control vector of the local area corresponding to the face can be obtained.

在一些實施例中，可以通過在所述第一編碼序列上進行滑動視窗的方式獲取至少一個音素對應的特徵編碼。其中，所述第一編碼序列可以是經過高斯卷積操作後的編碼序列。In some embodiments, the feature code corresponding to at least one phoneme may be obtained by performing a sliding window on the first code sequence. Wherein, the first coding sequence may be a coding sequence after Gaussian convolution operation.

以設定長度的時間視窗和設定步長，對所述編碼序列進行滑動視窗，將所述時間視窗內的特徵編碼作為所對應的至少一個音素的特徵編碼，在完成滑動視窗後，根據得到的多個特徵編碼，可以獲得第二編序列。如圖4所示，通過在第一編碼序列420或者平滑後的第一編碼序列430上，滑動設定長度的時間視窗，分別獲得特徵編碼1、特徵編碼2、特徵編碼3，以此類推，在遍歷第一編碼序列後，獲得特徵編碼1、特徵編碼2、特徵編碼3、…、特徵編碼M，從而得到了第二編碼序列440。其中，M為正整數，其數值根據第一編碼序列的長度、時間視窗的長度以及時間視窗滑動的步長確定。With a time window of a set length and a set step size, a sliding window is performed on the coding sequence, and the feature coding in the time window is used as the feature coding of the corresponding at least one phoneme, and after the sliding window is completed, according to the obtained multiple A feature code can be obtained to obtain the second code sequence. As shown in FIG. 4 , by sliding a time window of a set length on the first encoding sequence 420 or the smoothed first encoding sequence 430, feature encoding 1, feature encoding 2, and feature encoding 3 are obtained respectively, and so on. After traversing the first coding sequence, feature coding 1, feature coding 2, feature coding 3, . . . , feature coding M are obtained, thereby obtaining the second coding sequence 440. Wherein, M is a positive integer, and its value is determined according to the length of the first coding sequence, the length of the time window, and the step size of the time window sliding.

根據特徵編碼1、特徵編碼2、特徵編碼3、…、特徵編碼M，可以獲得相應的控制向量1、控制向量2、控制向量3、…、控制向量M，從而獲得控制向量的序列450。Corresponding control vector 1, control vector 2, control vector 3, .

控制向量的序列450與第二編碼序列440在時間上是對齊的，由於所述第二編碼序列中的每個特徵編碼是根據音素序列中的至少一個音素獲得的，因此控制向量的序列450中的每個特徵向量同樣是根據音素序列中的至少一個音素獲得的。在播放文本數據所對應的音素序列的同時，根據所述控制向量的序列驅動所述互動物件做出動作，即能夠實現驅動互動物件發出文本內容所對應的聲音的同時，做出與聲音同步的動作，給目標物件以所述互動物件正在說話的感覺，提升了目標物件與互動物件的互動體驗。The sequence 450 of control vectors and the second coding sequence 440 are aligned in time. Since each feature code in the second coding sequence is obtained from at least one phoneme in the phoneme sequence, the sequence 450 of control vectors is Each feature vector of is also obtained from at least one phoneme in the phoneme sequence. While playing the phoneme sequence corresponding to the text data, the interactive object is driven to act according to the sequence of the control vector, that is, the interactive object can be driven to emit the sound corresponding to the text content, and at the same time, it can make a sound synchronized with the sound. The action gives the target object the feeling that the interactive object is talking, which improves the interactive experience between the target object and the interactive object.

假設在第一個時間視窗的設定時刻開始輸出特徵編碼，可以將在所述設定時刻之前的控制向量設置為默認值，也即在剛開始播放音素序列時，使所述互動物件做出默認的動作，在所述設定時刻之後開始利用根據第一編碼序列所得到的控制向量的序列驅動所述互動物件做出動作。以圖4為例，在t0時刻開始輸出特徵編碼1，在t0時刻之前輸出是默認控制向量。Assuming that the feature code is output at the set time of the first time window, the control vector before the set time can be set to the default value, that is, when the phoneme sequence is just started to be played, the interactive object can be set to the default value. Action, after the set time, the interactive object starts to use the sequence of control vectors obtained according to the first coding sequence to drive the interactive object to act. Taking Figure 4 as an example, the feature code 1 is output at time t0, and the default control vector is output before time t0.

所述時間視窗的長度與所述特徵編碼所包含的資訊量相關。在時間視窗所含的資訊量較大的情況下，經所述循環神經網路處理會輸出較均勻的結果。若時間視窗的長度過大，可能導致互動物件說話時的表情無法與部分文字對應；若時間視窗的長度過小，可能導致互動物件說話時的表情顯得生硬。因此，時間視窗的時長需要根據文本數據所對應的音素持續的最小時間來確定，以使驅動所述互動物件所做出的動作與聲音具有更強的關聯性。The length of the time window is related to the amount of information contained in the feature code. When the amount of information contained in the time window is relatively large, the RNN process will output a more uniform result. If the length of the time window is too large, the expressions of the interactive objects when they speak may not correspond to part of the text; if the length of the time window is too small, the expressions of the interactive objects may appear stiff when they speak. Therefore, the duration of the time window needs to be determined according to the minimum duration of the phoneme corresponding to the text data, so that the action made by driving the interactive object has a stronger correlation with the sound.

時間視窗滑動的步長與獲取控制向量的時間間隔（頻率）相關，也即與驅動互動物件做出動作的頻率相關。可以根據實際的互動場景來設置所述時間視窗的長度以及步長，以使互動物件做出的表情和動作與聲音的關聯性更強，並且更加生動、自然。The step size of the time window sliding is related to the time interval (frequency) of obtaining the control vector, which is related to the frequency of driving the interactive object to make an action. The length and step size of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive objects are more closely related to the sound, and more vivid and natural.

在一些實施例中，在所述音素序列中音素之間的時間間隔大於設定門檻值的情況下，根據所述局部區域的設定控制向量，驅動所述互動物件做出動作。也即，在互動人物說話停頓較長時，驅動互動物件做出設定的動作。例如，在輸出的語音停頓較長時，可以使互動物件做出微笑的表情，或者身體微微的擺動，以避免在停頓較長時互動物件面無表情地直立，從而使得互動物件說話的過程更加自然、流暢，提高了目標物件的互動感受。In some embodiments, when the time interval between phonemes in the phoneme sequence is greater than a set threshold value, the interactive object is driven to act according to the set control vector of the local area. That is, when the interactive character pauses for a long time, the interactive object is driven to perform the set action. For example, when the output speech pauses for a long time, the interactive object can be made to smile or slightly wiggle its body, so as to prevent the interactive object from standing upright with no expression during the long pause, so that the process of the interactive object speaking is more convenient. Natural and smooth, which improves the interactive experience of the target object.

在一些實施例中，所述語音數據序列包括語音幀序列，獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊，包括：獲取所述語音幀序列對應的第一聲學特徵序列，所述第一聲學特徵序列包括與所述語音幀序列中的每個語音幀對應的聲學特徵向量；根據所述第一聲學特徵序列，獲取至少一個語音幀對應的聲學特徵向量；根據所述聲學特徵向量，獲得所述至少一個語音幀對應的特徵資訊。In some embodiments, the voice data sequence includes a voice frame sequence, and acquiring feature information of at least one voice data unit in the voice data sequence includes: acquiring a first acoustic feature sequence corresponding to the voice frame sequence, where the The first acoustic feature sequence includes an acoustic feature vector corresponding to each speech frame in the speech frame sequence; according to the first acoustic feature sequence, an acoustic feature vector corresponding to at least one speech frame is obtained; according to the acoustic feature vector to obtain feature information corresponding to the at least one speech frame.

在本公開實施例中，可以根據所述語音幀序列的聲學特徵，確定互動物件的至少一個局部區域的控制參數，也可以根據所述語音幀序列的其他特徵來確定控制參數。In the embodiment of the present disclosure, the control parameters of at least one local area of the interactive object may be determined according to the acoustic characteristics of the speech frame sequence, and the control parameters may also be determined according to other characteristics of the speech frame sequence.

首先，獲取所述語音幀序列對應的聲學特徵序列。此處，為了與後續提到的聲學特徵序列進行區分，將所述語音幀序列對應的聲學特徵序列稱為第一聲學特徵序列。First, acquire the acoustic feature sequence corresponding to the speech frame sequence. Here, in order to distinguish it from the acoustic feature sequence mentioned later, the acoustic feature sequence corresponding to the speech frame sequence is referred to as the first acoustic feature sequence.

在本公開實施例中，聲學特徵可以是與語音情感相關的特徵，例如基頻特徵、共峰特徵、梅爾頻率倒譜系數（Mel Frequency Cepstral Cofficient，MFCC）等等。In an embodiment of the present disclosure, the acoustic feature may be a feature related to speech emotion, such as a fundamental frequency feature, a common peak feature, a Mel Frequency Cepstral Cofficient (MFCC), and the like.

所述第一聲學特徵序列是對整體的語音幀序列進行處理所得到的，以MFCC特徵為例，可以通過對所述語音幀序列中的各個語音幀進行加窗、快速傅裡葉變換、濾波、對數處理、離散余弦處理，得到各個語音幀對應的MFCC係數。The first acoustic feature sequence is obtained by processing the entire sequence of speech frames. Taking the MFCC feature as an example, each speech frame in the sequence of speech frames can be subjected to windowing, fast Fourier transform, and filtering. , logarithmic processing and discrete cosine processing to obtain the MFCC coefficients corresponding to each speech frame.

所述第一聲學特徵序列是針對整體的語音幀序列進行處理所得到的，體現了語音數據序列的整體聲學特徵。The first acoustic feature sequence is obtained by processing the entire speech frame sequence, and reflects the overall acoustic feature of the speech data sequence.

在本公開實施例中，所述第一聲學特徵序列包含與所述語音幀序列中的每個語音幀對應的聲學特徵向量。以MFCC為例，所述第一聲學特徵序列包含了每個語音幀的MFCC係數。根據所述語音幀序列所獲得的第一聲學特徵序列如圖5所示。In an embodiment of the present disclosure, the first acoustic feature sequence includes an acoustic feature vector corresponding to each speech frame in the speech frame sequence. Taking MFCC as an example, the first acoustic feature sequence includes MFCC coefficients of each speech frame. The first acoustic feature sequence obtained according to the speech frame sequence is shown in FIG. 5 .

接下來，根據所述第一聲學特徵序列，獲取至少一個語音幀對應的聲學特徵。Next, according to the first acoustic feature sequence, the acoustic feature corresponding to at least one speech frame is acquired.

在所述第一聲學特徵序列包括了所述語音幀序列中的每個語音幀對應的聲學特徵向量的情況下，可以將所述至少一個語音幀對應的相同數目的特徵向量作為所述語音幀的聲學特徵。其中，上述相同數目的特徵向量可以形成一個特徵矩陣，該特徵矩陣即為所述至少一個語音幀的聲學特徵。When the first acoustic feature sequence includes an acoustic feature vector corresponding to each speech frame in the speech frame sequence, the same number of feature vectors corresponding to the at least one speech frame may be used as the speech frame acoustic characteristics. The above-mentioned same number of feature vectors may form a feature matrix, and the feature matrix is the acoustic feature of the at least one speech frame.

以圖5為例，所述第一聲學特徵序列中的N個特徵向量形成了所對應的N個語音幀的聲學特徵；其中，N為正整數。所述第一聲學特徵矩陣可以包括多個聲學特徵，各個所述聲學特徵所對應的語音幀之間可以是部分重疊的。Taking FIG. 5 as an example, the N feature vectors in the first acoustic feature sequence form the acoustic features of the corresponding N speech frames; wherein, N is a positive integer. The first acoustic feature matrix may include a plurality of acoustic features, and speech frames corresponding to each of the acoustic features may partially overlap.

最後，獲取所述聲學特徵對應的所述互動物件的至少一個局部區域的控制向量。Finally, a control vector of at least one local area of the interactive object corresponding to the acoustic feature is obtained.

對於所獲得的至少一個語音幀對應的聲學特徵，可以獲取至少一個局部區域的控制向量。局部區域可以根據需要控制的互動物件的動作進行選擇，例如在需要控制所述互動物件面部以及肢體同時進行動作時，可以獲取全部局部區域的控制向量；在需要控制所述互動物件的表情時，則可以獲取所述面部所對應的局部區域的控制向量。For the acquired acoustic feature corresponding to at least one speech frame, a control vector of at least one local area may be acquired. The local area can be selected according to the action of the interactive object that needs to be controlled. For example, when the face and limbs of the interactive object need to be controlled to move at the same time, the control vectors of all local areas can be obtained; when the expression of the interactive object needs to be controlled, Then the control vector of the local area corresponding to the face can be obtained.

在播放語音數據序列的同時，根據通過所述第一聲學特徵序列所獲得的各個聲學特徵對應的控制向量驅動所述互動物件做出動作，可以實現終端設備在輸出聲音的同時，互動物件能夠做出與所輸出的聲音相配合的動作，該動作包括面部動作、表情以及肢體動作等，從而使目標物件產生該互動物件正在說話的感覺。並且由於所述控制向量是與輸出聲音的聲學特徵相關的，根據所述控制向量進行驅動能夠使得互動物件的表情和肢體動作具有了情感因素，從而使得互動物件的說話過程更加自然、生動，從而提高了目標物件與互動物件的互動體驗。While playing the voice data sequence, the interactive object is driven to act according to the control vector corresponding to each acoustic feature obtained through the first acoustic feature sequence, so that while the terminal device outputs sound, the interactive object can Actions that match the output sound include facial movements, expressions, and body movements, so that the target object feels that the interactive object is speaking. And because the control vector is related to the acoustic characteristics of the output sound, driving according to the control vector can make the expressions and body movements of the interactive objects have emotional factors, so that the speaking process of the interactive objects is more natural and vivid, thereby Improved the interactive experience between target objects and interactive objects.

在一些實施例中，可以通過在所述第一聲學特徵序列上進行滑動視窗的方式獲取所述至少一個語音幀對應的聲學特徵。In some embodiments, the acoustic feature corresponding to the at least one speech frame may be acquired by performing a sliding window on the first acoustic feature sequence.

通過以設定長度的時間視窗和設定步長，對所述第一聲學特徵序列進行滑動視窗，將所述時間視窗內的聲學特徵向量作為對應的相同數目語音幀的聲學特徵，從而可以獲得這些語音幀共同對應的聲學特徵。在完成滑動視窗後，根據得到的多個聲學特徵，則可以獲得第二聲學特徵序列。These voices can be obtained by performing a sliding window on the first acoustic feature sequence with a time window of a set length and a set step, and taking the acoustic feature vectors in the time window as the acoustic features of the corresponding same number of speech frames. Acoustic features corresponding to the frames in common. After completing the sliding window, according to the obtained multiple acoustic features, a second acoustic feature sequence can be obtained.

以圖5所示的互動物件的驅動方法為例，所述語音幀序列每秒包括100個語音幀，所述時間視窗的長度為1s，步長為0.04s。由於所述第一聲學特徵序列中的每個特徵向量是與語音幀對應的，相應地，所述第一聲學特徵序列每秒同樣包括100個特徵向量。在所述第一聲學特徵序列上進行滑動視窗過程中，每次獲得所述時間視窗內的100個特徵向量，作為對應的100個語音幀的聲學特徵。通過在所述第一聲學特徵序列上以0.04s的步長移動所述時間視窗，分別獲得第1~100語音幀對應的聲學特徵1、第4~104語音幀所對應的聲學特徵2，以此類推，在遍歷第一聲學特徵後，得到聲學特徵1、聲學特徵2、…、聲學特徵M，從而獲得第二聲學特徵序列，其中，M為正整數，其數值根據語音幀序的幀數（第一聲學特徵序列中特徵向量的數目）、時間視窗的長度以及步長確定。Taking the driving method of the interactive object shown in FIG. 5 as an example, the speech frame sequence includes 100 speech frames per second, the length of the time window is 1s, and the step size is 0.04s. Since each feature vector in the first acoustic feature sequence corresponds to a speech frame, correspondingly, the first acoustic feature sequence also includes 100 feature vectors per second. During the sliding window process on the first acoustic feature sequence, 100 feature vectors in the time window are obtained each time as the acoustic features of the corresponding 100 speech frames. By moving the time window with a step size of 0.04s on the first acoustic feature sequence, the acoustic feature 1 corresponding to the 1st to 100th speech frames and the acoustic feature 2 corresponding to the 4th to 104th speech frames are obtained respectively, and By analogy, after traversing the first acoustic feature, acoustic feature 1, acoustic feature 2, ..., acoustic feature M are obtained, so as to obtain the second acoustic feature sequence, where M is a positive integer, and its value is based on the number of frames in the speech frame sequence (the number of feature vectors in the first acoustic feature sequence), the length of the time window and the step size are determined.

根據聲學特徵1、聲學特徵2、…、聲學特徵M，分別可以獲得相應的控制向量1、控制向量2、…、控制向量M，從而獲得控制向量的序列。According to the acoustic feature 1, the acoustic feature 2, ..., the acoustic feature M, the corresponding control vector 1, the control vector 2, ..., and the control vector M can be obtained respectively, thereby obtaining a sequence of control vectors.

如圖5所示，所述控制向量的序列與所述第二聲學特徵序列在時間上是對齊的，所述第二聲學特徵序列中的聲學特徵1、聲學特徵2、…、聲學特徵M，分別是根據所述第一聲學特徵序列中的N個特徵向量獲得的，因此，在播放所述語音幀的同時，可以根據所述控制向量的序列驅動所述互動物件做出動作。As shown in FIG. 5 , the sequence of control vectors and the second acoustic feature sequence are aligned in time, and the acoustic feature 1, acoustic feature 2, ..., acoustic feature M in the second acoustic feature sequence, are respectively obtained according to the N feature vectors in the first acoustic feature sequence. Therefore, while playing the speech frame, the interactive object can be driven to act according to the sequence of the control vectors.

假設在第一個時間視窗的設定時刻開始輸出聲學特徵，可以將在所述設定時刻之前的控制向量設置為默認值，也即在剛開始播放語音幀序列時，使所述互動物件做出默認的動作，在所述設定時刻之後開始利用根據第一聲學特徵序列所得到的控制向量的序列驅動所述互動物件做出動作。Assuming that the output of acoustic features starts at the set time of the first time window, the control vector before the set time can be set to the default value, that is, when the audio frame sequence is just started to be played, the interactive object can be set to the default value. After the set time, the sequence of control vectors obtained according to the first acoustic feature sequence is used to drive the interactive object to act.

以圖5為例，在t0時刻開始輸出聲學特徵1，並以步長對應的時間0.04s為間隔輸出聲學特徵，在t1時刻開始輸出聲學特徵2，t2時刻開始輸出聲學特徵3，直至在t（M-1）時刻輸出聲學特徵M。對應地，在t i~t（i+1）時間段內對應的是特徵向量（i+1），其中，i為小於（M-1）的整數，而在t0時刻之前，控制向量為默認控制向量。Taking Figure 5 as an example, the acoustic feature 1 is output at time t0, and the acoustic feature is output at an interval of 0.04s corresponding to the step size, the acoustic feature 2 is output at the time t1, and the acoustic feature 3 is output at the time t2. (M-1) The acoustic feature M is output at time. Correspondingly, the feature vector (i+1) corresponds to the time period t i~t(i+1), where i is an integer less than (M-1), and before time t0, the control vector is the default control vector.

在本公開實施例中，通過在播放所述語音數據序列的同時，根據所述控制向量的序列驅動所述互動物件做出動作，從而使互動物件的動作與所輸出的聲音同步，給目標物件以所述互動物件正在說話的感覺，提升了目標物件與互動物件的互動體驗。In the embodiment of the present disclosure, while playing the voice data sequence, the interactive object is driven to act according to the sequence of control vectors, so that the action of the interactive object is synchronized with the output sound, and the target object is sent to the target object. With the feeling that the interactive object is talking, the interactive experience between the target object and the interactive object is improved.

所述時間視窗的長度，與所述聲學特徵所包含的資訊量相關。時間視窗的長度越大，所包含的資訊量越多，驅動所述互動物件所做出的動作與聲音的關聯性越強。時間視窗滑動的步長與獲取控制向量的時間間隔（頻率）相關，也即與驅動互動物件做出動作的頻率相關。可以根據實際的互動場景來設置所述時間視窗的長度以及步長，以使互動物件做出的表情和動作與聲音的關聯性更強，並且更加生動、自然。The length of the time window is related to the amount of information contained in the acoustic feature. The longer the time window is, the more information it contains, and the stronger the correlation between the action and the sound that drives the interactive object. The step size of the time window sliding is related to the time interval (frequency) of obtaining the control vector, which is related to the frequency of driving the interactive object to make an action. The length and step size of the time window can be set according to the actual interactive scene, so that the expressions and actions made by the interactive objects are more closely related to the sound, and more vivid and natural.

在一些實施例中，所述聲學特徵包括L個維度的梅爾頻率倒譜系數MFCC，其中，L為正整數。MFCC表示語音訊號的能量在不同頻率範圍的分佈，可以通過將所述語音幀序列中的多個語音幀數據轉換至頻域，利用包括L個子帶的梅爾濾波器，獲得L個維度的MFCC。通過根據語音數據序列的MFCC來獲取控制向量，以根據所述控制向量驅動所述互動物件進行面部動作和肢體動作，使得互動物件的表情和肢體動作具有了情感因素，使得互動物件的說話過程更加自然、生動，從而提高了目標物件與互動物件的互動體驗。In some embodiments, the acoustic features include L-dimension Mel-frequency cepstral coefficients MFCC, where L is a positive integer. MFCC represents the distribution of the energy of the speech signal in different frequency ranges. The L-dimension MFCC can be obtained by converting multiple speech frame data in the speech frame sequence to the frequency domain and using a mel filter including L subbands. . The control vector is obtained according to the MFCC of the speech data sequence, so as to drive the interactive object to perform facial movements and body movements according to the control vector, so that the expressions and body movements of the interactive object have emotional factors, and the speaking process of the interactive object is more Natural and vivid, thus improving the interactive experience between the target object and the interactive object.

在一些實施例中，可以將所述語音數據單元的特徵資訊輸入至預先訓練的循環神經網路，獲得與所述特徵資訊對應的所述互動物件的控制參數值。由於所述循環神經網路是一種時間遞歸神經網路，其可以學習所輸入的特徵資訊的歷史資訊，根據語音單元序列輸出控制參數；例如該控制參數可以為面部姿態控制參數，或者至少一個局部區域的控制向量。In some embodiments, the feature information of the speech data unit may be input into a pre-trained recurrent neural network to obtain control parameter values of the interactive object corresponding to the feature information. Since the recurrent neural network is a time recursive neural network, it can learn the historical information of the input feature information, and output control parameters according to the sequence of speech units; for example, the control parameters can be facial gesture control parameters, or at least one local The control vector for the region.

在本公開實施例中，利用預先訓練的循環神經網路獲取所述語音數據單元的特徵資訊對應的控制參數，將具有關聯性的歷史特徵資訊和當前特徵資訊進行融合，從而使得歷史控制參數對當前控制參數的變化產生影響，使得互動人物的表情變化和肢體動作更加平緩、自然。In the embodiment of the present disclosure, a pre-trained recurrent neural network is used to obtain the control parameters corresponding to the feature information of the speech data unit, and the related historical feature information and the current feature information are fused, so that the historical control parameters are The changes of the current control parameters have an impact, making the expression changes and body movements of the interactive characters more gentle and natural.

在一些實施例中，可以通過以下方式對所述循環神經網路進行訓練。In some embodiments, the recurrent neural network can be trained in the following manner.

首先，獲取特徵資訊樣本。例如，可以通過以下方式獲取所述特徵資訊樣本。First, obtain a sample of feature information. For example, the feature information sample can be obtained in the following manner.

獲取一角色發出語音的視訊段，從所述視訊段中提取角色的相應語音段，例如，可以獲取一真實人物正在說話的視訊段；對所述視訊段進行採樣獲取多個包含所述角色的第一圖像幀；以及，對所述語音段進行採樣，獲得多個語音幀。Obtain a video segment of a character's voice, and extract the corresponding speech segment of the character from the video segment. For example, a video segment in which a real person is speaking can be obtained; the video segment is sampled to obtain a plurality of video segments containing the character. a first image frame; and sampling the speech segment to obtain a plurality of speech frames.

根據與所述第一圖像幀對應的所述語音幀所包含的語音數據單元，獲取所述語音幀對應的特徵資訊；Obtain feature information corresponding to the voice frame according to the voice data unit included in the voice frame corresponding to the first image frame;

將所述第一圖像幀轉化為包含所述互動物件的第二圖像幀，獲取所述第二圖像幀對應的所述互動物件的控制參數值。Converting the first image frame into a second image frame including the interactive object, and acquiring control parameter values of the interactive object corresponding to the second image frame.

根據所述控制參數值，對與所述第一圖像幀對應的特徵資訊進行標注，獲得特徵資訊樣本。According to the control parameter value, the feature information corresponding to the first image frame is marked to obtain a feature information sample.

在一些實施例中，所述特徵資訊包括音素的特徵編碼，所述控制參數包括面部肌肉控制係數。根據上述獲取特徵資訊樣本的方法，利用所獲得的面部肌肉控制係數，對與所述第一圖像幀對應的音素的特徵編碼進行標注，則獲得了音素的特徵編碼對應的特徵資訊樣本。In some embodiments, the feature information includes phoneme feature codes, and the control parameters include facial muscle control coefficients. According to the above method for obtaining feature information samples, the obtained facial muscle control coefficients are used to mark the feature codes of the phonemes corresponding to the first image frame, and then the feature information samples corresponding to the phoneme feature codes are obtained.

在一些實施例中，所述特徵資訊包括音素的特徵編碼，所述控制參數包括所述互動物件的至少一個局部的控制向量。根據上述獲取特徵資訊樣本的方法，利用所獲得的至少一個局部的控制向量，對與所述第一圖像幀對應的音素的特徵編碼進行標註，則獲得了音素的特徵編碼對應的特徵資訊樣本。In some embodiments, the feature information includes feature codes of phonemes, and the control parameters include at least one local control vector of the interactive object. According to the above method for obtaining feature information samples, the feature code of the phoneme corresponding to the first image frame is marked by using the obtained at least one local control vector, and then the feature information sample corresponding to the feature code of the phoneme is obtained. .

在一些實施例中，所述特徵資訊包括語音幀的聲學特徵，所述控制參數包括所述互動物件的至少一個局部的控制向量。根據上述獲取特徵資訊樣本的方法，利用所獲得的至少一個局部的控制向量，對與所述第一圖像幀對應的語音幀的聲學特徵進行標注，則獲得了語音幀的聲學特徵對應的特徵資訊樣本。In some embodiments, the feature information includes acoustic features of speech frames, and the control parameter includes at least one local control vector of the interactive object. According to the above method for obtaining feature information samples, using the obtained at least one local control vector to mark the acoustic features of the speech frame corresponding to the first image frame, the features corresponding to the acoustic features of the speech frame are obtained Information sample.

本領域技術人員應當理解，所述特徵資訊樣本不限於以上所述，對應於各個類型的語音數據單元的各種特徵，可以獲得相應的特徵資訊樣本。Those skilled in the art should understand that the feature information samples are not limited to the above, and corresponding feature information samples can be obtained corresponding to various features of various types of speech data units.

在獲得所述特徵資訊樣本後，根據所述特徵資訊樣本對初始循環神經網路進行訓練，在網路損失的變化滿足收斂條件後訓練得到所述循環神經網路，其中，所述網路損失包括所述循環神經網路預測得到的控制參數值與標注的控制參數值之間的差異。After obtaining the feature information samples, the initial recurrent neural network is trained according to the feature information samples, and the recurrent neural network is obtained by training after the change of the network loss satisfies the convergence condition, wherein the network loss Including the difference between the control parameter value predicted by the recurrent neural network and the marked control parameter value.

在本公開實施例中，通過將一角色的視訊段，拆分為對應的多個第一圖像幀和多個語音幀，通過將包含真實人物的第一圖像幀轉化為包含互動物件的第二圖像幀來獲取至少一個語音幀的特徵資訊對應的控制參數值，使得特徵資訊與控制參數值的對應性較好，從而獲得高質量的特徵資訊樣本，使得互動物件的姿態更接近於對應角色的真實姿態。In the embodiment of the present disclosure, by splitting a video segment of a character into a plurality of corresponding first image frames and a plurality of voice frames, by converting the first image frame containing the real person into the interactive object containing the first image frame The second image frame is used to obtain the control parameter value corresponding to the feature information of at least one speech frame, so that the correspondence between the feature information and the control parameter value is better, so as to obtain high-quality feature information samples, so that the posture of the interactive object is closer to Corresponds to the real pose of the character.

圖6繪示根據本公開至少一個實施例的互動物件的驅動裝置的結構示意圖，如圖6所示，該裝置可以包括：第一獲取單元601，用於獲取所述互動物件的驅動數據，並確定所述驅動數據的驅動模式；第二獲取單元602，用於響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值；驅動單元603，用於根據所述控制參數值控制所述互動物件的姿態。FIG. 6 is a schematic structural diagram of an apparatus for driving an interactive object according to at least one embodiment of the present disclosure. As shown in FIG. 6 , the apparatus may include: a first acquiring unit 601 for acquiring driving data of the interactive object, and determining the driving mode of the driving data; the second obtaining unit 602 is configured to, in response to the driving mode, obtain the control parameter value of the interactive object according to the driving data; the driving unit 603 is configured to obtain the control parameter value according to the control parameter The value controls the pose of the interactive object.

在一些實施例中，所述裝置更包括輸出單元，用於根據所述驅動數據控制所述顯示設備輸出語音和/或顯示文本。In some embodiments, the apparatus further includes an output unit for controlling the display device to output voice and/or display text according to the driving data.

在一些實施例中，在確定所述驅動數據對應的驅動模式時，所述第一獲取單元具體用於：根據所述驅動數據的類型，獲取所述驅動數據對應的語音數據序列，所述語音數據序列包括多個語音數據單元；若檢測到所述語音數據單元中包括目標數據，則確定所述驅動數據的驅動模式為第一驅動模式，所述目標數據與互動物件的預設控制參數值相對應；響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值，包括：響應於所述第一驅動模式，將所述目標數據對應的所述預設控制參數值作為所述互動物件的控制參數值。In some embodiments, when determining the driving mode corresponding to the driving data, the first obtaining unit is specifically configured to: according to the type of the driving data, obtain a voice data sequence corresponding to the driving data, the voice data The data sequence includes a plurality of voice data units; if it is detected that the voice data unit includes target data, the driving mode of the driving data is determined to be the first driving mode, and the target data and the preset control parameter value of the interactive object are determined. Correspondingly; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the first driving mode, converting the preset control parameter value corresponding to the target data As the control parameter value of the interactive object.

在一些實施例中，所述目標數據包括關鍵詞或關鍵字，所述關鍵詞或所述關鍵字與互動物件的設定動作的預設控制參數值相對應；或者，所述目標數據包括音節，所述音節與所述互動物件的設定嘴型動作的預設控制參數值對應。In some embodiments, the target data includes a keyword or a keyword, and the keyword or the keyword corresponds to a preset control parameter value of a setting action of an interactive object; or, the target data includes a syllable, The syllable corresponds to the preset control parameter value of the interactive object for setting the mouth shape action.

在一些實施例中，在識別所述驅動數據的驅動模式時，所述第一獲取單元具體用於：根據所述驅動數據的類型，獲取所述驅動數據對應的語音數據序列，所述語音數據序列包括多個語音數據單元；若未檢測到所述語音數據單元中包括目標數據，則確定所述驅動數據的驅動模式為第二驅動模式，所述目標數據與互動物件的預設控制參數值相對應；響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值，包括：響應於所述第二驅動模式，獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊；獲取與所述特徵資訊對應的所述互動物件的控制參數值。In some embodiments, when identifying the driving mode of the driving data, the first obtaining unit is specifically configured to: according to the type of the driving data, obtain a voice data sequence corresponding to the driving data, the voice data The sequence includes a plurality of voice data units; if it is not detected that the voice data unit includes target data, the driving mode of the driving data is determined to be the second driving mode, and the target data and the preset control parameter value of the interactive object are determined. Correspondingly; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring the value of at least one voice data unit in the voice data sequence Feature information; obtain the control parameter value of the interactive object corresponding to the feature information.

在一些實施例中，所述語音數據序列包括音素序列，在取所述語音數據序列中的至少一個語音數據單元的特徵資訊時，所述第二獲取單元具體用於：對所述音素序列進行特徵編碼，獲得所述音素序列對應的第一編碼序列；根據所述第一編碼序列，獲取至少一個音素對應的特徵編碼；根據所述特徵編碼，獲得所述至少一個音素的特徵資訊。In some embodiments, the speech data sequence includes a phoneme sequence, and when the feature information of at least one speech data unit in the speech data sequence is obtained, the second obtaining unit is specifically configured to: perform an operation on the phoneme sequence. The feature coding is used to obtain a first coding sequence corresponding to the phoneme sequence; the feature coding corresponding to at least one phoneme is obtained according to the first coding sequence; the feature information of the at least one phoneme is obtained according to the feature coding.

在一些實施例中，所述語音數據序列包括語音幀序列，在獲取所述語音數據序列中的至少一個語音數據單元的特徵資訊時，所述第二獲取單元具體用於：獲取所述語音幀序列對應的第一聲學特徵序列，所述第一聲學特徵序列包括與所述語音幀序列中的每個語音幀對應的聲學特徵向量；根據所述第一聲學特徵序列，獲取至少一個語音幀對應的聲學特徵向量；根據所述聲學特徵向量，獲得所述至少一個語音幀對應的特徵資訊。In some embodiments, the voice data sequence includes a voice frame sequence, and when acquiring feature information of at least one voice data unit in the voice data sequence, the second acquiring unit is specifically configured to: acquire the voice frame The first acoustic feature sequence corresponding to the sequence, the first acoustic feature sequence includes an acoustic feature vector corresponding to each voice frame in the voice frame sequence; according to the first acoustic feature sequence, obtain at least one voice frame corresponding to The acoustic feature vector of ; obtain feature information corresponding to the at least one speech frame according to the acoustic feature vector.

在一些實施例中，所述互動物件的控制參數包括面部姿態參數，所述面部姿態參數包括面部肌肉控制係數，該面部肌肉控制係數用於控制至少一個面部肌肉的運動狀態；在根據所述驅動數據獲取所述互動物件的控制參數值時，所述第二獲取單元具體用於：根據所述驅動數據獲取所述互動物件的面部肌肉控制係數；所述驅動單元具體用於：根據所獲取的面部肌肉控制係數，驅動所述互動物件做出與所述驅動數據匹配的面部動作；所述裝置更包括肢體驅動單元，用於獲取與所述面部姿態參數關聯的身體姿態的驅動數據；根據與所述面部姿態參數值關聯的身體姿態的驅動數據，驅動所述互動物件做出肢體動作。In some embodiments, the control parameters of the interactive object include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control the motion state of at least one facial muscle; When the data obtains the control parameter value of the interactive object, the second obtaining unit is specifically used for: obtaining the facial muscle control coefficient of the interactive object according to the driving data; the driving unit is specifically used for: according to the obtained The facial muscle control coefficient drives the interactive object to make facial movements that match the driving data; the device further includes a limb driving unit for acquiring the driving data of the body posture associated with the facial posture parameters; The driving data of the body posture associated with the facial posture parameter value drives the interactive object to make body movements.

在一些實施例中，所述互動物件的控制參數包括所述互動物件的至少一個局部區域的控制向量；在根據所述驅動數據獲取所述互動物件的控制參數值時，所述第二獲取單元具體用於：根據所述驅動數據獲取所述互動物件的至少一個局部區域的控制向量；所述驅動單元具體用於：根據所獲取的所述至少一個局部區域的控制向量，控制所述互動物件的面部動作和/或肢體動作。In some embodiments, the control parameter of the interactive object includes a control vector of at least one local area of the interactive object; when acquiring the control parameter value of the interactive object according to the driving data, the second acquiring unit It is specifically used for: acquiring the control vector of at least one local area of the interactive object according to the driving data; the driving unit is specifically used for: controlling the interactive object according to the acquired control vector of the at least one local area facial and/or body movements.

根據本公開的一方面，提供一種計算機可讀儲存媒體，其上儲存有計算機程式，所述程式被處理器執行時實現本公開提供的任一實施方式所述的互動物件的驅動方法。According to an aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for driving an interactive object according to any one of the implementation manners provided in the present disclosure.

本說明書至少一個實施例還提供了一種電子設備，如圖7所示，所述設備包括記憶體、處理器，記憶體用於儲存可在處理器上運行的計算機指令，處理器用於在執行所述計算機指令時實現本公開任一實施例所述的互動物件的驅動方法。At least one embodiment of the present specification further provides an electronic device, as shown in FIG. 7 , the device includes a memory and a processor, where the memory is used to store computer instructions that can be executed on the processor, and the processor is used to execute all computer instructions. The method for driving an interactive object described in any embodiment of the present disclosure is implemented when the computer instruction is used.

本說明書至少一個實施例還提供了一種計算機可讀儲存媒體，其上儲存有計算機程式，所述程式被處理器執行時實現本公開任一實施例所述的互動物件的驅動方法。At least one embodiment of the present specification further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method for driving an interactive object described in any embodiment of the present disclosure.

本領域技術人員應明白，本說明書一個或多個實施例可提供為方法、系統或計算機程式產品。因此，本說明書一個或多個實施例可採用完全硬體實施例、完全軟體實施例或結合軟體和硬體方面的實施例的形式。而且，本說明書一個或多個實施例可採用在一個或多個其中包含有計算機可用程式代碼的計算機可用儲存媒體（包括但不限於磁碟記憶體、CD-ROM、光學記憶體等）上實施的計算機程式產品的形式。As will be appreciated by one skilled in the art, one or more embodiments of this specification may be provided as a method, system or computer program product. Accordingly, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may be implemented on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) having computer-usable program code embodied therein in the form of a computer program product.

本說明書中的各個實施例均採用遞進的方式描述，各個實施例之間相同相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。尤其，對於數據處理設備實施例而言，由於其基本相似於方法實施例，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。Each embodiment in this specification is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment.

上述對本說明書特定實施例進行了描述。其它實施例在所附請求項的範圍內。在一些情況下，在請求項中記載的行為或步驟可以按照不同於實施例中的順序來執行並且仍然可以實現期望的結果。另外，在附圖中描繪的過程不一定要求繪示的特定順序或者連續順序才能實現期望的結果。在某些實施方式中，多任務處理和並行處理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the appended claims. In some cases, the acts or steps recited in the claims may be performed in an order different from that of the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

本說明書中描述的主題及功能操作的實施例可以在以下中實現：數位電子電路、有形體現的計算機軟體或韌體、包括本說明書中公開的結構及其結構性等同物的計算機硬體、或者它們中的一個或多個的組合。本說明書中描述的主題的實施例可以實現為一個或多個計算機程式，即編碼在有形非暫時性程式載體上以被數據處理裝置執行或控制數據處理裝置的操作的計算機程式指令中的一個或多個模組。可替代地或附加地，程式指令可以被編碼在人工生成的傳播訊號上，例如機器生成的電、光或電磁訊號，該訊號被生成以將資訊編碼並傳輸到合適的接收機裝置以由數據處理裝置執行。計算機儲存媒體可以是機器可讀儲存設備、機器可讀儲存基板、隨機或序列存取記憶體設備、或它們中的一個或多個的組合。Embodiments of the subject matter and functional operations described in this specification can be implemented in digital electronic circuits, in tangible embodiment of computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or A combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, a data processing apparatus. Multiple mods. Alternatively or additionally, program instructions may be encoded on artificially generated propagating signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode and transmit information to suitable receiver devices for data retrieval. The processing device executes. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or sequential access memory device, or a combination of one or more of these.

本說明書中描述的處理及邏輯流程可以由執行一個或多個計算機程式的一個或多個可程式化計算機執行，以通過根據輸入數據進行操作並生成輸出來執行相應的功能。所述處理及邏輯流程還可以由專用邏輯電路—例如FPGA（現場可程式化門陣列）或ASIC（專用積體電路）來執行，並且裝置也可以實現為專用邏輯電路。The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, eg, an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).

適合用於執行計算機程式的計算機包括，例如通用和/或專用微處理器，或任何其他類型的中央處理單元。通常，中央處理單元將從唯讀記憶體和/或隨機存取記憶體接收指令和數據。計算機的基本組件包括用於實施或執行指令的中央處理單元以及用於儲存指令和數據的一個或多個記憶體設備。通常，計算機還將包括用於儲存數據的一個或多個大容量儲存設備，例如磁碟、光碟磁光碟或光碟等，或者計算機將可操作地與此大容量儲存設備耦接以從其接收數據或向其傳送數據，抑或兩種情況兼而有之。然而，計算機不是必須具有這樣的設備。此外，計算機可以嵌入在另一設備中，例如移動電話、個人數位助理（PDA）、移動音訊或視訊播放器、遊戲操縱臺、全球定位系統（GPS）接收機、或例如通用序列匯流排（USB）快閃記憶體驅動器的便攜式儲存設備，僅舉幾例。Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from read-only memory and/or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operably coupled to, such mass storage devices to receive data therefrom, one or more mass storage devices, such as magnetic disks, optical disks, magneto-optical disks, or optical disks, etc., for storing data Or send data to it, or both. However, the computer does not have to have such a device. Additionally, a computer may be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or a universal serial bus (USB ) flash memory drives for portable storage devices, to name a few.

適合於儲存計算機程式指令和數據的計算機可讀媒體包括所有形式的非揮發性記憶體、媒體和記憶體設備，例如包括半導體記憶體設備（例如EPROM、EEPROM和快閃記憶體設備）、磁碟（例如內部硬碟或可移動碟）、光碟磁光碟以及CD ROM和DVD-ROM。處理器和記憶體可由專用邏輯電路補充或併入專用邏輯電路中。Computer-readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media, and memory devices including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disks), CD-ROMs, and CD-ROMs and DVD-ROMs. The processor and memory may be supplemented by or incorporated in special purpose logic circuitry.

雖然本說明書包含許多具體實施細節，但是這些不應被解釋為限制任何發明的範圍或所要求保護的範圍，而是主要用於描述特定發明的具體實施例的特徵。本說明書內在多個實施例中描述的某些特徵也可以在單個實施例中被組合實施。另一方面，在單個實施例中描述的各種特徵也可以在多個實施例中分開實施或以任何合適的子組合來實施。此外，雖然特徵可以如上所述在某些組合中起作用並且甚至最初如此要求保護，但是來自所要求保護的組合中的一個或多個特徵在一些情況下可以從該組合中去除，並且所要求保護的組合可以指向子組合或子組合的變型。While this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or what may be claimed, but rather are used primarily to describe features of specific embodiments of particular inventions. Certain features that are described in this specification in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function as described above in certain combinations and even be originally claimed as such, one or more features from a claimed combination may in some cases be removed from the combination and the claimed A protected combination may point to a subcombination or a variation of a subcombination.

類似地，雖然在附圖中以特定順序描繪了操作，但是這不應被理解為要求這些操作以所示的特定順序執行或順次執行、或者要求所有例示的操作被執行，以實現期望的結果。在某些情況下，多任務和並行處理可能是有利的。此外，上述實施例中的各種系統模組和組件的分離不應被理解為在所有實施例中均需要這樣的分離，並且應當理解，所描述的程式組件和系統通常可以一起積體在單個軟體產品中，或者封裝成多個軟體產品。Similarly, although operations in the figures are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or sequentially, or that all illustrated operations be performed, to achieve the desired results . In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software package products, or packaged into multiple software products.

由此，主題的特定實施例已被描述。其他實施例在所附請求項的範圍以內。在某些情況下，請求項中記載的動作可以以不同的順序執行並且仍實現期望的結果。此外，附圖中描繪的處理並非必需所示的特定順序或順次順序，以實現期望的結果。在某些實現中，多任務和並行處理可能是有利的。Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claim may be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

以上所述僅為本說明書一個或多個實施例的較佳實施例而已，並不用以限制本說明書一個或多個實施例，凡在本說明書一個或多個實施例的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本說明書一個或多個實施例保護的範圍之內。The above descriptions are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit one or more embodiments of this specification. All within the spirit and principles of one or more embodiments of this specification, Any modifications, equivalent replacements, improvements, etc. made should be included within the protection scope of one or more embodiments of this specification.

201:獲取所述互動物件的驅動數據，並確定所述驅動數據的驅動模式的步驟 202:響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值的步驟 203:根據所述控制參數值控制所述互動物件的姿態的步驟 601:第一獲取單元 602:第二獲取單元 603:驅動單元201: the step of acquiring the driving data of the interactive object and determining the driving mode of the driving data 202: in response to the driving mode, the step of acquiring the control parameter value of the interactive object according to the driving data 203: the step of controlling the posture of the interactive object according to the control parameter value 601: The first acquisition unit 602: Second acquisition unit 603: Drive unit

圖1是本公開至少一個實施例提出的互動物件的驅動方法中顯示設備的示意圖。圖2是本公開至少一個實施例提出的互動物件的驅動方法的流程圖。圖3是本公開至少一個實施例提出的對音素序列進行特徵編碼的過程示意圖。圖4是本公開至少一個實施例提出的根據音素序列獲得控制參數值的過程示意圖。圖5是本公開至少一個實施例提出的根據語音幀序列獲得控制參數值的過程示意圖。圖6是本公開至少一個實施例提出的互動物件的驅動裝置的結構示意圖。圖7是本公開至少一個實施例提出的電子設備的結構示意圖。FIG. 1 is a schematic diagram of a display device in a method for driving an interactive object provided by at least one embodiment of the present disclosure. FIG. 2 is a flowchart of a method for driving an interactive object provided by at least one embodiment of the present disclosure. FIG. 3 is a schematic diagram of a process of feature encoding for a phoneme sequence proposed by at least one embodiment of the present disclosure. FIG. 4 is a schematic diagram of a process of obtaining a control parameter value according to a phoneme sequence proposed by at least one embodiment of the present disclosure. FIG. 5 is a schematic diagram of a process of obtaining a control parameter value according to a sequence of speech frames according to at least one embodiment of the present disclosure. FIG. 6 is a schematic structural diagram of a driving device for an interactive object according to at least one embodiment of the present disclosure. FIG. 7 is a schematic structural diagram of an electronic device provided by at least one embodiment of the present disclosure.

201:獲取所述互動物件的驅動數據，並確定所述驅動數據的驅動模式的步驟201: the step of acquiring the driving data of the interactive object and determining the driving mode of the driving data

202:響應於所述驅動模式，根據所述驅動數據獲取所述互動物件的控制參數值的步驟202: in response to the driving mode, the step of acquiring the control parameter value of the interactive object according to the driving data

203:根據所述控制參數值控制所述互動物件的姿態的步驟203: the step of controlling the posture of the interactive object according to the control parameter value

Claims

A method for driving an interactive object, wherein the interactive object is displayed in a display device, the driving method includes: acquiring driving data of the interactive object, and determining a driving mode of the driving data; in response to the driving mode, according to Obtaining the control parameter value of the interactive object from the driving data; controlling the posture of the interactive object according to the control parameter value, wherein determining the driving mode of the driving data includes: obtaining the driving data according to the type of the driving data a voice data sequence corresponding to the driving data, the voice data sequence including a plurality of voice data units; in response to not detecting that the voice data unit includes target data, determine that the driving mode of the driving data is the second driving mode ; wherein, in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring the data of at least one voice data unit in the voice data sequence Feature information; obtain the control parameter value of the interactive object corresponding to the feature information.

The driving method according to claim 1, further comprising: controlling the display device to output voice and/or display text according to the driving data.

The driving method according to claim 1 or 2, wherein determining the driving mode of the driving data comprises: in response to detecting that the target data is included in the voice data unit, determining that the driving mode of the driving data is: In the first driving mode, the target data corresponds to the preset control parameter value of the interactive object; in response to the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: responding to the In the first driving mode, the preset control parameter value corresponding to the target data is used as the control parameter value of the interactive object.

The driving method according to claim 3, wherein the target data includes a keyword or a keyword, and the keyword or the keyword corresponds to a preset control parameter value of the setting action of the interactive object; or , the target data includes syllables, and the syllables correspond to the preset control parameter values of the interactive object for setting the mouth shape action.

The driving method according to claim 1, wherein the voice data sequence includes a phoneme sequence, and acquiring feature information of at least one voice data unit in the voice data sequence includes: performing feature encoding on the phoneme sequence to obtain The first coding sequence corresponding to the phoneme sequence; according to the first coding sequence, obtain the feature code corresponding to at least one phoneme; according to the feature code, obtain the feature information of the at least one phoneme.

The driving method according to claim 1, wherein the voice data sequence includes a voice frame sequence, and acquiring feature information of at least one voice data unit in the voice data sequence includes: acquiring the first voice data unit corresponding to the voice frame sequence an acoustic feature sequence, the first acoustic feature sequence includes an acoustic feature vector corresponding to each speech frame in the speech frame sequence; according to the first acoustic feature sequence, acquire an acoustic feature vector corresponding to at least one speech frame ; According to the acoustic feature vector, obtain feature information corresponding to the at least one speech frame.

The driving method according to claim 1 or 2, wherein the control parameters of the interactive object include facial posture parameters, the facial posture parameters include facial muscle control coefficients, and the facial muscle control coefficients are used to control at least one facial muscle obtaining the control parameter value of the interactive object according to the driving data, including: obtaining the facial muscle control coefficient of the interactive object according to the driving data; controlling the posture of the interactive object according to the control parameter value , which includes: driving the interactive object to perform facial actions matching the driving data according to the acquired facial muscle control coefficients.

The driving method according to claim 7, further comprising: acquiring the driving data of the body posture associated with the facial posture parameter; driving the interactive object to do body movements.

The driving method according to claim 1 or 2, wherein the control parameter of the interactive object includes a control vector of at least one local area of the interactive object; the control parameter value of the interactive object is obtained according to the driving data, The method includes: obtaining a control vector of at least one local area of the interactive object according to the driving data; controlling the posture of the interactive object according to the control parameter value, including: obtaining the control vector of the at least one local area according to the obtained control vector , to control the facial movements and/or body movements of the interactive object.

The driving method according to claim 1, wherein acquiring the control parameter value of the interactive object corresponding to the feature information comprises: inputting the feature information into a pre-trained recurrent neural network, and obtaining the value of the control parameter of the interactive object corresponding to the feature information. The control parameter value of the interactive object corresponding to the feature information.

A driving device for an interactive object, the interactive object is displayed on a display device, the device comprises: a first acquiring unit, configured to acquire driving data of the interactive object, and determine a driving mode of the driving data; a second acquiring unit an acquiring unit, configured to acquire a control parameter value of the interactive object according to the driving data in response to the driving mode; a driving unit, configured to control the posture of the interactive object according to the control parameter value, wherein the determined The driving mode of the driving data includes: obtaining the language corresponding to the driving data according to the type of the driving data a voice data sequence, the voice data sequence includes a plurality of voice data units; in response to not detecting that the voice data unit includes target data, determining that the driving mode of the driving data is the second driving mode; wherein, in response to the In the driving mode, acquiring the control parameter value of the interactive object according to the driving data includes: in response to the second driving mode, acquiring feature information of at least one voice data unit in the voice data sequence; The control parameter value of the interactive object corresponding to the feature information.

An electronic device includes a memory and a processor, wherein the memory is used to store computer instructions that can be executed on the processor, and the processor is used to implement the driving method described in claim 1 when the computer instructions are executed.

A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the driving method described in claim 1.