WO2018006371A1 - 一种同步语音及虚拟动作的方法、系统及机器人 - Google Patents
一种同步语音及虚拟动作的方法、系统及机器人 Download PDFInfo
- Publication number
- WO2018006371A1 WO2018006371A1 PCT/CN2016/089215 CN2016089215W WO2018006371A1 WO 2018006371 A1 WO2018006371 A1 WO 2018006371A1 CN 2016089215 W CN2016089215 W CN 2016089215W WO 2018006371 A1 WO2018006371 A1 WO 2018006371A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- length
- robot
- voice
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J13/00—Controls for manipulators
- B25J13/003—Controls for manipulators by means of an audio-responsive input
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/04—Time compression or expansion
- G10L21/055—Time compression or expansion for synchronising with other signals, e.g. video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
Definitions
- the present invention relates to the field of robot interaction technologies, and in particular, to a method, system and robot for synchronizing voice and virtual motion.
- robots are used more and more. For example, some elderly people and children can interact with robots, including dialogue and entertainment.
- the inventor developed a virtual robot display device and imaging system, which can form a 3D animated image, and the virtual robot's host accepts human commands such as voice to interact with humans. Then the virtual 3D animated image will respond to the sounds and actions according to the instructions of the host, so that the robot can be more anthropomorphic, not only can interact with humans in sounds and expressions, but also interact with humans in actions, etc. Improve the experience of interaction.
- a method of synchronizing speech and virtual actions including:
- the interactive content includes at least voice information and action information
- the length of the voice information and the length of the motion information are adjusted to phase
- the same specific steps include:
- the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
- the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
- the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.
- the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
- the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.
- part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
- the method for generating parameters of the life time axis of the robot includes:
- the parameters of the robot's self-cognition are fitted to the parameters in the life time axis to generate the life time axis of the robot.
- the step of expanding the self-cognition of the robot specifically comprises: combining the life scene with the self-knowledge of the robot to form a self-cognitive curve based on the life time axis.
- the step of fitting the self-cognitive parameter of the robot to the parameter in the life time axis comprises: using a probability algorithm to calculate each parameter of the robot on the life time axis after the time axis scene parameter is changed.
- the probability of change forms a fitted curve.
- the life time axis refers to a time axis including 24 hours a day
- the parameters in the life time axis include at least a daily life behavior performed by the user on the life time axis and parameter values representing the behavior.
- a system for synchronizing voice and virtual actions including:
- An obtaining module configured to acquire multi-modal information of the user
- An artificial intelligence module for generating interactions based on the user's multimodal information and the life time axis
- the interactive content includes at least voice information and action information
- the control module is configured to adjust the length of the voice information and the length of the motion information to be the same.
- control module is specifically configured to:
- the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
- the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
- the playback speed of the voice information or/and the playback speed of the motion information is accelerated, so that the time length of the motion information is equal to the length of time of the voice information.
- control module is specifically configured to:
- the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time.
- the length is equal to the length of time of the voice information.
- part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
- the system comprises a processing module for:
- the parameters of the robot's self-cognition are fitted to the parameters in the life time axis to generate the life time axis of the robot.
- the processing module is specifically configured to combine a life scene with a self-awareness of the robot to form a self-cognitive curve based on a life time axis.
- the processing module is specifically configured to: use a probability algorithm to calculate a probability of each parameter change of the robot on the life time axis after the time axis scene parameter is changed, to form a fitting curve.
- the life time axis refers to a time axis including 24 hours a day
- the parameters in the life time axis include at least a daily life behavior performed by the user on the life time axis and parameter values representing the behavior.
- the invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.
- the present invention has the following advantages: synchronous speech and virtual action of the present invention
- the method includes: acquiring multi-modal information of the user; generating interactive content according to the multi-modal information of the user and the life time axis, where the interactive content includes at least the voice information and the action information; the time length of the voice information and the time of the action information The length is adjusted to be the same.
- the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc.
- the interactive content includes at least voice information and motion information, and in order to make the voice information and action
- the information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting.
- the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
- FIG. 1 is a flowchart of a method for synchronizing voice and virtual actions according to Embodiment 1 of the present invention
- FIG. 2 is a schematic diagram of a system for synchronizing voice and virtual actions according to Embodiment 2 of the present invention.
- Computer devices include user devices and network devices.
- the user equipment or the client includes but is not limited to a computer, a smart phone, a PDA, etc.;
- the network device includes but is not limited to a single network server, a server group composed of multiple network servers, or a cloud computing-based computer or network server. cloud.
- the computer device can operate alone to carry out the invention, and can also access the network and implement the invention through interoperation with other computer devices in the network.
- the network in which the computer device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a VPN network, and the like.
- first means “first,” “second,” and the like may be used herein to describe the various elements, but the elements should not be limited by these terms, and the terms are used only to distinguish one element from another.
- the term “and/or” used herein includes any and all combinations of one or more of the associated listed items. When a unit is referred to as being “connected” or “coupled” to another unit, it can be directly connected or coupled to the other unit, or an intermediate unit can be present.
- a method for synchronizing voice and virtual actions including:
- S102 Generate interaction content according to the multimodal information of the user and the life timeline 300, where the interaction content includes at least voice information and action information.
- the method for synchronizing speech and virtual actions of the present invention comprises: acquiring multimodal information of a user; generating interactive content according to multimodal information of the user and a life time axis, wherein the interactive content includes at least voice information and action information; The length of time and the length of the action information are adjusted to be the same.
- the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc.
- the interactive content includes at least voice information and motion information, and in order to make the voice information and action
- the information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting.
- the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
- the present invention adds the life time axis in which the robot is located to the interactive content generation of the robot, and makes the robot more humanized when interacting with the human, so that the robot has a human lifestyle in the life time axis, and the method can enhance the robot interaction content.
- the interactive content may be a combination of one or more of an expression or a text or a voice or an action.
- the life time axis 300 of the robot is completed and set in advance. Specifically, the life time axis 300 of the robot is a series of parameter collections, and this parameter is transmitted to the system to generate interactive content.
- the multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information.
- the life time axis is specifically: according to the time axis of human daily life, according to the human way, the self-cognition value of the robot itself in the time axis of daily life is fitted, and the behavior of the robot is according to this The action is to get the robot's own behavior in one day, so that the robot can perform its own behavior based on the life time axis, such as generating interactive content and communicating with humans. If the robot is always awake, it will act according to the behavior on this timeline, and the robot's self-awareness will be changed according to this timeline.
- the life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.
- the life time axis includes not only voice information, but also information such as actions.
- the user speaks to the robot: "It's so sleepy”, the robot understands that the user is sleepy, and then combines the life axis of the robot, for example, the current time is 9:00, then the robot knows that the owner just got up, Then you should ask the owner early, for example, answer the voice "Good morning” as a reply, you can also sing a song, and match the dance movement.
- the robot understands that the user is sleepy, and then the robot's life time axis, such as the current time is 9:00, then the robot knows that the owner needs to sleep, then Will reply to the voice "master good night, sleep well” and other similar terms, and with the appropriate good night, sleep movements. This way is more close to people's life than simple voice and expression reply, and the action is more anthropomorphic.
- the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
- the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
- the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
- the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.
- the specific meaning of the adjustment may be the length of time to compress or stretch the voice information or / and the length of the action information, or may speed up the playback speed or slow down the playback speed, for example, multiply the playback speed of the voice information by 2, or The playback time of the action information is multiplied by 0.8 and so on.
- the time length of the voice information and the time length of the motion information are one minute.
- the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes.
- the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information.
- the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information.
- both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.
- the specific steps of adjusting the length of the voice information and the length of the action information to the same include:
- the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are sorted and combined, so that the combined action information is The length of time is equal to the length of time of the voice message.
- part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
- the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.
- the time length of the voice information and the time length of the motion information are 30 seconds.
- the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute.
- other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is.
- the length of the post action information is 2 minutes, so that The length of the voice message matches the same.
- the action information closest to the time length of the voice information may be selected according to the length of the voice information, or the closest voice information may be selected according to the time length of the motion information.
- the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.
- the method further includes: outputting the adjusted voice information and the motion information to the virtual image for display.
- the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.
- the method for generating parameters of the life time axis of the robot includes:
- the parameters of the robot's self-cognition are fitted to the parameters in the life time axis to generate the life time axis of the robot.
- the life time axis is added to the self-cognition of the robot itself, so that the robot has an anthropomorphic life. For example, add the cognition of lunch to the robot.
- the step of expanding the self-cognition of the robot specifically includes: combining the life scene with the self-awareness of the robot to form a self-cognitive curve based on the life time axis.
- the life time axis can be specifically added to the parameters of the robot itself.
- the step of fitting the parameter of the self-cognition of the robot to the parameter in the life time axis comprises: using a probability algorithm to calculate the time of the robot on the life time axis after the time axis scene parameter is changed The probability of each parameter change forms a fitted curve.
- the probability algorithm may be a Bayesian probability algorithm.
- the robot will have sleep, exercise, eat, dance, read books, eat, make up, sleep and other actions. Each action will affect the self-cognition of the robot itself, and combine the parameters on the life time axis with the self-cognition of the robot itself.
- the robot's self-cognition includes, mood, fatigue value, intimacy. , good feelings, number of interactions, The three-dimensional cognition, age, height, weight, intimacy, game scene value, game object value, location scene value, location object value, etc. of the robot.
- For the robot to identify the location of the scene such as cafes, bedrooms, etc.
- the machine will perform different actions in the time axis of the day, such as sleeping at night, eating at noon, exercising during the day, etc. All the scenes in the life time axis will have an impact on self-awareness. These numerical changes are modeled by the dynamic fit of the probability model, fitting the probability that all of these actions occur on the time axis.
- Scene Recognition This type of scene recognition changes the value of the geographic scene in self-cognition.
- a system for synchronizing voice and virtual actions including:
- the obtaining module 201 is configured to acquire multi-modal information of the user
- the artificial intelligence module 202 is configured to generate interaction content according to the multimodal information of the user and the life time axis, where the interaction content includes at least voice information and action information, wherein the life time axis is generated by the life time axis module 301;
- the control module 203 is configured to adjust the time length of the voice information and the time length of the motion information to be the same.
- the interactive content can be generated by one or more of the user's multimodal information such as user voice, user expression, user action, etc.
- the interactive content includes at least voice information and motion information, and in order to make the voice information and action
- the information can be synchronized, and the time length of the voice information and the time length of the motion information are adjusted to be the same, so that the robot can synchronously match when playing the sound and the action, so that the robot not only has the voice performance but also has the action when interacting.
- the robot's representation is more diverse, making the robot more anthropomorphic and improving the user's experience in robot interaction.
- the present invention adds the life time axis in which the robot is located to the interactive content generation of the robot, and makes the robot more humanized when interacting with the human, so that the robot has a human lifestyle in the life time axis, and the method can enhance the robot interaction content.
- the interactive content may be a combination of one or more of an expression or a text or a voice or an action.
- the life time axis 300 of the robot is completed and set in advance, specifically, the life time axis of the robot 300 It is a collection of parameters that are transmitted to the system to generate interactive content.
- the multimodal information in this embodiment may be one of user expression, voice information, gesture information, scene information, image information, video information, face information, pupil iris information, light sense information, and fingerprint information.
- the life time axis is specifically: according to the time axis of human daily life, according to the human way, the self-cognition value of the robot itself in the time axis of daily life is fitted, and the behavior of the robot is according to this The action is to get the robot's own behavior in one day, so that the robot can perform its own behavior based on the life time axis, such as generating interactive content and communicating with humans. If the robot is always awake, it will act according to the behavior on this timeline, and the robot's self-awareness will be changed according to this timeline.
- the life timeline and variable parameters can be used to change the attributes of self-cognition, such as mood values, fatigue values, etc., and can also automatically add new self-awareness information, such as no previous anger value, based on the life time axis and The scene of the variable factor will automatically add to the self-cognition of the robot based on the scene that previously simulated the human self-cognition.
- the life time axis includes not only voice information, but also information such as actions.
- the user speaks to the robot: "It's so sleepy”, the robot understands that the user is sleepy, and then combines the life axis of the robot, for example, the current time is 9:00, then the robot knows that the owner just got up, Then you should ask the owner early, for example, answer the voice "Good morning” as a reply, you can also sing a song, and match the dance movement.
- the robot understands that the user is sleepy, and then the robot's life time axis, such as the current time is 9:00, then the robot knows that the owner needs to sleep, then Will reply to the voice "master good night, sleep well” and other similar terms, and with the appropriate good night, sleep movements. This way is more close to people's life than simple voice and expression reply, and the action is more anthropomorphic.
- control module is specifically configured to:
- the difference between the length of the voice information and the duration of the motion information is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice.
- the length of time for the message is not greater than the threshold, when the length of the voice information is less than the duration of the motion information, the playback speed of the motion information is accelerated, and the duration of the motion information is equal to the voice. The length of time for the message.
- the playback speed of the voice information or/and the playback speed of the motion information are accelerated, so that the duration of the motion information is equal to the length of time of the voice information.
- the specific meaning of the adjustment can compress or stretch the length of the voice information or / and the length of the action information, or can speed up the playback speed or slow down the playback speed, for example, multiply the playback speed of the voice information by 2, or the action
- the playback time of the information is multiplied by 0.8 and so on.
- the time length of the voice information and the time length of the motion information are one minute.
- the length of the voice information is 1 minute, and the duration of the motion information is 2 minutes.
- the playback speed of the motion information can be accelerated to twice the original playback speed, and then the playback time after the motion information adjustment is 1 minute, thereby synchronizing with the voice information.
- the playback speed of the voice information can be slowed down, and adjusted to 0.5 times the original playback speed, so that the voice information is adjusted and then slowed down to 2 minutes, thereby synchronizing with the motion information.
- both the voice information and the motion information can be adjusted, for example, the voice information is slowed down, and the motion information is accelerated, and the time is adjusted to 1 minute and 30 seconds, and the voice and the motion can be synchronized.
- control module is specifically configured to:
- the difference between the length of the voice information and the time length of the action information is greater than the threshold, when the time length of the voice information is greater than the time length of the action information, at least two sets of action information are combined to make the combined action information time.
- the length is equal to the length of time of the voice information.
- part of the action information is selected, so that the length of the selected part action is equal to the time length of the voice information.
- the meaning of the adjustment is to add or delete part of the action information, so that the time length of the action information is the same as the time length of the voice information.
- the time length of the voice information and the time length of the motion information are 30 seconds.
- the length of the voice information is 3 minutes, and the duration of the motion information is 1 minute.
- other action information needs to be added to the original action information, for example, to find an action information with a length of 2 minutes, and the two sets of action information are sorted and combined to match the time length of the voice information. It is.
- the length of the post-action information is 2 minutes, so that the length of the voice information can be matched the same.
- the artificial intelligence module may be specifically used to: according to the time of the voice information For the length of the interval, the motion information closest to the length of the voice information is selected, and the closest voice information may be selected according to the length of the motion information.
- the control module can conveniently adjust the time length of the voice information and the motion information, and it is easier to adjust to the same, and the adjusted play is more natural and smooth.
- the system further includes an output module 204 for outputting the adjusted voice information and motion information to the virtual image for presentation.
- the output can be output after the adjustment is consistent, and the output can be output on the virtual image, thereby making the virtual robot more anthropomorphic and improving the user experience.
- the system includes a time axis based and artificial intelligence cloud processing module for:
- the self-cognitive parameters of the robot are fitted to the parameters in the life time axis to generate a robot life time axis.
- the life time axis is added to the self-cognition of the robot itself, so that the robot has an anthropomorphic life. For example, add the cognition of lunch to the robot.
- the time-based and artificial intelligence cloud processing module is specifically configured to combine a life scene with a self-awareness of the robot to form a self-cognitive curve based on a life time axis.
- the life time axis can be specifically added to the parameters of the robot itself.
- the time-based and artificial intelligence cloud processing module is specifically configured to: use a probability algorithm to calculate a probability of each parameter change of a robot on a life time axis after a change of a time axis scene parameter, to form a fit curve.
- the probability algorithm may be a Bayesian probability algorithm.
- the robot will have sleep, exercise, eat, dance, read books, eat, make up, sleep and other actions. Each action will affect the self-cognition of the robot itself, and combine the parameters on the life time axis with the self-cognition of the robot itself.
- the robot's self-cognition includes, mood, fatigue value, intimacy. , goodness, number of interactions, three-dimensional cognition of the robot, age, height, weight, intimacy, game scene value, game object value, location scene value, location object value, etc. For the robot to identify the location of the scene, such as cafes, bedrooms, etc.
- the machine will perform different actions in the time axis of the day, such as sleeping at night, eating at noon, exercising during the day, etc. All the scenes in the life time axis will have an impact on self-awareness. These numerical changes are modeled by the dynamic fit of the probability model, fitting the probability that all of these actions occur on the time axis.
- Scene Recognition This type of scene recognition changes the value of the geographic scene in self-cognition.
- the invention discloses a robot comprising a system for synchronizing speech and virtual actions as described above.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- General Physics & Mathematics (AREA)
- Manipulator (AREA)
- Toys (AREA)
Abstract
一种同步语音及虚拟动作的方法,包括:获取用户的多模态信息(S101);根据用户的多模态信息和生活时间轴(300)生成交互内容,交互内容至少包括语音信息和动作信息(S102);将语音信息的时间长度和动作信息的时间长度调整到相同(S103)。一种同步语音及虚拟动作的系统,其具有获取模块(201)、人工智能模块(202)、控制模块(203)以及输出模块(204)。这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人更加拟人化,也提高了用户与机器人交互时的体验度。
Description
本发明涉及机器人交互技术领域,尤其涉及一种同步语音及虚拟动作的方法、系统及机器人。
机器人作为与人类的交互工具,使用的场合越来越多,例如一些老人、小孩较孤独时,就可以与机器人交互,包括对话、娱乐等。而为了让机器人与人类交互时更加拟人化,发明人研究出一种虚拟机器人的显示设备和成像系统,能够形成3D的动画形象,虚拟机器人的主机接受人类的指令例如语音等与人类进行交互,然后虚拟的3D动画形象会根据主机的指令进行声音和动作的回复,这样就可以让机器人更加拟人化,不仅在声音、表情上能够与人类交互,而且还可以在动作等上与人类交互,大大提高了交互的体验感。
然而,虚拟机器人如何将回复内容中的语音和虚拟动作进行同步是一个比较复杂的问题,如果语音和动作不能匹配,则会大大影响用户的交互体验。
因此,如何提供一种同步语音及虚拟动作的方法、系统及机器人,提升人机交互体验成为亟需解决的技术问题。
发明内容
本发明的目的是提供一种同步语音及虚拟动作的方法、系统及机器人,提升人机交互体验。
本发明的目的是通过以下技术方案来实现的:
一种同步语音及虚拟动作的方法,包括:
获取用户的多模态信息;
根据用户的多模态信息和生活时间轴生成交互内容,所述交互内容至少包括语音信息和动作信息;
将语音信息的时间长度和动作信息的时间长度调整到相同。
优选的,所述将语音信息的时间长度和动作信息的时间长度调整到相
同的具体步骤包括:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
优选的,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
优选的,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行排序组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。
优选的,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。
优选的,所述机器人的生活时间轴的参数的生成方法包括:
将机器人的自我认知进行扩展;
获取生活时间轴的参数;
对机器人的自我认知的参数与生活时间轴中的参数进行拟合,生成机器人的生活时间轴。
优选的,所述将机器人的自我认知进行扩展的步骤具体包括:将生活场景与机器人的自我认识相结合形成基于生活时间轴的自我认知曲线。
优选的,所述对机器人的自我认知的参数与生活时间轴中的参数进行拟合的步骤具体包括:使用概率算法,计算生活时间轴上的机器人在时间轴场景参数改变后的每个参数改变的概率,形成拟合曲线。
优选的,其中,所述生活时间轴指包含一天24小时的时间轴,所述生活时间轴中的参数至少包括用户在所述生活时间轴上进行的日常生活行为以及代表该行为的参数值。
一种同步语音及虚拟动作的系统,包括:
获取模块,用于获取用户的多模态信息;
人工智能模块,用于根据用户的多模态信息和生活时间轴生成交互内
容,所述交互内容至少包括语音信息和动作信息;
控制模块,用于将语音信息的时间长度和动作信息的时间长度调整到相同。
优选的,所述控制模块具体用于:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
优选的,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
优选的,所述控制模块具体用于:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。
优选的,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。
优选的,所述系统包括处理模块,用于:
将机器人的自我认知进行扩展;
获取生活时间轴的参数;
对机器人的自我认知的参数与生活时间轴中的参数进行拟合,生成机器人的生活时间轴。
优选的,所述处理模块具体用于:将生活场景与机器人的自我认识相结合形成基于生活时间轴的自我认知曲线。
优选的,所述处理模块具体用于:使用概率算法,计算生活时间轴上的机器人在时间轴场景参数改变后的每个参数改变的概率,形成拟合曲线。
优选的,其中,所述生活时间轴指包含一天24小时的时间轴,所述生活时间轴中的参数至少包括用户在所述生活时间轴上进行的日常生活行为以及代表该行为的参数值。
本发明公开一种机器人,包括如上述任一所述的一种同步语音及虚拟动作的系统。
相比现有技术,本发明具有以下优点:本发明的同步语音及虚拟动作
的方法包括:获取用户的多模态信息;根据用户的多模态信息和生活时间轴生成交互内容,所述交互内容至少包括语音信息和动作信息;将语音信息的时间长度和动作信息的时间长度调整到相同。这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人在交互时不仅具有语音表现,还可以具有动作等多样的表现形式,机器人的表现形式更加多样化,使机器人更加拟人化,也提高了用户于机器人交互时的体验度。
图1是本发明实施例一的一种同步语音及虚拟动作的方法的流程图;
图2是本发明实施例二的一种同步语音及虚拟动作的系统的示意图。
虽然流程图将各项操作描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。各项操作的顺序可以被重新安排。当其操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。
计算机设备包括用户设备与网络设备。其中,用户设备或客户端包括但不限于电脑、智能手机、PDA等;网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算的由大量计算机或网络服务器构成的云。计算机设备可单独运行来实现本发明,也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。
在这里可能使用了术语“第一”、“第二”等等来描述各个单元,但是这些单元不应当受这些术语限制,使用这些术语仅仅是为了将一个单元与另一个单元进行区分。这里所使用的术语“和/或”包括其中一个或更多所列出的相关联项目的任意和所有组合。当一个单元被称为“连接”或“耦合”到另一单元时,其可以直接连接或耦合到所述另一单元,或者可以存在中间单元。
这里所使用的术语仅仅是为了描述具体实施例而不意图限制示例性实施例。除非上下文明确地另有所指,否则这里所使用的单数形式“一个”、“一项”还意图包括复数。还应当理解的是,这里所使用的术语“包括”和/或“包含”规定所陈述的特征、整数、步骤、操作、单元和/或组件的存在,而不排除存在或添加一个或更多其他特征、整数、步骤、操作、单元、组件和/或其组合。
下面结合附图和较佳的实施例对本发明作进一步说明。
实施例一
如图1所示,本实施例中公开一种同步语音及虚拟动作的方法,包括:
S101、获取用户的多模态信息;
S102、根据用户的多模态信息和生活时间轴300生成交互内容,所述交互内容至少包括语音信息和动作信息;
S103、将语音信息的时间长度和动作信息的时间长度调整到相同。
本发明的同步语音及虚拟动作的方法包括:获取用户的多模态信息;根据用户的多模态信息和生活时间轴生成交互内容,所述交互内容至少包括语音信息和动作信息;将语音信息的时间长度和动作信息的时间长度调整到相同。这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人在交互时不仅具有语音表现,还可以具有动作等多样的表现形式,机器人的表现形式更加多样化,使机器人更加拟人化,也提高了用户于机器人交互时的体验度。
对于人来讲每天的生活都具有一定的规律性,为了让机器人与人沟通时更加拟人化,在一天24小时中,让机器人也会有睡觉,运动,吃饭,跳舞,看书,吃饭,化妆,睡觉等动作。因此本发明将机器人所在的生活时间轴加入到机器人的交互内容生成中去,使机器人与人交互时更加拟人化,使得机器人在生活时间轴内具有人类的生活方式,该方法能够提升机器人交互内容生成的拟人性,提升人机交互体验,提高智能性。交互内容可以是表情或文字或语音或动作等一种或几种的组合。机器人的生活时间轴300是提前进行拟合和设置完成的,具体来讲,机器人的生活时间轴300是一系列的参数合集,将这个参数传输给系统进行生成交互内容。
本实施例中的多模态信息可以是用户表情、语音信息、手势信息、场景信息、图像信息、视频信息、人脸信息、瞳孔虹膜信息、光感信息和指纹信息等其中的其中一种或几种。
本实施例中,基于生活时间轴具体是:根据人类日常生活的时间轴,按照人类的方式,将机器人本身在日常生活时间轴中的自我认知的数值做拟合,机器人的行为按照这个拟合行动,也就是得到一天中机器人自己的行为,从而让机器人基于生活时间轴去进行自己的行为,例如生成交互内容与人类沟通等。假如机器人一直唤醒的话,就会按照这个时间轴上的行为行动,机器人的自我认知也会根据这个时间轴进行相应的更改。生活时间轴与可变参数可以对自我认知中的属性,例如心情值,疲劳值等等的更改,也可以自动加入新的自我认知信息,比如之前没有愤怒值,基于生活时间轴和可变因素的场景就会自动根据之前模拟人类自我认知的场景,从而对机器人的自我认知进行添加。生活时间轴中不仅包括语音信息,也包括了动作等信息。
例如,用户向机器人说话:“好困啊”,机器人听到后理解的为用户很困,然后结合机器人的生活时间轴,例如当前的时间为上午9点,那么机器人就知道主人是刚刚起床,那么就应该向主人问早,例如回答语音“早上好”作为回复,还可以唱一首歌,并配上相应舞蹈动作等。而如果用户向机器人说话:“好困啊”,机器人听到后理解的为用户很困,然后机器人的生活时间轴,例如当前的时间为晚上9点,那么机器人就知道主人需要睡觉了,那么就会回复语音“主人晚安,睡个好觉”等类似用语,并配上相应的晚安、睡眠动作等。这种方式要比单纯的语音和表情回复更加贴近人的生活,具有动作更加拟人化。
本实施例中,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
因此,当语音信息的时间长度与动作信息的时间长度的差值不大于阈
值,调整的具体含义可以为压缩或拉伸语音信息的时间长度或/和动作信息的时间长度,也可以是加快播放速度或者减缓播放速度,例如将语音信息的播放速度乘以2,或者将动作信息的播放时间乘以0.8等等。
例如,语音信息的时间长度与动作信息的时间长度的阈值是一分钟,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是1分钟,动作信息的时间长度是2分钟,那么就可以将动作信息的播放速度加快,为原来播放速度的两倍,那么动作信息调整后的播放时间就会为1分钟,从而与语音信息进行同步。当然,也可以让语音信息的播放速度减缓,调整为原来播放速度的0.5倍,这样就会让语音信息经过调整后减缓为2分钟,从而与动作信息同步。另外,也可以将语音信息和动作信息都调整,例如语音信息减缓,同时将动作信息加快,都调整到1分30秒,也可以让语音和动作进行同步。
此外,本实施例中,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行排序组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。
当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。
因此,当语音信息的时间长度与动作信息的时间长度的差值大于阈值,调整的含义就是添加或者删除部分动作信息,以使动作信息的时间长度与语音信息的时间长度相同。
例如,语音信息的时间长度与动作信息的时间长度的阈值是30秒,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是3分钟,动作信息的时间长度是1分钟,那么就需要将其他的动作信息也加入到原本的动作信息中,例如找到一个时间长度为2分钟的动作信息,将上述两组动作信息进行排序组合后就与语音信息的时间长度匹配到相同了。当然,如果没有找到时间长度为2分钟的动作信息,而找到了一个时间长度为了2分半的,那么就可以选取这个2分半的动作信息中的部分动作(可以是部分帧),使选取后的动作信息的时间长度为2分钟,这样就可
以语音信息的时间长度匹配相同了。
本实施例中,可以根据语音信息的时间长度,选择与语音信息的时间长度最接近的动作信息,也可以根据动作信息的时间长度选择最接近的语音信息。
这样在选择的时候根据语音信息的时间长度进行选择,可以方便控制模块对语音信息和动作信息的时间长度的调整,更加容易调整到一致,而且调整后的播放更加自然,平滑。
根据其中一个示例,在将语音信息的时间长度和动作信息的时间长度调整到相同的步骤之后还包括:将调整后的语音信息和动作信息输出到虚拟影像进行展示。
这样就可以在调整一致后进行输出,输出可以是在虚拟影像上进行输出,从而使虚拟机器人更加拟人化,提高用户体验度。
根据其中一个示例,所述机器人的生活时间轴的参数的生成方法包括:
将机器人的自我认知进行扩展;
获取生活时间轴的参数;
对机器人的自我认知的参数与生活时间轴中的参数进行拟合,生成机器人的生活时间轴。
这样将生活时间轴加入到机器人本身的自我认知中去,使机器人具有拟人化的生活。例如将中午吃饭的认知加入到机器人中去。
根据其中另一个示例,所述将机器人的自我认知进行扩展的步骤具体包括:将生活场景与机器人的自我认识相结合形成基于生活时间轴的自我认知曲线。
这样就可以具体的将生活时间轴加入到机器人本身的参数中去。
根据其中另一个示例,所述对机器人的自我认知的参数与生活时间轴中的参数进行拟合的步骤具体包括:使用概率算法,计算生活时间轴上的机器人在时间轴场景参数改变后的每个参数改变的概率,形成拟合曲线。这样就可以具体的将机器人的自我认知的参数与生活时间轴中的参数进行拟合。其中概率算法可以是贝叶斯概率算法。
例如,在一天24小时中,使机器人会有睡觉,运动,吃饭,跳舞,看书,吃饭,化妆,睡觉等动作。每个动作会影响机器人本身的自我认知,将生活时间轴上的参数与机器人本身的自我认知进行结合,拟合后,即让机器人的自我认知包括了,心情,疲劳值,亲密度,好感度,交互次数,
机器人的三维的认知,年龄,身高,体重,亲密度,游戏场景值,游戏对象值,地点场景值,地点对象值等。为机器人可以自己识别所在的地点场景,比如咖啡厅,卧室等。
机器一天的时间轴内会进行不同的动作,比如夜里睡觉,中午吃饭,白天运动等等,这些所有的生活时间轴中的场景,对于自我认知都会有影响。这些数值的变化采用的概率模型的动态拟合方式,将这些所有动作在时间轴上发生的几率拟合出来。场景识别:这种地点场景识别会改变自我认知中的地理场景值。
实施例二
如图2所示,本实施例中公开一种同步语音及虚拟动作的系统,包括:
获取模块201,用于获取用户的多模态信息;
人工智能模块202,用于根据用户的多模态信息和生活时间轴生成交互内容,所述交互内容至少包括语音信息和动作信息,其中生活时间轴由生活时间轴模块301生成;
控制模块203,用于将语音信息的时间长度和动作信息的时间长度调整到相同。
这样就可以通过用户的多模态信息例如用户语音、用户表情、用户动作等的一种或几种,来生成交互内容,交互内容中至少包括语音信息和动作信息,而为了让语音信息和动作信息能够同步,将语音信息的时间长度和动作信息的时间长度调整到相同,这样就可以让机器人在播放声音和动作时可以同步匹配,使机器人在交互时不仅具有语音表现,还可以具有动作等多样的表现形式,机器人的表现形式更加多样化,使机器人更加拟人化,也提高了用户于机器人交互时的体验度。
对于人来讲每天的生活都具有一定的规律性,为了让机器人与人沟通时更加拟人化,在一天24小时中,让机器人也会有睡觉,运动,吃饭,跳舞,看书,吃饭,化妆,睡觉等动作。因此本发明将机器人所在的生活时间轴加入到机器人的交互内容生成中去,使机器人与人交互时更加拟人化,使得机器人在生活时间轴内具有人类的生活方式,该方法能够提升机器人交互内容生成的拟人性,提升人机交互体验,提高智能性。交互内容可以是表情或文字或语音或动作等一种或几种的组合。机器人的生活时间轴300是提前进行拟合和设置完成的,具体来讲,机器人的生活时间轴300
是一系列的参数合集,将这个参数传输给系统进行生成交互内容。
本实施例中的多模态信息可以是用户表情、语音信息、手势信息、场景信息、图像信息、视频信息、人脸信息、瞳孔虹膜信息、光感信息和指纹信息等其中的其中一种或几种。
本实施例中,基于生活时间轴具体是:根据人类日常生活的时间轴,按照人类的方式,将机器人本身在日常生活时间轴中的自我认知的数值做拟合,机器人的行为按照这个拟合行动,也就是得到一天中机器人自己的行为,从而让机器人基于生活时间轴去进行自己的行为,例如生成交互内容与人类沟通等。假如机器人一直唤醒的话,就会按照这个时间轴上的行为行动,机器人的自我认知也会根据这个时间轴进行相应的更改。生活时间轴与可变参数可以对自我认知中的属性,例如心情值,疲劳值等等的更改,也可以自动加入新的自我认知信息,比如之前没有愤怒值,基于生活时间轴和可变因素的场景就会自动根据之前模拟人类自我认知的场景,从而对机器人的自我认知进行添加。生活时间轴中不仅包括语音信息,也包括了动作等信息。
例如,用户向机器人说话:“好困啊”,机器人听到后理解的为用户很困,然后结合机器人的生活时间轴,例如当前的时间为上午9点,那么机器人就知道主人是刚刚起床,那么就应该向主人问早,例如回答语音“早上好”作为回复,还可以唱一首歌,并配上相应舞蹈动作等。而如果用户向机器人说话:“好困啊”,机器人听到后理解的为用户很困,然后机器人的生活时间轴,例如当前的时间为晚上9点,那么机器人就知道主人需要睡觉了,那么就会回复语音“主人晚安,睡个好觉”等类似用语,并配上相应的晚安、睡眠动作等。这种方式要比单纯的语音和表情回复更加贴近人的生活,具有动作更加拟人化。
本实施例中,所述控制模块具体用于:
若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
因此,当语音信息的时间长度与动作信息的时间长度的差值不大于阈
值,调整的具体含义可以压缩或拉伸语音信息的时间长度或/和动作信息的时间长度,也可以是加快播放速度或者减缓播放速度,例如将语音信息的播放速度乘以2,或者将动作信息的播放时间乘以0.8等等。
例如,语音信息的时间长度与动作信息的时间长度的阈值是一分钟,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是1分钟,动作信息的时间长度是2分钟,那么就可以将动作信息的播放速度加快,为原来播放速度的两倍,那么动作信息调整后的播放时间就会为1分钟,从而与语音信息进行同步。当然,也可以让语音信息的播放速度减缓,调整为原来播放速度的0.5倍,这样就会让语音信息经过调整后减缓为2分钟,从而与动作信息同步。另外,也可以将语音信息和动作信息都调整,例如语音信息减缓,同时将动作信息加快,都调整到1分30秒,也可以让语音和动作进行同步。
此外,本实施例中,所述控制模块具体用于:
若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。
当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。
因此,当语音信息的时间长度与动作信息的时间长度的差值大于阈值,调整的含义就是添加或者删除部分动作信息,以使动作信息的时间长度与语音信息的时间长度相同。
例如,语音信息的时间长度与动作信息的时间长度的阈值是30秒,机器人根据用户的多模态信息生成的交互内容中,语音信息的时间长度是3分钟,动作信息的时间长度是1分钟,那么就需要将其他的动作信息也加入到原本的动作信息中,例如找到一个时间长度为2分钟的动作信息,将上述两组动作信息进行排序组合后就与语音信息的时间长度匹配到相同了。当然,如果没有找到时间长度为2分钟的动作信息,而找到了一个时间长度为了2分半的,那么就可以选取这个2分半的动作信息中的部分动作(可以是部分帧),使选取后的动作信息的时间长度为2分钟,这样就可以语音信息的时间长度匹配相同了。
本实施例中,可以为所述人工智能模块具体用于:根据语音信息的时
间长度,选择与语音信息的时间长度最接近的动作信息,也可以根据动作信息的时间长度选择最接近的语音信息。
这样在选择的时候根据语音信息的时间长度进行选择,可以方便控制模块对语音信息和动作信息的时间长度的调整,更加容易调整到一致,而且调整后的播放更加自然,平滑。
根据其中一个示例,所述系统还包括输出模块204,用于将调整后的语音信息和动作信息输出到虚拟影像进行展示。
这样就可以在调整一致后进行输出,输出可以是在虚拟影像上进行输出,从而使虚拟机器人更加拟人化,提高用户体验度。
根据其中一个示例,所述系统包括基于时间轴与人工智能云处理模块,用于:
将机器人的自我认知进行扩展;
获取生活时间轴的参数;
对机器人的自我认知的参数与生活时间轴中的参数进行拟合,生成机器人生活时间轴。
这样将生活时间轴加入到机器人本身的自我认知中去,使机器人具有拟人化的生活。例如将中午吃饭的认知加入到机器人中去。
根据其中另一个示例,所述基于时间轴与人工智能云处理模块具体用于:将生活场景与机器人的自我认识相结合形成基于生活时间轴的自我认知曲线。这样就可以具体的将生活时间轴加入到机器人本身的参数中去。
根据其中另一个示例,所述基于时间轴与人工智能云处理模块具体用于:使用概率算法,计算生活时间轴上的机器人在时间轴场景参数改变后的每个参数改变的概率,形成拟合曲线。这样就可以具体的将机器人的自我认知的参数与生活时间轴中的参数进行拟合。其中概率算法可以是贝叶斯概率算法。
例如,在一天24小时中,使机器人会有睡觉,运动,吃饭,跳舞,看书,吃饭,化妆,睡觉等动作。每个动作会影响机器人本身的自我认知,将生活时间轴上的参数与机器人本身的自我认知进行结合,拟合后,即让机器人的自我认知包括了,心情,疲劳值,亲密度,好感度,交互次数,机器人的三维的认知,年龄,身高,体重,亲密度,游戏场景值,游戏对象值,地点场景值,地点对象值等。为机器人可以自己识别所在的地点场景,比如咖啡厅,卧室等。
机器一天的时间轴内会进行不同的动作,比如夜里睡觉,中午吃饭,白天运动等等,这些所有的生活时间轴中的场景,对于自我认知都会有影响。这些数值的变化采用的概率模型的动态拟合方式,将这些所有动作在时间轴上发生的几率拟合出来。场景识别:这种地点场景识别会改变自我认知中的地理场景值。
本发明公开一种机器人,包括如上述任一所述的一种同步语音及虚拟动作的系统。
以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。
Claims (19)
- 一种同步语音及虚拟动作的方法,其特征在于,包括:获取用户的多模态信息;根据用户的多模态信息和生活时间轴生成交互内容,所述交互内容至少包括语音信息和动作信息;将语音信息的时间长度和动作信息的时间长度调整到相同。
- 根据权利要求1所述的方法,其特征在于,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
- 根据权利要求2所述的方法,其特征在于,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
- 根据权利要求1所述的方法,其特征在于,所述将语音信息的时间长度和动作信息的时间长度调整到相同的具体步骤包括:若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行排序组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。
- 根据权利要求4所述的方法,其特征在于,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。
- 根据权利要求1所述的方法,其特征在于,所述机器人的生活时间轴的参数的生成方法包括:将机器人的自我认知进行扩展;获取生活时间轴的参数;对机器人的自我认知的参数与生活时间轴中的参数进行拟合,生成机器人的生活时间轴。
- 根据权利要求6所述的方法,其特征在于,所述将机器人的自我认知进行扩展的步骤具体包括:将生活场景与机器人的自我认识相结合形成基于生活时间轴的自我认知曲线。
- 根据权利要求6所述的方法,其特征在于,所述对机器人的自我认知的参数与生活时间轴中的参数进行拟合的步骤具体包括:使用概率算法,计算生活时间轴上的机器人在时间轴场景参数改变后的每个参数改变的概率,形成拟合曲线。
- 根据权利要求1所述的方法,其特征在于,其中,所述生活时间轴指包含一天24小时的时间轴,所述生活时间轴中的参数至少包括用户在所述生活时间轴上进行的日常生活行为以及代表该行为的参数值。
- 一种同步语音及虚拟动作的系统,其特征在于,包括:获取模块,用于获取用户的多模态信息;人工智能模块,用于根据用户的多模态信息和生活时间轴生成交互内容,所述交互内容至少包括语音信息和动作信息;控制模块,用于将语音信息的时间长度和动作信息的时间长度调整到相同。
- 根据权利要求10所述的系统,其特征在于,所述控制模块具体用于:若语音信息的时间长度与动作信息的时间长度的差值不大于阈值,当语音信息的时间长度小于动作信息的时间长度,则加快动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
- 根据权利要求11所述的系统,其特征在于,当语音信息的时间长度大于动作信息的时间长度,则加快语音信息的播放速度或/和减缓动作信息的播放速度,使动作信息的时间长度等于所述语音信息的时间长度。
- 根据权利要求10所述的系统,其特征在于,所述控制模块具体用于:若语音信息的时间长度与动作信息的时间长度的差值大于阈值,当语音信息的时间长度大于动作信息的时间长度时,则将至少两组动作信息进行组合,使组合后的动作信息的时间长度等于所述语音信息的时间长度。
- 根据权利要求13所述的系统,其特征在于,当语音信息的时间长度小于动作信息的时间长度时,则选取动作信息中的部分动作,使选取的部分动作的时间长度等于所述语音信息的时间长度。
- 根据权利要求10所述的系统,其特征在于,所述系统包括处理模块,用于:将机器人的自我认知进行扩展;获取生活时间轴的参数;对机器人的自我认知的参数与生活时间轴中的参数进行拟合,生成机器人的生活时间轴。
- 根据权利要求15所述的系统,其特征在于,所述处理模块具体用于:将生活场景与机器人的自我认识相结合形成基于生活时间轴的自我认知曲线。
- 根据权利要求15所述的系统,其特征在于,所述处理模块具体用于:使用概率算法,计算生活时间轴上的机器人在时间轴场景参数改变后的每个参数改变的概率,形成拟合曲线。
- 根据权利要求10所述的系统,其特征在于,其中,所述生活时间轴指包含一天24小时的时间轴,所述生活时间轴中的参数至少包括用户在所述生活时间轴上进行的日常生活行为以及代表该行为的参数值。
- 一种机器人,其特征在于,包括如权利要求9至18任一所述的一种同步语音及虚拟动作的系统。
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201680001731.5A CN106463118B (zh) | 2016-07-07 | 2016-07-07 | 一种同步语音及虚拟动作的方法、系统及机器人 |
| PCT/CN2016/089215 WO2018006371A1 (zh) | 2016-07-07 | 2016-07-07 | 一种同步语音及虚拟动作的方法、系统及机器人 |
| JP2017133168A JP6567610B2 (ja) | 2016-07-07 | 2017-07-06 | 音声と仮想動作を同期させる方法、システムとロボット本体 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2016/089215 WO2018006371A1 (zh) | 2016-07-07 | 2016-07-07 | 一种同步语音及虚拟动作的方法、系统及机器人 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018006371A1 true WO2018006371A1 (zh) | 2018-01-11 |
Family
ID=58215741
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2016/089215 Ceased WO2018006371A1 (zh) | 2016-07-07 | 2016-07-07 | 一种同步语音及虚拟动作的方法、系统及机器人 |
Country Status (3)
| Country | Link |
|---|---|
| JP (1) | JP6567610B2 (zh) |
| CN (1) | CN106463118B (zh) |
| WO (1) | WO2018006371A1 (zh) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108650217A (zh) * | 2018-03-21 | 2018-10-12 | 腾讯科技(深圳)有限公司 | 动作状态的同步方法、装置、存储介质及电子装置 |
| CN112528000A (zh) * | 2020-12-22 | 2021-03-19 | 北京百度网讯科技有限公司 | 虚拟机器人的生成方法、装置和电子设备 |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107992935A (zh) * | 2017-12-14 | 2018-05-04 | 深圳狗尾草智能科技有限公司 | 为机器人设置生活周期的方法、设备及介质 |
| CN109202925A (zh) * | 2018-09-03 | 2019-01-15 | 深圳狗尾草智能科技有限公司 | 实现机器人动作和语音同步的方法、系统及设备 |
| CN109521878A (zh) * | 2018-11-08 | 2019-03-26 | 歌尔科技有限公司 | 交互方法、装置和计算机可读存储介质 |
| CN115497499B (zh) * | 2022-08-30 | 2024-09-17 | 阿里巴巴(中国)有限公司 | 语音和动作时间同步的方法 |
| CN116028655A (zh) * | 2022-12-01 | 2023-04-28 | 腾讯音乐娱乐科技(深圳)有限公司 | 动作序列生成方法以及相关设备 |
| CN117058286B (zh) * | 2023-10-13 | 2024-01-23 | 北京蔚领时代科技有限公司 | 一种文字驱动数字人生成视频的方法和装置 |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101290718A (zh) * | 2008-05-30 | 2008-10-22 | 梅敏 | 一种网络互动语音玩具组件及其实现方法 |
| CN101604204A (zh) * | 2009-07-09 | 2009-12-16 | 北京科技大学 | 智能情感机器人分布式认知技术 |
| CN103037945A (zh) * | 2010-04-30 | 2013-04-10 | 方瑞麟 | 具有基于声音的动作同步化的交互式装置 |
| US9147388B2 (en) * | 2012-06-26 | 2015-09-29 | Yamaha Corporation | Automatic performance technique using audio waveform data |
| CN105598972A (zh) * | 2016-02-04 | 2016-05-25 | 北京光年无限科技有限公司 | 一种机器人系统及交互方法 |
Family Cites Families (22)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH10143351A (ja) * | 1996-11-13 | 1998-05-29 | Sharp Corp | インタフェース装置 |
| EP2175659B1 (en) * | 1996-12-04 | 2012-11-14 | Panasonic Corporation | Optical disk for high resolution and three-dimensional video recording, optical disk reproduction apparatus, and optical disk recording apparatus |
| JP3792882B2 (ja) * | 1998-03-17 | 2006-07-05 | 株式会社東芝 | 感情生成装置及び感情生成方法 |
| JP2001154681A (ja) * | 1999-11-30 | 2001-06-08 | Sony Corp | 音声処理装置および音声処理方法、並びに記録媒体 |
| JP2001215940A (ja) * | 2000-01-31 | 2001-08-10 | Toshiba Corp | 表情を有する知的ロボット |
| JP3930389B2 (ja) * | 2002-07-08 | 2007-06-13 | 三菱重工業株式会社 | ロボット発話中の動作プログラム生成装置及びロボット |
| JP2005003926A (ja) * | 2003-06-11 | 2005-01-06 | Sony Corp | 情報処理装置および方法、並びにプログラム |
| JP2005092675A (ja) * | 2003-09-19 | 2005-04-07 | Science Univ Of Tokyo | ロボット |
| EP1845724A1 (en) * | 2005-02-03 | 2007-10-17 | Matsushita Electric Industrial Co., Ltd. | Recording/reproduction device, recording/reproduction method, recording/reproduction apparatus and recording/reproduction method, and recording medium storing recording/reproduction program, and integrated circuit for use in recording/reproduction apparatus |
| JP2008040726A (ja) * | 2006-08-04 | 2008-02-21 | Univ Of Electro-Communications | ユーザ支援システム及びユーザ支援方法 |
| JP2009141555A (ja) * | 2007-12-05 | 2009-06-25 | Fujifilm Corp | 音声入力機能付き撮像装置及びその音声記録方法 |
| JP5045519B2 (ja) * | 2008-03-26 | 2012-10-10 | トヨタ自動車株式会社 | 動作生成装置、ロボット及び動作生成方法 |
| US8595156B2 (en) * | 2008-10-03 | 2013-11-26 | Bae Systems Plc | Assisting with updating a model for diagnosing failures in a system |
| JP2010094799A (ja) * | 2008-10-17 | 2010-04-30 | Littleisland Inc | 人型ロボット |
| JP2011054088A (ja) * | 2009-09-04 | 2011-03-17 | National Institute Of Information & Communication Technology | 情報処理装置、情報処理方法、プログラム及び対話システム |
| JP2012215645A (ja) * | 2011-03-31 | 2012-11-08 | Speakglobal Ltd | コンピュータを利用した外国語会話練習システム |
| CN103596051A (zh) * | 2012-08-14 | 2014-02-19 | 金运科技股份有限公司 | 电视装置及其虚拟主持人显示方法 |
| JP6126028B2 (ja) * | 2014-02-28 | 2017-05-10 | 三井不動産株式会社 | ロボット制御システム、ロボット制御サーバ及びロボット制御プログラム |
| JP6058053B2 (ja) * | 2014-06-05 | 2017-01-11 | Cocoro Sb株式会社 | 記録制御システム、システム及びプログラム |
| JP6305538B2 (ja) * | 2014-07-10 | 2018-04-04 | 株式会社東芝 | 電子機器及び方法及びプログラム |
| CN104574478A (zh) * | 2014-12-30 | 2015-04-29 | 北京像素软件科技股份有限公司 | 一种编辑动画人物口型的方法及装置 |
| CN105807933B (zh) * | 2016-03-18 | 2019-02-12 | 北京光年无限科技有限公司 | 一种用于智能机器人的人机交互方法及装置 |
-
2016
- 2016-07-07 CN CN201680001731.5A patent/CN106463118B/zh not_active Expired - Fee Related
- 2016-07-07 WO PCT/CN2016/089215 patent/WO2018006371A1/zh not_active Ceased
-
2017
- 2017-07-06 JP JP2017133168A patent/JP6567610B2/ja not_active Expired - Fee Related
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101290718A (zh) * | 2008-05-30 | 2008-10-22 | 梅敏 | 一种网络互动语音玩具组件及其实现方法 |
| CN101604204A (zh) * | 2009-07-09 | 2009-12-16 | 北京科技大学 | 智能情感机器人分布式认知技术 |
| CN103037945A (zh) * | 2010-04-30 | 2013-04-10 | 方瑞麟 | 具有基于声音的动作同步化的交互式装置 |
| US9147388B2 (en) * | 2012-06-26 | 2015-09-29 | Yamaha Corporation | Automatic performance technique using audio waveform data |
| CN105598972A (zh) * | 2016-02-04 | 2016-05-25 | 北京光年无限科技有限公司 | 一种机器人系统及交互方法 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108650217A (zh) * | 2018-03-21 | 2018-10-12 | 腾讯科技(深圳)有限公司 | 动作状态的同步方法、装置、存储介质及电子装置 |
| CN108650217B (zh) * | 2018-03-21 | 2019-07-23 | 腾讯科技(深圳)有限公司 | 动作状态的同步方法、装置、存储介质及电子装置 |
| CN112528000A (zh) * | 2020-12-22 | 2021-03-19 | 北京百度网讯科技有限公司 | 虚拟机器人的生成方法、装置和电子设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| CN106463118A (zh) | 2017-02-22 |
| CN106463118B (zh) | 2019-09-03 |
| JP2018001404A (ja) | 2018-01-11 |
| JP6567610B2 (ja) | 2019-08-28 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2018006371A1 (zh) | 一种同步语音及虚拟动作的方法、系统及机器人 | |
| TWI778477B (zh) | 互動方法、裝置、電子設備以及儲存媒體 | |
| CN106471572B (zh) | 一种同步语音及虚拟动作的方法、系统及机器人 | |
| KR101306221B1 (ko) | 3차원 사용자 아바타를 이용한 동영상 제작장치 및 방법 | |
| JP2024525119A (ja) | インタラクティブな同調離散アバターをリアルタイムで自動生成するシステムおよび方法 | |
| Dionisio et al. | 3D virtual worlds and the metaverse: Current status and future possibilities | |
| WO2018000259A1 (zh) | 一种机器人交互内容的生成方法、系统及机器人 | |
| Peng et al. | Robotic dance in social robotics—a taxonomy | |
| Leman | Musical gestures and embodied cognition | |
| WO2018000268A1 (zh) | 一种机器人交互内容的生成方法、系统及机器人 | |
| WO2018006370A1 (zh) | 一种虚拟3d机器人的交互方法、系统及机器人 | |
| US20090128567A1 (en) | Multi-instance, multi-user animation with coordinated chat | |
| CN108665492A (zh) | 一种基于虚拟人的舞蹈教学数据处理方法及系统 | |
| JP2019040516A (ja) | 故人憑依用ロボット | |
| WO2011143523A2 (en) | Electronic personal interactive device | |
| WO2018006372A1 (zh) | 一种基于意图识别控制家电的方法、系统及机器人 | |
| CN106462255A (zh) | 一种机器人交互内容的生成方法、系统及机器人 | |
| JP2018014094A (ja) | 仮想ロボットのインタラクション方法、システム及びロボット | |
| WO2018006374A1 (zh) | 一种基于主动唤醒的功能推荐方法、系统及机器人 | |
| CN117857892B (zh) | 基于人工智能的数据处理方法、装置、电子设备、计算机程序产品及计算机可读存储介质 | |
| WO2024012007A9 (zh) | 一种动画数据生成方法、装置及相关产品 | |
| De Simone et al. | Empowering human interaction: A socially assistive robot for support in trade shows | |
| CN106462804A (zh) | 一种机器人交互内容的生成方法、系统及机器人 | |
| WO2018000260A1 (zh) | 一种机器人交互内容的生成方法、系统及机器人 | |
| WO2018000261A1 (zh) | 一种机器人交互内容的生成方法、系统及机器人 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16907876 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 16907876 Country of ref document: EP Kind code of ref document: A1 |