[go: up one dir, main page]

CN118692444A - Method, device, equipment, and medium for generating prompt audio - Google Patents

Method, device, equipment, and medium for generating prompt audio Download PDF

Info

Publication number
CN118692444A
CN118692444A CN202410952680.0A CN202410952680A CN118692444A CN 118692444 A CN118692444 A CN 118692444A CN 202410952680 A CN202410952680 A CN 202410952680A CN 118692444 A CN118692444 A CN 118692444A
Authority
CN
China
Prior art keywords
information
emotion
vehicle
voice data
subject
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410952680.0A
Other languages
Chinese (zh)
Inventor
周茂井
王红余
金飞
张超英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chery New Energy Automobile Co Ltd
Original Assignee
Chery New Energy Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chery New Energy Automobile Co Ltd filed Critical Chery New Energy Automobile Co Ltd
Priority to CN202410952680.0A priority Critical patent/CN118692444A/en
Publication of CN118692444A publication Critical patent/CN118692444A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Traffic Control Systems (AREA)

Abstract

本申请公开了一种提示音频的生成方法、装置、设备、介质,涉及汽车技术领域,该方法由第一车辆的车载终端执行,包括如下步骤:采集第一主体的语音数据,第一主体位于第一车辆内;对语音数据进行分析,得到第一主体的情感状态信息,情感状态信息用于指示第一主体表达语音数据时的情感状态;基于情感状态信息生成对应的提示信息,提示信息用于指示对第一主体提供情感反馈;将提示信息转换成提示音频,并播放提示音频。能够通过采集驾驶员的语音数据,分析驾驶员当前的情感状态,提高识别结果的准确性。向驾驶员提供对应的情感反馈,帮助驾驶员及时调整情绪,提升驾驶员的驾驶体验和行驶过程中的安全性。

The present application discloses a method, device, equipment, and medium for generating prompt audio, which relates to the field of automobile technology. The method is executed by a vehicle-mounted terminal of a first vehicle, and includes the following steps: collecting voice data of a first subject, the first subject is located in the first vehicle; analyzing the voice data to obtain emotional state information of the first subject, the emotional state information is used to indicate the emotional state of the first subject when expressing the voice data; generating corresponding prompt information based on the emotional state information, the prompt information is used to indicate providing emotional feedback to the first subject; converting the prompt information into prompt audio, and playing the prompt audio. It is possible to improve the accuracy of the recognition result by collecting the driver's voice data and analyzing the driver's current emotional state. Providing corresponding emotional feedback to the driver helps the driver to adjust his emotions in time, and improves the driver's driving experience and safety during driving.

Description

提示音频的生成方法、装置、设备、介质Method, device, equipment, and medium for generating prompt audio

技术领域Technical Field

本申请实施例涉及汽车技术领域,特别涉及一种提示音频的生成方法、装置、设备、介质。The embodiments of the present application relate to the field of automobile technology, and in particular to a method, device, equipment, and medium for generating prompt audio.

背景技术Background Art

在现代智能汽车领域,车载系统的研发正朝着更加个性化和智能化的方向发展。智能汽车及时识别驾驶员情绪并提供相应的安慰措施,能够增强驾驶员与车辆的情感联系,提升整体的驾驶体验和行驶过程中的安全性。In the field of modern smart cars, the research and development of vehicle systems is moving towards a more personalized and intelligent direction. Smart cars can identify the driver's emotions in a timely manner and provide corresponding comfort measures, which can enhance the emotional connection between the driver and the vehicle, improve the overall driving experience and safety during driving.

相关技术中,通过图像处理和机器学习算法,分析驾驶员的面部表情来识别情绪,并据此提供个性化的安慰措施。In related technologies, image processing and machine learning algorithms are used to analyze the driver's facial expressions to identify emotions and provide personalized comfort measures accordingly.

然而,环境光线变化、面部表情的复杂性以及对实时处理速度的高要求均可能影响情绪识别的准确性,影响车辆对驾驶员的安慰效果。However, changes in ambient light, the complexity of facial expressions, and the high requirements for real-time processing speed may affect the accuracy of emotion recognition and affect the vehicle's comforting effect on the driver.

发明内容Summary of the invention

本申请实施例提供了一种提示音频的生成方法、装置、设备、介质,能够及时识别驾驶员的情感状态并生成提示音频,对驾驶员进行情感反馈。所述技术方案如下:The embodiment of the present application provides a method, device, equipment, and medium for generating a prompt audio, which can timely identify the emotional state of the driver and generate a prompt audio to provide emotional feedback to the driver. The technical solution is as follows:

一方面,提供了一种提示音频的生成方法,由第一车辆的车载终端执行,所述方法包括:In one aspect, a method for generating a prompt audio is provided, which is executed by an onboard terminal of a first vehicle, and the method includes:

采集第一主体的语音数据,所述第一主体位于所述第一车辆内;collecting voice data of a first subject, the first subject being located in the first vehicle;

对所述语音数据进行分析,得到所述第一主体的情感状态信息,所述情感状态信息用于指示所述第一主体表达所述语音数据时的情感状态;Analyzing the voice data to obtain emotional state information of the first subject, wherein the emotional state information is used to indicate an emotional state of the first subject when expressing the voice data;

基于所述情感状态信息生成对应的提示信息,所述提示信息用于指示对所述第一主体提供情感反馈;generating corresponding prompt information based on the emotional state information, wherein the prompt information is used to instruct to provide emotional feedback to the first subject;

将所述提示信息转换成提示音频,并播放所述提示音频。The prompt information is converted into a prompt audio, and the prompt audio is played.

另一方面,提供了一种提示音频的生成装置,所述装置包括:On the other hand, a device for generating a prompt audio is provided, the device comprising:

采集模块,用于采集第一主体的语音数据,所述第一主体位于所述第一车辆内;a collection module, configured to collect voice data of a first subject, the first subject being located in the first vehicle;

分析模块,用于对所述语音数据进行分析,得到所述第一主体的情感状态信息,所述情感状态信息用于指示所述第一主体表达所述语音数据时的情感状态;an analysis module, configured to analyze the voice data to obtain the emotional state information of the first subject, wherein the emotional state information is used to indicate the emotional state of the first subject when expressing the voice data;

生成模块,用于基于所述情感状态信息生成对应的提示信息,所述提示信息用于指示对所述第一主体提供情感反馈;A generating module, configured to generate corresponding prompt information based on the emotional state information, wherein the prompt information is used to instruct to provide emotional feedback to the first subject;

转换模块,用于将所述提示信息转换成提示音频,并播放所述提示音频。The conversion module is used to convert the prompt information into a prompt audio and play the prompt audio.

在一个可选的实施例中,所述分析模块,还用于将所述语音数据输入至预训练好的深度学习模型,通过所述深度学习模型对所述语音数据进行情感分析,得到并输出分析结果,所述分析结果中包含至少一种候选情感状态;采集所述第一车辆的驾驶信息,所述第一车辆的驾驶信息包括所述第一车辆的车辆运行信息和所述第一车辆所处环境的环境信息;基于所述分析结果和所述驾驶信息,从所述候选情感状态中确定所述第一主体的情感状态信息。In an optional embodiment, the analysis module is also used to input the voice data into a pre-trained deep learning model, perform sentiment analysis on the voice data through the deep learning model, obtain and output an analysis result, and the analysis result includes at least one candidate emotional state; collect driving information of the first vehicle, the driving information of the first vehicle includes vehicle operation information of the first vehicle and environmental information of the environment in which the first vehicle is located; based on the analysis result and the driving information, determine the emotional state information of the first subject from the candidate emotional states.

在一个可选的实施例中,所述分析模块,还用于获取情感状态对照表,所述情感状态对照表中包含所述第一车辆的驾驶信息与至少一种情感状态之间的对应关系;基于所述驾驶信息从所述情感状态对照表中确定与所述驾驶信息匹配的第一对应关系;基于所述第一对应关系所指示的情感状态,从所述候选情感状态中确定所述第一主体的所述情感状态信息。In an optional embodiment, the analysis module is also used to obtain an emotional state comparison table, which contains the correspondence between the driving information of the first vehicle and at least one emotional state; based on the driving information, determine a first correspondence matching the driving information from the emotional state comparison table; based on the emotional state indicated by the first correspondence, determine the emotional state information of the first subject from the candidate emotional states.

在一个可选的实施例中,所述分析模块之前,所述装置还包括:In an optional embodiment, before the analysis module, the device further includes:

训练模块,用于获取样本语音数据,所述样本语音数据标注有情感标记,所述情感标记用于指示所述样本语音数据所表达的情感状态;基于所述样本语音数据对预训练模型进行训练,得到所述预训练好的深度学习模型。A training module is used to obtain sample speech data, wherein the sample speech data is marked with an emotion tag, and the emotion tag is used to indicate the emotional state expressed by the sample speech data; based on the sample speech data, a pre-trained model is trained to obtain the pre-trained deep learning model.

在一个可选的实施例中,所述训练模块,还用于通过所述预训练模型对所述样本语音数据的时间序列进行建模,确定所述样本语音数据的上下文信息和时间依赖关系,所述时间依赖关系是指所述样本语音数据在时间序列上的特性;基于所述样本语音数据的所述上下文信息和所述时间依赖关系输出得到预测情感状态;基于所述训练结果和所述情感标记对所述预训练模型进行调整,得到所述预训练好的深度学习模型。In an optional embodiment, the training module is also used to model the time series of the sample speech data through the pre-trained model, determine the context information and time dependency of the sample speech data, and the time dependency refers to the characteristics of the sample speech data in the time series; output a predicted emotional state based on the context information and the time dependency of the sample speech data; and adjust the pre-trained model based on the training results and the emotional markers to obtain the pre-trained deep learning model.

在一个可选的实施例中,所述生成模块,还用于将所述情感状态信息输入至预训练好的自然语言模型,通过所述自然语言模型对所述情感状态信息进行识别分析,确定所述车载终端对所述第一主体进行反馈时的情感类型;生成符合所述情感类型的文本提示信息,所述文本提示信息中包含文本形式的情感反馈内容。In an optional embodiment, the generation module is also used to input the emotional state information into a pre-trained natural language model, identify and analyze the emotional state information through the natural language model, and determine the emotional type when the vehicle-mounted terminal provides feedback to the first subject; generate text prompt information that conforms to the emotional type, and the text prompt information contains emotional feedback content in text form.

在一个可选的实施例中,所述生成模块,还用于获取历史时间段内所述自然语言模型所生成的历史文本提示信息,所述历史文本提示信息用于指示所述车载终端在历史时间段内向所述第一主体所提供的情感反馈内容;基于所述情感类型从所述历史文本提示信息中筛选符合所述情感类型的信息作为所述本文提示信息。In an optional embodiment, the generation module is also used to obtain historical text prompt information generated by the natural language model within a historical time period, and the historical text prompt information is used to indicate the emotional feedback content provided by the vehicle-mounted terminal to the first subject within the historical time period; based on the emotional type, information that meets the emotional type is filtered out from the historical text prompt information as the text prompt information.

另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述本申请实施例中任一所述的提示音频的生成方法。On the other hand, a computer device is provided, comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor to implement a method for generating a prompt audio as described in any of the above-mentioned embodiments of the present application.

另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上述本申请实施例中任一所述的提示音频的生成方法。On the other hand, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a code set or an instruction set is stored, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by a processor to implement a method for generating a prompt audio as described in any of the above-mentioned embodiments of the present application.

另一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例中任一所述的提示音频的生成方法。On the other hand, a computer program product or a computer program is provided, the computer program product or the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method for generating a prompt audio as described in any of the above embodiments.

本申请实施例提供的技术方案带来的有益效果至少包括:The beneficial effects brought by the technical solution provided by the embodiment of the present application include at least:

通过采集驾驶员的语音数据,分析驾驶员当前的情感状态,能够向驾驶员提供对应的情感反馈,帮助驾驶员及时调整情绪,提升驾驶员的驾驶体验和行驶过程中的安全性。由于语音具有连续性,随着语音数据中音调、音量等不断变化,能够更好的反映驾驶员的真实情感,相较于相关技术中,采集驾驶员面部表情来识别驾驶员的情绪的方式,基于语音数据进行驾驶员情感识别,能够提高识别结果的准确性,进而提供更加适合的情感反馈。By collecting the driver's voice data and analyzing the driver's current emotional state, we can provide the driver with corresponding emotional feedback, help the driver adjust his emotions in time, and improve the driver's driving experience and safety during driving. Since voice is continuous, as the tone and volume in the voice data change continuously, it can better reflect the driver's true emotions. Compared with the related technology of collecting the driver's facial expressions to identify the driver's emotions, driver emotion recognition based on voice data can improve the accuracy of the recognition results and provide more appropriate emotional feedback.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本申请一个示例性实施例提供的提示音频的生成系统的示意图;FIG1 is a schematic diagram of a system for generating a prompt audio according to an exemplary embodiment of the present application;

图2是本申请一个示例性实施例提供的提示音频的生成环节的示意图;FIG2 is a schematic diagram of a prompt audio generation process provided by an exemplary embodiment of the present application;

图3是本申请一个示例性实施例提供的提示音频的生成方法的流程图;FIG3 is a flow chart of a method for generating a prompt audio according to an exemplary embodiment of the present application;

图4是本申请一个示例性实施例提供的提示音频的生成装置的结构框图;FIG4 is a structural block diagram of a device for generating a prompt audio according to an exemplary embodiment of the present application;

图5是本申请另一个示例性实施例提供的提示音频的生成装置的结构框图;FIG5 is a structural block diagram of a device for generating a prompt audio provided by another exemplary embodiment of the present application;

图6是本申请一个示例性实施例提供的计算机设备的结构框图。FIG. 6 is a structural block diagram of a computer device provided by an exemplary embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application more clear, the implementation methods of the present application will be further described in detail below with reference to the accompanying drawings.

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present application. Instead, they are merely examples of devices and methods consistent with some aspects of the present application as detailed in the appended claims.

在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terms used in this application are for the purpose of describing specific embodiments only and are not intended to limit this application. The singular forms of "a", "said" and "the" used in this application and the appended claims are also intended to include plural forms unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any or all possible combinations of one or more associated listed items.

需要说明的是,本申请所涉及的信息、数据,均为经用户授权或者经过各方充分授权的信息和数据,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the information and data involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.

应当理解,尽管在本申请可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一参数也可以被称为第二参数,类似地,第二参数也可以被称为第一参数。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that, although the terms first, second, etc. may be used in the present application to describe various information, these information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present application, the first parameter may also be referred to as the second parameter, and similarly, the second parameter may also be referred to as the first parameter. Depending on the context, the word "if" as used herein may be interpreted as "at the time of" or "when" or "in response to determining".

首先,针对本申请实施例中涉及的名词进行简单介绍:First, a brief introduction is given to the terms involved in the embodiments of the present application:

循环神经网络(Recurrent Neural Network,RNN):是一种适合于处理序列数据的人工神经网络。RNN能够处理序列中的动态特征,即它们能够记住之前输入的信息,并利用这些信息来影响当前的输出。例如,RNN可以将语音信号转换为文本,学习语音信号的时序特征,并预测相应的文字序列。通过分析语音的音调、音量、语速等特征,RNN还可以用于识别说话者的情感状态,如快乐、悲伤、愤怒等。Recurrent Neural Network (RNN): is an artificial neural network suitable for processing sequence data. RNNs are able to process dynamic features in sequences, that is, they can remember previously input information and use this information to influence the current output. For example, RNNs can convert speech signals into text, learn the temporal features of speech signals, and predict the corresponding text sequence. By analyzing the pitch, volume, speed and other characteristics of speech, RNNs can also be used to identify the speaker's emotional state, such as happiness, sadness, anger, etc.

损失函数:也称为代价函数或目标函数,是衡量模型预测值与实际值之间差异的函数。在机器学习和深度学习中,损失函数是训练过程中的核心部分,用于指导模型的学习方向,即通过最小化损失函数来优化模型参数。Loss function: Also known as cost function or objective function, it is a function that measures the difference between the model's predicted value and the actual value. In machine learning and deep learning, the loss function is a core part of the training process, used to guide the learning direction of the model, that is, to optimize the model parameters by minimizing the loss function.

其中,损失函数的类型多样,例如交叉熵损失函数,实际可根据应用场景选择对应的损失函数对模型进行训练。Among them, there are various types of loss functions, such as the cross entropy loss function. In practice, the corresponding loss function can be selected to train the model according to the application scenario.

优化器:在机器学习中是用于调整模型参数的算法,其目的是最小化损失函数。优化器通过迭代更新模型的参数,以使得模型在训练数据上的表现最佳。Optimizer: In machine learning, it is an algorithm used to adjust model parameters in order to minimize the loss function. The optimizer iteratively updates the model parameters to optimize the model's performance on the training data.

在现代汽车行业,车载系统的创新正不断向个性化和智能化迈进。智能汽车通过先进的技术,能够识别驾驶员的情绪状态,并据此提供定制化的安慰措施,以提升驾驶员的驾驶体验和行车安全。In the modern automotive industry, the innovation of in-vehicle systems is constantly moving towards personalization and intelligence. Smart cars use advanced technology to identify the driver's emotional state and provide customized comfort measures to enhance the driver's driving experience and driving safety.

汽车识别驾驶员情绪时,通常依赖于图像处理和机器学习算法,采集驾驶员的面部图像等,通过分析驾驶员的面部表情来识别其情绪状态。例如,如果系统识别到驾驶员情绪紧张或焦虑,它可能会提供一些舒缓的措施,比如调节空调温度或播放柔和的自然声音。When a car recognizes a driver's emotions, it usually relies on image processing and machine learning algorithms to collect facial images of the driver and analyze the driver's facial expressions to identify his or her emotional state. For example, if the system recognizes that the driver is nervous or anxious, it may provide some soothing measures, such as adjusting the air conditioning temperature or playing soft natural sounds.

然而,上述方式在实际应用中面临着一系列挑战。环境光线的变化可能会影响面部表情的识别准确性。在强光或弱光条件下,面部特征可能难以捕捉,导致情绪识别系统出现误判。并且,人的情绪表达多样,有时微妙的表情变化可能难以被算法准确捕捉。However, the above methods face a series of challenges in practical applications. Changes in ambient light may affect the accuracy of facial expression recognition. Facial features may be difficult to capture under strong or weak light conditions, resulting in misjudgment by the emotion recognition system. In addition, people express emotions in a variety of ways, and sometimes subtle changes in expression may be difficult for the algorithm to accurately capture.

本申请提供了一种提示音频的生成方法,能够基于驾驶员的语音数据分析驾驶员的情感状态,提高情绪识别结果的准确性。并生成对应的提示音频向驾驶员提供情感反馈,避免驾驶员在控制车辆行驶的过程中因情感状态而导致的事故。The present application provides a method for generating a prompt audio, which can analyze the driver's emotional state based on the driver's voice data to improve the accuracy of emotion recognition results. It also generates corresponding prompt audio to provide emotional feedback to the driver, thereby avoiding accidents caused by the driver's emotional state during the process of controlling the vehicle.

其次,对本申请实施例中涉及的提示音频的生成系统进行说明,示意性的,请参考图1,该实施环境中涉及第一车辆100、服务器120,第一车辆100、服务器120之间通过通信网络140连接。Secondly, the prompt audio generation system involved in the embodiment of the present application is explained. For schematic illustration, please refer to Figure 1. The implementation environment involves a first vehicle 100 and a server 120. The first vehicle 100 and the server 120 are connected via a communication network 140.

第一车辆100上部署有车载终端和语音采集组件,其中,语音采集组件在驾驶员授权开启的情况下,用于采集驾驶员的语音数据,驾驶员的语音数据中包括驾驶员的人声。采集到语音数据后,由车载终端将语音数据通过通信网络140发送至服务器120中,由服务器120对语音数据进行分析处理,以识别出语音数据所反映的驾驶员的情感状态,并对应生成提示音频返回给车载终端,其中,提示音频用于向驾驶员提供情感反馈,例如,安慰驾驶员,使驾驶员平复心情。The first vehicle 100 is equipped with an onboard terminal and a voice collection component, wherein the voice collection component is used to collect the driver's voice data when the driver authorizes it to be turned on, and the driver's voice data includes the driver's voice. After collecting the voice data, the onboard terminal sends the voice data to the server 120 through the communication network 140, and the server 120 analyzes and processes the voice data to identify the driver's emotional state reflected by the voice data, and generates a prompt audio correspondingly and returns it to the onboard terminal, wherein the prompt audio is used to provide emotional feedback to the driver, for example, to comfort the driver and calm the driver down.

服务器120中包含深度学习模型,将语音数据输入至深度学习模型中,输出对应的情感状态信息,情感状态信息用于指示驾驶员的情感状态。The server 120 includes a deep learning model, which inputs the voice data into the deep learning model and outputs corresponding emotional state information, where the emotional state information is used to indicate the emotional state of the driver.

服务器120基于情感状态信息生成对应的提示信息,并将提示信息合成为提示音频,返回至车载终端。The server 120 generates corresponding prompt information based on the emotional state information, synthesizes the prompt information into prompt audio, and returns it to the vehicle terminal.

车载终端及时播放提示音频,对驾驶员进行情感反馈,例如,在情感状态信息表示驾驶员处于悲伤状态下,提示音频用于给予驾驶员安慰。The vehicle terminal plays prompt audio in a timely manner to provide emotional feedback to the driver. For example, when the emotional state information indicates that the driver is in a sad state, the prompt audio is used to comfort the driver.

在一些实施例中,上述过程可以被视为多个处理环节,由车载终端和服务器共同执行。In some embodiments, the above process can be regarded as multiple processing links, which are jointly executed by the vehicle terminal and the server.

可选地,如图2所示,包括如下环节:语音数据采集环节210、情感识别环节220、情感分析环节230、音频合成环节240和音频输出环节250。Optionally, as shown in FIG. 2 , the following links are included: a voice data collection link 210 , an emotion recognition link 220 , an emotion analysis link 230 , an audio synthesis link 240 and an audio output link 250 .

语音数据采集环节210:用于在汽车内部或驾驶过程中收集驾驶员的语音数据,可通过车载麦克风或其他语音采集设备实现。在一些实施例中,该环节还包括语音数据的预处理步骤,如去除环境噪音、确定有效语音片段等,有效语音片段是指包含驾驶员人声的语音片段。Voice data collection link 210: used to collect the driver's voice data inside the car or during driving, which can be achieved through a car microphone or other voice collection equipment. In some embodiments, this link also includes a pre-processing step of the voice data, such as removing environmental noise, determining valid voice segments, etc., and the valid voice segment refers to the voice segment containing the driver's voice.

情感识别环节220:实现对驾驶员语音情感状态的识别和分类,通常采用深度学习模型进行情感识别,如循环神经网络或卷积神经网络等。该环节的输入为语音数据,输出为对应的情感状态信息,如高兴、沮丧、焦虑等。Emotion recognition link 220: It realizes the recognition and classification of the driver's voice emotional state, and usually adopts deep learning models for emotion recognition, such as recurrent neural networks or convolutional neural networks, etc. The input of this link is voice data, and the output is the corresponding emotional state information, such as happiness, frustration, anxiety, etc.

情感分析环节230:利用情感识别环节220识别到的驾驶员情感状态信息进行进一步分析和处理,在一些实施例中,该环节结合其他信息(如车速、车辆所处环境的交通情况等)综合分析驾驶员的情感状态。Emotional analysis section 230: further analyzes and processes the driver's emotional state information identified by the emotion recognition section 220. In some embodiments, this section combines other information (such as vehicle speed, traffic conditions in the vehicle's environment, etc.) to comprehensively analyze the driver's emotional state.

音频合成环节240:基于情感识别和分析的结果,选择合适的语音安慰内容,利用语音合成技术生成相应的提示音频。该环节能够将文本转换成自然流畅的语音,以提供安抚、鼓励、提醒等特定的语音反馈。Audio synthesis link 240: Based on the results of emotion recognition and analysis, select appropriate voice comfort content and use speech synthesis technology to generate corresponding prompt audio. This link can convert text into natural and fluent speech to provide specific voice feedback such as comfort, encouragement, and reminder.

音频输出环节250:负责将生成的语音合成内容输出给驾驶员,可以通过车载音响系统或者为驾驶员配戴的耳机进行输出,以实现针对驾驶员的情感安慰与指导功能。The audio output link 250 is responsible for outputting the generated speech synthesis content to the driver, which can be output through the vehicle audio system or the headphones worn by the driver to achieve emotional comfort and guidance functions for the driver.

值得注意的是,上述服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content DeliveryNetwork,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。It is worth noting that the above-mentioned servers can be independent physical servers, or they can be server clusters or distributed systems composed of multiple physical servers. They can also be cloud servers that provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), as well as big data and artificial intelligence platforms.

其中,云技术(Cloud technology)是指在广域网或局域网内将硬件、软件、网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术。云技术基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术、应用技术等的总称,可以组成资源池,按需所用,灵活便利。云计算技术将变成重要支撑。技术网络系统的后台服务需要大量的计算、存储资源,如视频网站、图片类网站和更多的门户网站。伴随着互联网行业的高度发展和应用,将来每个物品都有可能存在自己的识别标志,都需要传输到后台系统进行逻辑处理,不同程度级别的数据将会分开处理,各类行业数据皆需要强大的系统后盾支撑,只能通过云计算来实现。Among them, cloud technology refers to a hosting technology that unifies hardware, software, network and other resources in a wide area network or local area network to realize data computing, storage, processing and sharing. Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, application technology, etc. based on the cloud computing business model. It can form a resource pool, which is used on demand and flexible and convenient. Cloud computing technology will become an important support. The background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites. With the high development and application of the Internet industry, in the future, each item may have its own identification mark, which needs to be transmitted to the background system for logical processing. Data of different levels will be processed separately. All kinds of industry data need strong system backing support, which can only be achieved through cloud computing.

在一些实施例中,上述服务器还可以实现为区块链系统中的节点。In some embodiments, the above server can also be implemented as a node in a blockchain system.

结合上述名词简介和应用场景,对本申请提供的提示音频的生成方法进行说明,该方法可以由服务器或者车载终端执行,也可以由服务器和车载终端共同执行,本申请实施例中,以该方法由第一车辆的车载终端执行为例进行说明,如图3所示,图3是本申请一个示例性实施例提供的提示音频的生成方法的流程图。该方法包括如下步骤。In combination with the above-mentioned noun introduction and application scenarios, the method for generating prompt audio provided by the present application is described. The method can be executed by a server or a vehicle-mounted terminal, or can be executed by a server and a vehicle-mounted terminal together. In the embodiment of the present application, the method is described by taking the vehicle-mounted terminal of the first vehicle as an example, as shown in FIG3 , which is a flow chart of the method for generating prompt audio provided by an exemplary embodiment of the present application. The method comprises the following steps.

步骤310,采集第一主体的语音数据。Step 310: Collect voice data of the first subject.

其中,第一主体位于第一车辆内,第一主体可以是驾驶员或第一车辆内的乘员。The first subject is located in the first vehicle, and the first subject may be a driver or an occupant in the first vehicle.

可选地,第一车辆内部配置有可以采集车辆内部音频数据的组件,其中,当第一车辆内的主体发声时,该音频数据中包括语音数据,例如,第一车辆内部配置有麦克风阵列,由多个麦克风组成,能够捕捉第一车辆内的声音,并基于声音定位技术确定声音的来源。Optionally, the first vehicle is equipped with a component capable of collecting audio data inside the vehicle, wherein when a subject inside the first vehicle makes a sound, the audio data includes voice data. For example, the first vehicle is equipped with a microphone array composed of multiple microphones, which can capture sounds inside the first vehicle and determine the source of the sounds based on sound localization technology.

其中,语音采集组件的开启方式包括但不限于:(1)第一主体对语音采集组件的开关进行操作开启;(2)第一主体使用语音采集组件之间存在绑定关系的其他终端或设备开启;(3)第一主体通过语音指令开启,例如,语音采集组件具有语音控制的功能。The voice collection component may be turned on in the following ways, including but not limited to: (1) the first subject operates the switch of the voice collection component to turn it on; (2) the first subject uses other terminals or devices with which the voice collection component is bound to turn it on; (3) the first subject turns it on through voice commands, for example, the voice collection component has a voice control function.

语音数据是指与第一车辆内的主体开启语音采集组件功能时,由语音采集组件对车辆内部存在的语音信号进行采集后所得到的数据。当语音采集组件未被开启时,则无法对第一车辆内的语音信号进行采集。Voice data refers to data obtained by the voice collection component after collecting voice signals inside the vehicle when the voice collection component function is turned on by the subject in the first vehicle. When the voice collection component is not turned on, the voice signals in the first vehicle cannot be collected.

在一些实施例中,语音采集组件持续保持开启状态,当第一车辆内存在主体时,语音采集组件会主动播放提示语音,向第一车辆内部的主体征求语音采集权限。In some embodiments, the voice collection component remains in an on state continuously, and when there is a subject in the first vehicle, the voice collection component will actively play a prompt voice to solicit voice collection permission from the subject in the first vehicle.

例如,第一车辆内存在传感器,传感器能够检测到第一车辆内是否存在第一主体的情况,当检测到第一车辆内进入了主体,则车载终端自动控制语音采集组件播放提示语音:“是否开启语音采集权限”,若车辆内主体通过语音指令或其他方式同意开启语音采集权限,则语音采集组件继续保持开启状态;若车辆内主体通过语音指令或其他方式拒绝开启语音采集权限,则语音采集组件自动关闭。For example, there is a sensor in the first vehicle, and the sensor can detect whether there is a first subject in the first vehicle. When it is detected that the subject has entered the first vehicle, the vehicle-mounted terminal automatically controls the voice collection component to play a prompt voice: "Do you want to enable voice collection permission?" If the subject in the vehicle agrees to enable voice collection permission through voice commands or other means, the voice collection component continues to remain on; if the subject in the vehicle refuses to enable voice collection permission through voice commands or other means, the voice collection component is automatically closed.

可选地,以第一主体为驾驶员为例进行说明。第一主体的语音数据中包含第一主体的人声,即,由第一主体的声带所产生的语言声音,如,说话、朗读、唱歌等,语音数据能够体现第一主体发声时的语调、音量、音速和连续性等特征。Optionally, the first subject is a driver as an example for explanation. The voice data of the first subject includes the human voice of the first subject, that is, the language sound produced by the vocal cords of the first subject, such as speaking, reading aloud, singing, etc. The voice data can reflect the characteristics of the first subject when speaking, such as intonation, volume, speed and continuity.

可选地,语音数据中还包括第一主体所发出的其他声音,例如:笑声、哭声、叹气声、咳嗽声等,这些声音能够反映第一主体的身体健康情况和情感状态。Optionally, the voice data also includes other sounds emitted by the first subject, such as laughter, crying, sighing, coughing, etc. These sounds can reflect the physical health and emotional state of the first subject.

在一些实施例中,语音采集组件所采集到的语音数据中除了第一主体所发出的声音,还可能包括其他来源所发出的声音,例如,第一车辆所处环境下的环境噪音、第一车辆内部的电子设备运行时所发出的声音等。In some embodiments, the voice data collected by the voice collection component may include, in addition to the sound emitted by the first subject, sounds emitted by other sources, such as ambient noise in the environment in which the first vehicle is located, sounds emitted by electronic equipment inside the first vehicle when in operation, etc.

在进行语音分析时,如情感分析,去除噪音有助于更准确地捕捉到语音数据中有关第一主体的关键特征,为了突出语音数据中直接来源于第一主体的声音、提高后续对语音数据进行识别分析的准确性,车载终端在语音采集组件采集完毕后对语音数据进行降噪处理,以去除环境中的噪音,确定语音数据中的有效语音片段。When performing speech analysis, such as sentiment analysis, noise removal helps to more accurately capture the key features of the first subject in the speech data. In order to highlight the sound directly from the first subject in the speech data and improve the accuracy of subsequent recognition and analysis of the speech data, the vehicle-mounted terminal performs noise reduction processing on the speech data after the speech collection component completes the collection, so as to remove the noise in the environment and determine the effective speech segments in the speech data.

步骤320,对语音数据进行分析,得到第一主体的情感状态信息。Step 320: Analyze the voice data to obtain the emotional state information of the first subject.

其中,情感状态信息用于指示第一主体表达语音数据时的情感状态。The emotional state information is used to indicate the emotional state of the first subject when expressing the voice data.

例如,语音数据是在第一主体唱歌时所采集到的数据,对语音数据进行分析后所得到的情感状态信息用于指示第一主体唱歌时的情感状态,如,激动、悲伤、兴奋、喜悦等。For example, the voice data is data collected when the first subject sings, and the emotional state information obtained after analyzing the voice data is used to indicate the emotional state of the first subject when singing, such as excitement, sadness, excitement, joy, etc.

可选地,将语音数据输入至预训练好的深度学习模型,通过深度学习模型对语音数据进行情感分析,得到并输出分析结果,分析结果中包含至少一种候选情感状态。Optionally, the speech data is input into a pre-trained deep learning model, and sentiment analysis is performed on the speech data through the deep learning model to obtain and output analysis results, which include at least one candidate emotional state.

示例性的,深度学习模型是指循环神经网络模型,循环神经网络模型的结构包括输入层、隐藏层和输出层。Exemplarily, the deep learning model refers to a recurrent neural network model, the structure of which includes an input layer, a hidden layer, and an output layer.

其中,输入层是神经网络接收原始数据的第一层,将语音数据输入至循环神经网络模型后,语音数据首先进入输入层,输入层将语音数据映射为对应的数值特征,以便循环神经网络进行数据处理。隐藏层用于对输入数据进行特征提取,将简单的特征组合成更为复杂的特征,帮助神经网络理解输入数据,学习输入数据中各个特征的重要性,调整神经网络以达到优化性能的目的,将处理后的数据传递给下一层。输出层将隐藏层传递的信息转换成模型的最终预测结果。Among them, the input layer is the first layer of the neural network to receive raw data. After the voice data is input into the recurrent neural network model, the voice data first enters the input layer, and the input layer maps the voice data to corresponding numerical features so that the recurrent neural network can process the data. The hidden layer is used to extract features from the input data, combine simple features into more complex features, help the neural network understand the input data, learn the importance of each feature in the input data, adjust the neural network to achieve the purpose of optimizing performance, and pass the processed data to the next layer. The output layer converts the information transmitted by the hidden layer into the final prediction result of the model.

也即,深度学习模型通过至少上述三个结构层对语音数据进行处理后,输出得到的预测数据即为分析结果。That is, after the deep learning model processes the speech data through at least the above three structural layers, the output prediction data is the analysis result.

示例性的,以语音数据为第一主体进行语音通话时所采集的数据为例进行。Exemplarily, the voice data is taken as data collected when the first subject makes a voice call.

第一主体的通话内容是在第一主体向语音采集组件授权后所采集到的,如下:“路上太堵了,我本该20分钟就到家的,但现在看样子至少要晚两个小时”。The content of the first subject's call is collected after the first subject authorizes the voice collection component, as follows: "The traffic is too congested. I should have been home in 20 minutes, but now it seems that I will be at least two hours late."

其中,该语音数据中所反映的第一主体的音量较高、音调较高、语速较快。则,将语音数据输入至预训练好的深度学习模型中,该深度学习模型将语音数据中的通话内容转换为文本内容,对其进行自然语言处理来分析该文本内容的情感倾向。利用语音数据所反映的音调来识别第一主体的高音变化情况,利用语音数据所反映的音量来评估第一主体的情绪是否高涨。结合文本内容,分析第一主体的情感状态,得到至少一种候选情感。Among them, the first subject reflected in the voice data has a high volume, a high pitch, and a fast speaking speed. Then, the voice data is input into a pre-trained deep learning model, which converts the call content in the voice data into text content, and performs natural language processing on it to analyze the emotional tendency of the text content. The pitch reflected in the voice data is used to identify the high-pitched changes of the first subject, and the volume reflected in the voice data is used to evaluate whether the first subject is emotional. Combined with the text content, the emotional state of the first subject is analyzed to obtain at least one candidate emotion.

通过上述分析可知第一主体的音量显著提高、语速加快、音调尖锐且语调起伏较大,同时文本内容包含表达不满和焦虑的词汇。The above analysis shows that the volume of the first subject has increased significantly, the speaking speed has increased, the tone has become sharper and the intonation has fluctuated greatly. At the same time, the text content contains words that express dissatisfaction and anxiety.

从文本内容看来,第一主体正处于堵车路况中,对于可能存在的晚归情况产生焦虑。示例性的,深度学习模型所输出的分析结果中包括如下几种候选情感状态:焦虑、愤怒、生气、疲惫、急躁。From the text content, it can be seen that the first subject is stuck in a traffic jam and is anxious about the possibility of returning home late. Exemplarily, the analysis results output by the deep learning model include the following candidate emotional states: anxiety, anger, irritation, fatigue, and impatience.

可选地,采集第一车辆的驾驶信息,第一车辆的驾驶信息包括第一车辆的车辆运行信息和第一车辆所处环境的环境信息。Optionally, driving information of the first vehicle is collected, where the driving information of the first vehicle includes vehicle operation information of the first vehicle and environmental information of an environment in which the first vehicle is located.

基于分析结果和驾驶信息,从候选情感状态中确定第一主体的情感状态信息。Based on the analysis result and the driving information, emotional state information of the first subject is determined from the candidate emotional states.

示例性的,环境信息表示第一车辆所处道路较为拥堵,道路上存在车辆数量较多,第一车辆与其他车辆之间的跟车距离较近。车辆运行信息表示第一车辆处于关闭状态持续5分钟。For example, the environment information indicates that the road where the first vehicle is located is relatively congested, there are a large number of vehicles on the road, and the following distance between the first vehicle and other vehicles is relatively close. The vehicle operation information indicates that the first vehicle is in a closed state for 5 minutes.

则,上述驾驶信息表示第一车辆所处路段在短时间内保持停滞不动状态,第一车辆处于关闭状态是为了节省车辆能源消耗元素。故,分析结果和驾驶信息均表示第一主体的情感状态为急躁和愤怒。Then, the above driving information indicates that the road section where the first vehicle is located remains stationary for a short period of time, and the first vehicle is in a closed state in order to save vehicle energy consumption elements. Therefore, the analysis result and the driving information both indicate that the emotional state of the first subject is impatience and anger.

示例性的,获取情感状态对照表,情感状态对照表中包含第一车辆的驾驶信息与至少一种情感状态之间的对应关系。Exemplarily, an emotional state comparison table is obtained, wherein the emotional state comparison table contains a correspondence between driving information of the first vehicle and at least one emotional state.

基于驾驶信息从情感状态对照表中确定与驾驶信息匹配的第一对应关系。Based on the driving information, a first corresponding relationship matching the driving information is determined from the emotion state comparison table.

基于第一对应关系所指示的情感状态,从候选情感状态中确定第一主体的情感状态信息。Based on the emotional state indicated by the first corresponding relationship, emotional state information of the first subject is determined from the candidate emotional states.

示意性的,如下表1所示,表1是一个情感状态对照表的示意,用于表示驾驶信息与情感状态之间的对应关系,其中,驾驶信息用于指示驾驶员驾驶第一车辆时所遇到的情况/路况。Schematically, as shown in Table 1 below, Table 1 is a schematic diagram of an emotional state comparison table, which is used to represent the correspondence between driving information and emotional states, wherein the driving information is used to indicate the situations/road conditions encountered by the driver when driving the first vehicle.

表1Table 1

示例性的,当前驾驶信息指示第一车辆保持5分钟静止状态,则表示此时路况属于“交通停滞”,且对语音数据的分析过程中确定第一主体的音调升高,则第一主体的情感状态信息极有可能为“爆发愤怒”。故,基于驾驶信息从情感状态对照表中确定与驾驶信息匹配的第一对应关系如下:交通停滞-爆发愤怒。For example, if the current driving information indicates that the first vehicle remains stationary for 5 minutes, it means that the road condition at this time is "traffic stagnation", and if the first subject's voice tone is determined to be rising during the analysis of the voice data, the first subject's emotional state information is very likely to be "outburst of anger". Therefore, based on the driving information, the first corresponding relationship matching the driving information is determined from the emotional state comparison table as follows: traffic stagnation-outburst of anger.

其中,至少一个候选情感状态中包含如下情感状态:焦虑、愤怒、生气、疲惫、急躁,基于第一对应关系从候选情感状态中筛选出“愤怒”,确定为第一主体的情感状态信息。Among them, at least one candidate emotional state includes the following emotional states: anxiety, anger, irritability, fatigue, and impatience. Based on the first corresponding relationship, "anger" is screened out from the candidate emotional states and determined as the emotional state information of the first subject.

深度学习模型的类型包括但不限于循环神经网络和卷积神经网络,以循环神经网络为例进行说明,模型的训练过程如下:获取样本语音数据,样本语音数据标注有情感标记,情感标记用于指示样本语音数据所表达的情感状态。基于样本语音数据对预训练模型进行训练,得到预训练好的深度学习模型。The types of deep learning models include, but are not limited to, recurrent neural networks and convolutional neural networks. Taking recurrent neural networks as an example, the training process of the model is as follows: sample speech data is obtained, the sample speech data is annotated with emotion tags, and the emotion tags are used to indicate the emotional state expressed by the sample speech data. The pre-trained model is trained based on the sample speech data to obtain a pre-trained deep learning model.

例如,样本语音数据中包含多个子数据,每个子数据对应不同的情感标记,包括第一子数据。For example, the sample speech data includes multiple sub-data, each sub-data corresponds to a different emotion tag, including the first sub-data.

示例性的,第一子数据是用户遭遇电梯故障时,通过电梯内的电话向工作人员进行求助时的通话内容,第一子数据所标注的情感标记是:紧张和害怕。For example, the first sub-data is the content of a conversation when a user encounters an elevator failure and asks for help from a staff member via the telephone in the elevator. The emotion tags marked in the first sub-data are: nervousness and fear.

值得注意的是,由于人类的情感状态多数情况下并非单一,因此,每个子数据可以对应至少一个情感标记,以便深度学习模型更全面的学习人类情感,提高分析识别的准确性。在基于第一子数据对模型进行训练时,可以合并共同包含同一个情感标记的多个子数据,专门训练模型对该情感标记所指示的情感进行识别的能力,例如,子数据1包含标记A、B、C,子数据2包含标记A、D,将子数据1和子数据2作为输入数据训练模型对情感标记A的识别能力;也可以对同一个子数据分别标记不同的情感标记,每次仅依靠一个子数据训练模型对该其中一个情感标记所指示的情感进行识别的能力,本实施例对此不加以限定。It is worth noting that, since the emotional state of human beings is not singular in most cases, each sub-data can correspond to at least one emotional tag, so that the deep learning model can learn human emotions more comprehensively and improve the accuracy of analysis and recognition. When training the model based on the first sub-data, multiple sub-data that contain the same emotional tag can be combined to specifically train the model's ability to recognize the emotion indicated by the emotional tag. For example, sub-data 1 contains tags A, B, and C, and sub-data 2 contains tags A and D. Sub-data 1 and sub-data 2 are used as input data to train the model's ability to recognize emotional tag A. Different emotional tags can also be marked on the same sub-data, and the model can be trained to recognize the emotion indicated by one of the emotional tags by relying on only one sub-data at a time. This embodiment does not limit this.

可选地,通过预训练模型对样本语音数据的时间序列进行建模,确定样本语音数据的上下文信息和时间依赖关系,时间依赖关系是指样本语音数据在时间序列上的特性。Optionally, the time series of the sample speech data is modeled by a pre-trained model to determine the context information and time dependency of the sample speech data, where the time dependency refers to the characteristics of the sample speech data in the time series.

基于样本语音数据的上下文信息和时间依赖关系输出得到预测情感状态。The predicted emotional state is output based on the contextual information and time dependency of the sample speech data.

基于训练结果和情感标记对预训练模型进行调整,得到预训练好的深度学习模型。The pre-trained model is adjusted based on the training results and sentiment labels to obtain a pre-trained deep learning model.

示例性的,预先准备带有情感标记的样本语音数据,其中包括不同情感状态的语音样本。构建循环神经网络模型,通过序列学习的能力来捕捉样本语音数据中的时间依赖关系。Exemplarily, sample speech data with emotion tags are prepared in advance, including speech samples of different emotional states, and a recurrent neural network model is constructed to capture the temporal dependency in the sample speech data through the ability of sequence learning.

其中,语音数据的时间依赖关系指的是语音信号在时间序列上的相互关联性,体现在多个层面,例如:(1)音素依赖性:语音中的音素(如辅音和元音)通常不是孤立存在的,一个音素的发音往往受到前后音素的影响,这种现象称为连音或同化;(2)短语和句子依赖性:在短语或句子层面,语音的语调、节奏和强度等特征会受到句子结构和语义内容的影响,表现出时间上的连贯性和变化性;(3)情感表达:语音中的情感状态,如快乐、悲伤、愤怒等,会在一段时间内通过语音的音量、音调、语速等特征表现出来,存在时间上的连续性;(4)长期依赖性:在较长的语音段落中,如演讲或对话,语音的某些特征可能在整个段落中保持一致或逐渐变化,表现出长期的时间依赖性。Among them, the temporal dependency of speech data refers to the interrelationship of speech signals in time series, which is reflected in multiple levels, such as: (1) Phoneme dependency: Phonemes in speech (such as consonants and vowels) usually do not exist in isolation. The pronunciation of a phoneme is often affected by the previous and next phonemes. This phenomenon is called liaison or assimilation; (2) Phrase and sentence dependency: At the phrase or sentence level, speech features such as intonation, rhythm, and intensity are affected by sentence structure and semantic content, showing temporal coherence and variability; (3) Emotional expression: The emotional state in speech, such as happiness, sadness, anger, etc., is expressed through speech volume, pitch, speaking speed and other features over a period of time, showing temporal continuity; (4) Long-term dependency: In longer speech passages, such as speeches or conversations, certain features of speech may remain consistent or change gradually throughout the passage, showing long-term temporal dependency.

循环神经网络的结构包括输入层、隐藏层和输出层,在训练过程中,循环神经网络会对语音数据的时间序列进行建模,帮助模型捕捉样本语音数据中的上下文信息和长期依赖关系。The structure of the recurrent neural network includes an input layer, a hidden layer, and an output layer. During the training process, the recurrent neural network models the time series of speech data to help the model capture contextual information and long-term dependencies in the sample speech data.

使用标注有情感标记的样本语音数据对循环神经网络模型进行训练,采用适当的损失函数(如交叉熵损失函数)和优化器(如Adam优化器)进行参数优化,通过反向传播算法不断更新模型参数以最大程度地减少损失函数的取值。最后,通过测试集对训练好的模型进行评估,以确定模型在情感分类任务上的分类性能。其中,测试集中也包含多个预先知道情感状态/标记的样本语音数据。The recurrent neural network model is trained using sample speech data labeled with emotional tags, and the appropriate loss function (such as the cross entropy loss function) and optimizer (such as the Adam optimizer) are used for parameter optimization. The model parameters are continuously updated through the back-propagation algorithm to minimize the value of the loss function. Finally, the trained model is evaluated through the test set to determine the classification performance of the model on the emotion classification task. The test set also contains multiple sample speech data with known emotional states/tags in advance.

步骤330,基于情感状态信息生成对应的提示信息。Step 330: Generate corresponding prompt information based on the emotional state information.

其中,提示信息用于指示对第一主体提供情感反馈。The prompt information is used to instruct to provide emotional feedback to the first subject.

示例性的,提示信息的种类包括但不限于:文本提示信息、语音提示信息、动画提示信息、图片提示信息等,本实施例以提示信息为本文提示信息为例进行说明。Exemplarily, the types of prompt information include, but are not limited to: text prompt information, voice prompt information, animation prompt information, picture prompt information, etc. This embodiment takes the prompt information as the prompt information of this article as an example for description.

将情感状态信息输入至预训练好的自然语言模型,通过自然语言模型对情感状态信息进行识别分析,确定车载终端对第一主体进行反馈时的情感类型。The emotional state information is input into a pre-trained natural language model, and the emotional state information is identified and analyzed through the natural language model to determine the emotional type when the vehicle-mounted terminal provides feedback to the first subject.

生成符合情感类型的文本提示信息,文本提示信息中包含文本形式的情感反馈内容。Generate text prompt information that matches the emotion type, and the text prompt information contains emotional feedback content in text form.

其中,自然语言模型(Natural Language Model,NLM)是用于处理、理解和生成人类语言(即自然语言)的计算模型。这些模型旨在模拟人类理解和使用语言的方式,使计算机能够执行各种语言相关的任务。自然语言模型能够理解文本内容的结构和意义,对文本内容进行情感分析,识别文本内容中的情感倾向,例如,积极、消极或中性,并生成自然流畅的语言文本,作为反馈。Among them, Natural Language Model (NLM) is a computational model used to process, understand and generate human language (i.e. natural language). These models are designed to simulate the way humans understand and use language, enabling computers to perform various language-related tasks. Natural language models can understand the structure and meaning of text content, perform sentiment analysis on text content, identify the emotional tendencies in text content, such as positive, negative or neutral, and generate natural and fluent language text as feedback.

在一些实施例中,文本提示信息可以是预先准备好的文本内容,这些文本内容与不同的情感状态信息之间存在对应关系,例如,情感状态信息指示第一主体处于沮丧、伤心的状态时,文本提示信息对应为能够给予第一主体安慰和鼓励的文本内容。In some embodiments, the text prompt information may be pre-prepared text content, and there is a correspondence between these text contents and different emotional state information. For example, when the emotional state information indicates that the first subject is in a depressed or sad state, the text prompt information corresponds to text content that can give comfort and encouragement to the first subject.

或者,文本提示信息还可以从历史时间段内由自然语言模型所生成的文本提示信息确定,根据情感状态信息直接沿用历史时间段已经生成过的提示信息能够提高自然语言模型生成反馈内容的效率。Alternatively, the text prompt information can also be determined from the text prompt information generated by the natural language model in the historical time period. Directly using the prompt information that has been generated in the historical time period according to the emotional state information can improve the efficiency of the natural language model in generating feedback content.

可选地,获取历史时间段内自然语言模型所生成的历史文本提示信息,历史文本提示信息用于指示车载终端在历史时间段内向第一主体所提供的情感反馈内容。Optionally, historical text prompt information generated by a natural language model within a historical time period is obtained, and the historical text prompt information is used to indicate emotional feedback content provided by the vehicle-mounted terminal to the first subject within the historical time period.

基于情感类型从历史文本提示信息中筛选符合情感类型的信息作为本文提示信息。Based on the emotion type, the information that matches the emotion type is filtered out from the historical text prompt information as the prompt information of this article.

在一些实施例中,车载终端具有网络连接能力,能够访问并获取互联网平台上已被公开的多媒体内容。将情感状态信息输入至预训练好的自然语言模型,通过自然语言模型对情感状态信息进行识别分析,确定车载终端对第一主体进行反馈时的情感类型后,车载终端自动基于反馈情感类型在互联网平台进行索引,收集与反馈情感类型所符合的故事、诗歌、剧本、新闻等文本内容,基于文本内容获得文本提示信息。例如,对文本内容进行概括后,得到文本提示信息;或者,对文本内容中的关键词进行提取后,基于提取到的关键词自动生成文本提示信息,其中,关键词可以是文本内容中出现频率较高的词、对文本内容进行概括的词。In some embodiments, the vehicle-mounted terminal has network connection capability and can access and obtain multimedia content that has been made public on the Internet platform. The emotional state information is input into a pre-trained natural language model, and the emotional state information is identified and analyzed by the natural language model. After determining the emotional type when the vehicle-mounted terminal provides feedback to the first subject, the vehicle-mounted terminal automatically indexes the Internet platform based on the feedback emotional type, collects stories, poems, scripts, news and other text content that matches the feedback emotional type, and obtains text prompt information based on the text content. For example, after summarizing the text content, text prompt information is obtained; or, after extracting keywords from the text content, text prompt information is automatically generated based on the extracted keywords, wherein keywords can be words that appear more frequently in the text content and words that summarize the text content.

在一些实施例中,第一车辆的车载终端上登录有第一主体的社交账号,该社交账号是与第一车辆的信息系统、服务或应用相关联的账号,第一主体能够通过登录该社交账号调用第一车辆内部的服务和功能。In some embodiments, a social account of the first subject is logged into the on-board terminal of the first vehicle. The social account is an account associated with the information system, service or application of the first vehicle. The first subject can call the services and functions inside the first vehicle by logging into the social account.

可选地,第一主体的社交账号与其他社交账号之间存在关联,例如,其他社交账号是第一主体在互联网应用平台所登录的账号,或者,其他社交账号是第二主体登录在互联网应用平台或其他车辆上的账号,其他社交账号与第一主体的社交账号之间互为好友账号。Optionally, there is an association between the social account of the first subject and other social accounts. For example, the other social accounts are accounts logged in by the first subject on the Internet application platform, or the other social accounts are accounts logged in by the second subject on the Internet application platform or other vehicles. The other social accounts and the social account of the first subject are friend accounts of each other.

以第二主体登录有好友账号为例,深度学习模型输出得到第一主体的情感状态信息后,将情感状态信息发送至好友账号,由第二主体根据情感状态信息编辑提示信息,通过好友账号反馈给第一主体的社交账号,车载终端接收第二主体所反馈的提示信息。Taking the example of the second subject logging in to a friend account, after the deep learning model outputs the emotional status information of the first subject, the emotional status information is sent to the friend account. The second subject edits the prompt information based on the emotional status information and feeds it back to the first subject's social account through the friend account. The in-vehicle terminal receives the prompt information fed back by the second subject.

例如,情感状态信息表示第一主体正处沮丧状态,第二主体通过社交账号所反馈的提示信息是一段具有安慰情感特征的音频,该音频中包含第二主体的人声。For example, the emotional state information indicates that the first subject is in a depressed state, and the prompt information fed back by the second subject through the social account is an audio with comforting emotional characteristics, which contains the voice of the second subject.

在一些实施例中,深度学习模型输出第一主体的情感状态信息后,车载终端基于采集到的车辆运行信息生成对第一车辆的调整建议作为提示信息。其中,车辆运行信息还包括第一车辆内的空气温度、湿度、第一车辆内多种车载设备的工作状态等,例如:车内照明设备的亮度、车载空调是否打开、车窗是否打开等。In some embodiments, after the deep learning model outputs the emotional state information of the first subject, the vehicle terminal generates adjustment suggestions for the first vehicle as prompt information based on the collected vehicle operation information. The vehicle operation information also includes the air temperature and humidity in the first vehicle, the working status of various vehicle-mounted devices in the first vehicle, such as the brightness of the interior lighting equipment, whether the vehicle air conditioner is turned on, whether the window is open, etc.

示例性的,情感状态信息指示第一主体因路况拥堵而处于愤怒状态,车辆运行信息指示第一车辆内部的空气温度为30℃,车载空调处于关闭状态,车窗处于开启状态。则车载终端自动生成如下建议:关闭车窗,开启车内空调。基于上述建议得到提示信息如下:道路拥堵、车外噪音大,请关闭车窗,车辆内温度较高,请开启车内空调。For example, the emotional state information indicates that the first subject is in an angry state due to traffic congestion, and the vehicle operation information indicates that the air temperature inside the first vehicle is 30°C, the vehicle air conditioner is off, and the windows are open. Then the vehicle terminal automatically generates the following suggestions: Close the windows and turn on the air conditioner in the car. Based on the above suggestions, the prompt information is as follows: The road is congested and the noise outside the car is loud, please close the windows, and the temperature inside the vehicle is high, please turn on the air conditioner in the car.

步骤340,将提示信息转换成提示音频,并播放提示音频。Step 340: convert the prompt information into prompt audio, and play the prompt audio.

其中,以提示信息为文本提示信息进行示例说明,将文本内容转换为音频内容的过程,通常称为文本到语音(Text-to-Speech,TTS)转换,主要包括如下几个步骤:(1)文本预处理:对文本进行分词;(2)声音选择:选择一个预录的声音或合成声音,这个声音可以是男性或女性,也可以具有不同的年龄特征;(3)音素合成:将文本分解为音素或音标;(4)韵律生成:确定语句的语调、节奏、强度等;(5)语音合成:使用TTS引擎将音素转换为音频信号;Among them, the prompt information is used as an example to illustrate that the process of converting text content into audio content is usually called text-to-speech (TTS) conversion, which mainly includes the following steps: (1) Text preprocessing: segment the text; (2) Voice selection: select a pre-recorded voice or a synthesized voice, which can be male or female, or have different age characteristics; (3) Phoneme synthesis: decompose the text into phonemes or phonetic symbols; (4) Rhythm generation: determine the tone, rhythm, intensity, etc. of the sentence; (5) Speech synthesis: use the TTS engine to convert phonemes into audio signals;

(6)输出音频:将处理后的音频输出为常见的音频格式。(6) Audio output: Output the processed audio to a common audio format.

在一些实施例中,第一主体的情感状态可能在短时间内发生变化,因此,在生成提示音频后,根据语音采集组件实时采集第一主体的语音数据并进行情感分析时,能够体现第一主体的情感变化,若当前时刻第一主体的情感状态信息仍然符合已生成的提示音频,则可以直接播放该提示音频。若当前时刻第一主体的情感状态信息发生了变化,则可以结合当前时刻以及历史时刻下第一主体的情感状态信息的变化情况,重新生成对应的提示音频。In some embodiments, the emotional state of the first subject may change in a short period of time. Therefore, after the prompt audio is generated, the emotional changes of the first subject can be reflected when the voice data of the first subject is collected in real time according to the voice collection component and emotional analysis is performed. If the emotional state information of the first subject at the current moment still conforms to the generated prompt audio, the prompt audio can be played directly. If the emotional state information of the first subject at the current moment has changed, the corresponding prompt audio can be regenerated in combination with the changes in the emotional state information of the first subject at the current moment and at historical moments.

在一些实施例中,当第一车辆内存在多个主体时,语音采集组件所采集到的语音数据中很可能包含车辆内多个主体之间的对话内容,则根据对话内容的不同,也可以通过对语音数据进行情感分析,生成适合对多个主体共同进行情感反馈的提示音频并进行播放。In some embodiments, when there are multiple subjects in the first vehicle, the voice data collected by the voice collection component is likely to contain conversation content between the multiple subjects in the vehicle. Depending on the content of the conversation, it is also possible to generate and play prompt audio suitable for providing emotional feedback to multiple subjects by performing sentiment analysis on the voice data.

示例性的,预训练好的深度学习模型对语音数据进行分析后,确定第一车辆内存在两个主体正在争吵,则车载音响所播放的提示音频应当是中立、平和并有助于缓和紧张气氛的。例如,冷静的提醒音频:“试试深呼吸,冷静一下再继续对话”;强调安全的音频:“安全驾驶是最重要的,如果现在的情绪影响到了驾驶,请考虑先停车休息一下”。For example, after the pre-trained deep learning model analyzes the voice data and determines that there are two parties arguing in the first vehicle, the prompt audio played by the car audio should be neutral, peaceful, and helpful in easing the tense atmosphere. For example, a calm reminder audio: "Try to take a deep breath and calm down before continuing the conversation"; an audio that emphasizes safety: "Safe driving is the most important thing. If your current emotions affect your driving, please consider stopping and taking a rest first."

值得注意的是,对不同的情况进行情感反馈时,需要结合具体的场景选择适当的音频,除了根据采集到的语音数据进行分析后生成对应的音频,也可以通过播放本地存储的音乐,来达到安慰车辆内部主体的效果。例如,播放节奏平缓的音乐,帮助车辆内的主体平复心情,保持冷静等。It is worth noting that when providing emotional feedback for different situations, it is necessary to select appropriate audio based on the specific scenario. In addition to generating corresponding audio based on the collected voice data after analysis, it is also possible to achieve the effect of comforting the subject inside the vehicle by playing locally stored music. For example, playing music with a gentle rhythm can help the subject inside the vehicle calm down and stay calm.

综上所述,本申请提供的提示音频的生成方法,通过采集驾驶员的语音数据,分析驾驶员当前的情感状态,能够向驾驶员提供对应的情感反馈,帮助驾驶员及时调整情绪,提升驾驶员的驾驶体验和行驶过程中的安全性。由于语音具有连续性,随着语音数据中音调、音量等不断变化,能够更好的反映驾驶员的真实情感,相较于相关技术中,采集驾驶员面部表情来识别驾驶员的情绪的方式,基于语音数据进行驾驶员情感识别,能够提高识别结果的准确性,进而提供更加适合的情感反馈。In summary, the method for generating prompt audio provided by the present application can provide corresponding emotional feedback to the driver by collecting the driver's voice data and analyzing the driver's current emotional state, thereby helping the driver to adjust his emotions in a timely manner and improving the driver's driving experience and safety during driving. Since voice is continuous, as the tone, volume, etc. in the voice data continue to change, it can better reflect the driver's true emotions. Compared with the method of collecting the driver's facial expressions to identify the driver's emotions in related technologies, driver emotion recognition based on voice data can improve the accuracy of the recognition results and provide more appropriate emotional feedback.

图4是本申请一个示例性实施例提供的提示音频的生成装置的结构框图,如图4所示,该装置包括如下部分。FIG4 is a structural block diagram of a device for generating a prompt audio according to an exemplary embodiment of the present application. As shown in FIG4 , the device includes the following parts.

采集模块410,用于采集第一主体的语音数据,所述第一主体位于所述第一车辆内;A collection module 410 is used to collect voice data of a first subject, where the first subject is located in the first vehicle;

分析模块420,用于对所述语音数据进行分析,得到所述第一主体的情感状态信息,所述情感状态信息用于指示所述第一主体表达所述语音数据时的情感状态;An analysis module 420, configured to analyze the voice data to obtain the emotional state information of the first subject, wherein the emotional state information is used to indicate the emotional state of the first subject when expressing the voice data;

生成模块430,用于基于所述情感状态信息生成对应的提示信息,所述提示信息用于指示对所述第一主体提供情感反馈;A generating module 430, configured to generate corresponding prompt information based on the emotional state information, wherein the prompt information is used to instruct to provide emotional feedback to the first subject;

转换模块440,用于将所述提示信息转换成提示音频,并播放所述提示音频。The conversion module 440 is used to convert the prompt information into a prompt audio and play the prompt audio.

在一个可选的实施例中,所述分析模块420,还用于将所述语音数据输入至预训练好的深度学习模型,通过所述深度学习模型对所述语音数据进行情感分析,得到并输出分析结果,所述分析结果中包含至少一种候选情感状态;采集所述第一车辆的驾驶信息,所述第一车辆的驾驶信息包括所述第一车辆的车辆运行信息和所述第一车辆所处环境的环境信息;基于所述分析结果和所述驾驶信息,从所述候选情感状态中确定所述第一主体的情感状态信息。In an optional embodiment, the analysis module 420 is also used to input the voice data into a pre-trained deep learning model, perform sentiment analysis on the voice data through the deep learning model, obtain and output an analysis result, and the analysis result includes at least one candidate emotional state; collect driving information of the first vehicle, the driving information of the first vehicle includes vehicle operation information of the first vehicle and environmental information of the environment in which the first vehicle is located; based on the analysis result and the driving information, determine the emotional state information of the first subject from the candidate emotional states.

在一个可选的实施例中,所述分析模块420,还用于获取情感状态对照表,所述情感状态对照表中包含所述第一车辆的驾驶信息与至少一种情感状态之间的对应关系;基于所述驾驶信息从所述情感状态对照表中确定与所述驾驶信息匹配的第一对应关系;基于所述第一对应关系所指示的情感状态,从所述候选情感状态中确定所述第一主体的所述情感状态信息。In an optional embodiment, the analysis module 420 is also used to obtain an emotional state comparison table, which contains the correspondence between the driving information of the first vehicle and at least one emotional state; based on the driving information, determine a first correspondence matching the driving information from the emotional state comparison table; based on the emotional state indicated by the first correspondence, determine the emotional state information of the first subject from the candidate emotional states.

在一个可选的实施例中,所述分析模块420之前,如图5所示,所述装置还包括:In an optional embodiment, before the analysis module 420, as shown in FIG5 , the device further includes:

训练模块450,用于获取样本语音数据,所述样本语音数据标注有情感标记,所述情感标记用于指示所述样本语音数据所表达的情感状态;基于所述样本语音数据对预训练模型进行训练,得到所述预训练好的深度学习模型。The training module 450 is used to obtain sample speech data, where the sample speech data is marked with an emotion tag, and the emotion tag is used to indicate the emotional state expressed by the sample speech data; and train the pre-trained model based on the sample speech data to obtain the pre-trained deep learning model.

在一个可选的实施例中,所述训练模块450,还用于通过所述预训练模型对所述样本语音数据的时间序列进行建模,确定所述样本语音数据的上下文信息和时间依赖关系,所述时间依赖关系是指所述样本语音数据在时间序列上的特性;基于所述样本语音数据的所述上下文信息和所述时间依赖关系输出得到预测情感状态;基于所述训练结果和所述情感标记对所述预训练模型进行调整,得到所述预训练好的深度学习模型。In an optional embodiment, the training module 450 is also used to model the time series of the sample speech data through the pre-trained model, determine the context information and time dependency of the sample speech data, and the time dependency refers to the characteristics of the sample speech data in the time series; output a predicted emotional state based on the context information and the time dependency of the sample speech data; and adjust the pre-trained model based on the training results and the emotional markers to obtain the pre-trained deep learning model.

在一个可选的实施例中,所述生成模块430,还用于将所述情感状态信息输入至预训练好的自然语言模型,通过所述自然语言模型对所述情感状态信息进行识别分析,确定所述车载终端对所述第一主体进行反馈时的情感类型;生成符合所述情感类型的文本提示信息,所述文本提示信息中包含文本形式的情感反馈内容。In an optional embodiment, the generation module 430 is also used to input the emotional state information into a pre-trained natural language model, identify and analyze the emotional state information through the natural language model, and determine the emotional type when the vehicle-mounted terminal provides feedback to the first subject; generate text prompt information that conforms to the emotional type, and the text prompt information contains emotional feedback content in text form.

在一个可选的实施例中,所述生成模块430,还用于获取历史时间段内所述自然语言模型所生成的历史文本提示信息,所述历史文本提示信息用于指示所述车载终端在历史时间段内向所述第一主体所提供的情感反馈内容;基于所述情感类型从所述历史文本提示信息中筛选符合所述情感类型的信息作为所述本文提示信息。In an optional embodiment, the generation module 430 is also used to obtain historical text prompt information generated by the natural language model within a historical time period, and the historical text prompt information is used to indicate the emotional feedback content provided by the vehicle-mounted terminal to the first subject within the historical time period; based on the emotional type, information that meets the emotional type is filtered out from the historical text prompt information as the text prompt information.

综上所述,本申请提供的提示音频的生成装置,能够通过采集驾驶员的语音数据,分析驾驶员当前的情感状态,能够向驾驶员提供对应的情感反馈,帮助驾驶员及时调整情绪,提升驾驶员的驾驶体验和行驶过程中的安全性。由于语音具有连续性,随着语音数据中音调、音量等不断变化,能够更好的反映驾驶员的真实情感,相较于相关技术中,采集驾驶员面部表情来识别驾驶员的情绪的方式,基于语音数据进行驾驶员情感识别,能够提高识别结果的准确性,进而提供更加适合的情感反馈。In summary, the device for generating prompt audio provided by the present application can collect the driver's voice data, analyze the driver's current emotional state, provide the driver with corresponding emotional feedback, help the driver adjust his emotions in time, and improve the driver's driving experience and safety during driving. Since voice is continuous, as the tone, volume, etc. in the voice data change continuously, it can better reflect the driver's true emotions. Compared with the method of collecting the driver's facial expressions to identify the driver's emotions in the related technology, the driver's emotion recognition based on voice data can improve the accuracy of the recognition results, and thus provide more appropriate emotional feedback.

需要说明的是:上述实施例提供的提示音频的生成装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的提示音频的生成装置与提示音频的生成方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that the device for generating prompt audio provided in the above embodiment is only illustrated by the division of the above functional modules. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device for generating prompt audio provided in the above embodiment and the method for generating prompt audio belong to the same concept. The specific implementation process is detailed in the method embodiment and will not be repeated here.

图6示出了本申请一个示例性实施例提供的计算机设备600的结构框图。该计算机设备600可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group AudioLayer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts GroupAudio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。计算机设备600还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。FIG6 shows a block diagram of a computer device 600 provided by an exemplary embodiment of the present application. The computer device 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, Moving Picture Experts Group Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, Moving Picture Experts Group Audio Layer 4), a laptop computer or a desktop computer. The computer device 600 may also be called a user device, a portable terminal, a laptop terminal, a desktop terminal or other names.

通常,计算机设备600包括有:处理器601和存储器602。Typically, the computer device 600 includes a processor 601 and a memory 602 .

处理器601可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器601可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器601也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central ProcessingUnit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器601可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器601还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 601 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor. The main processor is a processor for processing data in the awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 601 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.

存储器602可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器602中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器601所执行以实现本申请中方法实施例提供的提示音频的生成方法。The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include a high-speed random access memory, and a non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 602 is used to store at least one instruction, which is used to be executed by the processor 601 to implement the method for generating the prompt audio provided in the method embodiment of the present application.

在一些实施例中,计算机设备600还包括一些其他组件603,可以基于计算机设备600的功能需要选择其他组件603的类型和数量。本领域技术人员可以理解,图6中示出的结构并不构成对计算机设备600的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。In some embodiments, the computer device 600 further includes some other components 603, and the types and quantities of the other components 603 may be selected based on the functional requirements of the computer device 600. Those skilled in the art will appreciate that the structure shown in FIG. 6 does not limit the computer device 600, and may include more or fewer components than shown in the figure, or combine some components, or adopt different component arrangements.

可选地,该计算机可读存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、固态硬盘(SSD,Solid State Drives)或光盘等。其中,随机存取记忆体可以包括电阻式随机存取记忆体(ReRAM,Resistance RandomAccess Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。Optionally, the computer readable storage medium may include: a read-only memory (ROM), a random access memory (RAM), a solid state drive (SSD), or an optical disk. Among them, the random access memory may include a resistance random access memory (ReRAM) and a dynamic random access memory (DRAM). The serial numbers of the above embodiments of the present application are only for description and do not represent the advantages and disadvantages of the embodiments.

本申请实施例还提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述本申请实施例中任一所述的提示音频的生成方法。An embodiment of the present application also provides a computer device, which includes a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the method for generating prompt audio as described in any of the above embodiments of the present application.

本申请实施例还提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上述本申请实施例中任一所述的提示音频的生成方法。An embodiment of the present application also provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set or an instruction set is stored. The at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by a processor to implement a method for generating a prompt audio as described in any of the above embodiments of the present application.

本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例中任一所述的提示音频的生成方法。The embodiment of the present application also provides a computer program product or a computer program, which includes a computer instruction stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method for generating the prompt audio described in any of the above embodiments.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person skilled in the art will understand that all or part of the steps to implement the above embodiments may be accomplished by hardware or by instructing related hardware through a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a disk or an optical disk, etc.

以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above description is only an optional embodiment of the present application and is not intended to limit the present application. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of generating alert audio, performed by an in-vehicle terminal of a first vehicle, the method comprising:
collecting voice data of a first main body, wherein the first main body is positioned in the first vehicle;
Analyzing the voice data to obtain emotion state information of the first main body, wherein the emotion state information is used for indicating the emotion state of the first main body when the voice data is expressed;
Generating corresponding prompt information based on the emotion state information, wherein the prompt information is used for indicating to provide emotion feedback for the first main body;
and converting the prompt information into prompt audio and playing the prompt audio.
2. The method of claim 1, wherein analyzing the voice data to obtain the emotional state information of the first subject comprises:
inputting the voice data into a pre-trained deep learning model, carrying out emotion analysis on the voice data through the deep learning model to obtain and output an analysis result, wherein the analysis result comprises at least one candidate emotion state;
collecting driving information of the first vehicle, wherein the driving information of the first vehicle comprises vehicle running information of the first vehicle and environment information of the environment where the first vehicle is located;
and determining emotion state information of the first subject from the candidate emotion states based on the analysis result and the driving information.
3. The method of claim 2, wherein the determining emotional state information of the first subject from the candidate emotional states based on the analysis results and the driving information comprises:
acquiring an emotion state comparison table, wherein the emotion state comparison table comprises a corresponding relation between driving information of the first vehicle and at least one emotion state;
Determining a first corresponding relation matched with the driving information from the emotion state comparison table based on the driving information;
and determining the emotion state information of the first subject from the candidate emotion states based on the emotion states indicated by the first correspondence.
4. The method of claim 2, wherein before inputting the speech data into the pre-trained deep learning model, further comprising:
Acquiring sample voice data, wherein the sample voice data is marked with emotion marks, and the emotion marks are used for indicating emotion states expressed by the sample voice data;
training the pre-training model based on the sample voice data to obtain the pre-trained deep learning model.
5. The method of claim 4, wherein training the pre-trained model based on the sample speech data results in the pre-trained deep learning model, comprising:
modeling a time sequence of the sample voice data through the pre-training model, and determining context information and time dependency relationship of the sample voice data, wherein the time dependency relationship refers to characteristics of the sample voice data on the time sequence;
Outputting to obtain a predicted emotion state based on the context information and the time dependency relationship of the sample voice data;
And adjusting the pre-training model based on the training result and the emotion marks to obtain the pre-trained deep learning model.
6. The method of claim 1, wherein the generating the corresponding hint information based on the emotional state information comprises:
Inputting the emotion state information into a pre-trained natural language model, and identifying and analyzing the emotion state information through the natural language model to determine the emotion type when the vehicle-mounted terminal feeds back the first main body;
And generating text prompt information conforming to the emotion type, wherein the text prompt information comprises emotion feedback content in a text form.
7. The method according to claim 6, wherein the inputting the emotion state information into a pre-trained natural language model, performing recognition analysis on the emotion state information through the natural language model, and determining the emotion type when the vehicle-mounted terminal feeds back the first subject, further comprises:
Acquiring historical text prompt information generated by the natural language model in a historical time period, wherein the historical text prompt information is used for indicating emotion feedback content provided by the vehicle-mounted terminal to the first main body in the historical time period;
And screening information conforming to the emotion type from the historical text prompt information based on the emotion type to serve as the text prompt information.
8. A device for generating alert audio, the device comprising:
the system comprises an acquisition module, a first vehicle and a second vehicle, wherein the acquisition module is used for acquiring voice data of a first main body, and the first main body is positioned in the first vehicle;
the analysis module is used for analyzing the voice data to obtain emotion state information of the first main body, wherein the emotion state information is used for indicating the emotion state of the first main body when the first main body expresses the voice data;
The generation module is used for generating corresponding prompt information based on the emotion state information, and the prompt information is used for indicating to provide emotion feedback for the first main body;
The conversion module is used for converting the prompt information into prompt audio and playing the prompt audio.
9. A computer device comprising a processor and a memory, wherein the memory stores at least one program, the at least one program being loaded and executed by the processor to implement a method of generating alert audio according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored therein at least one program that is loaded and executed by a processor to implement the method of generating alert audio according to any one of claims 1 to 7.
CN202410952680.0A 2024-07-16 2024-07-16 Method, device, equipment, and medium for generating prompt audio Pending CN118692444A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410952680.0A CN118692444A (en) 2024-07-16 2024-07-16 Method, device, equipment, and medium for generating prompt audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410952680.0A CN118692444A (en) 2024-07-16 2024-07-16 Method, device, equipment, and medium for generating prompt audio

Publications (1)

Publication Number Publication Date
CN118692444A true CN118692444A (en) 2024-09-24

Family

ID=92764577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410952680.0A Pending CN118692444A (en) 2024-07-16 2024-07-16 Method, device, equipment, and medium for generating prompt audio

Country Status (1)

Country Link
CN (1) CN118692444A (en)

Similar Documents

Publication Publication Date Title
Liu et al. Expressive TTS training with frame and style reconstruction loss
US12154563B2 (en) Electronic apparatus and method for controlling thereof
US20230290346A1 (en) Content output management based on speech quality
Schuller et al. Computational paralinguistics: emotion, affect and personality in speech and language processing
KR20210070213A (en) Voice user interface
CN112771607A (en) Electronic device and control method thereof
Alex et al. Attention and feature selection for automatic speech emotion recognition using utterance and syllable-level prosodic features
US12100383B1 (en) Voice customization for synthetic speech generation
KR20220070466A (en) Intelligent speech recognition method and device
CN114373443B (en) Speech synthesis method and device, computing device, storage medium and program product
US11922538B2 (en) Apparatus for generating emojis, vehicle, and method for generating emojis
US20230267923A1 (en) Natural language processing apparatus and natural language processing method
Huang et al. Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering
CN116564274A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN117352000A (en) Speech classification method, device, electronic equipment and computer readable medium
Thiripurasundari et al. Speech emotion recognition for human–computer interaction
CN117216532A (en) Model training method, device, equipment, storage medium and program product
CN119478153A (en) Facial expression data generation based on emotion discrimination large model training, generation method
Priya Dharshini et al. Transfer accent identification learning for enhancing speech emotion recognition
CN118692444A (en) Method, device, equipment, and medium for generating prompt audio
CN116403563A (en) Method and device for generating spoken acoustic features, electronic equipment and storage medium
CN116129938A (en) Singing voice synthesizing method, singing voice synthesizing device, singing voice synthesizing equipment and storage medium
CN115686426A (en) Emotion recognition-based music playback method, device, device, and medium
US12230278B1 (en) Output of visual supplemental content
KR102771957B1 (en) Customized feedback lighting system through classification of user emotions based on Wake-Up Words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination