CN118430538A

CN118430538A - Error correction multi-mode model construction method, system, equipment and medium

Info

Publication number: CN118430538A
Application number: CN202410593278.8A
Authority: CN
Inventors: 冯晨
Original assignee: Pacific Insurance Technology Co Ltd
Current assignee: Pacific Insurance Technology Co Ltd
Priority date: 2024-05-13
Filing date: 2024-05-13
Publication date: 2024-08-02

Abstract

The application discloses an error correction multi-mode model construction method, a system, equipment and a medium, which are used for acquiring audio information; performing voice separation processing and transcription processing on the audio information to obtain a processing result; and combining the processing result with a corresponding real result to obtain a fine tuning sample, inputting a preparation model, adjusting the preparation model based on the fine tuning sample to obtain an error correction multi-modal model, wherein the preparation model comprises a linear layer and a large-scale language model in a frozen state, and the projection layer aligns a voice encoder for human voice separation with the large-scale voice model in the frozen state. Based on the multi-mode large model, two mode information are simultaneously utilized, the strong zero-shot capability is utilized, a more complex conference scene is adapted, two modules of a conference transcription system are optimized, and the accuracy of conference transcription is improved.

Description

A method, system, device and medium for constructing an error correction multimodal model

技术领域Technical Field

本申请涉及计算机技术领域，特别是涉及一种纠错多模态模型构建方法、系统、设备及介质。The present application relates to the field of computer technology, and in particular to a method, system, device and medium for constructing an error correction multimodal model.

背景技术Background technique

纠错多模态模型构建系统的主要任务是将会议中的语音内容转化为文字形式，方便参会者查阅和记录。当会议人数逐渐增多，环境变得愈发嘈杂时，已有的会议质检系统会遭受到前所未有的挑战。例如在一场大型的线上会议中，数十甚至上百位参会者同时发言，他们的语音交织在一起，背景中还夹杂着键盘敲击声、纸张翻动声以及偶尔传来的门铃声等。系统需要处理的信息量急剧增加，同时语音信号的清晰度也会大幅下降。这导致人声分离和转写两个模块在纠错时面临着巨大的困难。The main task of the error correction multimodal model construction system is to convert the voice content in the meeting into text form, so that participants can check and record it. When the number of people in the meeting gradually increases and the environment becomes more and more noisy, the existing meeting quality inspection system will face unprecedented challenges. For example, in a large online meeting, dozens or even hundreds of participants speak at the same time, their voices are intertwined, and the background is also mixed with the sound of keyboard tapping, paper shuffling, and occasional doorbells. The amount of information that the system needs to process increases dramatically, and the clarity of the voice signal also drops significantly. This causes the two modules of voice separation and transcription to face huge difficulties in error correction.

人声分离模块的主要任务是将不同人的语音信号分离开来，确保每个人的发言都能被准确识别。然而，在嘈杂的环境中，不同人的语音信号往往会发生重叠和干扰，使得分离变得异常困难。此外，由于每个人的语音特征都有所不同，分离模块还需要具备强大的识别能力，以应对各种复杂的语音情况。转写模块则负责将分离后的语音信号转化为文字。然而，由于语音信号的质量问题以及分离模块可能存在的误差，转写模块在转写过程中也容易出现错误。这些错误可能包括识别错误、漏字、多字等，严重影响了转录结果的准确性。The main task of the voice separation module is to separate the voice signals of different people to ensure that everyone's speech can be accurately recognized. However, in a noisy environment, the voice signals of different people often overlap and interfere, making separation extremely difficult. In addition, since everyone's voice characteristics are different, the separation module also needs to have strong recognition capabilities to cope with various complex voice situations. The transcription module is responsible for converting the separated voice signals into text. However, due to the quality of the voice signal and the possible errors in the separation module, the transcription module is also prone to errors in the transcription process. These errors may include recognition errors, missing words, extra words, etc., which seriously affect the accuracy of the transcription results.

由于人声分离和转写两个模块涉及两种不同的模态(语音和文本)，它们的纠错方案通常是割裂的。这意味着在优化这两个模块时，往往只能针对其中一个模块进行改进，而无法同时考虑到它们之间的相互影响。这导致了纠错多模态模型构建的整体性能的提升受限，进而影响了纠错多模态模型构建的准确性。Since the voice separation and transcription modules involve two different modalities (speech and text), their error correction schemes are usually separated. This means that when optimizing these two modules, only one of them can be improved, and the mutual influence between them cannot be considered at the same time. This leads to the limitation of the overall performance improvement of the error correction multimodal model construction, which in turn affects the accuracy of the error correction multimodal model construction.

发明内容Summary of the invention

基于上述问题，本申请提供了一种纠错多模态模型构建方法、系统、设备及介质，用以提高会议转录的准确性。Based on the above problems, the present application provides a method, system, device and medium for constructing an error-correcting multimodal model to improve the accuracy of conference transcription.

为解决上述问题，本申请实施例提供的技术方案如下：To solve the above problems, the technical solutions provided in the embodiments of the present application are as follows:

本申请第一方面提供了一种纠错多模态模型构建方法，包括：The first aspect of the present application provides a method for constructing an error correction multimodal model, comprising:

获取音频信息；Get audio information;

对所述音频信息进行人声分离处理和转写处理，得到处理结果；Performing voice separation and transcription processing on the audio information to obtain a processing result;

将所述处理结果与相对应的真实结果组合得到微调样本，输入预备模型，基于所述微调样本对所述预备模型进行调整，得到纠错多模态模型，所述预备模型包括线性层和处于冻结状态的大型语言模型，所述投影层将用于人声分离的语音编码器与处于冻结状态的语音大模型对齐。The processing result is combined with the corresponding real result to obtain a fine-tuning sample, which is input into a preparation model. The preparation model is adjusted based on the fine-tuning sample to obtain an error-correcting multimodal model, wherein the preparation model includes a linear layer and a large language model in a frozen state, and the projection layer aligns the speech encoder for voice separation with the large speech model in a frozen state.

在一种可能的实现方式中，所述对所述音频信息进行人声分离处理和转写处理，得到处理结果，包括：In a possible implementation, performing voice separation and transcription processing on the audio information to obtain a processing result includes:

将所述音频信息输入静音检测模块，并剔除所述音频信息中的静音部分；Inputting the audio information into a silence detection module, and removing the silence part in the audio information;

将剔除后得到的音频信息分别输入基于说话者的自动语音识别模块和人声分离模块，得到处理结果，所述自动语音识别模块用于利用声学模型和语言模型，将所述音频信息中非静音部分的音频帧转换为文本信息，所述人声分离模块用于根据说话者特征将音频帧分配给不同的说话者。The audio information obtained after the elimination is respectively input into a speaker-based automatic speech recognition module and a human voice separation module to obtain processing results. The automatic speech recognition module is used to convert the audio frames of the non-silent part of the audio information into text information using an acoustic model and a language model, and the human voice separation module is used to assign audio frames to different speakers according to speaker characteristics.

在一种可能的实现方式中，所述处理结果包括自动语音识别模块处理得到的第一结果，所述第一结果包括每一帧音频被分析并转换得到的文本单元，所述文本单元为音素或单词的一部分。In a possible implementation, the processing result includes a first result obtained by processing an automatic speech recognition module, and the first result includes a text unit obtained by analyzing and converting each frame of audio, and the text unit is a phoneme or a part of a word.

在一种可能的实现方式中，所述处理结果包括自动语音识别模块处理得到的第二结果，所述第二结果是基于说话者的音色特征和语速特征将音频帧进行分配得到的。In a possible implementation, the processing result includes a second result obtained by processing an automatic speech recognition module, where the second result is obtained by allocating audio frames based on the timbre characteristics and speech speed characteristics of the speaker.

在一种可能的实现方式中，所述将所述处理结果与相对应的真实结果组合得到微调样本，输入预备模型之前，还包括：In a possible implementation, the combining of the processing result with the corresponding real result to obtain a fine-tuning sample, before inputting the fine-tuning sample into a preparation model, further includes:

获取所述处理结果的特征标签；Obtaining a feature label of the processing result;

所述将所述处理结果与相对应的真实结果组合得到微调样本，包括：The step of combining the processing result with the corresponding real result to obtain a fine-tuning sample includes:

将所述特征标签，所述处理结果，以及与所述处理结果相对应的真实结果组合得到微调样本。The feature label, the processing result, and a true result corresponding to the processing result are combined to obtain a fine-tuning sample.

所述将所述处理结果与相对应的真实结果组合得到微调样本，输入预备模型，包括：The step of combining the processing result with the corresponding real result to obtain a fine-tuning sample and inputting the fine-tuning sample into a preparation model comprises:

将所述特征标签和所述处理结果组合得到微调样本，输入预备模型中的投影层；Combining the feature labels and the processing results to obtain fine-tuning samples, and inputting the samples into the projection layer in the preparation model;

将经过投影层梳理的微调样本中加入所述特征标签，输入预备模型中的处于冻结状态的语音大模型。The feature labels are added to the fine-tuned samples combed by the projection layer and input into the frozen large speech model in the preparation model.

本申请第二方面提供了一种纠错多模态模型构建系统，包括：The second aspect of the present application provides a system for constructing an error correction multimodal model, comprising:

第一获取单元，用于获取音频信息；A first acquisition unit, used to acquire audio information;

处理单元，用于对所述音频信息进行人声分离处理和转写处理，得到处理结果；A processing unit, used for performing voice separation and transcription processing on the audio information to obtain a processing result;

输入单元，用于将所述处理结果与相对应的真实结果组合得到微调样本，输入预备模型，基于所述微调样本对所述预备模型进行调整，得到纠错多模态模型，所述预备模型包括线性层和处于冻结状态的大型语言模型，所述投影层将用于人声分离的语音编码器与处于冻结状态的语音大模型对齐。The input unit is used to combine the processing result with the corresponding real result to obtain a fine-tuning sample, input the fine-tuning sample into a preparation model, adjust the preparation model based on the fine-tuning sample, and obtain an error-correcting multimodal model. The preparation model includes a linear layer and a large language model in a frozen state, and the projection layer aligns the speech encoder used for voice separation with the large speech model in a frozen state.

在一种可能的实现方式中，所述系统还包括：In a possible implementation, the system further includes:

第二获取单元，用于获取所述处理结果的特征标签；A second acquisition unit, used to acquire a feature label of the processing result;

本申请第三方面提供了一种电子设备，包括：存储器，处理器，及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时，实现前述第一方面所述的纠错多模态模型构建方法。The third aspect of the present application provides an electronic device, comprising: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the error correction multimodal model construction method described in the first aspect is implemented.

本申请第四方面提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有指令，当所述指令在终端设备上运行时，使得所述终端设备执行如前述第一方面所述的纠错多模态模型构建方法。The fourth aspect of the present application provides a computer-readable storage medium, in which instructions are stored. When the instructions are executed on a terminal device, the terminal device executes the error correction multimodal model construction method as described in the first aspect above.

相较于现有技术，本申请具有以下有益效果：Compared with the prior art, this application has the following beneficial effects:

结合人声分离的语音编码器与语言大模型，实现从音频到文本的多模态处理。这种融合不仅能够提升对语音信号的理解深度，还能利用语言模型丰富的上下文知识进行更精准的转录和纠错。通过投影层将人声分离的语音编码器与冻结的LLM对齐，确保音频特征能够与LLM的文本处理能力有效结合。这种对齐方式不仅提高了模型处理音频数据的效率，还使得模型能够更准确地识别语音内容。相对于传统方案中纠错方法往往只针对单一模块进行，本申请基于多模态大模型同时利用两个模态信息，利用其强大的在未接受过特定任务或类别直接训练的情况下，能够对新任务或未见过的类别进行预测或分类的能力，适配更复杂的会议场景，提高会议转录的准确性。Combine the speech encoder for voice separation with the large language model to achieve multimodal processing from audio to text. This fusion can not only improve the depth of understanding of speech signals, but also use the rich contextual knowledge of the language model for more accurate transcription and error correction. The speech encoder for voice separation is aligned with the frozen LLM through the projection layer to ensure that the audio features can be effectively combined with the text processing capabilities of the LLM. This alignment method not only improves the efficiency of the model in processing audio data, but also enables the model to more accurately recognize the voice content. Compared with the traditional solution in which the error correction method is often only performed on a single module, this application is based on a multimodal large model that uses two modal information at the same time, and uses its powerful ability to predict or classify new tasks or unseen categories without direct training for specific tasks or categories, to adapt to more complex meeting scenarios and improve the accuracy of meeting transcription.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为更清楚地说明本实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in this embodiment or the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without creative work.

图1为本申请实施例提供的一种纠错多模态模型构建方法的流程图；FIG1 is a flow chart of a method for constructing an error correction multimodal model provided in an embodiment of the present application;

图2为本申请实施例提供的纠错多模态模型构建过程示意图；FIG2 is a schematic diagram of a process for constructing an error correction multimodal model provided in an embodiment of the present application;

图3为本申请实施例提供的混合加密流程；FIG3 is a hybrid encryption process provided by an embodiment of the present application;

图4为本申请实施例所提供的一种纠错多模态模型构建系统结构图。FIG4 is a structural diagram of an error correction multimodal model building system provided in an embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

为便于理解本申请实施例提供的技术方案，下面将先对本申请实施例涉及的技术术语进行说明。To facilitate understanding of the technical solutions provided by the embodiments of the present application, the technical terms involved in the embodiments of the present application will be explained below.

Zero-shot能力，也被称为Zero-shot学习(Zero-shot Learning)，是指模型在训练阶段没有见过或训练过与测试阶段完全相同的类别或任务的情况下，仍然能够对这些新的类别或任务进行准确的预测或推断。这种能力是由先进的深度学习模型和迁移学习方法实现的。Zero-shot capability, also known as zero-shot learning, means that the model can still make accurate predictions or inferences about new categories or tasks even if it has not seen or trained the same categories or tasks as the test phase during the training phase. This capability is achieved by advanced deep learning models and transfer learning methods.

为便于理解本申请实施例提供的技术方案，下面将先对本申请实施例涉及的背景技术进行说明。To facilitate understanding of the technical solutions provided by the embodiments of the present application, the background technology involved in the embodiments of the present application will be described below.

参见图1，图1为本申请提供的串行输出训练方案流程示意图，其中，包括静音检测(Vad)模块、静音检测(Vad)模块、人声分离(Diarization)模块。Frame-level diarizationwith SOT(FD-SOT):这指的是基于帧级别的串行输出训练的人声分离方法。它处理音频帧以识别不同的说话者。Multi-talker:表示有多个说话者在同时进行语音交流。Speaker-attributedASR:这是基于说话者特征的自动语音识别系统，它能根据每个说话者的声音特性进行转录。W1,W2,W3/W4,W5:这些表示不同的音频通道或流，可以代表不同说话者或不同音频源的录音。Monaural transcription:单声道转录，意味着音频信号是从单一声道(而非立体声或多声道)进行处理的。Oracle VAD:理想化的语音活动检测器，用于准确检测语音的开始和结束。Long-term audio:指的是长时间录制的音频，可能是会议或其他持续活动的录音。Spkr 1&Spkr 2:分别代表两个说话者(或更多，但图中只标注了Spkr 1和Spkr2)。Frame-level:指处理和分析的单位是音频帧。Speaker Profile:说话者特征，用于区分不同的说话者。Diarization:人声分离技术，用于将混合的语音信号划分为不同说话者的语音段。speaker-attributed transcription指的是说话者属性转录，即将转录的文本与会议中的不同说话者进行关联或归属。在多人参与的会议中，这种转录方式能够区分并标识出不同说话者的发言内容，使得转录结果更加清晰、有条理。Refer to Figure 1, which is a flow chart of the serial output training scheme provided by the present application, which includes a silence detection (Vad) module, a silence detection (Vad) module, and a voice separation (Diarization) module. Frame-level diarization with SOT (FD-SOT): This refers to a voice separation method based on frame-level serial output training. It processes audio frames to identify different speakers. Multi-talker: It means that there are multiple speakers communicating at the same time. Speaker-attributed ASR: This is an automatic speech recognition system based on speaker features, which can transcribe according to the voice characteristics of each speaker. W1, W2, W3/W4, W5: These represent different audio channels or streams, which can represent recordings of different speakers or different audio sources. Monaural transcription: Monophonic transcription means that the audio signal is processed from a single channel (not stereo or multi-channel). Oracle VAD: An idealized voice activity detector for accurately detecting the beginning and end of speech. Long-term audio: refers to audio recorded for a long time, which may be recordings of meetings or other continuous activities. Spkr 1 & Spkr 2: represent two speakers respectively (or more, but only Spkr 1 and Spkr2 are marked in the figure). Frame-level: refers to the unit of processing and analysis is the audio frame. Speaker Profile: speaker characteristics, used to distinguish different speakers. Diarization: voice separation technology, used to divide mixed speech signals into speech segments of different speakers. Speaker-attributed transcription refers to speaker attribute transcription, that is, associating or attributing the transcribed text with different speakers in the meeting. In a meeting with multiple participants, this transcription method can distinguish and identify the speeches of different speakers, making the transcription results clearer and more organized.

第一个流程：基于帧的消歧(Frame-level Diarizationwith SOT，FD-SOT)The first process: frame-level diarization with SOT (FD-SOT)

在这个流程中，纠错多模态模型构建系统通过串行输出训练的方式处理多说话者的音频信号。首先，静音检测(Vad模块)接收原始的音频信号，并检测出其中的静音部分。去除静音部分后，音频信号被传递给基于说话者的自动语音识别(ASR)模块。ASR模块利用声学模型和语言模型，将非静音部分的音频帧转换为文本。这里，每一帧音频都被分析并转换为相应的文本单元，可能是音素或单词的一部分。随后，人声分离(Diarization模块)在帧级别上进行操作。它接收ASR模块输出的文本帧，并尝试根据说话者特征(如音色、语速等)将这些帧分配给不同的说话者。这个过程涉及对帧的声学特征进行分析，以区分不同说话者的语音。最终，系统输出的是已经转录为文本并标记了说话者信息的音频帧序列。In this process, the error correction multimodal model building system processes multi-speaker audio signals by serial output training. First, the silence detection (Vad module) receives the original audio signal and detects the silent part. After removing the silent part, the audio signal is passed to the speaker-based automatic speech recognition (ASR) module. The ASR module converts the audio frames of the non-silent part into text using the acoustic model and language model. Here, each frame of audio is analyzed and converted into the corresponding text unit, which may be a phoneme or part of a word. Subsequently, the human voice separation (Diarization module) operates at the frame level. It receives the text frames output by the ASR module and tries to assign these frames to different speakers based on speaker characteristics (such as timbre, speaking speed, etc.). This process involves analyzing the acoustic features of the frames to distinguish the speech of different speakers. Finally, the system outputs a sequence of audio frames that have been transcribed into text and labeled with speaker information.

第二个流程：基于词的消歧(Word-level Diarizationwith SOT，WD-SOT)The second process: Word-level Diarization with SOT (WD-SOT)

与第一个流程类似，第二个流程也包含静音检测(Vad模块)和基于说话者的自动语音识别(ASR)模块。然而，在人声分离(Diarization模块)阶段，第二个流程在更细的粒度上进行操作。它接收ASR模块输出的文本，并在词级别上进行消歧。这意味着Diarization模块不仅分析每个词的声学特征，还考虑词的上下文信息和说话者特征，以更准确地确定每个词所属的说话者。通过基于词的消歧，系统能够更精确地识别不同说话者的语音，并在转录结果中标记出每个词的说话者信息。这对于需要精确区分说话者的应用场景(如会议记录、多人对话分析等)尤为重要。Similar to the first process, the second process also contains silence detection (Vad module) and speaker-based automatic speech recognition (ASR) modules. However, in the voice separation (Diarization module) stage, the second process operates at a finer granularity. It receives the text output by the ASR module and performs disambiguation at the word level. This means that the Diarization module not only analyzes the acoustic features of each word, but also considers the contextual information and speaker characteristics of the word to more accurately determine the speaker to which each word belongs. Through word-based disambiguation, the system can more accurately identify the speech of different speakers and mark the speaker information of each word in the transcription results. This is especially important for application scenarios that require accurate distinction between speakers (such as meeting records, multi-person conversation analysis, etc.).

首先，线上会议的接入方式多样化意味着参会者可能通过不同的设备、平台或网络接入会议，这导致了音频信号的质量和格式可能存在差异。参会用户不固定意味着每次会议的参与者可能不同，他们的语音特征、口音、语速等也会有所不同。而所处的环境不固定则意味着会议可能发生在各种嘈杂或安静的环境中，背景噪音、回声等问题也可能对纠错多模态模型构建造成干扰。First, the diversity of online conference access methods means that participants may access the conference through different devices, platforms or networks, which may lead to differences in the quality and format of the audio signal. The non-fixed participants mean that the participants in each meeting may be different, and their voice characteristics, accents, speaking speed, etc. may also be different. The non-fixed environment means that the meeting may take place in a variety of noisy or quiet environments, and background noise, echo and other problems may also interfere with the construction of the error correction multimodal model.

由于这些纷杂的场景都集中在一场会议，中纠错多模态模型构建系统需要处理的信息量巨大且复杂。这导致纠错多模态模型构建结果并不理想，主要体现在以下几个方面：Since all these complicated scenes are concentrated in one meeting, the amount of information that the error correction multimodal model construction system needs to process is huge and complex. This leads to unsatisfactory results of error correction multimodal model construction, which is mainly reflected in the following aspects:

当会议人数变多、环境变得嘈杂时，会议质检系统确实会面临巨大的挑战。在这种复杂场景下，人声分离和转写两个模块的纠错工作变得尤为关键，而且这两个模块分别涉及语音和文本两种模态，使得它们的优化方案常常是独立的，缺乏整体性的考虑。When the number of participants increases and the environment becomes noisy, the conference quality inspection system will indeed face huge challenges. In such a complex scenario, the error correction work of the two modules of voice separation and transcription becomes particularly critical. Moreover, these two modules involve two modes, voice and text, respectively, which makes their optimization solutions often independent and lack holistic consideration.

传统的纠错方法往往只针对单一模块进行，例如单独优化人声分离模块以提升语音识别的准确性，或者单独优化转写模块以提高文本输出的质量。然而，这种方法忽略了两个模块之间的内在联系和相互影响，导致整体性能的提升有限。Traditional error correction methods often only target a single module, such as optimizing the voice separation module to improve the accuracy of speech recognition, or optimizing the transcription module to improve the quality of text output. However, this method ignores the intrinsic connection and mutual influence between the two modules, resulting in limited improvement in overall performance.

为了解决这个问题，本申请可以利用MLLM(文本、语音)强大的模态对齐能力，对两个模块进行联合纠正。MLLM能够同时处理文本和语音两种模态的数据，通过捕捉它们之间的内在关联，实现跨模态的信息融合和纠错。To solve this problem, this application can use the powerful modality alignment capability of MLLM (text, speech) to jointly correct the two modules. MLLM can process data in both text and speech modalities at the same time, and realize cross-modal information fusion and error correction by capturing the intrinsic relationship between them.

需要说明的是，本申请提供的纠错多模态模型构建方法、系统、设备及介质，可应用于计算机技术领域。上述仅为示例，并不对本申请提供的纠错多模态模型构建方法、系统、设备及介质的应用领域进行限定。另外，本申请实施例亦可不限定纠错多模态模型构建的执行主体，例如，本申请实施例的纠错多模态模型构建方法可以应用于终端设备或服务器等数据处理设备。其中，终端设备可以为智能手机、计算机、个人数字助理(PersonalDigital Assistant，PDA)、平板电脑等电子设备。服务器可以为独立服务器、云服务器或者由多台服务器组成的集群服务器。It should be noted that the error correction multimodal model construction method, system, device and medium provided in the present application can be applied to the field of computer technology. The above is only an example and does not limit the application field of the error correction multimodal model construction method, system, device and medium provided in the present application. In addition, the embodiments of the present application may not limit the execution subject of the error correction multimodal model construction. For example, the error correction multimodal model construction method of the embodiment of the present application can be applied to data processing devices such as terminal devices or servers. Among them, the terminal device can be an electronic device such as a smart phone, a computer, a personal digital assistant (PDA), a tablet computer, etc. The server can be a stand-alone server, a cloud server, or a cluster server composed of multiple servers.

为了使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present application clearer, the technical solution in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

以下通过一个实施例，对本申请提供的一种纠错多模态模型构建方法进行说明。参见图2，该图2为本申请实施例提供的一种纠错多模态模型构建方法的流程图。其中包括：The following is an example of an error correction multimodal model construction method provided by the present application. See Figure 2, which is a flow chart of an error correction multimodal model construction method provided by an embodiment of the present application. It includes:

S101、获取音频信息。S101. Acquire audio information.

在实际应用场景中，可以用一个多说话者、单声道音频的输入源用于提供包括声学特征的音频信息。这个输入源可能包括各种会议录音、访谈、讲座等。In actual application scenarios, a multi-speaker, mono audio input source can be used to provide audio information including acoustic features. This input source may include various conference recordings, interviews, lectures, etc.

S102、对所述音频信息进行人声分离处理和转写处理，得到处理结果。S102: performing voice separation processing and transcription processing on the audio information to obtain a processing result.

将所述音频信息输入静音检测模块，并剔除所述音频信息中的静音部分；将剔除后得到的音频信息分别输入基于说话者的自动语音识别模块和人声分离模块，得到处理结果，所述自动语音识别模块用于利用声学模型和语言模型，将所述音频信息中非静音部分的音频帧转换为文本信息，所述人声分离模块用于根据说话者特征将音频帧分配给不同的说话者。The audio information is input into a silence detection module, and the silence part in the audio information is removed; the audio information obtained after the removal is respectively input into a speaker-based automatic speech recognition module and a human voice separation module to obtain a processing result, wherein the automatic speech recognition module is used to convert the audio frames of the non-silent part in the audio information into text information by using an acoustic model and a language model, and the human voice separation module is used to assign the audio frames to different speakers according to speaker characteristics.

可以使用ASR(Automatic Speech Recognition，自动语音识别)技术来提取这些音频中的语音特征，并将这些特征转换为时间序列数据。ASR技术能够识别出音频中的语音内容，并将其初步转换为文本形式，这构成了预测结果(Predicted Transcriptions)。ASR (Automatic Speech Recognition) technology can be used to extract speech features from these audios and convert these features into time series data. ASR technology can recognize the speech content in the audio and convert it into text form, which constitutes the predicted results (Predicted Transcriptions).

即利用ASR模型(如深度学习模型)分析每一帧音频，并将其转换为对应的文本单元。这些文本单元可以是音素或单词的一部分，具体取决于ASR模型的训练目标和设计。That is, an ASR model (such as a deep learning model) is used to analyze each frame of audio and convert it into corresponding text units. These text units can be parts of phonemes or words, depending on the training objectives and design of the ASR model.

通过结合说话者音色和语速特征的音频帧分配与识别，最终得到第二结果。这个第二结果不仅包含了音频内容的识别结果(即文本)，还包含了每个文本对应的说话者信息。这使得会议转录系统能够更准确地识别出不同说话者的内容，并进行后续的分析和处理。By combining the allocation and recognition of audio frames with the speaker's timbre and speech rate characteristics, the second result is finally obtained. This second result not only contains the recognition result of the audio content (i.e., text), but also contains the speaker information corresponding to each text. This enables the conference transcription system to more accurately identify the content of different speakers and conduct subsequent analysis and processing.

S103、将所述处理结果与相对应的真实结果组合得到微调样本，输入预备模型，基于所述微调样本对所述预备模型进行调整，得到纠错多模态模型，所述预备模型包括线性层和处于冻结状态的大型语言模型，所述投影层将用于人声分离的语音编码器与处于冻结状态的语音大模型对齐。S103. Combining the processing result with the corresponding real result to obtain a fine-tuning sample, inputting the fine-tuning sample into a preparation model, adjusting the preparation model based on the fine-tuning sample to obtain an error-correcting multimodal model, wherein the preparation model includes a linear layer and a large language model in a frozen state, and the projection layer aligns the speech encoder for voice separation with the large speech model in a frozen state.

在一种可能的实现方式中，冻结的语音大模型LLM是一个预先训练好的大型语言模型，它具备强大的语言理解和生成能力。由于LLM已经被大量数据训练过，因此可以将其冻结，即不再更新其参数，而直接用于新的任务中。在纠错多模态模型构建中，可以利用LLM的语言理解能力来纠正转录文本中的错误。通过将编码后的语音特征输入到LLM中，可以得到与语音内容相对应的文本表示，从而实现对转录文本的纠错。In one possible implementation, the frozen speech model LLM is a pre-trained large language model with powerful language understanding and generation capabilities. Since the LLM has been trained with a large amount of data, it can be frozen, that is, its parameters are no longer updated, and it can be directly used in new tasks. In the construction of the error correction multimodal model, the language understanding ability of the LLM can be used to correct errors in the transcribed text. By inputting the encoded speech features into the LLM, a text representation corresponding to the speech content can be obtained, thereby realizing error correction of the transcribed text.

投影层的作用是将语音编码器的输出与LLM的输入进行对齐。这通常涉及到将语音编码器的输出转换为适合LLM处理的格式或维度。虽然LLM的参数被冻结，但投影层的参数是可训练的。通过训练投影层，可以使其能够将语音编码器的输出有效地传递给LLM。The role of the projection layer is to align the output of the speech encoder with the input of the LLM. This usually involves converting the output of the speech encoder into a format or dimension suitable for processing by the LLM. While the parameters of the LLM are frozen, the parameters of the projection layer are trainable. By training the projection layer, it can be enabled to effectively pass the output of the speech encoder to the LLM.

将训练好的投影层与语音编码器和LLM进行集成，形成一个完整的纠错多模态模型。该模型能够同时处理语音和文本信息。可以在模型中添加纠错机制，利用LLM的语言处理能力来纠正转录中的错误。这可以通过比较LLM生成的文本与原始转录文本，并识别并修正其中的差异来实现。The trained projection layer is integrated with the speech encoder and LLM to form a complete error-correcting multimodal model. This model is capable of processing both speech and text information. Error correction mechanisms can be added to the model to use the language processing capabilities of the LLM to correct errors in the transcription. This can be achieved by comparing the text generated by the LLM with the original transcription and identifying and correcting the differences.

由于ASR技术的局限性，这些预测结果中可能包含错误或不准确的部分。因此，可以对这些预测结果进行校验和修正。可以由通过人工方式进行的，即人工监听音频内容，并对照预测结果进行校对。经过校验和修正后的文本，得到真实结果(TrueTranscriptions)。Due to the limitations of ASR technology, these prediction results may contain errors or inaccuracies. Therefore, these prediction results can be verified and corrected. This can be done manually, that is, manually listening to the audio content and checking it against the prediction results. The verified and corrected text is the true result (TrueTranscriptions).

由此，形成了预测结果-真实结果样本对。这些样本对中的预测结果代表了ASR技术的当前性能，而真实结果则提供了准确的参考标准。随后，使用这些样本对来对多模态大模型进行微调(Fine-tuning)。微调过程是通过调整模型的参数，使其能够更好地识别并纠正转录中的错误，从而提升转录的准确性。微调后的模型将具备更强的泛化能力，能够更好地适应复杂的转录环境。Thus, prediction result-true result sample pairs are formed. The prediction results in these sample pairs represent the current performance of ASR technology, while the true results provide an accurate reference standard. Subsequently, these sample pairs are used to fine-tune the multimodal large model. The fine-tuning process is to improve the accuracy of transcription by adjusting the parameters of the model so that it can better identify and correct errors in transcription. The fine-tuned model will have stronger generalization capabilities and can better adapt to complex transcription environments.

最后，微调后的多模态大模型可以集成到纠错多模态模型构建系统中，用于实时接收会议音频并进行转录。系统通过利用微调后的模型的强大能力，能够准确识别语音内容，并纠正可能出现的错误，从而提供高质量的转录服务。Finally, the fine-tuned multimodal large model can be integrated into the error-correcting multimodal model building system to receive conference audio in real time and transcribe it. By leveraging the powerful capabilities of the fine-tuned model, the system can accurately recognize the speech content and correct possible errors, thereby providing high-quality transcription services.

在整个过程中，预测结果-真实结果样本对的获取是关键的一步。它不仅为模型的微调提供了有力的数据支持，还能够帮助评估和提升ASR技术的性能。通过不断优化这一过程，可以不断提高纠错多模态模型构建系统的准确性和可靠性。In the whole process, obtaining the predicted result-true result sample pair is a key step. It not only provides strong data support for fine-tuning the model, but also helps evaluate and improve the performance of ASR technology. By continuously optimizing this process, the accuracy and reliability of the error correction multimodal model construction system can be continuously improved.

区别于上述实现方式，在该实现方式中，将特征标签添加至所述处理结果，以及与所述处理结果组成的样本中，形成具备特征标签的样本，即微调样本。Different from the above implementation, in this implementation, the feature label is added to the processing result and the sample composed of the processing result to form a sample with the feature label, that is, a fine-tuning sample.

在大型语言模型的应用中，为了增强模型在不同语境和条件下的表现，常常会在输入数据(即prompt)中添加特定的标签或指示符。这些标签可以包括语种、方言、性别、情感、风格等，以帮助模型更好地理解和生成符合要求的文本。In the application of large language models, in order to enhance the performance of the model in different contexts and conditions, specific labels or indicators are often added to the input data (i.e. prompt). These labels can include language, dialect, gender, emotion, style, etc., to help the model better understand and generate text that meets the requirements.

对于语种和方言标签，它们的作用是告诉模型当前输入文本的语言类型。由于不同的语言有不同的语法规则和词汇表，因此正确的语种和方言标签对于确保模型生成准确和适当的响应至关重要。例如，如果模型被提示以中文生成文本，但输入的内容却是英文，那么模型可能会产生混乱的响应，因为它不知道应该遵循哪种语言的规则。As for language and dialect labels, their role is to tell the model the language type of the current input text. Since different languages have different grammatical rules and vocabularies, correct language and dialect labels are crucial to ensure that the model generates accurate and appropriate responses. For example, if the model is prompted to generate text in Chinese, but the input content is English, the model may produce confusing responses because it does not know which language's rules should be followed.

性别标签同样重要，尤其是在涉及人物对话或描述的文本中。性别标签可以帮助模型了解应该使用哪种代词(如“他”或“她”)以及其他与性别相关的词汇和表达方式。这对于生成自然和准确的文本至关重要。Gender labels are equally important, especially in text involving character dialogue or descriptions. Gender labels can help the model understand which pronouns should be used (such as "he" or "she") and other gender-related vocabulary and expressions. This is crucial for generating natural and accurate text.

此外，情感、风格等标签也可以帮助模型更好地适应不同的应用场景。例如，在生成广告文案时，使用“正式”或“幽默”的风格标签可以引导模型生成符合特定风格的文本。同样，在情感分析中，使用“积极”或“消极”的情感标签可以帮助模型更准确地识别文本中的情感倾向。In addition, labels such as sentiment and style can also help the model better adapt to different application scenarios. For example, when generating advertising copy, using style labels such as "formal" or "humorous" can guide the model to generate text that conforms to a specific style. Similarly, in sentiment analysis, using sentiment labels such as "positive" or "negative" can help the model more accurately identify the emotional tendencies in the text.

综上，通过在输入的内容中添加特定的标签或指示符，我们可以帮助大型语言模型更好地理解和生成符合要求的文本，从而在各种应用场景中提高模型的准确性和可靠性。需要说明的是，本申请的特征标签并不局限于上述举例说明的几种标签，In summary, by adding specific labels or indicators to the input content, we can help large language models better understand and generate text that meets the requirements, thereby improving the accuracy and reliability of the model in various application scenarios. It should be noted that the feature labels of this application are not limited to the several labels illustrated above.

在一种可能的实现方式中，可以开发必要的接口，使纠错多模态模型能够与纠错多模态模型构建系统进行无缝集成。在纠错多模态模型构建系统中，实时接收会议音频，并使用纠错多模态模型进行转录和纠错。纠错后的文本可以实时显示给会议参与者或保存为会议记录。In one possible implementation, necessary interfaces can be developed to enable the error correction multimodal model to be seamlessly integrated with the error correction multimodal model building system. In the error correction multimodal model building system, conference audio is received in real time and transcribed and corrected using the error correction multimodal model. The corrected text can be displayed to the conference participants in real time or saved as a conference record.

综上，本申请提供的实施例具备以下有益效果：In summary, the embodiments provided in this application have the following beneficial effects:

1、首先，该方案通过结合人声分离的语音编码器与一个强大的语言大模型(LLM)，实现了从音频到文本的多模态处理。这种融合不仅能够提升对语音信号的理解深度，还能利用语言模型丰富的上下文知识进行更精准的转录和纠错。1. First, the solution realizes multimodal processing from audio to text by combining a voice-separated speech encoder with a powerful large language model (LLM). This fusion not only improves the depth of understanding of speech signals, but also uses the rich contextual knowledge of the language model for more accurate transcription and error correction.

2、引入一个投影层来对语音编码器的输出进行变换，目的是使其与LLM的预期输入格式相匹配。这一步骤是关键，因为它确保了两个不同模态间的有效信息传递，使得语音特征可以直接“对话”语言模型。2. A projection layer is introduced to transform the output of the speech encoder so that it matches the expected input format of the LLM. This step is critical because it ensures effective information transfer between the two different modalities, allowing speech features to directly "talk" to the language model.

3、利用预测结果与真实转录结果的对比作为样本，直接对整个多模态模型进行微调，而不是仅仅依赖于ASR(自动语音识别)的损失函数。这种方法能够更直接地优化端到端的性能，特别是对于纠错任务，因为可以直接针对转录错误进行调整。3. Use the comparison between the predicted results and the actual transcription results as samples to directly fine-tune the entire multimodal model, rather than relying solely on the ASR (automatic speech recognition) loss function. This approach can more directly optimize end-to-end performance, especially for error correction tasks, because you can directly adjust for transcription errors.

4、利用LLM的泛化能力：预训练的LLM具有很强的zero-shot学习能力，这意味着即使在微调数据集中没有出现的错误类型，模型也有可能通过泛化能力正确识别并纠正。这对于应对多样化的转录场景尤为重要。4. Leverage the generalization ability of LLM: The pre-trained LLM has strong zero-shot learning ability, which means that even if the error type does not appear in the fine-tuning dataset, the model is likely to correctly identify and correct it through generalization ability. This is especially important for dealing with diverse transcription scenarios.

5、在向LLM输入信息前，通过在prompt中加入语种、方言、性别等元数据标签，为模型提供了额外的上下文信息，有助于模型在特定语言或方言环境中做出更加准确的推理。5. Before inputting information into LLM, by adding metadata tags such as language, dialect, gender, etc. in the prompt, additional contextual information is provided to the model, which helps the model make more accurate inferences in a specific language or dialect environment.

以上为本申请实施例所提供的纠错多模态模型构建方法的一些具体实现方式，基于此，本申请还提供了对应的用于纠错多模态模型构建的系统。下面将从功能模块化的角度对本申请实施例所提供的系统进行介绍。图4为本申请实施例所提供的一种纠错多模态模型构建系统结构图。The above are some specific implementations of the error correction multimodal model construction method provided in the embodiment of the present application. Based on this, the present application also provides a corresponding system for error correction multimodal model construction. The system provided in the embodiment of the present application will be introduced from the perspective of functional modularization. Figure 4 is a structural diagram of an error correction multimodal model construction system provided in the embodiment of the present application.

所述系统包括：The system comprises:

第一获取单元110，用于获取音频信息；A first acquisition unit 110, configured to acquire audio information;

处理单元111，用于对所述音频信息进行人声分离处理和转写处理，得到处理结果；The processing unit 111 is used to perform voice separation and transcription processing on the audio information to obtain a processing result;

输入单元112，用于将所述处理结果与相对应的真实结果组合得到微调样本，输入预备模型，基于所述微调样本对所述预备模型进行调整，得到纠错多模态模型，所述预备模型包括线性层和处于冻结状态的大型语言模型，所述投影层将用于人声分离的语音编码器与处于冻结状态的语音大模型对齐。The input unit 112 is used to combine the processing result with the corresponding real result to obtain a fine-tuning sample, input the fine-tuning sample into the preparation model, adjust the preparation model based on the fine-tuning sample to obtain an error-correcting multimodal model, the preparation model includes a linear layer and a large language model in a frozen state, and the projection layer aligns the speech encoder used for voice separation with the large speech model in a frozen state.

本申请实施例还提供了对应的设备以及计算机存储介质，用于实现本申请实施例所提供的纠错多模态模型构建方法方案。The embodiments of the present application also provide corresponding devices and computer storage media for implementing the error correction multimodal model construction method provided in the embodiments of the present application.

其中，所述设备包括存储器和处理器，所述存储器用于存储指令或代码，所述处理器用于执行所述指令或代码，以使所述设备执行本申请任一实施例所述的纠错多模态模型构建方法。Among them, the device includes a memory and a processor, the memory is used to store instructions or codes, and the processor is used to execute the instructions or codes, so that the device executes the error correction multimodal model construction method described in any embodiment of the present application.

所述计算机存储介质中存储有代码，当所述代码被运行时，运行所述代码的设备实现本申请任一实施例所述的纠错多模态模型构建方法。The computer storage medium stores codes, and when the codes are executed, the device executing the codes implements the error correction multimodal model construction method described in any embodiment of the present application.

需要说明的是，本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统或装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the system or device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part description.

应当理解，在本申请中，“至少一个(项)”是指一个或者多个，“多个”是指两个或两个以上。“和/或”，用于描述关联对象的关联关系，表示可以存在三种关系，例如，“A和/或B”可以表示：只存在A，只存在B以及同时存在A和B三种情况，其中A，B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达，是指这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b或c中的至少一项(个)，可以表示：a，b，c，“a和b”，“a和c”，“b和c”，或“a和b和c”，其中a，b，c可以是单个，也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

需要理解的是，术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。It should be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside" and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the accompanying drawings, and are only for the convenience of describing the present invention and simplifying the description, and do not indicate or imply that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be understood as a limitation on the present invention.

需要说明的是，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。It should be noted that, unless otherwise clearly specified and limited, the terms "installed", "connected", and "connected" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection or an indirect connection through an intermediate medium, or it can be the internal communication of two components. For ordinary technicians in this field, the specific meanings of the above terms in the present invention can be understood according to specific circumstances.

还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown herein, but will conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The method for constructing the error correction multi-mode model is characterized by comprising the following steps of:

Acquiring audio information;

Performing voice separation processing and transcription processing on the audio information to obtain a processing result;

And combining the processing result with a corresponding real result to obtain a fine adjustment sample, inputting a preparation model, adjusting the preparation model based on the fine adjustment sample to obtain an error correction multi-modal model, wherein the preparation model comprises a linear layer and a large-scale language model in a frozen state, and the projection layer aligns a voice encoder for separating human voice with the large-scale voice model in the frozen state.

2. The method according to claim 1, wherein the performing the human voice separation processing and the transcription processing on the audio information to obtain a processing result includes:

inputting the audio information into a silence detection module, and removing a silence part in the audio information;

The audio information obtained after the elimination is respectively input into an automatic speech recognition module and a voice separation module based on a speaker to obtain a processing result, wherein the automatic speech recognition module is used for converting an audio frame of a non-mute part in the audio information into text information by utilizing an acoustic model and a language model, and the voice separation module is used for distributing the audio frame to different speakers according to the characteristics of the speaker.

3. The method of claim 2, wherein the processing results include a first result processed by an automatic speech recognition module, the first result including text units that are part of a phoneme or word that each frame of audio was analyzed and converted.

4. The method of claim 2, wherein the processing results include a second result of processing by the automatic speech recognition module, the second result being obtained by assigning audio frames based on a timbre characteristic and a speech rate characteristic of the speaker.

5. The method of claim 1, wherein combining the processing result with the corresponding real result results yields a fine-tuned sample, and further comprising, prior to inputting into the preliminary model:

acquiring a characteristic label of the processing result;

Combining the processing result with the corresponding real result to obtain a fine tuning sample, including:

and combining the feature tag, the processing result and a real result corresponding to the processing result to obtain a fine tuning sample.

6. The method of claim 1, wherein combining the processing result with the corresponding real result results yields a fine-tuned sample, and further comprising, prior to inputting into the preliminary model:

acquiring a characteristic label of the processing result;

And combining the processing result with a corresponding real result to obtain a fine adjustment sample, and inputting the fine adjustment sample into a preparation model, wherein the fine adjustment sample comprises:

combining the feature tag and the processing result to obtain a fine adjustment sample, and inputting the fine adjustment sample into a projection layer in a preparation model;

And adding the characteristic labels into the fine tuning samples which are carded by the projection layer, and inputting the characteristic labels into a frozen large voice model in the preparation model.

7. An error correction multimodal model building system, the system comprising:

The first acquisition unit is used for acquiring the audio information;

The processing unit is used for performing voice separation processing and transcription processing on the audio information to obtain a processing result;

The input unit is used for combining the processing result with a corresponding real result to obtain a fine adjustment sample, inputting a preparation model, adjusting the preparation model based on the fine adjustment sample to obtain an error correction multi-modal model, wherein the preparation model comprises a linear layer and a large language model in a frozen state, and the projection layer aligns a voice encoder for separating human voice with the large voice model in the frozen state.

8. The system of claim 7, wherein the system further comprises:

the second acquisition unit is used for acquiring the characteristic label of the processing result;

9. An electronic device, comprising: memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed, implements the error correction multi-modal model construction method according to any one of claims 1-6.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the error correction multi-modality model construction method according to any of claims 1-6.