[go: up one dir, main page]

WO2019227579A1 - Conference information recording method and apparatus, computer device, and storage medium - Google Patents

Conference information recording method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2019227579A1
WO2019227579A1 PCT/CN2018/094370 CN2018094370W WO2019227579A1 WO 2019227579 A1 WO2019227579 A1 WO 2019227579A1 CN 2018094370 W CN2018094370 W CN 2018094370W WO 2019227579 A1 WO2019227579 A1 WO 2019227579A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
voice
target
participant
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2018/094370
Other languages
French (fr)
Chinese (zh)
Inventor
黄锦伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Publication of WO2019227579A1 publication Critical patent/WO2019227579A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1813Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
    • H04L12/1831Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method, a device, a computer device, and a storage medium for recording conference information.
  • a system for implementing multi-person collaborative work through a network is a remote conference system.
  • a remote conference system is used to transfer images and sounds between conference rooms located at remote locations, and the participants' faces reflected on the TV screen can be used to hold meetings. Participants who are physically located remotely from each other can discuss matters as if they were meeting in the same conference room. In addition, since each participant does not need to move, time and expenses such as travel expenses can be effectively reduced.
  • a conference information recording method includes:
  • voiceprint recognition is performed on the target voiceprint feature to determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes a basic voiceprint of the participant Characteristics, participant identification, and the corresponding relationship between the basic voiceprint characteristics and the participant identification;
  • the target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification.
  • a conference information recording device includes:
  • a data collection module configured to obtain the voice data and the current time if a voice data transmission request of a collection channel is received
  • a feature extraction module configured to perform voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data
  • a person determination module is configured to perform voiceprint recognition on the target voiceprint feature according to a preset voiceprint recognition model, and determine a target participant identification corresponding to the target voiceprint feature, where the voiceprint recognition model includes The basic voiceprint characteristics of the participants, the participant identification, and the correspondence between the basic voiceprint characteristics and the participant identification;
  • a voice recognition module configured to perform voice recognition on the voice data to generate real-time voice text
  • the record generating module is configured to record the target participant identification, the current time, and the real-time voice text in association with each other, and obtain the speech information of the target participant corresponding to the target participant identification.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the method for recording conference information described above when the processor executes the computer-readable instructions A step of.
  • One or more non-volatile readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute the conference information Document the steps of the method.
  • FIG. 1 is a schematic diagram of an application environment of a conference information recording method according to an embodiment of the present application
  • FIG. 2 is an implementation flowchart of a conference information recording method according to an embodiment of the present application
  • step S2 in the conference information recording method according to the embodiment of the present application
  • step S4 is a flowchart of implementing step S4 in a conference information recording method according to an embodiment of the present application
  • FIG. 5 is an implementation flowchart of constructing a preset voiceprint recognition model in a conference information recording method according to an embodiment of the present application
  • FIG. 6 is a flowchart of real-time recording correction in a conference information recording method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a conference information recording device according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
  • FIG. 1 illustrates an application environment of a conference information recording method according to an embodiment of the present application.
  • the conference information recording method is applied to a conference information recording scenario of a remote conference.
  • the recording scenario includes the server, client, and monitor. Among them, the server and the client, and the server and the monitor are connected through the network. Participants send voice data to the server through the client, and the server passes Analyze the voice data, and convert the voice data into the corresponding meeting record information.
  • the meeting information recorder monitors and modifies the server's meeting record information through the monitoring terminal.
  • the client can specifically but not limited to various microphones, microphones, and mobile phones.
  • the server and monitor can be implemented with independent servers or a server cluster consisting of multiple servers.
  • FIG. 2 illustrates a conference information recording method provided by an embodiment of the present application.
  • the method is applied to the server in FIG. 1 as an example for description, as follows:
  • the server when the collection channel sends a data transmission request to the server, the server receives the voice data in the request and obtains the current time, where the collection channel is the client of this embodiment.
  • the collection channel collects conference speech information of the participants, which may specifically be various microphones, microphones, or other communication equipment with a voice transmission function, etc., which are not specifically limited here.
  • multiple collection channels are used to collect voice data.
  • voice data is easily recognized as the background sound, so it is filtered out when the voice signal is optimized, so that the accuracy of voice recognition becomes low.
  • multiple channels are used to avoid participants who use the same acquisition channel. The problem of excessive voice confusion recognition rate reduction, the collection channel can be set up for each participant, an independent collection channel, or two to four people in the vicinity use the same collection channel, which can be set according to the actual situation. There are no restrictions here.
  • S2 Perform voiceprint feature extraction on the voice data to obtain the target voiceprint feature of the voice data.
  • voiceprint feature extraction is performed on the voice data, and the specific process is: by performing voice encoding and preprocessing of the voice data on the voice data, the voice data is converted from an analog signal into a digital signal, and then the voiceprint is performed from the digital signal. Feature extraction, and then get the target voiceprint feature of speech data.
  • the voiceprint features include, but are not limited to, acoustic features, morphological features, prosodic features, dialect accent information, and channel information.
  • the voiceprint feature used in this application is an acoustic feature.
  • voiceprint recognition is performed on the target voiceprint feature to determine the target participant identification corresponding to the target voiceprint feature.
  • the voiceprint recognition model includes the basic voiceprint features of the participants. Personnel identification, and the corresponding relationship between the basic voiceprint characteristics and participant identification.
  • the preset voiceprint recognition model includes: the basic voiceprint characteristics of the participant, the participant identification, and the correspondence between the basic voiceprint characteristics and the participant identification.
  • the target voiceprint feature is subjected to voiceprint recognition through a preset voiceprint recognition model, thereby obtaining the target participant identification corresponding to the target voiceprint feature, and then determining the current speaker. Meeting with personnel.
  • voiceprint recognition a type of biometric technology, also known as speaker recognition
  • speaker recognition has two types, namely speaker recognition and speaker confirmation.
  • Voiceprint recognition is to convert acoustic signals into electrical signals, and then use a computer to identify them.
  • voiceprint recognition is used in the embodiment of the present application to identify the participant who is currently speaking, that is, speaker identification.
  • S4 Perform speech recognition on speech data to generate real-time speech text.
  • the real-time speech text refers to segmenting the speech data of the conference speech according to the pause and silence during each conference speech, and the segmented segmented speech data is obtained, and each segmented speech data is subjected to speech recognition to obtain
  • the corresponding recognition text is speech recognition text.
  • a piece of voice data is acquired from 0 to 1.8 seconds, and is recorded as voice data E.
  • the voice data acquired from 1.8 to 3 seconds is empty, and 3 to 8 seconds are obtained.
  • the voice data at the other end is obtained, recorded as voice data F, and the voice data E is subjected to voice recognition, and a real-time voice text is obtained: “in this quarter”, and the voice data F is subjected to voice recognition, and a real-time voice text is obtained: Our business goals have been exceeded.
  • the voice data may be recognized by a voice recognition algorithm or a third-party tool with a voice recognition function, which is not limited in particular.
  • Speech recognition algorithms include, but are not limited to, speech recognition algorithms based on channel models, speech template matching recognition algorithms, or artificial neural network speech recognition algorithms.
  • the speech recognition algorithm used in the embodiment of the present application is a speech recognition algorithm based on a channel model.
  • S5 The target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification.
  • the real-time voice text and the current time corresponding to the real-time voice text are associated with the target participant ID, and the speech information of the target participant corresponding to the target participant ID is obtained. .
  • the obtained real-time voice text is "Now I announce the official start of the meeting”
  • the corresponding current time is "April 18, 2018 13:52:00”
  • the corresponding participant ID is "Work ID 003” which records the real-time voice text and its corresponding current time and participant identification, and obtains a record of the meeting information: "April 18, 2018 13:52:00, Work ID 003, now I Announce the official start of the meeting.
  • all meeting information records can be sorted in chronological order.
  • the voice data and the current time are obtained, and the voiceprint feature extraction is performed on the voice data to obtain the target voiceprint feature of the voice data.
  • a preset voiceprint recognition model is used to identify the target voiceprint feature, so as to determine the target participant identification corresponding to the target voiceprint feature, so that when there are many participants in a remote conference, voiceprint recognition is used to quickly Accurately identify the current speaker's identity, improve the efficiency of identifying and recording the speaker in the conference information record, and improve the accuracy of the conference information record.
  • the real-time recognition of voice data and the generation of real-time voice text will be provided.
  • the identification, current time, and real-time voice and text are associated to obtain the speech information of the participant at the conference corresponding to the participant identification, avoiding missed and misremembered records caused by manual recording, and improving the efficiency and accuracy of conference information recording.
  • step S2 voiceprint feature extraction is performed on the voice data to obtain the target voiceprint feature of the voice data, which specifically includes the following steps:
  • S21 Perform voiceprint analysis on the voice data to obtain the initial voiceprint.
  • the voice data obtained by the device is an analog signal, and voiceprint analysis is performed on the voice data, that is, after the voice data is obtained, the voice data is encoded using Pulse Code Modulation (PCM) to make these
  • PCM Pulse Code Modulation
  • the analog signal is converted into a digital signal, and the analog signal in the voice data is sampled at a sampling point every predetermined time to discretize it, and then the sampled signal is quantized, and the quantized digital is output in a binary code group manner.
  • Signal according to the frequency range of speech of 200-3400Hz, the sampling rate can be set to 8KHz, and the quantization accuracy is 16bit.
  • pre-emphasis is performed in the pre-processing for this purpose.
  • the purpose of pre-emphasis is to improve the high-frequency part, make the signal spectrum flat, and keep it in the entire low-frequency to high-frequency band. Can use the same signal-to-noise ratio to find the spectrum to facilitate spectrum analysis or channel parameter analysis. Pre-emphasis can be performed before the anti-aliasing filter when the voice signal is digitized.
  • Pre-emphasis can be implemented using a first-order digital filter, such as a Finite Impulse Response (FIR) filter.
  • FIR Finite Impulse Response
  • the voice signals obtained by the device are all analog signals.
  • the analog information needs to be converted into digital signals through sampling and quantization. According to the speech spectrum range 200-3400Hz, sampling The rate can be set to 8KHz and the quantization accuracy is 16bit.
  • the high-frequency part of the frequency spectrum is improved, and the signal also becomes flat, generating a weighted processing voiceprint with a flat spectrum, which is beneficial for subsequent voiceprint feature extraction.
  • the voiceprint signal has the property of short-term stability. After pre-emphasis processing, the voiceprint signal needs to be framed and windowed to maintain the short-term stability of the signal. Generally, every second The number of frames included is between 33 and 100 frames. In order to maintain the continuity between frames, so that the adjacent two frames can smoothly transition, the overlapping frame method is adopted, that is, to maintain a crossover between the adjacent two frames, which is the frame shift. .
  • the range of the ratio of the frame shift to the frame length is (0, 0.5).
  • the pre-emphasized voiceprint signal is s' (n)
  • the frame length is N sampling points
  • the frame shift is M sampling points.
  • x l (n) x [(l-1) M + n]
  • the corresponding window function w (n) is multiplied with the pre-emphasized voiceprint signal s' (n) to obtain the windowed speech signal Sw .
  • This voiceprint signal is used as the first A basic speech frame.
  • the window functions include, but are not limited to, rectangular windows, Hamming windows, and Hanning windows.
  • the rectangular window expression is:
  • w (n) is a window function
  • N is the number of sampling points
  • n is the nth sampling point.
  • the Hamming window expression is:
  • pi is the perimeter, and preferably, the value of pi in the embodiment of the present application is 3.1416.
  • the Hanning window expression is:
  • the pre-emphasis processing of the voiceprint is framed and windowed, so that the voiceprint maintains the continuity between frames and eliminates some abnormal signal points, which improves the robustness of the voiceprint.
  • the silent sound separation is performed on the first basic speech frame to obtain the first target speech frame.
  • the silent sound separation is performed on the first basic speech frame to obtain the first target speech frame.
  • S25 Perform voiceprint feature extraction on each first target voice frame to obtain a target voiceprint feature.
  • a stable voiceprint is obtained, and voiceprint features are extracted using the voiceprint.
  • voiceprint feature extraction is to extract and select acoustic or linguistic features with strong separability and high stability for the speaker's voiceprint.
  • the voiceprint feature selected in this application is a linear cepstrum feature in the acoustic feature.
  • the voiceprint analysis is performed on the speech data of the participants to obtain the initial voiceprint, and the initial voiceprint samples are pre-emphasized to generate a weighted voiceprint with a flat frequency spectrum.
  • the windowing method performs frame processing on the weighted voiceprint to obtain a robust first basic speech frame, and then performs silent sound separation on the first basic speech frame, and filters out silent noises to obtain the first
  • the target voice frame makes the voiceprint continuous and robust, and then extracts the target voiceprint characteristics of the participants from the voiceprint, improving the accuracy and robustness of the target voiceprint characteristics, which is conducive to subsequent voice In the pattern recognition, the accuracy of voiceprint recognition is improved to ensure the accurate recognition of target participants.
  • step S4 the voice data is subjected to voice recognition to generate real-time voice text, which specifically includes the following steps:
  • S41 Perform speech analysis on the speech data to obtain a frame set including a second basic speech frame.
  • speech analysis is performed on the acquired speech data to obtain a frame set including a second basic speech frame.
  • the speech analysis includes, but is not limited to, speech encoding and pre-processing of speech signals.
  • speech encoding is to encode analog speech signals and convert analog signals into digital signals, thereby reducing the transmission code rate and performing digital transmission.
  • the basic methods of speech encoding can be divided into waveform encoding, parametric encoding, and hybrid encoding.
  • the speech coding method used in this proposal is waveform coding.
  • Waveform coding is a digital voice signal formed by sampling, quantizing, and encoding the waveform signal of the analog voice in the time domain.
  • the waveform coding can provide high voice quality.
  • the preprocessing of a voice signal refers to pre-emphasis, framing, windowing and other preprocessing operations on the voice signal before analysis and processing.
  • the purpose of speech signal pre-processing is to eliminate the effects of aliasing, higher harmonic distortion, high frequency and other factors caused by the human vocal organ itself and the equipment that collects the speech signal, so as to make the follow-up
  • the signal obtained by speech processing is more uniform and smooth, which provides high-quality parameters for signal parameter extraction and improves the quality of speech processing.
  • S42 Perform silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number.
  • the voice signals in the voice data can be divided into two states: the active period and the silent period.
  • the silent period does not transmit any voice signals.
  • the uplink and downlink activation periods and silent periods are mutually related. independent.
  • the participants will have a pause state. This state will bring about a pause in the voice signal, that is, a silent period.
  • the silent period state needs to be detected, and then the silent period and the active period will be detected. Separation is performed to obtain a continuous activation period, and the voice signal of the remaining continuous activation period is used as a target speech frame.
  • the methods for detecting the state of the silence include, but are not limited to, voice endpoint detection, detection of audio mute algorithms, and voice activity detection (VAD) algorithms.
  • the specific implementation process of performing silence detection on the second basic voice frame and obtaining K consecutive silence frames in the second basic voice frame used in the embodiments of the present application includes steps A to E, which are detailed as follows:
  • Step A Calculate the frame energy of the second basic speech frame of each frame.
  • the frame energy is the short-term energy of the voice signal, and reflects the data amount of the voice information of the voice frame.
  • the frame energy can be used to determine whether the voice frame is a sentence frame or a mute frame.
  • Step B For each frame of the second basic voice frame, if the frame energy of the second basic voice frame is less than a preset frame energy threshold, mark the second basic voice frame as a mute frame.
  • the frame energy threshold is a preset parameter. If the calculated frame energy of the second basic voice frame is less than a preset frame energy threshold, the corresponding second basic voice frame is marked as a mute frame, and the frame energy is The threshold may be specifically set according to actual requirements. For example, the frame energy threshold is set to 0.5, or the frame energy of each second basic voice frame may be calculated and set based on calculations. There is no limitation here.
  • the frame energy threshold is set to 0.5
  • the frame energy calculation is calculated for 6 second basic speech frames: J 1 , J 2 , J 3 , J 4 , J 5, and J 6 , and the results are obtained respectively. It is: 1.6, 0.2, 0.4, 1.7, 1.1, and 0.8. From this result, it is easy to understand that the second basic speech frame J 2 and the second basic speech frame J 3 are silent frames.
  • Step C If H consecutive mute frames are detected, and the cut H is greater than a preset continuous threshold I, the frame set composed of the H consecutive mute frames is regarded as a continuous mute frame.
  • the continuous threshold I can be preset according to actual needs. If the number of continuous silent frames is H, and the cut H is greater than the preset continuous threshold I, all of the intervals composed of the H continuous silent frames are all Mute frames are merged to get a continuous mute frame.
  • the preset continuous threshold I is 5, and at a certain moment, the status of the acquired mute frame is shown in Table 1.
  • Table 1 shows a frame composed of 50 second basic voice frames.
  • the set, as shown in Table 1, contains five or more consecutive mute frame intervals: the interval P composed of the second basic voice frame corresponding to frame number 7 to frame number 13 and frame number 21 to frame number 29
  • the interval Q composed of the corresponding second basic voice frame, therefore, the 7 second basic voice frames corresponding to the frame number 7 to the frame number 13 included in the interval P are combined to obtain a continuous mute frame P, and the continuous mute frame P
  • the duration of P is the sum of the durations of the 7 second basic speech frames corresponding to the frame number 7 to frame number 13.
  • the second basic speech frame corresponding to the frame number 21 to frame number 29 included in the interval Q is performed.
  • the combination is another continuous silence frame Q, and the duration of the continuous silence frame Q is the sum of the durations of the 9 second basic voice frames corresponding to the frame number 21 to the frame number 29.
  • Step D According to the method of steps A to C, obtain a total of K consecutive silent frames.
  • the continuous mute frames obtained are continuous mute frame P and continuous mute frame Q, because in the example corresponding to step C, the value of K is 2.
  • S43 Divide the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames.
  • the K consecutive silent frames obtained in step S42 are used as the demarcation points, and the second basic voice frame included in the frame set is divided to obtain a set interval of K + 1 second basic voice frames. All the second basic speech frames included in the collection interval are used as a target speech frame.
  • the status of the acquired mute frame is shown in Table 1 in step C in S42, which shows two consecutive mute frames, which are 7 corresponding to frame number 7 to frame number 13, respectively.
  • the second basic voice frame is combined to obtain a continuous mute frame P, and the nine second basic voice frames corresponding to frame number 21 to frame number 29 are combined to obtain a continuous mute frame Q.
  • the two consecutive mute frames are used as the demarcation point.
  • the frame set containing 50 second basic speech frames is divided into three intervals, which are: the interval M 1 composed of the second basic speech frames corresponding to the frame number 1 to frame number 6, and the frame number 14 to the frame number.
  • the interval M 2 composed of the second basic speech frame corresponding to 20 and the interval M 3 composed of the second basic speech frames corresponding to the frame number 30 to frame number 50 combine all the second basic speech frames in the interval M 1 , A combined speech frame is obtained as the second target speech frame M 1 .
  • text conversion is performed on each target voice frame to obtain a real-time voice text corresponding to the target voice frame.
  • the text conversion can use a tool that supports speech conversion text, or can be used for text conversion algorithms, which are not specifically limited here.
  • the voice data is parsed to obtain a frame set including a second basic voice frame, and then the second basic voice frame is muted to detect K consecutive silent frames in the second basic voice frame.
  • K consecutive mute frames divide the second basic voice frame contained in the frame set into K + 1 second target voice frames, and convert each target voice frame into a real-time voice text, so that the received voice
  • the signals are converted into independent real-time voice texts in real time, which enables accurate recording of the speech information of participants in real time, and improves the accuracy and efficiency of meeting information recording.
  • the conference information recording method further includes the following steps:
  • S61 Collect voiceprint features for each participant to obtain the basic voiceprint features for each participant.
  • a predetermined corpus is provided to the participants, and the participants are recorded and collected when the corpus is read aloud to obtain a recording sample, and then the voiceprint feature extraction of the voice signal in the recorded sample is obtained.
  • the preset corpus can be an article, a web news, or a meeting minutes, etc., and can be set according to actual needs, and is not specifically limited here.
  • a professional paper is provided as a preset corpus. For each participant, a recording sample of the participant reading the preset corpus is collected, and the participant is obtained based on the recording sample.
  • the basic voiceprint features are provided.
  • the voiceprint feature extraction of the voice signal in the recorded text to obtain the basic voiceprint feature of the participant can refer to the description of steps S21 to S25 or the description of step S2. To avoid repetition, I won't repeat them here.
  • the basic voiceprint feature is a collection containing multiple voiceprint features.
  • S62 Obtain the identity of each participant, and build an initial voiceprint recognition model based on the corresponding relationship between the participant's basic voiceprint characteristics, the participant's identification, and the participant's basic voiceprint characteristics and the participant's identification.
  • the participant identification of the participant is obtained, and based on the participant identification, the basic voiceprint characteristics of the participant, and the correspondence between the basic voiceprint characteristics of the participant and the participant identification Relationship, construct a map of the voiceprint feature information of the participants, and store the map in the initial voiceprint model.
  • S63 Use the genetic algorithm to train the initial voiceprint recognition model to generate a preset voiceprint recognition model.
  • the initial voiceprint model needs to be trained in order to improve the recognition rate of the voiceprint recognition model.
  • the training methods include, but are not limited to, a template matching method, a nearest neighbor method, and a neural network method. , Hidden Markov Model (HMM), Vector Quantization (VQ) method, Polynomial Classifiers method and so on.
  • the training method used in the embodiment of the present application is a genetic algorithm (GA) based on a neural network method, and the specific implementation process is as follows:
  • GA genetic algorithm
  • the basic voiceprint characteristics of each participant is used as a group
  • the genetic algorithm is used to perform initialization coding, individual evaluation, selection, crossover and mutation to obtain the optimal or approximate solution of the group, that is, the basic voiceprint of the participant
  • the optimal solution of the feature is found among the optimal solutions of each basic voiceprint feature, and according to the optimal solution Correspond to the voiceprint characteristics to determine the target participants.
  • voiceprint feature collection is performed to obtain the basic voiceprint characteristics of the participant, and the participant identification is obtained, and based on the participant's basic voiceprint characteristics, the participant identification, and Participants 'basic voiceprint characteristics and the corresponding relationship between the participants' logos, construct the initial voiceprint recognition model, and then use the genetic algorithm to train the initial voiceprint recognition model to generate a preset voiceprint recognition model, so that When the target voiceprint feature is identified, the participant ID corresponding to the target voiceprint feature can be quickly and accurately determined, so that the participant ID of the current speaker can be automatically and accurately recorded, which is helpful to improve the efficiency and accuracy of meeting information recording.
  • the server sends the real-time voice text to the monitoring terminal and receives the modified text fed back by the monitoring terminal.
  • the conference information recording method further includes the following: step:
  • S71 Send the real-time voice text to the monitoring end of the conference information record.
  • the real-time voice text is sent to the monitoring end, and the meeting information recording staff of the monitoring end checks the real-time voice text. If the user finds that the real-time voice text is incorrect , The monitoring terminal feeds back the modified text corresponding to the real-time voice text to the server.
  • the real-time voice text received by a monitoring end is "0068-Next Year's Economic and Industrial Driving Analysis". After confirmation by the user of the monitoring end, the real-time voice text is found to be incorrect.
  • the real-time voice text was modified to "0068- Analysis of Economic and Industrial Forms for the Next Year", and the modified text was sent to the server as the modified text.
  • the server After receiving the modified text sent by the monitoring end, the server obtains the real-time voice text corresponding to the modified text and uses the modified text to replace the real-time voice text.
  • the correspondence between the modified text and the real-time voice text can be determined by the serial number or other identification characters in the real-time voice text, or it can be set according to actual needs, which is not specifically limited here.
  • the real-time voice text corresponding to the modified text is the real-time voice text of the serial number “0068” according to the modified text “0068-Analysis of Economic and Industrial Forms in the Next Year”, and then the serial number “0068” is found.
  • Real-time voice text "0068- Analysis of Economic and Industrial Driving in the Next Year", and replace the real-time voice text with the modified text.
  • the real-time voice text is sent to the monitoring end of the conference information record, and the user of the monitoring end verifies the real-time voice text. If the monitoring end user verifies that there is an error in the real-time voice text, the real-time voice text is performed. Modify, generate modified text, and send the modified text to the server through the monitoring end. After receiving the modified text sent by the monitoring end, the server replaces the corresponding real-time voice text with the modified text, which improves the meeting information record. Accuracy.
  • a conference information recording device corresponds to the conference information recording method in the foregoing embodiment.
  • the conference information recording device includes a data collection module 10, a feature extraction module 20, a person determination module 30, a voice recognition module 40, and a record generation module 50.
  • the detailed description of each function module is as follows:
  • the data collection module 10 is configured to obtain the voice data and the current time if a voice data transmission request of the collection channel is received;
  • a feature extraction module 20 configured to extract voiceprint features from voice data to obtain target voiceprint features of the voice data
  • a person determination module 30 is configured to perform voiceprint recognition on a target voiceprint feature according to a preset voiceprint recognition model, and determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes the basis of the participants. Voiceprint characteristics, participant identification, and the corresponding relationship between basic voiceprint characteristics and participant identification;
  • a voice recognition module 40 configured to perform voice recognition on voice data to generate real-time voice text
  • the record generating module 50 is configured to correlate the target participant identification, the current time, and the real-time voice text to obtain the speech information of the target participant corresponding to the target participant identification.
  • the feature extraction module 20 includes:
  • the voiceprint analysis unit 21 is configured to perform voiceprint analysis on the voice data to obtain an initial voiceprint
  • An emphasis processing unit 22 configured to perform pre-emphasis processing on the initial voiceprint to generate an emphasis processing voiceprint with a flat spectrum
  • Framing and windowing unit 23 is configured to use framing and windowing to perform framing and windowing on the weighted voiceprint to obtain a first basic voice frame;
  • a silent sound separating unit 24 configured to perform silent sound separation on a first basic speech frame to obtain a first target speech frame
  • a feature extraction unit 25 is configured to extract voiceprint features for each first target voice frame to obtain target voiceprint features.
  • the speech recognition module 40 includes:
  • a speech parsing unit 41 configured to perform speech parsing on speech data to obtain a frame set including a second basic speech frame
  • the silence detection unit 42 is configured to perform silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number;
  • a voice segmentation unit 43 configured to divide the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames;
  • the text conversion unit 44 is configured to convert each second target speech frame into a real-time speech text.
  • the conference information recording device further includes:
  • the voiceprint collection module 61 is configured to collect voiceprint characteristics for each participant to obtain basic voiceprint characteristics for each participant;
  • a model building module 62 is configured to obtain the identification of each participant and construct an initial voiceprint based on the basic voiceprint characteristics of the participants, the identification of the participants, and the correspondence between the basic voiceprint characteristics of the participants and the participants' identification Recognition model
  • the model training module 63 is configured to train an initial voiceprint recognition model by using a genetic algorithm to generate a preset voiceprint recognition model.
  • the conference information recording device further includes:
  • a text transmission module 71 configured to send real-time voice text to the monitoring end of the conference information record
  • the text replacing module 72 is configured to replace the real-time voice text with the modified text if the modified text is received from the monitoring end.
  • Each module in the conference information recording device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer device includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium.
  • the database of the computer equipment is used to store the basic voiceprint characteristics and conference information records of the participants in the conference information recording method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by a processor to implement a conference information recording method.
  • a computer device including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor.
  • the processor implements the computer-readable instructions to implement the conference information of the foregoing embodiment.
  • the steps of the recording method are, for example, steps S1 to S5 shown in FIG. 2.
  • the processor executes the computer-readable instructions, the functions of the modules / units of the conference information recording device of the foregoing embodiment are implemented, for example, modules 10 to 50 shown in FIG. 7. To avoid repetition, we will not repeat them here.
  • one or more non-volatile readable storage media are provided, and computer-readable instructions are stored thereon.
  • the meeting information record of the foregoing embodiment is implemented.
  • the steps of the method, or the functions of the modules / units of the conference information recording device of the foregoing embodiment are implemented when computer-readable instructions are executed by one or more processors. To avoid repetition, details are not repeated here.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A conference information recording method and apparatus, a computer device, and a storage medium. The method comprises: if it is detected that a collecting channel generates voice data, obtaining the voice data and current time; extracting a voiceprint feature of the voice data; moreover, recognizing the voiceprint feature by obtaining a preset voiceprint recognition model to determine a target participant identifier corresponding to the voiceprint feature, so as to quickly and accurately confirm the identity of a current spokesman, thereby improving the efficiency and accuracy of conference information recording for recording the spokesman; and moreover, recognizing the voice data in real time to generate a real-time voice text, and associating the participant identifier, the current time, and the real-time voice text to obtain speaking information of the participant corresponding to the participant identifier at a current conference, thereby avoiding missing and wrong recording caused by manual recording, and improving the efficiency and accuracy of conference information recording.

Description

会议信息记录方法、装置、计算机设备及存储介质Conference information recording method, device, computer equipment and storage medium

本申请以2018年5月29日提交的申请号为201810528516.1,名称为“会议信息记录方法、装置、终端设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on a Chinese invention patent application filed on May 29, 2018 with the application number 201810528516.1 and entitled "Meeting Information Recording Method, Device, Terminal Equipment and Storage Medium", and claims priority.

技术领域Technical field

本申请涉及计算机技术领域,尤其涉及一种会议信息记录方法、装置、计算机设备及存储介质。The present application relates to the field of computer technology, and in particular, to a method, a device, a computer device, and a storage medium for recording conference information.

背景技术Background technique

随着科技的进步和计算机网络技术的飞速发展,对用于将各个计算机相互连接的网络计算技术的需求日益高涨。在网络连接环境下,可以顺利地进行计算机资源的共享和信息共享、流通、分发、交换等协同工作。With the advancement of science and technology and the rapid development of computer network technology, the demand for network computing technology for interconnecting computers is increasing. Under the network connection environment, computer resources sharing and information sharing, circulation, distribution, and exchange can be smoothly performed.

例如,通过网络来实现多人的协同工作的系统有远程会议系统。采用远程会议系统,在位于相互远离的地点的会议室之间传图像和声音,可以看到映在电视画面上的参加者的面孔来召开会议。实际上彼此位于远程地区的参加者可以如同在同一会议室开会那样讨论事宜。另外,由于各个参加者不需要移动,所以能够有效利用时间和削减差旅费等经费等。For example, a system for implementing multi-person collaborative work through a network is a remote conference system. A remote conference system is used to transfer images and sounds between conference rooms located at remote locations, and the participants' faces reflected on the TV screen can be used to hold meetings. Participants who are physically located remotely from each other can discuss matters as if they were meeting in the same conference room. In addition, since each participant does not need to move, time and expenses such as travel expenses can be effectively reduced.

远程会议给我们带来了很大的便利,但是,当远程会议的与会人员众多时,在进行会议信息记录的时候,往往由于与会人员的数量多而导致记错发言人,而且,在多次发言间隔时间短,或者发言时间较长时,容易漏记一些重要信息,这使得会议信息记录的效率低,且准确率不能达到要求。Remote conferences have brought us great convenience. However, when there are many participants in remote conferences, when recording conference information, they often misrepresent the speaker due to the large number of participants. When the speech interval is short or the speech time is long, it is easy to miss some important information, which makes the efficiency of meeting information recording low, and the accuracy rate cannot meet the requirements.

发明内容Summary of the Invention

基于此,有必要针对上述技术问题,提供一种可以提高会议信息记录的效率和准确率的会议信息记录方法、装置、计算机设备及存储介质。Based on this, it is necessary to provide a meeting information recording method, device, computer equipment, and storage medium that can improve the efficiency and accuracy of meeting information recording in response to the above technical problems.

一种会议信息记录方法,包括:A conference information recording method includes:

若接收到采集通道的语音数据传输请求,则获取所述语音数据和当前时间;If a voice data transmission request for a collection channel is received, obtaining the voice data and the current time;

对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征;Performing voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data;

根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识,其中,所述声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及所述基础声纹特征和所述与会人员标识之间的对应关系;According to a preset voiceprint recognition model, voiceprint recognition is performed on the target voiceprint feature to determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes a basic voiceprint of the participant Characteristics, participant identification, and the corresponding relationship between the basic voiceprint characteristics and the participant identification;

对所述语音数据进行语音识别,生成实时语音文本;Performing speech recognition on the speech data to generate real-time speech text;

将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联记录,得到所述目标与会人员标识对应的目标与会人员的发言信息。The target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification.

一种会议信息记录装置,包括:A conference information recording device includes:

数据收集模块,用于若接收到采集通道的语音数据传输请求,则获取所述语音数据和当前时间;A data collection module, configured to obtain the voice data and the current time if a voice data transmission request of a collection channel is received;

特征提取模块,用于对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征;A feature extraction module, configured to perform voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data;

人员确定模块,用于根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识,其中,所述声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及所述基础声纹特征和所述与会人员标识之间的对应关系;A person determination module is configured to perform voiceprint recognition on the target voiceprint feature according to a preset voiceprint recognition model, and determine a target participant identification corresponding to the target voiceprint feature, where the voiceprint recognition model includes The basic voiceprint characteristics of the participants, the participant identification, and the correspondence between the basic voiceprint characteristics and the participant identification;

语音识别模块,用于对所述语音数据进行语音识别,生成实时语音文本;A voice recognition module, configured to perform voice recognition on the voice data to generate real-time voice text;

记录生成模块,用于将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联记录,得到所述目标与会人员标识对应的目标与会人员的发言信息。The record generating module is configured to record the target participant identification, the current time, and the real-time voice text in association with each other, and obtain the speech information of the target participant corresponding to the target participant identification.

一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述会议信息记录方法的步骤。A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and the processor implements the method for recording conference information described above when the processor executes the computer-readable instructions A step of.

一个或多个存储有计算机可读指令的非易失性可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行所述会议信息记录方法的步骤。One or more non-volatile readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to execute the conference information Document the steps of the method.

本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below, and other features and advantages of the present application will become apparent from the description, the drawings, and the claims.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the drawings used in the description of the embodiments of the application will be briefly introduced below. Obviously, the drawings in the following description are just some embodiments of the application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without paying creative labor.

图1是本申请实施例提供的会议信息记录方法的应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a conference information recording method according to an embodiment of the present application; FIG.

图2是本申请实施例提供的会议信息记录方法的实现流程图;FIG. 2 is an implementation flowchart of a conference information recording method according to an embodiment of the present application; FIG.

图3是本申请实施例提供的会议信息记录方法中步骤S2的实现流程图;3 is a flowchart of implementing step S2 in the conference information recording method according to the embodiment of the present application;

图4是本申请实施例提供的会议信息记录方法中步骤S4的实现流程图;4 is a flowchart of implementing step S4 in a conference information recording method according to an embodiment of the present application;

图5是本申请实施例提供的会议信息记录方法中构建预设的声纹识别模型的实现流程图;5 is an implementation flowchart of constructing a preset voiceprint recognition model in a conference information recording method according to an embodiment of the present application;

图6是本申请实施例提供的会议信息记录方法中实时记录纠正的实现流程图;FIG. 6 is a flowchart of real-time recording correction in a conference information recording method according to an embodiment of the present application; FIG.

图7是本申请实施例提供的会议信息记录装置的示意图;7 is a schematic diagram of a conference information recording device according to an embodiment of the present application;

图8是本申请实施例提供的计算机设备的示意图。FIG. 8 is a schematic diagram of a computer device according to an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following, the technical solutions in the embodiments of the present application will be clearly and completely described with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of this application.

请参阅图1,图1示出本申请实施例提供的会议信息记录方法的应用环境。该会议信 息记录方法应用在远程会议的会议信息记录场景中。该记录场景包括服务端、客户端和监控端,其中,服务端和客户端之间、服务端和监控端之间通过网络进行连接,与会人员通过客户端发送语音数据到服务端,服务端通过对语音数据进行分析,将语音数据转化为对应的会议记录信息,会议信息记录人员通过监控端监控并根据需要修改服务端的会议记录信息,客户端具体可以但不限于是各种话筒、麦克风、手机和带有录音功能的智能设备,服务端和监控端具体可以用独立的服务器或者多个服务器组成的服务器集群实现。Please refer to FIG. 1, which illustrates an application environment of a conference information recording method according to an embodiment of the present application. The conference information recording method is applied to a conference information recording scenario of a remote conference. The recording scenario includes the server, client, and monitor. Among them, the server and the client, and the server and the monitor are connected through the network. Participants send voice data to the server through the client, and the server passes Analyze the voice data, and convert the voice data into the corresponding meeting record information. The meeting information recorder monitors and modifies the server's meeting record information through the monitoring terminal. The client can specifically but not limited to various microphones, microphones, and mobile phones. For smart devices with recording functions, the server and monitor can be implemented with independent servers or a server cluster consisting of multiple servers.

请参阅图2,图2示出本申请实施例提供的一种会议信息记录方法,以该方法应用在图1中的服务器为例进行说明,详述如下:Please refer to FIG. 2, which illustrates a conference information recording method provided by an embodiment of the present application. The method is applied to the server in FIG. 1 as an example for description, as follows:

S1:若接收到采集通道的语音数据传输请求,则获取该语音数据和当前时间。S1: If a voice data transmission request is received from the acquisition channel, the voice data and the current time are obtained.

具体地,当采集通道向服务端发送数据传输请求时,服务端接收该请求中的语音数据,并获取当前时间,其中,采集通道为本实施例的客户端。Specifically, when the collection channel sends a data transmission request to the server, the server receives the voice data in the request and obtains the current time, where the collection channel is the client of this embodiment.

其中,采集通道收集与会人员的会议发言信息,其具体可以是各种话筒、麦克风或其他带有语音传输功能的通讯设备等,此处不作具体限制。The collection channel collects conference speech information of the participants, which may specifically be various microphones, microphones, or other communication equipment with a voice transmission function, etc., which are not specifically limited here.

值得说明的是,本申请实施例采用多个采集通道来收集语音数据,一般情况下,远程会议人数较多,4个以上的与会人员使用同一采集通道,会使得距离采集通道较远的与会人员的语音数据容易被识别成背景音,从而在语音信号优化的时候被过滤掉,使得语音识别的准确率变低,而本申请实施例中采用多通道,避免了因为使用同一采集通道的与会人员过多而导致的语音混杂识别率降低的问题,采集通道可以设置每个与会人员一个独立的采集通道,或者是邻近的2至4个人使用同一个采集通道,具体可以根据实际情况来设置,此处不作限制。It is worth noting that in the embodiment of the present application, multiple collection channels are used to collect voice data. In general, there are a large number of remote conferences. If more than 4 participants use the same collection channel, the participants who are far away from the collection channel will make the participants farther from the collection channel. The voice data is easily recognized as the background sound, so it is filtered out when the voice signal is optimized, so that the accuracy of voice recognition becomes low. In the embodiment of the present application, multiple channels are used to avoid participants who use the same acquisition channel. The problem of excessive voice confusion recognition rate reduction, the collection channel can be set up for each participant, an independent collection channel, or two to four people in the vicinity use the same collection channel, which can be set according to the actual situation. There are no restrictions here.

S2:对语音数据进行声纹特征提取,得到该语音数据的目标声纹特征。S2: Perform voiceprint feature extraction on the voice data to obtain the target voiceprint feature of the voice data.

具体地,对语音数据进行声纹特征提取,具体过程为:通过对语音数据进行语音编码和语音信号的预处理,将语音数据从模拟信号转化成数字信号,进而从该数字信号中进行声纹特征的提取,进而得到语音数据的目标声纹特征。Specifically, voiceprint feature extraction is performed on the voice data, and the specific process is: by performing voice encoding and preprocessing of the voice data on the voice data, the voice data is converted from an analog signal into a digital signal, and then the voiceprint is performed from the digital signal. Feature extraction, and then get the target voiceprint feature of speech data.

其中,声纹特征包括但不限于:声学特征、词法特征、韵律特征、语种方言口音信息和通道信息等。The voiceprint features include, but are not limited to, acoustic features, morphological features, prosodic features, dialect accent information, and channel information.

优选地,本申请采用的声纹特征为声学特征。Preferably, the voiceprint feature used in this application is an acoustic feature.

S3:根据预设的声纹识别模型,对目标声纹特征进行声纹识别,确定该目标声纹特征对应的目标与会人员标识,其中,声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及基础声纹特征和与会人员标识之间的对应关系。S3: According to a preset voiceprint recognition model, voiceprint recognition is performed on the target voiceprint feature to determine the target participant identification corresponding to the target voiceprint feature. The voiceprint recognition model includes the basic voiceprint features of the participants. Personnel identification, and the corresponding relationship between the basic voiceprint characteristics and participant identification.

具体地,预设的声纹识别模型包括:与会人员的基础声纹特征,与会人员标识,以及基础声纹特征和与会人员标识之间的对应关系。在获取到目标声纹特征之后,通过预设的声纹识别模型,对该目标声纹特征进行声纹识别,从而获取到该目标声纹特征对应的目标与会人员标识,进而确定当前进行发言的会与人员。Specifically, the preset voiceprint recognition model includes: the basic voiceprint characteristics of the participant, the participant identification, and the correspondence between the basic voiceprint characteristics and the participant identification. After the target voiceprint feature is obtained, the target voiceprint feature is subjected to voiceprint recognition through a preset voiceprint recognition model, thereby obtaining the target participant identification corresponding to the target voiceprint feature, and then determining the current speaker. Meeting with personnel.

其中,声纹识别,生物识别技术的一种,也称为说话人识别,有两类,即说话人辨认和说话人确认。声纹识别就是把声信号转换成电信号,再用计算机进行识别。Among them, voiceprint recognition, a type of biometric technology, also known as speaker recognition, has two types, namely speaker recognition and speaker confirmation. Voiceprint recognition is to convert acoustic signals into electrical signals, and then use a computer to identify them.

容易理解地,本申请实施例采用声纹识别是为了辨认当前发言的与会人员身份,即说话人辨认。It is easy to understand that the voiceprint recognition is used in the embodiment of the present application to identify the participant who is currently speaking, that is, speaker identification.

S4:对语音数据进行语音识别,生成实时语音文本。S4: Perform speech recognition on speech data to generate real-time speech text.

具体地,通过对语音数据进行语音识别,生成与会人员发言信息的实时语音文本,从而实现了自动记录发言信息,提高了会议记录的效率。Specifically, by performing voice recognition on voice data, real-time voice texts of speech information of participants are generated, thereby automatically recording speech information and improving the efficiency of meeting records.

其中,实时语音文本是指根据每次会议发言过程中的停顿静音,对会议发言的语音数据进行切分,得到的一段段切分语音数据,对每段切分语音数据经过语音识别,从而得到对应的识别文本,即为语音识别文本。Among them, the real-time speech text refers to segmenting the speech data of the conference speech according to the pause and silence during each conference speech, and the segmented segmented speech data is obtained, and each segmented speech data is subjected to speech recognition to obtain The corresponding recognition text is speech recognition text.

例如,在一具体实施方式中,第0秒至第1.8秒获取到一段语音数据,记为语音数据E,第1.8秒到第3秒获取到的语音数据为空,第3秒至第8秒获取到另一端语音数据,记为语音数据F,对语音数据E进行语音识别,得到一个实时语音文本为:“在本季度”,对语音数据F进行语音识别,得到一个实时语音文本为:“我们的业务目标超额完成”。For example, in a specific embodiment, a piece of voice data is acquired from 0 to 1.8 seconds, and is recorded as voice data E. The voice data acquired from 1.8 to 3 seconds is empty, and 3 to 8 seconds are obtained. The voice data at the other end is obtained, recorded as voice data F, and the voice data E is subjected to voice recognition, and a real-time voice text is obtained: “in this quarter”, and the voice data F is subjected to voice recognition, and a real-time voice text is obtained: Our business goals have been exceeded. "

其中,对语音数据进行语音识别,可采用语音识别算法,也可以使用具有语音识别功能的第三方工具,具体不作限制。语音识别算法包括但不限于:基于声道模型的语音识别算法、语音模板匹配识别算法和或人工神经网络的语音识别算法等。The voice data may be recognized by a voice recognition algorithm or a third-party tool with a voice recognition function, which is not limited in particular. Speech recognition algorithms include, but are not limited to, speech recognition algorithms based on channel models, speech template matching recognition algorithms, or artificial neural network speech recognition algorithms.

优选地,本申请实施例采用的语音识别算法为基于声道模型的语音识别算法。Preferably, the speech recognition algorithm used in the embodiment of the present application is a speech recognition algorithm based on a channel model.

S5:将目标与会人员标识、当前时间和实时语音文本进行关联记录,得到目标与会人员标识对应的目标与会人员的发言信息。S5: The target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification.

具体地,针对每个实时语音文本,将该实时语音文本和该实时语音文本对应的当前时间,对应的目标与会人员标识进行关联记录,从而得到该目标与会人员标识对应的目标与会人员的发言信息。Specifically, for each real-time voice text, the real-time voice text and the current time corresponding to the real-time voice text are associated with the target participant ID, and the speech information of the target participant corresponding to the target participant ID is obtained. .

例如,在一具体实施方式中,获取到的实时语音文本为“现在我宣布会议正式开始”,对应的当前时间为“2018年4月18日13:52:00”,对应的与会人员标识为“工号003”,将实时语音文本及其对应的当前时间和与会人员标识进行关联记录,得到一条会议信息记录为:“2018年4月18日13:52:00,工号003,现在我宣布会议正式开始”。For example, in a specific embodiment, the obtained real-time voice text is "Now I announce the official start of the meeting", the corresponding current time is "April 18, 2018 13:52:00", and the corresponding participant ID is "Work ID 003", which records the real-time voice text and its corresponding current time and participant identification, and obtains a record of the meeting information: "April 18, 2018 13:52:00, Work ID 003, now I Announce the official start of the meeting. "

优选地,在会议结束后,可以按照时间的先后顺序,对所有会议信息记录进行排序。Preferably, after the meeting is ended, all meeting information records can be sorted in chronological order.

在本实施例中,一方面,若检测到采集通道产生语音数据,则获取该语音数据和当前时间,并对语音数据进行声纹特征提取,得到该语音数据的目标声纹特征,同时通过获取预设的声纹识别模型,来对该目标声纹特征进行识别,从而确定该目标声纹特征对应的目标与会人员标识,使得在远程会议的与会人员较多时,通过声纹识别的方式来快速准确识别当前发言人的身份,提高了会议信息记录中确认并记录发言人的效率以及提升了会议信息记录的准确率,另一方面,对语音数据进行实时识别,生成实时语音文本,将与会人员标识、当前时间和实时语音文本进行关联,得到与会人员标识对应的与会人员在本次会议的发言信息,避免手工记录导致的漏记错记,提高了会议信息记录的效率和准确率。In this embodiment, on the one hand, if it is detected that the acquisition channel generates voice data, the voice data and the current time are obtained, and the voiceprint feature extraction is performed on the voice data to obtain the target voiceprint feature of the voice data. A preset voiceprint recognition model is used to identify the target voiceprint feature, so as to determine the target participant identification corresponding to the target voiceprint feature, so that when there are many participants in a remote conference, voiceprint recognition is used to quickly Accurately identify the current speaker's identity, improve the efficiency of identifying and recording the speaker in the conference information record, and improve the accuracy of the conference information record. On the other hand, the real-time recognition of voice data and the generation of real-time voice text will be provided The identification, current time, and real-time voice and text are associated to obtain the speech information of the participant at the conference corresponding to the participant identification, avoiding missed and misremembered records caused by manual recording, and improving the efficiency and accuracy of conference information recording.

在一实施例中,如图3所示,步骤S2中,即对语音数据进行声纹特征提取,得到该语音数据的目标声纹特征,具体包括如下步骤:In an embodiment, as shown in FIG. 3, in step S2, voiceprint feature extraction is performed on the voice data to obtain the target voiceprint feature of the voice data, which specifically includes the following steps:

S21:对语音数据进行声纹解析,得到初始声纹。S21: Perform voiceprint analysis on the voice data to obtain the initial voiceprint.

具体地,利用设备获取的语音数据都是模拟信号,对语音数据进行声纹解析,就是在获取到语音数据后,对语音数据采用脉冲编码调制技术(Pulse Code Modulation,PCM)进行编码,使这些模拟信号转化为数字信号,并将语音数据中的模拟信号每隔预设的时间对一个采样点进行采样,使其离散化,进而对采样信号量化,以二进制码组的方式输出量化后的数字信号,根据语音的频谱范围200-3400Hz,采样率可设置为8KHz,量化精度为 16bit。Specifically, the voice data obtained by the device is an analog signal, and voiceprint analysis is performed on the voice data, that is, after the voice data is obtained, the voice data is encoded using Pulse Code Modulation (PCM) to make these The analog signal is converted into a digital signal, and the analog signal in the voice data is sampled at a sampling point every predetermined time to discretize it, and then the sampled signal is quantized, and the quantized digital is output in a binary code group manner. Signal, according to the frequency range of speech of 200-3400Hz, the sampling rate can be set to 8KHz, and the quantization accuracy is 16bit.

应理解,此处采样率和量化精度的数值范围,为本申请优选范围,但可以根据实际应用的需要进行设置,此处不做限制。It should be understood that the numerical ranges of the sampling rate and quantization accuracy herein are the preferred ranges of the present application, but can be set according to the needs of practical applications, and are not limited here.

S22:对初始声纹进行预加重处理,生成具有平坦频谱的加重处理声纹。S22: Perform pre-emphasis processing on the initial voiceprint to generate an emphasized voiceprint with a flat spectrum.

具体地,由于声门激励和口鼻辐射会对语音信号的平均功率谱产生影响,导致高频在超过800Hz时会按6dB/倍频跌落,所以在计算语音信号频谱时,频率越高相应的成分越小,为此要在预处理中进行预加重(Pre-emphasis)处理,预加重的目的是提高高频部分,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱,以便于频谱分析或者声道参数分析。预加重可在语音信号数字化时在反混叠滤波器之前进行,这样不仅可以进行预加重,而且可以压缩信号的动态范围,有效地提高信噪比。预加重可使用一阶的数字滤波器来实现,例如:有限脉冲响应(Finite Impulse Response,FIR)滤波器。Specifically, since the glottic excitation and the nasal and nasal radiation will affect the average power spectrum of the speech signal, the high frequency will drop by 6dB / octave when the frequency exceeds 800Hz. Therefore, when calculating the speech signal spectrum, the higher the frequency the corresponding The smaller the component, the pre-emphasis is performed in the pre-processing for this purpose. The purpose of pre-emphasis is to improve the high-frequency part, make the signal spectrum flat, and keep it in the entire low-frequency to high-frequency band. Can use the same signal-to-noise ratio to find the spectrum to facilitate spectrum analysis or channel parameter analysis. Pre-emphasis can be performed before the anti-aliasing filter when the voice signal is digitized. This not only can perform pre-emphasis, but also can compress the dynamic range of the signal and effectively improve the signal-to-noise ratio. Pre-emphasis can be implemented using a first-order digital filter, such as a Finite Impulse Response (FIR) filter.

值得说明的是,利用设备获取的语音信号都是模拟信号,在对这些模拟信号进行预加重处理之前,需要经过采样和量化将模拟信息转化为数字信号,根据语音的频谱范围200-3400Hz,采样率可设置为8KHz,量化精度为16bit。It is worth noting that the voice signals obtained by the device are all analog signals. Before pre-emphasis processing of these analog signals, the analog information needs to be converted into digital signals through sampling and quantization. According to the speech spectrum range 200-3400Hz, sampling The rate can be set to 8KHz and the quantization accuracy is 16bit.

应理解,此处采样率和量化精度的数值范围,为本申请优选范围,但可以根据实际应用的需要进行设置,此处不做限制。It should be understood that the numerical ranges of the sampling rate and quantization accuracy herein are the preferred ranges of the present application, but can be set according to the needs of practical applications, and are not limited here.

语音信号在经过预加重后,频谱的高频部分得到提升,信号也变得平坦,生成具有平坦频谱的加重处理声纹,有利于后续的声纹特征提取。After the pre-emphasis of the speech signal, the high-frequency part of the frequency spectrum is improved, and the signal also becomes flat, generating a weighted processing voiceprint with a flat spectrum, which is beneficial for subsequent voiceprint feature extraction.

S23:采用分帧和加窗的方式,对加重处理声纹进行分帧和加窗处理,得到第一基础语音帧。S23: Framing and windowing are used to frame and emphasize the processed voiceprint to obtain a first basic voice frame.

具体地,声纹信号具有短时平稳的性质,声纹信号在经过预加重处理后,需要对其进行分帧和加窗处理,来保持信号的短时平稳性,通常情况下,每秒钟包含的帧数在33~100帧之间。为了保持帧与帧之间的连续性,使得相邻两帧都能平滑过渡,采用交叠分帧的方式,即在相邻两帧之间保持一个交叉重叠的部分,该部分即为帧移。Specifically, the voiceprint signal has the property of short-term stability. After pre-emphasis processing, the voiceprint signal needs to be framed and windowed to maintain the short-term stability of the signal. Generally, every second The number of frames included is between 33 and 100 frames. In order to maintain the continuity between frames, so that the adjacent two frames can smoothly transition, the overlapping frame method is adopted, that is, to maintain a crossover between the adjacent two frames, which is the frame shift. .

优选地,帧移与帧长的比值的取值范围为(0,0.5)。Preferably, the range of the ratio of the frame shift to the frame length is (0, 0.5).

例如,在一具体实施方式中,预加重后的声纹信号为s'(n),帧长为N个采样点,帧移为M个采样点。当第l帧对应的采样点为第n个时,原始语音信号x l(n)与各参数之间的对应关系为: For example, in a specific implementation, the pre-emphasized voiceprint signal is s' (n), the frame length is N sampling points, and the frame shift is M sampling points. When the sampling point corresponding to the l frame is the nth, the corresponding relationship between the original speech signal x l (n) and each parameter is:

x l(n)=x[(l-1)M+n] x l (n) = x [(l-1) M + n]

其中,n=0,1,...,N-1,N=256。Among them, n = 0,1, ..., N-1, N = 256.

声纹经过分帧之后,使用相应的窗函数w(n)与预加重后的声纹信号s'(n)相乘,即得到加窗后的语音信号S w,将该声纹信号作为第一基础语音帧。 After the voiceprint is framed, the corresponding window function w (n) is multiplied with the pre-emphasized voiceprint signal s' (n) to obtain the windowed speech signal Sw . This voiceprint signal is used as the first A basic speech frame.

其中,窗函数包括但不限于:矩形窗(Rectangular)、汉明窗(Hamming)和汉宁窗(Hanning)等。The window functions include, but are not limited to, rectangular windows, Hamming windows, and Hanning windows.

矩形窗表达式为:The rectangular window expression is:

Figure PCTCN2018094370-appb-000001
Figure PCTCN2018094370-appb-000001

其中,w(n)为窗函数,N为采样点的个数,n为第n个采样点。Among them, w (n) is a window function, N is the number of sampling points, and n is the nth sampling point.

汉明窗表达式为:The Hamming window expression is:

Figure PCTCN2018094370-appb-000002
Figure PCTCN2018094370-appb-000002

其中,pi为圆周率,优选地,本申请实施例中pi的取值为3.1416。Wherein, pi is the perimeter, and preferably, the value of pi in the embodiment of the present application is 3.1416.

汉宁窗表达式为:The Hanning window expression is:

Figure PCTCN2018094370-appb-000003
Figure PCTCN2018094370-appb-000003

对经过预加重处理的声纹进行分帧和加窗处理,使得声纹保持帧与帧之间的连续性,并剔除掉一些异常的信号点,提高了声纹的鲁棒性。The pre-emphasis processing of the voiceprint is framed and windowed, so that the voiceprint maintains the continuity between frames and eliminates some abnormal signal points, which improves the robustness of the voiceprint.

S24:对第一基础语音帧进行静默音分离,得到第一目标语音帧。S24: Silent tone separation is performed on the first basic voice frame to obtain a first target voice frame.

具体地,对第一基础语音帧进行静默音分离,得到第一目标语音帧的具体实现过程可参考步骤S42和步骤S43的描述,为避免重复,此处不再赘述。Specifically, the silent sound separation is performed on the first basic speech frame to obtain the first target speech frame. For details about the implementation process of step S42 and step S43, to avoid repetition, details are not described herein again.

S25:对每个第一目标语音帧进行声纹特征提取,得到目标声纹特征。S25: Perform voiceprint feature extraction on each first target voice frame to obtain a target voiceprint feature.

具体地,在经过预加重处理、分帧和加窗和静默音分离之后,获取了稳定性强的声纹,使用该声纹进行声纹特征的提取。Specifically, after pre-emphasis processing, framing, windowing, and silence separation, a stable voiceprint is obtained, and voiceprint features are extracted using the voiceprint.

其中,声纹特征提取是提取并选择对说话人的声纹具有可分性强、稳定性高等特性的声学或语言特征。Among them, voiceprint feature extraction is to extract and select acoustic or linguistic features with strong separability and high stability for the speaker's voiceprint.

优选地,本申请选择提取的声纹特征为声学特征中的线性倒谱特征。Preferably, the voiceprint feature selected in this application is a linear cepstrum feature in the acoustic feature.

在本实施例中,通过对与会人员发言中的语音数据进行声纹解析,得到初始声纹,并对初始声纹样本进行预加重处理,生成具有平坦频谱的加重处理声纹,采用分帧和加窗的方式,对加重处理声纹进行分帧处理,得到具有鲁棒性的第一基础语音帧,然后对第一基础语音帧进行静默音分离,过滤掉有干扰的静默音,得到第一目标语音帧,使得声纹具有连续性和鲁棒性,进而从该声纹中提取与会人员的的目标声纹特征,提升了目标声纹特征的准确度和鲁棒性,有利于后续进行声纹识别时,提高声纹识别的准确率,从而保证对目标与会人员的准确识别。In this embodiment, the voiceprint analysis is performed on the speech data of the participants to obtain the initial voiceprint, and the initial voiceprint samples are pre-emphasized to generate a weighted voiceprint with a flat frequency spectrum. The windowing method performs frame processing on the weighted voiceprint to obtain a robust first basic speech frame, and then performs silent sound separation on the first basic speech frame, and filters out silent noises to obtain the first The target voice frame makes the voiceprint continuous and robust, and then extracts the target voiceprint characteristics of the participants from the voiceprint, improving the accuracy and robustness of the target voiceprint characteristics, which is conducive to subsequent voice In the pattern recognition, the accuracy of voiceprint recognition is improved to ensure the accurate recognition of target participants.

在一实施例中,如图4所示,步骤S4中,即对语音数据进行语音识别,生成实时语音文本,具体包括如下步骤:In an embodiment, as shown in FIG. 4, in step S4, the voice data is subjected to voice recognition to generate real-time voice text, which specifically includes the following steps:

S41:对语音数据进行语音解析,得到包含第二基础语音帧的帧集合。S41: Perform speech analysis on the speech data to obtain a frame set including a second basic speech frame.

具体地,对获取到的语音数据进行语音解析,得到包含第二基础语音帧的帧集合,语音解析包括但不限于:语音编码和语音信号的预处理等。Specifically, speech analysis is performed on the acquired speech data to obtain a frame set including a second basic speech frame. The speech analysis includes, but is not limited to, speech encoding and pre-processing of speech signals.

其中,语音编码就是对模拟的语音信号进行编码,将模拟信号转化成数字信号,从而降低传输码率并进行数字传输,语音编码的基本方法可分为波形编码、参量编码和混合编码。Among them, speech encoding is to encode analog speech signals and convert analog signals into digital signals, thereby reducing the transmission code rate and performing digital transmission. The basic methods of speech encoding can be divided into waveform encoding, parametric encoding, and hybrid encoding.

优选地,本提案使用的语音编码方式为波形编码,波形编码是将时域的模拟话音的波 形信号经过取样、量化、编码而形成的数字话音信号,波形编码可提供高话音的质量。Preferably, the speech coding method used in this proposal is waveform coding. Waveform coding is a digital voice signal formed by sampling, quantizing, and encoding the waveform signal of the analog voice in the time domain. The waveform coding can provide high voice quality.

其中,语音信号的预处理是指在对语音信号进行分析和处理之前,对其进行预加重、分帧、加窗等预处理操作。语音信号的预处理的目的是消除因为人类发声器官本身和由于采集语音信号的设备所带来的混叠、高次谐波失真、高频等等因素,对语音信号质量的影响,从而使得后续语音处理得到的信号更均匀、平滑,为信号参数提取提供优质的参数,提高语音处理质量。Among them, the preprocessing of a voice signal refers to pre-emphasis, framing, windowing and other preprocessing operations on the voice signal before analysis and processing. The purpose of speech signal pre-processing is to eliminate the effects of aliasing, higher harmonic distortion, high frequency and other factors caused by the human vocal organ itself and the equipment that collects the speech signal, so as to make the follow-up The signal obtained by speech processing is more uniform and smooth, which provides high-quality parameters for signal parameter extraction and improves the quality of speech processing.

S42:对第二基础语音帧进行静音检测,得到第二基础语音帧中的K个连续静音帧,其中,K为自然数。S42: Perform silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number.

具体地,在与会人员持续发言的过程中,语音数据中的语音信号可分为激活期和静默期两个状态,静默期不传送任何语音信号,上、下行链路的激活期和静默期相互独立。与会人员发言过程中,会有停顿的状态,这个状态会带来语音信号的停顿,即静默期,在进行语音识别并转换文本的时候,需要检测出静默期状态,进而将静默期与激活期进行分离,以得到持续的激活期,将保留下来的持续的激活期的语音信号作为目标语音帧。Specifically, during the continuous speaking process of the participants, the voice signals in the voice data can be divided into two states: the active period and the silent period. The silent period does not transmit any voice signals. The uplink and downlink activation periods and silent periods are mutually related. independent. During the speech process, the participants will have a pause state. This state will bring about a pause in the voice signal, that is, a silent period. When performing speech recognition and converting text, the silent period state needs to be detected, and then the silent period and the active period will be detected. Separation is performed to obtain a continuous activation period, and the voice signal of the remaining continuous activation period is used as a target speech frame.

其中,检测静默音状态的方式包括但不限于:语音端点检测、探测音频静音算法和语音活动检测(Voice Activity Detection,VAD)算法等。The methods for detecting the state of the silence include, but are not limited to, voice endpoint detection, detection of audio mute algorithms, and voice activity detection (VAD) algorithms.

优选地,本申请实施例使用的对第二基础语音帧进行静音检测,得到第二基础语音帧中的K个连续静音帧的具体实现流程包括步骤A至步骤E,详述如下:Preferably, the specific implementation process of performing silence detection on the second basic voice frame and obtaining K consecutive silence frames in the second basic voice frame used in the embodiments of the present application includes steps A to E, which are detailed as follows:

步骤A:计算每帧第二基础语音帧的帧能量。Step A: Calculate the frame energy of the second basic speech frame of each frame.

具体地,帧能量是语音信号的短时能量,反映了语音帧的语音信息的数据量,通过帧能量能够进行判断该语音帧是否为语句帧还是静音帧。Specifically, the frame energy is the short-term energy of the voice signal, and reflects the data amount of the voice information of the voice frame. The frame energy can be used to determine whether the voice frame is a sentence frame or a mute frame.

步骤B:针对每帧第二基础语音帧,若该第二基础语音帧的帧能量小于预设的帧能量阈值,则标记该第二基础语音帧为静音帧。Step B: For each frame of the second basic voice frame, if the frame energy of the second basic voice frame is less than a preset frame energy threshold, mark the second basic voice frame as a mute frame.

具体地,帧能量阈值为预先设定的参数,若计算得到的第二基础语音帧的帧能量小于预设的帧能量阈值,则将对应的第二基础语音帧标记为静音帧,该帧能量阈值具体可以根据实际需求进行设置,如帧能量阈值设置为0.5,也可以根据计算得到各个第二基础语音帧的帧能量进行具体分析设置,此处不做限制。Specifically, the frame energy threshold is a preset parameter. If the calculated frame energy of the second basic voice frame is less than a preset frame energy threshold, the corresponding second basic voice frame is marked as a mute frame, and the frame energy is The threshold may be specifically set according to actual requirements. For example, the frame energy threshold is set to 0.5, or the frame energy of each second basic voice frame may be calculated and set based on calculations. There is no limitation here.

例如,在一具体实施方式中,帧能量阈值设置为0.5,对6个第二基础语音帧:J 1、J 2、J 3、J 4、J 5和J 6计算帧能量计算,得到结果分别为:1.6、0.2、0.4、1.7、1.1和0.8,由此结果容易理解,第二基础语音帧J 2和第二基础语音帧J 3为静音帧。 For example, in a specific embodiment, the frame energy threshold is set to 0.5, and the frame energy calculation is calculated for 6 second basic speech frames: J 1 , J 2 , J 3 , J 4 , J 5, and J 6 , and the results are obtained respectively. It is: 1.6, 0.2, 0.4, 1.7, 1.1, and 0.8. From this result, it is easy to understand that the second basic speech frame J 2 and the second basic speech frame J 3 are silent frames.

步骤C:若检测到H个连续的静音帧,切H大于预设的连续阈值I,则将该H个连续的静音帧组成的帧集合作为为连续静音帧。Step C: If H consecutive mute frames are detected, and the cut H is greater than a preset continuous threshold I, the frame set composed of the H consecutive mute frames is regarded as a continuous mute frame.

具体地,连续阈值I可以根据实际需要进行预先设置,若存在连续的静音帧的数量为H,切H大于预设的连续阈值I,则将该H个连续的静音帧组成的区间中的所有静音帧进行合并,得到一个连续静音帧。Specifically, the continuous threshold I can be preset according to actual needs. If the number of continuous silent frames is H, and the cut H is greater than the preset continuous threshold I, all of the intervals composed of the H continuous silent frames are all Mute frames are merged to get a continuous mute frame.

例如,在一具体实施方式中,预设的连续阈值I为5,在某一时刻,获取到的静音帧状态如表一所示,表一示出了50个第二基础语音帧组成的帧集合,由表一可知,包含连续5个或5个以上的连续的静音帧区间为:帧序号7至帧序号13对应的第二基础语音帧 组成的区间P,以及帧序号21至帧序号29对应的第二基础语音帧组成的区间Q,因而,将区间P中包含的帧序号7至帧序号13对应的7个第二基础语音帧进行组合,得到一个连续静音帧P,该连续静音帧P的时长为帧序号7至帧序号13对应的7个第二基础语音帧的时长之和,按此方法,将区间Q中包含的帧序号21至帧序号29对应的第二基础语音帧进行组合,作为另一个连续静音帧Q,连续静音帧Q的时长为帧序号21至帧序号29对应的9个第二基础语音帧的时长之和。For example, in a specific embodiment, the preset continuous threshold I is 5, and at a certain moment, the status of the acquired mute frame is shown in Table 1. Table 1 shows a frame composed of 50 second basic voice frames. The set, as shown in Table 1, contains five or more consecutive mute frame intervals: the interval P composed of the second basic voice frame corresponding to frame number 7 to frame number 13 and frame number 21 to frame number 29 The interval Q composed of the corresponding second basic voice frame, therefore, the 7 second basic voice frames corresponding to the frame number 7 to the frame number 13 included in the interval P are combined to obtain a continuous mute frame P, and the continuous mute frame P The duration of P is the sum of the durations of the 7 second basic speech frames corresponding to the frame number 7 to frame number 13. According to this method, the second basic speech frame corresponding to the frame number 21 to frame number 29 included in the interval Q is performed. The combination is another continuous silence frame Q, and the duration of the continuous silence frame Q is the sum of the durations of the 9 second basic voice frames corresponding to the frame number 21 to the frame number 29.

表一Table I

帧序号Frame number 11 22 33 44 55 66 77 88 99 1010 是否静音帧Whether to mute the frame no no Yes no no no Yes Yes Yes Yes 帧序号Frame number 1111 1212 1313 1414 1515 1616 1717 1818 1919 2020 是否静音帧Whether to mute the frame Yes Yes Yes no no no no no no no 帧序号Frame number 21twenty one 22twenty two 23twenty three 24twenty four 2525 2626 2727 2828 2929 3030 是否静音帧Whether to mute the frame Yes Yes Yes Yes Yes Yes Yes Yes Yes no 帧序号Frame number 3131 3232 3333 3434 3535 3636 3737 3838 3939 4040 是否静音帧Whether to mute the frame Yes Yes no no no no no no Yes Yes 帧序号Frame number 3131 3232 3333 3434 3535 3636 3737 3838 3939 4040 是否静音帧Whether to mute the frame no Yes Yes no no Yes no no no no

步骤D:按照步骤A至步骤C的方法,获取连续静音帧的总数K个。Step D: According to the method of steps A to C, obtain a total of K consecutive silent frames.

以步骤C中列举的表一为例,获取的连续静音帧为连续静音帧P和连续静音帧Q,因为在步骤C对应的举例中,K的值为2。Taking Table 1 listed in step C as an example, the continuous mute frames obtained are continuous mute frame P and continuous mute frame Q, because in the example corresponding to step C, the value of K is 2.

S43:根据K个静音帧,将帧集合中包含的第二基础语音帧划分成K+1个第二目标语音帧。S43: Divide the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames.

具体地,将步骤S42中得到的K个连续静音帧作为分界点,将帧集合中包含的第二基础语音帧划分开来,得到K+1个第二基础语音帧的集合区间,将每个集合区间中包含的所有第二基础语音帧,作为一个目标语音帧。Specifically, the K consecutive silent frames obtained in step S42 are used as the demarcation points, and the second basic voice frame included in the frame set is divided to obtain a set interval of K + 1 second basic voice frames. All the second basic speech frames included in the collection interval are used as a target speech frame.

例如,在一具体实施方式中,获取到的静音帧的状态如S42中步骤C的表一所示,该表示出了两个连续静音帧,分别为帧序号7至帧序号13对应的7个第二基础语音帧进行组合得到一个连续静音帧P,以及帧序号21至帧序号29对应的9个第二基础语音帧进行组合得到一个连续静音帧Q,将这两个连续静音帧作为分界点,将这个包含50个第二基础语音帧的帧集合划分成了三个区间,分别为:帧序号1至帧序号6对应的第二基础语音帧组成的区间M 1,帧序号14至帧序号20对应的第二基础语音帧组成的区间M 2,以及帧序号30至帧序号50对应的第二基础语音帧组成的区间M 3,将区间M 1中所有的第二基础语音帧进行组合,得到一个组合后的语音帧,作为第二目标语音帧M 1For example, in a specific implementation manner, the status of the acquired mute frame is shown in Table 1 in step C in S42, which shows two consecutive mute frames, which are 7 corresponding to frame number 7 to frame number 13, respectively. The second basic voice frame is combined to obtain a continuous mute frame P, and the nine second basic voice frames corresponding to frame number 21 to frame number 29 are combined to obtain a continuous mute frame Q. The two consecutive mute frames are used as the demarcation point. The frame set containing 50 second basic speech frames is divided into three intervals, which are: the interval M 1 composed of the second basic speech frames corresponding to the frame number 1 to frame number 6, and the frame number 14 to the frame number. The interval M 2 composed of the second basic speech frame corresponding to 20 and the interval M 3 composed of the second basic speech frames corresponding to the frame number 30 to frame number 50 combine all the second basic speech frames in the interval M 1 , A combined speech frame is obtained as the second target speech frame M 1 .

S44:将每个第二目标语音帧转换为实时语音文本。S44: Convert each second target voice frame into real-time voice text.

具体地,对每个目标语音帧进行文本转换,得到该目标语音帧对应的实时语音文本。Specifically, text conversion is performed on each target voice frame to obtain a real-time voice text corresponding to the target voice frame.

其中,文本转换可使用支持语音转换文本的工具,也可以使用用于文本转换算法,此 处不作具体限制。Among them, the text conversion can use a tool that supports speech conversion text, or can be used for text conversion algorithms, which are not specifically limited here.

在本实施例中,对语音数据进行语音解析,得到包含第二基础语音帧的帧集合,进而对第二基础语音帧进行静音检测,得到第二基础语音帧中的K个连续静音帧,根据这K个连续静音帧,将帧集合中包含的第二基础语音帧划分成K+1个第二目标语音帧,将每个目标语音帧均转换为一个实时语音文本,使得将接收到的语音信号实时转换成一个个独立的实时语音文本,实现了实时对与会人员的发言信息进行准确记录,提高了会议信息记录的准确率和效率。In this embodiment, the voice data is parsed to obtain a frame set including a second basic voice frame, and then the second basic voice frame is muted to detect K consecutive silent frames in the second basic voice frame. These K consecutive mute frames divide the second basic voice frame contained in the frame set into K + 1 second target voice frames, and convert each target voice frame into a real-time voice text, so that the received voice The signals are converted into independent real-time voice texts in real time, which enables accurate recording of the speech information of participants in real time, and improves the accuracy and efficiency of meeting information recording.

在一实施例中,如图5所示,在步骤S3之前,该会议信息记录方法还包括如下步骤:In an embodiment, as shown in FIG. 5, before step S3, the conference information recording method further includes the following steps:

S61:针对每个与会人员进行声纹特征采集,得到每个与会人员的基础声纹特征。S61: Collect voiceprint features for each participant to obtain the basic voiceprint features for each participant.

具体地,在会议开始之前,先对与会人员提供预设的语料,在与会人员朗读该语料时进行录音采集,得到录音样本,进而对该录音样本中的语音信号进行声纹特征提取,得到该会人员的基础声纹特征。Specifically, before the meeting begins, a predetermined corpus is provided to the participants, and the participants are recorded and collected when the corpus is read aloud to obtain a recording sample, and then the voiceprint feature extraction of the voice signal in the recorded sample is obtained The basic voiceprint characteristics of the meeting staff.

其中,预设的语料可以是一篇文章、一个网络新闻或一次会议纪要等,可根据实际需要来进行设置,此处不做具体限制。The preset corpus can be an article, a web news, or a meeting minutes, etc., and can be set according to actual needs, and is not specifically limited here.

例如,在一具体实施方式中,提供一篇专业论文作为预设的语料,针对每个与会人员,收集该与会人员朗读该预设的语料的录音样本,并基于该录音样本来获取该与会人员的基础声纹特征。For example, in a specific embodiment, a professional paper is provided as a preset corpus. For each participant, a recording sample of the participant reading the preset corpus is collected, and the participant is obtained based on the recording sample. The basic voiceprint features.

其中,对该录音文本中的语音信号进行声纹特征提取,得到该与会人员的基础声纹特征的具体过程,可参考步骤S21至步骤S25的描述,或参考步骤S2的描述,为避免重复,此处不再赘述。The voiceprint feature extraction of the voice signal in the recorded text to obtain the basic voiceprint feature of the participant can refer to the description of steps S21 to S25 or the description of step S2. To avoid repetition, I won't repeat them here.

值得说明的是,基础声纹特征是一个包含多个声纹特征的集合。It is worth noting that the basic voiceprint feature is a collection containing multiple voiceprint features.

S62:获取每个与会人员标识,并基于与会人员的基础声纹特征,与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型。S62: Obtain the identity of each participant, and build an initial voiceprint recognition model based on the corresponding relationship between the participant's basic voiceprint characteristics, the participant's identification, and the participant's basic voiceprint characteristics and the participant's identification.

具体地,针对每个与会人员,获取该与会人员的与会人员标识,并基于该与会人员标识,该与会人员的基础声纹特征,以及该与会人员基础声纹特征和与会人员标识之间的对应关系,构建该与会人员声纹特征信息的映射,并将该映射存入到初始声纹模型之中。Specifically, for each participant, the participant identification of the participant is obtained, and based on the participant identification, the basic voiceprint characteristics of the participant, and the correspondence between the basic voiceprint characteristics of the participant and the participant identification Relationship, construct a map of the voiceprint feature information of the participants, and store the map in the initial voiceprint model.

S63:采用遗传算法对初始声纹识别模型进行训练,生成预设的声纹识别模型。S63: Use the genetic algorithm to train the initial voiceprint recognition model to generate a preset voiceprint recognition model.

具体地,在得到初始声纹模型之后,需要对初始声纹模型进行训练,以便提高声纹识别模型的识别率,其中,训练方式包括但不限于:模板匹配方法、最近邻方法、神经网络方法、隐式马尔可夫模型(Hidden Markov Model,HMM)、矢量量化(Vector Quantization,VQ)方法、多项式分类器(Polynomial Classifiers)方法等。Specifically, after the initial voiceprint model is obtained, the initial voiceprint model needs to be trained in order to improve the recognition rate of the voiceprint recognition model. The training methods include, but are not limited to, a template matching method, a nearest neighbor method, and a neural network method. , Hidden Markov Model (HMM), Vector Quantization (VQ) method, Polynomial Classifiers method and so on.

优选地,本申请实施例采用的训练方式为基于神经网络方法的遗传算法(Genetic Algorithm,GA),其具体实现过程为:通过将每个与会人员的基础声纹特征作为一个群体,基础声纹特征中的每个声纹特征作为一条染色体,通过使用遗传算法来进行初始化编码、个体评价、选择、交叉和变异,进而得到该群体的最优解或近似解,即该与会人员的基础声纹特征的最优解,在后续通过该声纹识别模型进行识别时,在各个基础声纹特征的最优解中,找出与目标声纹特征最接近的最优解,并根据该最优解对应的声纹特征,确定目标与会人员。Preferably, the training method used in the embodiment of the present application is a genetic algorithm (GA) based on a neural network method, and the specific implementation process is as follows: By using the basic voiceprint characteristics of each participant as a group, the basic voiceprint Each voiceprint feature in the feature is used as a chromosome, and the genetic algorithm is used to perform initialization coding, individual evaluation, selection, crossover and mutation to obtain the optimal or approximate solution of the group, that is, the basic voiceprint of the participant The optimal solution of the feature. In the subsequent recognition by the voiceprint recognition model, the optimal solution closest to the target voiceprint feature is found among the optimal solutions of each basic voiceprint feature, and according to the optimal solution Correspond to the voiceprint characteristics to determine the target participants.

在本实施例中,针对每个与会人员,进行声纹特征采集,得到该与会人员的基础声纹 特征,并获取该与会人员标识,并基于与会人员的基础声纹特征,与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型,进而采用遗传算法对初始声纹识别模型进行训练,生成预设的声纹识别模型,使得在后续对目标声纹特征进行识别时,可以快速准确的确定目标声纹特征对应的与会人员标识,从而实现自动准确记录当前发言的与会人员标识,有利于提高会议信息记录的效率和准确性。In this embodiment, for each participant, voiceprint feature collection is performed to obtain the basic voiceprint characteristics of the participant, and the participant identification is obtained, and based on the participant's basic voiceprint characteristics, the participant identification, and Participants 'basic voiceprint characteristics and the corresponding relationship between the participants' logos, construct the initial voiceprint recognition model, and then use the genetic algorithm to train the initial voiceprint recognition model to generate a preset voiceprint recognition model, so that When the target voiceprint feature is identified, the participant ID corresponding to the target voiceprint feature can be quickly and accurately determined, so that the participant ID of the current speaker can be automatically and accurately recorded, which is helpful to improve the efficiency and accuracy of meeting information recording.

在一实施例中,服务端将实时语音文本发送给监控端,并接收监控端反馈的修改文本,如图6所示,在步骤S4之后,在步骤S5之前,该会议信息记录方法还包括如下步骤:In an embodiment, the server sends the real-time voice text to the monitoring terminal and receives the modified text fed back by the monitoring terminal. As shown in FIG. 6, after step S4 and before step S5, the conference information recording method further includes the following: step:

S71:将实时语音文本发送给会议信息记录的监控端。S71: Send the real-time voice text to the monitoring end of the conference information record.

具体地,在对语音数据进行语音识别,生成实时语音文本之后,将该实时语音文本发送到监控端,监控端的会议信息记录人员对该实时语音文本进行检查,若用户发现该实时语音文本不正确,则通过监控端向服务端反馈该实时语音文本对应的修改文本。Specifically, after performing voice recognition on the voice data and generating a real-time voice text, the real-time voice text is sent to the monitoring end, and the meeting information recording staff of the monitoring end checks the real-time voice text. If the user finds that the real-time voice text is incorrect , The monitoring terminal feeds back the modified text corresponding to the real-time voice text to the server.

例如,在一具体实施方式中,一监控端接收到的实时语音文本为“0068-下一年度经济与行业行驶分析”,经过该监控端用户的确认,发现该实时语音文本不正确,请该实时语音文本修改为“0068-下一年度经济与行业形式分析”,并将修改后的文本作为修改文本,发送给服务端。For example, in a specific embodiment, the real-time voice text received by a monitoring end is "0068-Next Year's Economic and Industrial Driving Analysis". After confirmation by the user of the monitoring end, the real-time voice text is found to be incorrect. The real-time voice text was modified to "0068- Analysis of Economic and Industrial Forms for the Next Year", and the modified text was sent to the server as the modified text.

S72:若接收到监控端反馈的修改文本,则使用该修改文本替换实时语音文本。S72: If a modified text feedback is received from the monitoring end, the modified text is used to replace the real-time voice text.

具体地,服务端在接收到在监控端发送的修改文本后,获取该修改文本对应的实时语音文本,并使用该修改文本来替换该实时语音文本。Specifically, after receiving the modified text sent by the monitoring end, the server obtains the real-time voice text corresponding to the modified text and uses the modified text to replace the real-time voice text.

其中,修改文本与实时语音文本的对应关系,可以通过实时语音文本中的序号或者其他标识字符来确定,也可以根据实际需要来设置,此处不作具体限制。The correspondence between the modified text and the real-time voice text can be determined by the serial number or other identification characters in the real-time voice text, or it can be set according to actual needs, which is not specifically limited here.

以步骤S71中的修改文本为例,根据修改文本“0068-下一年度经济与行业形式分析”确定该修改文本对应的实时语音文本为序号“0068”的实时语音文本,进而找到序号“0068”的实时语音文本“0068-下一年度经济与行业行驶分析”,并将该实时语音文本用修改文本替换。Taking the modified text in step S71 as an example, it is determined that the real-time voice text corresponding to the modified text is the real-time voice text of the serial number “0068” according to the modified text “0068-Analysis of Economic and Industrial Forms in the Next Year”, and then the serial number “0068” is found. Real-time voice text "0068- Analysis of Economic and Industrial Driving in the Next Year", and replace the real-time voice text with the modified text.

在本实施例中,将实时语音文本发送给会议信息记录的监控端,监控端用户对实时语音文本进行核实,若监控端用户核实结果为该实时语音文本存在错误,则对该实时语音文本进行修改,生成修改文本,并通过监控端将修改文本发送给服务端,服务端在接收到在监控端发送的修改文本后,将对应的实时语音文本替换成该修改文本,提升了会议信息记录的准确率。In this embodiment, the real-time voice text is sent to the monitoring end of the conference information record, and the user of the monitoring end verifies the real-time voice text. If the monitoring end user verifies that there is an error in the real-time voice text, the real-time voice text is performed. Modify, generate modified text, and send the modified text to the server through the monitoring end. After receiving the modified text sent by the monitoring end, the server replaces the corresponding real-time voice text with the modified text, which improves the meeting information record. Accuracy.

应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

在一实施例中,提供一种会议信息记录装置,该会议信息记录装置与上述实施例中会议信息记录方法一一对应。如图7所示,该会议信息记录装置包括数据收集模块10、特征提取模块20、人员确定模块30、语音识别模块40和记录生成模块50。各功能模块详细说明如下:In one embodiment, a conference information recording device is provided. The conference information recording device corresponds to the conference information recording method in the foregoing embodiment. As shown in FIG. 7, the conference information recording device includes a data collection module 10, a feature extraction module 20, a person determination module 30, a voice recognition module 40, and a record generation module 50. The detailed description of each function module is as follows:

数据收集模块10,用于若接收到采集通道的语音数据传输请求,则获取该语音数据和当前时间;The data collection module 10 is configured to obtain the voice data and the current time if a voice data transmission request of the collection channel is received;

特征提取模块20,用于对语音数据进行声纹特征提取,得到该语音数据的目标声纹特征;A feature extraction module 20, configured to extract voiceprint features from voice data to obtain target voiceprint features of the voice data;

人员确定模块30,用于根据预设的声纹识别模型,对目标声纹特征进行声纹识别,确定该目标声纹特征对应的目标与会人员标识,其中,声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及基础声纹特征和与会人员标识之间的对应关系;A person determination module 30 is configured to perform voiceprint recognition on a target voiceprint feature according to a preset voiceprint recognition model, and determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes the basis of the participants. Voiceprint characteristics, participant identification, and the corresponding relationship between basic voiceprint characteristics and participant identification;

语音识别模块40,用于对语音数据进行语音识别,生成实时语音文本;A voice recognition module 40, configured to perform voice recognition on voice data to generate real-time voice text;

记录生成模块50,用于将目标与会人员标识、当前时间和实时语音文本进行关联记录,得到目标与会人员标识对应的目标与会人员的发言信息。The record generating module 50 is configured to correlate the target participant identification, the current time, and the real-time voice text to obtain the speech information of the target participant corresponding to the target participant identification.

进一步地,特征提取模块20包括:Further, the feature extraction module 20 includes:

声纹解析单元21,用于对语音数据进行声纹解析,得到初始声纹;The voiceprint analysis unit 21 is configured to perform voiceprint analysis on the voice data to obtain an initial voiceprint;

加重处理单元22,用于对初始声纹进行预加重处理,生成具有平坦频谱的加重处理声纹;An emphasis processing unit 22, configured to perform pre-emphasis processing on the initial voiceprint to generate an emphasis processing voiceprint with a flat spectrum;

分帧加窗单元23,用于采用分帧和加窗的方式,对加重处理声纹进行分帧和加窗处理,得到第一基础语音帧;Framing and windowing unit 23 is configured to use framing and windowing to perform framing and windowing on the weighted voiceprint to obtain a first basic voice frame;

静默音分离单元24,用于对第一基础语音帧进行静默音分离,得到第一目标语音帧;A silent sound separating unit 24, configured to perform silent sound separation on a first basic speech frame to obtain a first target speech frame;

特征提取单元25,用于对每个第一目标语音帧进行声纹特征提取,得到目标声纹特征。A feature extraction unit 25 is configured to extract voiceprint features for each first target voice frame to obtain target voiceprint features.

进一步地,语音识别模块40包括:Further, the speech recognition module 40 includes:

语音解析单元41,用于对语音数据进行语音解析,得到包含第二基础语音帧的帧集合;A speech parsing unit 41, configured to perform speech parsing on speech data to obtain a frame set including a second basic speech frame;

静音检测单元42,用于对第二基础语音帧进行静音检测,得到第二基础语音帧中的K个连续静音帧,其中,K为自然数;The silence detection unit 42 is configured to perform silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number;

语音切分单元43,用于根据K个静音帧,将帧集合中包含的第二基础语音帧划分成K+1个第二目标语音帧;A voice segmentation unit 43 configured to divide the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames;

文本转化单元44,用于将每个第二目标语音帧转换为实时语音文本。The text conversion unit 44 is configured to convert each second target speech frame into a real-time speech text.

进一步地,该会议信息记录装置还包括:Further, the conference information recording device further includes:

声纹采集模块61,用于针对每个与会人员进行声纹特征采集,得到每个与会人员的基础声纹特征;The voiceprint collection module 61 is configured to collect voiceprint characteristics for each participant to obtain basic voiceprint characteristics for each participant;

模型构建模块62,用于获取每个与会人员标识,并基于与会人员的基础声纹特征,与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型;A model building module 62 is configured to obtain the identification of each participant and construct an initial voiceprint based on the basic voiceprint characteristics of the participants, the identification of the participants, and the correspondence between the basic voiceprint characteristics of the participants and the participants' identification Recognition model

模型训练模块63,用于采用遗传算法对初始声纹识别模型进行训练,生成预设的声纹识别模型。The model training module 63 is configured to train an initial voiceprint recognition model by using a genetic algorithm to generate a preset voiceprint recognition model.

进一步地,该会议信息记录装置还包括:Further, the conference information recording device further includes:

文本传输模块71,用于将实时语音文本发送给会议信息记录的监控端;A text transmission module 71, configured to send real-time voice text to the monitoring end of the conference information record;

文本替换模块72,用于若接收到监控端反馈的修改文本,则使用该修改文本替换实时语音文本。The text replacing module 72 is configured to replace the real-time voice text with the modified text if the modified text is received from the monitoring end.

关于会议信息记录装置的具体限定可以参见上文中对于会议信息记录方法的限定,在此不再赘述。上述会议信息记录装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the conference information recording device, reference may be made to the foregoing limitation on the conference information recording method, which is not repeated here. Each module in the conference information recording device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the hardware in or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构 图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储会议信息记录方法中的与会人员的基础声纹特征和会议信息记录。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种会议信息记录方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 8. The computer device includes a processor, a memory, a network interface, and a database connected through a system bus. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in a non-volatile storage medium. The database of the computer equipment is used to store the basic voiceprint characteristics and conference information records of the participants in the conference information recording method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by a processor to implement a conference information recording method.

在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例会议信息记录方法的步骤,例如图2所示的步骤S1至步骤S5。或者,处理器执行计算机可读指令时实现上述实施例会议信息记录装置的各模块/单元的功能,例如图7所示的模块10至模块50。为避免重复,这里不再赘述。In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor. The processor implements the computer-readable instructions to implement the conference information of the foregoing embodiment. The steps of the recording method are, for example, steps S1 to S5 shown in FIG. 2. Alternatively, when the processor executes the computer-readable instructions, the functions of the modules / units of the conference information recording device of the foregoing embodiment are implemented, for example, modules 10 to 50 shown in FIG. 7. To avoid repetition, we will not repeat them here.

在一个实施例中,提供了一个或多个非易失性可读存储介质,其上存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时实现上述实施例会议信息记录方法的步骤,或者,计算机可读指令被一个或多个处理器执行时实现上述实施例会议信息记录装置的各模块/单元的功能,为避免重复,这里不再赘述。In one embodiment, one or more non-volatile readable storage media are provided, and computer-readable instructions are stored thereon. When the computer-readable instructions are executed by one or more processors, the meeting information record of the foregoing embodiment is implemented The steps of the method, or the functions of the modules / units of the conference information recording device of the foregoing embodiment are implemented when computer-readable instructions are executed by one or more processors. To avoid repetition, details are not repeated here.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一个或多个非易失性可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by using computer-readable instructions to instruct related hardware. The computer-readable instructions can be stored in one or more non-easy instructions. In the volatile readable storage medium, the computer-readable instructions, when executed, may include the processes of the embodiments of the methods described above. Wherein, any reference to the storage, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile storage. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and brevity of the description, only the above-mentioned division of functional units and modules is used as an example. In practical applications, the above functions can be assigned by different functional units, Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to describe the technical solution of the present application, but not limited thereto. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still implement the foregoing implementations. The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of this application.

Claims (20)

一种会议信息记录方法,其特征在于,所述会议信息记录方法包括:A conference information recording method, characterized in that the conference information recording method includes: 若接收到采集通道的语音数据传输请求,则获取所述语音数据和当前时间;If a voice data transmission request for a collection channel is received, obtaining the voice data and the current time; 对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征;Performing voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data; 根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识,其中,所述声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及所述基础声纹特征和所述与会人员标识之间的对应关系;According to a preset voiceprint recognition model, voiceprint recognition is performed on the target voiceprint feature to determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes a basic voiceprint of the participant Characteristics, participant identification, and the corresponding relationship between the basic voiceprint characteristics and the participant identification; 对所述语音数据进行语音识别,生成实时语音文本;Performing speech recognition on the speech data to generate real-time speech text; 将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联记录,得到所述目标与会人员标识对应的目标与会人员的发言信息。The target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification. 如权利要求1所述的会议信息记录方法,其特征在于,所述对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征包括:The conference information recording method according to claim 1, wherein the voiceprint feature extraction of the voice data to obtain a target voiceprint feature of the voice data comprises: 对所述语音数据进行声纹解析,得到初始声纹;Performing voiceprint analysis on the voice data to obtain an initial voiceprint; 对所述初始声纹进行预加重处理,生成具有平坦频谱的加重处理声纹;Performing pre-emphasis processing on the initial voiceprint to generate an emphasis processing voiceprint with a flat frequency spectrum; 采用分帧和加窗的方式,对所述加重处理声纹进行分帧和加窗处理,得到第一基础语音帧;Framing and windowing the frame and windowing the emphasized voiceprint to obtain a first basic speech frame; 对所述第一基础语音帧进行静默音分离,得到第一目标语音帧;Performing silent tone separation on the first basic voice frame to obtain a first target voice frame; 对每个所述第一目标语音帧进行声纹特征提取,得到所述目标声纹特征。Voiceprint feature extraction is performed on each of the first target speech frames to obtain the target voiceprint feature. 如权利要求1所述的会议信息记录方法,其特征在于,所述对所述语音数据进行语音识别,生成实时语音文本包括:The conference information recording method according to claim 1, wherein the performing voice recognition on the voice data to generate real-time voice text comprises: 对所述语音数据进行语音解析,得到包含第二基础语音帧的帧集合;Performing speech analysis on the speech data to obtain a frame set including a second basic speech frame; 对所述第二基础语音帧进行静音检测,得到所述第二基础语音帧中的K个连续静音帧,其中,K为自然数;Performing silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number; 根据K个所述静音帧,将所述帧集合中包含的所述第二基础语音帧划分成K+1个第二目标语音帧;Dividing the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames; 将每个所述第二目标语音帧转换为所述实时语音文本。Converting each of the second target speech frames into the real-time speech text. 如权利要求1所述的会议信息记录方法,其特征在于,在所述根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识之前,所述会议信息记录方法还包括:The conference information recording method according to claim 1, characterized in that, in the voiceprint recognition of the target voiceprint feature according to a preset voiceprint recognition model, a target corresponding to the target voiceprint feature is determined Before the participants are identified, the conference information recording method further includes: 针对每个与会人员进行声纹特征采集,得到每个与会人员的基础声纹特征;Collect voiceprint features for each participant to obtain the basic voiceprint characteristics for each participant; 获取每个与会人员标识,并基于所述与会人员的基础声纹特征,所述与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型;Obtaining the identification of each participant and constructing an initial voiceprint recognition model based on the basic voiceprint characteristics of the participant, the corresponding relationship between the participant's identification, and the basic voiceprint characteristics of the participant and the participant's identification 采用遗传算法对所述初始声纹识别模型进行训练,生成所述预设的声纹识别模型。Genetic algorithm is used to train the initial voiceprint recognition model to generate the preset voiceprint recognition model. 如权利要求1所述的会议信息记录方法,其特征在于,在对所述语音数据进行实时识别,生成实时语音文本之后,在所述将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联,得到所述目标与会人员在本次会议的发言信息的步骤之前,且在所述对所述语音数据进行语音识别,生成实时语音文本的步骤之后,所述会议信息记录方法还包括:The conference information recording method according to claim 1, wherein after real-time recognition of the voice data and generation of real-time voice text, the target participant identification, the current time, and the The real-time voice and text are correlated to obtain the target participant's speech information in the current meeting, and after the steps of performing voice recognition on the voice data and generating the real-time voice text, the meeting information recording method Also includes: 将所述实时语音文本发送给会议信息记录的监控端;Sending the real-time voice text to the monitoring end of the conference information record; 若接收到所述监控端反馈的修改文本,则使用所述修改文本替换所述实时语音文本。If a modified text fed back by the monitoring terminal is received, the real-time voice text is replaced with the modified text. 一种会议信息记录装置,其特征在于,所述会议信息记录装置包括:A conference information recording device, characterized in that the conference information recording device includes: 数据收集模块,用于若接收到采集通道的语音数据传输请求,则获取所述语音数据和当前时间;A data collection module, configured to obtain the voice data and the current time if a voice data transmission request of a collection channel is received; 特征提取模块,用于对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征;A feature extraction module, configured to perform voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data; 人员确定模块,用于根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识,其中,所述声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及所述基础声纹特征和所述与会人员标识之间的对应关系;A person determination module is configured to perform voiceprint recognition on the target voiceprint feature according to a preset voiceprint recognition model, and determine a target participant identification corresponding to the target voiceprint feature, where the voiceprint recognition model includes The basic voiceprint characteristics of the participants, the participant identification, and the correspondence between the basic voiceprint characteristics and the participant identification; 语音识别模块,用于对所述语音数据进行语音识别,生成实时语音文本;A voice recognition module, configured to perform voice recognition on the voice data to generate real-time voice text; 记录生成模块,用于将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联记录,得到所述目标与会人员标识对应的目标与会人员的发言信息。The record generating module is configured to record the target participant identification, the current time, and the real-time voice text in association with each other, and obtain the speech information of the target participant corresponding to the target participant identification. 如权利要求6所述的会议信息记录装置,其特征在于,所述特征提取模块包括:The conference information recording device according to claim 6, wherein the feature extraction module comprises: 声纹解析单元,用于对所述语音数据进行声纹解析,得到初始声纹;A voiceprint analysis unit, configured to perform voiceprint analysis on the voice data to obtain an initial voiceprint; 加重处理单元,用于对所述初始声纹进行预加重处理,生成具有平坦频谱的加重处理声纹;An emphasis processing unit, configured to perform pre-emphasis processing on the initial voiceprint to generate an emphasis processing voiceprint with a flat frequency spectrum; 分帧加窗单元,用于采用分帧和加窗的方式,对所述加重处理声纹进行分帧和加窗处理,得到第一基础语音帧;A frame framing and windowing unit, configured to use framing and windowing to perform framing and windowing on the emphasis processing voiceprint to obtain a first basic voice frame; 静默音分离单元,用于对所述第一基础语音帧进行静默音分离,得到第一目标语音帧;A silent sound separating unit, configured to perform silent sound separation on the first basic speech frame to obtain a first target speech frame; 特征提取单元,用于对每个所述第一目标语音帧进行声纹特征提取,得到所述目标声纹特征。A feature extraction unit is configured to extract a voiceprint feature for each of the first target voice frames to obtain the target voiceprint feature. 如权利要求6所述的会议信息记录装置,其特征在于,所述语音识别模块包括:The conference information recording device according to claim 6, wherein the voice recognition module comprises: 语音解析单元,用于对所述语音数据进行语音解析,得到包含第二基础语音帧的帧集合;A speech parsing unit, configured to perform speech parsing on the speech data to obtain a frame set including a second basic speech frame; 静音检测单元,用于对所述第二基础语音帧进行静音检测,得到所述第二基础语音帧中的K个连续静音帧,其中,K为自然数;A silence detection unit, configured to perform silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number; 语音切分单元,用于根据K个所述静音帧,将所述帧集合中包含的所述第二基础语音帧划分成K+1个第二目标语音帧;A voice segmentation unit, configured to divide the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames; 文本转化单元,用于将每个所述第二目标语音帧转换为所述实时语音文本。A text conversion unit, configured to convert each of the second target speech frames into the real-time speech text. 如权利要求6所述的会议信息记录装置,其特征在于,所述会议信息记录装置还包括:The conference information recording device according to claim 6, wherein the conference information recording device further comprises: 声纹采集模块,用于针对每个与会人员进行声纹特征采集,得到每个与会人员的基础声纹特征;The voiceprint collection module is used to collect voiceprint features for each participant to obtain the basic voiceprint characteristics of each participant; 模型构建模块,用于获取每个与会人员标识,并基于所述与会人员的基础声纹特征,所述与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型;A model building module is used to obtain the identification of each participant and construct a correspondence relationship between the participant's basic voiceprint characteristics and the participants' basic voiceprint characteristics and the participant's identification based on the basic voiceprint characteristics of the participant. Initial voiceprint recognition model; 模型训练模块,用于采用遗传算法对所述初始声纹识别模型进行训练,生成所述预设 的声纹识别模型。A model training module is configured to train the initial voiceprint recognition model by using a genetic algorithm to generate the preset voiceprint recognition model. 如权利要求6所述的会议信息记录装置,其特征在于,所述会议信息记录装置还包括:The conference information recording device according to claim 6, wherein the conference information recording device further comprises: 文本传输模块,用于将所述实时语音文本发送给会议信息记录的监控端;A text transmission module, configured to send the real-time voice text to a monitoring end of a conference information record; 文本替换模块,用于若接收到所述监控端反馈的修改文本,则使用所述修改文本替换所述实时语音文本。A text replacement module is configured to replace the real-time voice text with the modified text if the modified text fed back by the monitoring terminal is received. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, and is characterized in that the processor implements the computer-readable instructions as follows step: 若接收到采集通道的语音数据传输请求,则获取所述语音数据和当前时间;If a voice data transmission request for a collection channel is received, obtaining the voice data and the current time; 对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征;Performing voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data; 根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识,其中,所述声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及所述基础声纹特征和所述与会人员标识之间的对应关系;According to a preset voiceprint recognition model, voiceprint recognition is performed on the target voiceprint feature to determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes a basic voiceprint of the participant Characteristics, participant identification, and the corresponding relationship between the basic voiceprint characteristics and the participant identification; 对所述语音数据进行语音识别,生成实时语音文本;Performing speech recognition on the speech data to generate real-time speech text; 将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联记录,得到所述目标与会人员标识对应的目标与会人员的发言信息。The target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification. 如权利要求11所述的终端设备,其特征在于,所述对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征包括:The terminal device according to claim 11, wherein the voiceprint feature extraction of the voice data to obtain the target voiceprint feature of the voice data comprises: 对所述语音数据进行声纹解析,得到初始声纹;Performing voiceprint analysis on the voice data to obtain an initial voiceprint; 对所述初始声纹进行预加重处理,生成具有平坦频谱的加重处理声纹;Performing pre-emphasis processing on the initial voiceprint to generate an emphasis processing voiceprint with a flat frequency spectrum; 采用分帧和加窗的方式,对所述加重处理声纹进行分帧和加窗处理,得到第一基础语音帧;Framing and windowing the frame and windowing the emphasized voiceprint to obtain a first basic speech frame; 对所述第一基础语音帧进行静默音分离,得到第一目标语音帧;Performing silent tone separation on the first basic voice frame to obtain a first target voice frame; 对每个所述第一目标语音帧进行声纹特征提取,得到所述目标声纹特征。Voiceprint feature extraction is performed on each of the first target speech frames to obtain the target voiceprint feature. 如权利要求11所述的终端设备,其特征在于,所述对所述语音数据进行语音识别,生成实时语音文本包括:The terminal device according to claim 11, wherein the performing voice recognition on the voice data to generate real-time voice text comprises: 对所述语音数据进行语音解析,得到包含第二基础语音帧的帧集合;Performing speech analysis on the speech data to obtain a frame set including a second basic speech frame; 对所述第二基础语音帧进行静音检测,得到所述第二基础语音帧中的K个连续静音帧,其中,K为自然数;Performing silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number; 根据K个所述静音帧,将所述帧集合中包含的所述第二基础语音帧划分成K+1个第二目标语音帧;Dividing the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames; 将每个所述第二目标语音帧转换为所述实时语音文本。Converting each of the second target speech frames into the real-time speech text. 如权利要求11所述的终端设备,其特征在于,在所述根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识之前,所述处理器执行所述计算机可读指令时还包括实现如下步骤:The terminal device according to claim 11, characterized in that, in the voiceprint recognition of the target voiceprint feature according to a preset voiceprint recognition model, a target participant corresponding to the target voiceprint feature is determined Before identification, when the processor executes the computer-readable instructions, the method further includes implementing the following steps: 针对每个与会人员进行声纹特征采集,得到每个与会人员的基础声纹特征;Collect voiceprint features for each participant to obtain the basic voiceprint characteristics for each participant; 获取每个与会人员标识,并基于所述与会人员的基础声纹特征,所述与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型;Obtaining the identification of each participant and constructing an initial voiceprint recognition model based on the basic voiceprint characteristics of the participant, the corresponding relationship between the participant's identification, and the basic voiceprint characteristics of the participant and the identification of the participant; 采用遗传算法对所述初始声纹识别模型进行训练,生成所述预设的声纹识别模型。Genetic algorithm is used to train the initial voiceprint recognition model to generate the preset voiceprint recognition model. 如权利要求11所述的终端设备,其特征在于,在对所述语音数据进行实时识别,生成实时语音文本之后,在所述将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联,得到所述目标与会人员在本次会议的发言信息的步骤之前,且在所述对所述语音数据进行语音识别,生成实时语音文本的步骤之后,所述处理器执行所述计算机可读指令时还包括实现如下步骤:The terminal device according to claim 11, characterized in that after real-time recognition of the voice data and generation of real-time voice text, after the identification of the target participant, the current time and the real-time voice The text is associated to obtain the speech information of the target participant before the step of the conference, and after the steps of performing voice recognition on the voice data and generating real-time voice text, the processor executes the computer The readable instructions also include the following steps: 将所述实时语音文本发送给会议信息记录的监控端;Sending the real-time voice text to the monitoring end of the conference information record; 若接收到所述监控端反馈的修改文本,则使用所述修改文本替换所述实时语音文本。If a modified text fed back by the monitoring terminal is received, the real-time voice text is replaced with the modified text. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:若接收到采集通道的语音数据传输请求,则获取所述语音数据和当前时间;One or more non-volatile readable storage media storing computer readable instructions, characterized in that when the computer readable instructions are executed by one or more processors, the one or more processors are caused to execute The following steps: if a voice data transmission request for a collection channel is received, obtaining the voice data and the current time; 对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征;Performing voiceprint feature extraction on the voice data to obtain a target voiceprint feature of the voice data; 根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识,其中,所述声纹识别模型包括与会人员的基础声纹特征,与会人员标识,以及所述基础声纹特征和所述与会人员标识之间的对应关系;According to a preset voiceprint recognition model, voiceprint recognition is performed on the target voiceprint feature to determine a target participant identification corresponding to the target voiceprint feature, wherein the voiceprint recognition model includes a basic voiceprint of the participant Characteristics, participant identification, and the corresponding relationship between the basic voiceprint characteristics and the participant identification; 对所述语音数据进行语音识别,生成实时语音文本;Performing speech recognition on the speech data to generate real-time speech text; 将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联记录,得到所述目标与会人员标识对应的目标与会人员的发言信息。The target participant identification, the current time, and the real-time voice text are associated and recorded to obtain the speech information of the target participant corresponding to the target participant identification. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述语音数据进行声纹特征提取,得到所述语音数据的目标声纹特征包括:The non-volatile readable storage medium according to claim 16, wherein the voiceprint feature extraction of the voice data to obtain the target voiceprint feature of the voice data comprises: 对所述语音数据进行声纹解析,得到初始声纹;Performing voiceprint analysis on the voice data to obtain an initial voiceprint; 对所述初始声纹进行预加重处理,生成具有平坦频谱的加重处理声纹;Performing pre-emphasis processing on the initial voiceprint to generate an emphasis processing voiceprint with a flat frequency spectrum; 采用分帧和加窗的方式,对所述加重处理声纹进行分帧和加窗处理,得到第一基础语音帧;Framing and windowing the frame and windowing the emphasized voiceprint to obtain a first basic speech frame; 对所述第一基础语音帧进行静默音分离,得到第一目标语音帧;Performing silent tone separation on the first basic voice frame to obtain a first target voice frame; 对每个所述第一目标语音帧进行声纹特征提取,得到所述目标声纹特征。Voiceprint feature extraction is performed on each of the first target speech frames to obtain the target voiceprint feature. 如权利要求16所述的非易失性可读存储介质,其特征在于,所述对所述语音数据进行语音识别,生成实时语音文本包括:The non-volatile readable storage medium according to claim 16, wherein the performing voice recognition on the voice data to generate real-time voice text comprises: 对所述语音数据进行语音解析,得到包含第二基础语音帧的帧集合;Performing speech analysis on the speech data to obtain a frame set including a second basic speech frame; 对所述第二基础语音帧进行静音检测,得到所述第二基础语音帧中的K个连续静音帧,其中,K为自然数;Performing silence detection on the second basic voice frame to obtain K consecutive silence frames in the second basic voice frame, where K is a natural number; 根据K个所述静音帧,将所述帧集合中包含的所述第二基础语音帧划分成K+1个第二目标语音帧;Dividing the second basic voice frame included in the frame set into K + 1 second target voice frames according to the K silence frames; 将每个所述第二目标语音帧转换为所述实时语音文本。Converting each of the second target speech frames into the real-time speech text. 如权利要求16所述的非易失性可读存储介质,其特征在于,在所述根据预设的声纹识别模型,对所述目标声纹特征进行声纹识别,确定所述目标声纹特征对应的目标与会人员标识之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The non-volatile readable storage medium according to claim 16, wherein the target voiceprint feature is identified by voiceprint recognition according to a preset voiceprint recognition model, and the target voiceprint is determined. Before the target participant corresponding to the feature is identified, when the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: 针对每个与会人员进行声纹特征采集,得到每个与会人员的基础声纹特征;Collect voiceprint features for each participant to obtain the basic voiceprint characteristics for each participant; 获取每个与会人员标识,并基于所述与会人员的基础声纹特征,所述与会人员标识,以及与会人员的基础声纹特征和与会人员标识之间的对应关系,构建初始声纹识别模型;Obtaining the identification of each participant and constructing an initial voiceprint recognition model based on the basic voiceprint characteristics of the participant, the corresponding relationship between the participant's identification, and the basic voiceprint characteristics of the participant and the identification of the participant; 采用遗传算法对所述初始声纹识别模型进行训练,生成所述预设的声纹识别模型。Genetic algorithm is used to train the initial voiceprint recognition model to generate the preset voiceprint recognition model. 如权利要求16所述的非易失性可读存储介质,其特征在于,在对所述语音数据进行实时识别,生成实时语音文本之后,在所述将所述目标与会人员标识、所述当前时间和所述实时语音文本进行关联,得到所述目标与会人员在本次会议的发言信息的步骤之前,且在所述对所述语音数据进行语音识别,生成实时语音文本的步骤之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The non-volatile readable storage medium according to claim 16, wherein after real-time recognition of the voice data and generation of real-time voice text, the target participant identification, the current Correlate time with the real-time voice text to obtain the speech information of the target participant in the current meeting, and after the step of performing voice recognition on the voice data to generate real-time voice text, When the computer-readable instructions are executed by one or more processors, the one or more processors further perform the following steps: 将所述实时语音文本发送给会议信息记录的监控端;Sending the real-time voice text to the monitoring end of the conference information record; 若接收到所述监控端反馈的修改文本,则使用所述修改文本替换所述实时语音文本。If a modified text fed back by the monitoring terminal is received, the real-time voice text is replaced with the modified text.
PCT/CN2018/094370 2018-05-29 2018-07-03 Conference information recording method and apparatus, computer device, and storage medium Ceased WO2019227579A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810528516.1 2018-05-29
CN201810528516.1A CN108922538B (en) 2018-05-29 2018-05-29 Conference information recording method, conference information recording device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2019227579A1 true WO2019227579A1 (en) 2019-12-05

Family

ID=64419574

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094370 Ceased WO2019227579A1 (en) 2018-05-29 2018-07-03 Conference information recording method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN108922538B (en)
WO (1) WO2019227579A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112053692A (en) * 2020-09-24 2020-12-08 上海明略人工智能(集团)有限公司 Speech recognition processing method, device and storage medium
CN112201253A (en) * 2020-11-09 2021-01-08 平安普惠企业管理有限公司 Character marking method and device, electronic equipment and computer readable storage medium
CN113032607A (en) * 2019-12-09 2021-06-25 深圳云天励飞技术有限公司 Key person analysis method and device, electronic equipment and storage medium
CN113132542A (en) * 2021-05-06 2021-07-16 上海纽偲钽文化传媒有限公司 Internet-based online conference system and method
CN113628638A (en) * 2021-07-30 2021-11-09 深圳海翼智新科技有限公司 Audio processing method, apparatus, device and storage medium
CN114792522A (en) * 2021-01-26 2022-07-26 阿里巴巴集团控股有限公司 Audio signal processing method, conference recording and presenting method, apparatus, system and medium
CN114974304A (en) * 2022-05-18 2022-08-30 建信金融科技有限责任公司 Method, device, equipment and medium for marking voice endpoint detection data
CN115100701A (en) * 2021-03-08 2022-09-23 福建福清核电有限公司 Conference speaker identity identification method based on artificial intelligence technology
CN115831118A (en) * 2022-11-29 2023-03-21 零犀(北京)科技有限公司 Method, device, storage medium and electronic equipment for speech separation and recognition
CN119495301A (en) * 2023-08-15 2025-02-21 广州视源电子科技股份有限公司 Audio processing method, device, system, storage medium and electronic device
CN119989322A (en) * 2025-04-16 2025-05-13 广东公信智能会议股份有限公司 A method and system for managing information of participants
CN120108402A (en) * 2025-05-07 2025-06-06 北京华建云鼎科技股份公司 Voice control method and device for conference
CN120388567A (en) * 2025-06-26 2025-07-29 杭州秋果计划科技有限公司 A method and device for processing voice information for smart glasses and smart glasses

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741754A (en) * 2018-12-10 2019-05-10 上海思创华信信息技术有限公司 A kind of conference voice recognition methods and system, storage medium and terminal
CN109560941A (en) * 2018-12-12 2019-04-02 深圳市沃特沃德股份有限公司 Minutes method, apparatus, intelligent terminal and storage medium
CN109474763A (en) * 2018-12-21 2019-03-15 深圳市智搜信息技术有限公司 A kind of AI intelligent meeting system and its implementation based on voice, semanteme
CN109960743A (en) * 2019-01-16 2019-07-02 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for distinguishing conference content
CN109887508A (en) * 2019-01-25 2019-06-14 广州富港万嘉智能科技有限公司 A kind of meeting automatic record method, electronic equipment and storage medium based on vocal print
CN111667837A (en) * 2019-02-21 2020-09-15 奇酷互联网络科技(深圳)有限公司 Conference record acquisition method, intelligent terminal and device with storage function
CN111627463B (en) * 2019-02-28 2024-01-16 百度在线网络技术(北京)有限公司 Voice VAD tail point determination method and device, electronic equipment and computer readable medium
CN111768769A (en) * 2019-03-15 2020-10-13 阿里巴巴集团控股有限公司 Voice interaction method, device, equipment and storage medium
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium
CN110400566B (en) * 2019-06-27 2022-05-31 联想(北京)有限公司 Identification method and electronic equipment
CN110472097A (en) * 2019-07-03 2019-11-19 平安科技(深圳)有限公司 Melody automatic classification method, device, computer equipment and storage medium
CN112420030A (en) * 2019-08-22 2021-02-26 杭州海康威视数字技术股份有限公司 A patrol recording method, device and electronic device
CN110650250B (en) * 2019-09-25 2022-03-08 携程旅游信息技术(上海)有限公司 Method, system, device and storage medium for processing voice dialogue
CN110648553B (en) * 2019-09-26 2021-05-28 北京声智科技有限公司 Site reminding method, electronic equipment and computer readable storage medium
CN110931022B (en) * 2019-11-19 2023-09-15 天津大学 Voiceprint recognition method based on high-low frequency dynamic and static characteristics
CN110808062B (en) * 2019-11-26 2022-12-13 秒针信息技术有限公司 Hybrid speech separation method and device
CN114556469B (en) * 2019-12-20 2025-10-17 深圳市欢太科技有限公司 Data processing method, device, electronic equipment and storage medium
CN111177353B (en) * 2019-12-27 2023-06-09 赣州得辉达科技有限公司 Text record generation method, device, computer equipment and storage medium
CN111192587A (en) * 2019-12-27 2020-05-22 拉克诺德(深圳)科技有限公司 Voice data matching method and device, computer equipment and storage medium
WO2021134284A1 (en) * 2019-12-30 2021-07-08 深圳市欢太科技有限公司 Voice information processing method, hub device, control terminal and storage medium
WO2021134720A1 (en) * 2019-12-31 2021-07-08 华为技术有限公司 Method for processing conference data and related device
CN113139392B (en) * 2020-01-17 2023-08-15 青岛海信商用显示股份有限公司 Conference summary generation method, device and storage medium
CN111432157B (en) * 2020-02-18 2023-04-07 视联动力信息技术股份有限公司 Conference processing method, device, equipment and storage medium based on video networking
CN111429916B (en) * 2020-02-20 2023-06-09 西安声联科技有限公司 A sound signal recording system
CN111477235B (en) * 2020-04-15 2023-05-05 厦门快商通科技股份有限公司 Voiceprint acquisition method, voiceprint acquisition device and voiceprint acquisition equipment
CN111312260A (en) * 2020-04-16 2020-06-19 厦门快商通科技股份有限公司 Human voice separation method, device and equipment
CN111583953A (en) * 2020-04-30 2020-08-25 厦门快商通科技股份有限公司 Voiceprint feature-based voice separation method, device and equipment
CN111785275A (en) * 2020-06-30 2020-10-16 北京捷通华声科技股份有限公司 Speech recognition method and device
CN113963694B (en) * 2020-07-20 2024-11-15 中移(苏州)软件技术有限公司 A speech recognition method, speech recognition device, electronic device and storage medium
CN112017632A (en) * 2020-09-02 2020-12-01 浪潮云信息技术股份公司 Automatic conference record generation method
CN112099628A (en) * 2020-09-08 2020-12-18 平安科技(深圳)有限公司 VR interaction method and device based on artificial intelligence, computer equipment and medium
CN112216306B (en) * 2020-09-25 2024-08-02 广东电网有限责任公司佛山供电局 Voiceprint-based call management method and device, electronic equipment and storage medium
CN112201275B (en) * 2020-10-09 2024-05-07 深圳前海微众银行股份有限公司 Voiceprint segmentation method, voiceprint segmentation device, voiceprint segmentation equipment and readable storage medium
CN112751837A (en) * 2020-12-25 2021-05-04 苏州星舟知识产权代理有限公司 Open type synchronous online conference system
CN112992174A (en) * 2021-02-03 2021-06-18 深圳壹秘科技有限公司 Voice analysis method and voice recording device thereof
CN113691382A (en) * 2021-08-25 2021-11-23 平安国际智慧城市科技股份有限公司 Conference recording method, device, computer equipment and medium
CN113870892B (en) * 2021-09-26 2024-10-11 平安科技(深圳)有限公司 Conference recording method, device, equipment and storage medium based on voice recognition
CN113920986B (en) * 2021-09-29 2025-10-17 中国平安人寿保险股份有限公司 Conference record generation method, device, equipment and storage medium
CN114640826B (en) * 2022-03-23 2023-11-03 北京有竹居网络技术有限公司 Data processing methods, devices, readable media and electronic equipment
CN115396627B (en) * 2022-08-24 2025-02-14 易讯科技股份有限公司 A positioning management method and system for screen recording video conference
CN115938368A (en) * 2022-11-29 2023-04-07 华迪计算机集团有限公司 Speech recognition management system and method
CN115713939B (en) * 2023-01-06 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 Voice recognition method and device and electronic equipment
CN117873628B (en) * 2024-01-12 2025-05-27 中移互联网有限公司 Method and device for displaying data in video conference and video conference system
CN118782073A (en) * 2024-07-09 2024-10-15 联通(山西)产业互联网有限公司 Conference transcription method, device and system for intelligent speech separation and recognition

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030081751A1 (en) * 2001-10-31 2003-05-01 International Business Machines Corporation Apparatus and method for providing conference call roster information with speaker voice identification
CN103391201A (en) * 2013-08-05 2013-11-13 公安部第三研究所 System and method for realizing smart card authentication based on voiceprint recognition
CN105975569A (en) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 Voice processing method and terminal
CN106782545A (en) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 System and method for converting audio and video data into character records
CN107564531A (en) * 2017-08-25 2018-01-09 百度在线网络技术(北京)有限公司 Minutes method, apparatus and computer equipment based on vocal print feature

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004279897A (en) * 2003-03-18 2004-10-07 Nippon Telegr & Teleph Corp <Ntt> Voice communication record generation method, apparatus and program
CN102480671B (en) * 2010-11-26 2014-10-08 华为终端有限公司 Audio processing method and device in video communication
CN103390410A (en) * 2012-05-10 2013-11-13 宏碁股份有限公司 Telephone conference system and method
CN104427292A (en) * 2013-08-22 2015-03-18 中兴通讯股份有限公司 Method and device for extracting a conference summary
CN104575504A (en) * 2014-12-24 2015-04-29 上海师范大学 Method for personalized television voice wake-up by voiceprint and voice identification
CN104732972B (en) * 2015-03-12 2018-02-27 广东外语外贸大学 A kind of HMM Application on Voiceprint Recognition based on classified statistics is registered method and system
CN105096955B (en) * 2015-09-06 2019-02-01 广东外语外贸大学 A kind of speaker's method for quickly identifying and system based on model growth cluster
CN107333090B (en) * 2016-04-29 2020-04-07 中国电信股份有限公司 Video conference data processing method and platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030081751A1 (en) * 2001-10-31 2003-05-01 International Business Machines Corporation Apparatus and method for providing conference call roster information with speaker voice identification
CN103391201A (en) * 2013-08-05 2013-11-13 公安部第三研究所 System and method for realizing smart card authentication based on voiceprint recognition
CN105975569A (en) * 2016-05-03 2016-09-28 深圳市金立通信设备有限公司 Voice processing method and terminal
CN106782545A (en) * 2016-12-16 2017-05-31 广州视源电子科技股份有限公司 System and method for converting audio and video data into character records
CN107564531A (en) * 2017-08-25 2018-01-09 百度在线网络技术(北京)有限公司 Minutes method, apparatus and computer equipment based on vocal print feature

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032607A (en) * 2019-12-09 2021-06-25 深圳云天励飞技术有限公司 Key person analysis method and device, electronic equipment and storage medium
CN112053692A (en) * 2020-09-24 2020-12-08 上海明略人工智能(集团)有限公司 Speech recognition processing method, device and storage medium
CN112053692B (en) * 2020-09-24 2024-01-12 上海明略人工智能(集团)有限公司 Speech recognition processing method, device and storage medium
CN112201253A (en) * 2020-11-09 2021-01-08 平安普惠企业管理有限公司 Character marking method and device, electronic equipment and computer readable storage medium
CN112201253B (en) * 2020-11-09 2023-08-25 观华(广州)电子科技有限公司 Text marking method, text marking device, electronic equipment and computer readable storage medium
CN114792522A (en) * 2021-01-26 2022-07-26 阿里巴巴集团控股有限公司 Audio signal processing method, conference recording and presenting method, apparatus, system and medium
CN115100701A (en) * 2021-03-08 2022-09-23 福建福清核电有限公司 Conference speaker identity identification method based on artificial intelligence technology
CN113132542A (en) * 2021-05-06 2021-07-16 上海纽偲钽文化传媒有限公司 Internet-based online conference system and method
CN113628638B (en) * 2021-07-30 2024-05-28 深圳海翼智新科技有限公司 Audio processing method, device, equipment and storage medium
CN113628638A (en) * 2021-07-30 2021-11-09 深圳海翼智新科技有限公司 Audio processing method, apparatus, device and storage medium
CN114974304A (en) * 2022-05-18 2022-08-30 建信金融科技有限责任公司 Method, device, equipment and medium for marking voice endpoint detection data
CN115831118A (en) * 2022-11-29 2023-03-21 零犀(北京)科技有限公司 Method, device, storage medium and electronic equipment for speech separation and recognition
CN119495301A (en) * 2023-08-15 2025-02-21 广州视源电子科技股份有限公司 Audio processing method, device, system, storage medium and electronic device
CN119989322A (en) * 2025-04-16 2025-05-13 广东公信智能会议股份有限公司 A method and system for managing information of participants
CN120108402A (en) * 2025-05-07 2025-06-06 北京华建云鼎科技股份公司 Voice control method and device for conference
CN120388567A (en) * 2025-06-26 2025-07-29 杭州秋果计划科技有限公司 A method and device for processing voice information for smart glasses and smart glasses

Also Published As

Publication number Publication date
CN108922538B (en) 2023-04-07
CN108922538A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN108922538B (en) Conference information recording method, conference information recording device, computer equipment and storage medium
JP6792089B2 (en) Voiceprint recognition methods, devices, terminal devices and storage media
CN108833722B (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN110298252A (en) Meeting summary generation method, device, computer equipment and storage medium
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
Nolan et al. A case for formant analysis in forensic speaker identification
Lu et al. Speakersense: Energy efficient unobtrusive speaker identification on mobile phones
CN111785275A (en) Speech recognition method and device
CN111312219B (en) Telephone recording annotation method, system, storage medium and electronic device
CN110300001A (en) Conference audio control method, system, equipment and computer readable storage medium
US20170270930A1 (en) Voice tallying system
CN108806696A (en) Establish method, apparatus, computer equipment and the storage medium of sound-groove model
CN103730112A (en) Multi-channel voice simulation and acquisition method
CN112151055B (en) Audio processing method and device
CN115831125A (en) Speech recognition method, device, equipment, storage medium and product
CN117457008A (en) Multi-person voiceprint recognition method and device based on telephone channel
CN117198324A (en) A method, device and system for bird sound recognition based on clustering model
CN117992597B (en) Information feedback method, device, computer equipment and computer storage medium
CN119314476A (en) Human voice recognition method and system based on multi-person conference records
CN114944145B (en) A method and system for sound reproduction
CN118645107B (en) Conference coordination method, device, system and electronic equipment based on intelligent audio
Feng et al. Multi-modal Speech Enhancement with Limited Electromyography Channels
JP7296214B2 (en) speech recognition system
US20250191592A1 (en) Systems and methods for improved automatic speech recognition accuracy
CN118430538A (en) Error correction multi-mode model construction method, system, equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920745

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18920745

Country of ref document: EP

Kind code of ref document: A1