CN112837688B

CN112837688B - Voice transcription method, device, related system and equipment

Info

Publication number: CN112837688B
Application number: CN201911159513.6A
Authority: CN
Inventors: 陈梦喆; 陈谦; 李博
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2024-04-02
Anticipated expiration: 2039-11-22
Also published as: CN112837688A; WO2021098637A1

Abstract

The application discloses a voice recognition method, a voice recognition device, a related system and electronic equipment. The method comprises the following steps: determining a first text sequence corresponding to voice data to be recognized; determining acoustic characteristic information of the voice data; and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information. By adopting the processing mode, on the basis of determining punctuation information according to text semantic information of voice data, the acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation mark more conforming to the spoken language; therefore, the recognition accuracy of the phonetic text punctuation marks can be effectively improved.

Description

Voice transcription method, device, related system and equipment

Technical Field

The application relates to the technical field of data processing, in particular to a voice interaction system, a voice interaction method, a voice transcription system, a voice transcription method, a voice recognition device, a voice recognition method, a voice text punctuation mark prediction model building method, a voice processing method, ordering equipment, an intelligent sound box, a voice transcription equipment and electronic equipment.

Background

The speech transcription system is a speech processing system capable of transcribing speech into characters. The system can automatically form a meeting summary so as to improve the meeting efficiency, exert the meeting function, avoid the waste of manpower, material resources and financial resources, reduce the meeting cost and achieve the efficiency of manpower resources.

Real-time speech transcription systems typically output text that is text without punctuation, which can result in high user reading costs. In order to ensure that the text recognized by the automatic speech recognition ASR system has good on-screen reading experience, after the decoding result of the speech data is obtained by the ASR system, punctuation marks are marked on the ASR decoding result through a punctuation mark prediction model so as to facilitate reading. Punctuation prediction is a task for judging punctuation on of a current text, and a typical punctuation prediction method adopts the following processing mode that: and predicting punctuation marks possibly appearing in the spoken text obtained by ASR decoding based on the text semantics of the spoken.

However, in the process of implementing the present invention, the inventors found that at least the following problems exist in this technical solution: the scheme only considers text semantics to predict punctuation marks, but the spoken language corpus sometimes has incomplete semantics, so that marking purely by semantics can often obtain an undesirable result. In summary, the existing scheme has the problem that the punctuation recognition accuracy of the voice text is low.

Disclosure of Invention

The application provides a voice transcription system to solve the problem that punctuation marks of voice texts cannot be recognized correctly in the prior art. The application additionally provides a voice transcription method and device, a voice recognition method and device, a method and device for constructing a voice text punctuation mark prediction model, a voice interaction system, a method and device, a voice processing method, ordering equipment, an intelligent sound box, voice transcription equipment and electronic equipment.

The application provides a speech transcription system, comprising:

the server side is used for receiving the voice data to be transcribed sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; returning the second text sequence to the client;

the client is used for collecting the voice data and sending the voice data to the server; and receiving the second text sequence returned by the server side, and displaying the second text sequence.

The application also provides a voice transcription method, which comprises the following steps:

Receiving voice data to be transcribed sent by a client;

determining a first text sequence corresponding to the voice data;

determining acoustic characteristic information of the voice data;

determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information;

and sending the second text sequence back to the client.

Optionally, the punctuation information includes punctuation information related to text semantic information and the acoustic feature information of the voice data.

Optionally, the acoustic characteristic information includes at least one of the following information:

a Bottleneck feature, a fbank feature, a word duration, a post-word mute duration, a pitch feature.

Optionally, the determining a first text sequence corresponding to the voice data includes:

determining the first text sequence through an acoustic model and a language model;

the determining the acoustic characteristic information of the voice data comprises the following steps:

and acquiring the acoustic characteristic information output by the acoustic model.

Optionally, the determining, according to the first text sequence and the acoustic feature information, a second text sequence corresponding to the voice data and including punctuation information includes:

Determining first punctuation information related to text semantic information of the voice data according to the first text sequence through a first punctuation prediction sub-network included in a punctuation prediction model;

determining second punctuation information related to text semantic information and the acoustic feature information of the voice data according to the first punctuation information and the acoustic feature information through a second punctuation prediction sub-network included in the punctuation prediction model;

and determining the second text sequence according to the second punctuation information and the first text sequence.

Optionally, the first punctuation prediction sub-network includes at least one transducer layer;

the second punctuation prediction subnetwork includes at least one fransformer layer.

Optionally, the determining, by the second punctuation prediction sub-network included in the punctuation prediction model, the second punctuation information according to the first punctuation information and the acoustic feature information includes:

determining acoustic feature information of each word in the first text sequence;

and taking the word as a unit, taking paired data of the first punctuation information and the acoustic characteristic information corresponding to each word as input data of the second punctuation prediction sub-network, and determining second punctuation information of each word through the second punctuation prediction sub-network.

Optionally, the acoustic feature information of the voice data includes acoustic feature information of a plurality of data frames in units of voice data frames;

the determining the acoustic feature information of the word includes:

and determining the acoustic characteristic information of the word from the acoustic characteristic information of a plurality of data frames related to the word according to the time information of the plurality of data frames.

Optionally, the acoustic model includes one of the following modules of the network structure: deep feedforward sequence memory neural network structure DFSMN, two-way long and short time memory network BLSTM;

the determining the acoustic feature information of the word from the acoustic feature information of the plurality of data frames according to the time information of the plurality of data frames related to the word comprises the following steps:

and taking the acoustic characteristic information of the last data frame related to the word as the acoustic characteristic information of the word, wherein the acoustic characteristic information of the last data frame comprises the acoustic characteristic information of the plurality of data frames.

Optionally, the method further comprises:

and learning from the corresponding relation set between the voice data and the text sequence comprising punctuation mark marking information to obtain the punctuation mark prediction model.

Collecting voice data to be transcribed;

sending the voice data to a server;

receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information;

displaying the second text sequence;

wherein the second text sequence is determined by: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The application also provides a voice transcription device, comprising:

the voice data receiving unit is used for receiving voice data to be transcribed, which is sent by the client;

a first text sequence generating unit for determining a first text sequence corresponding to the voice data;

an acoustic feature information determining unit configured to determine acoustic feature information of the voice data;

a second text sequence generating unit, configured to determine a second text sequence corresponding to the voice data and including punctuation information according to the first text sequence and the acoustic feature information;

And the second text sequence returning unit is used for returning the second text sequence to the client.

The application also provides a voice transcription device, comprising:

the voice data acquisition unit is used for acquiring voice data to be transcribed;

a voice data transmitting unit, configured to transmit the voice data to a server;

the second text sequence receiving unit is used for receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation information;

a second text sequence display unit, configured to display the second text sequence;

The application also provides an electronic device comprising:

a processor; and

and the memory is used for storing a program for realizing the voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: receiving voice data to be transcribed sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The application also provides a voice transcription device, comprising:

a processor; and

and the memory is used for storing a program for realizing the voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: collecting voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The application also provides a method for constructing a phonetic text punctuation predictive model, comprising:

determining a corresponding relation set among words, word acoustic feature information related to voice data to which the words belong and word punctuation mark labeling information;

Constructing a network structure of a voice text punctuation mark prediction model;

and learning from the corresponding relation set to obtain the punctuation mark prediction model.

Optionally, the corresponding relation set is determined in the following manner:

and determining the corresponding relation set among the words, the acoustic feature information of the words related to the voice data to which the words belong and the punctuation mark information according to the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

The application also provides a voice transcription device, comprising:

the data determining unit is used for determining a corresponding relation set among words, word acoustic feature information related to voice data to which the words belong and word punctuation mark marking information;

the network construction unit is used for constructing a network structure of the voice text punctuation mark prediction model;

and the model training unit is used for learning the punctuation mark prediction model from the corresponding relation set.

The application also provides an electronic device comprising:

a processor; and

and the memory is used for storing a program for realizing the voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: determining a corresponding relation set among words, word acoustic feature information related to voice data to which the words belong and word punctuation mark labeling information; constructing a network structure of a voice text punctuation mark prediction model; and learning from the corresponding relation set to obtain the punctuation mark prediction model.

The application also provides a voice recognition method, which comprises the following steps:

determining a first text sequence corresponding to voice data to be recognized;

determining acoustic characteristic information of the voice data;

and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The application also provides a voice recognition device, comprising:

a first text sequence generating unit for determining a first text sequence corresponding to voice data to be recognized;

and the second text sequence generating unit is used for determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The application also provides an electronic device comprising:

a processor; and

a memory for storing a program for implementing a voice recognition method, the apparatus being powered on and executing the program of the voice recognition method by the processor, and performing the steps of: determining a first text sequence corresponding to voice data to be recognized; determining acoustic characteristic information of the voice data; and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The application also provides a voice interaction system, comprising:

the server side is used for receiving a voice interaction request aiming at target voice data, which is sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is returned to the client;

the client is used for determining the target voice data and sending the voice interaction request to the server; and receiving the voice reply information returned by the server and displaying the voice reply information.

The application also provides a voice interaction method, which comprises the following steps:

receiving a voice interaction request aiming at target voice data sent by a client;

determining a first text sequence corresponding to the voice data;

determining acoustic characteristic information of the voice data;

Determining voice reply information according to the second text sequence;

and sending the voice reply information back to the client.

determining target voice data;

sending a voice interaction request aiming at the target voice data to a server;

receiving voice reply information returned by the service end;

displaying the voice reply information;

the voice reply information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

The application also provides a voice interaction device, comprising:

the request receiving unit is used for receiving a voice interaction request aiming at target voice data sent by the client;

the voice reply information determining unit is used for determining voice reply information according to the second text sequence;

and the voice reply information returning unit is used for returning the voice reply information to the client.

The application also provides a voice interaction device, comprising:

a voice data determining unit configured to determine target voice data;

a request sending unit, configured to send a voice interaction request for the target voice data to a server;

the voice reply information receiving unit is used for receiving the voice reply information returned by the service end;

the voice reply information display unit is used for displaying the voice reply information;

The application also provides an electronic device comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

The application also provides an electronic device comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the service end; displaying the voice reply information; the voice reply information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

The application also provides a voice interaction system, comprising:

the server side is used for receiving a voice interaction request aiming at target voice data, which is sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client;

the client is used for determining the voice data and sending the voice interaction request to the server; and receiving the voice instruction information returned by the server side and executing the voice instruction information.

determining a first text sequence corresponding to the voice data;

determining acoustic characteristic information of the voice data;

Determining voice instruction information according to the second text sequence;

and sending the voice instruction information back to the client.

determining target voice data;

sending a voice interaction request aiming at the voice data to a server;

receiving voice instruction information returned by the server;

executing the voice instruction information;

the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

The application also provides a voice interaction device, comprising:

the voice instruction information determining unit is used for determining voice instruction information according to the second text sequence;

and the voice instruction information returning unit is used for returning the voice instruction information to the client.

The application also provides a voice interaction device, comprising:

a voice data determining unit configured to determine target voice data;

a request sending unit, configured to send a voice interaction request for the voice data to a server;

the voice instruction information receiving unit is used for receiving the voice instruction information returned by the server;

a voice instruction information execution unit configured to execute the voice instruction information;

The application also provides an electronic device comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

The application also provides an electronic device comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

Optionally, the apparatus includes: intelligent audio amplifier, intelligent TV, subway pronunciation ticket purchasing equipment, perhaps order equipment.

The application also provides a voice processing method, which comprises the following steps:

collecting voice data to be transcribed;

determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data;

and executing processing related to the second text sequence.

Optionally, if the speech processing condition is satisfied, executing the method;

the method further comprises the steps of:

if the voice processing condition is not satisfied, determining a first text sequence corresponding to the voice data; and determining a third text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence.

Optionally, the voice processing conditions include: the noise of the voice data acquisition environment is smaller than a noise threshold value, or the noise of the voice data acquisition environment is larger than the noise threshold value;

the method further comprises the steps of:

noise data of a speech data acquisition environment is determined.

Optionally, the method further comprises:

determining a noise threshold specified by a user;

the noise threshold is stored.

Optionally, determining a target voice processing method designated by the user;

if the target voice processing method is the method, the voice processing condition is satisfied.

Optionally, the method further comprises:

and displaying the voice processing progress information.

Optionally, the progress information includes at least one of the following information: the voice data acquisition is completed, the first text sequence determination is completed, the acoustic characteristic information determination is completed, and the second text sequence determination is completed.

The application also provides a ordering device, comprising:

a voice acquisition device;

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: collecting voice ordering data of a first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; and determining ordering information according to the second ordering text sequence, so that the second user prepares meals according to the ordering information.

The application also provides an intelligent sound box, include:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; and displaying the voice reply information and/or executing voice instruction information.

The present application also provides a computer-readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the various methods described above.

The present application also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the various methods described above.

Compared with the prior art, the application has the following advantages:

According to the voice recognition method, a first text sequence corresponding to voice data to be recognized is determined; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the recognition accuracy of the phonetic text punctuation marks can be effectively improved.

According to the voice interaction system provided by the embodiment of the application, the target voice data is determined through the client, and a voice interaction request aiming at the voice data is sent to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is returned to the client, and the client receives and displays the voice reply information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice reply information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

According to the voice interaction system provided by the embodiment of the application, the target voice data is determined through the client, and a voice interaction request aiming at the voice data is sent to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client; the client executes the voice instruction information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice instruction information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

According to the voice transcription system provided by the embodiment of the application, voice data are collected through the client side, and the voice data are sent to the server side; the server determines a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; the second text sequence is sent back to the client, and the client receives and displays the second text sequence; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

The method for constructing the voice text punctuation mark prediction model provided by the embodiment of the application comprises the steps of determining a word, word acoustic characteristic information related to voice data to which the word belongs, and a corresponding relation set between the punctuation mark marking information; constructing a network structure of a voice text punctuation mark prediction model; learning from the corresponding relation set to obtain the punctuation mark prediction model; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the accuracy of the model can be effectively improved.

According to the voice processing method, voice data to be transcribed are collected; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; performing processing associated with the second text sequence; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then processing related to the second text sequence is executed based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of the correlation process can be effectively improved.

According to the ordering equipment provided by the embodiment of the application, voice ordering data of the first user are collected; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; determining ordering information according to the second ordering text sequence, so that a second user prepares meals according to the ordering information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice ordering data, acoustic characteristic information of the voice ordering data is comprehensively utilized to predict the punctuation information, the self-meaning of an ordering person can be better utilized after the acoustic characteristic information is utilized, punctuation marks which are more in line with spoken language are obtained, and ordering information (such as dish names, personal taste requirements and the like) is determined based on ordering texts comprising the more accurate punctuation marks; therefore, the ordering accuracy can be effectively improved, and the user experience is improved.

According to the intelligent sound box, voice data of a first user are collected; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; displaying the voice reply information and/or executing the voice instruction information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice reply information and/or voice instruction information are determined based on a second text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply and voice instructions can be effectively improved, and the user experience is improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a method of speech recognition provided herein;

fig. 2 is a schematic application scenario diagram of an embodiment of a speech recognition method provided in the present application;

FIG. 3 is a specific flow chart of an embodiment of a speech recognition method provided herein;

FIG. 4 is a diagram of a model network architecture of an embodiment of a speech recognition method provided herein;

FIG. 5 is a specific flow chart of an embodiment of a speech recognition method provided herein;

FIG. 6 is a schematic diagram of an embodiment of a speech recognition device provided herein;

FIG. 7 is a schematic diagram of an embodiment of an electronic device provided herein;

FIG. 8 is a schematic diagram of device interactions of an embodiment of a voice interaction system provided herein;

FIG. 9 is a schematic diagram of device interactions of an embodiment of a voice interaction system provided herein;

fig. 10 is a schematic device interaction diagram of an embodiment of a speech transcription system provided in the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.

In the application, a voice transcription system, a method and a device, a voice recognition method and a device, a method and a device for constructing a voice text punctuation mark prediction model, a voice interaction system, a method and a device, a voice processing method, ordering equipment, an intelligent sound box, voice transcription equipment and electronic equipment are provided. The various schemes are described in detail one by one in the examples below.

First embodiment

Please refer to fig. 1, which is a flowchart illustrating an embodiment of a voice recognition method of the present application. The method is implemented by a voice recognition device, which is usually deployed at a server, but is not limited to the server, and may be any device capable of implementing the voice recognition method. The voice recognition method provided by the embodiment comprises the following steps:

step S101: a first text sequence corresponding to the speech data to be recognized is determined.

In this embodiment, the first text sequence may be determined by an Acoustic Model (AM) and a language Model (e.g., an N-gram language Model). Wherein the acoustic model can realize posterior probability score of converting the input voice signal into an acoustic modeling unit (also called a phoneme and a pronunciation unit); the language model may be used to predict a priori probabilities of occurrence of word sequences for a given word sequence:

Then, the decoding network can be constructed by combining the acoustic model score and the language model score through the decoder, and the decoding result is obtained through the preferred path search, namely: a first text sequence.

The first text sequence may be a text sequence not including punctuation information, for example, the first text sequence is "a medicine shortage patient is not on the market and is illicitly controlled, so that the price is high, and in recent years, the domestic shortage medicine supply problem is always paid attention to the day … …".

In the process of realizing the invention, the inventor finds that the prior art scheme only considers text semantics to predict punctuation marks, but does not consider the input of an ASR system, namely acoustic characteristic information, but in fact, the spoken language material is sometimes not quite complete in terms of semantics, and for the spoken language material, a large amount of punctuation mark information is hidden in the voice besides the text semantics, for example, pause is often the position of the punctuation mark, and the pause length is also the comma or period favorable distinguishing information, and further, for example, the change of tone is often predictive of the generation of a question mark, so that the marking purely by semantics often can obtain an undesirable result. Based on the consideration, the inventor proposes the technical concept of comprehensively utilizing the acoustic characteristic information of the voice data to predict the punctuation information on the basis of determining the punctuation information according to the text semantic information of the voice data, and can better utilize the meaning of a speaker after utilizing the acoustic characteristic information to obtain the punctuation more conforming to the spoken language, so that the punctuation recognition accuracy of the voice text can be improved.

Please refer to fig. 2, which is a schematic diagram of a usage scenario of an embodiment of the speech recognition method of the present application. In this embodiment, 6 microphone arrays are deployed at the conference site, and include a data collection device, each microphone array sends a respective target sound source signal to the data collection device, the target voice signal is sent to the cloud end via the data collection device, and the data collection device further receives and displays the transcription result by performing voice transcription through the voice recognition device deployed at the cloud end. Wherein, the transcription result comprises: punctuation information associated with text semantic information and the acoustic feature information of the speech data.

Step S103: acoustic feature information of the speech data is determined.

The acoustic signature information includes, but is not limited to, at least one of the following acoustic signature information: a Bottleneck feature, a fbank feature, a word duration, a post-word mute duration, a pitch feature, and the like.

In implementation, acoustic feature information of the voice data, such as linear prediction analysis (LinearPredictionCoefficients, LPC), perceptual linear prediction coefficients (PerceptualLinearPredictive, PLP), tandem features and bottleck features, fbank features based on filter banks (Filterbank), linear prediction cepstral coefficients (LinearPredictiveCepstralCoefficient, LPCC), mel frequency cepstral coefficients (MelFrequencyCepstrumCoefficient, MFCC), and the like, may be determined by using acoustic feature extraction methods in the prior art.

In one example, step S103 may employ the following manner: and acquiring the acoustic characteristic information output by the acoustic model in the step S101.

Step S105: and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The second text sequence is a text sequence of voice data and comprises punctuation information, for example, the second text sequence is a text sequence of' medicine shortage patients are not used, the market is illegally controlled, and the price is increased … … in recent years, the domestic medicine shortage supply problem is paid attention. Day before … …%.

The punctuation information, including but not limited to: punctuation information associated with text semantic information and the acoustic feature information of the speech data. In particular implementations, the punctuation information can also include first punctuation information related only to text semantic information of the speech data, and punctuation information related only to the acoustic feature information.

Please refer to fig. 3, which is a flowchart illustrating an embodiment of the voice recognition method of the present application. In this embodiment, step S105 may include the following sub-steps:

Step S1051, determining first punctuation information related to text semantic information of the voice data according to the first text sequence through a first punctuation prediction sub-network included in the punctuation prediction model.

According to the method provided by the embodiment of the application, first punctuation information, namely punctuation information related to text semantic information of the voice data, is determined according to the first text sequence through a first punctuation prediction sub-network included in a punctuation prediction model.

Step S1053, determining second punctuation information related to text semantic information and the acoustic feature information of the voice data according to the first punctuation information and the acoustic feature information through a second punctuation prediction sub-network included in the punctuation prediction model.

On the basis of determining the first punctuation information according to the text semantic information of the voice data, the acoustic characteristic information of the voice data is comprehensively utilized to predict the second punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, so that the punctuation more conforming to the spoken language is obtained.

The punctuation information output by the punctuation prediction model may only include part of the first punctuation information, which is because after the acoustic feature information of the voice data is synthesized, part of the first punctuation information which does not conform to the spoken language but is related to the text semantics may be removed. The punctuation information output by the punctuation prediction model may also include punctuation other than the first punctuation, and the newly added punctuation may be a punctuation related to the acoustic feature information.

Please refer to fig. 4, which is a schematic diagram of a punctuation prediction model of an embodiment of a speech recognition method of the present application. The punctuation prediction model includes a first punctuation prediction sub-network and a second punctuation prediction sub-network. In this embodiment, a Transformer model is used to model the punctuation prediction model, where the punctuation prediction model may include several Transformer layers, where the input is a string of words (also called token), and the output is a classification task with punctuation prediction for each word through several layers of Transformer networks, where the specific classification of the punctuation may be determined according to actual needs, such as comma, period, question mark, mark, and so on. The input data of the first punctuation mark prediction sub-network is input data of the punctuation mark prediction model, namely a string of words, such as a plurality of words forming a text segment; and the output data of the first punctuation mark prediction sub-network is the first punctuation mark information. The input data of the second punctuation mark prediction sub-network is paired first punctuation mark information of each word and acoustic characteristic information of the word; and the output data of the second punctuation mark prediction sub-network is the second punctuation mark information.

It should be noted that, the acoustic model of the present embodiment uses frames as output units, outputs acoustic feature information of each voice data frame, and the punctuation prediction model uses words as output units, and outputs first punctuation information of each word. Since the input data of the second punctuation mark prediction sub-network is the paired first punctuation mark information of each word and the acoustic characteristic information of the word, the first punctuation mark information and the acoustic characteristic information of the word are aligned by taking the word as a unit.

In particular, step S1053 may include the following sub-steps: 1) Determining acoustic feature information of each word in the first text sequence; 2) And taking the word as a unit, taking paired data of the first punctuation information and the acoustic characteristic information corresponding to each word as input data of the second punctuation prediction sub-network, and determining second punctuation information of each word through the second punctuation prediction sub-network.

In this embodiment, the acoustic feature information of the voice data includes acoustic feature information of a plurality of data frames in units of voice data frames; the step of determining the acoustic feature information of the words may be implemented in the following manner: and determining the acoustic characteristic information of the word from the acoustic characteristic information of a plurality of data frames related to the word according to the time information of the plurality of data frames.

In this embodiment, the acoustic model may be a model with long-term information recording capability, including one of the following network structure modules: the deep feedforward sequence memory neural network structure DFSMN and the bidirectional long and short time memory network BLSTM are adopted, and the acoustic model enables the acoustic characteristic information of the last frame of each word to actually contain the acoustic information of the whole word. In order to obtain a better punctuation recognition effect, the embodiment adopts the acoustic characteristic information of the last frame of each word as information spliced to a transducer model, and the beginning and result time points of each word can be obtained in the acoustic model decoding process.

It should be noted that, the punctuation prediction model may also be modeled by a model other than a transducer model, and acoustic feature information may be utilized in combination with the network characteristics of the model.

And step S1055, determining the second text sequence according to the second punctuation information and the first text sequence.

After determining the second punctuation information and the first text sequence, the information may be stitched together, thereby determining the second text sequence.

Please refer to fig. 5, which is a flowchart illustrating an embodiment of the voice recognition method of the present application. In this embodiment, the method may further include the steps of:

Step S501: and learning from the corresponding relation set between the voice data and the text sequence comprising punctuation mark marking information to obtain the punctuation mark prediction model.

According to the method provided by the embodiment of the application, the punctuation mark prediction model is obtained through the supervised machine learning method in a centralized learning mode from the corresponding relation. The voice data can be converted into a text sequence through the existing voice recognition method, and then punctuation mark labeling processing can be manually carried out on the text sequence to form the text sequence comprising punctuation mark labeling information. The text sequence comprising punctuation mark labeling information is a text sequence comprising punctuation marks and comprising voice data. The set of correspondence relationships is used as training data.

In this embodiment, the step S401 may include the following sub-steps:

step S4011: and determining the corresponding relation set among the words, the acoustic feature information of the words related to the voice data to which the words belong and the punctuation mark information according to the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

The word acoustic characteristic information related to the voice data of the word comprises acoustic information related to the voice data of the word.

Table 1 shows a set of correspondence relations among the words of the present embodiment, the acoustic feature information of the words related to the voice data to which the words belong, and the punctuation mark labeling information.

TABLE 1 correspondence set

It can be seen from table 1 that for the same word it may have different acoustic signature information in different contextual voices and thus different punctuation classifications. For example, the word "happy" is that the punctuation mark in the voice data 1 is "comma", and the punctuation mark in the voice data 2 is "period".

Step S4013: and learning from the corresponding relation set among the words, the acoustic feature information of the words related to the voice data of the words and the punctuation mark marking information to obtain the punctuation mark prediction model.

After the corresponding relation set among the word, the acoustic feature information of the word related to the voice data of the word and the punctuation mark labeling information is obtained, the punctuation mark prediction model can be obtained through learning. In the model training process, if the difference between the predicted punctuation mark and the pre-marked punctuation mark reaches an optimization target, model training is completed, and model parameters are stored so as to be convenient for use in a prediction stage.

As can be seen from the foregoing embodiments, the voice recognition method provided in the embodiments of the present application determines a first text sequence corresponding to voice data to be recognized; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the recognition accuracy of the phonetic text punctuation marks can be effectively improved.

Second embodiment

In the above embodiment, a voice recognition method is provided, and correspondingly, the present application further provides a voice recognition device. The device corresponds to the embodiment of the method described above.

Please refer to fig. 6, which is a schematic diagram of an embodiment of a voice recognition device provided in the present application, and portions of the embodiment, which have the same content as those of the first embodiment, are not described again, but refer to corresponding portions in the first embodiment. The voice recognition device provided by the application comprises:

a first text sequence generating unit 601, configured to determine a first text sequence corresponding to voice data to be recognized;

an acoustic feature information determining unit 603 configured to determine acoustic feature information of the voice data;

a second text sequence generating unit 605 is configured to determine a second text sequence corresponding to the speech data and including punctuation information according to the first text sequence and the acoustic feature information.

Third embodiment

Please refer to fig. 7, which is a schematic diagram of an embodiment of an electronic device of the present application. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor 701 and a memory 702; a memory for storing a program for implementing a voice recognition method, the apparatus being powered on and executing the program of the voice recognition method by the processor, and performing the steps of: determining a first text sequence corresponding to voice data to be recognized; determining acoustic characteristic information of the voice data; and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

Fourth embodiment

In the foregoing embodiment, a voice recognition method is provided, and correspondingly, a voice interaction system is also provided.

Refer to FIG. 8, which is a schematic diagram of device interactions of an embodiment of a voice interaction system of the present application. Since the system embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a voice interaction system comprising: the system comprises a server side and a client side.

The server side is used for receiving a voice interaction request aiming at target voice data, which is sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is returned to the client; the client is used for determining the target voice data and sending the voice interaction request to the server; and receiving the voice reply information returned by the server and displaying the voice reply information.

The voice reply message can be a text form reply message, a voice form reply message or other forms of reply message.

As can be seen from the foregoing embodiments, in the voice interaction system provided in the embodiments of the present application, target voice data is determined by a client, and a voice interaction request for the voice data is sent to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is returned to the client, and the client receives and displays the voice reply information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice reply information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

Fifth embodiment

Corresponding to the voice interaction system, the application also provides a voice interaction method, and an execution subject of the method comprises, but is not limited to, a server side and other clients. The same parts of the present embodiment as those of the first embodiment will not be described again, please refer to the corresponding parts in the first embodiment.

The voice interaction method provided by the application comprises the following steps:

step 1: receiving a voice interaction request aiming at target voice data sent by a client;

step 2: determining a first text sequence corresponding to the voice data;

step 3: determining acoustic characteristic information of the voice data;

step 4: determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information;

step 5: determining voice reply information according to the second text sequence;

step 6: and sending the voice reply information back to the client.

As can be seen from the above embodiments, in the voice interaction method provided in the embodiments of the present application, a voice interaction request for target voice data sent by a client is received; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is returned to the client; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice reply information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

Sixth embodiment

In the foregoing embodiment, a voice interaction method is provided, and correspondingly, a voice interaction device is also provided. The device corresponds to the embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides a voice interaction device, comprising:

Seventh embodiment

In the foregoing embodiment, a voice interaction method is provided, and corresponding electronic equipment is also provided. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

Eighth embodiment

Corresponding to the voice interaction system, the application also provides a voice interaction method, and an execution subject of the method comprises, but is not limited to, a mobile communication device, a personal computer, a PAD, iPad, RF gun and other clients. The same parts of the present embodiment as those of the first embodiment will not be described again, please refer to the corresponding parts in the first embodiment. The voice interaction method provided by the application comprises the following steps:

step 1: determining target voice data;

step 2: sending a voice interaction request aiming at the target voice data to a server;

step 3: receiving voice reply information returned by the service end;

step 4: displaying the voice reply information;

As can be seen from the above embodiments, the voice interaction method provided in the embodiments of the present application determines target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the service end; displaying the voice reply information; the voice reply information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; the voice reply information is returned to the client; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice reply information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply can be effectively improved.

Ninth embodiment

a voice data determining unit configured to determine target voice data;

Tenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the service end; displaying the voice reply information; the voice reply information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

Eleventh embodiment

Referring to fig. 9, a device interaction diagram of an embodiment of a voice interaction system of the present application is shown. Since the system embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The server side is used for receiving a voice interaction request aiming at target voice data, which is sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client; the client is used for determining the voice data and sending the voice interaction request to the server; and receiving the voice instruction information returned by the server side and executing the voice instruction information.

In one example, the client is a smart speaker that collects user voice data, such as "heaven fairy, bringing the air conditioner temperature up", by which the system can determine that the voice command information is "air conditioner: the intelligent sound box can execute the instruction to adjust the air conditioner to be more than 25 degrees.

As can be seen from the foregoing embodiments, in the voice interaction system provided in the embodiments of the present application, target voice data is determined by a client, and a voice interaction request for the voice data is sent to the server; the server side responds to the request and determines a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client; the client executes the voice instruction information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice instruction information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

Twelfth embodiment

step 3: determining a first text sequence corresponding to the voice data;

step 4: determining acoustic characteristic information of the voice data;

step 5: determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information;

step 6: determining voice instruction information according to the second text sequence;

step 7: and sending the voice instruction information back to the client.

As can be seen from the above embodiments, in the voice interaction method provided in the embodiments of the present application, a voice interaction request for target voice data sent by a client is received; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice instruction information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

Thirteenth embodiment

Fourteenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

Fifteenth embodiment

step 1: determining target voice data;

step 2: sending a voice interaction request aiming at the voice data to a server;

step 3: receiving voice instruction information returned by the server;

step 4: executing the voice instruction information;

As can be seen from the above embodiments, the voice interaction method provided in the embodiments of the present application determines target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice instruction information is determined based on a text sequence comprising the more accurate punctuation; therefore, the accuracy of voice interaction can be effectively improved.

Sixteenth embodiment

a voice data determining unit configured to determine target voice data;

Seventeenth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

The apparatus includes, but is not limited to: intelligent audio amplifier, intelligent TV, subway pronunciation ticket purchasing equipment, perhaps order equipment etc..

Eighteenth embodiment

In the above embodiment, a voice recognition method is provided, and correspondingly, the present application further provides a voice transcription system.

Referring to fig. 10, a schematic device interaction diagram of an embodiment of a speech transcription system of the present application is shown. Since the system embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for relevant points. The system embodiments described below are merely illustrative.

The present application additionally provides a speech transcription system comprising: the system comprises a server side and a client side.

As shown in fig. 2, the client may be a voice capture device deployed at the conference site that connects multiple microphones.

As can be seen from the foregoing embodiments, in the voice transcription system provided in the embodiments of the present application, voice data is collected by a client, and the voice data is sent to the server; the server determines a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; the second text sequence is sent back to the client, and the client receives and displays the second text sequence; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

Nineteenth embodiment

Corresponding to the above-mentioned voice transcription system, the present application also provides a voice transcription method, where the execution body of the method includes, but is not limited to, a server side, and may also be other clients. The same parts of the present embodiment as those of the first embodiment will not be described again, please refer to the corresponding parts in the first embodiment.

The voice transcription method provided by the application comprises the following steps:

step 1: receiving voice data to be transcribed sent by a client;

step 2: a first text sequence corresponding to the speech data is determined.

In this embodiment, step 2 may be implemented as follows: the first text sequence is determined by an acoustic model and a language model.

Step 3: acoustic feature information of the speech data is determined.

In this embodiment, the step 3 may be implemented as follows: and acquiring the acoustic characteristic information output by the acoustic model.

The acoustic signature information includes at least one of the following: a Bottleneck feature, a fbank feature, a word duration, a post-word mute duration, a pitch feature.

Step 4: and determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information.

The punctuation information, including but not limited to: punctuation information associated with text semantic information and the acoustic feature information of the speech data.

In this embodiment, step 4 may include the following sub-steps: 1) Determining first punctuation information related to text semantic information of the voice data according to the first text sequence through a first punctuation prediction sub-network included in a punctuation prediction model; 2) Determining second punctuation information related to text semantic information and the acoustic feature information of the voice data according to the first punctuation information and the acoustic feature information through a second punctuation prediction sub-network included in the punctuation prediction model; 3) And determining the second text sequence according to the second punctuation information and the first text sequence.

In this embodiment, the first punctuation prediction sub-network includes at least one transducer layer; the second punctuation prediction subnetwork includes at least one fransformer layer.

In a specific implementation, the step of determining, by the second punctuation prediction sub-network included in the punctuation prediction model, the second punctuation information according to the first punctuation information and the acoustic feature information may include the following sub-steps: 1) Determining acoustic feature information of each word in the first text sequence; 2) And taking the word as a unit, taking paired data of the first punctuation information and the acoustic characteristic information corresponding to each word as input data of the second punctuation prediction sub-network, and determining second punctuation information of each word through the second punctuation prediction sub-network.

In this embodiment, the acoustic feature information of the voice data includes acoustic feature information of a plurality of data frames in units of voice data frames; the step of determining the acoustic feature information of the word may be implemented in the following manner: and determining the acoustic characteristic information of the word from the acoustic characteristic information of a plurality of data frames related to the word according to the time information of the plurality of data frames.

In this embodiment, the acoustic model includes one of the following modules of the network structure: deep feedforward sequence memory neural network structure DFSMN, two-way long and short time memory network BLSTM; the step of determining the acoustic feature information of the word from the acoustic feature information of the plurality of data frames according to the time information of the plurality of data frames related to the word may be implemented in the following manner: and taking the acoustic characteristic information of the last data frame related to the word as the acoustic characteristic information of the word, wherein the acoustic characteristic information of the last data frame comprises the acoustic characteristic information of the plurality of data frames.

In this embodiment, the method may further include the steps of: and learning from the corresponding relation set between the voice data and the text sequence comprising punctuation mark marking information to obtain the punctuation mark prediction model.

Step 5: and sending the second text sequence back to the client.

As can be seen from the above embodiments, in the voice transcription method provided in the embodiments of the present application, voice data to be transcribed sent by a client is received; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; returning the second text sequence to the client; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

Twentieth embodiment

In the above embodiment, a voice transcription method is provided, and correspondingly, the application also provides a voice transcription device. The device corresponds to the embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides a speech transcription apparatus comprising:

Twenty-first embodiment

In the foregoing embodiment, a voice transcription method is provided, and corresponding electronic equipment is also provided. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: receiving voice data to be transcribed sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

Twenty-second embodiment

Corresponding to the above-mentioned voice transcription system, the present application also provides a voice transcription method, and the execution subject of the method includes, but is not limited to, mobile communication equipment, personal computer, PAD, iPad, RF gun, and other clients. The same parts of the present embodiment as those of the first embodiment will not be described again, please refer to the corresponding parts in the first embodiment. The voice transcription method provided by the application comprises the following steps:

step 1: collecting voice data to be transcribed;

step 2: sending the voice data to a server;

Step 3: receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information;

step 4: displaying the second text sequence;

As can be seen from the above embodiments, the voice transcription method provided in the embodiments of the present application collects the voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; returning the second text sequence to the client; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the accuracy of voice transcription can be effectively improved.

Twenty-third embodiment

Twenty-fourth embodiment

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: collecting voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; and sending the second text sequence back to the client.

The electronic device includes, but is not limited to: the voice data gathering device shown in fig. 2 may also be a mobile communication device or the like.

Twenty-fifth embodiment

Corresponding to the above-mentioned voice recognition method, the present application also provides a method for constructing a voice text punctuation prediction model, where the implementation subject of the method includes, but is not limited to, a server, and any other device that can implement the method for constructing a voice text punctuation prediction model. The same parts of the present embodiment as those of the first embodiment will not be described again, please refer to the corresponding parts in the first embodiment.

The method for constructing the voice text punctuation predictive model comprises the following steps:

step 1: determining a corresponding relation set among words, word acoustic feature information related to voice data to which the words belong and word punctuation mark labeling information;

the set of correspondence relationships may be determined as follows: and determining the corresponding relation set among the words, the acoustic feature information of the words related to the voice data to which the words belong and the punctuation mark information according to the corresponding relation set between the voice data and the text sequence comprising the punctuation mark information.

Step 2: constructing a network structure of a voice text punctuation mark prediction model;

step 3: and learning from the corresponding relation set to obtain the punctuation mark prediction model.

As can be seen from the above embodiments, the method for constructing a speech text punctuation mark prediction model provided in the embodiments of the present application determines a set of correspondence relationships among words, word acoustic feature information related to speech data to which the words belong, and punctuation mark annotation information; constructing a network structure of a voice text punctuation mark prediction model; learning from the corresponding relation set to obtain the punctuation mark prediction model; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, and the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized to obtain the punctuation more conforming to the spoken language; therefore, the accuracy of the model can be effectively improved.

Twenty-sixth embodiment

In the foregoing embodiments, a method for constructing a voice text punctuation prediction model is provided, and in correspondence therewith, the present application also provides an apparatus for constructing a voice text punctuation prediction model. The device corresponds to the embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

The present application additionally provides an apparatus for constructing a phonetic text punctuation prediction model, comprising:

Twenty-seventh embodiment

In the above embodiment, a voice recognition method is provided, and corresponding electronic equipment is also provided. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An electronic device of the present embodiment includes: a processor and a memory; the memory is used for storing a program for realizing a method for constructing a voice text punctuation predictive model, and after the device is powered on and the processor runs the program for constructing the voice text punctuation predictive model, the following steps are executed: determining a corresponding relation set among words, word acoustic feature information related to voice data to which the words belong and word punctuation mark labeling information; constructing a network structure of a voice text punctuation mark prediction model; and learning from the corresponding relation set to obtain the punctuation mark prediction model.

Twenty-eighth embodiment

Corresponding to the voice recognition method, the application also provides a voice processing method, and the execution subject of the method comprises, but is not limited to, a mobile communication device, a personal computer, a PAD, iPad, RF gun and other clients, and can also be any other devices capable of realizing the voice processing method. The same parts of the present embodiment as those of the first embodiment will not be described again, please refer to the corresponding parts in the first embodiment.

The voice processing method provided by the application comprises the following steps:

step 1: collecting voice data to be transcribed;

step 2: determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data;

step 3: determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information;

step 4: and executing processing related to the second text sequence.

The processing related to the second text sequence may be displaying the second text sequence, determining voice reply information according to the second text sequence, determining voice instruction information according to the second text sequence, and the like.

In one example, if the speech processing condition is satisfied, steps 1-4 described above are performed; accordingly, the method may further comprise the steps of: if the voice processing condition is not satisfied, determining a first text sequence corresponding to the voice data; and determining a third text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence. Punctuation in the third text sequence includes punctuation associated with text semantic information.

The speech processing conditions include, but are not limited to: the noise of the voice data acquisition environment is smaller than a noise threshold value, or the noise of the voice data acquisition environment is larger than the noise threshold value; other conditions are possible, such as the device currently available computing resources being greater than a computing resource threshold, and so on.

In one example, the speech processing conditions are: noise of the voice data acquisition environment is smaller than a noise threshold; the method may further comprise the steps of: noise data of a speech data acquisition environment is determined. To determine the noise data of the voice data collection environment, a more sophisticated prior art may be used, such as determining that the noise data reaches x db, etc. By adopting the processing mode, when the environmental noise is small, the punctuation marks can be predicted by combining the text semantic information and the acoustic characteristic information, and when the environmental noise is overlarge, the acoustic characteristic information of the voice data with higher quality can not be extracted, so that the punctuation marks can be predicted only according to the text semantic information; therefore, the computing resources can be effectively saved.

In particular, the method may further comprise the steps of: 1) Determining a noise threshold specified by a user; 2) The noise threshold is stored. For example, a user interface for setting the noise threshold may be provided so that the user may set/adjust the noise threshold according to actual needs, and so on.

In another example, the speech processing conditions are: the currently available computing resources of the speech processing device are greater than a computing resource threshold; the method may further comprise the steps of: computing resources currently available to the device are determined. To determine the currently available computing resources of the device, more sophisticated techniques may be employed, such as determining available memory, CPU utilization, and so forth. By adopting the processing mode, when the current available computing resources of the voice processing equipment are large, the punctuation marks can be predicted by combining the text semantic information and the acoustic characteristic information, and when the current available computing resources of the voice processing equipment are small, the punctuation marks can be predicted only according to the text semantic information; therefore, the voice processing speed can be effectively improved, and the user experience is improved.

In yet another example, the method may further comprise the steps of: 1) Determining a target voice processing method designated by a user; 2) If the target voice processing method is the method, the voice processing condition is satisfied. By adopting the processing mode, a user can select a target voice processing method suitable for the user from several optional voice processing methods, such as the method provided by the embodiment of the application or the method for performing punctuation prediction only according to text voice information, and the like, and if the method specified by the user is the method provided by the embodiment of the application, the voice processing condition is satisfied.

In one example, the method may further comprise the steps of: and displaying the voice processing progress information. By adopting the processing mode, the user can sense the voice processing progress in real time, such as voice data acquisition completion, first text sequence determination completion, acoustic characteristic information determination completion, second text sequence determination completion and the like; therefore, the user experience can be effectively improved.

As can be seen from the above embodiments, in the voice processing method provided by the embodiments of the present application, voice data to be transcribed is collected; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; performing processing associated with the second text sequence; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then processing related to the second text sequence is executed based on the text sequence comprising the more accurate punctuation; therefore, the accuracy of the correlation process can be effectively improved.

Twenty-ninth embodiment

In the above embodiment, a voice interaction method is provided, and corresponding to the voice interaction method, the application also provides ordering equipment. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An order device of this embodiment, this order device includes: the voice acquisition device, the processor and the memory; the memory is used for storing a program for realizing a voice interaction method, and after the equipment is electrified and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice ordering data of a first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; and determining ordering information according to the second ordering text sequence, so that the second user prepares meals according to the ordering information.

As can be seen from the above embodiments, the ordering device provided in the embodiments of the present application collects voice ordering data of a first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence which corresponds to the voice ordering data and comprises punctuation mark information according to the first ordering text sequence and the acoustic characteristic information; determining ordering information according to the second ordering text sequence, so that a second user prepares meals according to the ordering information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice ordering data, acoustic characteristic information of the voice ordering data is comprehensively utilized to predict the punctuation information, the self-meaning of an ordering person can be better utilized after the acoustic characteristic information is utilized, punctuation marks which are more in line with spoken language are obtained, and ordering information (such as dish names, personal taste requirements and the like) is determined based on ordering texts comprising the more accurate punctuation marks; therefore, the ordering accuracy can be effectively improved, and the user experience is improved.

Thirty-first embodiment

In the foregoing embodiment, a voice interaction method is provided, and correspondingly, the application further provides an intelligent sound box. The device corresponds to an embodiment of the method described above. Since the apparatus embodiments are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The device embodiments described below are merely illustrative.

An intelligent sound box of this embodiment, this intelligent sound box includes: the voice acquisition device, the processor and the memory; the memory is used for storing a program for realizing a voice interaction method, and after the equipment is electrified and the program for realizing the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; and displaying the voice reply information and/or executing the voice instruction information.

As can be seen from the above embodiments, the intelligent sound box provided in the embodiments of the present application collects the voice data of the first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic characteristic information; determining voice reply information and/or voice instruction information according to the second text sequence; displaying the voice reply information and/or executing the voice instruction information; the processing mode ensures that on the basis of determining punctuation information according to text semantic information of voice data, acoustic characteristic information of the voice data is comprehensively utilized to predict the punctuation information, the meaning of a speaker can be better utilized after the acoustic characteristic information is utilized, the punctuation more conforming to a spoken language is obtained, and then voice reply information and/or voice instruction information are determined based on a second text sequence comprising the more accurate punctuation; therefore, the accuracy of voice reply and voice instructions can be effectively improved, and the user experience is improved.

While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the present invention, so that the scope of the present invention shall be defined by the claims of the present application.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A speech transcription system, comprising:

the server side is used for receiving the voice data to be transcribed sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; returning the second text sequence to the client;

2. A method of speech transcription, comprising:

receiving voice data to be transcribed sent by a client;

determining a first text sequence corresponding to the voice data;

determining acoustic characteristic information of the voice data;

determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data;

and sending the second text sequence back to the client.

3. The method of claim 2, wherein the determining a second text sequence corresponding to the speech data that includes punctuation information based on the first text sequence and the acoustic feature information comprises:

4. The method of claim 3, wherein the determining, by a second punctuation prediction sub-network included in the punctuation prediction model, the second punctuation information based on the first punctuation information and the acoustic feature information, comprises:

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the acoustic characteristic information of the voice data comprises acoustic characteristic information of a plurality of data frames taking the voice data frame as a unit;

the determining the acoustic feature information of the word includes:

6. The method of claim 5, wherein the step of determining the position of the probe is performed,

7. A method according to claim 3, further comprising:

8. A method of speech transcription, comprising:

collecting voice data to be transcribed;

sending the voice data to a server;

displaying the second text sequence;

wherein the second text sequence is determined by: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; and sending the second text sequence back to the client.

9. A speech transcription apparatus, comprising:

a second text sequence generating unit, configured to determine a second text sequence corresponding to the voice data and including punctuation information according to the first text sequence and the acoustic feature information, where the punctuation information includes punctuation information related to text semantic information and the acoustic feature information of the voice data;

10. A speech transcription apparatus, comprising:

11. An electronic device, comprising:

a processor; and

and the memory is used for storing a program for realizing the voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: receiving voice data to be transcribed sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; and sending the second text sequence back to the client.

12. A speech transcription apparatus, characterized by comprising:

a processor; and

and the memory is used for storing a program for realizing the voice transcription method, and after the equipment is electrified and the program of the voice transcription method is run by the processor, the following steps are executed: collecting voice data to be transcribed; sending the voice data to a server; receiving a second text sequence which is returned by the server and corresponds to the voice data and comprises punctuation mark information; displaying the second text sequence; wherein the second text sequence is determined by: the server receives the voice data; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; and sending the second text sequence back to the client.

13. A method of speech recognition, comprising:

Determining a first text sequence corresponding to voice data to be recognized;

determining acoustic characteristic information of the voice data;

and determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data.

14. A speech recognition apparatus, comprising:

and the second text sequence generating unit is used for determining a second text sequence which corresponds to the voice data and comprises punctuation information according to the first text sequence and the acoustic characteristic information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic characteristic information of the voice data.

15. An electronic device, comprising:

a processor; and

a memory for storing a program for implementing a voice recognition method, the apparatus being powered on and executing the program of the voice recognition method by the processor, and performing the steps of: determining a first text sequence corresponding to voice data to be recognized; determining acoustic characteristic information of the voice data; and determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data.

16. A voice interactive system, comprising:

the server side is used for receiving a voice interaction request aiming at target voice data, which is sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice reply information according to the second text sequence; the voice reply information is returned to the client;

17. A method of voice interaction, comprising:

determining a first text sequence corresponding to the voice data;

determining acoustic characteristic information of the voice data;

determining voice reply information according to the second text sequence;

and sending the voice reply information back to the client.

18. A method of voice interaction, comprising:

determining target voice data;

receiving voice reply information returned by the service end;

displaying the voice reply information;

the voice reply information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

19. A voice interaction device, comprising:

20. A voice interaction device, comprising:

a voice data determining unit configured to determine target voice data;

21. An electronic device, comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

22. An electronic device, comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the target voice data to a server; receiving voice reply information returned by the service end; displaying the voice reply information; the voice reply information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice reply information according to the second text sequence; and sending the voice reply information back to the client.

23. A voice interactive system, comprising:

The server side is used for receiving a voice interaction request aiming at target voice data, which is sent by the client side; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice instruction information according to the second text sequence; the voice instruction information is returned to the client;

24. A method of voice interaction, comprising:

determining a first text sequence corresponding to the voice data;

determining acoustic characteristic information of the voice data;

and sending the voice instruction information back to the client.

25. A method of voice interaction, comprising:

determining target voice data;

sending a voice interaction request aiming at the voice data to a server;

receiving voice instruction information returned by the server;

executing the voice instruction information;

the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

26. A voice interaction device, comprising:

27. A voice interaction device, comprising:

a voice data determining unit configured to determine target voice data;

28. An electronic device, comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: receiving a voice interaction request aiming at target voice data sent by a client; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

29. An electronic device, comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: determining target voice data; sending a voice interaction request aiming at the voice data to a server; receiving voice instruction information returned by the server; executing the voice instruction information; the voice instruction information is determined by the following steps: the server receives the voice interaction request; determining a first text sequence corresponding to the voice data; determining acoustic characteristic information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice instruction information according to the second text sequence; and sending the voice instruction information back to the client.

30. A method of speech processing, comprising:

Collecting voice data to be transcribed;

and executing processing related to the second text sequence.

31. The method of claim 30, wherein the step of determining the position of the probe is performed,

if the voice processing condition is satisfied, executing the method;

the method further comprises the steps of:

32. The method of claim 31, wherein the step of determining the position of the probe is performed,

the speech processing conditions include: the noise of the voice data acquisition environment is smaller than a noise threshold value, or the noise of the voice data acquisition environment is larger than the noise threshold value;

The method further comprises the steps of:

noise data of a speech data acquisition environment is determined.

33. The method as recited in claim 32, further comprising:

determining a noise threshold specified by a user;

the noise threshold is stored.

34. The method of claim 31, wherein the step of determining the position of the probe is performed,

determining a target voice processing method designated by a user;

35. The method as recited in claim 30, further comprising:

and displaying the voice processing progress information.

36. The method of claim 35, wherein the step of determining the position of the probe is performed,

the progress information includes at least one of the following information: the voice data acquisition is completed, the first text sequence determination is completed, the acoustic characteristic information determination is completed, and the second text sequence determination is completed.

37. A food ordering apparatus, comprising:

a voice acquisition device;

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: collecting voice ordering data of a first user; determining a first ordering text sequence corresponding to the voice ordering data; determining acoustic characteristic information of the voice ordering data; determining a second ordering text sequence corresponding to the voice ordering data and comprising punctuation mark information according to the first ordering text sequence and the acoustic feature information, wherein the punctuation mark information comprises punctuation mark information related to text semantic information and the acoustic feature information of the voice data; and determining ordering information according to the second ordering text sequence, so that the second user prepares meals according to the ordering information.

38. An intelligent sound box, which is characterized by comprising:

a processor; and

and the memory is used for storing a program for realizing the voice interaction method, and after the equipment is electrified and the program of the voice interaction method is run by the processor, the following steps are executed: collecting voice data of a first user; determining a first text sequence corresponding to the voice data; and determining acoustic feature information of the voice data; determining a second text sequence corresponding to the voice data and comprising punctuation information according to the first text sequence and the acoustic feature information, wherein the punctuation information comprises punctuation information related to text semantic information and the acoustic feature information of the voice data; determining voice reply information and/or voice instruction information according to the second text sequence; and displaying the voice reply information and/or executing voice instruction information.