CN108733649B

CN108733649B - Method, device and system for inserting voice recognition text into script document

Info

Publication number: CN108733649B
Application number: CN201810377108.0A
Authority: CN
Inventors: 卢闪明; 张亚鹏; 李行; 单衍景
Original assignee: BEIJING HUAXIA DENTSU TECHNOLOGY CO LTD
Current assignee: BEIJING HUAXIA DENTSU TECHNOLOGY CO LTD
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2022-05-06
Anticipated expiration: 2038-04-25
Also published as: CN108733649A

Abstract

The embodiment of the application discloses a method, a device and a system for inserting a voice recognition text into a record document, wherein the method for inserting the voice recognition text into the record document comprises the steps of receiving current text recognition information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier and a text length; and inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identification of the current text recognition information. According to the scheme, the text recognition content returned by the recognition server is timely inserted into the script document regardless of confirmation, so that the problem that different speaker language habits and the like cannot be corrected in a unified manner is solved, the problem that the text recognition content is slow to insert into the script document due to low recognition text confirmation speed caused by the problems of a network or the server is solved, and the user experience is greatly improved.

Description

Method, device and system for inserting voice recognition text into script document

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a system for inserting a speech recognition text into a script document.

Background

With the development of speech recognition technology, speech recognition technology is more and more widely used in various industries. For example: in the court trial or meeting process, if the voice recognition technology can be applied to the court trial or meeting, the voice is converted into characters and the characters are inserted into the written document in different colors in real time, so that the workload of court trial or meeting recording personnel is greatly reduced, the problem of missing and misreading is avoided, and even the labor is saved by completely replacing the work of the recording personnel.

In the speech recognition process, the recognition server obtains an audio stream of a current certain role speaking, and generates a recognition text aiming at the current audio stream successively by repeatedly slicing the audio stream for multiple times and analyzing the audio stream in combination with context and semantics of context. If the text recognition content in the text recognition information cannot be confirmed, the recognition server repeatedly performs recognition processing on the current audio stream until the text recognition content in the text recognition information of the current audio stream is confirmed, and the text recognition content is not inserted into the record document. In the recognition process, if the speech speed of the speaker is too fast and the speech pause time is short, the recognition server will cause an error in automatic sentence-breaking calculation (the audio streams corresponding to two sentences of speech of the speaker are treated as one sentence), and since the recognition server performs comparison and analysis on the current audio stream for an increased number of times to obtain the recognition text of the final confirmation state, the user experience will be poor.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device and a system for inserting a voice recognition text into a record document, and solve the technical problem that the existing record document inserting experience is poor.

In order to achieve the above object, an embodiment of the present application provides a method for inserting a speech recognition text into a transcript document, including:

receiving current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier and a text length;

and inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identification of the current text recognition information.

Preferably, the step of inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier of the current text recognition information includes:

the text recognition state identifier in the current text recognition information is a non-confirmed identifier, and the text recognition identifier in the previous text recognition information is a non-confirmed identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

the text recognition state identifier in the current text recognition information is a non-confirmation identifier, and the text recognition identifier in the previous text recognition information is a confirmation identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document;

the text recognition state identifier in the current text recognition information is a confirmation identifier, and the text recognition identifier in the previous text recognition information is a non-confirmation identifier, inserting the text recognition content of the current text information into the corresponding position of the writing document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

and if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition identifier in the previous text recognition information is a confirmation identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document.

Preferably, the step of inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the last text recognition information and the text length and the text recognition content in the current text recognition information comprises:

comparing the content of the text recognition content of the current text recognition information from the starting position to the position with the same text length as the text recognition content of the previous text recognition information with the text recognition content of the previous text recognition information, if the comparison result is the same, removing the content of the text recognition content of the current text recognition information from the starting position to the position with the same text length as the text recognition information in the previous text recognition information, and inserting the residual content behind the text recognition content of the previous text recognition information in the record document; and if the comparison result is different, deleting the text identification content of the last text identification information, and inserting the text identification content of the current text identification information into the position of the text identification content of the last text identification information of the record document.

Preferably, the step of inserting the text recognition contents of the current text recognition information into the corresponding position of the bibliographic document comprises:

if the text recognition mark in the previous text recognition information is a non-determined mark and the text recognition state mark in the current text recognition information is a non-determined mark, obtaining an insertion position of the text recognition content in the current text recognition information through a bookmark used when the text recognition content in the previous text recognition information is inserted, inserting the text recognition content in the current text recognition information into a corresponding position, and updating the inclusion range of the bookmark;

and if the text identification mark in the previous text identification information is a confirmation mark, acquiring the insertion position of the text identification content in the current text identification information through a positioning function, inserting the text identification content in the current text identification information into a corresponding position, removing the shading effect of the bookmark containing the text content used when the text identification content in the previous text identification information is inserted, and re-creating a corresponding bookmark, wherein the bookmark contains the position area of the text identification content in the current text identification information.

In order to achieve the above object, an embodiment of the present application further provides a method for inserting a speech recognition text into a transcript document, including:

receiving an audio stream;

segmenting the audio stream to obtain audio substreams;

determining a target audio substream needing to be identified currently according to the text identification state identifier in the previous text identification information;

identifying the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier and a text length;

and sending the current text identification information to a writing record inserting end to insert the text identification content in the current text identification information into the writing record document.

Preferably, the step of determining the target audio substream currently to be identified comprises:

if the text recognition state identifier in the previous text recognition information is a non-confirmed identifier, the target audio substream to be recognized currently is the audio substream corresponding to the previous text recognition information;

and if the text recognition state identifier in the last text recognition information is the confirmation identifier, the target audio substream needing to be recognized currently is the next audio substream.

In order to achieve the above object, an embodiment of the present application provides an apparatus for inserting a speech recognition text into a transcript document, including:

a receiving unit for receiving current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier and a text length;

and the inserting record unit is used for inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier of the current text recognition information.

Preferably, the inserting record unit includes:

the first writing-record inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the writing-record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition identifier in the previous text recognition information is a non-confirmed identifier;

the second inserting type recording module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the recording document if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition identifier in the last text recognition information is a confirmed identifier;

a third inserting type writing module, configured to insert the text recognition content of the current text information into a corresponding position of the writing document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition identifier in the previous text recognition information is a non-confirmation identifier;

and the fourth writing record inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the writing record document if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition identifier in the last text recognition information is a confirmation identifier.

In order to achieve the above object, an apparatus for inserting a speech recognition text into a script document according to an embodiment of the present application includes:

a receiving unit for receiving an audio stream;

the segmentation unit is used for segmenting the audio stream to obtain audio substreams;

the target audio substream confirming unit is used for confirming the current target audio substream needing to be identified according to the text identification state identifier in the previous text identification information;

the identification unit is used for identifying the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier and a text length;

and the sending unit is used for sending the current text identification information to a record inserting end to realize that the text identification content in the current text identification information is inserted into the record document.

Preferably, the target audio substream confirmation unit includes:

the first confirmation module is used for judging that the current target audio substream to be recognized is the audio substream corresponding to the previous text recognition information if the text recognition state identifier in the previous text recognition information is a non-confirmation identifier;

and the second confirmation module is used for determining that the target audio substream needing to be identified currently is the next audio substream if the text identification state identifier in the previous text identification information is the confirmation identifier.

As can be seen from the above, the returning of the recognized text is too slow due to the speaking habits of the speakers, the network and recognition server configuration, and the returning of the text recognition content only in the case of recognition confirmation, which results in poor user experience. Based on the scheme, the text recognition content returned by the recognition server is timely inserted into the script document regardless of confirmation, so that the problem that the language habits of different speakers and the like cannot be corrected uniformly is solved, the problem that the text recognition content is slow to insert into the script document due to low recognition text confirmation speed caused by the problems of a network or the server is also guaranteed, and the user experience is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a system for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

fig. 3 is a second flowchart of a method for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

FIG. 4 is a functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

FIG. 5 is a second functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

fig. 6 is a schematic view of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.

Fig. 1 is a schematic diagram of a system for inserting a speech recognition text into a transcript document according to an embodiment of the present application. The method comprises the following steps: and inserting a writing terminal and a voice recognition server. The voice recognition server acquires an audio stream from the voice collector, and segments the audio stream into a plurality of audio substreams after the audio stream is subjected to noise processing. The voice recognition server carries out recognition processing on each audio substream, constructs a recognition processing result into text recognition information, and sends the text recognition information to the inserting record terminal regardless of whether the recognition content of the audio substream is confirmed or not. If the recognition content of the currently recognized audio substream is confirmed, the speech recognition server can perform the recognition work of the next audio substream. And if the identification content of the currently identified audio substream is in an unconfirmed state, the voice identification server continues to identify the current audio substream. The voice recognition server returns the text recognition information to the insertion script terminal regardless of whether the recognized contents of the audio substream are in an unconfirmed state or a confirmed state. And the inserting record terminal inserts the text recognition content in the text recognition information returned by the voice recognition server into the corresponding position of the record document according to the text recognition state.

The technical scheme is applied to an application scene that only one role speaks at the same time. In the technical scheme, a script document creation storage unit is inserted, and text identification information returned by an identification server is stored in the storage unit. The storage unit stores information such as voice content and identification state identification, and when a real-time identification text is received each time, the text insertion position is obtained by calculation of the identification state identification in the storage unit through the insertion of the handwriting terminal, so that the insertion of the identification text content of a single-role speaking into the corresponding position of the handwriting document is realized.

In the present embodiment, the meaning of identifying the unconfirmed state of the content is: the voice recognition server carries out slicing analysis and other recognition operations on the acquired audio stream to generate text recognition content, the text recognition content is a part of final text generated by current audio substream recognition, and individual fields stored in the text recognition content need to be corrected and modified through re-recognition processing. The meaning of recognizing the confirmation status of the content is: the recognition server carries out slice analysis and other recognition operations on the acquired audio stream to generate text recognition content, and the text recognition content finally confirms the text which does not need to be subjected to recognition operation again by combining with context semantic analysis.

Based on the above description, an embodiment of the present application provides a method for inserting a speech recognition text into a bibliographic document, as shown in fig. 2. For the technical scheme, the method and the device are applied to the terminal for inserting the record, and specifically, the terminal for inserting the record can be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, an intelligent wearable device, a shopping guide terminal, a television and the like with a data processing function. Alternatively, the client may be software capable of running in the electronic device. The method is applied to a multi-role simultaneous speaking situation, and can comprise the following steps:

step 201): receiving current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier and a text length.

In the technical scheme, the text recognition content is the voice content in the current target audio substream. The text recognition state identification identifies whether the recognized voice content in the current target audio substream does not need to be recognized again. In this embodiment, the text recognition status is labeled as 1, which represents that the speech content recognized in the current audio substream is finally confirmed by combining with the context semantic analysis to a text that does not need to be recognized again. The text recognition state is marked as 0, the voice content recognized in the current audio substream is marked as a part of the final text generated by recognizing the current audio substream, and individual fields stored in the text recognition content need to be corrected and modified through re-recognition processing. The text length is the length of the speech content of the current target audio substream identified by the recognition server.

In this embodiment, the terminal inserted into the script is provided with a storage unit on the processor, which is specially used for storing the current text recognition information returned by the voice recognition server. The storage unit is divided into a plurality of storage areas, and different contents in the text recognition information are respectively stored in different areas, such as a specific storage text recognition state identifier, a specific storage text recognition content and the like. For the technical scheme, a storage unit stores previous text recognition information, a writing terminal is inserted to receive the current text recognition information returned by a voice recognition server, the writing terminal is inserted to insert the text recognition content in the current text recognition information into a corresponding writing document according to a text recognition state identifier in the current text recognition information and a text recognition state identifier in the previous text recognition information, the previous text recognition information is deleted from the storage unit, and the current text recognition information is stored in the storage unit. The inserting record terminal is provided with a memory for storing result information inserted in the record document, the content in the last text identification information stored by the storage unit is used for accurately confirming the inserting position when the text identification content is inserted, and the memory is different from the content stored by the storage unit in the technical scheme.

Step 202): and inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identification of the current text recognition information.

In this technical solution, the step of inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier of the current text recognition information includes:

In the technical scheme, the step of inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the last text recognition information and the text length and the text recognition content in the current text recognition information comprises the following steps:

In the technical scheme, the step of inserting the text recognition content of the current text recognition information into the corresponding position of the record document comprises the following steps:

Specifically, to describe in detail the process of inserting the script document under the single-role speaking condition, the process flow of inserting the script terminal is as follows:

1. the character A speaks for the first time, the audio collector conducts audio collection on the speech of the character A to obtain audio streams, the recognition server conducts segmentation processing on the audio streams to obtain audio substreams, the recognition server conducts recognition processing on the first audio substream, text recognition content Sa1, text recognition state identification Ta1 and text length L1 are returned for the first time, a storage unit corresponding to the character A is created, and the text recognition content, the text recognition state identification and the text length in the text recognition information are stored. Wherein Ta1 is 1, which indicates that the recognition server is a confirmation text for the currently returned text recognition content Sa1, stores the text recognition content Sa1 and the text recognition state identification Ta1 in the currently returned text recognition information in the storage unit, and inserts the text recognition content Sa1 into the transcript document. And waiting for the recognition server to return the next text recognition information. Ta1 is 0, indicating that the recognition server has unconfirmed text of 1 bits of the currently returned text recognition content Sa, stores the text recognition content Sa1 and the text recognition state identification Ta1 in the currently returned text recognition information in the storage unit, and inserts the text recognition content Sa1 into the transcript document. And waiting for the recognition server to return the next text recognition information.

2. The recognition server returns the next text recognition information to obtain the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2. If the text recognition status identifier Ta1 of the last text recognition information is 1, the current text recognition information is obtained by the recognition server for the current audio substream recognition. If the text recognition status identifier Ta1 of the previous text recognition information is 0, the current text recognition information is obtained by the recognition server for the audio substream corresponding to the previous text recognition information.

2.1 if Ta2 is 0 and the recognized text state in the text recognition information is not confirmed, comparing the L1-length content starting from the start position of the text recognition content S2 with the text recognition content S1, and if the comparison result is the same, acquiring an L2-L1 content portion S21 of the text recognition content S2 by character string truncation and inserting the content S21 into the transcript document in a tail-added manner; if the comparison results are not the same, the text recognition content S2 is directly inserted into the script document in an overlay insertion manner (deleting the text recognition content S1 and inserting the text recognition content S2) without intercepting the text recognition content S2. The contents stored in the storage unit are updated, the text recognition content Sa1, the text recognition state flag Ta1, and the text length L1 of the last text recognition information are deleted, and the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2 of the current text recognition information are stored in the storage unit.

2.2 if Ta2 is 1, the recognition text state in the text recognition information is ok (Ta2 is 1), the text length is L2, and the text recognition content is S2. The contents starting from the start position of the text recognition content S2 in the currently acquired text recognition information to the length L1 are compared with the text content S1. If the comparison result is the same, acquiring the partial content L2-L1 of the text content S2 through character string interception and inserting the content into the script document in a tail adding mode; if the comparison results are different, the text content S2 does not need to be intercepted, and the text content S2 is directly inserted into the record document in an overlay insertion manner (S1 content in the deleted document is inserted into S2). Meanwhile, the text recognition content Sa1, the text recognition state flag Ta1, and the text length L1 of the last text recognition information are deleted, and the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2 of the current text recognition information are stored in the storage unit.

3. And returning the third text identification information by the identification server, receiving the information by the inserting record terminal, and executing the inserting logic according to the step 2.1 in the processing flow of the inserting record terminal if the text identification state identifier Ta2 in the last text identification information is 0 regardless of 1 or 0 of the text identification state identifier Ta3 in the third text identification information. If the text recognition status flag Ta2 in the previous text recognition information is equal to 1, the text processing and insertion logic is restarted according to the sequence of step 1 in the process flow of inserting the script terminal. Finally, the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2 of the last text recognition information are deleted, and the text recognition content Sa3, the text recognition state flag Ta3, and the text length L3 of the current text recognition information are stored in the storage unit.

When the text recognition content is inserted into the script document, a shading effect is added to a new text of the script document inserted into each role in real time, the shading effect of a text with a recognition state being a confirmed state in the text recognition content returned last time in the document is detected and removed, and the shading effect is ensured to follow the current latest inserted recognition text. Specifically, the step of inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier of the current text recognition information further includes:

after inserting the text recognition content in the current text recognition information into the corresponding position, judging the text recognition state identification of the previous text recognition information, if the text recognition state identification of the previous text recognition information is a confirmation identification, removing the shading effect of the text recognition content of the previous text recognition information, inserting the text recognition content in the current text recognition information, and setting the shading effect; and if the text recognition state identifier of the last text recognition information is a non-confirmed identifier, inserting the text recognition content in the current text recognition information, and setting a shading effect.

Then, on the basis of the above example, the shading effect adding logic corresponding to each text recognition content in the case of the role a speaking:

1. inserting the first text recognition information of the character A returned by the recognition server into the script terminal, creating a storage unit corresponding to the character A, and storing the text recognition content Sa1, the text recognition state identifier Ta1 and the text length L1 in the current text recognition information in an overlaying storage mode.

1.1 if Ta1 is equal to 0, calculating and acquiring a text recognition content insertion position of current text recognition information through a related positioning function provided by WordAPI, inserting the text recognition content into a record document, creating a Bookmark (Bookmark) B < a > corresponding to the role a and including Sa1, adding a corresponding ground-color effect to the inserted text content through the Bookmark, and going to step 2 to continue to execute the logic flow.

1.2 if Ta1 is 1, calculating and acquiring a text recognition content insertion position of current text recognition information through a related positioning function provided by WordAPI, inserting the text recognition content into a record document, creating a Bookmark (Bookmark) B < a > corresponding to the role A and including Sa1, and adding a corresponding ground-color effect to the inserted text content through the Bookmark. And the step 1 is carried out to continue the logic flow.

2. And the identification server returns the second text identification information to the terminal for inserting the record. The inserting-writing terminal stores the text recognition content Sa2, the text recognition state identification Ta2 and the text length L2 in the current text recognition information to the storage unit in an overlaying storage mode, and deletes the text recognition content Sa1, the text recognition state identification Ta1 and the text length L1 in the first text recognition information.

2.1 if the text recognition state identifier Ta2 is equal to 0, calculating and acquiring the insertion position of the text recognition content Sa2 through the bookmark B < a >, inserting the text recognition content Sa2 into the record document, updating the inclusion range of the bookmark B < a >, adding a corresponding shading effect to the updated bookmark B < a >, and going to step 3 to continue to execute the logic flow.

2.2 if the text recognition state identifier Ta2 is equal to 1, calculating and acquiring the insertion position of the text recognition content Sa2 through the bookmark B < a >, inserting the text recognition content Sa2 into the record document, updating the inclusion range of the bookmark B < a >, adding a corresponding shading effect to the updated bookmark B < a >, and going to step 3 to continue to execute the logic flow.

The 3 recognition server returns the third text recognition information to the insertion record terminal, the insertion record terminal stores the text recognition content Sa3, the text recognition state identifier Ta3 and the text length L3 in the current text recognition information to the storage unit in an overlaying storage mode, and deletes the text recognition content Sa2, the text recognition state identifier Ta2 and the text length L2 in the second text recognition information.

3.1 if Ta2 is equal to 0 and Ta3 is equal to 0, perform step 2.1.

3.2 if Ta2 is equal to 0 and Ta3 is equal to 1, perform step 2.2.

3.3 if Ta2 ═ 1, Ta3 ═ 0, or Ta2 ═ 1, Ta3 ═ 1, the insertion logic flow is executed from step 1 again, making it clear that bookmark B < a > contains the shading effect of the text.

The embodiment of the application provides another method for inserting the speech recognition text into the script document, as shown in fig. 3. For the technical scheme, the method is applied to inserting a voice recognition server, and specifically, the voice recognition server can be an electronic device with data operation and storage functions and network interaction functions; software may also be provided that runs in the electronic device to support data processing, storage, and network interaction. The number of servers is not particularly limited in the present embodiment. The server may be one server, several servers, or a server cluster formed by several servers. The method for inserting the voice recognition text into the script document can comprise the following steps:

step 301: an audio stream is received.

In this embodiment, the voice collector collects voice of a user in an application scene in real time, and performs noise reduction processing on the collected voice to obtain an audio stream.

Step 302): and segmenting the audio stream to obtain audio substreams.

In this embodiment, in order to improve the accuracy of speech recognition, the audio stream fed back by the speech acquisition device is subjected to segmentation processing, and a large-segment audio stream is subjected to segmentation processing to obtain a plurality of small-segment audio streams. The data of the audio stream is not particularly large during each recognition, and the recognition precision is greatly improved.

Step 303): and determining the current target audio substream needing to be identified according to the text identification state identifier in the last text identification information.

In the technical scheme, if the recognition server cannot confirm the recognition result of the current audio information needing to be recognized, the recognition result is still fed back to the insertion record terminal, the unconfirmed content is inserted into the record document, then the recognition server continues to recognize the audio information again, whether the recognition result is confirmed or not at this time, the recognition result is still fed back to the insertion record terminal, and the recognition result is inserted into the record document. And the next audio information is not identified until the text identification information of the audio information identified and processed by the identification server is confirmed. And if the identification result of the identification server to the current audio information needing identification processing is in a confirmation state, feeding back the identification result to the insertion record terminal, inserting the confirmed content into the record document, and then identifying the next audio information by the identification server.

For a conventional technical scheme, the recognition result of the current audio information to be recognized cannot be confirmed by the recognition server, and the recognition result is not fed back to the insertion record terminal until the recognition server confirms the current audio information to be recognized result, and the recognition result is fed back to the insertion record terminal for insertion. The time spent in the insertion of the conventional technical scheme is longer than that of the technical scheme, and the experience degree of a user is greatly reduced. According to the technical scheme, the identification information is inserted into the record document in real time regardless of confirmation or not, and the user experience is improved. Therefore, in the present technical solution, the step of determining the target audio substream that needs to be identified currently includes:

Step 304): identifying the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier and a text length;

step 305): and sending the current text identification information to a writing record inserting end to insert the text identification content in the current text identification information into the writing record document.

According to the technical scheme, the problem of poor user experience due to low returning speed of the recognition text is solved by inserting the text generated in the process of slicing, comparing, analyzing and calculating the audio stream by the recognition server into the document in real time regardless of confirmation.

Fig. 4 is a functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present application. The device is used for inserting a record terminal in practical application. The method comprises the following steps:

a receiving unit 401, configured to receive current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier and a text length;

and an inserting record unit 402, configured to insert corresponding text recognition content into a corresponding position of the record document according to the text recognition state identifier of the current text recognition information.

In this embodiment, the inserting record unit includes:

the first writing record inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the writing record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition identifier in the previous text recognition information is a non-confirmed identifier;

the second handwriting inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the handwriting document if the text recognition state identifier in the current text recognition information is a non-confirmation identifier and the text recognition identifier in the previous text recognition information is a confirmation identifier;

and the fourth inserting record module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the record document if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition identifier in the last text recognition information is a confirmation identifier.

Fig. 5 is a second functional block diagram of an apparatus for inserting a speech recognition text into a bibliographic document according to the embodiment of the present application. The device is used for inserting a record terminal in practical application. The method comprises the following steps:

a receiving unit 501 for receiving an audio stream;

a segmentation unit 502, configured to segment the audio stream to obtain an audio substream;

a target audio substream confirming unit 503, configured to determine a target audio substream that needs to be identified currently according to the text identification state identifier in the previous text identification information;

an identifying unit 504, configured to identify the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier and a text length;

a sending unit 505, configured to send the current text identification information to an insertion entry end, so as to implement insertion of the text identification content in the current text identification information into the entry document.

Fig. 6 is a schematic diagram of an electronic system according to an embodiment of the present application. The electronic device includes: a memory a and a processor b, wherein the memory a stores a computer program, and when the computer program is executed by the processor b, the computer program realizes the following functions:

In this embodiment, when the corresponding text recognition content is inserted into the corresponding position of the record document according to the text recognition status identifier of the current text recognition information, the computer program implements the following functions when executed by the processor b:

if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition identifier in the previous text recognition information is a non-confirmed identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

the text recognition state identifier in the current text recognition information is a confirmation identifier, and the text recognition identifier in the previous text recognition information is a non-confirmation identifier, and then the text recognition content of the current text information is inserted into the corresponding position of the writing document according to the text length and the text recognition content in the previous text recognition information, and the text length and the text recognition content in the current text recognition information;

In this embodiment, the computer program, when executed by the processor b, implements the following functions when inserting the text recognition content of the current text recognition information into the corresponding position of the entry document according to the text length and the text recognition content in the last text recognition information and the text length and the text recognition content in the current text recognition information:

An embodiment of the present application provides another electronic device, where the electronic device includes: a memory a and a processor b, wherein the memory a stores a computer program, and when the computer program is executed by the processor b, the computer program realizes the following functions:

receiving an audio stream;

segmenting the audio stream to obtain audio substreams;

In this embodiment, a target audio substream currently to be identified is determined, and the computer program, when executed by the processor b, implements the following functions:

In this embodiment, the Memory includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card).

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

The specific functions implemented by the memory and the processor of the electronic device provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and can achieve the technical effects of the foregoing embodiments, and thus, no further description is provided herein.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain a corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbylangue (Hardware Description Language), vhjhdul (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

Those skilled in the art will also appreciate that, in addition to implementing a client, server as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the client, server are in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, such a client and a server may be considered as a hardware component, and a device included therein for implementing various functions may be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the client, reference may be made to the introduction of the embodiments of the method described above for a comparative explanation.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A method for inserting a voice recognition text into a script document is applied to a terminal for inserting scripts, and comprises the following steps:

receiving current text identification information of a target audio substream; the current text recognition information comprises text recognition content, a text recognition state identifier and a text length;

inserting corresponding text recognition content into a corresponding position of the record document according to the text recognition state identifier of the current text recognition information, wherein the method comprises the following steps:

the text recognition state identifier in the current text recognition information is a non-confirmed identifier, and the text recognition state identifier in the previous text recognition information is a non-confirmed identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

inserting the text recognition content of the current text recognition information into the corresponding position of the record document if the text recognition state identifier in the current text recognition information is a non-confirmation identifier and the text recognition state identifier in the previous text recognition information is a confirmation identifier;

if the text recognition state identifier in the current text recognition information is a confirmed identifier and the text recognition state identifier in the previous text recognition information is a non-confirmed identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

and if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition state identifier in the previous text recognition information is a confirmation identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document.

2. The method of claim 1, wherein the step of inserting the text recognition contents of the current text recognition information into the corresponding position of the bibliographic document based on the text length and the text recognition contents of the previous text recognition information and the text length and the text recognition contents of the current text recognition information comprises:

3. The method of claim 2, wherein the step of inserting the text recognition contents of the current text recognition information into the corresponding position of the transcript document comprises:

if the text recognition state identifier in the previous text recognition information is a non-determined identifier and the text recognition state identifier in the current text recognition information is a non-determined identifier, obtaining an insertion position of the text recognition content in the current text recognition information through a bookmark used when the text recognition content in the previous text recognition information is inserted, inserting the text recognition content in the current text recognition information into a corresponding position, and updating an inclusion range of the bookmark;

and if the text recognition state identifier in the previous text recognition information is a confirmation identifier, acquiring the insertion position of the text recognition content in the current text recognition information through a positioning function, inserting the text recognition content in the current text recognition information into a corresponding position, removing the shading effect of the bookmark containing the text content used when the text recognition content in the previous text recognition information is inserted, and re-creating a corresponding bookmark, wherein the bookmark contains a position area of the text recognition content in the current text recognition information.

4. The method of claim 1, further comprising the steps of, applied to a speech recognition server:

receiving an audio stream;

segmenting the audio stream to obtain audio substreams;

identifying the target audio substream to obtain current text identification information; the current text recognition information comprises text recognition content, a text recognition state identifier and a text length;

5. The method of claim 4, wherein the step of determining a target audio substream currently in need of identification comprises:

6. A device for inserting a voice recognition text into a record document is applied to a terminal for inserting the record, and comprises:

a receiving unit for receiving current text identification information of a target audio substream; the current text recognition information comprises text recognition content, a text recognition state identifier and a text length;

the system comprises a recording inserting unit and a recording inserting unit, wherein the recording inserting unit is used for inserting corresponding text recognition content into a corresponding position of a recording document according to a text recognition state identifier of current text recognition information, and the recording inserting unit comprises:

the first writing record inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the writing record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition state identifier in the previous text recognition information is a non-confirmed identifier;

the second handwriting inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the handwriting document if the text recognition state identifier in the current text recognition information is a non-confirmation identifier and the text recognition state identifier in the previous text recognition information is a confirmation identifier;

a third inserting type module, configured to insert the text recognition content of the current text recognition information into a corresponding position of the type document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition state identifier in the previous text recognition information is a non-confirmation identifier;

and the fourth writing record inserting module is used for inserting the text recognition content of the current text recognition information into the corresponding position of the writing record document if the text recognition state identifier in the current text recognition information is a confirmation identifier and the text recognition state identifier in the last text recognition information is a confirmation identifier.

7. The apparatus of claim 6, applied to a speech recognition server, the receiving unit being adapted to receive an audio stream;

the device further comprises: the segmentation unit is used for segmenting the audio stream to obtain an audio substream;

the identification unit is used for identifying the target audio substream to obtain current text identification information; the current text recognition information comprises text recognition content, a text recognition state identifier and a text length;

8. The apparatus of claim 7, wherein the target audio substream confirmation unit comprises: