[go: up one dir, main page]

HK40002006B - Speech marking method and device, and equipment - Google Patents

Speech marking method and device, and equipment Download PDF

Info

Publication number
HK40002006B
HK40002006B HK19125250.1A HK19125250A HK40002006B HK 40002006 B HK40002006 B HK 40002006B HK 19125250 A HK19125250 A HK 19125250A HK 40002006 B HK40002006 B HK 40002006B
Authority
HK
Hong Kong
Prior art keywords
sentence
pinyin
original
voice
data
Prior art date
Application number
HK19125250.1A
Other languages
Chinese (zh)
Other versions
HK40002006A (en
Inventor
官砚楚
杨磊
陈力
韩喆
Original Assignee
创新先进技术有限公司
Filing date
Publication date
Application filed by 创新先进技术有限公司 filed Critical 创新先进技术有限公司
Publication of HK40002006A publication Critical patent/HK40002006A/en
Publication of HK40002006B publication Critical patent/HK40002006B/en

Links

Description

Voice labeling method, device and equipment
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a method, an apparatus, and a device for voice annotation.
Background
In order to train a good acoustic model, whether in a speech recognition scenario or a speech synthesis scenario, it is necessary to rely on a large amount of speech data and correct text information corresponding to the speech data, and the speech data and the correct text information corresponding to the speech data, which are simply referred to as text-to-speech pairs. The process of determining the speech data and the correct text information corresponding to the speech data may be referred to as speech tagging, and the correct text information may be referred to as tagged data of the speech data. In the related art, a manual dictation mode is usually adopted to transcribe voice data into text information, and correct text information corresponding to the voice data is determined by manual judgment and combining factors such as semantic context and the like to obtain a text-voice pair. However, this way of voice labeling relies on human labor, is inefficient, and has high human costs.
Disclosure of Invention
In order to overcome the problems in the related art, the specification provides a voice labeling method, a voice labeling device and voice labeling equipment.
According to a first aspect of embodiments of the present specification, there is provided a speech annotation method, the method including:
acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
segmenting the voice data to obtain at least one segment of voice sentence data;
and comparing the similarity of the recognition sentence information obtained by performing voice recognition on the voice sentence data with the original sentence information in the original text information, and forming a text voice pair by using the original sentence information and the voice sentence data according to a comparison result.
In one embodiment, the sentence-segment segmenting the speech data to obtain at least one piece of speech sentence data includes:
determining the starting position and the end position of continuous voice in the voice data according to the relation between the short-time energy of each frame in the voice data and a preset energy threshold;
and segmenting the voice data into sentence fragments according to the determined initial position and end position and the interval between the end position and the initial position to obtain at least one section of voice sentence data, wherein the frame number of the voice sentence data is greater than or equal to a preset frame number threshold value.
In one embodiment, the similarity comparison of the recognized sentence information and the original sentence information comprises: and identifying comparison between the pinyin of the sentence information and the pinyin of the original sentence information, wherein the pinyin is the pinyin with tones.
In one embodiment, the comparing the similarity between the recognized sentence information obtained by speech-recognizing the speech sentence data and the original sentence information in the original text information includes:
carrying out sentence-breaking division and sequencing on the original text information according to punctuations to obtain an original text sequence;
carrying out voice recognition and sequencing on the voice sentence data to obtain a recognition text sequence;
respectively converting the original text sequence and the identification text sequence into pinyin with tones to obtain an original pinyin sequence and an identification pinyin sequence, wherein the original pinyin sequence comprises original pinyin sentences, and the identification pinyin sequence comprises identification pinyin sentences;
and aiming at each section of original pinyin sentence in the original pinyin sequence, performing similarity comparison on the original pinyin sentence with the current serial number in the recognized pinyin sequence and the recognized pinyin sentences in the serial numbers which are shifted and designated before and after the current serial number in the recognized pinyin sequence to obtain a comparison result.
In one embodiment, the forming a text-to-speech pair using the original sentence information and the speech sentence data according to the comparison result includes:
selecting the identified pinyin sentences from the identified pinyin sentences in the current serial number and the designated serial numbers shifted from the current serial number to the front and back according to the comparison result and preset screening conditions;
utilizing the identification sentence information corresponding to the identification pinyin sentences obtained by screening to check the original sentence information corresponding to the original pinyin sentences, wherein the check comprises deleting characters which are missed to be read or adding characters which are read more;
and taking the text information obtained by verification as the marking data of the voice sentence data corresponding to the recognition sentence information to form a text voice pair.
In one embodiment, the preset screening condition includes one of the following conditions:
if the maximum similarity in the comparison result is larger than a preset similarity threshold and a maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity;
if the maximum similarity in the comparison result is larger than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities;
and if the maximum similarity and the secondary similarity in the comparison result are both greater than the preset similarity threshold, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarity and the secondary similarity.
According to a second aspect of embodiments herein, there is provided a speech annotation apparatus, the apparatus comprising:
the information acquisition module is used for acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
the data segmentation module is used for segmenting the voice data to obtain at least one segment of voice sentence data;
and the voice pair forming module is used for comparing the similarity of the recognition sentence information obtained by carrying out voice recognition on the voice sentence data with the original sentence information in the original text information and forming a text voice pair by utilizing the original sentence information and the voice sentence data according to a comparison result.
In one embodiment, the data slicing module is specifically configured to:
determining the starting position and the end position of continuous voice in the voice data according to the relation between the short-time energy of each frame in the voice data and a preset energy threshold;
and segmenting the voice data into sentence fragments according to the determined initial position and end position and the interval between the end position and the initial position to obtain at least one section of voice sentence data, wherein the frame number of the voice sentence data is greater than or equal to a preset frame number threshold value.
In one embodiment, the similarity comparison of the recognized sentence information and the original sentence information comprises: and identifying comparison between the pinyin of the sentence information and the pinyin of the original sentence information, wherein the pinyin is the pinyin with tones.
In one embodiment, the speech pair construction module is specifically configured to:
carrying out sentence-breaking division and sequencing on the original text information according to punctuations to obtain an original text sequence;
carrying out voice recognition and sequencing on the voice sentence data to obtain a recognition text sequence;
respectively converting the original text sequence and the identification text sequence into pinyin with tones to obtain an original pinyin sequence and an identification pinyin sequence, wherein the original pinyin sequence comprises original pinyin sentences, and the identification pinyin sequence comprises identification pinyin sentences;
and aiming at each section of original pinyin sentence in the original pinyin sequence, performing similarity comparison on the original pinyin sentence with the current serial number in the recognized pinyin sequence and the recognized pinyin sentences in the serial numbers which are shifted and designated before and after the current serial number in the recognized pinyin sequence to obtain a comparison result.
In one embodiment, the speech pair construction module is specifically configured to include:
selecting the identified pinyin sentences from the identified pinyin sentences in the current serial number and the designated serial numbers shifted from the current serial number to the front and back according to the comparison result and preset screening conditions;
utilizing the identification sentence information corresponding to the identification pinyin sentences obtained by screening to check the original sentence information corresponding to the original pinyin sentences, wherein the check comprises deleting characters which are missed to be read or adding characters which are read more;
and taking the text information obtained by verification as the marking data of the voice sentence data corresponding to the recognition sentence information to form a text voice pair.
In one embodiment, the preset screening condition includes one of the following conditions:
if the maximum similarity in the comparison result is larger than a preset similarity threshold and a maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity;
if the maximum similarity in the comparison result is larger than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities;
and if the maximum similarity and the secondary similarity in the comparison result are both greater than the preset similarity threshold, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarity and the secondary similarity.
According to a third aspect of embodiments herein, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the following method when executing the program:
acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
segmenting the voice data to obtain at least one segment of voice sentence data;
and comparing the similarity of the recognition sentence information obtained by performing voice recognition on the voice sentence data with the original sentence information in the original text information, and forming a text voice pair by using the original sentence information and the voice sentence data according to a comparison result.
The technical scheme provided by the embodiment of the specification can have the following beneficial effects:
in the embodiment of the specification, original text information and voice data corresponding to the original text information are obtained, the voice data are segmented into sentence breaks, multi-segment voice sentence data with voice can be obtained, then voice recognition is performed on the voice sentence data, similarity comparison is performed between the recognition sentence information obtained through recognition and the original sentence information in the original text information, and then the original sentence information and the voice sentence data are utilized to form a text voice pair according to a comparison result, so that automatic labeling processing is realized, and the efficiency of obtaining the text voice pair is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.
FIG. 1 is a flow chart illustrating a method of speech tagging according to an exemplary embodiment of the present specification.
FIG. 2 is a flow chart illustrating another method of speech annotation shown herein according to an exemplary embodiment.
FIG. 3 is a hardware block diagram of a computer device in which a voice annotation apparatus is shown according to an exemplary embodiment.
FIG. 4 is a block diagram of a voice annotation device shown in accordance with an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
Speech recognition, which may be the conversion of speech into text, involves acoustic and language models in the recognition process. The speech synthesis may be to map the text to audio, and the acoustic model may also be involved in the synthesis process. For example, the end-to-end speech recognition model may be a neural network connected from the input end (speech waveform or feature sequence) to the output end (word or character sequence), and conventional acoustic models, pronunciation dictionaries, language models, and other conventional modules are placed in the neural network for processing. The speech synthesis model may be from the input (word or character sequence) to the output (speech waveform or feature sequence).
The acoustic model needs to be established by relying on a large amount of voice data and correct text information corresponding to the voice data, so that the statistical relationship between voice and characters is obtained, and the model is trained by using the voice data and the correct text information corresponding to the voice data, so that the acoustic model is obtained. The process of determining the voice data and the correct text information corresponding to the voice data may be referred to as voice labeling, and the correct text information may be used as a labeling result of the voice data and may also be referred to as labeling data.
In the current voice labeling method, a manual dictation mode is often adopted to transcribe voice data into text information to obtain a text-voice pair. However, the number of the required text-to-speech pairs is large, and the manual labeling method has the problems of low efficiency and high labor cost.
In view of this, in the embodiments of the present specification, original text information and speech data corresponding to the original text information are obtained, and the speech data is sentence-segmented to obtain multiple pieces of speech-carrying speech sentence data, then speech recognition is performed on the speech sentence data, similarity comparison is performed between the recognition sentence information obtained by the recognition and the original sentence information in the original text information, and then a text-to-speech pair is formed by using the original sentence information and the speech sentence data according to a comparison result, so that automatic labeling processing is implemented, and efficiency of obtaining the text-to-speech pair is improved. The embodiments of the present specification are described below with reference to the accompanying drawings.
As shown in fig. 1, it is a flowchart of a method for voice annotation according to an exemplary embodiment shown in this specification, the method includes:
in step 102, original text information and voice data are obtained, wherein the voice data comprise: reading the original text information to obtain recording data;
in step 104, performing sentence segmentation on the voice data to obtain at least one segment of voice sentence data;
in step 106, the recognition sentence information obtained by performing speech recognition on the speech sentence data is compared with the similarity of the original sentence information in the original text information, and a text speech pair is formed by using the original sentence information and the speech sentence data according to the comparison result.
In this embodiment, there is an association between the original text information and the speech data. The original text information may include speakable text (also may be referred to as recorded text) for voice recording; the voice data may include recorded voice data obtained by reading original text information. For example, a professional may be invited to read the original text information, thereby obtaining speech data corresponding to the original text information. As another example, news text, news voice, etc. may be obtained from a news platform.
In the original text information, one sentence or a plurality of sentences may exist. Accordingly, the voice data also has one or more voice sentences. In the embodiment, for the purpose of distinguishing, the information obtained by the original text information is added with 'original' before naming; for data obtained by speech data recognition, "recognition" is added before naming. For example, a sentence in the original text information may be referred to as original sentence information. Each original sentence information may constitute an original text sequence. The original sentence information is subjected to pinyin conversion, and an original pinyin sentence can be obtained. Each original pinyin sentence may constitute an original pinyin sequence. Correspondingly, the speech data is subjected to sentence-breaking processing, and speech sentence data can be obtained. By performing speech recognition on the speech sentence data, recognized sentence information can be obtained. Each piece of recognition sentence information may constitute a recognition text sequence. The information of the recognized sentences is subjected to pinyin conversion, and the recognized pinyin sentences can be obtained. Each recognized pinyin sentence may constitute a recognized pinyin sequence.
The present embodiment may be to obtain speech sentence data having continuous speech by sentence-breaking the speech data. The speech sentence data may also be referred to as speech segment data, speech segment data. The time length of the pause represented by different punctuations is different, the punctuation at the end of the sentence is longer than the pause represented by the punctuation in the sentence, and whether the pause exists or not can be judged based on the existence of the voice, so that the sentence-breaking segmentation of the voice data can be realized.
In one embodiment, the speech data may be sentence-segmented based on a speech endpoint detection approach. Voice Activity Detection (VAD), also called Voice Activity Detection and Voice boundary Detection, may be used to detect the presence or absence of Voice in a noise environment. Voice endpoint detection may separate a voice signal and a non-voice signal from raw voice data, locate a start point and an end point of the voice signal, which may be referred to as endpoints. Therefore, in the embodiments of the present disclosure, the endpoint may be detected by using a voice endpoint detection method, and whether two segments of voice belong to the same sentence is determined by combining the endpoint interval, so as to implement sentence segmentation of voice data. Not only is the processing only carried out on the voice signal realized, but also the non-voice signal is not concerned, and meanwhile, the voice sentence data with voice can be obtained. Furthermore, sentence segmentation can be performed by combining with the pause time.
Because the energy in the mute and talk states is very different, in order to improve the accuracy of segmentation, in one embodiment, an energy threshold (or short-term energy) may be used to detect and segment the speech. For example, segmenting the speech data through speech endpoint detection to obtain at least one segment of speech sentence data with speech may include:
determining the starting position and the end position of continuous voice in the voice data according to the relation between the short-time energy of each frame in the voice data and a preset energy threshold;
and segmenting the voice data into sentence fragments according to the determined initial position and the determined end position and the interval between the end position and the initial position to obtain at least one section of voice sentence data.
Where the speech data may be framed and windowed and the short-time energy of each frame calculated. For example, the voice data is divided into a plurality of short-time energy frames according to the frequency of the preset voice signal, and the energy of the short-time energy frames can be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-time energy frames. Specifically, the energy of each sampling point can be determined according to the amplitude of the audio signal corresponding to each sampling point in the short-time energy frame, and then the energies are added, and the finally obtained energy sum is used as the energy of the short-time energy frame. The energy of the short-time energy frame is compared with a preset energy threshold value, so that the starting point position and the end point position of continuous voice in the voice data can be determined, whether the adjacent voice segments are the same sentence or not is judged based on the interval between the end point of the previous voice segment and the starting point of the current voice segment in the adjacent voice segments, and the voice sentence data with the speaking voice is obtained.
In practical applications, the speech data may include speech unrelated to the original text information, for example, a speaker (e.g., a speaker) reads out a phonetic word, a phonetic transcription, and the like during reading the original text information, and the speech is often short. For example, after the initial speech sentence data is obtained using the energy threshold, the initial speech sentence data having a frame number greater than or equal to a preset frame number threshold may be retained as the speech sentence data.
Where the number of frames may be the number of consecutive speech frames. In one example, a frame may take 20-50ms, with a word of approximately 200 and 300 ms. The preset frame number threshold may be obtained based on a character length conversion desired to be achieved. For example, it is desirable to exclude segment data of not more than 5 chinese characters in the speech sentence data, the number of audio frames corresponding to 5 chinese characters may be determined as the preset frame number threshold.
The embodiment excludes the voice sentence data with the frame number smaller than the preset frame number threshold, can filter the voice which is contained in the voice data and is irrelevant to the original text information, and can obtain the voice sentence data with higher relevance to the original text information, thereby improving the accuracy of voice labeling.
After the voice sentence data is obtained, the voice sentence data may be subjected to voice recognition to obtain recognized sentence information. The speech recognition means may be a speech recognition means in the related art, and is not limited herein. For the original text information, in order to improve the accuracy of voice annotation and the efficiency of voice annotation, a sentence segmentation method can be adopted to perform sentence-breaking processing on the original text information to obtain original sentence information. In order to distinguish the texts obtained in the two ways, a text obtained by recognition of speech sentence data is referred to as recognized sentence information, and a text obtained by sentence-breaking processing of original text information is referred to as original sentence information. In one example, the sentence break process may be performed by determining whether the character is a designated character, for example, the designated character may be a period, an exclamation point, a question mark, an ellipsis, a line break, and the like.
Because pauses in the voice data are often caused by sentence breaks, the voice data are subjected to sentence break processing, recognition sentence information is obtained through recognition, sentence break processing is carried out on original text information, similarity comparison is carried out on the recognition sentence information and the original sentence information, and comparison between fragments can improve comparison efficiency. Particularly, the similarity of the local text information is compared by combining the position of the voice sentence data in the voice data and the position of the original sentence information in the original text information, so that the comparison efficiency can be improved.
In practical applications, since the same pinyin may represent different words, for example, the text corresponding to the speech "b ǐ' j" may be "note" or "handwriting", and thus, there may be a case of inaccurate recognition in speech recognition, and in order to avoid the influence of the recognition error on the similarity comparison between text information, in an example, the similarity comparison between the recognized sentence information and the original sentence information may include: a comparison of the pinyin for sentence information and the pinyin for the original sentence information is identified.
Therefore, the embodiment can avoid the condition of inaccurate voice recognition process by a pinyin comparison mode.
Further, the pinyin can be a pinyin with tones. The accuracy of text similarity comparison can be improved through the pinyin with tones.
In order to improve the text comparison efficiency, the original text information and the recognized sentence information obtained by speech recognition may be sorted separately and compared based on the sequence. For example, the similarity comparison of the recognized sentence information and the original sentence information may include: the recognized pinyin sequence obtained by the conversion of the recognized sentence information is compared with the original pinyin sequence obtained by the conversion of the original sentence information.
In practical application, a certain text may be repeatedly read due to the misreading of the reader, and in order to obtain an accurate text-to-speech pair, when performing the text similarity comparison, the similarity comparison may be performed between the original pinyin sentence with the i-th serial number in the original pinyin sequence and the recognized pinyin sentences from the (i-k) th serial number to the (i + k) th serial number in the recognized pinyin sequence, so as to implement the local comparison. Specifically, the comparing the similarity between the recognition sentence information obtained by performing the speech recognition on the speech sentence data and the original sentence information in the original text information may include:
carrying out sentence-breaking division and sequencing on the original text information according to punctuations to obtain an original text sequence;
carrying out voice recognition and sequencing on the voice sentence data to obtain a recognition text sequence;
respectively converting the original text sequence and the identification text sequence into pinyin with tones to obtain an original pinyin sequence and an identification pinyin sequence, wherein the original pinyin sequence comprises original pinyin sentences, and the identification pinyin sequence comprises identification pinyin sentences;
and aiming at each section of original pinyin sentence in the original pinyin sequence, performing similarity comparison on the original pinyin sentence with the current serial number in the recognized pinyin sequence and the recognized pinyin sentences in the serial numbers which are shifted and designated before and after the current serial number in the recognized pinyin sequence to obtain a comparison result.
The original text information is divided into punctuations according to punctuations to obtain original sentence information, and the original sentence information is sequenced according to the sequence of the original sentence information in the original text information to obtain an original text sequence. The speech sentence data is subjected to speech recognition to obtain recognition sentence information, and the recognition sentence information is sequenced according to the sequence of the speech sentence data in the speech data to obtain a recognition text sequence. The original sentence information in the original text sequence is converted into the pinyin with tones, so that an original pinyin sequence can be obtained, and the original pinyin sequence is formed by original pinyin sentences. And converting the information of the identification sentences in the identification text sequence into pinyin with tones to obtain an identification pinyin sequence, wherein the identification pinyin sequence is formed by identification pinyin sentences. In one example, the tones of a pinyin may be represented by four different numbers, e.g., 1 to 4. In order to distinguish the two pinyin sequences, the pinyin sequences are named as an original pinyin sequence and an identification pinyin sequence.
Traversing the original pinyin sentences in the original pinyin sequence, and executing the following comparison processing on each section of original pinyin sentences in the original pinyin sequence:
and comparing the similarity of the original pinyin sentence with the current sequence number in the recognized pinyin sequence and the recognized pinyin sentences in the sequence numbers specified by the front offset and the back offset of the current sequence number to obtain a comparison result.
Wherein, assuming that the current serial number is i and the designated serial number is k, the identified pinyin sentence constituting objects from the (i-k) th serial number to the (i + k) th serial number can be used as comparison objects of the original pinyin sentence with the i-th serial number, thereby realizing fast matching of local text similarity based on pinyin, further improving the text information obtained as the data labeling result of the voice sentence, namely improving the accuracy of the text voice pair.
Regarding the degree of similarity, in one example, the degree of similarity may be determined based on a ratio of the number of pinyins matched in the two pinyin sentences (the original pinyin sentence and the recognized pinyin sentence) to the total number of pinyins in the two pinyin sentences. For example, the formula sim (i, j) ═ 2 × M/T may be used, where M represents the number of pinyins that the original pinyin sentence with the i-th number can match the identified pinyin sentence with the j-th number, T represents the total number of pinyins in the original pinyin sentence with the i-th number, and j ∈ [ i-k, i + k ].
In one embodiment, since the similarity between the original pinyin sentence and the recognized pinyin sentence can be used as the similarity between the original sentence information and the speech sentence data, the speech sentence data with the similarity satisfying the condition can be screened out according to the comparison result, and the speech sentence data and the original sentence information form a text-speech pair.
The preset filtering condition may be a condition for filtering out appropriate speech sentence data.
In one example, the preset screening condition may be: and if the maximum similarity in the comparison result is greater than a preset similarity threshold and one maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity.
In this embodiment, the voice sentence data corresponding to the similarity which is greater than the preset similarity threshold and is the maximum value in the comparison result may be screened out, and form a text voice pair with the original sentence information.
In another example, the preset screening condition may be: and selecting the identified pinyin sentence with the largest serial number from the identified pinyin sentences corresponding to the similarity greater than the preset similarity threshold.
In this embodiment, if the number of similarities greater than the preset similarity threshold appears in the comparison result is two or more, the speech sentence data corresponding to the identified pinyin sentence with the largest sequence number may be selected, and a text speech pair may be formed with the original sentence information, so that the accuracy of reading the last sentence is often higher when repeated reading is considered.
In another example, the preset screening condition may be: and if the maximum similarity in the comparison result is greater than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities.
In this embodiment, at least two maximum similarities may exist in the comparison result, and therefore, the identified pinyin sentence with the largest sequence number may be screened out from the identified pinyin sentences corresponding to the maximum similarities.
In another example, if the maximum similarity and the second-largest similarity in the comparison result are both greater than the preset similarity threshold, the identified pinyin sentence with the largest sequence number among the identified pinyin sentences corresponding to the maximum similarity and the second-largest similarity is selected.
In this embodiment, the identified pinyin sentences with the largest sequence number may be screened out from the identified pinyin sentences corresponding to the largest similarity and the next largest similarity. Further, the maximum similarity and the second maximum similarity are the first two maximum similarities, that is, only one maximum similarity and the second maximum similarity exist in the comparison result. If the similarity of the first two is greater than the preset similarity threshold, the speech sentence data corresponding to the identified pinyin sentence with the largest sequence number can be selected and form a text speech pair with the original sentence information, so that the accuracy of reading the last sentence is often higher when repeated reading is performed in consideration of actual conditions.
It is understood that other preset screening conditions may be used as long as at least the similarity is used as a screening factor.
In practical application, the recording process may have a condition of reading less or reading more, and since the difficulty of modifying the speech is relatively high, in view of this, in an embodiment, the recognition sentence information may also be used to verify the original sentence information, so as to delete characters that are not read or add characters that are read more. And screening out voice sentence data with the similarity meeting the conditions according to the comparison result, and forming a text voice pair with the verified text information. For example, the forming of the text-to-speech pair using the original sentence information and the speech sentence data according to the comparison result may include:
selecting the identified pinyin sentences from the identified pinyin sentences in the current serial number and the designated serial numbers shifted from the current serial number to the front and back according to the comparison result and preset screening conditions;
utilizing the identification sentence information corresponding to the identification pinyin sentences obtained by screening to check the original sentence information corresponding to the original pinyin sentences, wherein the check comprises deleting characters which are missed to be read or adding characters which are read more;
and taking the text information obtained by verification as the marking data of the voice sentence data corresponding to the recognition sentence information to form a text voice pair.
Wherein the preset screening condition comprises one of the following conditions:
if the maximum similarity in the comparison result is larger than a preset similarity threshold and a maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity; if the maximum similarity in the comparison result is larger than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities;
and if the maximum similarity and the secondary similarity in the comparison result are both greater than the preset similarity threshold, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarity and the secondary similarity.
It is understood that the verification process may also include other processes, which are not described herein.
In the embodiment, the original sentence information is verified by using the recognized sentence information, so that correct text information of the voice sentence data is obtained, and the difficulty caused by modifying the voice data is avoided.
The various technical features in the above embodiments can be arbitrarily combined, so long as there is no conflict or contradiction between the combinations of the features, but the combination is limited by the space and is not described one by one, and therefore, any combination of the various technical features in the above embodiments also belongs to the scope disclosed in the present specification.
One of the combinations is exemplified below.
As shown in fig. 2, which is a flowchart illustrating another speech annotation method according to an exemplary embodiment, after the original text is obtained (step 202), the speech generated by reading the original text may be collected to obtain a speech file, which corresponds to the speech data in fig. 1 (step 204). For the original text, a conversion algorithm of Chinese characters and pinyin can be used to convert the original sentence information (sentence) in the original text into pinyin, so as to obtain an original pinyin sentence (step 206). Speech segmentation based on short-time energy is performed on the speech file to obtain a plurality of speech sentence data (step 208). The speech sentence data is subjected to speech recognition and converted into chinese characters, and recognition sentence information is obtained (step 210). The Chinese characters are converted into a toned pinyin to obtain a recognized pinyin sentence (step 212). Based on the original pinyin sentence and the recognized pinyin sentence, fast text similarity matching in a local range is performed to obtain a comparison result (step 214). And screening the sentence set to be selected according to the comparison result, wherein the sentence set to be selected can comprise the identified pinyin sentences in the current sequence number and the designated sequence numbers before and after the current sequence number. The original sentence information corresponding to the original pinyin sentence is verified by using the recognition sentence information corresponding to the recognition pinyin sentence obtained by screening, and punctuation is corrected to obtain labeling data as voice sentence data corresponding to the recognition sentence information, thereby forming a text-voice pair (step 216).
The embodiment comprises the whole processes of voice segmentation, voice recognition, pinyin-based local text similarity fast matching and the like, and the whole process can automatically complete the automatic labeling facing to an end-to-end voice model. When the text similarity is compared, a Pinyin and tone mode is adopted, and the condition that the voice recognition process is inaccurate is avoided. When comparing texts, local comparison is carried out by combining the relative positions of the texts, and the cleaning of the original voice data is well solved through the matching degree of pinyin. Meanwhile, the voice labeling accuracy and the recall rate are improved.
Corresponding to the embodiment of the voice annotation method, the specification also provides an embodiment of the voice annotation device and an electronic device applied by the voice annotation device.
The embodiment of the voice labeling device in the specification can be applied to computer equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of the computer device where the software implementation is located as a logical means. From a hardware aspect, as shown in fig. 3, which is a hardware structure diagram of a computer device in which the voice annotation apparatus of this specification is located, except for the processor 310, the network interface 320, the memory 330, and the nonvolatile memory 340 shown in fig. 3, in an embodiment, the computer device in which the voice annotation apparatus 331 is located may also include other hardware according to an actual function of the device, which is not described again.
As shown in fig. 4, a block diagram of a voice annotation apparatus according to an exemplary embodiment is shown in the present specification, the apparatus includes:
an information obtaining module 42, configured to obtain original text information and voice data, where the voice data includes: reading the original text information to obtain recording data;
the data segmentation module 44 is configured to segment the speech data to obtain at least one segment of speech sentence data;
a speech pair forming module 46, configured to compare similarity between the recognized sentence information obtained by performing speech recognition on the speech sentence data and the original sentence information in the original text information, and form a text speech pair by using the original sentence information and the speech sentence data according to the comparison result.
In one embodiment, the data slicing module is specifically configured to:
determining the starting position and the end position of continuous voice in the voice data according to the relation between the short-time energy of each frame in the voice data and a preset energy threshold;
and segmenting the voice data into sentence fragments according to the determined initial position and end position and the interval between the end position and the initial position to obtain at least one section of voice sentence data, wherein the frame number of the voice sentence data is greater than or equal to a preset frame number threshold value.
In one embodiment, the similarity comparison of the recognized sentence information and the original sentence information comprises: and identifying comparison between the pinyin of the sentence information and the pinyin of the original sentence information, wherein the pinyin is the pinyin with tones.
In one embodiment, the speech pair construction module is specifically configured to:
carrying out sentence-breaking division and sequencing on the original text information according to punctuations to obtain an original text sequence;
carrying out voice recognition and sequencing on the voice sentence data to obtain a recognition text sequence;
respectively converting the original text sequence and the identification text sequence into pinyin with tones to obtain an original pinyin sequence and an identification pinyin sequence, wherein the original pinyin sequence comprises original pinyin sentences, and the identification pinyin sequence comprises identification pinyin sentences;
and aiming at each section of original pinyin sentence in the original pinyin sequence, performing similarity comparison on the original pinyin sentence with the current serial number in the recognized pinyin sequence and the recognized pinyin sentences in the serial numbers which are shifted and designated before and after the current serial number in the recognized pinyin sequence to obtain a comparison result.
In one embodiment, the speech pair construction module is specifically configured to include:
selecting the identified pinyin sentences from the identified pinyin sentences in the current serial number and the designated serial numbers shifted from the current serial number to the front and back according to the comparison result and preset screening conditions;
utilizing the identification sentence information corresponding to the identification pinyin sentences obtained by screening to check the original sentence information corresponding to the original pinyin sentences, wherein the check comprises deleting characters which are missed to be read or adding characters which are read more;
and taking the text information obtained by verification as the marking data of the voice sentence data corresponding to the recognition sentence information to form a text voice pair.
In one embodiment, the preset screening condition includes one of the following conditions:
if the maximum similarity in the comparison result is larger than a preset similarity threshold and a maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity;
if the maximum similarity in the comparison result is larger than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities;
and if the maximum similarity and the secondary similarity in the comparison result are both greater than the preset similarity threshold, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarity and the secondary similarity.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, wherein the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
Accordingly, embodiments of the present specification further provide a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the following method when executing the program:
acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
segmenting the voice data to obtain at least one segment of voice sentence data;
and comparing the similarity of the recognition sentence information obtained by performing voice recognition on the voice sentence data with the original sentence information in the original text information, and forming a text voice pair by using the original sentence information and the voice sentence data according to a comparison result.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
A computer storage medium having stored therein program instructions, the program instructions comprising:
acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
segmenting the voice data to obtain at least one segment of voice sentence data;
and comparing the similarity of the recognition sentence information obtained by performing voice recognition on the voice sentence data with the original sentence information in the original text information, and forming a text voice pair by using the original sentence information and the voice sentence data according to a comparison result.
Embodiments of the present description may take the form of a computer program product embodied on one or more storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having program code embodied therein. Computer-usable storage media include permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of the storage medium of the computer include, but are not limited to: phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technologies, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by a computing device.
Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.
It will be understood that the present description is not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (13)

1. A method of voice annotation, the method comprising:
acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
segmenting the voice data to obtain at least one segment of voice sentence data;
carrying out similarity comparison on recognition sentence information obtained by carrying out voice recognition on the voice sentence data and original sentence information in original text information, and forming a text voice pair by utilizing the original sentence information and the voice sentence data according to a comparison result;
in the process of forming the text-to-speech pairs, original sentence information is verified by utilizing the recognition sentence information so as to delete characters which are not read or add characters which are read more.
2. The method according to claim 1, wherein the sentence-segment segmenting the speech data to obtain at least one piece of speech sentence data comprises:
determining the starting position and the end position of continuous voice in the voice data according to the relation between the short-time energy of each frame in the voice data and a preset energy threshold;
and segmenting the voice data into sentence fragments according to the determined initial position and end position and the interval between the end position and the initial position to obtain at least one section of voice sentence data, wherein the frame number of the voice sentence data is greater than or equal to a preset frame number threshold value.
3. The method of claim 1, identifying similarity comparisons of sentence information and original sentence information comprising: and identifying comparison between the pinyin of the sentence information and the pinyin of the original sentence information, wherein the pinyin is the pinyin with tones.
4. The method according to claim 3, wherein the comparing the similarity of the recognized sentence information obtained by speech-recognizing the speech sentence data with the original sentence information in the original text information, comprises:
carrying out sentence-breaking division and sequencing on the original text information according to punctuations to obtain an original text sequence;
carrying out voice recognition and sequencing on the voice sentence data to obtain a recognition text sequence;
respectively converting the original text sequence and the identification text sequence into pinyin with tones to obtain an original pinyin sequence and an identification pinyin sequence, wherein the original pinyin sequence comprises original pinyin sentences, and the identification pinyin sequence comprises identification pinyin sentences;
and aiming at each section of original pinyin sentence in the original pinyin sequence, performing similarity comparison on the original pinyin sentence with the current serial number in the recognized pinyin sequence and the recognized pinyin sentences in the serial numbers which are shifted and designated before and after the current serial number in the recognized pinyin sequence to obtain a comparison result.
5. The method of claim 4, the composing a text-to-speech pair using the original sentence information and the speech sentence data according to the comparison result, comprising:
selecting the identified pinyin sentences from the identified pinyin sentences in the current serial number and the designated serial numbers shifted from the current serial number to the front and back according to the comparison result and preset screening conditions;
utilizing the identification sentence information corresponding to the identification pinyin sentences obtained by screening to check the original sentence information corresponding to the original pinyin sentences, wherein the check comprises deleting characters which are missed to be read or adding characters which are read more;
and taking the text information obtained by verification as the marking data of the voice sentence data corresponding to the recognition sentence information to form a text voice pair.
6. The method of claim 5, wherein the preset screening condition comprises one of the following conditions:
if the maximum similarity in the comparison result is larger than a preset similarity threshold and a maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity;
if the maximum similarity in the comparison result is larger than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities;
and if the maximum similarity and the secondary similarity in the comparison result are both greater than the preset similarity threshold, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarity and the secondary similarity.
7. A voice annotation apparatus, the apparatus comprising:
the information acquisition module is used for acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
the data segmentation module is used for segmenting the voice data to obtain at least one segment of voice sentence data;
a speech pair forming module for comparing the similarity of the recognition sentence information obtained by performing speech recognition on the speech sentence data with the original sentence information in the original text information, and forming a text speech pair by using the original sentence information and the speech sentence data according to the comparison result; in the process of forming the text-to-speech pairs, original sentence information is verified by utilizing the recognition sentence information so as to delete characters which are not read or add characters which are read more.
8. The apparatus of claim 7, the data slicing module to:
determining the starting position and the end position of continuous voice in the voice data according to the relation between the short-time energy of each frame in the voice data and a preset energy threshold;
and segmenting the voice data into sentence fragments according to the determined initial position and end position and the interval between the end position and the initial position to obtain at least one section of voice sentence data, wherein the frame number of the voice sentence data is greater than or equal to a preset frame number threshold value.
9. The apparatus of claim 7, identifying a similarity comparison of sentence information and original sentence information comprising: and identifying comparison between the pinyin of the sentence information and the pinyin of the original sentence information, wherein the pinyin is the pinyin with tones.
10. The apparatus of claim 9, the speech pair construction module to be specifically configured to:
carrying out sentence-breaking division and sequencing on the original text information according to punctuations to obtain an original text sequence;
carrying out voice recognition and sequencing on the voice sentence data to obtain a recognition text sequence;
respectively converting the original text sequence and the identification text sequence into pinyin with tones to obtain an original pinyin sequence and an identification pinyin sequence, wherein the original pinyin sequence comprises original pinyin sentences, and the identification pinyin sequence comprises identification pinyin sentences;
and aiming at each section of original pinyin sentence in the original pinyin sequence, performing similarity comparison on the original pinyin sentence with the current serial number in the recognized pinyin sequence and the recognized pinyin sentences in the serial numbers which are shifted and designated before and after the current serial number in the recognized pinyin sequence to obtain a comparison result.
11. The apparatus of claim 10, the speech pair construction module specifically configured to include:
selecting the identified pinyin sentences from the identified pinyin sentences in the current serial number and the designated serial numbers shifted from the current serial number to the front and back according to the comparison result and preset screening conditions;
utilizing the identification sentence information corresponding to the identification pinyin sentences obtained by screening to check the original sentence information corresponding to the original pinyin sentences, wherein the check comprises deleting characters which are missed to be read or adding characters which are read more;
and taking the text information obtained by verification as the marking data of the voice sentence data corresponding to the recognition sentence information to form a text voice pair.
12. The apparatus of claim 11, wherein the preset screening condition comprises one of:
if the maximum similarity in the comparison result is larger than a preset similarity threshold and a maximum similarity exists, selecting the identified pinyin sentence corresponding to the maximum similarity;
if the maximum similarity in the comparison result is larger than a preset similarity threshold and at least two maximum similarities exist, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarities;
and if the maximum similarity and the secondary similarity in the comparison result are both greater than the preset similarity threshold, selecting the identified pinyin sentence with the maximum sequence number in the identified pinyin sentences corresponding to the maximum similarity and the secondary similarity.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the following method when executing the program:
acquiring original text information and voice data, wherein the voice data comprises: reading the original text information to obtain recording data;
segmenting the voice data to obtain at least one segment of voice sentence data;
carrying out similarity comparison on recognition sentence information obtained by carrying out voice recognition on the voice sentence data and original sentence information in original text information, and forming a text voice pair by utilizing the original sentence information and the voice sentence data according to a comparison result;
in the process of forming the text-to-speech pairs, original sentence information is verified by utilizing the recognition sentence information so as to delete characters which are not read or add characters which are read more.
HK19125250.1A 2019-06-14 Speech marking method and device, and equipment HK40002006B (en)

Publications (2)

Publication Number Publication Date
HK40002006A HK40002006A (en) 2020-03-13
HK40002006B true HK40002006B (en) 2021-03-12

Family

ID=

Similar Documents

Publication Publication Date Title
CN109065031B (en) Voice labeling method, device and equipment
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
CN109599093B (en) Intelligent quality inspection keyword detection method, device and equipment and readable storage medium
CN101710490B (en) Method and device for compensating noise for voice assessment
CN112133277B (en) Sample generation method and device
KR101587866B1 (en) Apparatus and method for extension of articulation dictionary by speech recognition
CN112259083B (en) Audio processing method and device
JP6495792B2 (en) Speech recognition apparatus, speech recognition method, and program
US8942977B2 (en) System and method for speech recognition using pitch-synchronous spectral parameters
CN109213970B (en) Method and device for generating notes
CN110600010B (en) Corpus extraction method and apparatus
JP6718787B2 (en) Japanese speech recognition model learning device and program
CN114333828A (en) Quick voice recognition system for digital product
CN112151018B (en) Speech evaluation and speech recognition method, device, equipment and storage medium
CN109559752B (en) Speech recognition method and device
CN109559753B (en) Speech recognition method and device
CN111341298A (en) Speech recognition algorithm scoring method
HK40002006A (en) Speech marking method and device, and equipment
HK40002006B (en) Speech marking method and device, and equipment
CN119864019A (en) A method and device for generating data
EP1511009B1 (en) Voice labeling error detecting system, and method and program thereof
CN115424616B (en) Audio data screening method, device, equipment and computer-readable medium
CN109389969B (en) Corpus optimization method and apparatus
Raškinis et al. From speech corpus to intonation corpus: clustering phrase pitch contours of Lithuanian
CN115862603B (en) Song voice recognition method, system, storage medium and electronic equipment