[go: up one dir, main page]

CN109949813A - A kind of method, apparatus and system converting speech into text - Google Patents

A kind of method, apparatus and system converting speech into text Download PDF

Info

Publication number
CN109949813A
CN109949813A CN201711386363.3A CN201711386363A CN109949813A CN 109949813 A CN109949813 A CN 109949813A CN 201711386363 A CN201711386363 A CN 201711386363A CN 109949813 A CN109949813 A CN 109949813A
Authority
CN
China
Prior art keywords
text
feature parameter
parameter
voice signal
matching degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711386363.3A
Other languages
Chinese (zh)
Inventor
王群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Junlin Intelligent Technology Co.,Ltd.
Original Assignee
Beijing Junlin Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Junlin Polytron Technologies Inc filed Critical Beijing Junlin Polytron Technologies Inc
Priority to CN201711386363.3A priority Critical patent/CN109949813A/en
Priority to PCT/CN2018/122344 priority patent/WO2019120248A1/en
Publication of CN109949813A publication Critical patent/CN109949813A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention discloses a kind of method for converting speech into text and devices, this method comprises: extracting the fisrt feature parameter of targeted voice signal;Second feature parameter is matched with the third feature parameter in speech database, determine N number of target signature parameter, N >=2, N number of target signature parameter is maximum N number of with second feature parameter matching degree in third feature parameter, the second feature parameter is a part in the fisrt feature parameter, it determines text corresponding with the maximum target signature parameter of second feature parameter matching degree, and exports the text;The accuracy rate of text is determined using the matching degree of N number of target signature parameter;If accuracy rate is lower than preset threshold, text is carried out to highlight label.The embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate and judge correcting errors for the text, while facilitating verification, additionally it is possible to improve verification efficiency and guarantee the accuracy rate of text.

Description

A kind of method, apparatus and system converting speech into text
Technical field
The present embodiments relate to technical field of voice recognition more particularly to a kind of method for converting speech into text, Apparatus and system.
Background technique
Currently, the efficiency for converting speech into text has obtained significantly with the development of intelligent sound text conversion technology It improves.Intelligent sound text conversion technology can be applied in minutes, training record or interview record.Believe to voice When number carrying out text conversion, the characteristic parameter of voice signal is extracted first, then by text in this feature parameter and speech database Corresponding characteristic parameter is matched, to obtain the highest text of matching degree and export.For the standard under quiet environment The text of mandarin pronunciation converts, and accuracy rate is higher.But under reality scene, spokesman inevitably can be with certain Accent, and not can guarantee and record under quiet environment, so not can guarantee the accurate of language and characters conversion Rate.
100% accuracy rate is not can guarantee using the text that existing intelligent sound text conversion technology obtains, so needing Manually the text after conversion is verified.Common method of calibration is that verification personnel read in the whole text, is turned with finding The text of mistake is changed, but this verification mode is more time-consuming.Moreover, because desk checking is easy to appear fault, have A little mistakes are not easy to be found, higher so as to cause errors in text rate.
Summary of the invention
The embodiment of the invention provides a kind of methods and terminal for converting speech into text, can be improved with providing one kind The method that text verifies efficiency and reduces errors in text rate.
The embodiment of the invention provides a kind of methods for converting speech into text, comprising:
Extract the fisrt feature parameter of targeted voice signal;
Second feature parameter is matched with the third feature parameter in speech database, determines N number of target signature ginseng Number, N number of target signature parameter be it is maximum N number of with the second feature parameter matching degree in the third feature parameter, N >=2, the second feature parameter are a part in the fisrt feature parameter,;
It determines text corresponding with the maximum target signature parameter of the second feature parameter matching degree, and exports institute State text;
The accuracy rate of the text is determined using the matching degree of N number of target signature parameter;
If the accuracy rate is lower than preset threshold, the text is carried out to highlight label.
Further, the accuracy rate of the text is determined using the matching degree of N number of target signature parameter, comprising:
Determine the sum of corresponding matching degree of N number of target signature parameter;
Determine that the corresponding matching degree of the text accounts for the specific gravity of the sum of described matching degree, the specific gravity is the standard of the text True rate.
Further, if the accuracy rate is lower than preset threshold, the text is carried out to highlight label, comprising:
If the accuracy rate is lower than preset threshold, color mark is carried out to the text.
Further, the method also includes:
Obtain voice signal;
If the perdurabgility of sentence halted signals is more than preset time in the voice signal, pauses and believe in the sentence Make pauses in reading unpunctuated ancient writings at number to the voice signal, forms speech signal segment;
Timestamp is marked to the speech signal segment, the speech signal segment is targeted voice signal.
Further, the method also includes:
The corresponding text section of speech signal segment described in timestamp label using the speech signal segment.
Further, the method also includes:
When detecting play instruction, text to be played is obtained;
The corresponding timestamp of text section where determining the text to be played;
Play the corresponding speech signal segment of the timestamp.
The embodiment of the invention also provides a kind of devices for converting speech into text, comprising:
Extraction unit, for extracting the fisrt feature parameter of targeted voice signal;
Matching unit determines N for matching second feature parameter with the third feature parameter in speech database A target signature parameter, N number of target signature parameter are to match in the third feature parameter with the second feature parameter Maximum N number of, N >=2 are spent, the second feature parameter is a part in the fisrt feature parameter;
First determination unit, for the determining and maximum target signature parameter pair of the second feature parameter matching degree The text answered, and export the text;
Second determination unit, for determining the accuracy rate of the text using the matching degree of N number of target signature parameter;
First marking unit highlights the text if be lower than preset threshold for the accuracy rate Label.
Further, described device further include:
Acquiring unit, for obtaining voice signal;
Punctuate unit, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, Make pauses in reading unpunctuated ancient writings at the sentence halted signals to the voice signal, forms speech signal segment;
Second marking unit, for marking timestamp to the speech signal segment, the speech signal segment is target Voice signal, and utilize the corresponding text section of speech signal segment described in the timestamp label of the speech signal segment.
Further, described device further include:
Second acquisition unit obtains text to be played when receiving play instruction;
Third determination unit, for the corresponding timestamp of text section where determining the text to be played;
Broadcast unit, for playing the corresponding speech signal segment of the timestamp.
The embodiment of the invention also provides a kind of systems for converting speech into text, comprising: terminal and with the end Hold the cloud server of connection;
The terminal is sent to the cloud server for acquiring voice signal, and by the voice signal after acquisition;
The cloud server includes:
Receiving unit, the voice signal sent for receiving the terminal;
Extraction unit, for extracting the fisrt feature parameter of the voice signal;
Matching unit determines N for matching second feature parameter with the third feature parameter in speech database A target signature parameter, N number of target signature parameter are to match in the third feature parameter with the second feature parameter Maximum N number of, N >=2 are spent, the second feature parameter is a part in the fisrt feature parameter;
First determination unit, for the determining and maximum target signature parameter pair of the second feature parameter matching degree The text answered, and export the text;
Second determination unit, for determining the accuracy rate of the text using the matching degree of N number of target signature parameter;
Marking unit, for carrying out highlighting label to the text when the accuracy rate is lower than preset threshold.
Method and apparatus provided in an embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate Word simultaneously judges correcting errors for the text, while facilitating verification, additionally it is possible to improve verification efficiency and guarantee text after converting Accuracy rate.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, for those of ordinary skills, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow chart for the method for converting speech into text provided in an embodiment of the present invention;
Fig. 2 is a kind of structural block diagram for the device for converting speech into text provided in an embodiment of the present invention;
Fig. 3 is a kind of structural block diagram for the system for converting speech into text provided in an embodiment of the present invention.
Specific embodiment
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
The embodiment of the present invention can be applied at the terminal, such as mobile phone, computer, tablet computer etc..It is one of real Existing mode can be text and convert in real time, i.e., while the voice signal of acquisition spokesman output on the spot, by the voice signal It is converted into text, and is saved.In this implementation, the terminal with speech signal collection function can be used, such as Terminal with microphone.Another implementation can be non real-time conversion, i.e., using the equipment with sound-recording function to speech The voice signal of person's output is recorded in advance, then the complete speech signal recorded is sent to terminal, and terminal is to getting Voice signal carry out text conversion.
The embodiment of the present invention can be applied in terminal and the cloud server connecting with terminal.Terminal can will into The voice signal of row text conversion is sent to cloud server, and cloud server carries out text to the voice signal got and turns Change, and the text after conversion is sent to terminal.The voice signal of the pending text conversion can be terminal itself and record in real time It makes, is also possible to be sent to the voice signal of terminal after other sound pick-up outfits have been recorded.
It is a kind of method for converting speech into text provided in an embodiment of the present invention referring to Fig. 1, this method can be applied At the terminal.As shown in Figure 1, this method may comprise steps of.
Step 11, voice signal is obtained.
The voice signal is to be transformed be text voice signal.The mode for converting voice signals into text can be reality When convert, be also possible to non real-time conversion.Record to voice signal can be terminal itself and records, and be also possible to It is recorded using other equipment.
Step 12, if in the voice signal sentence halted signals perdurabgility be more than preset time, in institute's predicate Make pauses in reading unpunctuated ancient writings at sentence halted signals to the voice signal, forms speech signal segment.
Step 13, timestamp is marked to the speech signal segment, the speech signal segment is targeted voice signal.
During obtaining voice signal or after obtaining voice signal, if sentence halted signals in the voice signal Perdurabgility be more than preset time, make pauses in reading unpunctuated ancient writings at the sentence halted signals, speech signal segment formed, then to voice Signal segment marks timestamp.After making pauses in reading unpunctuated ancient writings to voice signal, corresponding timestamp at the punctuate can be calculated, with determination The initial time of speech signal segment and end time.In the concrete realization, the voice signal piece that first can be formed The initial time of section is set as 0 second.Speech signal segment can one timestamp of label, can for speech signal segment rise Begin time or end time.Speech signal segment can also two timestamps of label, respectively when speech signal segment starting when Between and the end time.
Such as in the mode converted in real time, terminal carries out during acquiring voice signal, while to voice signal Detection, if the perdurabgility for detecting sentence halted signals in voice signal is more than preset time, in sentence halted signals Place makes pauses in reading unpunctuated ancient writings, and marks timestamp to the speech signal segment that punctuate is formed, and carry out text conversion.In non real-time conversion In mode, whole voice signals can be made pauses in reading unpunctuated ancient writings first according to the perdurabgility of sentence halted signals in voice signal, And corresponding timestamp is marked to each speech signal segment, then text conversion is carried out to speech signal segment.
The perdurabgility of the sentence halted signals can be determined according to the waveform of voice signal, and details are not described herein.
The present embodiment does not limit the punctuate mode of voice signal, such as can also be according to fixed time interval pair Voice signal is made pauses in reading unpunctuated ancient writings.
Step 14, the fisrt feature parameter of targeted voice signal is extracted.
In the concrete realization, single syllable element one by one can be divided, then mention according to the energy value of voice signal The characteristic parameter of each single syllable element is taken, to obtain the fisrt feature parameter of targeted voice signal.
If the voice signal is analog signal, it can be first converted into digital signal, then carry out feature extraction, obtained The corresponding fisrt feature parameter of the voice signal.
In the present embodiment feature extraction can be carried out as unit of the speech signal segment after making pauses in reading unpunctuated ancient writings.
Step 15, second feature parameter is matched with the third feature parameter in speech database, determines N number of target Characteristic parameter, N number of target signature parameter are maximum with the second feature parameter matching degree in the third feature parameter N number of, N >=2, the second feature parameter be the fisrt feature parameter in a part.
It should be noted that unless otherwise specified, it is following in matching degree refer both to the characteristic parameter in third feature parameter With the matching degree between the second feature parameter.
It, can when being matched the fisrt feature parameter of a speech signal segment with the characteristic parameter in speech database With successively from fisrt feature parameter selected part characteristic parameter, i.e. second feature parameter is matched, to successively match The corresponding text of one characteristic parameter.The second feature parameter can be the characteristic parameter of a single syllable element, two or four The corresponding characteristic parameter of single syllable element combinations.
Step 16, text corresponding with the maximum target signature parameter of the second feature parameter matching degree is determined, And export the text.
Speech database preserves the characteristic parameter of single text, two-character phrase language or the corresponding voice signal of four word Chinese idioms, That is third feature parameter, it is generally the case that can will be corresponding with the maximum third feature parameter of second feature parameter matching degree Text is determined as the corresponding text of voice signal to be converted, that is, the text for needing to export.
In the present embodiment, not only it needs to be determined that with the maximum third feature parameter of second feature parameter matching degree, but also It needs to be determined that at least one is in addition to the maximum third feature parameter of matching degree, most with the second feature parameter matching degree Big target signature parameter.
For example, to the third feature parameter in the corresponding second feature parameter of the voice signal of " hello " and speech database It is matched, the maximum third feature parameter of three matching degrees, respectively " hello " corresponding third feature ginseng can be matched to Number, " Ni Hao " corresponding third feature parameter, and " Ning Hao " corresponding third feature parameter, these three third feature parameters Matching degree between fisrt feature parameter is respectively 60%, 20% and 5%.Wherein " hello " corresponding third feature ginseng Matching degree between several and fisrt feature parameter is maximum, it is possible to determine that " hello " is text, i.e. the voice signal is corresponding Text.Determining text can export and save as text formatting.
It should be noted that the present embodiment is not to the value of N, i.e. the number of target signature parameter is defined, such as can Think two, three or four etc..
Wherein, the matching degree between fisrt feature parameter and third feature parameter can use dynamic time warping (DTW) Algorithm, hidden Markov model (HMM) algorithm, vector quantization (VQ) the methods of algorithm or neural network are determined, specific mistake Details are not described herein for journey.
Step 17, the accuracy rate of the text is determined using the matching degree of N number of target signature parameter.
In this step, it can first determine the sum of corresponding matching degree of N number of target signature parameter, then determine the text The corresponding matching degree of word accounts for the specific gravity of the sum of described matching degree, and the specific gravity is the accuracy rate of the text.
For example, three target signature parameters are obtained by step 13, the matching degree difference between second feature parameter It is 60%, 20% and 5%, the sum of these three corresponding matching degrees of target signature parameter are 85% (60%+20%+5%= 85%) the corresponding text of target signature parameter that, matching degree is 60% is text, and the corresponding matching degree of text accounts for the matching The ratio that the specific gravity of the sum of degree is 60% and 85%, i.e., 70.59%.So the accuracy rate of the text is 70.59%.
In the concrete realization, the sum of maximum matching degree and the second largest matching degree only can also be accounted for by calculating maximum matching degree Specific gravity, determine the accuracy rate of text.That is, in the example above, ratio that the accuracy rate of text is 60% and 80% Value, i.e., 75%.
Difference between maximum matching degree matching degree corresponding with other each target signature parameters can characterize maximum With the accuracy rate for spending corresponding text, the bigger accuracy rate of difference is relatively high, so in the concrete realization, the accuracy rate of text Can also by calculate the difference between maximum matching degree matching degree corresponding with other each target signature parameters carry out it is true It is fixed.For example, determining accuracy rate according to the difference between maximum matching degree and the second largest matching degree.Target signature parameter is corresponding Matching degree is respectively 60% and 20%, and the difference between maximum matching degree and the second largest matching degree is 40% (60%-20%= 40%), can be according to 40% difference and difference accuracy rate corresponding with the calculating of the mapping relations of accuracy rate, or it can also To be determined as accuracy rate for 40%.It should be noted that the present embodiment does not carry out the circular of the accuracy rate of text Limitation.
Step 18, if the accuracy rate is lower than preset threshold, the text is carried out to highlight label.
In the present embodiment, color mark can be carried out lower than the text of preset threshold to accuracy rate, after text is shown, The text that accuracy rate is lower than preset threshold can be highlighted, to facilitate verification personnel to carry out text verification.And it can basis The grade of accuracy rate is to word marking different colours, the preset range that each grade correspondence can be different.For example, can will not With accuracy rate be divided into three grades, it can be 80% to 60% without color mark, accuracy rate that accuracy rate, which reaches 80%, Text carry out yellow flag, accuracy rate lower than 60% text carry out red-label.
When carrying out highlighting label to text, preset range belonging to text accuracy rate can be first determined, further according to this Affiliated preset range determines color identifier.When display text, the text with color identifier can be character background and indicate Color or text itself indicate color.
For example, a speech signal segment is that " everybody is leader, and every welcome guest, good morning for everybody!", the text being converted to Format can be for " everybody is leader, everybody [B: yellow] beer on draft [E: yellow], good morning for everybody!" wherein, [B: yellow] indicates yellow Color marker beginning, [E: yellow] indicate that yellow flag terminates to locate, and when showing the text, reader can see " beer on draft " Font mark is yellow.
It should be noted that the mode that the present embodiment does not highlight label to text is defined, can be used for example Marking, italic label or overstriking label etc. are used for highlighted mark mode.
The corresponding whole texts of targeted voice signal can be matched through the above steps, when targeted voice signal is voice When signal segment, the corresponding text of each speech signal segment can form text section.
Step 19, the corresponding text section of speech signal segment described in the timestamp label using the speech signal segment.
For example, speech signal segment and its corresponding timestamp are as follows: [00:00] in three big basic telecommunication companies, XXX is detailed Understand enterprise to reinforce information infrastructure building, carry out technological innovation and application, implementation speed-raising drop expense measure service enterprise in a deep going way The case where industry and consumer etc..[00:28] he encourages enterprise to aim at scientific and technological revolution and industry transformation trend, puts forth effort to break through More core technologies seize international competition commanding elevation, promote to wider, deeper time fusion application.
After speech signal segment is converted to text, " in three big basic telecommunication companies, XXX has understood enterprise's reinforcement in detail Information infrastructure building carries out technological innovation and application in a deep going way, implements speed-raising Jiang Fei measure service enterprise and consumer etc. side The timestamp of this text section of the case where face " can mark as the second, and " he encourages enterprise to aim at scientific and technological revolution and industry transformation Trend puts forth effort to break through more core technologies, seizes international competition commanding elevation, promotes to answer to wider, deeper time fusion With " timestamp of this text section can mark as the second.
The timestamp of text section can have one, and mark the section in text section first or section tail, the timestamp indicate when Between can be initial time or the end time of the corresponding speech signal segment of this article field.The timestamp of text section can also be with There are two, and mark that the section in text section is first and section tail, the two timestamps can indicate that this article field is corresponding respectively respectively The initial time of speech signal segment and end time.
It should be noted that when voice signal carries out punctuate and voice signal after to punctuate according to fixed time interval When fragment label timestamp, if the timestamp is also the timestamp of the corresponding text section of speech signal segment, the timestamp pair If the text section answered may not be one section complete for semantically, so after converting text for speech signal segment, it can Punctuate is re-started with the meaning of one's words according to text, to each text segment mark timestamp after punctuate.To the time of text segment mark Stamp can be determined according to the corresponding initial time of the corresponding speech signal segment of text section and end time.And it is possible to According to the text section made pauses in reading unpunctuated ancient writings again, make pauses in reading unpunctuated ancient writings again to voice signal, and the corresponding time is marked to speech signal segment Stamp.
In non real-time text transform mode, above-mentioned targeted voice signal may be complete voice signal, by the language After sound signal is converted into text, it can be made pauses in reading unpunctuated ancient writings according to the meaning of one's words of text to text and voice signal, to each text after punctuate Field and speech signal segment mark timestamp.
Above-mentioned steps 11 can not only be executed to step 19 by terminal, can also be executed by cloud server.Carrying out text In the mode that word converts in real time, terminal acquires voice signal in real time, and is sent to cloud server, and cloud server is by voice Signal is then forwarded to terminal after being converted to text, to realize conversion in real time.
In another embodiment, terminal can be when acquiring voice signal, when according to the continuity of sentence halted signals Between make pauses in reading unpunctuated ancient writings to voice signal, timestamp is marked to speech signal segment, and speech signal segment is sent to cloud service The speech signal segment is converted to text section by device, cloud server, and carries out highlighting label to corresponding text, with And then the text section after conversion is sent to text segment mark timestamp by terminal according to the timestamp of speech signal segment, In this embodiment, the timestamp of speech signal segment can also be marked by cloud server.
Text after conversion can form readable and editable text, when verifying to the text, verify personnel It can easily find the lower text of accuracy rate and judge correcting errors for the text, while facilitating verification, additionally it is possible to mention Height verification efficiency and the accuracy rate for guaranteeing text after conversion.
Judging when correcting errors of text conversion, the corresponding speech signal segment of the text can played, will pass through context Judged, specific implementation process can be with are as follows: when detecting play instruction, obtain text to be played, determine it is described to The corresponding timestamp of text section where the text of broadcasting, then plays the corresponding speech signal segment of the timestamp.It should be wait broadcast The text put can be selected text.
In specific implementation, verification personnel can choose the text to be verified, and click play button, click play button Afterwards, system is able to detect that play instruction, and then plays the corresponding voice signal piece of timestamp of text place text section Section, so as to judge whether the text is correct by context.
It should be noted that the embodiment of the present invention is not defined the display mode of the broadcast button, the broadcast button It can show, can also be shown after choosing text after opening text, be shown after text and right click mouse can also be chosen.
Mark text section timestamp can the timestamp of speech signal segment corresponding with text section it is identical, thus true After fixed text to be played, it can determine that corresponding speech signal segment is gone forward side by side according to the corresponding timestamp of text to be played Row plays, so use is more convenient, can be improved verification using method provided in this embodiment when carrying out text verification Efficiency.
It referring to fig. 2, is a kind of structural block diagram for the device for converting speech into text provided in an embodiment of the present invention, the dress Setting can be only fitted in terminal or for terminal itself.As shown in Fig. 2, the device can specifically include acquiring unit 21, punctuate is single Member 22, the second marking unit 23, extraction unit 24, matching unit 25, the first determination unit 26, the second determination unit 27, first Marking unit 28.
Acquiring unit 21, for obtaining voice signal.
Punctuate unit 22, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, Make pauses in reading unpunctuated ancient writings at the sentence halted signals to the voice signal, forms speech signal segment.
Second marking unit 23, for marking timestamp to the speech signal segment, the speech signal segment is mesh Poster sound signal.
Extraction unit 24, for extracting the fisrt feature parameter of targeted voice signal.
Matching unit 25, for matching second feature parameter with the third feature parameter in speech database, really Fixed N number of target signature parameter, N number of target signature parameter be in the third feature parameter with the second feature parameter Matching degree is maximum N number of, N >=2, and the second feature parameter is a part in the fisrt feature parameter.
First determination unit 26, for the determining and maximum target signature parameter of the second feature parameter matching degree Corresponding text, and export the text.
Second determination unit 27, for determining the accurate of the text using the matching degree of N number of target signature parameter Rate.
Second determination unit 27 is specifically determined for the sum of corresponding matching degree of N number of target signature parameter;With And determining that the corresponding matching degree of the text accounts for the specific gravity of the sum of described matching degree, the specific gravity is the accuracy rate of the text.
First marking unit 28 carries out the text prominent aobvious if be lower than preset threshold for the accuracy rate Indicating note.
Second marking unit 23 is also used to voice signal described in the timestamp label using the speech signal segment The corresponding text section of segment.
The device can also include second acquisition unit, third determination unit, broadcast unit.
Second acquisition unit, for when receiving play instruction, obtaining text to be played.
Third determination unit, for the corresponding timestamp of text section where determining the text to be played.
Broadcast unit, for playing the corresponding speech signal segment of the timestamp.
Device provided in an embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate and sentence The text that breaks is corrected errors, while facilitating verification, additionally it is possible to which text is accurate after raising verification efficiency and guarantee conversion Rate.
It is a kind of structural block diagram for the system for converting speech into text provided in an embodiment of the present invention, this is referring to Fig. 3 System may include terminal 31 and the cloud server 32 that connect with the terminal 31.
Voice signal after acquisition is sent to cloud server 32 for acquiring voice signal by terminal 31.
Cloud server 32 is used to the voice signal being converted to text, and the text is sent to the terminal 31. Cloud server 32 can specifically include with lower unit.
Receiving unit 321, the voice signal sent for receiving the terminal.
Punctuate unit 322, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, Make pauses in reading unpunctuated ancient writings at the sentence halted signals to the voice signal, forms speech signal segment.
Second marking unit 323, for marking timestamp to the speech signal segment.
Extraction unit 324, for extracting the fisrt feature parameter of the voice signal.
Extraction unit 325, specifically for extracting the fisrt feature parameter of the speech signal segment.
Matching unit 326, for matching second feature parameter with the third feature parameter in speech database, really Fixed N number of target signature parameter, N number of target signature parameter be in the third feature parameter with the fisrt feature parameter Matching degree is maximum N number of, N >=2, and the second feature parameter is a part in the fisrt feature parameter.
First determination unit 327 is joined for determining with the maximum target signature of the second feature parameter matching degree The corresponding text of number, and export the text.
Second determination unit 328, for determining the accurate of the text using the matching degree of N number of target signature parameter Rate.
Marking unit 329, for carrying out highlighting mark to the text when the accuracy rate is lower than preset threshold Note.
Second marking unit 323 is also used to voice described in the timestamp label using the speech signal segment and believes Number corresponding text section of segment.
Transmission unit 330, for the text matched to be sent to the terminal 31, the text includes the voice letter Number corresponding text.
Device provided in an embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate and sentence The text that breaks is corrected errors, while facilitating verification, additionally it is possible to which text is accurate after raising verification efficiency and guarantee conversion Rate.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.
The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flow chart and/or box can be realized by computer program instructions The combination of the process and/or box in each flow and/or block and flowchart and/or the block diagram in figure.It can provide These computer program instructions are whole to the processing of general purpose computer, special purpose computer, Embedded Processor or other programmable datas The processor of end equipment is to generate a machine, so that passing through computer or the place of other programmable data processing terminal devices The instruction that device executes is managed to generate for realizing in one box of one or more flows of the flowchart and/or block diagram or more The device for the function of being specified in a box.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates Manufacture including command device, the command device are realized in one or more flows of the flowchart and/or one, block diagram The function of being specified in box or multiple boxes.
These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one process of flow chart or multiple streams The step of function of being specified in journey and/or one or more blocks of the block diagram.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So appended claims are intended to explain Being includes preferred embodiment and all change and modification for falling into range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or behaviour There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any other change Body is intended to non-exclusive inclusion, so that including process, method, article or the terminal device of a series of elements Include not only those elements, but also including other elements that are not explicitly listed, or further includes for this process, side Method, article or the intrinsic element of terminal device.In the absence of more restrictions, by sentence "including a ..." The element of restriction, it is not excluded that there is also other in process, method, article or the terminal device for including the element Identical element.
Above to a kind of method, apparatus and system for converting speech into text provided by the present invention, carry out in detail It introduces, used herein a specific example illustrates the principle and implementation of the invention, the explanation of above embodiments It is merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification It should not be construed as limiting the invention.

Claims (10)

1. a kind of method for converting speech into text characterized by comprising
Extract the fisrt feature parameter of targeted voice signal;
Second feature parameter is matched with the third feature parameter in speech database, determines N number of target signature parameter, institute Stating N number of target signature parameter is maximum N number of with the second feature parameter matching degree in the third feature parameter, N >=2, The second feature parameter is a part in the fisrt feature parameter;
It determines text corresponding with the maximum target signature parameter of the second feature parameter matching degree, and exports the text Word;
The accuracy rate of the text is determined using the matching degree of N number of target signature parameter;
If the accuracy rate is lower than preset threshold, the text is carried out to highlight label.
2. the method as described in claim 1, which is characterized in that determine institute using the matching degree of N number of target signature parameter State the accuracy rate of text, comprising:
Determine the sum of corresponding matching degree of N number of target signature parameter;
Determine that the corresponding matching degree of the text accounts for the specific gravity of the sum of described matching degree, the specific gravity is the accurate of the text Rate.
3. the method as described in claim 1, which is characterized in that if the accuracy rate is lower than preset threshold, to the text It carries out highlighting label, comprising:
If the accuracy rate is lower than preset threshold, color mark is carried out to the text.
4. the method as described in claim 1, which is characterized in that the method also includes:
Obtain voice signal;
If the perdurabgility of sentence halted signals is more than preset time in the voice signal, at the sentence halted signals Make pauses in reading unpunctuated ancient writings to the voice signal, forms speech signal segment;
Timestamp is marked to the speech signal segment, the speech signal segment is targeted voice signal.
5. method as claimed in claim 4, which is characterized in that the method also includes:
The corresponding text section of speech signal segment described in timestamp label using the speech signal segment.
6. method as described in claim 4 or 5, which is characterized in that the method also includes:
When detecting play instruction, text to be played is obtained;
The corresponding timestamp of text section where determining the text to be played;
Play the corresponding speech signal segment of the timestamp.
7. a kind of device for converting speech into text characterized by comprising
Extraction unit, for extracting the fisrt feature parameter of targeted voice signal;
Matching unit determines N number of mesh for matching second feature parameter with the third feature parameter in speech database Mark characteristic parameter, N number of target signature parameter be in the third feature parameter with the second feature parameter matching degree most Big N number of, N >=2, the second feature parameter are a part in the fisrt feature parameter;
First determination unit, it is corresponding with the maximum target signature parameter of the second feature parameter matching degree for determination Text, and export the text;
Second determination unit, for determining the accuracy rate of the text using the matching degree of N number of target signature parameter;
First marking unit carries out the text to highlight label if be lower than preset threshold for the accuracy rate.
8. device as claimed in claim 7, which is characterized in that further include:
Acquiring unit, for obtaining voice signal;
Punctuate unit, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, described Make pauses in reading unpunctuated ancient writings at sentence halted signals to the voice signal, forms speech signal segment;
Second marking unit, for marking timestamp to the speech signal segment, the speech signal segment is target voice Signal, and utilize the corresponding text section of speech signal segment described in the timestamp label of the speech signal segment.
9. device as claimed in claim 8, which is characterized in that further include:
Second acquisition unit, for when receiving play instruction, obtaining text to be played;
Third determination unit, for the corresponding timestamp of text section where determining the text to be played;
Broadcast unit, for playing the corresponding speech signal segment of the timestamp.
10. a kind of system for converting speech into text characterized by comprising terminal and the cloud being connect with the terminal Hold server;
The terminal is sent to the cloud server for acquiring voice signal, and by the voice signal after acquisition;
The cloud server includes:
Receiving unit, the voice signal sent for receiving the terminal;
Extraction unit, for extracting the fisrt feature parameter of the voice signal;
Matching unit determines N number of mesh for matching second feature parameter with the third feature parameter in speech database Mark characteristic parameter, N number of target signature parameter be in the third feature parameter with the second feature parameter matching degree most Big N number of, N >=2, the second feature parameter are a part in the fisrt feature parameter;
First determination unit, it is corresponding with the maximum target signature parameter of the second feature parameter matching degree for determination Text, and export the text;
Second determination unit, for determining the accuracy rate of the text using the matching degree of N number of target signature parameter;
Marking unit, for carrying out highlighting label to the text when the accuracy rate is lower than preset threshold.
CN201711386363.3A 2017-12-20 2017-12-20 A kind of method, apparatus and system converting speech into text Pending CN109949813A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201711386363.3A CN109949813A (en) 2017-12-20 2017-12-20 A kind of method, apparatus and system converting speech into text
PCT/CN2018/122344 WO2019120248A1 (en) 2017-12-20 2018-12-20 Method for converting speech into characters, device, and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711386363.3A CN109949813A (en) 2017-12-20 2017-12-20 A kind of method, apparatus and system converting speech into text

Publications (1)

Publication Number Publication Date
CN109949813A true CN109949813A (en) 2019-06-28

Family

ID=66992504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711386363.3A Pending CN109949813A (en) 2017-12-20 2017-12-20 A kind of method, apparatus and system converting speech into text

Country Status (2)

Country Link
CN (1) CN109949813A (en)
WO (1) WO2019120248A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036119A (en) * 2020-10-16 2020-12-04 深圳市欢太科技有限公司 Text display method and device and computer readable storage medium
CN114079695A (en) * 2020-08-18 2022-02-22 北京有限元科技有限公司 Method, device and storage medium for recording voice call content

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341255A (en) * 1999-02-19 2002-03-20 美国科斯特语音公司 Automated transcription system and method using two speech converting instances and computer-assisted correction
CN101127210A (en) * 2007-09-20 2008-02-20 Ut斯达康通讯有限公司 Method and device for implementing lyric synchronization when broadcasting song
CN101287029A (en) * 2007-04-13 2008-10-15 华为技术有限公司 Method and apparatus for automatically respond to detection
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 A Method for Segmentation of Amdo Tibetan Speech and Syllables
US20090048832A1 (en) * 2005-11-08 2009-02-19 Nec Corporation Speech-to-text system, speech-to-text method, and speech-to-text program
CN102122506A (en) * 2011-03-08 2011-07-13 天脉聚源(北京)传媒科技有限公司 Method for recognizing voice
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN103165131A (en) * 2011-12-17 2013-06-19 富泰华工业(深圳)有限公司 Voice processing system and voice processing method
CN103943109A (en) * 2014-04-28 2014-07-23 深圳如果技术有限公司 Method and device for converting voice to characters
CN104516520A (en) * 2013-09-28 2015-04-15 南京专创知识产权服务有限公司 Character input method based on voice recognition technology
CN105047198A (en) * 2015-08-24 2015-11-11 百度在线网络技术(北京)有限公司 Voice error correction processing method and apparatus
KR101590724B1 (en) * 2014-10-06 2016-02-02 포항공과대학교 산학협력단 Method for modifying error of speech recognition and apparatus for performing the method
CN105869634A (en) * 2016-03-31 2016-08-17 重庆大学 Field-based method and system for feeding back text error correction after speech recognition
WO2017125752A1 (en) * 2016-01-22 2017-07-27 Oxford Learning Solutions Limited Computer-implemented phoneme-grapheme matching
CN107068144A (en) * 2016-01-08 2017-08-18 王道平 It is easy to the method for manual amendment's word in a kind of speech recognition
CN107123042A (en) * 2017-04-26 2017-09-01 山东浪潮商用系统有限公司 A kind of intelligent sound does tax method, apparatus and system
US20170270086A1 (en) * 2016-03-16 2017-09-21 Kabushiki Kaisha Toshiba Apparatus, method, and computer program product for correcting speech recognition error

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578465B (en) * 2013-10-18 2016-08-17 威盛电子股份有限公司 Speech recognition method and electronic device
CN105931641B (en) * 2016-05-25 2020-11-10 腾讯科技(深圳)有限公司 Subtitle data generation method and device

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1341255A (en) * 1999-02-19 2002-03-20 美国科斯特语音公司 Automated transcription system and method using two speech converting instances and computer-assisted correction
US20090048832A1 (en) * 2005-11-08 2009-02-19 Nec Corporation Speech-to-text system, speech-to-text method, and speech-to-text program
CN101287029A (en) * 2007-04-13 2008-10-15 华为技术有限公司 Method and apparatus for automatically respond to detection
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 A Method for Segmentation of Amdo Tibetan Speech and Syllables
CN101127210A (en) * 2007-09-20 2008-02-20 Ut斯达康通讯有限公司 Method and device for implementing lyric synchronization when broadcasting song
CN102122506A (en) * 2011-03-08 2011-07-13 天脉聚源(北京)传媒科技有限公司 Method for recognizing voice
CN102543063A (en) * 2011-12-07 2012-07-04 华南理工大学 Method for estimating speech speed of multiple speakers based on segmentation and clustering of speakers
CN103165131A (en) * 2011-12-17 2013-06-19 富泰华工业(深圳)有限公司 Voice processing system and voice processing method
CN104516520A (en) * 2013-09-28 2015-04-15 南京专创知识产权服务有限公司 Character input method based on voice recognition technology
CN103943109A (en) * 2014-04-28 2014-07-23 深圳如果技术有限公司 Method and device for converting voice to characters
KR101590724B1 (en) * 2014-10-06 2016-02-02 포항공과대학교 산학협력단 Method for modifying error of speech recognition and apparatus for performing the method
CN105047198A (en) * 2015-08-24 2015-11-11 百度在线网络技术(北京)有限公司 Voice error correction processing method and apparatus
CN107068144A (en) * 2016-01-08 2017-08-18 王道平 It is easy to the method for manual amendment's word in a kind of speech recognition
WO2017125752A1 (en) * 2016-01-22 2017-07-27 Oxford Learning Solutions Limited Computer-implemented phoneme-grapheme matching
US20170270086A1 (en) * 2016-03-16 2017-09-21 Kabushiki Kaisha Toshiba Apparatus, method, and computer program product for correcting speech recognition error
CN105869634A (en) * 2016-03-31 2016-08-17 重庆大学 Field-based method and system for feeding back text error correction after speech recognition
CN107123042A (en) * 2017-04-26 2017-09-01 山东浪潮商用系统有限公司 A kind of intelligent sound does tax method, apparatus and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张茹等: "《数字内容安全》", 30 September 2017, 北京邮电大学出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114079695A (en) * 2020-08-18 2022-02-22 北京有限元科技有限公司 Method, device and storage medium for recording voice call content
CN112036119A (en) * 2020-10-16 2020-12-04 深圳市欢太科技有限公司 Text display method and device and computer readable storage medium

Also Published As

Publication number Publication date
WO2019120248A1 (en) 2019-06-27

Similar Documents

Publication Publication Date Title
CN110085261B (en) Pronunciation correction method, device, equipment and computer readable storage medium
CN109346059B (en) Dialect voice recognition method and electronic equipment
CN109410664B (en) A kind of pronunciation correction method and electronic device
CN100514446C (en) Pronunciation evaluating method based on voice identification and voice analysis
CN104078044B (en) The method and apparatus of mobile terminal and recording search thereof
CN107622054B (en) Text data error correction method and device
CN110136747A (en) A kind of method, apparatus, equipment and storage medium for evaluating phoneme of speech sound correctness
CN109147761B (en) Test method based on batch speech recognition and TTS text synthesis
CN109256152A (en) Speech assessment method and device, electronic equipment, storage medium
CN111292751B (en) Semantic analysis method and device, voice interaction method and device, and electronic equipment
CN103400512A (en) Learning aid device and method of operation thereof
CN110010121B (en) Method, device, computer equipment and storage medium for verifying answering technique
CN109166569B (en) Detection method and device for phoneme mislabeling
CN110503941B (en) Language ability evaluation method, device, system, computer equipment and storage medium
CN101393694A (en) Chinese character pronunciation studying device with pronunciation correcting function of Chinese characters, and method therefor
JP2015011348A (en) Method and apparatus for training and evaluating foreign language speaking ability using speech recognition
CN104299612A (en) Method and device for detecting imitative sound similarity
CN117894300A (en) Sample audio data acquisition method, voice recognition method and related devices
CN106297841A (en) Audio follow-up reading guiding method and device
CN111402892A (en) Conference recording template generation method based on voice recognition
CN106816151A (en) Subtitle alignment method and device
CN115658959A (en) Singing scoring method, computer device and storage medium
CN118675092A (en) Multi-mode video understanding method based on large language model
KR20160081244A (en) Automatic interpretation system and method
CN104361883B (en) Sing evaluating standard documenting method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100101 Room 101, 1st floor, block C, building 21, 2 Wanhong West Street, xibajianfang, dongzhimenwai, Chaoyang District, Beijing

Applicant after: Beijing Junlin Technology Co.,Ltd.

Address before: 100107 commercial building 03, floor 3, block C, tianlangyuan, Chaoyang District, Beijing (No. 1336, Fengshou incubator)

Applicant before: BEIJING JUNLIN TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210308

Address after: 215163 Room 201, building 17, No.158, Jinfeng Road, science and Technology City, Suzhou, Jiangsu Province

Applicant after: Suzhou Junlin Intelligent Technology Co.,Ltd.

Address before: 100101 Room 101, 1st floor, block C, building 21, 2 Wanhong West Street, xibajianfang, dongzhimenwai, Chaoyang District, Beijing

Applicant before: Beijing Junlin Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190628