CN109949813A

CN109949813A - A kind of method, apparatus and system converting speech into text

Info

Publication number: CN109949813A
Application number: CN201711386363.3A
Authority: CN
Inventors: 王群
Original assignee: Beijing Junlin Polytron Technologies Inc
Current assignee: Suzhou Junlin Intelligent Technology Co.,Ltd.
Priority date: 2017-12-20
Filing date: 2017-12-20
Publication date: 2019-06-28
Also published as: WO2019120248A1

Abstract

The embodiment of the invention discloses a kind of method for converting speech into text and devices, this method comprises: extracting the fisrt feature parameter of targeted voice signal；Second feature parameter is matched with the third feature parameter in speech database, determine N number of target signature parameter, N >=2, N number of target signature parameter is maximum N number of with second feature parameter matching degree in third feature parameter, the second feature parameter is a part in the fisrt feature parameter, it determines text corresponding with the maximum target signature parameter of second feature parameter matching degree, and exports the text；The accuracy rate of text is determined using the matching degree of N number of target signature parameter；If accuracy rate is lower than preset threshold, text is carried out to highlight label.The embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate and judge correcting errors for the text, while facilitating verification, additionally it is possible to improve verification efficiency and guarantee the accuracy rate of text.

Description

A kind of method, apparatus and system converting speech into text

Technical field

The present embodiments relate to technical field of voice recognition more particularly to a kind of method for converting speech into text, Apparatus and system.

Background technique

Currently, the efficiency for converting speech into text has obtained significantly with the development of intelligent sound text conversion technology It improves.Intelligent sound text conversion technology can be applied in minutes, training record or interview record.Believe to voice When number carrying out text conversion, the characteristic parameter of voice signal is extracted first, then by text in this feature parameter and speech database Corresponding characteristic parameter is matched, to obtain the highest text of matching degree and export.For the standard under quiet environment The text of mandarin pronunciation converts, and accuracy rate is higher.But under reality scene, spokesman inevitably can be with certain Accent, and not can guarantee and record under quiet environment, so not can guarantee the accurate of language and characters conversion Rate.

100% accuracy rate is not can guarantee using the text that existing intelligent sound text conversion technology obtains, so needing Manually the text after conversion is verified.Common method of calibration is that verification personnel read in the whole text, is turned with finding The text of mistake is changed, but this verification mode is more time-consuming.Moreover, because desk checking is easy to appear fault, have A little mistakes are not easy to be found, higher so as to cause errors in text rate.

Summary of the invention

The embodiment of the invention provides a kind of methods and terminal for converting speech into text, can be improved with providing one kind The method that text verifies efficiency and reduces errors in text rate.

The embodiment of the invention provides a kind of methods for converting speech into text, comprising:

Extract the fisrt feature parameter of targeted voice signal；

Second feature parameter is matched with the third feature parameter in speech database, determines N number of target signature ginseng Number, N number of target signature parameter be it is maximum N number of with the second feature parameter matching degree in the third feature parameter, N >=2, the second feature parameter are a part in the fisrt feature parameter,；

It determines text corresponding with the maximum target signature parameter of the second feature parameter matching degree, and exports institute State text；

The accuracy rate of the text is determined using the matching degree of N number of target signature parameter；

If the accuracy rate is lower than preset threshold, the text is carried out to highlight label.

Further, the accuracy rate of the text is determined using the matching degree of N number of target signature parameter, comprising:

Determine the sum of corresponding matching degree of N number of target signature parameter；

Determine that the corresponding matching degree of the text accounts for the specific gravity of the sum of described matching degree, the specific gravity is the standard of the text True rate.

Further, if the accuracy rate is lower than preset threshold, the text is carried out to highlight label, comprising:

If the accuracy rate is lower than preset threshold, color mark is carried out to the text.

Further, the method also includes:

Obtain voice signal；

If the perdurabgility of sentence halted signals is more than preset time in the voice signal, pauses and believe in the sentence Make pauses in reading unpunctuated ancient writings at number to the voice signal, forms speech signal segment；

Timestamp is marked to the speech signal segment, the speech signal segment is targeted voice signal.

Further, the method also includes:

The corresponding text section of speech signal segment described in timestamp label using the speech signal segment.

Further, the method also includes:

When detecting play instruction, text to be played is obtained；

The corresponding timestamp of text section where determining the text to be played；

Play the corresponding speech signal segment of the timestamp.

The embodiment of the invention also provides a kind of devices for converting speech into text, comprising:

Extraction unit, for extracting the fisrt feature parameter of targeted voice signal；

Matching unit determines N for matching second feature parameter with the third feature parameter in speech database A target signature parameter, N number of target signature parameter are to match in the third feature parameter with the second feature parameter Maximum N number of, N >=2 are spent, the second feature parameter is a part in the fisrt feature parameter；

First determination unit, for the determining and maximum target signature parameter pair of the second feature parameter matching degree The text answered, and export the text；

Second determination unit, for determining the accuracy rate of the text using the matching degree of N number of target signature parameter；

First marking unit highlights the text if be lower than preset threshold for the accuracy rate Label.

Further, described device further include:

Acquiring unit, for obtaining voice signal；

Punctuate unit, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, Make pauses in reading unpunctuated ancient writings at the sentence halted signals to the voice signal, forms speech signal segment；

Second marking unit, for marking timestamp to the speech signal segment, the speech signal segment is target Voice signal, and utilize the corresponding text section of speech signal segment described in the timestamp label of the speech signal segment.

Further, described device further include:

Second acquisition unit obtains text to be played when receiving play instruction；

Third determination unit, for the corresponding timestamp of text section where determining the text to be played；

Broadcast unit, for playing the corresponding speech signal segment of the timestamp.

The embodiment of the invention also provides a kind of systems for converting speech into text, comprising: terminal and with the end Hold the cloud server of connection；

The terminal is sent to the cloud server for acquiring voice signal, and by the voice signal after acquisition；

The cloud server includes:

Receiving unit, the voice signal sent for receiving the terminal；

Extraction unit, for extracting the fisrt feature parameter of the voice signal；

Marking unit, for carrying out highlighting label to the text when the accuracy rate is lower than preset threshold.

Method and apparatus provided in an embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate Word simultaneously judges correcting errors for the text, while facilitating verification, additionally it is possible to improve verification efficiency and guarantee text after converting Accuracy rate.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, for those of ordinary skills, in not making the creative labor property Under the premise of, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of flow chart for the method for converting speech into text provided in an embodiment of the present invention；

Fig. 2 is a kind of structural block diagram for the device for converting speech into text provided in an embodiment of the present invention；

Fig. 3 is a kind of structural block diagram for the system for converting speech into text provided in an embodiment of the present invention.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

The embodiment of the present invention can be applied at the terminal, such as mobile phone, computer, tablet computer etc..It is one of real Existing mode can be text and convert in real time, i.e., while the voice signal of acquisition spokesman output on the spot, by the voice signal It is converted into text, and is saved.In this implementation, the terminal with speech signal collection function can be used, such as Terminal with microphone.Another implementation can be non real-time conversion, i.e., using the equipment with sound-recording function to speech The voice signal of person's output is recorded in advance, then the complete speech signal recorded is sent to terminal, and terminal is to getting Voice signal carry out text conversion.

The embodiment of the present invention can be applied in terminal and the cloud server connecting with terminal.Terminal can will into The voice signal of row text conversion is sent to cloud server, and cloud server carries out text to the voice signal got and turns Change, and the text after conversion is sent to terminal.The voice signal of the pending text conversion can be terminal itself and record in real time It makes, is also possible to be sent to the voice signal of terminal after other sound pick-up outfits have been recorded.

It is a kind of method for converting speech into text provided in an embodiment of the present invention referring to Fig. 1, this method can be applied At the terminal.As shown in Figure 1, this method may comprise steps of.

Step 11, voice signal is obtained.

The voice signal is to be transformed be text voice signal.The mode for converting voice signals into text can be reality When convert, be also possible to non real-time conversion.Record to voice signal can be terminal itself and records, and be also possible to It is recorded using other equipment.

Step 12, if in the voice signal sentence halted signals perdurabgility be more than preset time, in institute's predicate Make pauses in reading unpunctuated ancient writings at sentence halted signals to the voice signal, forms speech signal segment.

Step 13, timestamp is marked to the speech signal segment, the speech signal segment is targeted voice signal.

During obtaining voice signal or after obtaining voice signal, if sentence halted signals in the voice signal Perdurabgility be more than preset time, make pauses in reading unpunctuated ancient writings at the sentence halted signals, speech signal segment formed, then to voice Signal segment marks timestamp.After making pauses in reading unpunctuated ancient writings to voice signal, corresponding timestamp at the punctuate can be calculated, with determination The initial time of speech signal segment and end time.In the concrete realization, the voice signal piece that first can be formed The initial time of section is set as 0 second.Speech signal segment can one timestamp of label, can for speech signal segment rise Begin time or end time.Speech signal segment can also two timestamps of label, respectively when speech signal segment starting when Between and the end time.

Such as in the mode converted in real time, terminal carries out during acquiring voice signal, while to voice signal Detection, if the perdurabgility for detecting sentence halted signals in voice signal is more than preset time, in sentence halted signals Place makes pauses in reading unpunctuated ancient writings, and marks timestamp to the speech signal segment that punctuate is formed, and carry out text conversion.In non real-time conversion In mode, whole voice signals can be made pauses in reading unpunctuated ancient writings first according to the perdurabgility of sentence halted signals in voice signal, And corresponding timestamp is marked to each speech signal segment, then text conversion is carried out to speech signal segment.

The perdurabgility of the sentence halted signals can be determined according to the waveform of voice signal, and details are not described herein.

The present embodiment does not limit the punctuate mode of voice signal, such as can also be according to fixed time interval pair Voice signal is made pauses in reading unpunctuated ancient writings.

Step 14, the fisrt feature parameter of targeted voice signal is extracted.

In the concrete realization, single syllable element one by one can be divided, then mention according to the energy value of voice signal The characteristic parameter of each single syllable element is taken, to obtain the fisrt feature parameter of targeted voice signal.

If the voice signal is analog signal, it can be first converted into digital signal, then carry out feature extraction, obtained The corresponding fisrt feature parameter of the voice signal.

In the present embodiment feature extraction can be carried out as unit of the speech signal segment after making pauses in reading unpunctuated ancient writings.

Step 15, second feature parameter is matched with the third feature parameter in speech database, determines N number of target Characteristic parameter, N number of target signature parameter are maximum with the second feature parameter matching degree in the third feature parameter N number of, N >=2, the second feature parameter be the fisrt feature parameter in a part.

It should be noted that unless otherwise specified, it is following in matching degree refer both to the characteristic parameter in third feature parameter With the matching degree between the second feature parameter.

It, can when being matched the fisrt feature parameter of a speech signal segment with the characteristic parameter in speech database With successively from fisrt feature parameter selected part characteristic parameter, i.e. second feature parameter is matched, to successively match The corresponding text of one characteristic parameter.The second feature parameter can be the characteristic parameter of a single syllable element, two or four The corresponding characteristic parameter of single syllable element combinations.

Step 16, text corresponding with the maximum target signature parameter of the second feature parameter matching degree is determined, And export the text.

Speech database preserves the characteristic parameter of single text, two-character phrase language or the corresponding voice signal of four word Chinese idioms, That is third feature parameter, it is generally the case that can will be corresponding with the maximum third feature parameter of second feature parameter matching degree Text is determined as the corresponding text of voice signal to be converted, that is, the text for needing to export.

In the present embodiment, not only it needs to be determined that with the maximum third feature parameter of second feature parameter matching degree, but also It needs to be determined that at least one is in addition to the maximum third feature parameter of matching degree, most with the second feature parameter matching degree Big target signature parameter.

For example, to the third feature parameter in the corresponding second feature parameter of the voice signal of " hello " and speech database It is matched, the maximum third feature parameter of three matching degrees, respectively " hello " corresponding third feature ginseng can be matched to Number, " Ni Hao " corresponding third feature parameter, and " Ning Hao " corresponding third feature parameter, these three third feature parameters Matching degree between fisrt feature parameter is respectively 60%, 20% and 5%.Wherein " hello " corresponding third feature ginseng Matching degree between several and fisrt feature parameter is maximum, it is possible to determine that " hello " is text, i.e. the voice signal is corresponding Text.Determining text can export and save as text formatting.

It should be noted that the present embodiment is not to the value of N, i.e. the number of target signature parameter is defined, such as can Think two, three or four etc..

Wherein, the matching degree between fisrt feature parameter and third feature parameter can use dynamic time warping (DTW) Algorithm, hidden Markov model (HMM) algorithm, vector quantization (VQ) the methods of algorithm or neural network are determined, specific mistake Details are not described herein for journey.

Step 17, the accuracy rate of the text is determined using the matching degree of N number of target signature parameter.

In this step, it can first determine the sum of corresponding matching degree of N number of target signature parameter, then determine the text The corresponding matching degree of word accounts for the specific gravity of the sum of described matching degree, and the specific gravity is the accuracy rate of the text.

For example, three target signature parameters are obtained by step 13, the matching degree difference between second feature parameter It is 60%, 20% and 5%, the sum of these three corresponding matching degrees of target signature parameter are 85% (60%+20%+5%= 85%) the corresponding text of target signature parameter that, matching degree is 60% is text, and the corresponding matching degree of text accounts for the matching The ratio that the specific gravity of the sum of degree is 60% and 85%, i.e., 70.59%.So the accuracy rate of the text is 70.59%.

In the concrete realization, the sum of maximum matching degree and the second largest matching degree only can also be accounted for by calculating maximum matching degree Specific gravity, determine the accuracy rate of text.That is, in the example above, ratio that the accuracy rate of text is 60% and 80% Value, i.e., 75%.

Difference between maximum matching degree matching degree corresponding with other each target signature parameters can characterize maximum With the accuracy rate for spending corresponding text, the bigger accuracy rate of difference is relatively high, so in the concrete realization, the accuracy rate of text Can also by calculate the difference between maximum matching degree matching degree corresponding with other each target signature parameters carry out it is true It is fixed.For example, determining accuracy rate according to the difference between maximum matching degree and the second largest matching degree.Target signature parameter is corresponding Matching degree is respectively 60% and 20%, and the difference between maximum matching degree and the second largest matching degree is 40% (60%-20%= 40%), can be according to 40% difference and difference accuracy rate corresponding with the calculating of the mapping relations of accuracy rate, or it can also To be determined as accuracy rate for 40%.It should be noted that the present embodiment does not carry out the circular of the accuracy rate of text Limitation.

Step 18, if the accuracy rate is lower than preset threshold, the text is carried out to highlight label.

In the present embodiment, color mark can be carried out lower than the text of preset threshold to accuracy rate, after text is shown, The text that accuracy rate is lower than preset threshold can be highlighted, to facilitate verification personnel to carry out text verification.And it can basis The grade of accuracy rate is to word marking different colours, the preset range that each grade correspondence can be different.For example, can will not With accuracy rate be divided into three grades, it can be 80% to 60% without color mark, accuracy rate that accuracy rate, which reaches 80%, Text carry out yellow flag, accuracy rate lower than 60% text carry out red-label.

When carrying out highlighting label to text, preset range belonging to text accuracy rate can be first determined, further according to this Affiliated preset range determines color identifier.When display text, the text with color identifier can be character background and indicate Color or text itself indicate color.

For example, a speech signal segment is that " everybody is leader, and every welcome guest, good morning for everybody！", the text being converted to Format can be for " everybody is leader, everybody [B: yellow] beer on draft [E: yellow], good morning for everybody！" wherein, [B: yellow] indicates yellow Color marker beginning, [E: yellow] indicate that yellow flag terminates to locate, and when showing the text, reader can see " beer on draft " Font mark is yellow.

It should be noted that the mode that the present embodiment does not highlight label to text is defined, can be used for example Marking, italic label or overstriking label etc. are used for highlighted mark mode.

The corresponding whole texts of targeted voice signal can be matched through the above steps, when targeted voice signal is voice When signal segment, the corresponding text of each speech signal segment can form text section.

Step 19, the corresponding text section of speech signal segment described in the timestamp label using the speech signal segment.

For example, speech signal segment and its corresponding timestamp are as follows: [00:00] in three big basic telecommunication companies, XXX is detailed Understand enterprise to reinforce information infrastructure building, carry out technological innovation and application, implementation speed-raising drop expense measure service enterprise in a deep going way The case where industry and consumer etc..[00:28] he encourages enterprise to aim at scientific and technological revolution and industry transformation trend, puts forth effort to break through More core technologies seize international competition commanding elevation, promote to wider, deeper time fusion application.

After speech signal segment is converted to text, " in three big basic telecommunication companies, XXX has understood enterprise's reinforcement in detail Information infrastructure building carries out technological innovation and application in a deep going way, implements speed-raising Jiang Fei measure service enterprise and consumer etc. side The timestamp of this text section of the case where face " can mark as the second, and " he encourages enterprise to aim at scientific and technological revolution and industry transformation Trend puts forth effort to break through more core technologies, seizes international competition commanding elevation, promotes to answer to wider, deeper time fusion With " timestamp of this text section can mark as the second.

The timestamp of text section can have one, and mark the section in text section first or section tail, the timestamp indicate when Between can be initial time or the end time of the corresponding speech signal segment of this article field.The timestamp of text section can also be with There are two, and mark that the section in text section is first and section tail, the two timestamps can indicate that this article field is corresponding respectively respectively The initial time of speech signal segment and end time.

It should be noted that when voice signal carries out punctuate and voice signal after to punctuate according to fixed time interval When fragment label timestamp, if the timestamp is also the timestamp of the corresponding text section of speech signal segment, the timestamp pair If the text section answered may not be one section complete for semantically, so after converting text for speech signal segment, it can Punctuate is re-started with the meaning of one's words according to text, to each text segment mark timestamp after punctuate.To the time of text segment mark Stamp can be determined according to the corresponding initial time of the corresponding speech signal segment of text section and end time.And it is possible to According to the text section made pauses in reading unpunctuated ancient writings again, make pauses in reading unpunctuated ancient writings again to voice signal, and the corresponding time is marked to speech signal segment Stamp.

In non real-time text transform mode, above-mentioned targeted voice signal may be complete voice signal, by the language After sound signal is converted into text, it can be made pauses in reading unpunctuated ancient writings according to the meaning of one's words of text to text and voice signal, to each text after punctuate Field and speech signal segment mark timestamp.

Above-mentioned steps 11 can not only be executed to step 19 by terminal, can also be executed by cloud server.Carrying out text In the mode that word converts in real time, terminal acquires voice signal in real time, and is sent to cloud server, and cloud server is by voice Signal is then forwarded to terminal after being converted to text, to realize conversion in real time.

In another embodiment, terminal can be when acquiring voice signal, when according to the continuity of sentence halted signals Between make pauses in reading unpunctuated ancient writings to voice signal, timestamp is marked to speech signal segment, and speech signal segment is sent to cloud service The speech signal segment is converted to text section by device, cloud server, and carries out highlighting label to corresponding text, with And then the text section after conversion is sent to text segment mark timestamp by terminal according to the timestamp of speech signal segment, In this embodiment, the timestamp of speech signal segment can also be marked by cloud server.

Text after conversion can form readable and editable text, when verifying to the text, verify personnel It can easily find the lower text of accuracy rate and judge correcting errors for the text, while facilitating verification, additionally it is possible to mention Height verification efficiency and the accuracy rate for guaranteeing text after conversion.

Judging when correcting errors of text conversion, the corresponding speech signal segment of the text can played, will pass through context Judged, specific implementation process can be with are as follows: when detecting play instruction, obtain text to be played, determine it is described to The corresponding timestamp of text section where the text of broadcasting, then plays the corresponding speech signal segment of the timestamp.It should be wait broadcast The text put can be selected text.

In specific implementation, verification personnel can choose the text to be verified, and click play button, click play button Afterwards, system is able to detect that play instruction, and then plays the corresponding voice signal piece of timestamp of text place text section Section, so as to judge whether the text is correct by context.

It should be noted that the embodiment of the present invention is not defined the display mode of the broadcast button, the broadcast button It can show, can also be shown after choosing text after opening text, be shown after text and right click mouse can also be chosen.

Mark text section timestamp can the timestamp of speech signal segment corresponding with text section it is identical, thus true After fixed text to be played, it can determine that corresponding speech signal segment is gone forward side by side according to the corresponding timestamp of text to be played Row plays, so use is more convenient, can be improved verification using method provided in this embodiment when carrying out text verification Efficiency.

It referring to fig. 2, is a kind of structural block diagram for the device for converting speech into text provided in an embodiment of the present invention, the dress Setting can be only fitted in terminal or for terminal itself.As shown in Fig. 2, the device can specifically include acquiring unit 21, punctuate is single Member 22, the second marking unit 23, extraction unit 24, matching unit 25, the first determination unit 26, the second determination unit 27, first Marking unit 28.

Acquiring unit 21, for obtaining voice signal.

Punctuate unit 22, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, Make pauses in reading unpunctuated ancient writings at the sentence halted signals to the voice signal, forms speech signal segment.

Second marking unit 23, for marking timestamp to the speech signal segment, the speech signal segment is mesh Poster sound signal.

Extraction unit 24, for extracting the fisrt feature parameter of targeted voice signal.

Matching unit 25, for matching second feature parameter with the third feature parameter in speech database, really Fixed N number of target signature parameter, N number of target signature parameter be in the third feature parameter with the second feature parameter Matching degree is maximum N number of, N >=2, and the second feature parameter is a part in the fisrt feature parameter.

First determination unit 26, for the determining and maximum target signature parameter of the second feature parameter matching degree Corresponding text, and export the text.

Second determination unit 27, for determining the accurate of the text using the matching degree of N number of target signature parameter Rate.

Second determination unit 27 is specifically determined for the sum of corresponding matching degree of N number of target signature parameter；With And determining that the corresponding matching degree of the text accounts for the specific gravity of the sum of described matching degree, the specific gravity is the accuracy rate of the text.

First marking unit 28 carries out the text prominent aobvious if be lower than preset threshold for the accuracy rate Indicating note.

Second marking unit 23 is also used to voice signal described in the timestamp label using the speech signal segment The corresponding text section of segment.

The device can also include second acquisition unit, third determination unit, broadcast unit.

Second acquisition unit, for when receiving play instruction, obtaining text to be played.

Third determination unit, for the corresponding timestamp of text section where determining the text to be played.

Device provided in an embodiment of the present invention can make verification personnel easily find the lower text of accuracy rate and sentence The text that breaks is corrected errors, while facilitating verification, additionally it is possible to which text is accurate after raising verification efficiency and guarantee conversion Rate.

It is a kind of structural block diagram for the system for converting speech into text provided in an embodiment of the present invention, this is referring to Fig. 3 System may include terminal 31 and the cloud server 32 that connect with the terminal 31.

Voice signal after acquisition is sent to cloud server 32 for acquiring voice signal by terminal 31.

Cloud server 32 is used to the voice signal being converted to text, and the text is sent to the terminal 31. Cloud server 32 can specifically include with lower unit.

Receiving unit 321, the voice signal sent for receiving the terminal.

Punctuate unit 322, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, Make pauses in reading unpunctuated ancient writings at the sentence halted signals to the voice signal, forms speech signal segment.

Second marking unit 323, for marking timestamp to the speech signal segment.

Extraction unit 324, for extracting the fisrt feature parameter of the voice signal.

Extraction unit 325, specifically for extracting the fisrt feature parameter of the speech signal segment.

Matching unit 326, for matching second feature parameter with the third feature parameter in speech database, really Fixed N number of target signature parameter, N number of target signature parameter be in the third feature parameter with the fisrt feature parameter Matching degree is maximum N number of, N >=2, and the second feature parameter is a part in the fisrt feature parameter.

First determination unit 327 is joined for determining with the maximum target signature of the second feature parameter matching degree The corresponding text of number, and export the text.

Second determination unit 328, for determining the accurate of the text using the matching degree of N number of target signature parameter Rate.

Marking unit 329, for carrying out highlighting mark to the text when the accuracy rate is lower than preset threshold Note.

Second marking unit 323 is also used to voice described in the timestamp label using the speech signal segment and believes Number corresponding text section of segment.

Transmission unit 330, for the text matched to be sent to the terminal 31, the text includes the voice letter Number corresponding text.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar parts between the embodiments can be referred to each other.

It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can provide as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.

The embodiment of the present invention be referring to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flow chart and/or box can be realized by computer program instructions The combination of the process and/or box in each flow and/or block and flowchart and/or the block diagram in figure.It can provide These computer program instructions are whole to the processing of general purpose computer, special purpose computer, Embedded Processor or other programmable datas The processor of end equipment is to generate a machine, so that passing through computer or the place of other programmable data processing terminal devices The instruction that device executes is managed to generate for realizing in one box of one or more flows of the flowchart and/or block diagram or more The device for the function of being specified in a box.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing terminal devices In computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates Manufacture including command device, the command device are realized in one or more flows of the flowchart and/or one, block diagram The function of being specified in box or multiple boxes.

These computer program instructions can also be loaded into computer or other programmable data processing terminal devices, so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one process of flow chart or multiple streams The step of function of being specified in journey and/or one or more blocks of the block diagram.

Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So appended claims are intended to explain Being includes preferred embodiment and all change and modification for falling into range of embodiment of the invention.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or behaviour There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any other change Body is intended to non-exclusive inclusion, so that including process, method, article or the terminal device of a series of elements Include not only those elements, but also including other elements that are not explicitly listed, or further includes for this process, side Method, article or the intrinsic element of terminal device.In the absence of more restrictions, by sentence "including a ..." The element of restriction, it is not excluded that there is also other in process, method, article or the terminal device for including the element Identical element.

Above to a kind of method, apparatus and system for converting speech into text provided by the present invention, carry out in detail It introduces, used herein a specific example illustrates the principle and implementation of the invention, the explanation of above embodiments It is merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification It should not be construed as limiting the invention.

Claims

1. a kind of method for converting speech into text characterized by comprising

Extract the fisrt feature parameter of targeted voice signal；

Second feature parameter is matched with the third feature parameter in speech database, determines N number of target signature parameter, institute Stating N number of target signature parameter is maximum N number of with the second feature parameter matching degree in the third feature parameter, N >=2, The second feature parameter is a part in the fisrt feature parameter；

It determines text corresponding with the maximum target signature parameter of the second feature parameter matching degree, and exports the text Word；

2. the method as described in claim 1, which is characterized in that determine institute using the matching degree of N number of target signature parameter State the accuracy rate of text, comprising:

Determine that the corresponding matching degree of the text accounts for the specific gravity of the sum of described matching degree, the specific gravity is the accurate of the text Rate.

3. the method as described in claim 1, which is characterized in that if the accuracy rate is lower than preset threshold, to the text It carries out highlighting label, comprising:

4. the method as described in claim 1, which is characterized in that the method also includes:

Obtain voice signal；

If the perdurabgility of sentence halted signals is more than preset time in the voice signal, at the sentence halted signals Make pauses in reading unpunctuated ancient writings to the voice signal, forms speech signal segment；

5. method as claimed in claim 4, which is characterized in that the method also includes:

6. method as described in claim 4 or 5, which is characterized in that the method also includes:

When detecting play instruction, text to be played is obtained；

Play the corresponding speech signal segment of the timestamp.

7. a kind of device for converting speech into text characterized by comprising

Matching unit determines N number of mesh for matching second feature parameter with the third feature parameter in speech database Mark characteristic parameter, N number of target signature parameter be in the third feature parameter with the second feature parameter matching degree most Big N number of, N >=2, the second feature parameter are a part in the fisrt feature parameter；

First determination unit, it is corresponding with the maximum target signature parameter of the second feature parameter matching degree for determination Text, and export the text；

First marking unit carries out the text to highlight label if be lower than preset threshold for the accuracy rate.

8. device as claimed in claim 7, which is characterized in that further include:

Acquiring unit, for obtaining voice signal；

Punctuate unit, for when the perdurabgility of sentence halted signals in the voice signal be more than preset time when, described Make pauses in reading unpunctuated ancient writings at sentence halted signals to the voice signal, forms speech signal segment；

9. device as claimed in claim 8, which is characterized in that further include:

Second acquisition unit, for when receiving play instruction, obtaining text to be played；

10. a kind of system for converting speech into text characterized by comprising terminal and the cloud being connect with the terminal Hold server；

The cloud server includes:

Receiving unit, the voice signal sent for receiving the terminal；