WO2013083132A1 - Translation method and computer programme for assisting the same - Google Patents
Translation method and computer programme for assisting the same Download PDFInfo
- Publication number
- WO2013083132A1 WO2013083132A1 PCT/DK2012/050445 DK2012050445W WO2013083132A1 WO 2013083132 A1 WO2013083132 A1 WO 2013083132A1 DK 2012050445 W DK2012050445 W DK 2012050445W WO 2013083132 A1 WO2013083132 A1 WO 2013083132A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- translation
- hypotheses
- text
- module
- word
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/44—Statistical methods, e.g. probability models
Definitions
- the invention relates to translation method and a computer programme for assisting the same.
- Translators usually write their translations using a computer keyboard.
- the translation process is usually divided into a comprehension phase, i.e. source text (ST) reading, a transfer, and a text production phase, in which the target text (TT) is written.
- ST source text
- TT target text
- Desilets et al (Proceedings of IWSLT; Hawaii USA; 2008) describe a possible loose integration of ASR and MT. They use an off-the-shelf ASR system (Dragon speech NaturallySpeaking) which allows them to extrapolate productivity gains with various types of idealised ASR systems. In their experiments, NaturallySpeaking had a recognition error rate of 1 1 %. Based on the time translators needed to speak the translation and the time to correct it, they calculated that an ASR with less than 4% recognition errors (and thus less time needed for error correction) could increase translation productivity by up to 45%. In a second part of this study, Desilets et al investigate to what extent a loose ASR-SMT integration could decrease the ASR recognition errors. For this end, various assumptions are made to train
- NaturallySpeaking i) with a domain corpus and ii) with an SMT sentence lattice.
- the LM training possibilities of NaturallySpeaking were adapted with the output of the SMT system for each sentence before the spoken signal was recognised, thus rescoring n-best ASR output hypotheses from the recognizer or the SMT system.
- a word-to-phone transducer transforms the word lattice output of the SMT system into a phone lattice.
- a Leveshtein distance measure is used to match the most likely phone sequence produced by the ASR system on the SMT phone lattice. In this way, the SMT phone lattice serves as an informed sentence-based lexicon and a language model for the ASR system.
- a named entity recognizer transliterates named entities from the source string into the target language. The transliterated named entities are then also converted into phones in the target lattice, which enables the ASR system to recognise words that would not have been known otherwise.
- a translation method comprising the steps of a translator speaking the translation of a written source text written in the target language (sight translation), an automatic speech recognition system converting the spoken translation into a set of phone/diphone/syllable and/or word hypotheses of the target language, a machine translation system translating the written source text into a set of translation hypotheses in the target language, and combining the set of word hypotheses and the set of translation hypotheses obtaining a target text in the target language.
- recognition errors of the spoken words may be corrected.
- recognition errors of the spoken words may be corrected.
- machine translated text may be corrected by combination with the sight translated text or hypotheses of the sight translated text.
- a translator sight translates a written text (i.e. reads the source text and speaks the translation into the target language) and based on the spoken translation and a machine translation of the source text a translated text in the target language is produced.
- the set of translation hypotheses and the set of word hypotheses are combined by a search algorithm.
- search algorithms are further explained below.
- an eye tracking device captures the reading behaviour of the translator.
- the eye tracking device monitors what part of the text the translator is reading at a given time, making it possible to perform different types of adjustments and synchronizations in and/or between the actions of the automated speech recognition system and the machine translation system.
- hypotheses of the reading progression obtained from the eye tracker and a gaze-to-word mapper
- hypotheses of spoken target words obtained from the eye tracker and a gaze-to-word mapper
- word translation hypotheses obtained from the eye tracker and a gaze-to-word mapper
- the reading progression, the spoken word hypotheses and the machine translation hypotheses are used to propose and predict translation completions.
- the method allows various corrections of the suggested translations during the translation process as the different input modalities, e.g. speech recognition, machine translation and eye tracking, provide different types of information and hypotheses, each having strengths and weaknesses in different areas.
- the different input modalities e.g. speech recognition, machine translation and eye tracking
- improved, better, optimized method is meant a method providing a faster translation and/or a translation with fewer errors compared to what is obtained by known speech recognition systems and machine translation systems.
- Partial results from the eye-tracker, the speech recognition system and the MT system are integrated by means of a search algorithm.
- search algorithms may be used but preferably the search is based on log-linear combination of weighted feature models, originating from the Speech recognition system, the machine translation systems and/or the eyetracker. Log-linear combinations are discussed further below.
- Different models for machine translations may also be used, however in advantageous embodiments the machine translation system is a trainable statistical machine translation system. A specific machine translation system may be chosen based on different factors such as language pairs, on-line trainability, and topology of the generated graph of translation hyoptheses.
- the automatic speech recognition system may be chosen to match specific factors such as the suitability for the target language to be recognized and the compatibility of the output with the machine translation output.
- specific factors such as the suitability for the target language to be recognized and the compatibility of the output with the machine translation output.
- At least some of the translation hypotheses are made available to the translator on a computer monitor when speaking the translation.
- a translator reading a text written in the source language and speaking the text in the target language may be presented translation suggestions made by the machine translation system.
- the presentation can for example be a list of word or phrase translation options on a computer screen.
- the display of the translation hypotheses made by the machine translation may be passive, i.e. may just be a list of word, parts of or complete sentences or text segments.
- the machine translation system may suggest target text completions and said suggestions may be accepted, ignored or modified by the translator. For example it may be possible for the translator to approve suggestions made by the machine translation system to quickly finish a sentence or text fragment. It may also be possible for the translator to reject or modify suggestions. It is also possible that the translator ignores suggestions for example by simply continuing the spoken translation.
- the machine translation system may also learn and adapt the set of translation hypotheses based on the produced translations hereby improving the efficiency of the suggested translations and the method over all.
- the method may advantageously comprise a step wherein the automatic speech recognition system incrementally learns a speaker and a lexicon model based on the produced translations. Rejections, approvals and modifications of the machine translation and speech recognition proposals may be fed back into the speech recognition and the machine translation systems to incrementally improve their recognition and translation performance (learning). For example the speech recognition system may learn the accent and intonation of one or more translators. This learning may be accelerated by the combination with the machine translation system.
- the eye-tracking may be used in different ways for example in an advantageous setup where the eye tracker registers where on the computer screen the translator is looking.
- the translator's reading behaviour indicates whether the translator faces translation problems. Longer fixations on one word may indicate word-related translation problems while regressions to previously read words may reflect translation problems on a textual level. This information will be synchronized with the ASR and the MT output to produce targeted translation proposals, translation predictions and translation completions.
- the eye-tracking may be used to enhance the combination of the ASR and MT as it can provide gaze data e.g. on which part of the text is being read (and thereby possibly translated) at a given time.
- the translator may then keep on producing translations without much delay, by re-speaking or accepting the proposed
- the translator need not look up translation proposals in third resources (lexica, and/or online search tools) but is encouraged to use the available translation proposals. This will on the one hand reduce the translator's search time for
- the integration of the ASR-MT may be based on phone level.
- a weighted sentence- based lexicon could be used, instead of the full-fledged SMT output graph, to be combined with the ASR phone lattice in the following way.
- the MT system retrieves phrase translations from a bilingual statistical lexicon.
- the set of most likely phrase translations can be plotted on the GUI to serve the translator as translation aides.
- the phrase translations are also converted into a phonetic transcription and integrated into the ASR system.
- the ASR system produces a phone lattice from the spoken signal.
- This phone lattice can be mapped on the lexicon models derived from the MT lexicon. For each source sentence, a sentence based phrase dictionary may be generated and their phonetic forms computed. In one possible implementation, this lexicon can substitute the ASR lexicon or combined by a joint decoding of the ASR phone lattice, the ASR internal lexicon, the phonetic phrase transcriptions (as obtained from the statistical lexicon), and a target language model
- a text to speech (TTS) synthesizer may read out a written text.
- TTS synthesizer may read out the target text produced by the present method or may read out one or more parts of the target text.
- the translator can verify the produced translation for comprehensibility and rhythm on an auditory channel.
- the ASR and TTS systems will share common resources, so that incrementally learning/refining of the speaker model in the recognition phase will also lead to enhanced synthesis during speech synthesis.
- the present invention also relates to a computer program assisting a translator, wherein said computer programme comprises the following modules
- an audio reception module - an audio conversion module, converting audio input from the an audio reception module to a set of phone and/or word hypotheses, a text conversion module, converting text input from the text reception module to a set of translation hypotheses, a combining module, combining the set of translation hypotheses and the set of phone and/or word hypotheses, and an output module, outputting a target text based on the combination of the set of translation hypotheses and the set of word hypotheses.
- the present computer program By using the present computer program it is possible to perform a fast and precise translation of a written text in a source language into a written text in a source language.
- Audio input in form of a spoken translation is registered via a microphone.
- the audio conversion module provides sets of phone and word hypotheses from the audio input.
- the original text in the source language is processed by the text conversion module to sets of translation hypotheses in the target language.
- the computer system then combines the sets of hypotheses from the text conversion module and from the audio conversion model to produce a translation of the source text.
- the end product from the computer program is a translated written text in the target language.
- the computer programme thus comprises an automated speech recognition (ASR) module (audio conversion module ) which receives the input in form of audio input i.e. the spoken translation and processes the input to obtain a complete or partial written target text.
- the ASR module can interact in different ways with other modules of the computer programme such as text-to-text (MT) module (the text conversion module).
- the computer programme may further comprise an eye tracker module which by means of input from an eye tracker follows the translators reading progression and reading behaviour thus enabling a processing and registration of where in the text the translator may be facing problems as well as what part of text is being translated at a given time.
- the eye tracker which follows the translator's eye movements over the text may register that the translators gaze is fixed at a certain word for a prolonged time indicating problems in the translation of that word.
- the eye tracker module which receives this data, may process the input and suggest relevant translations which the translator may react to or not.
- the computer programme may also comprise a Text to Speech (TTS) module, which may synthesise a spoken version of the translated written text.
- TTS Text to Speech
- the synthesized spoken text may help the translator to evaluate the translation in a different way than reading the written translation will. E.g. the synthesized spoken text, may help evaluate flow and rhythm of the translation.
- the ASR and TTS modules will share common resources, so that incrementally learning/refining of the speaker model in the recognition phase will also lead to enhanced synthesis during speech synthesis.
- the eye tracker module may also take part in synchronizing the spoken (sight translated) translation with the machine translated text to text translation. This way the module performing the text to text translation may be influenced by the text to speech translation and vice versa.
- An integration module can receive input from eye tracking module, the text-to-text translation, ASR and the text-to speech translation evaluating and combining the different input when appropriate.
- the method according to the present invention may be a computer implemented method, implemented by the present computer programme.
- the modules of the computer programme is preferably selected and programmed to assist in the method according to invention.
- a module may be a specific algorithm, an independent programme or a programme dependent of or forming part of a larger/other programme.
- the program modules performing the machine translation and the voice recognition may be previously known or developed specifically to the present computer programme.
- the computer program also contains a translation prediction module, generating partial translation predictions and translation completions.
- the computer program preferably comprises an incremental learning module for speech recognition, to learn a speaker and/or lexicon model during translation production.
- the computer program can comprise a gaze reception module (an eye tracker).
- the computer program may advantageously comprise a gaze-to-word mapping module, mapping gaze sample points onto the words and symbols looked at.
- the gaze mapping module may receive input from the gaze reception module and based hereon estimate what word, symbol or similar the translator at a given point in time is looking at.
- the computer program comprise a combining module, combining the set of machine translation hypotheses and the set of sight translated word hypotheses in order to arrive at a translation which is improved compared to the hypotheses obtained from each of the machine translation and automated speech recognition.
- the computer program may comprise a synchronization module, synchronizing the set of spoken and machine translated word hypotheses with the input from the gaze-to- word mapping module and hereby help the synchronization of the speech recognition and the machine translations.
- the computer program comprises a translation prediction module, generating partial translation predictions and translation completions, allowing the translator to get information on suggested translations of words, complete or partial sentences and/or text fragments.
- the computer program comprises an incremental learning module for speech recognition, to learn a speaker model during translation production.
- the incremental learning module helps optimize the speech recognition to produce an improved recognition of the words spoken by the translator.
- the fact that it is possible to combine the speech recognition with the machine text to text translation makes it possible to significantly increase the level of learning compared to known systems.
- the computer programme comprises a translation hypotheses output module, outputting at least part of the translation hypotheses.
- This module outputs at least part of the translation hypotheses which makes it possible for e.g. the translator to take it into account when performing the spoken translation.
- the computer programme may also comprise a word hypotheses output module, outputting at least part of the word hypotheses. It is also possible that at least part of the word hypotheses are outputted on e.g. the computer screen in order for e.g. the translator to follow how the speech recognition system recognizes the spoken words.
- the computer programme comprises one or more interaction modules allowing interaction between computer programme and user by at least keyboard, voice instructions and/or mouse. Other interaction modules are equally applicable such as touch pads, track balls, electronic pens etc.
- the one or more interaction modules enables the user to at least approve, modify and/or reject suggested translation hypotheses.
- the translator is able to approve/reject suggested translations the translation process may become faster, produce less errors as well as the machine translation and/or speech
- the recognition system may have an increased learning compared to if the system only outputs the final translated text .
- the rejection/approval may be carried out by the one or more interaction modules.
- the method and the computer programme implementing and assisting the method according to the invention may be executed by and in interaction with a system.
- the present invention also relates to a system for assisting in translation of a text said system comprising
- the interaction means are chosen from the group at least comprising keyboard, mouse, touch pad, voice recognition and/or eye tracker.
- SMT main-stream statistical MT
- a tight integration of ASR and MT systems will increase the speech recognition rate and the translation precision, reduce post-editing effort and has at the same time the potential to lead to more satisfying working conditions for translators, as s/he will speak his/her translation instead of postediting MT output.
- ASR and MT systems have a complementary set of weaknesses and strength: ASR systems often fail to correctly recognize homonyms ("have to go back” may be recognized as “have two go back", “the countries” as “the country is” “loss” as “laws” etc.), they may wrongly decode hesitation ("uuh happened” can be transcribed as “are happened") and pronunciation flaws ("a distrust” as "at this trust", “sends” as “since” etc.).
- MT systems produce completely different types of errors, such as wrong lexical choices, (synonyms), false collocations, semantic and syntactic interpretation, as well as erroneous re-ordering of target language words.
- ASR and MT systems are however strong where the respective other system is weak: ASR systems have no problems with synonyms, syntactic choices, or with word-order, while for MT systems pronunciation and hesitation of the spoken language are not an issues.
- the algorithm may traverse incrementally the search graph producing an output string based on the costs and weights of the prefix already produced and possibly also the cost of the remaining suffix, thereby pruning most of the less promising paths.
- the weights of each state depend on a number of (presumably) independent models which may be trained on different types of data.
- the data and the derived models include translation models derived from bi-texts, language models derived of source and target language texts, term bases, dictionaries, including PoS tagger, parsers etc..
- the models are based on formant analyzes, MFCC vectors, phonee transcriptions of dictionary entries and language models of training sentences, as well as a the mapping of phone sequences on words and/or sentences.
- the parameters of these heterogeneous models may be integrated and balanced using for each model m e M a weight Am.
- This log-linear framework offers the possibility to integrate ASR models, SMT models, and/or reading models in a seamless manner into an ASR-MT system, which is likely to balance recognition errors and translation errors in sight translation.
- computer program and system at text is translated from a source language to a text in target language in that a translator is presented a source text on a computer screen.
- An AST-MT system runs in the background, preferably pre-processes the translation, and waits for the translator to speak his/her translation, while preferably an eye-tracker follows reading activities on the screen.
- the ASR-MT system or gaze-enabled ASR-MT system produces - preferably at real-time - translation transcriptions in a target window. Due to its multi-modal input, the source text and the spoken translation, the ASR-MT system produces better translations than each of the systems in isolation, thus reducing post-editing effort, and letting the translator the main actor in the process.
- translation proposals - which the MT system has internally generated - can optionally be plotted in a separate window on the screen so as to support the translator when s/he in need of e.g. terminological translations, para-phrases etc.
- the translation technology is not (yet) ripe, so that each project or translation job needs a lot of preparation and tuning of the computer system(s) and background resources, and/or produces output that is troublesome to correct.
- many companies have out-sourced their translation needs and many free- lance translators are not willing or don't see the benefit for using the technology.
- post-editing is in the case of e.g. technical translation quicker than from-scratch translation, the main obstacle for wider deployment of more advanced computer- assisted translation technology are currently the translators, who reject to act as an appendix and post-editor of translation output.
- Fig. 1 a schematic flow diagram of the method
- Fig. 2 a second schematic view of the method
- Fig. 3 a schematic view of the method comprising the steps the eye tracker
- Fig. 4 an alternative schematic view of the method known from fig. 2
- Fig. 5 a schematic view of a method including automatic speech synthesis.
- Fig 6 shows a schematic view of models for integration of ASR and MT systems
- Fig 7 shows a schematic view of integration of ASR, MT and gaze data
- Fig 8a and 8b show examples of the results from gaze data Detailed description of the invention
- Fig. 1 shows a schematic flow diagram of the basic principles of the present invention.
- the translator reads a text in the source language and speaks out a translation in the target language arrows A and B.
- the automatic speech recognition system processes the acoustic input and outputs a set of phone, diphone, syllable or word hypotheses (arrow C).
- Parallel to the spoken translation process, simultaneous or not, the source language text is sent to the machine translation system (arrow D) which processes the texts and outputs a set of translation hypotheses (arrow F) based on a word translation graph/translation lattice (arrow E).
- the word hypotheses and the translation hypotheses are combined in a search module, for example as described above, to produce the final translated target language text (arrow G).
- Fig. 2 shows how the method known from fig. 1 is performed when the translator reads the text on a computer screen.
- a machine translation system generates a graph of word translation proposals (a translation lattice). Parallel to this, simultaneous or not, the translator reads the source text and speaks his translation into a microphone.
- An ASR system produces phone, diphone, syllable and/or word hypothesis from the translator's spoken input and a search algorithm takes this information into account when retrieving the best translation from the word graph (translation lattice).
- the translation lattice functions thus as an informed language model for the speech recognition system.
- Optionally parts of the translation lattice are plotted in the form of partial translation options on the screen.
- Fig. 3 shows a diagram of the method and computer program known from fig. 1 and fig. 2, including eye tracker the option.
- the eye tracker registers where on the computer screen the translator is looking.
- the information on where translator is looking is then processed to make a hypothesis about what part of the source text at a given time is being read by the translator.
- the information from the eye tracking may be outputted as hypotheses of reading progression which may be combined with the word hypotheses and translation hypotheses to e.g. synchronize the speech recognition output and the translation graph, and/or to trigger reactive translation assistance.
- the accepted target text can then be fed back into the automatic speech recognition system to incrementally learn/refine the speaker model and/or the system lexicon and to enhance the speech recognition rate. It can also be fed back into the machine translation system to enhance machine translation predictions.
- Fig. 4 shows a diagram of the system known from fig 2 in which the "verified text" is shown as a separate box to allow a clear illustration of the interactions with the Human translator, ASR module and MS module.
- Fig. 5 shows a diagram of a method according to the present invention wherein an integrated ASR-TTS system and an integrated ASR-MT system interacts to optimize the translation.
- the basis system is known from figures 1 - 4.
- the MT and ASR system interacts as indicated by the module "integrated Word and Phone Graph", and the output is integrated with gaze-to-word mapping based and information from the eye-tracker by an integration module in this case a multi modal integration module.
- the integrated ASR-TTS system creates a synthesized spoken version of the target text which may help the translator evaluate the translated text.
- the ASR- TTS system is an online trained shared resource.
- the translator's interaction with the computer system is also illustrated in this figure.
- the translator speaks a translation of the source text which is recognized by the ASR module.
- the translator can read suggestions for completions etc. on the screen and may revise the target text. Further the translator can hear a synthesized spoken version of the target text in order to evaluate other aspects of the target text that what would be recognizable from the written target text.
- Fig 6 shows an exemplary model for tight integration of ASR and MT systems.
- ASR and MT systems and some possible combinations are known from general description and the previous figures.
- Some combinations of ASR and SMT systems according to the present invention integrates both components at translation time.
- synchronization of the SMT output and the spoken translation can be achieved either by guiding the translator to speak and edit the translation of a currently selected and/or highlighted segment in the GUI (similar to TMs) and/or to use a gaze capturing device as a cue to which segment is currently been spoken.
- synchronization of the SMT output and the spoken translation relies on the assumption that it is likely for a translator to speak out the translation of a source text sentence (or sentence fragment) which has just been or is currently being read.
- a reading model triggers activation of entries in a bilingual lexicon which increases the likelihood for these items to be recognized in the speech signal.
- Fig 7 shows a scheme of integrating ASR and MT systems with gaze data.
- the basic system combining ASR and MT can be as known from the previous figures.
- the gaze-enabled ASR-MT system captures gaze data from the eye tracker which may be instrumental to detect particular translation problems. For instance, increased fixation durations on certain words, or refixations of segments may reveal well defined translation problems.
- the real-time integrated ASR-MT system with gaze- tracking support may react to the detected translation problem by plotting targeted assistance, before the translator becomes aware of the nature of the translation problem. Taking into account the automatically generated suggestions, the translator may then keep on producing translations without much delay, by re-speaking or accepting the proposed translations.
- the ASR-MT system it is possible for the ASR-MT system to produce and visualize translation proposals before the translation is being spoken, so as to provide targeted translation assistance at translation time.
- the translator need not look up translation proposals in third resources (lexica, and/or online search tools) but is encouraged to use the available translation proposals. This will on the one hand reduce the translator's search time for terminology and other translation problems, and on the other hand reduce the number of OOV words in the ASR system.
- MT supported sight translation puts translators in a more conductive mode for good translations.
- the integration of the ASR-MT may be based on phone level.
- a weighted sentence- based lexicon could be used, instead of the full-fledged SMT output graph, to be combined with the ASR phone lattice in the following way.
- the MT system retrieves phrase translations from a bilingual statistical lexicon.
- the set of most likely phrase translations can be plotted on the GUI to serve the translator as translation aides.
- the phrase translations are also converted into a phonetic transcription and integrated into the ASR system.
- the ASR system produces a phone lattice from the spoken signal.
- This phone lattice can be mapped on the lexicon models derived from the MT lexicon. For each source sentence, a sentence based phrase dictionary may be generated and their phonetic forms computed. In one possible implementation, this lexicon can substitute the ASR lexicon or combined by a joint decoding of the ASR phone lattice, the ASR internal lexicon, the phonetic phrase transcriptions (as obtained from the statistical lexicon), and a target language model.
- Fig 8a and 8b illustrates some aspects of the use of gaze data.
- the gaze data from the eye tracker are indicated by circles across the text. As seen the distance between circles are not constant as well as the vertical position drifts from start to end over a text passage.
- Various assumptions and theories can be applied depending on the detail and/or type of the needed gaze data.
- the assumed reading line is indicated by grey field 1 i.e. it is the main part of the first line of the text shown.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a translation method comprising the steps of: a translator speaking a translation of a written source text in a target language, an automatic speech recognition system converting the spoken translation into a set of phone and word hypotheses in the target language, a machine translation system translating the written source text into a set of translations hypotheses in the target language, and an integration module combining the set of spoken word hypotheses and the set of machine translation hypotheses obtaining a text in the target language. Thereby obtaining a significantly improved and faster translation compared to what is achieved by known translation methods.
Description
Translation method and computer programme for assisting the same.
Field of invention
The invention relates to translation method and a computer programme for assisting the same.
Background of the Invention
Translators usually write their translations using a computer keyboard. The translation process is usually divided into a comprehension phase, i.e. source text (ST) reading, a transfer, and a text production phase, in which the target text (TT) is written. There is general agreement that the production and transfer process is - in most cases - much more difficult and time consuming than ST comprehension.
Most of the effort is actually spent on the typing activity, i.e. the mechanical part in translation production. To increase translation speed, numerous computer-based tools exist and are used in the translation industry supporting translators with automated assistance e.g. to increase translation speed. Most of these tools generate translations (translation memories and machine translation systems) which the translators then have to post-edit, thereby supporting them not only by easing the mechanical typing part (or largely avoiding it), but at the same time also depriving them from the creative transfer part in translation process. Translators have to read the machine-generated texts, estimate segment by segment how well the machine generated translations reflects the source text meaning and ameliorate it where necessary. Subsequently translators are unhappy with these technologies because they have to read and adjust crummy machine generated texts, and get even payed less for this cognitively more challenging task. Desilets et al (Proceedings of IWSLT; Hawaii USA; 2008) describe a possible loose integration of ASR and MT. They use an off-the-shelf ASR system (Dragon speech NaturallySpeaking) which allows them to extrapolate productivity gains with various types of idealised ASR systems. In their experiments, NaturallySpeaking had a recognition error rate of 1 1 %. Based on the time translators needed to speak the translation and the time to correct it, they calculated that an ASR with less than 4% recognition errors (and thus less time needed for error correction) could increase translation productivity by up to 45%. In a second part of this study, Desilets et al
investigate to what extent a loose ASR-SMT integration could decrease the ASR recognition errors. For this end, various assumptions are made to train
NaturallySpeaking i) with a domain corpus and ii) with an SMT sentence lattice. The LM training possibilities of NaturallySpeaking were adapted with the output of the SMT system for each sentence before the spoken signal was recognised, thus rescoring n-best ASR output hypotheses from the recognizer or the SMT system. These experiments show that this kind of loose ASR-SMT integration (i.e. by adapting/training the ASR language model) falls short to reach the required ASR recognition rate to substantially assist translators. Several studies have been carried out to investigate a more tight integration of ASR and MT. Reddy et al (IEEE transactions on audio, speech and language processing VOL 18 No 8 November 2010) investigate a tight integration for multi pass ASR and SMT decoding. A word-to-phone transducer transforms the word lattice output of the SMT system into a phone lattice. A Leveshtein distance measure is used to match the most likely phone sequence produced by the ASR system on the SMT phone lattice. In this way, the SMT phone lattice serves as an informed sentence-based lexicon and a language model for the ASR system. To overcome reduced recognition rate due to OOV words, a named entity recognizer transliterates named entities from the source string into the target language. The transliterated named entities are then also converted into phones in the target lattice, which enables the ASR system to recognise words that would not have been known otherwise.
A slightly different integration is discussed in Luis Rodriguez et al (Efficient integration of translation and speech models in dictation based aided human translations; ICASS; 2012), who refactor the ASR LM with translation probabilities of the SMT system lexicon model output. Similar to the previous above experiments, these experiments were off-line computations, where the speech recognizer and translation analysis runs after the actual translation session. This offline
synchronization of the speech signal and the SMT output has a reported 6.8% error rate. Nevertheless, in both cases of tight SMT integration an increased ASR recognition rate of up to 30% can be obtained, compared to non-informed ASR.
Summary of the invention
Considering the prior art described above, it is an objective of the present invention to provide an improved translation method by combining automatic speech recognition and Machine translation in sight translation of a written text. This and other advantageous features are obtained according to the present invention by a translation method comprising the steps of a translator speaking the translation of a written source text written in the target language (sight translation), an automatic speech recognition system converting the spoken translation into a set of phone/diphone/syllable and/or word hypotheses of the target language, a machine translation system translating the written source text into a set of translation hypotheses in the target language, and combining the set of word hypotheses and the set of translation hypotheses obtaining a target text in the target language.
By this translation method it is possible to perform sight translation of a written text with improved speed and minimized risk of speech recognition and/or translation errors.
When the speech recognition is combined with a machine translation system, recognition errors of the spoken words may be corrected.
I.e. when the output of a speech recognition system is combined with the output of a machine translation system, recognition errors of the spoken words may be corrected.
Similarly may errors be avoided that are typical in machine translation, as the machine translated text may be corrected by combination with the sight translated text or hypotheses of the sight translated text.
Thus by the present method a translator sight translates a written text (i.e. reads the source text and speaks the translation into the target language) and based on the
spoken translation and a machine translation of the source text a translated text in the target language is produced.
Preferably the set of translation hypotheses and the set of word hypotheses are combined by a search algorithm. Exemplary search algorithms are further explained below.
In some advantageous embodiments an eye tracking device captures the reading behaviour of the translator. In these embodiments the eye tracking device monitors what part of the text the translator is reading at a given time, making it possible to perform different types of adjustments and synchronizations in and/or between the actions of the automated speech recognition system and the machine translation system.
For example it is possible to synchronize the spoken translations and the machine translation based on an analysis of the reading behaviour.
It is possible to combine hypotheses of the reading progression (obtained from the eye tracker and a gaze-to-word mapper), hypotheses of spoken target words, and/or the word translation hypotheses to predict and complete the translated text. The reading progression, the spoken word hypotheses and the machine translation hypotheses are used to propose and predict translation completions.
Thus, the method allows various corrections of the suggested translations during the translation process as the different input modalities, e.g. speech recognition, machine translation and eye tracking, provide different types of information and hypotheses, each having strengths and weaknesses in different areas.
Generally by improved, better, optimized method is meant a method providing a faster translation and/or a translation with fewer errors compared to what is obtained by known speech recognition systems and machine translation systems.
Partial results from the eye-tracker, the speech recognition system and the MT system are integrated by means of a search algorithm. Different types of search algorithms may be used but preferably the search is based on log-linear combination of weighted feature models, originating from the Speech recognition system, the machine translation systems and/or the eyetracker. Log-linear combinations are discussed further below.
Different models for machine translations may also be used, however in advantageous embodiments the machine translation system is a trainable statistical machine translation system. A specific machine translation system may be chosen based on different factors such as language pairs, on-line trainability, and topology of the generated graph of translation hyoptheses.
Similarly the automatic speech recognition system may be chosen to match specific factors such as the suitability for the target language to be recognized and the compatibility of the output with the machine translation output. In preferred
embodiments the automatic speech recognition system is a trainable speech
recognition system.
In advantageous embodiments at least some of the translation hypotheses are made available to the translator on a computer monitor when speaking the translation. For example a translator reading a text written in the source language and speaking the text in the target language may be presented translation suggestions made by the machine translation system. The presentation can for example be a list of word or phrase translation options on a computer screen.
The display of the translation hypotheses made by the machine translation may be passive, i.e. may just be a list of word, parts of or complete sentences or text segments. Also the machine translation system may suggest target text completions and said suggestions may be accepted, ignored or modified by the translator. For example it may be possible for the translator to approve suggestions made by the machine translation system to quickly finish a sentence or text fragment. It may also be possible for the translator to reject or modify suggestions. It is also possible that the translator ignores suggestions for example by simply continuing the spoken translation. The machine translation system may also learn and adapt the set of translation hypotheses based on the produced translations hereby improving the efficiency of the suggested translations and the method over all.
Similarly the method may advantageously comprise a step wherein the automatic speech recognition system incrementally learns a speaker and a lexicon model based on the produced translations. Rejections, approvals and modifications of the machine translation and speech recognition proposals may be fed back into the speech recognition and the machine translation systems to incrementally improve their recognition and translation performance (learning). For example the speech recognition
system may learn the accent and intonation of one or more translators. This learning may be accelerated by the combination with the machine translation system.
The eye-tracking may be used in different ways for example in an advantageous setup where the eye tracker registers where on the computer screen the translator is looking. The translator's reading behaviour indicates whether the translator faces translation problems. Longer fixations on one word may indicate word-related translation problems while regressions to previously read words may reflect translation problems on a textual level. This information will be synchronized with the ASR and the MT output to produce targeted translation proposals, translation predictions and translation completions.
Thus the eye-tracking may be used to enhance the combination of the ASR and MT as it can provide gaze data e.g. on which part of the text is being read (and thereby possibly translated) at a given time.
Real-time integrated ASR-MT system with gaze-tracking support may react to the detected translation problem by plotting targeted assistance (e.g. in form of
suggestions on phrases, grammar, corrections and/or completions), before the translator becomes aware of the nature of the translation problem. Taking into account the automatically generated suggestions (e.g. by interacting with touch screen, mouse, and/or by voice), the translator may then keep on producing translations without much delay, by re-speaking or accepting the proposed
translations.
It is possible for the ASR-MT system to produce and visualize translation proposals before the translation is being spoken, so as to provide targeted translation
assistance at translation time. By providing translation options on the screen at the time, the translator need not look up translation proposals in third resources (lexica, and/or online search tools) but is encouraged to use the available translation proposals. This will on the one hand reduce the translator's search time for
terminology and other translation problems, and on the other hand reduce the number of OOV words in the ASR system. MT supported sight translation puts translators in a more conductive mode for good translations.
The integration of the ASR-MT may be based on phone level. A weighted sentence- based lexicon could be used, instead of the full-fledged SMT output graph, to be combined with the ASR phone lattice in the following way. For each source sentence
that is being translated, the MT system retrieves phrase translations from a bilingual statistical lexicon. The set of most likely phrase translations can be plotted on the GUI to serve the translator as translation aides. In one possible implementation, the phrase translations are also converted into a phonetic transcription and integrated into the ASR system. As the translator speaks the translation by possibly re-using the provided translation options on the screen, the ASR system produces a phone lattice from the spoken signal. This phone lattice can be mapped on the lexicon models derived from the MT lexicon. For each source sentence, a sentence based phrase dictionary may be generated and their phonetic forms computed. In one possible implementation, this lexicon can substitute the ASR lexicon or combined by a joint decoding of the ASR phone lattice, the ASR internal lexicon, the phonetic phrase transcriptions (as obtained from the statistical lexicon), and a target language model
A text to speech (TTS) synthesizer may read out a written text. In the present context the TTS synthesizer may read out the target text produced by the present method or may read out one or more parts of the target text.
In this way, the translator can verify the produced translation for comprehensibility and rhythm on an auditory channel. The ASR and TTS systems will share common resources, so that incrementally learning/refining of the speaker model in the recognition phase will also lead to enhanced synthesis during speech synthesis.
The present invention also relates to a computer program assisting a translator, wherein said computer programme comprises the following modules
an audio reception module, - an audio conversion module, converting audio input from the an audio reception module to a set of phone and/or word hypotheses, a text conversion module, converting text input from the text reception module to a set of translation hypotheses, a combining module, combining the set of translation hypotheses and the set of phone and/or word hypotheses, and
an output module, outputting a target text based on the combination of the set of translation hypotheses and the set of word hypotheses.
By using the present computer program it is possible to perform a fast and precise translation of a written text in a source language into a written text in a source language. Audio input in form of a spoken translation is registered via a microphone. The audio conversion module provides sets of phone and word hypotheses from the audio input. Similarly the original text in the source language is processed by the text conversion module to sets of translation hypotheses in the target language. The computer system then combines the sets of hypotheses from the text conversion module and from the audio conversion model to produce a translation of the source text. The end product from the computer program is a translated written text in the target language. The computer programme thus comprises an automated speech recognition (ASR) module (audio conversion module ) which receives the input in form of audio input i.e. the spoken translation and processes the input to obtain a complete or partial written target text. The ASR module can interact in different ways with other modules of the computer programme such as text-to-text (MT) module (the text conversion module).
The computer programme may further comprise an eye tracker module which by means of input from an eye tracker follows the translators reading progression and reading behaviour thus enabling a processing and registration of where in the text the translator may be facing problems as well as what part of text is being translated at a given time.
For example the eye tracker which follows the translator's eye movements over the text may register that the translators gaze is fixed at a certain word for a prolonged time indicating problems in the translation of that word. The eye tracker module, which receives this data, may process the input and suggest relevant translations which the translator may react to or not.
If the eye tracker detects that the translators gaze is moved back to re-read one or more passages of the text it may indicate that the translator is considering e.g. textual problems. In this case the eye tracker module may receive and process these input and suggest one or more sentences, alternative grammar etc. which the translator may react to or not.
The computer programme may also comprise a Text to Speech (TTS) module, which may synthesise a spoken version of the translated written text. The synthesized spoken text may help the translator to evaluate the translation in a different way than reading the written translation will. E.g. the synthesized spoken text, may help evaluate flow and rhythm of the translation.
The ASR and TTS modules will share common resources, so that incrementally learning/refining of the speaker model in the recognition phase will also lead to enhanced synthesis during speech synthesis. The eye tracker module may also take part in synchronizing the spoken (sight translated) translation with the machine translated text to text translation. This way the module performing the text to text translation may be influenced by the text to speech translation and vice versa.
An integration module (combining module) can receive input from eye tracking module, the text-to-text translation, ASR and the text-to speech translation evaluating and combining the different input when appropriate.
Thus the method according to the present invention may be a computer implemented method, implemented by the present computer programme.
The modules of the computer programme is preferably selected and programmed to assist in the method according to invention. A module may be a specific algorithm, an independent programme or a programme dependent of or forming part of a larger/other programme.
As for the method the program modules performing the machine translation and the voice recognition may be previously known or developed specifically to the present computer programme.
In the below, the above discussion of the method steps equally applies to the discussion of the computer program and vice versa.
Preferably the computer program also contains a translation prediction module, generating partial translation predictions and translation completions.
As known from the corresponding method the computer program preferably comprises an incremental learning module for speech recognition, to learn a speaker and/or lexicon model during translation production.
Also the computer program can comprise a gaze reception module (an eye tracker). The computer program may advantageously comprise a gaze-to-word mapping module, mapping gaze sample points onto the words and symbols looked at. The gaze mapping module may receive input from the gaze reception module and based hereon estimate what word, symbol or similar the translator at a given point in time is looking at.
Preferably the computer program comprise a combining module, combining the set of machine translation hypotheses and the set of sight translated word hypotheses in order to arrive at a translation which is improved compared to the hypotheses obtained from each of the machine translation and automated speech recognition.
The computer program may comprise a synchronization module, synchronizing the set of spoken and machine translated word hypotheses with the input from the gaze-to- word mapping module and hereby help the synchronization of the speech recognition and the machine translations.
Also the computer program comprises a translation prediction module, generating partial translation predictions and translation completions, allowing the translator to get information on suggested translations of words, complete or partial sentences and/or text fragments.
Preferably the computer program comprises an incremental learning module for speech recognition, to learn a speaker model during translation production. The incremental learning module helps optimize the speech recognition to produce an improved recognition of the words spoken by the translator. The fact that it is possible to combine the speech recognition with the machine text to text translation makes it possible to significantly increase the level of learning compared to known systems.
Advantageously the computer programme comprises a translation hypotheses output module, outputting at least part of the translation hypotheses. This module outputs at least part of the translation hypotheses which makes it possible for e.g. the translator to take it into account when performing the spoken translation.
The computer programme may also comprise a word hypotheses output module, outputting at least part of the word hypotheses. It is also possible that at least part of the word hypotheses are outputted on e.g. the computer screen in order for e.g. the translator to follow how the speech recognition system recognizes the spoken words. Preferably the computer programme comprises one or more interaction modules allowing interaction between computer programme and user by at least keyboard, voice instructions and/or mouse. Other interaction modules are equally applicable such as touch pads, track balls, electronic pens etc. as long as it provides an input relevant for the present method and/or computer programme. In some embodiments the one or more interaction modules enables the user to at least approve, modify and/or reject suggested translation hypotheses. When the translator is able to approve/reject suggested translations the translation process may become faster, produce less errors as well as the machine translation and/or speech
recognition system may have an increased learning compared to if the system only outputs the final translated text . The rejection/approval may be carried out by the one or more interaction modules.
The method and the computer programme implementing and assisting the method according to the invention may be executed by and in interaction with a system. Thus the present invention also relates to a system for assisting in translation of a text said system comprising
Processor, Computer screen, Interaction means, Microphone, and
The computer programme according to claim 16 - 29 as described above.
Where the interaction means are chosen from the group at least comprising keyboard, mouse, touch pad, voice recognition and/or eye tracker.
The method and the modules of the computer programme is described in further detail below. The underlying techniques of most main-stream statistical MT (SMT) systems are derived from and therefore similar to ASR systems: a graph of word proposals is combined with a (target) language model to find the most likely translation/sentence.
While in the case of the ASR the graph consists of possible phones or words which match the acoustic input of the speech signal, in the case of SMT the graph consists of words which correspond to possible translations of the source text. In both cases a huge search space is generated, and heuristics are applied to find a likely path through the graph. But in both cases the available techniques cannot make sure to find the optimum or best path which leads to recognition and translation errors in ASR and SMT systems respectively. As a result, the output of MT and ASR systems needs to be post-edited which leads to the undesirable situation of post-editing machine generated texts. A tight integration of ASR and MT systems, as obtained by the present method and computer program, however, will increase the speech recognition rate and the translation precision, reduce post-editing effort and has at the same time the potential to lead to more satisfying working conditions for translators, as s/he will speak his/her translation instead of postediting MT output.
ASR and MT systems have a complementary set of weaknesses and strength: ASR systems often fail to correctly recognize homonyms ("have to go back" may be recognized as "have two go back", "the countries" as "the country is" "loss" as "laws" etc.), they may wrongly decode hesitation ("uuh happened" can be transcribed as "are happened") and pronunciation flaws ("a distrust" as "at this trust", "sends" as "since" etc.). MT systems produce completely different types of errors, such as wrong lexical choices, (synonyms), false collocations, semantic and syntactic interpretation, as well as erroneous re-ordering of target language words. ASR and MT systems are however strong where the respective other system is weak: ASR systems have no problems with synonyms, syntactic choices, or with word-order,
while for MT systems pronunciation and hesitation of the spoken language are not an issues.
Thus as according to the method and computer program of the present invention it is possible to integrate the search space of the ASR and SMT systems into a computer- assisted sight translation system, so as to balance the advantages and/or cancel out the weaknesses of the two systems.
This means that even though standard ASR and MT systems are used, the collective translation is significantly improved by the combination according to the present invention compared to ASR and/or MT output taken individually.
There are various methods, such as decision trees, beam and/or A* search algorithms etc. which are used in SMT and ASR systems to traverse only a subset of the search space, to avoid the computation of all possible solutions and thus to make a computationally hard problem computable. The algorithm may traverse incrementally the search graph producing an output string based on the costs and weights of the prefix already produced and possibly also the cost of the remaining suffix, thereby pruning most of the less promising paths. The weights of each state depend on a number of (presumably) independent models which may be trained on different types of data. In the case of SMT the data and the derived models include translation models derived from bi-texts, language models derived of source and target language texts, term bases, dictionaries, including PoS tagger, parsers etc.. In the case of ASR, the models are based on formant analyzes, MFCC vectors, phonee transcriptions of dictionary entries and language models of training sentences, as well as a the mapping of phone sequences on words and/or sentences. The parameters of these heterogeneous models may be integrated and balanced using for each model m e M a weight Am. The sum over the weighted product of m model parameters is know as "log-linear" integration, as logarithm of the model variables m =~ log(pm) may used, so that : ∑m Am * m
This log-linear framework offers the possibility to integrate ASR models, SMT models, and/or reading models in a seamless manner into an ASR-MT system, which is likely to balance recognition errors and translation errors in sight translation.
Thus according to the present method, computer program and system at text is translated from a source language to a text in target language in that a translator is presented a source text on a computer screen. An AST-MT system runs in the background, preferably pre-processes the translation, and waits for the translator to speak his/her translation, while preferably an eye-tracker follows reading activities on the screen. As the translator speaks the translation into a microphone, the ASR-MT system or gaze-enabled ASR-MT system produces - preferably at real-time - translation transcriptions in a target window. Due to its multi-modal input, the source text and the spoken translation, the ASR-MT system produces better translations than each of the systems in isolation, thus reducing post-editing effort, and letting the translator the main actor in the process.
In addition, as the ASR-MT system may listen to the translator and knows which sentence is currently being worked on, translation proposals - which the MT system has internally generated - can optionally be plotted in a separate window on the screen so as to support the translator when s/he in need of e.g. terminological translations, para-phrases etc.
Exemplification: The business situation
The translation industry is desperately looking for suitable methods to integrate more advanced computer technologies into the translation process. On the one hand computer-assisted translation aides are available and used in many offices, and companies are pushing translators to use the technology in the hope to accelerate translation flow-through thus to produce higher added value. Many companies are willing to investigate and invest in new technologies even with an estimated 10% increase in translation production.
On the other hand the translation technology is not (yet) ripe, so that each project or translation job needs a lot of preparation and tuning of the computer system(s) and background resources, and/or produces output that is troublesome to correct. In addition, many companies have out-sourced their translation needs and many free- lance translators are not willing or don't see the benefit for using the technology. Even if post-editing is in the case of e.g. technical translation quicker than from-scratch translation, the main obstacle for wider deployment of more advanced computer-
assisted translation technology are currently the translators, who reject to act as an appendix and post-editor of translation output.
Since it cannot be expected that translators accept in the near future their role as machine-translation post-editors, a major break-through in the translation industry can only be expected if a better integration of the translator and the technology can be achieved, if better computer interfaces are produced and/or if machine translation systems produce better output. The ASR-MT technology, and the gaze-enabled ASR- MT technology is a completely new and unexplored approach to computer-assisted translation, which promises better human-computer integration in translation and the potential to produce much faster translation. However, it also requires a completely new translation technique, speaking instead of typing, which, perhaps, can even be more fun for some translators.
Description of the drawings The invention will in the following be described in greater detail with reference to the accompanying drawings:
Fig. 1 a schematic flow diagram of the method
Fig. 2 a second schematic view of the method Fig. 3 a schematic view of the method comprising the steps the eye tracker
Fig. 4 an alternative schematic view of the method known from fig. 2
Fig. 5 a schematic view of a method including automatic speech synthesis.
Fig 6 shows a schematic view of models for integration of ASR and MT systems Fig 7 shows a schematic view of integration of ASR, MT and gaze data
Fig 8a and 8b show examples of the results from gaze data
Detailed description of the invention
The figures are included to exemplify the present invention and are not to be construed as limiting to the invention.
Fig. 1 shows a schematic flow diagram of the basic principles of the present invention. The translator reads a text in the source language and speaks out a translation in the target language arrows A and B. The automatic speech recognition system processes the acoustic input and outputs a set of phone, diphone, syllable or word hypotheses (arrow C). Parallel to the spoken translation process, simultaneous or not, the source language text is sent to the machine translation system (arrow D) which processes the texts and outputs a set of translation hypotheses (arrow F) based on a word translation graph/translation lattice (arrow E). The word hypotheses and the translation hypotheses are combined in a search module, for example as described above, to produce the final translated target language text (arrow G).
Fig. 2 shows how the method known from fig. 1 is performed when the translator reads the text on a computer screen.
A machine translation system generates a graph of word translation proposals (a translation lattice). Parallel to this, simultaneous or not, the translator reads the source text and speaks his translation into a microphone. An ASR system produces phone, diphone, syllable and/or word hypothesis from the translator's spoken input and a search algorithm takes this information into account when retrieving the best translation from the word graph (translation lattice).
The translation lattice functions thus as an informed language model for the speech recognition system.
Optionally parts of the translation lattice are plotted in the form of partial translation options on the screen.
If the translation lattice or parts of the translation lattice are plotted on a screen it is possible for the translator to consult the automatically generated translation proposals where necessary.
It is also possible that the search module can predict the most likely continuation(s) of the translation, which the translator may accept e.g. by speaking them out, or through keyboard shortcuts. Fig. 3 shows a diagram of the method and computer program known from fig. 1 and fig. 2, including eye tracker the option.
The eye tracker registers where on the computer screen the translator is looking. The information on where translator is looking is then processed to make a hypothesis about what part of the source text at a given time is being read by the translator. The information from the eye tracking may be outputted as hypotheses of reading progression which may be combined with the word hypotheses and translation hypotheses to e.g. synchronize the speech recognition output and the translation graph, and/or to trigger reactive translation assistance.
The accepted target text can then be fed back into the automatic speech recognition system to incrementally learn/refine the speaker model and/or the system lexicon and to enhance the speech recognition rate. It can also be fed back into the machine translation system to enhance machine translation predictions.
Fig. 4 shows a diagram of the system known from fig 2 in which the "verified text" is shown as a separate box to allow a clear illustration of the interactions with the Human translator, ASR module and MS module.
Fig. 5 shows a diagram of a method according to the present invention wherein an integrated ASR-TTS system and an integrated ASR-MT system interacts to optimize the translation. The basis system is known from figures 1 - 4.
The MT and ASR system interacts as indicated by the module "integrated Word and Phone Graph", and the output is integrated with gaze-to-word mapping based and
information from the eye-tracker by an integration module in this case a multi modal integration module.
The integrated ASR-TTS system creates a synthesized spoken version of the target text which may help the translator evaluate the translated text. Preferably the ASR- TTS system is an online trained shared resource.
The translator's interaction with the computer system is also illustrated in this figure. The translator speaks a translation of the source text which is recognized by the ASR module. The translator can read suggestions for completions etc. on the screen and may revise the target text. Further the translator can hear a synthesized spoken version of the target text in order to evaluate other aspects of the target text that what would be recognizable from the written target text.
Fig 6 shows an exemplary model for tight integration of ASR and MT systems. ASR and MT systems and some possible combinations are known from general description and the previous figures. Some combinations of ASR and SMT systems according to the present invention integrates both components at translation time. In such a tight runtime integrated ASR-SMT system, synchronization of the SMT output and the spoken translation can be achieved either by guiding the translator to speak and edit the translation of a currently selected and/or highlighted segment in the GUI (similar to TMs) and/or to use a gaze capturing device as a cue to which segment is currently been spoken. In the latter case, synchronization of the SMT output and the spoken translation relies on the assumption that it is likely for a translator to speak out the translation of a source text sentence (or sentence fragment) which has just been or is currently being read. A reading model triggers activation of entries in a bilingual lexicon which increases the likelihood for these items to be recognized in the speech signal.
Fig 7 shows a scheme of integrating ASR and MT systems with gaze data. The basic system combining ASR and MT can be as known from the previous figures. The gaze-enabled ASR-MT system captures gaze data from the eye tracker which may be instrumental to detect particular translation problems. For instance, increased fixation durations on certain words, or refixations of segments may reveal well
defined translation problems. The real-time integrated ASR-MT system with gaze- tracking support may react to the detected translation problem by plotting targeted assistance, before the translator becomes aware of the nature of the translation problem. Taking into account the automatically generated suggestions, the translator may then keep on producing translations without much delay, by re-speaking or accepting the proposed translations.
It is possible for the ASR-MT system to produce and visualize translation proposals before the translation is being spoken, so as to provide targeted translation assistance at translation time. By providing translation options on the screen at the time, the translator need not look up translation proposals in third resources (lexica, and/or online search tools) but is encouraged to use the available translation proposals. This will on the one hand reduce the translator's search time for terminology and other translation problems, and on the other hand reduce the number of OOV words in the ASR system. MT supported sight translation puts translators in a more conductive mode for good translations.
The integration of the ASR-MT may be based on phone level. A weighted sentence- based lexicon could be used, instead of the full-fledged SMT output graph, to be combined with the ASR phone lattice in the following way. For each source sentence that is being translated, the MT system retrieves phrase translations from a bilingual statistical lexicon. The set of most likely phrase translations can be plotted on the GUI to serve the translator as translation aides. In one possible implementation, the phrase translations are also converted into a phonetic transcription and integrated into the ASR system. As the translator speaks the translation by possibly re-using the provided translation options on the screen, the ASR system produces a phone lattice from the spoken signal. This phone lattice can be mapped on the lexicon models derived from the MT lexicon. For each source sentence, a sentence based phrase dictionary may be generated and their phonetic forms computed. In one possible implementation, this lexicon can substitute the ASR lexicon or combined by a joint decoding of the ASR phone lattice, the ASR internal lexicon, the phonetic phrase transcriptions (as obtained from the statistical lexicon), and a target language model.
Fig 8a and 8b illustrates some aspects of the use of gaze data. The gaze data from the eye tracker are indicated by circles across the text. As seen the distance between circles are not constant as well as the vertical position drifts from start to end over a text passage. Various assumptions and theories can be applied
depending on the detail and/or type of the needed gaze data. In the present example the assumed reading line is indicated by grey field 1 i.e. it is the main part of the first line of the text shown.
Claims
1. Translation method comprising the steps of a translator speaking a translation of a written source text in in a target language, an automatic speech recognition system converting the spoken translated text into a set of phone/diphone/syllable and/or word hypotheses in the target language, a machine translation system translating the written source text into a set of translations hypotheses in the target language, and combining the set of word hypotheses and the set of translation hypotheses obtaining a target text in the target language.
Method according to claim 1 , wherein the set of translation hypotheses and the set of word hypotheses are combined by a search.
3. Method according to any of the preceding claims, wherein an eye tracking device captures the reading progression and/or behaviour of the translator.
Method according to any of the preceding claims, comprising synchronizing the spoken translations and the machine translation based on an analysis of the reading progression and/or behaviour.
5. Method according to any of the preceding claims, comprising combining
hypotheses of the reading progression and/or behaviour, the set of spoken target word hypotheses, and the word translation graph to predict and complete the translated segment Method according to any of the preceding claims, wherein the reading progression, the spoken and/or the machine translation hypotheses are used to propose and predict translation completions.
7. Method according to any of the preceding claims wherein the search is based on log-linear combination of weighted feature models.
Method according to any of the preceding claims wherein the machine translation system is a trainable statistical machine translation system.
9. Method according to any of the preceding claims wherein automatic speech recognition system is a trainable speech recognition system.
10. Method according to any of the preceding claims wherein at least some of the translation hypotheses are made available to the person when speaking the translation.
1 1. Method according to any of the preceding claims comprising a step wherein the machine translation system suggests at least parts of a target text and wherein said suggested target text may be accepted, ignored or modified by the person.
12. Method according to any of the preceding claims comprising a step wherein the machine translation system learns and adapts the set of translation hypotheses based on the produced translations.
13. Method according to any of the preceding claims comprising a step wherein the automatic speech recognition system incrementally learns a speaker model based on the produced translations.
14. Method according to any of the preceding claims comprising a step wherein the machine translation system adapts the set of translation hypotheses based on the set of word hypotheses.
15. Method according to any of the preceding claims comprising a step wherein the automatic speech recognition system adapts the set of word hypotheses based on the set of translation hypotheses.
16. Method according to any of the preceding claims comprising a step wherein a Text to Speech (TTS) module synthesises a spoken version of at least a part of the target text.
17. Computer program assisting a translation, wherein said computer programme comprises the following modules
an audio reception module, - an audio conversion module, converting audio input from the an audio
reception module to a set of word hypotheses, a text conversion module, converting text input from the text reception module to a set of translation hypotheses, a combining module, combining the set of translation hypotheses and the set of word hypotheses, and an output module, outputting a target text based on the combination of the set of translation hypotheses and the set of word hypotheses.
18. Computer program according to claim 17 comprising a translation prediction module, generating partial translation predictions and translation completions.
19. Computer program according to claims 17 - 18 comprising an incremental
learning module for speech recognition, to learn a speaker and/or a lexicon model during translation production.
20. Computer program according to claims 17 - 19 comprising a gaze reception module (an eye tracker).
21. Computer program according to claims 17 - 20 comprising a gaze-to-word
mapping module, mapping gaze sample points onto the words and symbols looked at.
22. Computer program according to claims 17 - 21 comprising a combining module, combining the set of machine translation word hypotheses and the set of sight translated word hypotheses.
23. Computer program according to claims 17 - 22 comprising a synchronization module, synchronizing the set of spoken and machine translated word hypotheses.
24. Computer program according to claims 17 - 23 comprising a Text to Speech (TTS) module for synthesising a spoken version of at least a part of the target text.
25. Computer programme according to claims 17 - 24 comprising a translation
hypotheses output module, outputting at least part of the translation hypotheses.
26. Computer programme according to claims 17 - 25 comprising a word hypotheses output module, outputting at least part of the word hypotheses.
27. Computer programme according to claims 17 - 26 comprising one or more interaction modules allowing interaction between computer programme and user by at least keyboard, voice instructions, touch screen and/or mouse.
28. Computer programme according to claims 17 - 27 wherein the one or more interaction modules enables the user to at least approve, modify and/or reject suggested translation hypotheses.
29. Computer programme according to claims 17 - 28 programmed to assist in the method according to claims 1 - 16.
30. System for assisting in the production of a translation of a text, said system comprising
processor, storage means, computer screen, interaction means, microphone, and the computer programme according to claims 17 - 29.
31. System according to claim 30 wherein the interaction means are chosen from the group of keyboard, mouse, touch pad, touch screen, voice recognition and/or eye tracker.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DKPA201170669 | 2011-12-05 | ||
DKPA201170669 | 2011-12-05 | ||
DKPA201270351 | 2012-06-21 | ||
DKPA201270351 | 2012-06-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013083132A1 true WO2013083132A1 (en) | 2013-06-13 |
Family
ID=47471445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/DK2012/050445 WO2013083132A1 (en) | 2011-12-05 | 2012-12-05 | Translation method and computer programme for assisting the same |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2013083132A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9412365B2 (en) | 2014-03-24 | 2016-08-09 | Google Inc. | Enhanced maximum entropy models |
DE102015006662A1 (en) | 2015-05-22 | 2016-11-24 | Audi Ag | Method for configuring a voice control device |
US9842592B2 (en) | 2014-02-12 | 2017-12-12 | Google Inc. | Language models using non-linguistic context |
US10134394B2 (en) | 2015-03-20 | 2018-11-20 | Google Llc | Speech recognition using log-linear model |
CN110263348A (en) * | 2019-03-06 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Interpretation method, device, computer equipment and storage medium |
US10832664B2 (en) | 2016-08-19 | 2020-11-10 | Google Llc | Automated speech recognition using language models that selectively use domain-specific model components |
WO2021225728A1 (en) * | 2020-05-08 | 2021-11-11 | Zoom Video Communications, Inc. | Incremental post-editing and learning in speech transcription and translation services |
CN114239613A (en) * | 2022-02-23 | 2022-03-25 | 阿里巴巴达摩院(杭州)科技有限公司 | Real-time speech translation method, device, device and storage medium |
-
2012
- 2012-12-05 WO PCT/DK2012/050445 patent/WO2013083132A1/en active Application Filing
Non-Patent Citations (7)
Title |
---|
ALAIN DÉSILETS ET AL: "Evaluating Productivity Gains of Hybrid ASR-MT Systems for Translation Dictation", 2008 INTERNATIONAL WORKSHOP ON SPOKEN LANGUAGE TRANSLATION, 1 January 2008 (2008-01-01), Hawaii, US, pages 158 - 165, XP055037788 * |
BARBARA DRAGSTED ET AL: "Speaking your translation: students' first encounter with speech recognition technology", TRANSLATION & INTERPRETING; THE INTERNATIONAL JOURNAL FOR TRANSLATION & INTERPRETING RESEARCH, vol. 3, no. 1, 1 January 2011 (2011-01-01), pages 10 - 43, XP055037783 * |
BROWN P F ET AL: "AUTOMATIC SPEECH RECOGNITION IN MACHINE-AIDED TRANSLATION", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 8, no. 3, 1 July 1994 (1994-07-01), pages 177 - 187, XP000505679, ISSN: 0885-2308, DOI: 10.1006/CSLA.1994.1008 * |
DESILETS ET AL., PROCEEDINGS OF IWSLT, 2008 |
E. VIDAL ET AL: "Computer-assisted translation using speech recognition", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, vol. 14, no. 3, 1 May 2006 (2006-05-01), pages 941 - 951, XP055037791, ISSN: 1558-7916, DOI: 10.1109/TSA.2005.857788 * |
MT. REDDY ET AL., IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, vol. 18, no. 8, November 2010 (2010-11-01) |
REDDY A ET AL: "Integration of Statistical Models for Dictation of Document Translations in a Machine-Aided Human Translation Task", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, USA, vol. 18, no. 8, 1 November 2010 (2010-11-01), pages 2015 - 2027, XP011300062, ISSN: 1558-7916, DOI: 10.1109/TASL.2010.2040793 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9842592B2 (en) | 2014-02-12 | 2017-12-12 | Google Inc. | Language models using non-linguistic context |
US9412365B2 (en) | 2014-03-24 | 2016-08-09 | Google Inc. | Enhanced maximum entropy models |
US10134394B2 (en) | 2015-03-20 | 2018-11-20 | Google Llc | Speech recognition using log-linear model |
DE102015006662A1 (en) | 2015-05-22 | 2016-11-24 | Audi Ag | Method for configuring a voice control device |
US10832664B2 (en) | 2016-08-19 | 2020-11-10 | Google Llc | Automated speech recognition using language models that selectively use domain-specific model components |
US11557289B2 (en) | 2016-08-19 | 2023-01-17 | Google Llc | Language models using domain-specific model components |
US11875789B2 (en) | 2016-08-19 | 2024-01-16 | Google Llc | Language models using domain-specific model components |
CN110263348A (en) * | 2019-03-06 | 2019-09-20 | 腾讯科技(深圳)有限公司 | Interpretation method, device, computer equipment and storage medium |
WO2021225728A1 (en) * | 2020-05-08 | 2021-11-11 | Zoom Video Communications, Inc. | Incremental post-editing and learning in speech transcription and translation services |
CN114239613A (en) * | 2022-02-23 | 2022-03-25 | 阿里巴巴达摩院(杭州)科技有限公司 | Real-time speech translation method, device, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP4481972B2 (en) | Speech translation device, speech translation method, and speech translation program | |
US11093110B1 (en) | Messaging feedback mechanism | |
WO2013083132A1 (en) | Translation method and computer programme for assisting the same | |
Watanabe et al. | The 2020 espnet update: new features, broadened applications, performance improvements, and future plans | |
WO2020146873A1 (en) | System and method for direct speech translation system | |
JP2001100781A (en) | Method and device for voice processing and recording medium | |
US11900072B1 (en) | Quick lookup for speech translation | |
Ciobanu et al. | Speech recognition and synthesis technologies in the translation workflow | |
Kandhari et al. | A voice controlled e-commerce web application | |
AU2022203531B1 (en) | Real-time speech-to-speech generation (rssg) apparatus, method and a system therefore | |
US9218807B2 (en) | Calibration of a speech recognition engine using validated text | |
Sobti et al. | Comprehensive literature review on children automatic speech recognition system, acoustic linguistic mismatch approaches and challenges | |
Palivela et al. | Code-Switching ASR for Low-Resource Indic Languages: A Hindi-Marathi Case Study | |
Hashimoto et al. | Impacts of machine translation and speech synthesis on speech-to-speech translation | |
Zhou et al. | Optimization of cross-lingual voice conversion with linguistics losses to reduce foreign accents | |
Koo et al. | KEBAP: Korean error explainable benchmark dataset for ASR and post-processing | |
Hashimoto et al. | An analysis of machine translation and speech synthesis in speech-to-speech translation system | |
Bansal et al. | On Improving Code Mixed Speech Synthesis with Mixlingual Grapheme-to-Phoneme Model. | |
Behre et al. | Smart speech segmentation using acousto-linguistic features with look-ahead | |
Rudzionis et al. | Web services based hybrid recognizer of Lithuanian voice commands | |
Guo et al. | The hw-tsc’s speech to speech translation system for IWSLT 2022 evaluation | |
Mujadia et al. | Disfluency processing for cascaded speech translation involving English and Indian languages | |
Koo et al. | Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline | |
Kraljevski et al. | Preserving language heritage through speech technology: The case of Upper Sorbian | |
Dungavath et al. | Self-Training and Error Correction using Large Language Models for Medical Speech Recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12809097 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12809097 Country of ref document: EP Kind code of ref document: A1 |