HK1018647A - Method for translating cultural subtleties in machine translation - Google Patents
Method for translating cultural subtleties in machine translation Download PDFInfo
- Publication number
- HK1018647A HK1018647A HK99103567.9A HK99103567A HK1018647A HK 1018647 A HK1018647 A HK 1018647A HK 99103567 A HK99103567 A HK 99103567A HK 1018647 A HK1018647 A HK 1018647A
- Authority
- HK
- Hong Kong
- Prior art keywords
- language
- sentence
- source language
- target language
- source
- Prior art date
Links
Description
no marking
The present invention relates to computer translation from a source language to a target language, and more particularly to machine translation that takes cultural nuances into account when translating.
U.S. patent No.5224040 (hereinafter sometimes referred to as the "first patent") by the present inventor discloses a machine that translates a source language (e.g., chinese) into a target language (e.g., english). Since chinese sentences are written in character strings, different words can be produced from character strings combined in different ways, the method disclosed in the previous patent comprises the steps of: inputting a string of Chinese characters; segmenting the character string to find out character groups forming words and idioms; these words are translated into the target language, resulting in the original translation.
In a second generation machine disclosed in inventor's U.S. patent No.5384702 (hereinafter sometimes referred to as the "second patent"), grammatical rules and self-correcting rules are applied to render the original translation.
However, the syntactic structure and sentence composition of the source language reflects the cultural impact of the source language in its development over many centuries. The same is true for the target language. Thus, there is typically a considerable difference in the syntactic structure and sentence composition between any two source and target languages. The machines and methods disclosed in the first and second patents do not take this difference into account. As a result, even after the original sentence is formed using the machine disclosed in the first patent, by finding the grammatical identifier and applying the grammatical rules and the self-correcting rules for finishing in the machine disclosed in the second patent, the resulting finished sentence may lack the syntactic structure and sentence components of the target language, i.e., the grammatical rules and the self-correcting rules alone may not be sufficient to communicate the differences between the respective syntactic structure and sentence components between languages lacking a common cultural background.
Then, when modifying the original translation so that the finished translation uses the syntax structure and sentence components of the target language instead of the syntax structure and sentence components of the source language, a third generation of machines is required that can take into account the different syntax structures and sentence components between the source language and the target language.
Existing translation computers do not have the ability to convert sentences having the syntax and structure of a source language into sentences having the syntax and structure of a target language. Common general knowledge always holds that such skills are only possible if all possible sentences of the source language can be matched in computer memory with a predetermined corresponding sentence of the target language. In other words, some linguists are required to match all possible sentences between the two languages, and the computer simply generates a target language sentence that matches the source language sentence. Clearly, such a scheme has been considered impractical because the number of sentences that can occur in any one language is unlimited.
The general purpose translation machines known so far are not capable of translating an unlimited number of grammatically correct source language sentences into an unlimited number of grammatically correct target language sentences, where the two languages have different respective syntaxes and sentence structures.
Further, in conceiving the present invention, it is not obvious to one of ordinary skill in the art, from the overall prior art, how to provide such a general purpose translation machine.
Until now, no method has been implemented that can take cultural nuances into account in machine translation, and a new, useful and unobvious invention has been able to meet the need for such a method. The present invention includes a method for analyzing, interpreting and converting source language syntactic structures and sentence components into related target language syntactic structures and sentence components. As a result, a first translation machine in the world has been created that is capable of generating syntactically and sentence-structurally correct target language sentences from source languages having different syntaxes and sentence structures. Thus, the first translation machine provided by the present invention is able to compare favorably with a translator fluent in both the source and target languages. Unlike the human translator, however, the machine of the present invention is not limited to the number of languages translated.
The new method uses Linguistic Canonical Forms (LCFs) and Information Patterns (IPs) to transform cultural nuances by transforming the thinking process in the source language into the thinking process in the target language.
After the original translation provided by the translation machine of the first patent publication and before the steps performed by the translation machine of the second patent publication, some new steps are performed. The original sentence produced by the first patent translation machine will be improved by reflecting the syntax and sentence structure of the target language through the steps of the present invention when the grammar and self-correction rules disclosed in the second patent are employed. As a result, the sentence in the target language is finished at a higher level, incorporating the cultural nuances of the target language, without reflecting the cultural nuances of the source language, i.e. the quality of the sentence reaches a level that could only be reached by the translator so far.
The LCF of the source language relative to the target language is a representation using words of the source language, however, the words of the source language described herein are arranged according to the thinking process and sentence structure of the target language. Thus, LCF reflects the cultural background of the target language (which may be any natural language). This ensures that the next translation by the invention of the second patent will produce the highly finished sentence described above.
The first step of the new method, where the source language is Chinese, is to first divide the string into recognizable Chinese phrases according to the method taught in the first patent. The result is an original sentence containing the source language syntax structure and sentence components. Then, the present invention rearranges the Chinese words into a sentence arranged in English sentences, i.e., arranges the Chinese words in a predetermined English sentence structure, resulting in a Language Canonical Form (LCF).
For example, the source language is Chinese and the target language is English. The first level of translation is performed in the machine disclosed in the first patent, and may be the "Zhe ben shu shi ta xie de." literal translation: "This ben book he write de," ben "is a quantifier (MW) similar to" sheet, "where english represents a sheet of paper (" one sheet of paper "), and" de "has no english translation, representing past tenses and passive tenses. The information type of the source language is pronoun (P) + MW + noun (N) + verb (V) + P + V + de. The corresponding predetermined information pattern of the target language is P + N + be + V + by + P. Thus, LCF is "ZHeshu shi xie bei ta," the original version of which is "This book is write by he.," the retorted version of which is "This book is write by him"
Note that: "Zhe shu shi xie beita" is a representation of the arrangement of Chinese words in the form of English sentence structures. Thus, words can now be translated and the resulting sentence will be an English sentence, which only needs to be finished with the method disclosed in the second patent.
The English expression "This book is he write" can be considered as a Chinese-style English expression because it reflects the Chinese thinking process and sentence structure, i.e., as described above, the words are derived from the translations of the Chinese words before they are arranged into the English sentence structure. For purposes of illustration, this chinese-style english representation is referred to as LCF 1; this is useful in performing an automatic translation of english to chinese, since it already has the syntax and sentence structure of chinese. This original translation ("Zheshu shi ta xie") is then lexically edited to make it more readable, according to the method taught in the second patent.
The LCF of "This book waters writer by him" is: "Zhe shu shi xie bei ta." is a so-called English-style Chinese that gives a Chinese representation using the English-language thinking process and sentence structure. This is referred to as the language specification form LCF 2. Direct translation of LCF2 yields: the "This book is write by he." is useful in automatic translation of Chinese to English in progress, LCF 2. This original translation is then grammatically corrected to make it more readable according to the method taught in the second patent.
Thus, the main idea of the new method is to generate an LCF corresponding to the input source language sentence. From the LCF, the machine proceeds to translate to the target language sentence.
Because it is not possible to store all possible sentences, it is also impractical to attempt to store millions of matching sentences in both source and target languages, and therefore the present invention encompasses a breakthrough insight into representing all sentences in a limited number of information patterns.
However, there are some similarities between chinese and english. When the syntactic structure and sentence composition of Chinese is similar to that of an English sentence, the translation is straightforward and does not necessarily produce LCF. For example, translating a "Ta du shu kuai" chinese sentence word by word into a "He read book check" and then into a "He read the book check," in this example, there is no need to rearrange the chinese words into english sentence structures because the two languages have the same sentence structure.
On the other hand, a large number of chinese sentences are completely different from the syntax and sentence components of english translation. In this case, as described above, applying LCF provides an improved method for the machine to make the correct translation.
As another example, consider the Chinese sentence "Ta ba shu fang zai shu zhuoshan". Here "ba" has no English translation; the words are translated one by one to obtain 'He ba book put at desk on'. "zai shu zhuo shang" is a phrase in a similar language, and means "on the desk". In the source language, the information form (IP) of the sentence is P + ba + N + V + adverb (Adv). The corresponding predetermined information pattern of the target language is P + V + N + Adv. By inserting the Chinese word into the IP of the target language, the "Tafang shu o shang" is obtained, which is the LCF of the sentence, i.e., a representation of the words in the source language arranged in the target language. Thus, the syntax and sentence structure of the target language has replaced the syntax and sentence structure of the source language, so that the original translation of the LCF is done in the target language. In this example, the original translation of the LCF is given "He put book on desk", and then the original translation is processed as disclosed in the second patent, resulting in "He put the book on the desk".
English sentence: "She read the book quick," follows the pattern of P + V + article + N + Adv. There are numerous english sentences that fit this information type, for example: "He threw the ball slow". Thus, in accordance with the teachings of the present invention, this particular information pattern is stored in a knowledge base that represents a myriad of English sentences.
The inventors have found about 7000 pairs of information patterns in chinese and english, respectively. Therefore, in the memory of the third generation machine, instead of storing millions of english sentences, all that needs to be stored is about 7000 pairs of information patterns when the target language is english.
After the first generation translation machine provides the original translation, the machine of the present invention finds the information version of the source language, consults the knowledge base, and selects the already predetermined target language information version corresponding to the source language information version. The words of the source language are then arranged into the order of the target language message pattern to form an LCF for use in generating the original translation, completing the work of the third generation machine (the present invention). Then, the machine disclosed in the second patent is used to perform final finishing based on grammar rules, thereby completing the translation process.
With this approach, numerous sentences can be translated, even though the knowledge base contains only thousands of information patterns in both the source and target languages.
It is therefore a primary object of the present invention to provide a translation that converts the syntax and sentence structure of a source language to the syntax and sentence structure of a target language.
A closely related objective is to improve sentence interpretation artistry through tools and methods for disclosing message patterns, generating a knowledge base of message patterns, and generating LCFs.
These and other important objects, features and advantages of the present invention will become apparent as the description proceeds.
The scope of the invention, including the invention, is set forth in the claims.
For a fuller understanding of the nature and objects of the present invention, reference should be made to the following detailed description taken together with the accompanying figures.
Fig. 1 is a flow chart of a new method for first, second and third generation translation.
Fig. 2 is a flow chart for a more detailed description of the third generation method.
The new method represents a group of sentence patterns with an Information Pattern (IP). An IP is a unified representation of a set of sentences, which is represented by an arrangement of parts of speech (POS) of a set of sentences. With respect to a target language, a Chinese sentence has a number of IPs, one for each Chinese sentence, and one LCF.
The word "Ta" (meaning "he", "she", or "it") is frequently used in composing chinese sentences. Thus, "Ta" yields many IPs. A typical IP is: ta + N + V + de + Adv ("de" is an untranslated word with no English equivalents.
The sentence "Ta Ying Yu jiang de hao" can be translated verbally into "He English spreads dewell" the Chinese IP of this sentence is: p + N + V + de + Adv. The corresponding english IP of an english translation is: p + N + V + Adv, the IP corresponding to each source language, has an IP in the target language.
The Chinese words are arranged according to the sequence of English IP, and the LCF of the Chinese sentence can be found. Thus, arranging the Chinese words in the order "P + V + N + Adv", the LCF of the Chinese sentence is: "Ta jiang Ying Yuhao." ("Ta" is pronoun, "jiang" is verb, "Ying Yu" is noun, "hao" is adverb), so the original English is "He speak English well"; the method of the second patent was then used to generate "He spearks English well".
Referring now to FIG. 1, it can be seen that the numeral 10 generally designates one embodiment of the present invention.
A chinese sentence in the form of a chinese character string is input to the machine at block 12. With the method steps described in the first patent, us patent No.5224040, to which reference is made in its entirety, the character string is divided into segments in block 14, each segment representing a word, and the dictionary 16 may be referenced during the segmentation. Here, sentences in the source language are processed according to the method steps of the U.S. Pat. No.5224040 as described.
The steps of the method of the invention are then carried out. More specifically, by referring again to the dictionary 16, the part of speech (POS) of the segmented sentence is found at block 18 and the Information Pattern (IP) of the sentence (still in the source language) is found at block 20. The knowledge base 22 is then consulted, as indicated by arrow 21, in an attempt to match the IP of the source language with the corresponding predetermined IP of the target language. This attempt was successful from the comprehensive point of view of the knowledge base 22. The IP in the predetermined target language, which matches the IP in the source language, flows into function block 20 and then to function block 24, as indicated by arrow 23. At block 24, the LCF of the target sentence is provided by referring to the dictionary 16. Here, if the source language is chinese and the target language is english, the sentence will be an english-style chinese sentence, i.e., a representation of chinese words arranged in the correct english syntax and sentence structure. The LCF is then translated into the original english sentence by looking up the dictionary at function block 26.
The final steps of the process are carried out using second generation machinery, as described in the second patent, U.S. patent No.5384702, which is incorporated herein by reference in its entirety. Thus, the grammatical rules and self-correcting rules of the second patent publication are executed at block 28 to produce a finished and processed translation.
As disclosed in more detail in fig. 2, based on the part of speech found by the received function block 18, an Information Pattern (IP) is generated at function block 20, which is a prerequisite for generating the LCF of the sentence. The IP generated by the POS according to the source language will be the IP of the source language; this IP is fed to the knowledge base 22, as indicated at 21, and when a match is found, the corresponding IP in the target language is fed to the LCF generator 24, as indicated by arrow 23. The cultural nuances of the source language have been replaced here by cultural nuances of the target language, thus greatly facilitating the task of the final translation step.
At block 26, the dictionary 16 is consulted to provide the original translation, but it should be remembered that the words of the original translation are now arranged in the order of the information pattern in the target language corresponding to the information pattern in the source language. Thus, at block 28, the rules disclosed in the second patent are applied to this original translation to produce a processed and finished sentence in the target language.
The limited number of information types in any one language, which can be managed by a computer, has led to a pioneering, breakthrough invention. Thus, a method of using a computer to retrieve a predetermined information pattern in a target language corresponding to an information pattern in a source language identified from a knowledge base, then inserting words in the source language into the predetermined information pattern in the target language to produce an LCF, and then translating and retooling is also a breakthrough. In view of the inventive aspects of the present invention, the following claims should be studied to determine the true scope and content of this invention.
It will thus be seen that the objects set forth above, among those made apparent from the preceding description, are efficiently attained and, since certain changes may be made in the above construction without departing from the scope of the invention, it is intended that all matter contained in the above construction or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.
It is also to be understood that the following claims are intended to cover all of the generic and specific features of the invention herein described and all statements of the scope of the invention which, as a matter of language, might be said to fall therebetween.
Now, the present invention has been described.
Claims (7)
1. A method of translating a sentence in a source language into a sentence in a target language, characterized by: the method for converting the thinking process and the sentence structure of the source language into the thinking process and the sentence structure of the target language in the translation comprises the following steps:
inputting a sentence in a source language;
finding out the parts of speech of the source language sentence and the sequence of the parts of speech appearing in the source language;
finding out the information type of the source language, wherein the information type comprises the arrangement of the parts of speech according to the sequence;
finding a predetermined information pattern in the target language, the information pattern in the target language corresponding to the information pattern found in the source language;
generating an LCF by arranging words of the source language sentence in a word order represented by the predetermined information pattern of the target language;
and translating the words.
2. The method of claim 1, wherein: further comprising the step of providing a knowledge base containing information versions in a plurality of source languages.
3. The method of claim 2, wherein: further comprising a step of providing said knowledge base containing information patterns in a plurality of target languages.
4. The method of claim 3, wherein: further comprising the step of associating each information type in the source language with a predetermined information type in the target language.
5. The method of claim 4, wherein: further comprising the step of finding the corresponding information pattern in the target language from the information pattern knowledge base and replacing the words in the source language with the predetermined information pattern in the target language.
6. The method of claim 5, wherein: further comprising the steps of generating a language specification form based on the located target language information form, translating words of a source language of the language specification form, and providing an original translation of a sentence of the source language.
7. The method of claim 6, wherein: further comprising the step of applying a set of self-correcting grammar rules to process said original translation.
Publications (1)
Publication Number | Publication Date |
---|---|
HK1018647A true HK1018647A (en) | 1999-12-30 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6002997A (en) | Method for translating cultural subtleties in machine translation | |
US6760695B1 (en) | Automated natural language processing | |
Lee et al. | Language model based Arabic word segmentation | |
US8239188B2 (en) | Example based translation apparatus, translation method, and translation program | |
US20050216253A1 (en) | System and method for reverse transliteration using statistical alignment | |
US20070021956A1 (en) | Method and apparatus for generating ideographic representations of letter based names | |
US20070233460A1 (en) | Computer-Implemented Method for Use in a Translation System | |
CN1094618C (en) | Method for self-correction of grammar in machine translation | |
Liyanapathirana et al. | Sinspell: A comprehensive spelling checker for sinhala | |
Lavie et al. | Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario | |
de Silva et al. | Singlish to sinhala transliteration using rule-based approach | |
WO1997040452A1 (en) | Automated natural language translation | |
Nithya et al. | A hybrid approach to English to Malayalam machine translation | |
Dolatian et al. | A free/open-source morphological transducer for western armenian | |
Karimi et al. | English to persian transliteration | |
Wu et al. | Learning to find English to Chinese transliterations on the web | |
McEnery et al. | Multilingual resources for European languages: contributions of the CRATER project | |
HK1018647A (en) | Method for translating cultural subtleties in machine translation | |
Dhindsa et al. | English to Hindi transliteration system using combination-based approach | |
US20030093262A1 (en) | Language translation system | |
KR19980031976A (en) | English Long Segmentation Method for English-Korean Machine Translation System | |
Vilar et al. | A recursive statistical translation model | |
CN1212407A (en) | Method for translating cultural subtleties in machine translation | |
JP3387437B2 (en) | Machine translation proofreading device | |
Phillips et al. | Improving example-based machine translation through morphological generalization and adaptation |