RU2546064C1

RU2546064C1 - Distributed system and method of language translation

Info

Publication number: RU2546064C1
Application number: RU2013150294/08A
Authority: RU
Inventors: Иван Валерьевич Смольников; Владимир Владиславович Гусаков; Артем Владимирович Украинец
Original assignee: Общество с ограниченной ответственностью "Технологии управления переводом"
Priority date: 2013-11-12
Filing date: 2013-11-12
Publication date: 2015-04-10

Abstract

FIELD: information technologies.

SUBSTANCE: distributed network translation system provides the access to the distributed network of professional translators and machine translation systems (MT) which together perform translation in real time. The system consists of cloud servers, a user interface, a segmentation module which performs splitting of the initial file into a set of segments, a module of preliminary translation, translation memory base, morphological dictionaries, a module of glossaries, a module of the creation of the accompanying data forming a data set for reference for performers and for individual control of MT systems for translation of this document, a module of word-by-word alignment, a module of MT processing selecting the best version of translation, a module of identification of multiple performers recommended for work on the document translation, a module collecting data on all actions executed by performers in the web interface, a module which compiles final translations of each separate segment into the final file with translation.

EFFECT: increase of speed, efficiency and accuracy of translation.

13 cl, 3 dwg

Description

ОБЛАСТЬ ИЗОБРЕТЕНИЯFIELD OF THE INVENTION

[001] Настоящее изобретение описывает систему и метод языкового перевода текста в общем и в частности систему для совместного перевода.[001] The present invention describes a system and method for language translation of a text in general, and in particular a system for joint translation.

УРОВЕНЬ ТЕХНИКИBACKGROUND

[002] Сбор и обмен информацией с любой научной, коммерческой, политической или социальной целью зачастую требует быстрого и эффективного перевода текста, чтобы множество знаний и идей стали полезны в глобальном масштабе. Компьютерные программы, которые переводят автоматически с одного языка на другой (“программы машинного перевода”), в принципе могут удовлетворить данную потребность, и такие программы были разработаны и продолжают разрабатываться для множества языков. Для формального стиля изложения на глубоко исследованных языках (в отличие от неформального, идиоматического или разговорного стиля), такие программы машинного перевода демонстрируют достаточно адекватное качество перевода.[002] The collection and exchange of information for any scientific, commercial, political or social purpose often requires a quick and effective translation of the text, so that a lot of knowledge and ideas become useful on a global scale. Computer programs that automatically translate from one language to another (“machine translation programs”) can, in principle, satisfy this need, and such programs have been developed and continue to be developed for many languages. For a formal presentation style in deeply studied languages (as opposed to an informal, idiomatic or colloquial style), such machine translation programs demonstrate a fairly adequate translation quality.

[003] Для более трудных или менее исследованных языков (например, арабского языка), однако, существующие программы машинного перевода не работают хорошо даже для формального общения (например, Современного Стандартного Арабского языка) и они особенно слабы в случае неформального, разговорного и идиоматического общения. Аналогично, там, где требуется качественный точней перевод, машинного перевода самого по себе становится недостаточно даже для хорошо исследованных языков (например, английского, французского, испанского, немецкого и других языков).[003] For more difficult or less studied languages (for example, Arabic), however, existing machine translation programs do not work well even for formal communication (for example, Modern Standard Arabic) and they are especially weak in case of informal, conversational and idiomatic communication . Similarly, where a more accurate translation is required, machine translation alone is not enough even for well-studied languages (for example, English, French, Spanish, German and other languages).

[004] Профессиональные переводчики в принципе могут обеспечить качественные переводы для трудных языков и неформальных коммуникаций, но Интернет-приложения требуют постоянной доступности и оперативного реагирования, что не может быть гарантировано в случае использования существующих подходов к организации работы профессиональных переводчиков.[004] Professional translators, in principle, can provide high-quality translations for difficult languages and informal communications, but Internet applications require constant availability and prompt response, which cannot be guaranteed if existing approaches to organizing the work of professional translators are used.

[005] В свете вышесказанного, необходим метод и система, способные обеспечить эффективное использование баз памяти переводов при одновременной работе над переводом текста больших команд профессиональных переводчиков.[005] In light of the foregoing, a method and system is needed that can ensure the efficient use of translation memory databases while working on the translation of large teams of professional translators.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[006] Изобретение состоит в следующем:[006] The invention consists in the following:

[007] Настоящее изобретение обеспечивает систему и метод перевода с языка исходного файла. Система состоит из веб-сервера, принимающего и обрабатывающего данные исходного файла для перевода, базы данных для хранения переведенного текста, обработанных исходных файлов и терминов глоссария, модуля сегментации, предназначенного для разбиения исходного файла на множество сегментов, модуля обработки, выполняющего поиск соответствующих данным сегментам существующих данных, для нахождения полных и/или частично совпадающих сегментов из уже переведенных ранее текстов, модуля машинного перевода для формирования машинного перевода сегмента, модуля поиска терминологии для нахождения использованных в сегменте терминов из глоссария и пользовательского интерфейса, доступного одновременно для множества пользователей, выполняющего отображение машинного перевода, полных и частичных совпадений из памяти переводов, терминов из глоссариев, и обеспечивающего возможность выполнения профессионального перевода исходного файла. Система и метод могут быть представлены в виде исполняемого кода (программного обеспечения), аппаратного обеспечения или их комбинации.[007] The present invention provides a system and method for translating from a language of a source file. The system consists of a web server that receives and processes the data of the source file for translation, a database for storing translated text, processed source files and glossary terms, a segmentation module designed to split the source file into many segments, a processing module that searches for corresponding segments existing data, to find complete and / or partially matching segments from previously translated texts, machine translation module for generating machine translation egmenta, module search terms used to find the segment in terms of glossary and a user interface that is available for multiple users at the same time, the mapping of machine translation, full and partial matches from the translation memory, glossary of terms, and provides the ability to perform professional translation of the original file. The system and method can be represented as executable code (software), hardware, or a combination thereof.

[008] В другом аспекте настоящего изобретения для каждого сегмента, сохраненного в базе данных, модуль обработки ищет точное или частичное соответствие с ранее переведенными предложениями, ищет термины из глоссария и машинные переводы предложения. В реализации системы множество пользователей может получить одновременный доступ к пользовательскому интерфейсу системы, и переводы выполненные каждым пользователем передаются из пользовательского интерфейса на сервер и сохраняются в базу данных. В пользовательском интерфейсе также могут отображаться выполненные ранее другими пользователями сегменты, полностью или частично совпадающие с переводимым сегментом.[008] In another aspect of the present invention, for each segment stored in the database, the processing module searches for exact or partial correspondence with previously translated sentences, searches for terms from the glossary and machine translations of the sentence. In the implementation of the system, many users can get simultaneous access to the user interface of the system, and the transfers made by each user are transferred from the user interface to the server and stored in the database. The user interface may also display previously executed segments by other users that fully or partially coincide with the segment being translated.

КРАТКОЕ ОПИСАНИЕ ГРАФИЧЕСКИХ СХЕМBRIEF DESCRIPTION OF GRAPHIC SCHEMES

[009] Реализация изобретения будет описана в дальнейшем в соответствии с прилагаемыми графическими схемами, которые представлены для пояснения сути изобретения и никоим образом не ограничивают область изобретения. К заявке прилагаются следующие графические схемы:[009] The implementation of the invention will be described hereinafter in accordance with the accompanying graphic diagrams, which are presented to illustrate the essence of the invention and in no way limit the scope of the invention. The following graphic schemes are attached to the application:

[010] Рис.1 - диаграмма потока данных, иллюстрирующая распределенную систему языкового перевода, реализованную в соответствии с настоящим изобретением.[010] Fig. 1 is a data flow diagram illustrating a distributed language translation system implemented in accordance with the present invention.

[011] Рис.2 - диаграмма потока данных, иллюстрирующая метод автоматического предварительного перевода, используемый в модуле предварительного перевода 118, который в свою очередь является частью распределенной системы языкового перевода, реализованной в соответствии с настоящим изобретением.[011] Fig. 2 is a data flow diagram illustrating an automatic pre-translation method used in the pre-translation module 118, which in turn is part of a distributed language translation system implemented in accordance with the present invention.

[012] Рис.3 - схематически иллюстрирует взаимодействие слоя интеграции внешних информационных систем и распределенной системы языкового перевода, реализованной в соответствии с настоящим изобретением.[012] Fig. 3 schematically illustrates the interaction of an integration layer of external information systems and a distributed language translation system implemented in accordance with the present invention.

ПОДРОБНОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

[013] В приведенном ниже подробном описании реализации изобретения приведены многочисленные детали реализации, призванные обеспечить отчетливое понимание настоящего изобретения. Однако, квалифицированному в предметной области специалисту, будет очевидно каким образом можно использовать настоящее изобретение, как с данными деталями реализации, так и без них. В других случаях хорошо известные методы, процедуры и компоненты не были описаны подробно, чтобы не затруднять излишне понимание особенностей настоящего изобретения.[013] In the following detailed description of an embodiment of the invention, numerous implementation details are set forth to provide a clear understanding of the present invention. However, to a person skilled in the art, it will be apparent how the present invention can be used, both with and without implementation details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the understanding of the features of the present invention.

[014] Кроме того, из приведенного изложения будет ясно, что изобретение не ограничивается приведенной реализацией. Многочисленные возможные модификации, изменения, вариации и замены, сохраняющие суть и форму настоящего изобретения, будут очевидными для квалифицированных в предметной области специалистов.[014] Furthermore, it will be clear from the foregoing that the invention is not limited to the foregoing implementation. Numerous possible modifications, changes, variations and replacements preserving the essence and form of the present invention will be apparent to those skilled in the subject field.

[015] Настоящее изобретение направлено на обеспечение системы и метода для быстрого, эффективного и более надежного языкового перевода посредством распределенной сетевой системы языкового перевода.[015] The present invention is directed to providing a system and method for fast, efficient and more reliable language translation through a distributed network language translation system.

[016] Распределенная сетевая система языкового перевода - это распределенная сеть профессиональных переводчиков и систем машинного перевода, которые взаимодействуют через программные и пользовательские интерфейсы системы и выполняют совместно в режиме реального времени перевод текстов, для которых недостаточно применения исключительно машинного перевода или традиционно организованного профессионального перевода, включая перевод динамических коммуникаций и других текстов создаваемых в различных информационных средах.[016] A distributed language translation network system is a distributed network of professional translators and machine translation systems that interact through the program and user interfaces of the system and together perform real-time translation of texts for which the use of exclusively machine translation or traditionally organized professional translation is not enough, including translation of dynamic communications and other texts created in various information environments.

[017] В реализации настоящего изобретения система является доступной через сеть интернет облачной платформой, доступ профессиональных переводчиков к системе обеспечивается через интерфейс пользователя, открываемый в веб-браузере, в интерфейсе реализованы отдельные окна для управления проектом перевода и для одновременного перевода и редактирования текста несколькими исполнителями в режиме реального времени. Распределенная сетевая система языкового перевода обеспечивает инструментарий для агрегирования ресурсов большого числа переводчиков, с разными режимами доступности, с различными профессиональными навыками, как профессиональных переводчиков, так и компьютерных систем машинного перевода, для эффективного выполнения высококачественных переводов в режиме реального времени.[017] In the implementation of the present invention, the system is a cloud-based platform accessible via the Internet, professional translators have access to the system through a user interface that can be opened in a web browser, the interface has separate windows for managing a translation project and for simultaneous translation and editing of text by several executors in real time. The distributed network language translation system provides tools for aggregating the resources of a large number of translators, with different access modes, with various professional skills, both professional translators and computer-based machine translation systems, for efficiently performing high-quality translations in real time.

[018] В реализации настоящего изобретения, перевод выполняется путем разбиения исходного текста на сегменты, выполнение по сегменту поиска терминологии и поиска совпадений в базе памяти переводов, последующей одновременной отправки каждого сегмента на перевод нескольким системам машинного перевода. При этом каждый источник данных (глоссарий, память переводов, система машинного переводов) обладает собственным рейтингом качества, метрикой совпадения и/или собственной уверенности в выдаваемом результате, вычисляемых для каждого сегмента индивидуально. Причем результаты поиска по терминологии и базам памяти переводов используются для дополнительной индивидуальной настройки систем машинного перевода, а частичные совпадения с исходным сегментом, найденные в базе памяти переводов, используются для подстановки в итоговый машинный перевод той части сегмента, которая совпадает. Затем, принимая во внимание рейтинг каждой системы машинного перевода и значения метрик для каждого сегмента, как собственных для каждой системы машинного перевода, так и внешних (учитывающих как гладкость текста в целом, так и внутренние факторы, такие как количество фрагментов/фраз, из которых собран перевод, встречается ли терминология из глоссария в сегменте и т.д.), выбирается вариант машинного перевода с наилучшими значениями автоматических метрик.[018] In the implementation of the present invention, the translation is performed by breaking the source text into segments, performing terminology searches on the segment and searching for matches in the translation memory database, then simultaneously sending each segment for translation to several machine translation systems. Moreover, each data source (glossary, translation memory, machine translation system) has its own quality rating, metric of coincidence and / or own confidence in the output, calculated individually for each segment. Moreover, the search results by terminology and translation memory databases are used for additional individual tuning of machine translation systems, and partial matches with the source segment found in the translation memory database are used to substitute in the final machine translation the part of the segment that matches. Then, taking into account the rating of each machine translation system and the values of metrics for each segment, both its own for each machine translation system, and external (taking into account the smoothness of the text as a whole, and internal factors, such as the number of fragments / phrases, of which translation is collected, whether terminology is found from the glossary in the segment, etc.), a machine translation option with the best values of automatic metrics is selected.

[019] Распределенная сетевая система перевода основывается на технологии памяти переводов, призванной повысить эффективность перевода, состоящей из хранилища параллельных сегментов и подсистемы поиска, позволяющей выполнять как поиск с условиями, накладываемыми только на язык оригинала, так и поиск с условиями, накладываемыми одновременно и на язык оригинала и на язык перевода. В базе памяти переводов хранятся переведенные ранее сегменты, чтобы в дальнейшем не нужно было выполнять перевод повторно с чистого листа. Таким образом, одна из главных функций базы памяти переводов - это поиск переведенных ранее сегментов, схожих с вновь переводимым сегментом, а также сравнение сегментов их переводов, а также отдельных фраз и слов в данных сегментах.[019] The distributed network translation system is based on translation memory technology, designed to increase the translation efficiency, consisting of a storage of parallel segments and a search subsystem that allows you to perform both searches with conditions that apply only to the original language, and searches with conditions that apply both at the same time and original language and translation language. The translation memory stores previously translated segments in the database, so that in the future it would not be necessary to carry out the translation again from scratch. Thus, one of the main functions of the translation memory database is to search for previously translated segments that are similar to the newly translated segment, as well as to compare the segments of their translations, as well as individual phrases and words in these segments.

[020] На Рис.1 представлена диаграмма, распределенную сетевую систему перевода, являющую собой пример реализации в соответствии с настоящим изобретением. Как показано на Рис.1, распределенную сетевую систему перевода 100 обслуживает множество заказчиков перевода 102 желающих получить перевод исходных файлов 104. Множество заказчиков перевода 102 соединяются с удаленным веб-сервером 112 посредством сети Интернет 110 с помощью пользовательского интерфейса 106 и веб-браузера. Исходный файл 104 загружается на сервер 112 посредством сети интернет 110. После того, как исходный файл 104 загружен на веб-сервер 112, модуль сегментации 114 обрабатывает исходный файл 104 и разбивает его на текст на множество сегментов. Множество сегментом, каждый из которых содержит часть исходного текста файла 104 затем обрабатывается модулем предварительного перевода 118, который находит для каждого исходного сегмента соответствующие ему ресурсы из базы лингвистических ресурсов 116, а именно находит полностью совпадающие сегменты в памяти переводов, находит частично совпадающие сегменты в памяти переводов, находит термины глоссария, встречающиеся в сегменте, а также для каждого слова сегмента определяет соответствующие ему записи в морфологическом частотном словаре. База лингвистических ресурсов 116 состоит из морфологического частотного словаря, глоссариев, и базы памяти переводов, с переведенными ранее сегментами, дополненными метаданными документов, в которые входили данные сегменты, а также данными об истории работы над каждым сегментом, т.е. какой исполнитель какой этап рабочего процесса выполнил для данного сегмента, какие изменения он внес в текст перевода, а также записи всех действий выполненных им в интерфейсе системы в процессе работы над данным сегментом. Таким образом, в модуле предварительного перевода 118 для каждого найденного в памяти переводов сегмента определяется документ, к которому он относится, кто именно выполнял перевод сегмента (какая система машинного перевода была использована, кто был переводчиком, кто редактором, кто корректором, и т.д.). А также количественные оценки качества работы, полученные каждым из перечисленных исполнителей по данному документу (если проводился экспертный анализ качества перевода). Далее основываясь на предварительно настроенных правилах предварительного перевода, применяемых на уровне каждого сегмента, которые могут дополнительно изменять в каждом конкретном проекте перевода, в модуле предварительного перевода 118 формируется предварительный автоматический перевод а файла, содержащий для каждого сегмента полностью и частично совпадающие сегменты из памяти переводов, термины из глоссария, а также варианты машинного перевода сформированные различными системами машинного перевода. Правила предварительного перевода, применяются на уровне каждого отдельного сегмента и определяют какой именно из вариантов перевода будет использован по умолчанию, а также какие этапы работ с данным сегментом должны быть выполнены профессиональными исполнителями людьми, в зависимости от выбранного по умолчанию варианта перевода и автоматической оценки его качества и необходимого объема доработок.[020] Figure 1 shows a diagram of a distributed network translation system, which is an example implementation in accordance with the present invention. As shown in Figure 1, a distributed network translation system 100 serves many translation customers 102 who want to receive translation of the source files 104. Many translation customers 102 connect to the remote web server 112 via the Internet 110 using the user interface 106 and a web browser. The source file 104 is uploaded to the server 112 via the Internet 110. After the source file 104 is uploaded to the web server 112, the segmentation module 114 processes the source file 104 and splits it into text into multiple segments. A plurality of segments, each of which contains a part of the source text of the file 104, is then processed by the preliminary translation module 118, which finds for each source segment the resources corresponding to it from the database of linguistic resources 116, namely, it finds completely matching segments in the translation memory, finds partially matching segments in the memory translations, finds the glossary terms found in the segment, and also for each word of the segment determines the corresponding entries in the morphological frequency dictionary . The database of linguistic resources 116 consists of a morphological frequency dictionary, glossaries, and a translation memory database, with previously translated segments, supplemented with document metadata that included these segments, as well as data on the history of work on each segment, i.e. what kind of contractor has completed what stage of the workflow for this segment, what changes has he made to the text of the translation, as well as records of all the actions performed by him in the system interface during the work on this segment. Thus, in the pre-translation module 118, for each segment found in the translation memory, a document is defined to which it belongs, who exactly performed the segment translation (which machine translation system was used, who was the translator, who is the editor, who is the proofreader, etc. .). As well as quantitative assessments of the quality of work received by each of the listed artists on this document (if an expert analysis of the quality of the translation was carried out). Further, based on pre-configured pre-translation rules applied at the level of each segment, which can be additionally changed in each specific translation project, preliminary automatic translation is generated in the pre-translation module 118 and a file containing for each segment completely and partially matching segments from the translation memory, glossary terms, as well as machine translation options generated by various machine translation systems. The rules of preliminary translation are applied at the level of each individual segment and determine which of the translation options will be used by default, as well as which stages of work with this segment should be performed by professional performers, depending on the default translation option and automatic assessment of its quality and the necessary amount of improvements.

[021] В примере реализации настоящего изобретения, память переводов представляет из себя систему хранения и поиска параллельных сегментов (предложений, фраз или фрагментов предложений) - представляющих из себя пару исходный текст и текст перевода. Память переводов используется для того, чтобы помочь переводчику в переводе текста и накапливает уже переведенные ранее сегменты, чтобы избежать их повторного перевода с чистого листа в дальнейшем. Данная функция выполняется за счет поиска по базе памяти переводов для вновь переводимого сегмента полностью и частично совпадающих с ним ранее переведенных сегментов. Для установления степени совпадения ранее переведенных сегментов со вновь переводимым сегментом используется метрика соответствия, отражающая степень совпадения текста вновь переводимого сегмента с текстом оригинала переведенного ранее и хранящегося в базе данных сегмента.[021] In an example implementation of the present invention, the translation memory is a system for storing and searching parallel segments (sentences, phrases or sentence fragments) - a pair of source text and translation text. The translation memory is used to help the translator in the translation of the text and accumulates segments already translated earlier, in order to avoid their repeated translation from scratch in the future. This function is performed by searching the translation memory database for a newly translated segment of fully and partially matching previously translated segments. To establish the degree of coincidence of previously translated segments with the newly translated segment, a correspondence metric is used, which reflects the degree of coincidence of the text of the newly translated segment with the text of the original translated earlier and stored in the segment database.

[022] Множество исполнителей (переводчиков) 108 подключаются к платформе 100 посредством веб-интерфейса 106. Сегменты, являющиеся результатом работы модуля сегментации текста 114 передаются для обработки в модуль предварительного перевода 118, который формирует предварительный перевод исходного файла 104 (который может в том числе содержать наилучший вариант машинного перевода из множества вариантов, полученных от доступных систем машинного перевода), набор профессиональных операций, которые должны быть выполнены для каждого сегмента, а также набор рекомендуемых профессиональных переводчиков/редакторов/корректоров с указанием для каждого из них его рейтинга и метрики предпочтительности его привлечения к работе над данным конкретным документом. Пользовательский веб-интерфейс 106 содержит различные интерфейсные окна для управления проектом и для непосредственной работы над переводом и редактированием текста перевода, при этом возможно одновременное редактирование одного документа многими переводчиками/редакторами/корректорами/ревьюверами и т.д. С помощью соответствующего веб-интерфейса 108 переводчик может просматривать перевод исходного файла 104 с учетом предварительного перевода, выполненного модулем предварительного перевода 118, а также переводов выполненных или отредактированных другими переводчиками/редакторами/корректорами/ревьюверами, при этом база памяти переводов автоматически пополняется по мере ввода переводчиками перевода новых сегментов и при редактировании выполненных ранее автоматических или профессиональных переводов и изменения и новые добавления отображаются автоматически в режиме реального времени в результатах поиска.[022] Many performers (translators) 108 are connected to the platform 100 via the web interface 106. Segments resulting from the operation of the text segmentation module 114 are transmitted for processing to the preliminary translation module 118, which generates a preliminary translation of the source file 104 (which may including contain the best version of machine translation from a variety of options received from available machine translation systems), a set of professional operations that must be performed for each segment, as well as Boron recommended professional translators / editors / proofreaders, indicating for each of them in its rating and preference metrics of its attraction to work on this particular document. The web user interface 106 contains various interface windows for managing the project and for direct work on the translation and editing of the text of the translation, while it is possible to simultaneously edit one document by many translators / editors / proofreaders / reviewers, etc. Using the appropriate web interface 108, the translator can view the translation of the source file 104, taking into account the preliminary translation performed by the preliminary translation module 118, as well as translations made or edited by other translators / editors / proofreaders / reviewers, while the translation memory is automatically updated as you type translators translating new segments and when editing previously performed automatic or professional translations and changes and new additions are selected zhayutsya automatically in real-time search results.

[023] Переводы сегментов, которые вводятся или редактируются профессиональными переводчиками 108 после формирования их предварительного перевода в модуле предварительного перевода 118 (различные сегменты могут проходить через различные стадии профессиональной работы, что также определяется в результате работы модуля 118), автоматически проверяются после их сохранения соответствующим исполнителем на корректность перевода терминологии в соответствии с глоссарием и корректность прохождения прочих автоматических правил контроля качества, настроенных для данного файла. По результатам данных проверок у сегмента может быть выставлен специальный флаг, сигнализирующий о том, что автоматически зафиксирована потенциальная ошибка, приведено ее описание, кроме того по каждому сегменту исполнитель в веб-интерфейсе видит предысторию работы по нему - кто именно вносил изменения на каких этапах работ и какие изменения вносились, кроме того видны комментарии внесенные предыдущими исполнителями по данному сегменту, в которые исполнители могут включить обоснование причин, по которым был выбран вариант перевода, не проходящий автоматической верификации. Каждый сегмент документа должен пройти все стадии рабочего процесса, определенного на уровне документа, за исключением тех случаев, когда в результате работы модуля предварительного перевода 118 часть стадий для отдельных сегментов может пропускаться.[023] Translations of segments that are entered or edited by professional translators 108 after generating their preliminary translation in the preliminary translation module 118 (different segments can go through different stages of professional work, which is also determined as a result of the operation of module 118), are automatically checked after they are saved by the appropriate the contractor for the correct translation of the terminology in accordance with the glossary and the correctness of the passage of other automatic quality control rules, configured for this file. Based on the results of these checks, a special flag can be set for a segment that signals that a potential error has been automatically fixed, its description is provided, in addition, for each segment, the performer in the web interface sees the history of work on it - who made changes at what stages of work and what changes were made, in addition, comments made by previous performers in this segment are visible, in which performers can include a justification of the reasons for which the translation option was chosen, not passing automatic verification. Each segment of the document must go through all the stages of the workflow defined at the document level, with the exception of cases where as a result of the work of the preliminary translation module 118 some of the stages for individual segments can be skipped.

[024] На Рис.2 приведена диаграмма иллюстрирующая работу модуля предварительного перевода, входящего в состав распределенной сетевой системы перевода, реализованной в соответствии с настоящим изобретением. В блок 202 на вход поступает исходный файл 104, загруженный через пользовательский веб-интерфейс сетевой распределенной системы или поступившей через обращение к программным интерфейсам системы (API). Исходный файл 104 может иметь как текстовое, так и бинарное представление; затем исходный файл 104 обрабатывается программным фильтром, соответствующим формату файла, для извлечения содержимого файла в виде текста. В блоке 204, текстовое содержимое исходного файла 104 разбивается на сегменты программным разборщиком, совместимым с языком задания правил сегментации SRX (Segmentation Rule Exchange). Полученные сегменты затем сохраняются в базе данных вместе со специально сформированным XML-файлом с бинарными вставками, содержащим информацию, необходимую для последующей сборки переведенного файла с сохранением форматирования исходного файла 104.[024] Figure 2 is a diagram illustrating the operation of a pre-translation module that is part of a distributed network translation system implemented in accordance with the present invention. In block 202, the input file 104 is downloaded through the web-based user interface of the network distributed system or received through access to the system program interfaces (APIs). The source file 104 may have either a textual or binary representation; then the source file 104 is processed by a software filter corresponding to the file format to extract the contents of the file as text. At block 204, the text content of the source file 104 is segmented by a software parser compatible with the Segmentation Rule Exchange (SRX) language. The resulting segments are then stored in the database along with a specially generated XML file with binary inserts containing the information necessary for the subsequent assembly of the translated file while maintaining the formatting of the original file 104.

[025] Процесс разбиения текста на сегменты может быть наглядно представлен следующим образом: курсор двигается по тексту, по одному символу за раз. В каждой позиции курсора проверяются правила, состоящие из шаблона на предшествующий текст и шаблона на последующий текст, правила проверяются в соответствии с заданным для них порядком, чтобы сначала убедиться совпадает ли текст, предшествующий текущей позиции курсора с одним из шаблонов для предшествующего текста, после чего проверяется соответствие текста следующего после текущей позиции курсора связанному с данным шаблоном шаблону на последующий текст. Если текст соответствует обоим шаблонам, то либо курсор передвигается в следующую позицию без вставки границы разбиения сегментов, если эти шаблоны соответствуют правилу-исключению, либо вставляется граница разбиения сегментов, если шаблоны соответствуют правилу разбиения.[025] The process of breaking text into segments can be visualized as follows: the cursor moves through the text, one character at a time. At each cursor position, the rules are checked, consisting of a template for the preceding text and a template for the subsequent text, the rules are checked in accordance with the order specified for them, to first check whether the text preceding the current cursor position matches one of the patterns for the previous text, and then the text of the following cursor after the current cursor position is checked for the template associated with the given template for the subsequent text. If the text matches both patterns, then either the cursor moves to the next position without inserting the segment break border if these patterns match the exclusion rule, or the segment break border is inserted if the patterns match the break rule.

[026] В блоке 206, для каждого сегмента исходного документа 104, сохраненного в базе данных, выполняется поиск полных и частичных совпадений по переведенным ранее сегментам, хранящимся в базе памяти переводов 116. Также выполняется поиск терминов из глоссариев входящих в данный сегмент. Кроме того для каждого слова сегмента устанавливаются ссылки на соответствующие ему записи в морфологическом частотном словаре. Для каждого полностью или частично совпадающего сегмента, найденного в памяти переводов 116, также загружается информация о документе, в который он входит, использованных при переводе документа лингвистических ресурсах (глоссарии, базы памяти переводов, системы машинного перевода). Сюда также входит информация об исполнителях, работавших над документом и над каждым конкретным сегментом документа, полученных ими профессиональных оценках качества и метриках производительности работы и объема вносимых исправлений на каждом из этапов рабочего процесса по каждому из сегментов документа Соответственно используются как метрики полученные для конкретного найденного сегмента, так и усредненные значения метрик для документа в целом.[026] In block 206, for each segment of the original document 104 stored in the database, a search is made for full and partial matches for previously translated segments stored in the translation memory database 116. Terms are also searched from glossaries included in this segment. In addition, for each word of the segment, links to its corresponding entries in the morphological frequency dictionary are established. For each fully or partially matching segment found in the translation memory 116, information is also loaded on the document into which it is included, the linguistic resources used in the translation of the document (glossaries, translation memory databases, machine translation systems). This also includes information on the executors who worked on the document and on each specific segment of the document, their professional assessments of quality and performance metrics and the amount of corrections made at each stage of the workflow for each of the document segments. Accordingly, they are used as metrics obtained for a particular found segment. , and the average values of metrics for the document as a whole.

[027] В момент добавления в индекс базы памяти переводов 116 каждого нового переведенного (параллельного) сегмента, добавляемый сегмент разбивается на отдельные слова, а затем по каждому отдельному слову выполняется поиск по морфологическим частотным словарям, содержащим все допустимые формы каждого слова, указание на то какая из форм данного слова является базовой, а также различные дополнительные метаданные, описывающие как концепт в целом (например, часть речи или метрика частоты употребления), так и каждую из форм (род, падеж и т.д.). К каждому слову, таким образом, приписывается ссылка на соответствующую ему запись в словаре. Одному слову может соответствовать несколько возможных записей, в таком случае к нему приписываются все они. Данная информация включается в многопараметрический индекс, что позволяет впоследствии находить, например, все сегменты, в которых употреблено заданное слово, в любой из его допустимых форм. Текст перевода сегмента анализируется аналогичным образом, и полученная информация также включается в индекс, что также позволяет в дальнейшем находить сегменты, у которых в оригинале заданное слово в употреблено в любой из его допустимых форм, а в переводе некоторое другое слово употреблено в любой из его допустимых форм, или наоборот заданное слово не встречается в переводе ни в одной из его допустимых форм, для выполнения таких запросов не требуется выполнять перебор всех возможных форм слова, достаточно указать ссылку на базовую форму и искомые значения прочих метаданных. Также на основе такого анализа выстраивается пословное соответствие текста оригинала и перевода, т.е. для каждого слова оригинала находится соответствующее ему слово перевода (при этом отдельные слова, как в тексте оригинала, так и в тексте перевода могут не иметь соответствий, т.е. им соответствует пустое множество). Для каждого слова также выполняется поиск по всем глоссариям данного клиента и общедоступным глоссариям 208, ссылки на соответствующие записи в глоссарии также сохраняются в поисковом индексе, в случае нахождения для очередного слова терминов в глоссариях, состоящих из нескольких слов, выполняется проверка вхождения в сегмент остальных слов из глоссарного термина (проверка также выполняется на уровне наличия слова в любой из допустимых форм). Кроме того, слова, частота употребления которых, указанная в словаре, ниже заданного фиксированного порогового значения, помечаются как низкочастотные. Пороговое значение выбирается эмпирическим путем для каждого языка индивидуально.[027] At the moment when each new translated (parallel) segment is added to the index of the translation memory database 116, the added segment is divided into separate words, and then each individual word is searched by morphological frequency dictionaries containing all valid forms of each word, indicating which of the forms of this word is basic, as well as various additional metadata that describe both the concept as a whole (for example, part of speech or the metric of frequency of use), and each of the forms (gender, case, etc.). Thus, a link to a corresponding entry in the dictionary is attributed to each word. Several possible entries can correspond to one word, in which case all of them are assigned to it. This information is included in the multi-parameter index, which allows subsequently to find, for example, all segments in which the given word is used, in any of its valid forms. The segment translation text is analyzed in a similar way, and the obtained information is also included in the index, which also allows us to find segments in the original where the given word is used in any of its admissible forms, and some other word is used in the translation in any of its admissible forms, or vice versa, the given word is not found in the translation in any of its admissible forms; to perform such queries it is not necessary to search all possible forms of the word, just specify a link to the basic form y and the desired values of other metadata. Also, on the basis of such an analysis, the word-for-word correspondence of the text of the original and the translation is built, i.e. for each word of the original there is a translation word corresponding to it (in this case, individual words, both in the text of the original and in the text of the translation, may not have correspondences, i.e., they have an empty set). For each word, a search is also performed on all the glossaries of this client and the public glossaries of 208, links to the corresponding entries in the glossary are also stored in the search index, if for the next word the terms are in the glossaries of several words, the entry of the remaining words in the segment is checked from a glossary term (the check is also performed at the level of the presence of a word in any of the acceptable forms). In addition, words whose frequency of use indicated in the dictionary are below a given fixed threshold value are marked as low-frequency. The threshold value is selected empirically for each language individually.

[028] В блоке 206, для каждого сегмента создается набор параллельных сегментов (вариантов перевода) найденных из памяти переводов, к каждому из которых приписывается метрика степени сходства параллельного сегмента из памяти переводов с исходным сегментом. Как было описано выше, в процессе помещения нового параллельного сегмента в индекс базы памяти переводов выполняется анализ каждого слова сегмента по морфологическим словарям. В процессе поиска параллельных сегментов схожих с исходным сегментом слово, употребленное в различных формах в тексте исходного сегмента и в найденном параллельном сегменте, рассматривается как одно и то же слово, соответственно при вычислении метрики сходства сегментов, метрика уменьшается минимально в случае различий в формах употребления слова. Гораздо сильнее метрика сходства сегментов уменьшается в случае, когда в исходном сегменте есть слова, которых нет в параллельном сегменте и наоборот, когда в параллельном сегменте есть слова отсутствующие в исходном сегменте.[028] In block 206, for each segment, a set of parallel segments (translation options) is found found from the translation memory, each of which is assigned a metric of the degree of similarity of the parallel segment from the translation memory to the original segment. As described above, in the process of placing a new parallel segment in the index of the translation memory database, each word of the segment is analyzed using morphological dictionaries. When searching for parallel segments similar to the original segment, the word used in different forms in the text of the original segment and in the found parallel segment is considered as the same word, respectively, when calculating the metric of similarity of segments, the metric decreases minimally in case of differences in the forms of the word . The segment similarity metric is much stronger when there are words in the original segment that are not in the parallel segment and vice versa, when there are words in the parallel segment that are not in the original segment.

[029] В блоке 208, текст исходного сегмента разбивается на отдельные слова и затем для каждого слова выполняется поиск по морфологическому частотному словарю, в результате чего слову присваивается ссылка на возможные базовые формы слова и соответствующие метаданные, как общие для базовой формы (например, часть речи), так и соответствующие форме, в которой данное слово употреблено в исходном сегменте (падеж, склонение и т.д.). Словарь содержит для каждой базовой формы слова данные о частоте употребления данного слова. Слова исходного сегмента, имеющую частоту употребления (которая определяется на общем языковом корпусе) ниже пороговой, помечаются как низкочастотные. Для каждого слова также выполняется поиск по всем глоссариям данного клиента и общедоступным глоссариям 208, ссылки на соответствующие записи в глоссарии также сохраняются в поисковом индексе, в случае нахождения для очередного слова терминов в глоссариях, состоящих из нескольких слов, выполняется проверка вхождения в сегмент остальных слов из глоссарного термина (проверка также выполняется на уровне наличия слова в любой из допустимых форм) и в случае вхождения такого термина в сегмент целиком, данная информация также сохраняется в индексе, в том числе помечается какие именно слова входят в данный термин. Затем, основываясь на правилах извлечение терминологии и агрегированных данных, собранных по всем сегментам документа, определяется список терминов являющихся кандидатами на включение в глоссарий для данного проекта. Применяется два типа критериев для отбора терминов-кандидатов: лингвистические и статистические. Лингвистические критерии содержат правила определяющие допустимые и недопустимые сочетания слов (по частям речи, например), а также набор стоп-слов, которые не включаются в термины. Например, лингвистическое правило может определять что фраза состоящая из двух существительных является допустимой для извлечения. Статистические критерии задают минимальную частоту вхождений терминов в текст, при котором термины извлекаются, при этом различным лингвистическим правилам может соответствовать различная минимальная частота вхождений, кроме того частота может изменяться индивидуально менеджером проекта перевода в зависимости от стадии проекта, покрытия терминологии уже существующими глоссариями и т.д. В результате определяется набор терминов-кандидатов для включения в глоссарий, который сохраняется в базу данных в блоке 218. Менеджер проекта может далее назначить исполнителя ответственного за перевод данных терминов, переводчик просматривает отобранные термины и переводит их в блоке 220 посредством веб-интерфейса 106. В данном интерфейсе переводчик видит варианты перевода данных терминов, найденные в памяти переводов клиента (если термины там встречаются), варианты перевода терминов, предлагаемые в других общедоступных глоссариях и общедоступных базах памяти переводов, он также видит контекст употребления данных терминов в исходном тексте. Переведенные и верифицированные термины передаются в блок 208 и в конечном итоге включаются в агрегированные наборы лингвистических ресурсов, формируемые в блоке 214.[029] In block 208, the text of the source segment is divided into separate words and then a search is performed for each word in the morphological frequency dictionary, as a result of which the word is assigned a link to the possible basic forms of the word and the corresponding metadata as common to the basic form (for example, part speech), and the corresponding form in which the word is used in the original segment (case, declension, etc.). The dictionary contains for each basic form of the word data on the frequency of use of the word. Words of the original segment that have a frequency of use (which is defined on the common language corpus) below the threshold are marked as low-frequency. For each word, a search is also performed on all the glossaries of this client and the public glossaries of 208, links to the corresponding entries in the glossary are also stored in the search index, if for the next word the terms are in the glossaries of several words, the entry of the remaining words in the segment is checked from a glossary term (the check is also performed at the level of the presence of a word in any of the acceptable forms) and if such a term appears in the entire segment, this information is also stored in ind KSE, including marked the exact words included in the term. Then, based on the rules for extracting terminology and aggregated data collected for all segments of the document, a list of terms that are candidates for inclusion in the glossary for this project is determined. Two types of criteria are used to select candidate terms: linguistic and statistical. Linguistic criteria contain rules defining permissible and unacceptable combinations of words (in parts of speech, for example), as well as a set of stop words that are not included in the terms. For example, a linguistic rule may determine that a phrase consisting of two nouns is valid for extraction. Statistical criteria specify the minimum frequency of occurrences of terms in the text at which the terms are extracted, while different linguistic rules may correspond to different minimum frequencies of occurrences, in addition, the frequency may be changed individually by the translation project manager depending on the project stage, terminology coverage by existing glossaries, etc. d. As a result, a set of candidate terms is determined for inclusion in the glossary, which is stored in the database in block 218. The project manager can then appoint the person responsible for the translation of these terms, the translator views the selected terms and translates them in block 220 via the web interface 106. B in this interface, the translator sees the options for translating these terms found in the client’s translation memory (if terms are found there), the options for translating terms offered in other public glossaries and general accessible translation memory databases, he also sees the context of the use of these terms in the source text. The translated and verified terms are transmitted to block 208 and are ultimately included in the aggregated sets of linguistic resources generated in block 214.

[030] Как было описано ранее, в процессе добавления параллельного сегмента в базу памяти переводов 116, информация о найденных в сегменте терминах из глоссария и низкочастотных словах добавляется в индекс. В блоке 210 выполняется поиск по индексу памяти переводов 116 параллельных сегментов, в которых встречаются низкочастотные слова и термины из глоссария, найденные в исходном сегменте, и таким образом определяется набор схожих параллельных сегментов, в которых встречаются такие же низкочастотные слова и термины из глоссариев. Для каждого параллельного сегмента, включенного в данный набор, вычисляется метрика совпадения терминологии с точки зрения машинного перевода:[030] As described previously, in the process of adding a parallel segment to the translation memory database 116, information about terms found in the segment from the glossary and low-frequency words is added to the index. In block 210, a search is performed on the translation memory index of 116 parallel segments in which low-frequency words and glossary terms found in the original segment are encountered, and thus a set of similar parallel segments in which the same low-frequency words and glossary terms are found is determined. For each parallel segment included in this set, the terminology matching metric is calculated from the point of view of machine translation:

Где индекс i обозначает исходный сегмент, индекс j обозначает схожий сегмент из построенного набора, f - частота употребления соответствующего слова из морфологического частотного словаря. Константы C₁, C₂, C₃, f₀ выбираются эмпирическим путем для каждого языка, чтобы обеспечить максимальную корреляцию с профессиональными оценками сходства сегментов и максимизацию строковых метрик качества машинного перевода, построенного на текстах, отфильтрованных в соответствии с приведенным алгоритмом расчета MT_terminology_match_i,j. MT_terminology_index_i,j рассчитывается по той же формуле, что и MT_terminology_index_i, с той разницей, что в расчет включаются только слова и термины, встречающиеся в оригинале обоих сегментов. Кроме того для каждой пары сегментов i и j сохраняется набор совпадающих в них слов, в том числе со ссылками на термины глоссариев. Слова считаются совпадающими, если совпадают их базовые формы, то есть слова, употребленные в различных формах, считаются совпадающими и включаются в расчет. Термины глоссария gtossary_entries_i и glossary_entries_j учитываются по факту их вхождения в текст оригинала, совпадает ли перевод терминов в параллельном сегменте с переводом приведенным в глоссарии - не имеет значения для вычисления данной метрики.Where index i denotes the initial segment, index j denotes a similar segment from the constructed set, f is the frequency of use of the corresponding word from the morphological frequency dictionary. The constants C ₁ , C ₂ , C ₃ , f ₀ are selected empirically for each language in order to maximize correlation with professional estimates of segment similarity and maximize string metrics of machine translation quality, based on texts filtered in accordance with the above calculation algorithm MT_terminology_match _{i, j} . MT_terminology_index _{i, j is} calculated using the same formula as MT_terminology_index _i , with the difference that only words and terms found in the original of both segments are included in the calculation. In addition, for each pair of segments i and j, a set of words matching in them is stored, including with links to glossary terms. Words are considered coincident if their basic forms coincide, that is, words used in various forms are considered coincident and are included in the calculation. The terms of the glossary gtossary_entries _i and glossary_entries _j are taken into account upon their inclusion in the original text, whether the translation of the terms in the parallel segment matches the translation given in the glossary does not matter for the calculation of this metric.

Для каждой пары отобранных таким образом сегментов также вычисляется метрика совпадения сегментов в целом, по той же формуле, что и метрика, используемая для оценки степени совпадения сегментов из памяти переводов.For each pair of segments selected in this way, the metric of coincidence of segments as a whole is also calculated using the same formula as the metric used to assess the degree of coincidence of segments from the translation memory.

Для подбора оптимальных профессиональных исполнителей (переводчиков, редакторов, корректоров и т.д.) для работы над текстом, вычисляется также метрика совпадения терминологии с точки зрения профессионального перевода:To select the best professional performers (translators, editors, proofreaders, etc.) to work on the text, the metric of terminology coincidence is also calculated from the point of view of professional translation:

Где matching_glossary_entries_i - это термины глоссария, найденные в параллельном сегменте, для которых соответствующий им перевод в тексте сегмента совпадает с переводом приведенным в глоссарии. different_glossary_entries_i - это термины глоссария, найденные в параллельном сегменте, для которых соответствующий им перевод в тексте сегмента отличается от перевода приведенного в глоссарии, такого рода термины включаются в расчет метрики только при выполнении хотя бы одного из следующих условии: термин глоссария содержит более одного слова или термин глоссария является низкочастотным словом.Where matching_glossary_entries _i are glossary terms found in a parallel segment for which the corresponding translation in the segment text matches the translation given in the glossary. different_glossary_entries _i are glossary terms found in a parallel segment, for which the corresponding translation in the segment text differs from the translation given in the glossary, these types of terms are included in the calculation of the metric only if at least one of the following conditions is true: the glossary term contains more than one word or the term glossary is a low-frequency word.

[031] Для каждого сходного с исходным сегмента, отобранного в блоке 210, и каждого частично совпадающего сегмента из памяти переводов, найденного в блоке 206 мы определяем документы, к которым они относятся, и этапы рабочего процесса, использованные для данных документов: системы машинного перевода, персоналии работавших над документом исполнителей (переводчиков, редакторов, корректоров и т.д.), производительность их работы и строковые метрики, модификаций внесенных на каждом из этапов рабочего процесса. Первичные данные о производительности работы исполнителей (переводчиков, редакторов, корректоров и т.д.) собираются в режиме реального времени в пользовательском веб-интерфейсе 106, интерфейс собирает, сохраняет и затем отправляет на сервер данные обо всех действиях пользователя, обо всех данных введенных с клавиатуры, всех кликах мышкой на элементах интерфейса и составных событиях, таких как вход в поле редактирования перевода сегмента, выход из поля редактирования перевода сегмента, подстановка текста из памяти переводов, глоссария или от машинного перевода. Также вычисляются и метрики характеризующие объем внесенных в текст изменений. Вычисляется два типа метрик: метрики, вычисляемые исключительно на основе сравнения исходных и финальных строк (например, расстояние Левенштейна), и метрики, принимающие во внимание действия выполняемые пользователями в ходе редактирования текста (данные введенные с клавиатуры, клики мышкой на элементах интерфейса и составные события). При расчете времени потраченного на перевод или редактирование перевода сегмента учитываются также и периоды неактивности, когда веб-интерфейс терял фокус, либо когда интервал времени между двумя последовательными действиями превышает заданный порог.[031] For each segment similar to the source selected in block 210 and each partially matching segment from the translation memory found in block 206, we determine the documents to which they relate and the stages of the workflow used for these documents: machine translation systems , personalities of the executors working on the document (translators, editors, proofreaders, etc.), their performance and string metrics, modifications made at each stage of the work process. Primary data on the performance of performers (translators, editors, proofreaders, etc.) are collected in real time in the web user interface 106, the interface collects, saves and then sends data to the server about all user actions, about all data entered from keyboard, all mouse clicks on interface elements and compound events, such as entering a segment translation editing field, exiting a segment translation editing field, substituting text from the translation memory, glossary, or from machine translation. Metrics characterizing the volume of changes made to the text are also calculated. Two types of metrics are calculated: metrics calculated solely based on a comparison of the source and final lines (for example, Levenshtein distance), and metrics that take into account the actions performed by users during text editing (data entered from the keyboard, mouse clicks on interface elements and compound events ) When calculating the time spent on translating or editing a segment translation, periods of inactivity are also taken into account when the web interface lost focus or when the time interval between two consecutive actions exceeds a predetermined threshold.

[032] Блок 210 повторяется для каждого сегмента исходного текста, в результате чего формируется набор сходных параллельных сегментов из памяти переводов 116. При этом один параллельный сегмент из памяти переводов 116 может быть включен в число сходных сегментов для нескольких сегментов исходного текста.[032] Block 210 is repeated for each segment of the source text, resulting in a set of similar parallel segments from the translation memory 116. In this case, one parallel segment from the translation memory 116 can be included in the number of similar segments for several segments of the source text.

Где MT_terminology_document_match_j - это метрика сходства параллельного сегмента j из памяти переводов 116 с исходным документом, i - индекс сегментов исходного текста, имеющих положительное значение MT_terminology_match_i,j метрики сходства с данным параллельным сегментом j, N - количество различных сегментов исходного документа с положительным значением данной метрики (два идентичных с точки зрения текста сегмента документа, расположенные в различных местах документа, считаются различными сегментами в данном случае).Where MT_terminology_document_match _j is the similarity metric of the parallel segment j from the translation memory 116 with the source document, i is the index of the source text segments with a positive value MT_terminology_match _{i, j is} the similarity metric with this parallel segment j, N is the number of different segments of the source document with a positive value this metric (two document segments identical in terms of text, located at different places in the document, are considered different segments in this case).

Аналогичная формула используется для расчета Human_terminology_match_j A similar formula is used to calculate Human_terminology_match _j

[033] В блоке 212 для каждого документа, для которого мы нашли сходные сегменты в блоках 210 и 206, мы можем таким образом рассчитать метрику сходства документа в целом с нашим исходным документом.[033] In block 212 for each document for which we found similar segments in blocks 210 and 206, we can thus calculate the metric of similarity of the document as a whole with our original document.

Где TM_match_metric_j - это метрика сходства между сегментами, применяемая для поиска и ранжирования совпадений с базой памяти переводов описанная в параграфеWhere TM_match_metric _j is the metric of similarity between segments, used to find and rank matches with the translation memory database described in paragraph

[028], нормированная так, чтобы принимать значения в интервале (0,1], m - переведенный ранее документ, words_total_m - количество слов в документе m. C₄, C₅ - константы, заранее определенные эмпирическим путем.[028], normalized to take values in the interval (0,1], m is the previously translated document, words_total _m is the number of words in the document m. C ₄ , C ₅ are constants predetermined empirically.

[034] Для каждого ранее переведенного документа 106 мы можем определить глоссарии и базы памяти переводов, которые были в явном виде выбраны при переводе данного документа заказчиком перевода, либо менеджером проекта. В блоке 214 мы формируем четыре набора лингвистических ресурсов для исходного документа:[034] For each previously translated document 106, we can determine the glossaries and translation memory databases that were explicitly selected when translating this document by the translation customer or project manager. In block 214, we form four sets of linguistic resources for the source document:

(1) В явном виде выбранные для данного документа заказчиком или менеджером проекта глоссарии и базы памяти переводов;(1) Glossaries and translation memory databases selected explicitly for the document by the customer or project manager;

(2) Упорядоченный по приоритету набор ресурсов для индивидуальной настройки моделей перевода систем машинного перевода:(2) Priority-ordered set of resources for customizing translation models of machine translation systems:

1. набор терминов из глоссариев в явном виде выбранных для данного документа;1. a set of terms from the glossaries explicitly selected for this document;

2. набор параллельных сегментов из баз памяти перевода, в явном виде выбранных для данного документа;2. a set of parallel segments from translation memory databases, explicitly selected for this document;

3. для каждого переведенного ранее документа, для которого значение метрики MT_document_similarity_m превышает эмпирически заданную пороговую величину (ресурсы отсортированы в соответствии со значением метрики MT_document_similarity_m, ресурсы документов, имеющих большее значение метрики, получают более высокий приоритет), мы добавляем в набор следующие данные:3. for each document translated earlier, for which the value of the metric MT_document_similarity _m exceeds an empirically determined threshold value (resources are sorted according to the value of the metric MT_document_similarity _m , the resources of documents with a higher metric value get a higher priority), we add the following data to the set :

a. набор терминов из глоссариев в явном виде выбранных для данного документа, включаются только термины, содержащие не менее двух слов;a. a set of terms from the glossaries explicitly selected for this document, only terms containing at least two words are included;

b. параллельные сегменты из данного документа;b. parallel segments from this document;

c. набор параллельных сегментов из баз памяти перевода, в явном виде выбранных для данного документа;c. a set of parallel segments from translation memory databases, explicitly selected for this document;

(3) Упорядоченный по приоритету набор ресурсов для индивидуальной настройки моделей языка (для языка, на который выполняется перевод) систем машинного перевода:(3) A priority set of resources for customizing language models (for the language into which the translation is performed) of machine translation systems:

1. набор параллельных сегментов из баз памяти перевода, в явном виде выбранных для данного документа;1. a set of parallel segments from translation memory databases, explicitly selected for this document;

2. для всех документов, у которых значение метрики Human_document_similarity_m превышает эмпирически заданную пороговую величину (ресурсы отсортированы в соответствии со значением метрики Human_document_similarity_m, ресурсы документов, имеющих большее значение метрики, получают более высокий приоритет), мы добавляем в набор следующие данные:2. for all documents for which the value of the Human_document_similarity _m metric exceeds an empirically determined threshold value (resources are sorted according to the value of the Human_document_similarity _m metric, the resources of documents with a higher metric value receive a higher priority), we add the following data to the set:

a. параллельные сегменты из данного документа;a. parallel segments from this document;

b. набор параллельных сегментов из баз памяти перевода, в явном виде выбранных для данного документа;b. a set of parallel segments from translation memory databases, explicitly selected for this document;

(4) Набор сегментов исходного документа, к каждому из которых привязан упорядоченный набор следующих данных:(4) A set of segments of the source document, each of which has an ordered set of the following data:

1. частично совпадающие с исходным сегментом сегменты из памяти переводов - сюда включаются только сегменты, входящие в набор данных (2), сегменты упорядочены в соответствии со значениями метрики соответствия сегментов, используемой для поиска по базам памяти переводов;1. segments that partially overlap with the source segment from the translation memory — only segments included in the data set are included here (2), segments are ordered according to the values of the segment matching metric used to search the translation memory databases;

2. схожие сегменты с положительным значением метрики Human_terminology_match_strict_i,j - сюда включаются только сегменты, входящие в набор данных (2), сегменты упорядочены в соответствии со значениями метрики соответствия сегментов, используемой для поиска по базам памяти переводов;2. similar segments with a positive value of the Human_terminology_match_strict _{i, j} metric — only segments included in the data set are included here (2), segments are ordered according to the values of the segment matching metric used to search the translation memory databases;

для каждого параллельного сегмента включенного в данный набор, в набор также включается пословное выравнивание оригинала и перевода сегмента;for each parallel segment included in this set, the word-by-word alignment of the original and segment translation is also included in the set;

[035] В блоке 226 набор данных (1) может выгружаться в отдельный файл, который может быть выгружен из системы через веб-интерфейс и использован для перевода исходного файла в любой внешней среде. В случае если дальнейшая работа над переводом документа выполняется исполнителями через веб-интерфейс системы, эти данные сохраняются в базе данных 106 и впоследствии отображаются в качестве подсказок исполнителям (переводчикам, редакторам, корректорам и т.д.) при работе над каждым соответствующим сегментом.[035] In block 226, the data set (1) can be uploaded to a separate file, which can be uploaded from the system via the web interface and used to translate the source file in any external environment. If further work on the translation of the document is performed by the executors through the web interface of the system, this data is stored in the database 106 and subsequently displayed as prompts to the executors (translators, editors, proofreaders, etc.) when working on each relevant segment.

[036] В блоке 216 наборы данных (2) и (3) добавляются в статистические и основанные на моделях языка системы машинного перевода для индивидуальной настройки машинного перевода, с более высоким приоритетом в иерархии данных, чем общие данные, использованные для обучения систем. Данные добавляются с более высоким приоритетом в иерархии данных чем общие данные, кроме того каждому перечисленному выше подмножеству данных приписывается индивидуальный приоритет, исходя из порядка сортировки данных, описанного выше. В случае необходимости, после добавления данных ресурсов, автоматически выполняется до-обучение или переобучение системы машинного перевода.[036] In block 216, data sets (2) and (3) are added to the statistical and language-based machine translation systems to individually configure machine translation, with a higher priority in the data hierarchy than the general data used to train the systems. Data is added with a higher priority in the data hierarchy than general data, in addition, each of the above subsets of data is assigned an individual priority based on the data sorting procedure described above. If necessary, after adding these resources, pre-training or retraining of the machine translation system is automatically performed.

[037] В блоке 222 выполняется предварительный машинный перевод каждого сегмента исходного текста, всеми доступными системами машинного перевода, которые предварительно выбраны для данного исходного документа. Для каждой системы машинного перевода, имеющей возможность индивидуальной настройки, выполняется индивидуальная настройка в блоке 216 с помощью наборов данных (2) и (3), созданных в блоке 214. Каждый сегмент, отправляемый на машинный перевод каждой из систем машинного перевода, сопровождается набором данных (4) созданным для данного сегмента в блоке 214. Набор данных (4) используется следующим образом: мы берем каждый параллельный сегмент из набора (4) из числа частично совпадающих с исходным сегментов и определяем набор слов (подстрок) совпадающих с исходным сегментом, для каждого совпадающего набора слов мы определяем с помощью пословного выравнивания, построенного при добавлении параллельного сегмента в память переводов, соответствующий им в данном параллельном сегменте перевод. Затем мы рассматриваем возможные сочетания различных подстрок, с целью получения наибольшего покрытия текста исходного сегмента, при этом в качестве дополнительных к базовым подстрокам (полученным на основе одного параллельного сегмента) добавляются только подстроки, состоящие из нескольких слов или содержащие низкочастотные слова. В результате мы получаем набор из множества вариантов покрытия исходного сегмента подстроками с переводами из частично совпадающих с ним параллельных сегментов из памяти переводов.[037] In block 222, a preliminary machine translation of each segment of the source text is performed by all available machine translation systems that are preselected for this source document. For each machine translation system that can be individually configured, it is individually configured in block 216 using the data sets (2) and (3) created in block 214. Each segment sent to the machine translation of each of the machine translation systems is accompanied by a data set (4) created for this segment in block 214. The data set (4) is used as follows: we take each parallel segment from the set (4) from the number of partially matching segments and determine the set of words (substrings) matching with the source segment, for each matching set of words we determine using the word alignment constructed by adding a parallel segment in the translation memory the translation corresponding to them in the given parallel segment. Then we consider possible combinations of various substrings in order to obtain the greatest coverage of the text of the source segment, while only substrings consisting of several words or containing low-frequency words are added to the base substrings (obtained on the basis of one parallel segment). As a result, we get a set of many options for covering the source segment with substrings with translations from parallel segments partially matching it from the translation memory.

Затем мы берем набор сегментов с положительным значением метрики Human_terminology_match_strict_i,j из набора данных (4) и извлекаем из них, с помощью пословного выравнивания, построенного при добавлении параллельного сегмента в память переводов, варианты перевода низкочастотных слов и терминов из глоссария (для каждого слова и термина может извлекаться несколько вариантов перевода с различными допустимыми словоформами). После чего данные варианты перевод добавляются в качестве подстрок сегмента с переводом в те варианты покрытия, в которых данные слова и термины еще не входят ни в одну из подстрок сегмента с переводом.Then we take a set of segments with a positive value of the metric Human_terminology_match_strict _{i, j} from the data set (4) and extract from them, using the word-by-word alignment constructed when adding a parallel segment to the translation memory, the translation options for low-frequency words and terms from the glossary (for each word and the term can be extracted with several translation options with different valid word forms). After that, these translation options are added as substrings of the segment with translation into those coverage variants in which these words and terms are not yet included in any of the substrings of the segment with translation.

Таким образом, для каждого исходного сегмента мы получаем набор из множества вариантов покрытия исходного сегмента подстроками, содержащими перевод данной подстроки. Для каждого варианта покрытия сегмента подстроками мы определяем процент совпадения, который рассчитывается как отношение количества символов сегмента покрытых подстроками, к общему количеству символов сегмента.Thus, for each source segment we get a set of many options for covering the source segment with substrings containing the translation of this substring. For each variant of covering a segment with substrings, we determine the percentage of coincidence, which is calculated as the ratio of the number of characters in the segment covered by substrings to the total number of characters in the segment.

Каждый такой вариант разметки сегмента подстроками с переводом затем отправляется в каждую из доступных систем машинного перевода. Для каждого поступившего на вход варианта разметки сегмента, каждая система машинного перевода формирует перевод сегмента и сопутствующую ему метрику уверенности системы в качестве перевода, если система поддерживает расчет такой метрики. Для каждого варианта перевода, независимо от создавшей его системы машинного перевода также вычисляется единая метрика, характеризующая степень гладкости текста перевода с точки зрения языка перевода (например, может использоваться метрика, имеющая в своей основе расчета перплексивности - степени неопределенности в вероятностной модели языка), при этом используется статистическая модель языка перевода и индивидуально настроенная в блоке 216. После чего мы исключаем варианты машинного перевода, содержащие некорректную терминологию (терминология проверяется на соответствие глоссариям, в явном виде выбранным для данного проекта). Затем, для каждой системы машинного перевода мы выбираем перевод с наибольшим значением метрики собственной уверенности в качестве перевода самой системы машинного перевода, если данная система ее формирует, либо перевод с наилучшим значением нашей собственной метрики гладкости текста на основе перплексивности. Значения метрик сохраняются в базе данных вместе с самими вариантами машинного перевода.Each such variant of marking a segment with substrings with translation is then sent to each of the available machine translation systems. For each segment markup variant received at the input, each machine translation system generates a segment translation and the accompanying metric of confidence of the system as a translation, if the system supports the calculation of such a metric. For each version of the translation, regardless of the machine translation system that created it, a single metric is also calculated that characterizes the degree of smoothness of the text of the translation from the point of view of the target language (for example, a metric based on the calculation of perplexity - the degree of uncertainty in the probabilistic model of the language) can be used, for This uses a statistical model of the translation language and is individually configured in block 216. After which we exclude machine translation options that contain incorrect terminology (those checked against the terminology of glossaries, explicitly selected for this project). Then, for each machine translation system, we select the translation with the highest metric of self-confidence as the translation of the machine translation system itself, if this system generates it, or the translation with the best value of our own metric for text smoothness based on perplexity. Metric values are stored in the database along with the machine translation options themselves.

[038] В реализации настоящего изобретения, в процессе работы переводчика (а также редактора, корректора и т.д.) над переводом или редактированием сегмента посредством пользовательского веб-интерфейса 106, все его действия в данном веб-интерфейсе, включая нажатия клавиш клавиатуры, нажатия мышкой на элементах интерфейса, использование служебных сочетаний клавиш и т.д., фиксируются и отправляются на сервер, где они сохраняются в базу данных. Основываясь на этих исходных данных, асинхронно рассчитываются индивидуальные метрики производительности и качества, такие как время, затраченное на перевод (а также редактирование или корректуру) сегмента, а также количество существенных изменений внесенных в текст перевода сегмента (например, замена одного слова на другое, особенно для низкочастотных слов) и несущественных изменений (изменения в окончаниях слов для их согласования и т.д.). Изменения фиксируются на каждом этапе рабочего процесса по отношению к варианту перевода, который был после выполнения предыдущего этапа рабочего процесса (например, изменения внесенные редактором после переводчика, или переводчикам после системы машинного перевода). Кроме того, для отдельных сегментов переведенного документа может быть выборочно проведена профессиональная оценка качества перевода. Данная профессиональная оценка качества базируется на методологии классификации ошибок, в рамках которой специалист лингвистического контроля качества выполняет скрупулезный анализ перевода каждого из сегментов, включенных в выборку (обычно небольшую), и фиксирует для каждого сегмента типы и критичность найденных ошибок, в соответствии с утвержденными правилами классификации ошибок. Оба типа метрик используются совместно для присвоения репутационной оценки исполнителю (переводчику, редактору или корректору).[038] In the implementation of the present invention, during the work of the translator (as well as the editor, proofreader, etc.) on translating or editing a segment via the web user interface 106, all of its actions in this web interface, including pressing keyboard keys, mouse clicks on interface elements, use of office shortcuts, etc., are recorded and sent to the server, where they are saved to the database. Based on this source data, individual performance and quality metrics are calculated asynchronously, such as the time taken to translate (as well as edit or proofread) the segment, as well as the number of significant changes made to the segment translation text (for example, replacing one word with another, especially for low-frequency words) and insignificant changes (changes in the endings of words for their coordination, etc.). Changes are recorded at each stage of the workflow with respect to the translation option that was after the previous stage of the workflow (for example, changes made by the editor after the translator, or by translators after the machine translation system). In addition, for individual segments of the translated document, a professional assessment of the quality of the translation can be selectively conducted. This professional quality assessment is based on the error classification methodology, in which a linguistic quality control specialist performs a rigorous analysis of the translation of each of the segments included in the sample (usually a small one) and fixes for each segment the types and criticality of the errors found, in accordance with the approved classification rules mistakes. Both types of metrics are used together to assign a reputation score to an executor (translator, editor or proofreader).

Для каждого сегмента, в котором выполнялось редактирование машинного перевода, мы вычисляем количество следующих событий: (1) существенные изменения в переводе терминологии (перефразировка) - когда одно или несколько слов заменяются на другие слова, (2) изменение порядка слов, (3) согласование слов (изменение окончаний слов, особенно в языках со сложными правилами согласования и богатой морфологией и т.д.). Затем мы рассчитываем для каждой системы машинного перевода ожидаемое количество изменений, как табличную функцию от длины сегмента и значений метрик собственной уверенности системы машинного перевода в качестве перевода или внешней метрики гладкости перевода, основанной на расчете перплексивности текста для соответствующей статистической модели языка перевода:For each segment in which the machine translation was edited, we calculate the number of the following events: (1) significant changes in the translation of the terminology (rephrasing) - when one or more words are replaced by other words, (2) a change in the order of words, (3) coordination words (changing the endings of words, especially in languages with complex matching rules and rich morphology, etc.). Then we calculate for each machine translation system the expected number of changes, as a tabular function of the segment length and the self-confidence metrics of the machine translation system as a translation or an external metric for the smoothness of the translation based on the calculation of the text persplectivity for the corresponding statistical model of the translation language:

Для каждого исполнителя мы также обладаем данными обо всех сегментами над которыми он работал ранее, и в частности обо всех сегментах для которых он выполнял редактирование машинного перевода. Мы можем рассчитать объем внесенных им изменений каждого типа и суммарное время проведенное над редактированием сегмента. Далее мы можем рассчитать для каждого исполнителя и каждой системы машинного перевода набор констант t₁, t₂, t₃, обеспечивающих наилучшую линейную интерполяцию на собранном наборе первичных данных:For each artist, we also have data on all segments on which he worked previously, and in particular on all segments for which he performed machine translation editing. We can calculate the amount of changes made by him of each type and the total time spent on editing the segment. Next, we can calculate for each executor and each machine translation system a set of constants t ₁ , t ₂ , t ₃ that provide the best linear interpolation on the collected set of primary data:

,

[039] Основываясь на значениях метрик машинного перевода, рассчитанных для каждого перевода сегмента в блоке 222, длины исходного сегмента и табличных функций для ожидаемого количества изменений каждого типа, которые мы определили ранее для каждой системы машинного перевода, мы рассчитываем ожидаемое количество изменений для машинного перевода каждого сегмента. После чего мы выбираем вариант машинного перевода с наименьшим ожидаемым количеством изменений (изменения разного типа включаются с различными весами, соответствующими значениям констант t₁, t₂, t₃ усредненным по всей базе данных). Если несколько вариантов перевода получают близкие значения данной метрики, то из них выбирается вариант с наибольшим процентом покрытия текста сегментам подстроками с переводом. Выбранный вариант перевода, вместе с ожидаемым количеством изменений каждого типа сохраняется в базу данных и включается в агрегированный набор данных, формируемый в блоке 226.[039] Based on the values of the machine translation metrics calculated for each segment translation in block 222, the length of the source segment and the table functions for the expected number of changes of each type, which we determined earlier for each machine translation system, we calculate the expected number of changes for machine translation each segment. After that, we choose the machine translation option with the least expected number of changes (changes of different types are included with different weights corresponding to the values of the constants t ₁ , t ₂ , t ₃ averaged over the entire database). If several translation options get close values of this metric, then the option with the highest percentage of text coverage of the substring segments with the translation is selected from them. The selected translation option, together with the expected number of changes of each type, is stored in the database and included in the aggregated data set generated in block 226.

[040] Для каждого параллельного сегмента из памяти переводов включенного в набор (3), предназначенный для индивидуальной настройки статистической модели языка, созданный в блоке 214, мы имеем полный набор данных о рабочем процессе, имевшем место при переводе документа: кто выполнял перевод (или редактирование машинного перевода), кто выполнял редактирование и корректуру и т.д. Мы также можем установить документы содержащие данные сегменты, профессиональные оценки качества, проставленные специалистами лингвистического контроля качества для выборок из текста документов, автоматически собранные метрики количества изменений разных типов (изменение в переводе термина, изменение порядка слов, согласование слов) внесенных в текст перевода сегментов редакторами и корректорами после переводчика (или редактора машинного перевода), а также время, затраченное ими на изменение каждого сегмента. В блоке 230 мы затем рассчитываем взвешенную метрику качества перевода для каждого исполнителя (переводчика, редактора, корректора и т.д.) принимавшего участие в работе над переводом документов, включенных в набор (3):[040] For each parallel segment from the translation memory included in the set (3), designed to individually configure the statistical language model, created in block 214, we have a complete set of data on the workflow that took place when translating the document: who performed the translation (or editing machine translation), who performed editing and proofreading, etc. We can also install documents containing these segments, professional quality assessments made by linguistic quality control specialists for samples from the text of documents, automatically collected metrics for the number of changes of different types (change in the translation of the term, change in the word order, word matching) made by the editors in the text of the translation and proofreaders after the translator (or machine translation editor), as well as the time they spent on changing each segment. In block 230, we then calculate a weighted translation quality metric for each artist (translator, editor, proofreader, etc.) who participated in the translation of documents included in set (3):

Где LQA_metric_m - это нормированная профессиональная оценка качества перевода, рассчитанная для подмножества выборки из документа m, включающего только профессионально оцененные сегменты, в работе над которыми принимал участие данный исполнитель.Where LQA_metric _m is a standardized professional assessment of the quality of translation, calculated for a subset of a sample from document m, which includes only professionally evaluated segments in which this artist participated.

Основываясь на рассчитанном значении метрики LQA_total мы исключаем из дальнейшего рассмотрения исполнителей, для которых метрика имеет значение ниже эмпирически определенного порога. Данный порог в основном зависит от требований к качеству перевода, заданных для данного проекта менеджером проекта на этапе его создания и первоначальной настройки. Затем выполняется кластеризация результатов метрики LQA_total, после чего исполнители внутри каждого кластера упорядочиваются в соответствии со значением метрики Weight_total. Таким образом, в блоке 230 формируется отсортированный список предпочтительных исполнителей для работы над переводом данного исходного документа.Based on the calculated value of the LQA_total metric, we exclude from further consideration performers for whom the metric has a value below an empirically determined threshold. This threshold mainly depends on the translation quality requirements set for the project by the project manager at the stage of its creation and initial configuration. Then the results of the LQA_total metric are clustered, after which the performers within each cluster are ordered according to the value of the Weight_total metric. Thus, in block 230, a sorted list of preferred performers is generated to work on the translation of this source document.

Для каждого переводчика мы также определяем ожидаемое время редактирования машинного перевода данного исходного документа. Ожидаемое время редактирования машинного перевода рассчитывается исходя из значений констант t₁, t₂, t₃ для каждого переводчика и каждой системы машинного перевода и вариантов машинного перевода выбранных для каждого из сегментов в блоке 224 и значений ожидаемого количества исправлений каждого типа, приписанных к данному машинному переводу сегмента.For each translator, we also determine the expected time for editing a machine translation of this source document. The expected time for editing a machine translation is calculated based on the values of the constants t ₁ , t ₂ , t ₃ for each translator and each machine translation system and the machine translation options selected for each of the segments in block 224 and the values of the expected number of corrections of each type assigned to this machine segment translation.

[041] Менеджер проекта просматривает набор рекомендованных для проекта исполнителей (переводчиков, редакторов, корректоров и т.д.), ожидаемое время и трудоемкость редактирования машинного перевода и статистику проекта, включающую количество различных совпадений с памятью переводов, количество вхождений терминов из глоссария, а также объем данных и метрики сходства, рассчитанные для наборов данных, сформированных в блоке 214. Менеджер проекта также располагает в веб-интерфейсе системы данными о доступности рекомендованных исполнителей, подгружаемыми из системы управления проектами перевода, и принимает окончательное решение о составе исполнителей, после чего инициирует в веб-интерфейсе системы отправку приглашений к участию в проекте множеству выбранных исполнителей. Данное решение также может быть принято и автоматически, на основе доступных в режиме реального времени данных о доступности исполнителей, прогнозируемой загруженности каждого из исполнителей по другим принятым им проектам и требуемым сроком сдачи перевода данного проекта.[041] The project manager reviews the set of recommended executors for the project (translators, editors, proofreaders, etc.), the expected time and complexity of editing machine translation, and project statistics, including the number of different matches with the translation memory, the number of occurrences of terms from the glossary, and also the amount of data and similarity metrics calculated for the data sets generated in block 214. The project manager also has data on the availability of recommended artists in the web interface of the system, according to loaded from the project management system of the transfer, and makes the final decision on the composition of the executors, after which it initiates sending invitations to participate in the project to the set of selected executors in the web interface of the system. This decision can also be made automatically, based on real-time data on the availability of performers, the projected workload of each of the performers for the other projects it has adopted, and the required deadline for the translation of this project.

[042] После получения уведомления о приглашении к участию в проекте перевода, каждый соответствующий исполнитель (переводчик, редактор, корректор и т.д.) подтверждает либо отклоняет данное приглашение. После подтверждения участия, назначенный исполнитель в блоке 232 входит в систему и посредством пользовательского веб-интерфейса 106 приступает к работе над переводом документа. Системой поддерживается два типа проектов: последовательные, когда каждый следующий этап рабочего процесса начинается только после того как полностью завершен предыдущий этап рабочего процесса над всем документом, и параллельные, когда этапы рабочего процесса выполняются последовательно на уровне каждого отдельного сегмента, при этом в документе могут одновременно присутствовать сегменты находящиеся на различных этапах рабочего процесса.[042] After receiving a notification of an invitation to participate in a translation project, each relevant artist (translator, editor, proofreader, etc.) confirms or rejects this invitation. After confirming participation, the designated contractor in block 232 enters the system and through the web user interface 106 starts work on the translation of the document. The system supports two types of projects: sequential, when each subsequent stage of the workflow begins only after the previous stage of the workflow on the entire document is completely completed, and parallel, when the stages of the workflow are performed sequentially at the level of each individual segment, while the document can simultaneously segments present at various stages of the workflow.

[043] После того как все сегменты документа проходят все заданные стадии рабочего процесса и перевод документа завершен, перевод поступает в блок 128, в котором формируется итоговый файл перевода. Переведенный документ 130 имеет такой же текстовый или бинарный формат файла, что и исходный файл. Переведенный документ формируется из текста перевода каждого сегмента и файла метаданных, созданного при разборе исходного файла в модуле сегментации 114.[043] After all segments of the document pass all the specified stages of the workflow and the translation of the document is completed, the translation goes to block 128, in which the final translation file is generated. The translated document 130 has the same text or binary file format as the original file. The translated document is formed from the translation text of each segment and the metadata file created when parsing the source file in the segmentation module 114.

[044] Переведенный документ 130 затем автоматически передается заказчику перевода посредством пользовательского веб-интерфейса, из которого заказчик выгружает переведенный файл. Если исходные файлы поступили из некой внешней информационной системы, то переведенные файлы могут быть помещены в эту же систему посредством программных API интерфейсов, содержащихся в слое интеграции 302.[044] The translated document 130 is then automatically transmitted to the translation customer via the web user interface from which the customer downloads the translated file. If the source files came from a certain external information system, then the translated files can be placed into the same system using the software APIs of the interfaces contained in the integration layer 302.

[045] Рис.3 представляет собой схематическую иллюстрацию платформы, представляющую собой реализацию описанной в настоящем изобретении распределенной сетевой системы перевода. В соответствии с Рис.3, платформа состоит из трех слоев: слой интеграции 302, платформа перевода 304 и дополнительные модули 306. Слой интеграции 302 обеспечивает загрузку в систему исходных файлов на перевод 102 и конвертацию исходных файлов в набор текстовых сегментов. Интеграционный слой 302 представлен в реализации системы, если в качестве источников исходных файлов используются внешние информационные системы, такие как системы вправления контентом, системы документооборота, или порталы общего назначения. Слой платформы перевода 304 включает в себя пользовательский веб-интерфейс, модули обработки данных на стороне сервера и слой хранения данных. Пользовательский веб-интерфейс обеспечивает интерфейс для доступа в систему для переводчиков, менеджеров проектов, редакторов, корректоров, специалистов лингвистического контроля качества, специалистов по терминологии, представителей заказчика перевода и т.д. Пользовательский веб-интерфейс содержит отдельные интерфейсные окна для настройки и управления проектом и для непосредственной работы над переводов и редактированием документа (с возможностью одновременной работы над переводом множества исполнителей). Обработка данных на стороне сервера включает в себя преобразование исходных файлов в различные форматы, валидацию перевода, предварительный перевод документов и т.д. Слой дополнительных модулей 306 включает в словари ABBYY Lingvo™ и прочие словари, системы машинного перевода, используемые для предварительного перевода документов с последующим редактированием, элайнера, создающего из параллельных документов произвольных форматов и верстки XML-файл пригодный для экспорта в память переводов, а также модуля контроля орфографии и пунктуации. Другие необходимые подсистемы, также могут быть интегрированы с системой посредством интеграционной шины с программными API интерфейсами.[045] Fig. 3 is a schematic illustration of a platform representing an implementation of the distributed network translation system described in the present invention. According to Fig. 3, the platform consists of three layers: an integration layer 302, a translation platform 304, and additional modules 306. An integration layer 302 provides loading into the system the source files for translation 102 and converting the source files into a set of text segments. Integration layer 302 is represented in the implementation of the system if external information systems, such as content management systems, document management systems, or general purpose portals, are used as sources of source files. The translation platform layer 304 includes a web user interface, server-side data processing modules, and a data storage layer. The web-based user interface provides an interface for access to the system for translators, project managers, editors, proofreaders, linguistic quality control specialists, terminology specialists, representatives of the translation customer, etc. The web user interface contains separate interface windows for setting up and managing the project and for working directly on translations and editing a document (with the possibility of simultaneous work on the translation of many artists). Server-side data processing includes the conversion of source files to various formats, translation validation, preliminary translation of documents, etc. The layer of additional modules 306 includes in ABBYY Lingvo ™ dictionaries and other dictionaries, machine translation systems used for preliminary translation of documents with subsequent editing, an airliner that creates XML files from parallel documents of arbitrary formats and layout, suitable for export to translation memory, as well as a module spelling and punctuation control. Other necessary subsystems can also be integrated with the system via the integration bus with software APIs.

Claims

1. Network distributed language translation system for translating source files, consisting of:
cloud-accessible servers available on the network, simultaneously available to many artists and multiple translation customers connecting to servers via the Internet;
a user interface that allows many translation customers to upload source files for translation into the language translation system and receive information on the terminology translation options offered by the performers, as well as view and approve glossaries for the project;
a segmentation module that splits the source file into many logical segments and sends these segments for translation to the preliminary translation module;
translation memory database that stores previously translated logical segments and allows you to search for segments similar to a given segment;
morphological dictionaries containing information about the various acceptable forms of words and metadata that correspond to both the concept as a whole and individual word forms, including the frequency of use of words, and also allow us to establish a correspondence between the words in the text of the original and the translation of the segment and the corresponding entries in the dictionary ;
a glossary module that allows you to store glossary terms and find their occurrences in the text of the original and translation segments, including for terms consisting of several words;
a module for creating related data that searches for segments matching the source segment in the translation memory, calculates the degree of similarity and the degree of similarity of the terminology used in them, and generates a data set for prompting the performers and for customizing machine translation systems to translate this document;
a word-by-word alignment module that establishes a correspondence between the words of the original and the translation of a parallel segment;
a machine translation processing module that selects the best machine translation of a segment from the many options generated by each machine translation system and aggregates these translations for all document segments;
a module for determining the set of artists recommended for work on the translation of the document, based on the history of the work they performed earlier and professional quality assessments set for documents similar to the source file for translation;
a module that collects data on all actions performed by executors in the web interface of work on the translation of documents;
module that performs assembly of the final translations of each individual segment into the final file with the translation.

2. The system of claim 1, wherein the indexed logical segments and their translation are stored in the translation memory together with the metadata corresponding to the segment.

3. The system according to claim 2, in which the metadata stored together with the indexed logical segments includes data on the time spent working with the segment, the executors who performed the translation, the number of changes of various types made to the translation by various executors, as well as professional quality assessments assigned to this logical segment by linguistic quality control specialists.

4. The system of claim 1, wherein the logical segments can be phrases, sentences, parts of sentences, or idiomatic expressions.

5. The system according to claim 1, in which parallel logical segments from the translation memory are mapped to the original logical segment in accordance with a similarity metric of segments, which is a metric of coincidence of two lines, taking into account the possibility of having several different valid word forms for the same the words.

6. The system according to claim 1, in which the user interface in real time displays the translation of logical segments performed by many executors working on the translation of the document at the moment.

7. The system of claim 1, wherein the translation can be performed by both professional translators and machine translation systems.

8. The system according to claim 1, in which the performers are selected from the group including translators, editors, proofreaders.

9. The method of translating the source file from the original language into the target language, consisting of the following steps:
receiving a request to translate the source file in a distributed network system;
splitting the source file into many logical segments;
search for similar segments and segments with similar terminology in the database of linguistic resources, consisting of previously translated parallel logical segments, dictionaries and glossaries, with the calculation of the quality metric for the logical segments found in the translation memory;
sending multiple requests for the translation of each segment to several machine translation systems located in a distributed network, with the transfer to each machine translation system of a data set accompanying the source document for individual setting of the machine translation system, as well as the transfer for each segment of the set of translation options of part of the text for machine translation translation of parts for which there is no translation option, and assembly of the final version of machine translation;
create a set of translation options based on segments similar to the source segment, in which parts of the text of the source segment are marked with translation options from similar segments, and parts that do not have a translation option are translated in accordance with the glossaries, dictionaries and phrases generated by machine translation systems from translation memory databases;
collecting a variety of segment translation options from various machine translation systems;
choosing the best machine translation option from the number of options received;
selection of the most suitable professional performers registered in the system, based on the predicted level of quality of their work, which is calculated on the basis of previously translated documents similar to this source document;
sending invitations to selected artists to participate in the project;
providing multiple artists with a web-based user interface for simultaneous multi-user translation work in real time;
assembly translations of individual segments into the resulting translated file.

10. The method according to p. 9, in which the source file can have either text or binary format.

11. The method according to p. 9, in which a professional assessment of the quality of the translation is calculated by fixing the type and criticality of errors in the selection of segments of the translated document, the translation of the document can be performed either from scratch by the translator or by editing a preliminary machine translation.

12. The method according to p. 9, in which the best version of machine translation is selected both among the basic translation options, and translation options generated by applying an advanced technology for working with translation memory, in which a set of translation options is created with text markup of the source segment in parts of parallel segments from the translation memory, indicating the translation option for this part, with the substitution by the machine translation system for unallocated parts from the glossary, dictionary or set of phrases generated by the system machine translation from translation memory.

13. The method of claim 9, wherein the translated segments of the source document are stored in the translation memory for future use.