JP7732704B2

JP7732704B2 - Deep learning-based optical character recognition method and system

Info

Publication number: JP7732704B2
Application number: JP2024034245A
Authority: JP
Inventors: キル，テホ; ソ，ソクミン; キム，ドンヒョン; キム，ソンヒョン; イ，バド; キム，ユンシク; キム，デヒ; ソン，カヨン
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2023-03-06
Filing date: 2024-03-06
Publication date: 2025-09-02
Anticipated expiration: 2044-03-06
Also published as: JP2024126021A

Description

特許法第３０条第２項適用［２２０３．０５１２２］ＤＥＥＲ：Ｄｅｔｅｃｔｉｏｎ－ａｇｎｏｓｔｉｃＥｎｄ－ｔｏ－ＥｎｄＲｅｃｏｇｎｉｚｅｒｆｏｒＳｃｅｎｅＴｅｘｔＳｐｏｔｔｉｎｇ（ａｒｘｉｖ．ｏｒｇ）掲載日２０２２年３月１０日Applicable to Article 30, Paragraph 2 of the Patent Act [2203.05122] DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting (arxiv.org) Publication date March 10, 2022

特許法第３０条第２項適用ＮＡＶＥＲＤＥＶＩＥＷ２０２３（ＴＲＡＣＫＤ）ａｔＣＯＥＸＧｒａｎｄＢａｌｌｒｏｏｍ，Ｓｏｕｌ，Ｋｏｒｅａにおける研究集会での口頭発表ｈｔｔｐｓ：／／ｄｅｖｉｅｗ．ｋｒ／２０２３／ｓｅｓｓｉｏｎｓ／５６０公開日２０２３年２月２７日Article 30, Paragraph 2 of the Patent Act applies. Oral presentation at a research conference held at NAVER DEVIEW 2023 (TRACK D) at COEX Grand Ballroom, Seoul, Korea. https://deview.kr/2023/sessions/560. Published on February 27, 2023.

特許法第３０条第２項適用ＥＣＣＶ２０２２，ＴｉＥｗｏｒｋｓｈｏｐ（ＤａｖｉｄＩｎｔｅｒｃｏｎｔｉｎｅｎｔａｌ，ＧａｌｌｅｒｙＨａｌｌ）における研究集会での口頭発表ｈｔｔｐｓ：／／ｓｉｔｅｓ．ｇｏｏｇｌｅ．ｃｏｍ／ｖｉｅｗ／ｔｉｅ－ｅｃｃｖ２０２２／ｓｃｈｅｄｕｌｅｈｔｔｐｓ：／／ｓｉｔｅｓ．ｇｏｏｇｌｅ．ｃｏｍ／ｖｉｅｗ／ｔｉｅ－ｅｃｃｖ２０２２／ｈｏｍｅ公開日２０２２年１０月２４日キルテホ、キムソンヒョン、ソスクミンネイバーコーポレーションに所属する上記公開者である研究者が、ネイバーコーポレーションの依頼により、ＥＣＣＶ２０２２，ＴｉＥｗｏｒｋｓｈｏｐで「ＯｕｔｏｆＶｏｃａｂｕｌａｒｙＳｃｅｎｅＴｅｘｔＵｎｄｅｒｓｔａｎｄｉｎｇ」について公開した。Article 30, paragraph 2 of the Patent Act applies. Oral presentation at the research meeting at ECCV 2022, TiE workshop (David Intercontinental, Gallery Hall) https://sites.google.com/view/tie-eccv2022/schedule https://sites.google.com/view/tie-eccv2022/schedule com/view/tie-eccv2022/home Published on October 24, 2022 Kil Tae-ho, Kim Sung-hyun, Seo Sook-min The above-mentioned researchers affiliated with Naver Corporation published "Out of Vocabulary Scene Text Understanding" at the ECCV 2022, TiE workshop at the request of Naver Corporation.

本開示は、光学文字認識方法およびシステムに関するもので、具体的には、イメージに含まれているテキスト領域を参照するためのリファレンスポイント（reference point：参照点）を生成することによって、イメージおよびリファレンスポイントに基づいてイメージからテキストを抽出するディープラーニングベースの光学文字認識方法およびシステムに関するものである。 This disclosure relates to optical character recognition methods and systems, and more particularly to deep learning-based optical character recognition methods and systems that extract text from an image based on the image and reference points by generating reference points for referencing text regions contained in the image.

光学文字認識（ＯＣＲ）は、イメージからテキストを検出し、検出されたテキストがどのようなテキストであるかを認識する技術のことを意味する。従来の光学文字認識技術は、文書の文字を認識するために使用されてきた。しかし、道端の看板などのように、日常の中でよく見かけるテキストを認識しようとすると、場合の数が様々で技術的難易度が高い。例えば、ななめまたは屈曲のあるイメージに含まれているテキストの場合、長方形ではなく様々な形態のテキスト領域が存在し得るため、テキストを認識することが難しくなり得る。このように、周辺環境やオブジェクトに表示されているテキストをシーンテキスト（scene text）と称する。 Optical character recognition (OCR) is a technology that detects text from an image and recognizes the type of text that is detected. Conventional optical character recognition technology has been used to recognize characters in documents. However, recognizing text that is commonly seen in everyday life, such as on roadside signs, is technically difficult due to the wide variety of possible cases. For example, in the case of text contained in an image that is diagonal or curved, it can be difficult to recognize the text because the text areas may be non-rectangular and have various shapes. Text displayed in this way in the surrounding environment or on objects is called scene text.

このようなシーンテキストを認識するために、従来の技術は、検出器（detector）を用いてイメージからテキストの領域を検出し、検出された領域の大きさを調整するなどで編集して認識器（recognizer）に伝達する第１段階と、認識器を用いて当該テキストを認識する第２段階とで構成される。しかしながら、このような従来技術は、テキスト領域を検出できないと、テキスト認識がともに失敗することとなり、検出された領域の大きさを調整する際にテキストの内容が失われ得る。また、各段階で実行される人工ニューラルネットワークを重複して学習するため、計算資源（リソース）の効率が低下し得る。さらには、検出器を更新する場合、認識器も新たに学習する必要があるため、関連サービスの管理またはメンテナンスの面で不利となり得る。 To recognize such scene text, conventional technologies consist of a first stage in which a detector is used to detect text regions from an image, and then the detected regions are edited (for example, by adjusting the size of the regions) and transmitted to a recognizer, followed by a second stage in which the recognizer recognizes the text. However, with such conventional technologies, if the text regions cannot be detected, both text recognition and text recognition may fail, and the content of the text may be lost when adjusting the size of the detected regions. Furthermore, the artificial neural networks executed at each stage must be trained repeatedly, which can reduce the efficiency of computational resources. Furthermore, when updating the detector, the recognizer must also be retrained, which can be disadvantageous in terms of management and maintenance of related services.

韓国公開特許第１０－２０１５－０１２５３７６号公報Korean Patent Publication No. 10-2015-0125376

本開示は、前記のような問題点を解決するためのディープラーニングベースの光学文字認識方法およびシステム（装置）を提供する。 This disclosure provides a deep learning-based optical character recognition method and system (apparatus) to solve the problems described above.

本開示は、方法、装置（システム）、またはコンピュータ読み取り可能なコンピュータプログラムを含む様々な方法により実現され得る。 The present disclosure may be implemented in various ways, including as a method, an apparatus (system), or a computer-readable computer program.

本開示の一実施例によると、ディープラーニングベースの光学文字認識方法は、イメージから少なくとも１つのテキスト領域を検出する段階と、少なくとも１つのテキスト領域に関連する少なくとも１つのリファレンスポイントを生成する段階と、イメージおよび少なくとも１つのリファレンスポイントに基づいて、イメージから少なくとも１つのテキストを抽出する段階とを含み得る。 According to one embodiment of the present disclosure, a deep learning-based optical character recognition method may include detecting at least one text region from an image, generating at least one reference point associated with the at least one text region, and extracting at least one piece of text from the image based on the image and the at least one reference point.

本開示の一実施例による方法をコンピュータで実行するために、コンピュータ読み取り可能なコンピュータプログラムが提供され得る。 A computer-readable computer program may be provided for executing a method according to one embodiment of the present disclosure on a computer.

本開示の一実施例による光学文字認識システムであって、イメージから少なくとも１つのテキスト領域を検出し、少なくとも１つのテキスト領域に関連する少なくとも１つのリファレンスポイントを生成する検出器と、イメージおよび少なくとも１つのリファレンスポイントに基づいて、イメージから少なくとも１つのテキストを抽出する認識器とを含み得る。 An optical character recognition system according to one embodiment of the present disclosure may include a detector that detects at least one text region from an image and generates at least one reference point associated with the at least one text region, and a recognizer that extracts at least one piece of text from the image based on the image and the at least one reference point.

本開示の様々な実施例によると、イメージのテキスト領域を検出する過程にミスがあっても、イメージ全体とリファレンスポイントとを利用することにより、イメージからテキストを正しく抽出することができる。すなわち、テキスト領域の検出ミスが発生しても、テキスト内容の損失を防止することができる。また、最終的なテキスト認識結果が検出されたテキスト領域にもっぱら依存しないので、イメージ内の回転しているテキスト抽出にも強みを有し得る。 According to various embodiments of the present disclosure, even if an error occurs in the process of detecting a text region in an image, text can be correctly extracted from the image by utilizing the entire image and reference points. In other words, even if an error occurs in detecting a text region, loss of text content can be prevented. Furthermore, since the final text recognition result does not solely depend on the detected text region, this method can also have advantages when extracting rotated text within an image.

本開示の様々な実施例によると、２つの段階ではなく１つの段階で検出器および認識器が一度に学習されるので、学習の効率性が向上され得る。また、検出されたテキスト領域の大きさを調整しないで、イメージおよびリファレンスポイントを利用することにより、長い文字が歪んだり切られたりした状態で認識エラーが発生する現象を防止し得る。さらには、バックボーンから抽出されたマルチスケールフィーチャを検出器および認識器に共有することにより、モデルの性能および推論速度が向上され得る。 According to various embodiments of the present disclosure, the detector and recognizer are trained in one step instead of two, which can improve training efficiency. Furthermore, by using images and reference points without adjusting the size of detected text regions, it is possible to prevent recognition errors that occur when long characters are distorted or truncated. Furthermore, by sharing multi-scale features extracted from the backbone with the detector and recognizer, it is possible to improve model performance and inference speed.

本開示の様々な実施例によると、イメージ内の関心領域をプーリングまたはマスキングしないで、リファレンスポイントおよび全体のイメージフィーチャ（feature）を利用することにより、イメージ内のテキストが認識および復号（decoding）され得る。これにより、関心領域の検出にミスがあっても、テキストの抽出がスムーズに行われ得る。 According to various embodiments of the present disclosure, text in an image can be recognized and decoded by utilizing reference points and overall image features without pooling or masking regions of interest within the image. This allows for smooth text extraction even if regions of interest are not detected correctly.

本開示の様々な実施例によると、イメージ内の交差テキスト、テキスト内のテキスト、様々な文字体および大きさのような複雑なシーンテキストがより正確に抽出され得る。また、最終的なテキスト認識結果がイメージ内のテキスト領域の検出に大きく依存しないので、テキスト領域の検出ミスに強みを有し得る。さらには、イメージ内のテキストが回転していても、正しくテキストが抽出され得る。 Various embodiments of the present disclosure can more accurately extract complex scene text, such as crossed text within an image, text within text, and text of various fonts and sizes. Furthermore, since the final text recognition result does not heavily depend on the detection of text regions within an image, it can be resilient to errors in detecting text regions. Furthermore, text can be correctly extracted even if the text within an image is rotated.

本開示の様々な実施例によると、特定の単語単位のテキストがどのテキスト領域に含まれるかマッチングすることにより、検出されたテキストをより有機的に連結し得る。これにより、光学文字認識により文書からテキストを抽出した後、翻訳サービスを提供する過程で、より優れた品質の翻訳サービスが提供され得る。 According to various embodiments of the present disclosure, by matching which text regions contain specific word units of text, the detected text can be linked more organically. This can result in a higher quality translation service being provided in the process of extracting text from a document using optical character recognition and then providing a translation service.

本開示の効果は、以上で言及した効果に制限されず、言及されていない他の効果は、請求範囲の記載から本開示が属する技術分野において通常の知識を有する者（「通常の技術者」という）に明確に理解され得ることである。 The effects of the present disclosure are not limited to those mentioned above, and other effects not mentioned will be clearly understood by a person with ordinary skill in the art to which the present disclosure pertains (referred to as an "ordinary artisan") from the claims.

本開示の実施例は、以下に説明する添付の図面を参照して説明され、ここで類似の参照番号は類似の要素を示すが、これに限定されない。 Embodiments of the present disclosure will be described with reference to the accompanying drawings, as described below, in which like reference numerals indicate like elements, but are not limited to the drawings.

図１は、本開示の一実施例による光学文字認識方法の一例を示す。FIG. 1 illustrates an example of an optical character recognition method according to one embodiment of the present disclosure. 図２は、本開示の一実施例による光学文字認識のために、情報処理システムが複数のユーザ端末と通信可能のように連結された構成を例示する概要図である。FIG. 2 is a schematic diagram illustrating an example configuration in which an information processing system is communicatively coupled to multiple user terminals for optical character recognition according to one embodiment of the present disclosure. 図３は、本開示の一実施例によるユーザ端末および情報処理システムの内部構成を示すブロック図である。FIG. 3 is a block diagram illustrating the internal configuration of a user terminal and an information processing system according to an embodiment of the present disclosure. 図４は、本開示の一実施例による光学文字認識システムを例示する概要図である。FIG. 4 is a schematic diagram illustrating an optical character recognition system according to one embodiment of the present disclosure. 図５は、本開示の一実施例により、イメージからテキストを抽出する一例を示す図である。FIG. 5 is a diagram illustrating an example of extracting text from an image according to one embodiment of the present disclosure. 図６は、本開示の一実施例による光学文字認識システムのプロセスの一例を示す図である。FIG. 6 is a diagram illustrating an example process of an optical character recognition system according to an embodiment of the present disclosure. 図７は、本開示の一実施例による光学文字認識結果の一例を示す図である。FIG. 7 is a diagram illustrating an example of an optical character recognition result according to an embodiment of the present disclosure. 図８は、本開示の一実施例により文字単位のテキストが検出される一例を示す図である。FIG. 8 is a diagram showing an example in which text is detected character by character according to an embodiment of the present disclosure. 図９は、本開示の一実施例により、文書でライン単位および段落単位のテキスト領域を検出する一例を示す図である。FIG. 9 is a diagram illustrating an example of detecting line-based and paragraph-based text regions in a document according to an embodiment of the present disclosure. 図１０は、本開示の一実施例による方法を例示するフローチャートである。FIG. 10 is a flowchart illustrating a method according to one embodiment of the present disclosure.

以下、本開示の実施のための具体的な内容を添付の図面を参照して詳細に説明する。ただし、以下の説明では、本開示の要旨を不要に曖昧にする恐れがある場合、周知の機能や構成に関する具体的な説明は省略する。 Specific details for implementing this disclosure will be described in detail below with reference to the accompanying drawings. However, in the following description, specific descriptions of well-known functions and configurations will be omitted if they may unnecessarily obscure the gist of this disclosure.

添付の図面において、同一または対応する構成要素には同じ参照符号が付与されている。また、以下の実施例の説明において、同一または対応する構成要素を重複して記載することが省略され得る。しかしながら、構成要素に関する記述が省略されても、そのような構成要素がある実施例に含まれないものと意図するものではない。 In the accompanying drawings, identical or corresponding components are given the same reference numerals. Furthermore, in the following description of the embodiments, duplicate descriptions of identical or corresponding components may be omitted. However, the omission of a description of a component does not imply that such a component is not included in a certain embodiment.

開示された実施例の利点および特徴、そしてそれらを達成する方法は、添付の図面とともに後述の実施例を参照すると明確になることである。しかし、本開示は、以下に開示される実施例に限定されるものではなく、互いに異なる様々な形態で実現することができ、単に本実施例は、本開示が完全たるものとなるようにし、本開示が通常の技術者に発明の範囲を適切に知らせるために提供されるのみのものである。 Advantages and features of the disclosed embodiments, as well as methods for achieving them, will become apparent from the following detailed description of the embodiments taken in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, and may be embodied in a variety of different forms. These embodiments are provided solely so that this disclosure will be thorough and complete, and will adequately convey the scope of the invention to those skilled in the art.

本明細書で使用される用語について簡単に説明し、開示された実施例について具体的に説明する。本明細書で使用される用語は、本開示における機能を考慮しながら、なるべく現在広く使用されている一般的な用語を選択しているが、これは、関連分野に従事する技術者の意図または判例、新たな技術の出現などによって変わり得る。また、特定の場合は、出願人が任意に選定した用語もあり、その場合、該当する発明の説明において詳細にその意味を記載する。したがって、本開示で使用される用語は、単なる用語の名称ではなく、その用語が有する意味と、本開示全般にわたる内容に基づいて定義されるべきである。 The terms used in this specification are briefly explained, followed by a detailed description of the disclosed embodiments. The terms used in this specification are generally selected based on current widespread usage, taking into account the functionality of this disclosure. However, this may change depending on the intentions or legal precedents of engineers in the relevant field, the emergence of new technology, and other factors. In addition, in certain cases, the applicant may arbitrarily select terms, and in such cases, their meanings will be described in detail in the description of the relevant invention. Therefore, the terms used in this disclosure should be defined based on the meanings of the terms and the overall content of this disclosure, rather than simply by their names.

本明細書における単数の表現は、文脈上明らかに単数のものと特定しない限り、複数の表現を含む。また、複数の表現は、文脈上明らかに複数のものと特定しない限り、単数の表現を含む。明細書全体においてある部分がある構成要素を含むと言う場合、それは、特に反する記載がない限り、他の構成要素を除外するのではなく、他の構成要素をさらに含み得ることを意味する。 In this specification, the singular includes the plural unless the context clearly dictates otherwise. Furthermore, the plural includes the singular unless the context clearly dictates otherwise. When a part of the entire specification is said to include certain elements, this does not mean that other elements are excluded, but that other elements may also be included, unless otherwise specified.

さらに、本明細書で使用される「モジュール」または「部（ユニット）」という用語は、ソフトウェアまたはハードウェア構成要素を意味し、「モジュール」または「部」は何らかの役割を果たす。されど、「モジュール」または「部」は、ソフトウェアまたはハードウェアに限定される意味ではない。「モジュール」または「部」は、アドレッシング（アドレス指定）可能な記憶媒体にあるように構成されてもよく、１つまたはそれ以上のプロセッサを再生するように構成されてもよい。したがって、一例として、「モジュール」または「部」は、ソフトウェア構成要素、オブジェクト指向ソフトウェア構成要素、クラス構成要素、およびタスク構成要素のような構成要素と、プロセス、関数、属性、プロシージャ、サブルーチン、プログラムコードのセグメント、ドライバ、ファームウェア、マイクロコード、回路、データ、データベース、データ構造、テーブル、アレイ、または変数のうちの少なくとも１つを含み得る。構成要素と「モジュール」または「部」は、中で提供される機能はより小さい数の構成要素および「モジュール」または「部」で結合されるか、または追加の構成要素と「モジュール」または「部」にさらに分離され得る。 Furthermore, as used herein, the terms "module" or "unit" refer to a software or hardware component, and the "module" or "unit" performs some function. However, the term "module" or "unit" is not limited to software or hardware. A "module" or "unit" may be configured to reside on an addressable storage medium or to execute one or more processors. Thus, by way of example, a "module" or "unit" may include components such as software components, object-oriented software components, class components, and task components, as well as at least one of a process, function, attribute, procedure, subroutine, program code segment, driver, firmware, microcode, circuit, data, database, data structure, table, array, or variable. The components and "modules" or "units" may indicate that the functionality provided therein may be combined into fewer components and "modules" or "units," or further separated into additional components and "modules" or "units."

本開示の一実施例によると、「モジュール」または「部」は、プロセッサおよびメモリで実現され得る。「プロセッサ」は、汎用プロセッサ、中央処理装置（ＣＰＵ）、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ）、コントローラ、マイクロコントローラ、ステートマシンなどを含むよう広く解釈されるべきである。いくつかの環境において、「プロセッサ」は、カスタム半導体（ＡＳＩＣ）、プログラマブルロジックデバイス（ＰＬＤ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）などを指すこともあり得る。「プロセッサ」は、例えば、ＤＳＰとマイクロプロセッサとの組み合わせ、複数のマイクロプロセッサの組み合わせ、ＤＳＰコアと結合した１つ以上のマイクロプロセッサの組み合わせ、または他の任意のそのような構成の組み合わせのような処理デバイスの組み合わせを指すこともあり得る。また、「メモリ」は、電子情報を保存可能な任意の電子コンポネントを含むよう広く解釈されるべきである。「メモリ」は、ランダムアクセスメモリ（ＲＡＭ）、読み取り専用メモリ（ＲＯＭ）、不揮発性ランダムアクセスメモリ（ＮＶＲＡＭ）、プログラマブル読み取り専用メモリ（ＰＲＯＭ）、消去プログラマブル読み取り専用メモリ（ＥＰＲＯＭ）、電気的に消去可能なＰＲＯＭ（ＥＥＰＲＯＭ）、フラッシュメモリ、磁気または光データ記憶装置、レジスタなどのようなプロセッサ読み取り可能媒体の様々な類型を指すこともあり得る。プロセッサがメモリから情報の読み取りおよび／またはメモリに情報の書き込みができるならば、メモリはプロセッサと電子通信状態にあると称される。プロセッサに集積されたメモリは、プロセッサと電子通信状態にある。 According to one embodiment of the present disclosure, a "module" or "unit" may be implemented with a processor and memory. "Processor" should be broadly interpreted to include a general-purpose processor, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a controller, a microcontroller, a state machine, etc. In some environments, "processor" may also refer to an application specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), etc. "Processor" may also refer to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of multiple microprocessors, a combination of one or more microprocessors in conjunction with a DSP core, or any other such combination. Additionally, "memory" should be broadly interpreted to include any electronic component capable of storing electronic information. "Memory" may refer to various types of processor-readable media, such as random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage devices, registers, etc. Memory is said to be in electronic communication with a processor if the processor can read information from and/or write information to the memory. Memory that is integrated into a processor is in electronic communication with the processor.

本開示において、「システム」は、サーバー装置とクラウド装置のうちの少なくとも１つの装置を含み得るが、これらに限定されるものではない。例えば、システムは、１つ以上のサーバー装置で構成され得る。他の例として、システムは１つ以上のクラウド装置で構成され得る。また他の例として、システムは、サーバー装置とクラウド装置とが一緒に構成され動作し得る。 In this disclosure, a "system" may include at least one of a server device and a cloud device, but is not limited to these. For example, a system may be composed of one or more server devices. As another example, a system may be composed of one or more cloud devices. As yet another example, a system may be configured and operated by a server device and a cloud device together.

本開示において、「ディスプレイ」は、コンピューティング装置に関連する任意のディスプレイ装置を指すことがあり、例えば、コンピューティング装置によって制御されるか、またはコンピューティング装置から提供された任意の情報／データを表示し得る任意のディスプレイ装置を指し得る。 In this disclosure, "display" may refer to any display device associated with a computing device, for example, any display device capable of displaying any information/data controlled by or provided by a computing device.

本開示において、「複数のＡのそれぞれ」または「複数のＡそれぞれ」は、複数のＡに含まれているすべての構成要素のそれぞれを指すか、または複数のＡに含まれている一部構成要素のそれぞれを指し得る。 In this disclosure, "each of a plurality of A's" or "each of a plurality of A's" may refer to each of all components included in the plurality of A's, or may refer to each of some components included in the plurality of A's.

任意の形態のテキストインスタンス（text instance）を認識するエンドツーエンド（end-to-end）シーンテキストスポッティング（scene text spotting）技術は、大きな改善を成し遂げた。シーンテキスト検出のための一般的な方法は、単一のテキストインスタンスのフィーチャ（feature）を制限するために、関心領域をプーリング（pooling）するか、またはセグメンテーションマスキング（segmentation masking）することである。しかし、検出が正確ではない場合（例えば、１つ以上の文字が切られてしまう場合など）、このような方法における認識器は正しい文字イメージシーケンスを復号または生成することが困難であり得る。 End-to-end scene text spotting techniques that recognize arbitrary forms of text instances have made significant improvements. Common methods for scene text detection are region-of-interest pooling or segmentation masking to limit the features to a single text instance. However, if the detection is inaccurate (e.g., if one or more characters are cut off), the recognizer in such methods may have difficulty decoding or generating the correct character image sequence.

本開示においては、例えば、シーンテキストスポッティング問題において、検出器だけでは単語の境界を正確に決定することが難しいと言うことを考慮して、検出に拘束されない(detection-agnostic)エンドツーエンド認識器を含む光学文字認識方法を提供する。本開示の方法は、検出されたテキスト領域を使用する代わりに、各テキストに対して１つのリファレンスポイントを用いて検出器と認識器とを連結することにより、検出器と認識器との間の緊密な依存度を減少させ得る。また、本開示の方法によると、認識器が全体イメージのフィーチャとともにリファレンスポイントとして表示されたテキストを認識し得る。つまり、テキストを認識するためにたった１つのポイントのみを必要とするので、任意の形態の検出器または境界ポリゴン（bounding polygon）なしで、イメージからテキストが抽出され得る。加えて、本開示は、正規（regular：規則的）および不規則（arbitrarily）形態のテキストスポットベンチマーク（text spotting benchmarks）において競争力があり、テキスト検出エラーに対して強い（robust）光学文字認識方法およびシステムを提供し得る。 This disclosure provides an optical character recognition method that includes a detection-agnostic end-to-end recognizer, taking into account the difficulty of accurately determining word boundaries using a detector alone, for example, in scene text spotting problems. The disclosed method can reduce the tight dependency between the detector and recognizer by linking the detector and recognizer using a single reference point for each piece of text instead of using detected text regions. Furthermore, the disclosed method can enable the recognizer to recognize text displayed as a reference point along with features of the entire image. In other words, because only a single point is required to recognize text, text can be extracted from an image without any type of detector or bounding polygon. Additionally, this disclosure provides an optical character recognition method and system that is competitive in regular and arbitrary text spotting benchmarks and robust to text detection errors.

エンドツーエンド・シーンテキストスポッティング技術は、情報抽出、イメージ検索、視覚的クエリ応答など、さまざまな分野で活用されている。一般に、エンドツーエンド・シーンテキストスポッティングのパイプラインは、検出器および認識器で構成される。検出器は、ボックスまたはポリゴン形態でイメージ内のテキストインスタンスをローカライズ（localize）し、認識器は、ローカライズされた各テキスト領域を入力として受け、イメージの各パッチ内の文字を復号する。 End-to-end scene text spotting techniques are used in a variety of fields, including information extraction, image retrieval, and visual query response. In general, an end-to-end scene text spotting pipeline consists of a detector and a recognizer. The detector localizes text instances in an image in the form of boxes or polygons, and the recognizer receives each localized text region as input and decodes the characters in each patch of the image.

従来のシーンテキストスポッティング・パイプラインは、検出器と認識器との間にやや緊密に結合されたフレームワーク（framework）を使用していた。特に、検出器にてテキスト領域に対する認識エラーが発生すると、認識対象テキストの一部が切られたテキストのみが含まれているイメージが認識器に供給され得る。この場合、該当パイプラインの認識性能は、検出器および認識器が出力するイメージパッチの性能に大きく左右され得る。近年、エンドツーエンド・シーンテキストスポッティング方法は、関心領域（Region Of Interest、ＲＯＩ）プーリングまたはマスキングを用いてフィーチャを抽出し、認識器の入力領域を単一単語に制限することによって、より弱く結合されたフレームワークを使用する。認識器にローカライズされたフィーチャを使用すると、検出器において切られた領域に対する認識器の依存度は減り得るが、検出器のエラーは依然として蓄積され、認識失敗事例が発生し得る。また、フィーチャプーリングとマスキングは、最終アプリケーションに正確な境界情報が必要でなくても、エンドツーエンド・シーンテキストスポッティングモデルを学習させるために、境界ボックスまたは境界ポリゴンのあるデータが必要である。 Traditional scene text spotting pipelines use a somewhat tightly coupled framework between the detector and recognizer. In particular, if the detector encounters a recognition error for a text region, the recognizer may be supplied with an image containing only the text itself, with the target text cut off. In this case, the recognition performance of the pipeline can be heavily influenced by the performance of the image patches output by the detector and recognizer. Recently, end-to-end scene text spotting methods have used a more loosely coupled framework by extracting features using region of interest (ROI) pooling or masking and limiting the recognizer's input region to a single word. Using localized features in the recognizer can reduce the recognizer's reliance on the detector's cut-off regions, but detector errors can still accumulate and recognition failures can occur. Furthermore, feature pooling and masking require data with bounding boxes or polygons to train an end-to-end scene text spotting model, even if the final application does not require accurate boundary information.

本開示は、検出結果の精度への依存性を大きく緩和する、新たなエンドツーエンド認識器を提供する。正確なテキスト領域を抽出するために、検出器に依存する代わりに、検出器が各テキストインスタンスに対するリファレンスポイントを生成し得る。その後、認識器は、当該リファレンスポイント周囲のテキストを総合的に認識し得る。具体的に、認識器にリファレンスポイントが与えられると、認識器は、テキストシーケンスを復号する間に特定のテキストインスタンスの領域をアテンション（attention）するように学習され得る。また、検出器に単一リファレンスポイントのみを再度要請するので、より一層多様な検出アルゴリズムおよびアノテーション（annotation）が適用され得る。さらには、プーリング作業とポリゴンタイプのアノテーションなしで、回転または屈曲のあるテキストインスタンスが自然に処理され得る。 This disclosure provides a new end-to-end recognizer that significantly reduces dependency on the accuracy of detection results. Instead of relying on the detector to extract precise text regions, the detector can generate reference points for each text instance. The recognizer can then comprehensively recognize the text surrounding the reference points. Specifically, given the reference points, the recognizer can be trained to attend to specific text instance regions while decoding text sequences. Furthermore, by requiring the detector to only have a single reference point, a greater variety of detection algorithms and annotations can be applied. Furthermore, rotated or curved text instances can be handled naturally without pooling operations and polygon-type annotations.

本開示において、「テキスト領域」は、イメージ内のテキストが含まれている領域を指し得る。ここで、テキスト領域は、四角形状の境界ボックス（bounding box）形態で表示され得るが、これに限定されず、多角形状の境界ポリゴン（bounding polygon）形態で表示され得る。また、テキスト領域は、頂点の座標で構成され得る。例えば、テキスト領域が境界ボックス形態の場合、テキスト領域は、左上端（top-left）座標値、右上端（top-right）座標値、右下端（bottom-right）座標値、および左下端（bottom-left）座標値で構成され得る。 In this disclosure, a "text area" may refer to an area in an image that contains text. Here, a text area may be displayed in the form of a rectangular bounding box, but is not limited to this, and may also be displayed in the form of a polygonal bounding polygon. A text area may also be configured with vertex coordinates. For example, if a text area is in the form of a bounding box, the text area may be configured with top-left coordinate values, top-right coordinate values, bottom-right coordinate values, and bottom-left coordinate values.

本開示において、「認識器（recognizer）」は、光学文字認識システムにおいて、イメージ内のテキスト領域を認識するモジュールのことを指し得る。また、認識器は、テキストデコーダ（text decoder）を含み得る。すなわち、認識器は、テキスト領域を認識して文字イメージシーケンスを生成し、テキストデコーダを用いて文字イメージシーケンスをテキストに変換し得る。 In this disclosure, a "recognizer" may refer to a module in an optical character recognition system that recognizes text regions within an image. The recognizer may also include a text decoder. That is, the recognizer may recognize text regions to generate a character image sequence and convert the character image sequence into text using the text decoder.

図１は、本開示の一実施例による光学文字認識方法を例示する。第１例示１１０は、従来の光学文字認識方法の例示である。従来の光学文字認識方法は、検出器を用いてイメージからテキスト領域１１２を検出し、検出されたテキスト領域１１２の大きさを調整して認識器に伝達する第１段階と、認識器を用いて検出されたテキスト領域１１２でテキストを認識する第２段階とを含む。しかし、このような従来の方法では、検出器がテキスト領域１１２を誤って検出した場合、テキスト認識がともに失敗し得る。また、検出されたテキスト領域１１２の大きさが調整される際にテキストの内容が失われ得る。そして、各段階で実行される人工ニューラルネットワークを重複して学習する必要があるため、計算資源の効率が低下し得る。さらには、検出器を更新する場合、認識器も新たに学習する必要があるため、関連サービスの管理またはメンテナンスの時間およびコストが増加し得る。 FIG. 1 illustrates an optical character recognition method according to one embodiment of the present disclosure. A first example 110 is an example of a conventional optical character recognition method. The conventional optical character recognition method includes a first stage in which a detector detects a text region 112 from an image, adjusts the size of the detected text region 112, and transmits the size to a recognizer. A second stage in which the recognizer recognizes text in the detected text region 112. However, in this conventional method, if the detector erroneously detects the text region 112, both stages of text recognition may fail. Furthermore, the content of the text may be lost when the size of the detected text region 112 is adjusted. Furthermore, the need to redundantly train the artificial neural network executed at each stage may reduce the efficiency of computing resources. Furthermore, when the detector is updated, the recognizer must also be newly trained, which may increase the time and cost required for management or maintenance of related services.

第２例示１２０は、本開示の光学文字認識方法の例示である。一実施例において、イメージから少なくとも１つのテキスト領域が検出され得る。具体的には、イメージから単語単位のテキストに関連するフィーチャが抽出され得る。また、フィーチャに基づいて、少なくとも１つのテキスト領域の位置情報が生成され得る。なお、テキスト領域の位置情報は、テキスト領域の境界ボックスまたは境界ポリゴンの頂点座標であり得る。 The second example 120 is an example of the optical character recognition method of the present disclosure. In one embodiment, at least one text region may be detected from an image. Specifically, features related to text in units of words may be extracted from the image. Furthermore, position information for the at least one text region may be generated based on the features. Note that the position information for the text region may be the vertex coordinates of a bounding box or bounding polygon of the text region.

一実施例において、テキスト領域に関連するリファレンスポイント１２２が生成され得る。なお、リファレンスポイント１２２は、テキスト領域の中心点を含み得る。この場合、テキスト領域の中心点は、テキスト領域の位置情報に基づいて決定され得る。図に示すように、テキスト領域の境界ボックスが「ゴマ」を含む場合、「ゴマ」の中心部分にリファレンスポイント１２２が生成され得る。リファレンスポイントが表示される例示は、図７を参照して詳細に後述する。 In one embodiment, a reference point 122 associated with a text region may be generated. The reference point 122 may include the center point of the text region. In this case, the center point of the text region may be determined based on the position information of the text region. As shown in the figure, if the bounding box of the text region includes the character "sesame", the reference point 122 may be generated at the center of the character "sesame". An example of how the reference point is displayed will be described in detail below with reference to Figure 7.

一実施例において、イメージおよびリファレンスポイント１２２に基づいて、イメージからテキストが認識または抽出され得る。具体的には、リファレンスポイント１２２に隣接する複数のオフセットポイント１２４＿１～１２４＿４が生成され得る。なお、複数のオフセットポイント１２４＿１～１２４＿４は、認識器が複数のオフセットポイント１２４＿１～１２４＿４に隣接する領域にアテンションして、テキストを抽出するようにガイドし得る。また、イメージおよび複数のオフセットポイント１２４＿１～１２４＿４に基づいて、イメージから少なくとも１つのテキストが抽出され得る。例えば、認識器は、イメージで複数のオフセットポイント１２４＿１～１２４＿４をアテンションすることによって、イメージから「ゴマ」を抽出し得る。図１では、オフセットポイントが４つのものと示しているが、これに限定されない。 In one embodiment, text may be recognized or extracted from the image based on the image and the reference point 122. Specifically, multiple offset points 124_1 to 124_4 may be generated adjacent to the reference point 122. The multiple offset points 124_1 to 124_4 may guide the recognizer to focus on areas adjacent to the multiple offset points 124_1 to 124_4 and extract text. Furthermore, at least one piece of text may be extracted from the image based on the image and the multiple offset points 124_1 to 124_4. For example, the recognizer may extract "sesame" from the image by focusing on the multiple offset points 124_1 to 124_4 in the image. While FIG. 1 shows four offset points, this is not limiting.

一実施例において、複数のテキスト抽出が、少なくとも部分的に並列で実行され得る。例えば、第１テキスト領域として「ゴマ」が含まれている領域が検出され、第２テキスト領域として「スティック」が含まれている領域が検出され得る。この場合、第１テキスト領域に第１リファレンスポイントが生成され、第２テキスト領域に第２リファレンスポイントがそれぞれ生成され得る。また、イメージおよび第１リファレンスポイントに基づいて、第１テキスト領域から「ゴマ」が抽出され得る。さらに、「ゴマ」を抽出することと少なくとも部分的に並列で、イメージおよび第２リファレンスポイントに基づいて、第２テキスト領域から「スティック」が抽出され得る。 In one embodiment, multiple text extractions may be performed at least partially in parallel. For example, a region containing "sesame" may be detected as a first text region, and a region containing "stick" may be detected as a second text region. In this case, a first reference point may be generated in the first text region, and a second reference point may be generated in the second text region. Furthermore, "sesame" may be extracted from the first text region based on the image and the first reference point. Furthermore, "stick" may be extracted from the second text region based on the image and the second reference point, at least partially in parallel with the extraction of "sesame."

このような構成により、イメージのテキスト領域を検出する過程にミスがあっても、イメージ全体とリファレンスポイントを利用することにより、イメージからテキストを正しく抽出し得る。すなわち、テキスト領域の検出ミスが発生しても、テキスト内容の損失を防止し得る。また、最終的なテキスト認識結果が検出されたテキスト領域にもっぱら依存しないので、イメージ内の回転しているテキスト抽出にも強みを有し得る。 With this configuration, even if there is an error in the process of detecting the text region of the image, the text can be correctly extracted from the image by using the entire image and reference points. In other words, even if an error occurs in detecting the text region, the loss of text content can be prevented. Furthermore, since the final text recognition result does not solely depend on the detected text region, it can also have the advantage of extracting rotated text within an image.

図２は、本開示の一実施例による光学文字認識のために、情報処理システム２３０が複数のユーザ端末２１０＿１、２１０＿２、２１０＿３と通信可能に連結された構成を示す概要図である。図に示すように、複数のユーザ端末２１０＿１、２１０＿２、２１０＿３は、ネットワーク２２０を介して光学文字認識サービスを提供し得る情報処理システム２３０と連結され得る。なお、複数のユーザ端末２１０＿１、２１０＿２、２１０＿３は、光学文字認識サービスの提供を受けるユーザの端末を含み得る。 FIG. 2 is a schematic diagram illustrating a configuration in which an information processing system 230 is communicatively coupled to multiple user terminals 210_1, 210_2, and 210_3 for optical character recognition according to one embodiment of the present disclosure. As shown in the figure, multiple user terminals 210_1, 210_2, and 210_3 may be coupled to information processing system 230, which may provide optical character recognition services, via network 220. Note that the multiple user terminals 210_1, 210_2, and 210_3 may include terminals of users who receive the optical character recognition service.

一実施例において、情報処理システム２３０は、光学文字認識サービス提供などに関連するコンピュータ実行可能なプログラム（例えばダウンロード可能アプリケーション）およびデータを保存、提供、および実行し得る１つ以上のサーバー装置および／またはデータベース、またはクラウドコンピューティングサービスベースの１つ以上の分散コンピューティング装置および／または分散データベースを含み得る。 In one embodiment, information processing system 230 may include one or more server devices and/or databases, or one or more cloud computing service-based distributed computing devices and/or distributed databases, that may store, provide, and execute computer-executable programs (e.g., downloadable applications) and data related to providing optical character recognition services, etc.

情報処理システム２３０によって提供される光学文字認識サービスは、複数のユーザ端末２１０＿１、２１０＿２、２１０＿３のそれぞれに設けられた光学文字認識サービスアプリケーションのウェブブラウザまたはウェブブラウザ拡張プログラムなどによりユーザに提供され得る。例えば、情報処理システム２３０は、光学文字認識サービスアプリケーションなどにより、ユーザ端末２１０＿１、２１０＿２、２１０＿３から受信するイメージ内のテキスト抽出要請に対する情報を提供したり、対応したりする処理を行い得る。 The optical character recognition service provided by the information processing system 230 may be provided to users through a web browser or web browser extension program of an optical character recognition service application installed in each of the user terminals 210_1, 210_2, and 210_3. For example, the information processing system 230 may provide information or respond to a request for text extraction from an image received from the user terminals 210_1, 210_2, and 210_3 through an optical character recognition service application.

複数のユーザ端末２１０＿１、２１０＿２、２１０＿３は、ネットワーク２２０を介して情報処理システム２３０と通信し得る。ネットワーク２２０は、複数のユーザ端末２１０＿１、２１０＿２、２１０＿３と情報処理システム２３０との間の通信が可能となるように構成され得る。ネットワーク２２０は、設置環境に応じて、例えば、イーサネット（Ethernet、登録商標）、有線ホームネットワーク（Power Line Communication）、電話回線通信装置およびＲＳシリアル通信等の有線ネットワーク、移動体通信ネットワーク、ＷＬＡＮ（Wireless LAN）、Ｗｉ－Ｆｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、およびＺｉｇＢｅｅ（登録商標）等のような無線ネットワークまたはその組み合わせで構成され得る。通信方式は制限されず、ネットワーク２２０が含み得る通信網（一例として、移動体通信ネットワーク、有線インターネット、無線インターネット、放送通信網、衛星通信網など）を活用する通信方式だけでなく、ユーザ端末２１０＿１、２１０＿２、２１０＿３間の近距離無線通信も含まれ得る。 Multiple user terminals 210_1, 210_2, and 210_3 may communicate with information processing system 230 via network 220. Network 220 may be configured to enable communication between multiple user terminals 210_1, 210_2, and 210_3 and information processing system 230. Depending on the installation environment, network 220 may be configured as a wired network such as Ethernet (registered trademark), a wired home network (Power Line Communication), a telephone line communication device, or RS serial communication, a mobile communication network, a wireless network such as WLAN (Wireless LAN), Wi-Fi (registered trademark), Bluetooth (registered trademark), and ZigBee (registered trademark), or a combination thereof. The communication method is not limited, and may include not only communication methods that utilize communication networks that may be included in network 220 (for example, mobile communication networks, wired internet, wireless internet, broadcast communication networks, satellite communication networks, etc.), but also short-range wireless communication between user terminals 210_1, 210_2, and 210_3.

図２において、携帯電話端末２１０＿１、タブレット端末２１０＿２およびＰＣ端末２１０＿３が、ユーザ端末の例として示されているが、これに限定されず、ユーザ端末２１０＿１、２１０＿２、２１０＿３は、有線および／または無線通信が可能で、光学文字認識サービスアプリケーションまたはウェブブラウザなどが設置され実行され得る任意のコンピューティング装置であり得る。例えば、ユーザ端末は、ＡＩスピーカ、スマートフォン、携帯電話、ナビゲーション、コンピュータ、ノートブック、デジタル放送用端末、ＰＤＡ（Personal Digital Assistants）、ＰＭＰ（Portable Multimedia Player）、タブレットＰＣ、ゲームコンソール（game console）、ウェアラブルデバイス（wearable device）、ＩｏＴ（internet of things）デバイス、ＶＲ（virtual reality）デバイス、ＡＲ（augmented reality）デバイス、セットトップボックスなどを含み得る。また、図２には、３つのユーザ端末２１０＿１、２１０＿２、２１０＿３が、ネットワーク２２０を介して情報処理システム２３０と通信するものと示されているが、これに限定されず、異なる数のユーザ端末がネットワーク２２０を介して情報処理システム２３０と通信するように構成されても良い。 2, mobile phone terminal 210_1, tablet terminal 210_2, and PC terminal 210_3 are shown as examples of user terminals, but are not limited thereto. User terminals 210_1, 210_2, and 210_3 may be any computing device capable of wired and/or wireless communication and on which an optical character recognition service application or a web browser, etc., can be installed and executed. For example, user terminals may include AI speakers, smartphones, mobile phones, navigation systems, computers, notebooks, digital broadcasting terminals, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), tablet PCs, game consoles, wearable devices, IoT (Internet of Things) devices, VR (Virtual Reality) devices, AR (Augmented Reality) devices, set-top boxes, etc. Also, while FIG. 2 shows three user terminals 210_1, 210_2, and 210_3 communicating with the information processing system 230 via the network 220, this is not limited to this, and a different number of user terminals may be configured to communicate with the information processing system 230 via the network 220.

図２には、ユーザ端末２１０＿１、２１０＿２、２１０＿３が、情報処理システム２３０から光学文字認識サービスの提供を受けるものと示されているが、これに限定されない。例えば、情報処理システム２３０との通信なしで、ユーザ端末２１０＿１、２１０＿２、２１０＿３に設けられた光学文字認識プログラム／アプリケーションにより、光学文字認識サービスが提供され得る。また、情報処理システム２３０が単一装置として示されているが、これに限定されず、情報処理システム２３０は複数の装置で構成され得る。 While FIG. 2 shows user terminals 210_1, 210_2, and 210_3 receiving optical character recognition services from information processing system 230, this is not intended to be limiting. For example, optical character recognition services may be provided by optical character recognition programs/applications installed on user terminals 210_1, 210_2, and 210_3 without communication with information processing system 230. Also, while information processing system 230 is shown as a single device, this is not intended to be limiting and information processing system 230 may be comprised of multiple devices.

図３は、本開示の一実施例によるユーザ端末２１０および情報処理システム２３０の内部構成を示すブロック図である。ユーザ端末２１０は、アプリケーション、ウェブブラウザなどが実行可能で、有／無線通信が可能な任意のコンピューティング装置のことを指し、例えば、図２の携帯電話端末２１０＿１、タブレット端末２１０＿２、ＰＣ端末２１０＿３などを含み得る。図に示すように、ユーザ端末２１０は、メモリ３１２、プロセッサ３１４、通信モジュール３１６、および入出力インターフェース３１８を含み得る。同様に、情報処理システム２３０は、メモリ３３２、プロセッサ３３４、通信モジュール３３６、および入出力インターフェース３３８を含み得る。図３に示すように、ユーザ端末２１０および情報処理システム２３０は、それぞれの通信モジュール３１６、３３６を用いて、ネットワーク２２０を介して情報および／またはデータの通信ができるように構成され得る。また、入出力装置３２０は、入出力インターフェース３１８を介して、ユーザ端末２１０に情報および／またはデータを入力するか、またはユーザ端末２１０から生成された情報および／またはデータを出力するように構成され得る。 3 is a block diagram showing the internal configuration of a user terminal 210 and an information processing system 230 according to one embodiment of the present disclosure. The user terminal 210 refers to any computing device capable of executing applications, a web browser, etc., and capable of wired/wireless communication, and may include, for example, the mobile phone terminal 210_1, tablet terminal 210_2, and PC terminal 210_3 of FIG. 2. As shown in the figure, the user terminal 210 may include memory 312, a processor 314, a communication module 316, and an input/output interface 318. Similarly, the information processing system 230 may include memory 332, a processor 334, a communication module 336, and an input/output interface 338. As shown in FIG. 3, the user terminal 210 and the information processing system 230 may be configured to communicate information and/or data over the network 220 using their respective communication modules 316, 336. Additionally, the input/output device 320 may be configured to input information and/or data to the user terminal 210 or output information and/or data generated from the user terminal 210 via the input/output interface 318.

メモリ３１２、３３２は、非一時的な任意のコンピュータ読み取り可能な記録媒体を含み得る。一実施例によると、メモリ３１２、３３２は、ＲＯＭ（read only memory）、ディスクドライブ、ＳＳＤ（solid state drive）、フラッシュメモリ（flash memory）などのような非消滅性大容量記憶装置（permanent mass storage device）を含み得る。他の例として、ＲＯＭ、ＳＳＤ、フラッシュメモリ、ディスクドライブなどのような非消滅性大容量記憶装置は、メモリとは区別される別途の永久記憶装置としてユーザ端末２１０または情報処理システム２３０に含まれ得る。また、メモリ３１２、３３２には、オペレーティングシステムと少なくとも１つのプログラムコードが保存され得る。 Memory 312, 332 may include any non-transitory computer-readable recording medium. According to one embodiment, memory 312, 332 may include a permanent mass storage device such as a read only memory (ROM), a disk drive, a solid state drive (SSD), a flash memory, etc. As another example, a permanent mass storage device such as a ROM, an SSD, a flash memory, a disk drive, etc. may be included in user terminal 210 or information processing system 230 as a separate permanent storage device distinct from memory. In addition, memory 312, 332 may store an operating system and at least one program code.

このようなソフトウェア構成要素は、メモリ３１２、３３２とは別のコンピュータで読み取り可能な記録媒体からロードされ得る。このような別のコンピュータで読み取り可能な記録媒体は、そのようなユーザ端末２１０および情報処理システム２３０に直接連結可能な記録媒体を含み得るが、例えば、フロッピードライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータで読み取り可能な記録媒体を含み得る。他の例として、ソフトウェア構成要素は、コンピュータで読み取り可能な記録媒体ではなく、通信モジュール３１６、３３６を介してメモリ３１２、３３２にロードされてもよい。例えば、少なくとも１つのプログラムは、開発者またはアプリケーションの設置ファイルを配布するファイル配布システムが、ネットワーク２２０を介して提供するファイルによって設置されるコンピュータプログラムに基づいてメモリ３１２、３３２にロードされ得る。 Such software components may be loaded from a computer-readable recording medium separate from memory 312, 332. Such separate computer-readable recording medium may include a recording medium directly connectable to the user terminal 210 and information processing system 230, but may also include computer-readable recording media such as a floppy drive, disk, tape, DVD/CD-ROM drive, or memory card. As another example, the software components may be loaded into memory 312, 332 via communication modules 316, 336 rather than a computer-readable recording medium. For example, at least one program may be loaded into memory 312, 332 based on a computer program installed by a file provided over network 220 by a developer or a file distribution system that distributes application installation files.

プロセッサ３１４、３３４は、基本的な算術、ロジック、および入出力演算を実行することによって、コンピュータプログラムの命令を処理するように構成され得る。命令は、メモリ３１２、３３２または通信モジュール３１６、３３６によってプロセッサ３１４、３３４に提供され得る。例えば、プロセッサ３１４、３３４は、メモリ３１２、３３２のような記録装置に保存されたプログラムコードにより、受信される命令を実行するように構成され得る。 Processors 314, 334 may be configured to process computer program instructions by performing basic arithmetic, logic, and input/output operations. The instructions may be provided to processors 314, 334 by memory 312, 332 or communication modules 316, 336. For example, processors 314, 334 may be configured to execute instructions received from program code stored in a storage device such as memory 312, 332.

通信モジュール３１６、３３６は、ネットワーク２２０を介してユーザ端末２１０と情報処理システム２３０とが互いに通信するための構成または機能を提供することができ、ユーザ端末２１０および／または情報処理システム２３０が、他のユーザ端末または他のシステム（一例として、別途のクラウドシステムなど）と通信するための構成または機能を提供し得る。一例として、ユーザ端末２１０のプロセッサ３１４が、メモリ３１２などの記録装置に保存されたプログラムコードにより生成した要請またはデータ（例えば、イメージ内のテキスト抽出要請など）は、通信モジュール３１６の制御により、ネットワーク２２０を介して情報処理システム２３０に伝達され得る。逆に、情報処理システム２３０のプロセッサ３３４の制御により提供される制御信号や命令が、通信モジュール３３６とネットワーク２２０を経て、ユーザ端末２１０の通信モジュール３１６を介してユーザ端末２１０に受信され得る。 The communication modules 316, 336 may provide configurations or functions for the user terminal 210 and the information processing system 230 to communicate with each other via the network 220, and may provide configurations or functions for the user terminal 210 and/or the information processing system 230 to communicate with other user terminals or other systems (e.g., a separate cloud system). For example, a request or data (e.g., a request to extract text from an image) generated by the processor 314 of the user terminal 210 using program code stored in a storage device such as the memory 312 may be transmitted to the information processing system 230 via the network 220 under the control of the communication module 316. Conversely, a control signal or command provided under the control of the processor 334 of the information processing system 230 may be received by the user terminal 210 via the communication module 316 of the user terminal 210 via the communication module 336 and the network 220.

入出力インターフェース３１８は、入出力装置３２０とのインターフェースのための手段であり得る。一例として、入力装置は、オーディオセンサおよび／またはイメージセンサを含むカメラ、キーボード、マイクロフォン、マウスなどの装置を含み、出力装置は、ディスプレイ、スピーカ、ハプティック（触覚）フィードバックデバイス（haptic feedback device）などのような装置を含み得る。他の例として、入出力インターフェース３１８は、タッチスクリーンなどのように、入力および出力を実行するための構成または機能が１つに統合された装置とのインターフェースのための手段であり得る。例えば、ユーザ端末２１０のプロセッサ３１４が、メモリ３１２にロードされたコンピュータプログラムの命令を処理することにおいて、情報処理システム２３０や他のユーザ端末が提供する情報および／またはデータを用いて構成されるサービス画面などが、入出力インターフェース３１８を介してディスプレイに表示され得る。図３では、入出力装置３２０がユーザ端末２１０に含まれないように示されているが、これに限定されず、ユーザ端末２１０と１つの装置で構成され得る。また、情報処理システム２３０の入出力インターフェース３３８は、情報処理システム２３０と連結されるか、または情報処理システム２３０が含み得る入力または出力のための装置（図示せず）とのインターフェースのための手段であり得る。図３では、入出力インターフェース３１８、３３８が、プロセッサ３１４、３３４とは別に構成された要素として示されているが、これに限定されず、入出力インターフェース３１８、３３８がプロセッサ３１４、３３４に含まれるようにも構成され得る。 The input/output interface 318 may be a means for interfacing with the input/output device 320. For example, input devices may include devices such as a camera including an audio sensor and/or an image sensor, a keyboard, a microphone, and a mouse, while output devices may include devices such as a display, a speaker, and a haptic feedback device. As another example, the input/output interface 318 may be a means for interfacing with a device that integrates input and output functions, such as a touchscreen. For example, when the processor 314 of the user terminal 210 processes instructions from a computer program loaded into the memory 312, a service screen constructed using information and/or data provided by the information processing system 230 or another user terminal may be displayed on the display via the input/output interface 318. While FIG. 3 illustrates the input/output device 320 as not being included in the user terminal 210, this is not limiting and the input/output device 320 may be configured as a single device together with the user terminal 210. Furthermore, the input/output interface 338 of the information processing system 230 may be a means for interfacing with an input or output device (not shown) that is coupled to the information processing system 230 or that may be included in the information processing system 230. In FIG. 3, the input/output interfaces 318, 338 are shown as elements configured separately from the processors 314, 334, but are not limited to this, and the input/output interfaces 318, 338 may also be configured to be included in the processors 314, 334.

ユーザ端末２１０および情報処理システム２３０は、図３の構成要素よりもさらに多い構成要素を含み得る。しかし、大部分の従来技術的な構成要素を明確に示す必要はない。一実施例において、ユーザ端末２１０は、前述の入出力装置３２０のうち少なくとも一部を含むように実現され得る。また、ユーザ端末２１０は、トランシーバ（transceiver）、ＧＰＳ（Global Positioning system）モジュール、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含み得る。例えば、ユーザ端末２１０がスマートフォンである場合、一般にスマートフォンが含んでいる構成要素を含むことができ、例えば、加速度センサ、ジャイロセンサ、マイクモジュール、カメラモジュール、各種物理的ボタン、タッチパネルを用いたボタン、入出力ポート、振動のためのバイブレータなどの様々な構成要素が、ユーザ端末２１０にさらに含まれるように実現され得る。 The user terminal 210 and the information processing system 230 may include more components than those shown in FIG. 3. However, it is not necessary to explicitly show most of the conventional components. In one embodiment, the user terminal 210 may be implemented to include at least some of the input/output devices 320 described above. The user terminal 210 may also include other components such as a transceiver, a GPS (Global Positioning System) module, a camera, various sensors, a database, etc. For example, if the user terminal 210 is a smartphone, it may include components typically included in smartphones, such as an acceleration sensor, a gyro sensor, a microphone module, a camera module, various physical buttons, buttons using a touch panel, input/output ports, a vibrator for vibration, etc.

光学文字認識サービスアプリケーションなどのためのプログラムが動作している間、プロセッサ３１４は、入出力インターフェース３１８に接続されたタッチスクリーン、キーボード、オーディオセンサおよび／またはイメージセンサを含むカメラ、マイクロフォンなどの入力装置を介して入力または選択されたテキスト、イメージ、映像、音声および／または動作などを受信することができ、受信されたテキスト、イメージ、映像、音声および／または動作などをメモリ３１２に保存するか、または通信モジュール３１６およびネットワーク２２０を介して情報処理システム２３０に提供し得る。 While a program for an optical character recognition service application or the like is running, the processor 314 can receive text, images, videos, sounds, and/or actions, etc., entered or selected via input devices such as a touch screen, keyboard, camera including an audio sensor and/or image sensor, microphone, etc. connected to the input/output interface 318, and can store the received text, images, videos, sounds, and/or actions, etc. in memory 312 or provide them to the information processing system 230 via the communication module 316 and the network 220.

ユーザ端末２１０のプロセッサ３１４は、入出力装置３２０、他のユーザ端末、情報処理システム２３０、および／または複数の外部システムから受信した情報および／またはデータを管理、処理および／または保存するように構成され得る。プロセッサ３１４によって処理された情報および／またはデータは、通信モジュール３１６およびネットワーク２２０を介して情報処理システム２３０に提供され得る。ユーザ端末２１０のプロセッサ３１４は、入出力インターフェース３１８を介して入出力装置３２０に情報および／またはデータを転送して出力し得る。例えば、プロセッサ３１４は、受信した情報および／またはデータをユーザ端末２１０の画面に表示し得る。 The processor 314 of the user terminal 210 may be configured to manage, process, and/or store information and/or data received from the input/output device 320, other user terminals, the information processing system 230, and/or multiple external systems. The information and/or data processed by the processor 314 may be provided to the information processing system 230 via the communication module 316 and the network 220. The processor 314 of the user terminal 210 may transfer and output information and/or data to the input/output device 320 via the input/output interface 318. For example, the processor 314 may display the received information and/or data on the screen of the user terminal 210.

情報処理システム２３０のプロセッサ３３４は、複数のユーザ端末２１０および／または複数の外部システムから受信した情報および／またはデータを管理、処理、および／または保存するように構成され得る。プロセッサ３３４によって処理された情報および／またはデータは、通信モジュール３３６およびネットワーク２２０を介してユーザ端末２１０に提供し得る。 The processor 334 of the information processing system 230 may be configured to manage, process, and/or store information and/or data received from multiple user terminals 210 and/or multiple external systems. The information and/or data processed by the processor 334 may be provided to the user terminal 210 via the communication module 336 and the network 220.

図４は、本開示の一実施例による光学文字認識システムの一例を示す概要図である。一実施例において、光学文字認識システムは、バックボーン（backbone、図示せず）、トランスフォーマーエンコーダ（transformer encoder）４２０、検出器（detector）（または、ロケーションヘッド（location head））４３０、および認識器（recognizer）４４０を含み得る。なお、認識器４４０は、テキストデコーダ（text decoder）を含み得る。 Figure 4 is a schematic diagram illustrating an example of an optical character recognition system according to one embodiment of the present disclosure. In one embodiment, the optical character recognition system may include a backbone (not shown), a transformer encoder 420, a detector (or location head) 430, and a recognizer 440. Note that the recognizer 440 may include a text decoder.

一実施例において、トランスフォーマーエンコーダ４２０は、バックボーンで生成されたマルチスケールフィーチャマップ（multi-scale feature maps）を結合し得る。また、検出器４３０は、テキストインスタンスおよび境界ボックスのリファレンスポイント４３２を設定し得る。さらに、認識器４４０は、入力されたイメージ４１０のマルチスケールフィーチャおよびリファレンスポイント４３２に基づいてテキスト領域を認識することによって、文字イメージシーケンスを生成し、テキストデコーダを用いて生成された文字イメージシーケンスをテキストに変換し得る。 In one embodiment, the transformer encoder 420 may combine multi-scale feature maps generated by the backbone. The detector 430 may also set reference points 432 for text instances and bounding boxes. The recognizer 440 may then generate a character image sequence by recognizing text regions based on the multi-scale features and reference points 432 of the input image 410, and convert the generated character image sequence into text using a text decoder.

一実施例において、入力されたイメージ４１０がバックボーンに提供されることによって、入力イメージ４１０からフィーチャマップ（例えば、Ｃ_２、Ｃ_３、Ｃ_４、Ｃ_５）が抽出され得る。なお、抽出されたフィーチャマップの解像度は、それぞれ入力されたイメージ４１０の解像度の１／４、１／８、１／１６、１／３２であり得る。また、フィーチャマップは、ＦＣ（fully-connected）レイヤーおよびグループ正規化（group normalization）を適用することによって、複数のチャネル（例えば、２５６チャネル）に投影され得る。その後、投影されたフィーチャマップはマージされ（Ｌ_２＋Ｌ_３＋Ｌ_４＋Ｌ_５）×２５６サイズのフィーチャトークンに平坦化され連結され得る。ここで、Ｌ_ｉは、Ｃ_ｉ（またはＨ／２^ｉ×Ｗ／２^ｉ）の平坦化された長さ（flattened length）を表し得る。この場合、トランスフォーマーエンコーダ４２０は、それを入力として精製されたフィーチャ（refined feature）を出力し得る。精製されたフィーチャが検出器４３０を経てリファレンスポイント４３２とともに使用されることによって、認識器４４０は、テキストインスタンス内で自己回帰的に（autoregressively）テキストシーケンスを生成し得る。 In one embodiment, the input image 410 is provided to a backbone, which extracts feature maps (e.g., _C2 , _C3 , _C4 , and _C5 ) from the input image 410. The resolution of the extracted feature maps may be ¼, ⅛, ⅛-sixteenth, or ⅓-thousandth of the resolution of the input image 410, respectively. The feature maps may be projected onto multiple channels (e.g., 256 channels) by applying a fully-connected (FC) layer and group normalization. The projected feature maps may then be merged, flattened, and concatenated into feature tokens of size ( _L2 + _L3 + _L4 + _L5 ) × 256, where L _i may represent the flattened length of C _i (or H/2 ⁱ × W/2 ⁱ ). In this case, the transformer encoder 420 may use the feature maps as input and output refined features. The refined features, via detector 430, are used in conjunction with reference points 432 to enable recognizer 440 to autoregressively generate text sequences within text instances.

一実施例において、トランスフォーマーエンコーダ４２０において、入力長に応じて線形に拡張する変形可能なアテンション（deformable attention）が使用され得る。従来のセルフアテンション（self-attention）の学習ないし推論に要されるコストは、入力長に応じて２次的に（quadratically）増加するため、マルチスケールフィーチャの連結にトランスフォーマーを使用することは非効率的であり得る。それに対し、変形可能なアテンションは、より高い効率性および正確な位置認識結果をエンコーダおよびデコーダに提供し得る。変形可能なアテンションは、下記の式１のように計算され得る。 In one embodiment, the Transformer Encoder 420 may use deformable attention, which scales linearly with input length. Because the cost of learning or inference for conventional self-attention increases quadratically with input length, using a Transformer to concatenate multi-scale features may be inefficient. In contrast, deformable attention may provide the encoder and decoder with more efficient and accurate localization results. Deformable attention may be calculated as shown in Equation 1 below.

ここで、ｘ（ｖ、ｐ）は、位置ｐのフィーチャ値ｖからフィーチャを抽出する双線形補間法（bilinear-interpolation）を表す。また、Ｋはサンプリングされたキーポイントの数であり、ｋはキーインデックスを表す。ｈはアテンションヘッドのインデックスであり、Ｗ_ｈ ^ｏ∈Ｒ^Ｃ×ＣｍおよびＷ_ｈ ^ｋ∈Ｒ^Ｃｍ×Ｃは線形投影（linear projection）であり得る。そして、Ｐ_ｒｅｆ、ΔＰ_ｈｑｋおよびＡ_ｈｑｋそれぞれは、リファレンスポイント、サンプリングオフセットおよびアテンション重みであり得る。さらに、Ｐ_ｒｅｆ、ΔＰ_ｈｑｋおよびＡ_ｈｑｋは、クエリ（query）フィーチャに対する線形投影を適用することによって計算され得る。この場合、Ａ_ｈｑｋにソフトマックス（softmax）が適用され得る。また、エンコーダ４２０で［０，１］×［０，１］に正規化された座標のある固定された基準点が使用され得る。 Here, x(v, p) represents bilinear interpolation to extract a feature from the feature value v at position p. K is the number of sampled keypoints, and k represents the key index. h is the index of the attention head, and W _h ^o ∈R ^C×Cm and W _h ^k ∈R ^Cm×C can be linear projections. P _ref , ΔP _hqk , and A _hqk can be reference points, sampling offsets, and attention weights, respectively. P _ref , ΔP _hqk , and A _hqk can be calculated by applying linear projections to the query features. In this case, softmax can be applied to A _hqk . A fixed reference point with normalized coordinates to [0,1]×[0,1] can be used in the encoder 420.

一実施例において、位置情報を用いてテキストオブジェクトを認識するために、マルチスケールフィーチャからテキストを検出する検出器４３０を用いて、リファレンスポイント（または、テキストインスタンスの中心位置）４３２が予測され得る。また、セグメンテーション（segmentation）マップにより、テキストインスタンスの境界ポリゴンが抽出され得る。例えば、フィーチャマップＣ_２に対応するフィーチャからＬ_２サイズのフィーチャトークンが抽出され（Ｈ／４、Ｗ／４）のサイズに再構成され得る。また、転置畳み込み（transposed convolution）、グループ正規化およびＲｅＬＵ（Rectified Linear Unit）からなるセグメンテーションヘッドを使用して、バイナリおよびしきい値マップ（binary and threshold map）が取得され得る。加えて、感知された境界ポリゴンにおいて、テキストインスタンスの中心座標がリファレンスポイント４３２として決定され得る。ここでは、セグメンテーションベースの検出方法を用いてテキストが検出されるものとして説明したが、これに限定されず、様々な形態の検出方法が適用され得る。 In one embodiment, to recognize a text object using position information, a reference point (or center position of a text instance) 432 can be predicted using a detector 430 that detects text from multi-scale features. A bounding polygon of the text instance can be extracted using a segmentation map. For example, feature tokens of size _L2 can be extracted from features corresponding to feature map _C2 and reconstructed to a size of (H/4, W/4). A segmentation head consisting of transposed convolution, group normalization, and ReLU (Rectified Linear Unit) can be used to obtain binary and threshold maps. Additionally, the center coordinates of the text instance can be determined as the reference point 432 in the detected bounding polygon. While the text detection method described here is based on a segmentation-based detection method, various detection methods can be applied.

一実施例において、例えば、トランスフォーマーデコーダからなるテキストデコーダを含む認識器４４０は、変形可能なアテンション４４４によりイメージ４１０およびリファレンスポイント４３２を参照しながら、テキスト領域に関連するテキストインスタンスからテキストシーケンスを自己回帰的に予測し得る。この場合、テキストデコーダのためのクエリＱは、テキスト埋め込み（embedding）、位置埋め込み、およびリファレンスポイントｑ_ｒｅｆを含み得る。また、テキストデコーダのキー（key）Ｋおよび値（value）Ｖは、トランスフォーマーエンコーダ４２０のフィーチャトークンであり得る。クエリＱは、セルフアテンション４４２、変形可能なアテンション４４４およびフィードフォワードレイヤ（feed-forward layer）４４６を介して伝達され得る。 In one embodiment, a recognizer 440 including a text decoder, e.g., a Transformer decoder, may autoregressively predict text sequences from text instances associated with text regions, referencing the image 410 and reference points 432 with deformable attention 444. In this case, a query Q for the text decoder may include text embeddings, positional embeddings, and reference points q _ref . Also, the key K and value V of the text decoder may be feature tokens of the Transformer encoder 420. The query Q may be propagated through self-attention 442, deformable attention 444, and a feed-forward layer 446.

一実施例において、認識器４４０の学習段階において、テキスト領域の境界ボックスがイメージからサンプリングされ、境界ボックスの中心座標がテキストデコーダのリファレンスポイント４３２として使用され得る。これにより、検出器４３０および認識器４４０それぞれは、独立して学習され得る。また、モデルの予測と実際の値との間の座標差を減らすために、学習段階で中心座標が下記の式２を使用してランダムに変更され得る。 In one embodiment, during the training phase of the recognizer 440, the bounding box of the text region is sampled from the image, and the center coordinate of the bounding box can be used as the reference point 432 for the text decoder. This allows the detector 430 and the recognizer 440 to be trained independently. Furthermore, to reduce the coordinate difference between the model's prediction and the actual value, the center coordinate can be randomly varied during the training phase using Equation 2 below.

ここで、ｐ_ｃは正解ポリゴンの中心点を表し、ｐ_ｔｌ、ｐ_ｔｒ、ｐ_ｂｌはそれぞれ左上端座標、右上端座標、および左下端座標を表す。また、推論段階において、検出段階で抽出したテキスト領域の中心点がリファレンスポイントとして使用され得る。 Here, p _c represents the center point of the ground truth polygon, and p _tl , p _tr , and p _bl represent the top-left, top-right, and bottom-left coordinates, respectively. In addition, the center point of the text region extracted in the detection phase can be used as a reference point in the inference phase.

一実施例において、学習に使用される損失関数Ｌは、下記の式３のように表され得る。 In one embodiment, the loss function L used for training can be expressed as Equation 3 below:

ここで、Ｌ_ｒは自己回帰テキスト認識損失を表し、Ｌ_ｓ、Ｌ_ｂおよびＬ_ｔのそれぞれは、微分可能な２進化（differentiable binarization）の損失であって、確率マップ、バイナリマップおよびしきい値マップに対する損失を表す。また、文字列シーケンスの予測確率と該当テキスト境界ボックスに対応する正解テキストラベル間のソフトマックス交差エントロピー（softmax cross entropy）が、Ｌ_ｒとして計算され得る。微分可能な２進化の実行後、Ｌ_ｓに対してハードネガティブマイニング（hard negative mining）とともに二値交差エントロピーが適用され、Ｌ_ｂに対してダイス損失（dice loss）が適用され、Ｌ_ｔに対してＬ_１距離損失が適用され得る。 where _Lr represents the autoregressive text recognition loss, and _Ls , _Lb , and _Lt represent differentiable binarization losses for the probability map, binary map, and threshold map, respectively. Furthermore, the softmax cross entropy between the predicted probability of a string sequence and the ground truth text label corresponding to the corresponding text bounding box can be calculated as _Lr . After differentiable binarization, binary cross entropy with hard negative mining can be applied to _Ls , dice loss can be applied to _Lb , and _L1 distance loss can be applied to _Lt.

一実施例において、推論段階では、検出器４３０の確率マップ（probability map）が使用され得る。この場合、確率マップは指定されたしきい値に２進化され、連結された構成要素はバイナリマップで領域として抽出され得る。なお、抽出された領域の大きさは、実際のテキスト領域よりも小さくあり得る。これにより、抽出された領域は、下記の式４のように表されるＶａｔｔｉクリッピングアルゴリズムのオフセットＤを用いて拡張され得る。 In one embodiment, the inference stage may use the probability map of the detector 430. In this case, the probability map is binarized to a specified threshold, and connected components may be extracted as regions in the binary map. Note that the size of the extracted region may be smaller than the actual text region. Therefore, the extracted region may be expanded using an offset D in the Vatti clipping algorithm, as expressed in Equation 4 below.

ここで、Ａは、ポリゴン領域の面積であり、Ｌは、ポリゴンの周囲値（perimeter）であり、ｒは、予め決定された拡張因子（dilation factor）を表し得る。拡張領域でポリゴンが抽出された後、計算された中心座標は、認識器４４０にリファレンスポイント４３２として提供され得る。結果的に、認識器４４０は、リファレンスポイント４３２を参照して、該当テキスト領域において文字シーケンスを予測し得る。 Here, A is the area of the polygon region, L is the perimeter of the polygon, and r may represent a predetermined dilation factor. After the polygon is extracted in the dilated region, the calculated center coordinates may be provided to the recognizer 440 as a reference point 432. As a result, the recognizer 440 may predict a character sequence in the corresponding text region by referring to the reference point 432.

このような構成により、２つの段階ではなく１つの段階で検出器および認識器が一度に学習されるので、学習の効率性が向上され得る。また、検出されたテキスト領域の大きさを調整しないで、イメージおよびリファレンスポイントを利用することにより、長い文字が歪んだり切られたりした状態で認識エラーが発生する現象を防止し得る。さらには、バックボーンから抽出されたマルチスケールフィーチャを検出器および認識器に共有することによって、モデルの性能および推論速度が向上され得る。 This configuration allows the detector and recognizer to be trained in one step instead of two, improving training efficiency. Furthermore, by using images and reference points without adjusting the size of the detected text region, it is possible to prevent recognition errors caused by distorted or truncated long characters. Furthermore, by sharing multi-scale features extracted from the backbone with the detector and recognizer, model performance and inference speed can be improved.

図５は、本開示の一実施例により、イメージ５１０からテキストを抽出する一例を示す図である。一実施例において、リファレンスポイント５１２およびオフセットポイント５１４＿１～５１４＿４を用いて、イメージ５１０からテキストが認識され得る。具体的には、変形可能なアテンションを含む認識器５２０によりイメージ５１０のフィーチャおよびリファレンスポイント５１２を参照することによって、各文字が順次に、そして自己回帰的に復号され得る。なお、変形可能なアテンションは、イメージ５１０のフィーチャ全体に対してアテンションを実行せず、いくつかのポイントのみをサンプリングしてアテンションを実行し得る。すなわち、リファレンスポイント５１２およびオフセットポイント５１４＿１～５１４＿４を用いてアテンションが行われ得る。 Figure 5 illustrates an example of extracting text from an image 510 according to an embodiment of the present disclosure. In one embodiment, text can be recognized from the image 510 using a reference point 512 and offset points 514_1 to 514_4. Specifically, each character can be sequentially and autoregressively decoded by a recognizer 520 including deformable attention by referring to the features of the image 510 and the reference point 512. Note that deformable attention may not perform attention on the entire features of the image 510, but may instead sample only a few points. That is, attention may be performed using the reference point 512 and offset points 514_1 to 514_4.

一実施例において、オフセットポイント５１４＿１～５１４＿４は、リファレンスポイント５１２に隣接する位置に生成され得る。具体的には、複数のオフセットポイント５１４＿１～５１４＿４それぞれの位置は、リファレンスポイント５１２の座標からモデルが予測したオフセットを加算することによって算出され得る。この場合、認識器５２０は、オフセットポイント５１４＿１～５１４＿４のキーおよび値によってアテンションを実行する位置を決定し得る。これにより、認識器５２０は、リファレンスポイント５１２を中心として、周囲のオフセットポイント５１４＿１～５１４＿４とともにアテンションして、自己回帰的に文字を生成する方法でテキストを復号し得る。例えば、認識器５２０は、オフセットポイント５１４＿１～５１４＿４を参照することによって、「ゴマ」の各文字を順次、そして自己回帰的に復号し得る。図５では、複数のオフセットポイント５１４＿１～５１４＿４が４つのものと示されているが、これに限定されない。 In one embodiment, offset points 514_1 to 514_4 may be generated at positions adjacent to reference point 512. Specifically, the position of each of the multiple offset points 514_1 to 514_4 may be calculated by adding the offset predicted by the model from the coordinates of reference point 512. In this case, recognizer 520 may determine the position to perform attention based on the keys and values of offset points 514_1 to 514_4. As a result, recognizer 520 may decode text by generating characters autoregressively, with reference to reference point 512 as the center and attentioning the surrounding offset points 514_1 to 514_4. For example, recognizer 520 may sequentially and autoregressively decode each character of "sesame" by referring to offset points 514_1 to 514_4. While FIG. 5 shows four multiple offset points 514_1 to 514_4, this is not limiting.

このような構成により、イメージ内の関心領域をプーリングまたはマスキングしないで、リファレンスポイントおよび全体のイメージフィーチャを利用することによって、イメージ内のテキストが認識および復号され得る。これにより、関心領域の検出にミスがあっても、テキストの抽出がスムーズに行われ得る。 This configuration allows text in an image to be recognized and decoded by utilizing reference points and overall image features without pooling or masking regions of interest within the image. This allows for smooth text extraction even if regions of interest are not detected correctly.

図６は、本開示の一実施例による光学文字認識システムのプロセスの一例を示す図である。一実施例において、バックボーン６２０は、イメージ６１０からイメージフィーチャ（またはマルチスケールフィーチャマップ）を生成し得る。また、トランスフォーマーエンコーダ６３０は、イメージフィーチャをチューニングおよび符号化し得る。 Figure 6 illustrates an example process for an optical character recognition system according to one embodiment of the present disclosure. In one embodiment, a backbone 620 may generate image features (or multi-scale feature maps) from an image 610. A transformer encoder 630 may also tune and encode the image features.

一実施例において、単語検出器（word detector）６４０は、イメージ６１０が符号化されたフィーチャから単語単位のテキスト位置を検出し得る。具体的に、単語検出器６４０は、単語単位のテキストに関連するフィーチャを抽出し得る。また、単語検出器６４０はフィーチャに基づいて、少なくとも１つのテキスト領域の位置情報を生成し得る。さらに、単語検出器６４０は、テキスト領域の位置情報に基づいてリファレンスポイントを生成し得る。なお、リファレンスポイントは、検出されたテキスト領域の中心点であり得る。 In one embodiment, the word detector 640 may detect the location of text in units of words from the features encoded in the image 610. Specifically, the word detector 640 may extract features related to the text in units of words. The word detector 640 may also generate location information for at least one text region based on the features. Furthermore, the word detector 640 may generate a reference point based on the location information of the text region. The reference point may be the center point of the detected text region.

一実施例において、単語検出器６４０によって生成されたリファレンスポイントおよびイメージ６１０の符号化されたフィーチャに基づいて、認識器（recognizer１～Ｎ）６５０＿１～６５０＿Ｎは、イメージ６１０内のテキストを認識し、抽出し得る。具体的に、認識器６５０＿１～６５０＿Ｎは、リファレンスポイントに隣接する複数のオフセットポイントを生成し得る。また、認識器６５０＿１～６５０＿Ｎは、リファレンスポイントおよび複数のオフセットポイントにアテンションすることにより、イメージ６１０内のテキストを認識して、文字イメージシーケンスを生成し得る。さらに、認識器６５０＿１～６５０＿Ｎは、生成された文字イメージシーケンスを復号することによって、テキストに変換し得る。 In one embodiment, based on the reference points generated by the word detector 640 and the encoded features of the image 610, recognizers (recognizers 1 to N) 650_1 to 650_N may recognize and extract text from the image 610. Specifically, recognizers 650_1 to 650_N may generate multiple offset points adjacent to the reference points. Furthermore, recognizers 650_1 to 650_N may recognize the text from the image 610 by paying attention to the reference points and the multiple offset points, and generate a character image sequence. Furthermore, recognizers 650_1 to 650_N may convert the generated character image sequence into text by decoding it.

一実施例において、単語検出器６４０は、イメージ６１０が符号化されたフィーチャから複数のテキスト領域を少なくとも部分的に並列で検出し得る。この場合、テキスト領域それぞれに対してリファレンスポイントが生成され得る。これにより、Ｎ個のテキスト領域それぞれに対して、認識器６５０＿１～６５０＿Ｎは並列でテキストを抽出し得る。 In one embodiment, the word detector 640 may detect multiple text regions from the feature encoded image 610 at least partially in parallel. In this case, a reference point may be generated for each text region. This allows recognizers 650_1 through 650_N to extract text in parallel for each of the N text regions.

一実施例において、ライン検出器（line detector）６６０は、イメージ６１０が符号化されたフィーチャにおいて、ライン単位のテキスト位置を検出し得る。具体的に、ライン検出器６６０は、ライン単位のテキストに関連するフィーチャを抽出し得る。また、ライン検出器６６０はフィーチャに基づいて、少なくとも１つのテキスト領域の位置情報を生成し得る。ライン単位のテキスト位置が検出される例示は、図９を参照して詳細に後述する。 In one embodiment, the line detector 660 may detect line-by-line text positions in the feature-encoded image 610. Specifically, the line detector 660 may extract features related to line-by-line text. The line detector 660 may also generate position information for at least one text region based on the features. An example of detecting line-by-line text positions will be described in detail below with reference to FIG. 9.

一実施例において、段落検出器（paragraph detector）６７０は、イメージ６１０が符号化されたフィーチャにおいて、段落単位のテキスト位置を検出し得る。具体的に、段落検出器６７０は、段落単位のテキストに関連するフィーチャを抽出し得る。また、段落検出器６７０はフィーチャに基づいて、少なくとも１つのテキスト領域の位置情報を生成し得る。段落単位のテキスト位置が検出される例示は、図９を参照して詳細に後述する。 In one embodiment, the paragraph detector 670 may detect paragraph-based text positions in the feature-encoded image 610. Specifically, the paragraph detector 670 may extract features related to the text in paragraph units. The paragraph detector 670 may also generate position information for at least one text region based on the features. An example of detecting paragraph-based text positions will be described in detail below with reference to FIG. 9.

一実施例において、単語検出器６４０、ライン検出器６６０および段落検出器６７０は、並列でトランスフォーマーエンコーダ６３０に連結され得る。すなわち、単語検出器６４０、ライン検出器６６０および段落検出器６７０は、各々のレベルで単語、ラインおよび段落単位のテキスト領域を少なくとも部分的に並列で検出し得る。 In one embodiment, the word detector 640, line detector 660, and paragraph detector 670 may be coupled to the transformer encoder 630 in parallel. That is, the word detector 640, line detector 660, and paragraph detector 670 may detect word, line, and paragraph-based text regions at each level at least partially in parallel.

一実施例において、検出された各々のレベルにおけるテキスト領域は、後処理（post-processing）段階６８０により互いに関連付けられ得る。具体的に、単語検出器６４０によって検出された単語の位置、ライン検出器６６０によって検出されたラインの位置、および段落検出器６７０によって検出された段落の位置が互いにマッチングされ得る。これにより、認識部６５０＿１～６５０＿Ｎによって抽出されたテキストが含まれているライン単位のテキスト領域の位置情報および／または段落単位のテキスト領域の位置情報が検出され得る。 In one embodiment, the detected text regions at each level can be correlated with each other through a post-processing step 680. Specifically, the positions of words detected by the word detector 640, the positions of lines detected by the line detector 660, and the positions of paragraphs detected by the paragraph detector 670 can be matched with each other. This allows position information of line-based text regions and/or paragraph-based text regions containing text extracted by the recognition units 650_1 to 650_N to be detected.

図７は、本開示の一実施例による光学文字認識結果の一例を示す図である。第１例示７１０は、従来の光学文字認識結果の例示である。従来の光学文字認識方法は、関心領域をプーリングまたはマスキングすることによって、イメージ内のテキストを抽出する。この場合、複雑で密集したテキストまたは幾何学的形状のテキストの抽出が失敗し得る。例えば、従来の光学文字認識方法は、幾何学的形状が含まれ曖昧である「ＨＯＭＥ」において、幾何学的形状の「Ｏ」のせいで「ＭＥ」のみを関心領域として決定し得る。このようなテキスト検出の失敗によって、イメージ内のテキスト認識も失敗し得る。 Figure 7 illustrates an example of an optical character recognition result according to an embodiment of the present disclosure. A first example 710 is an example of a conventional optical character recognition result. Conventional optical character recognition methods extract text within an image by pooling or masking regions of interest. In this case, extraction of complex, dense text or text with geometric shapes may fail. For example, in the ambiguous word "HOME" containing a geometric shape, a conventional optical character recognition method may determine only "ME" as the region of interest due to the geometric shape of the "O." Such a failure to detect text may also result in failure to recognize text within the image.

第２例示７２０は、本開示の光学文字認識結果の例示である。本開示の光学文字認識方法は、関心領域をプーリングまたはマスキングすることなく、テキスト領域を検出してリファレンスポイント（図７で「＋」の形状で示される）を生成し得る。この場合、認識器は、検出されたテキスト領域に大きく依存しなくて良い。言い換えると、テキスト領域の検出結果が不正確であっても、認識器は抽出されたイメージ全体を考慮して、リファレンスポイントに集中することにより、イメージ内のテキストを正しく抽出し得る。 A second example 720 is an example of an optical character recognition result of the present disclosure. The optical character recognition method of the present disclosure can detect text regions and generate reference points (denoted by the "+" shape in FIG. 7) without pooling or masking regions of interest. In this case, the recognizer does not need to rely heavily on the detected text regions. In other words, even if the text region detection result is inaccurate, the recognizer can still correctly extract the text in the image by considering the entire extracted image and concentrating on the reference points.

このような構成により、イメージ内の交差テキスト、テキスト内のテキスト、様々な文字体および大きさなどの複雑なシーンテキストが、より正確に抽出され得る。また、最終的なテキスト認識結果が、イメージ内のテキスト領域の検出に大きく依存しないので、テキスト領域の検出ミスに強みを有し得る。さらに、イメージ内のテキストが回転していても、正しくテキストが抽出され得る。 This configuration allows for more accurate extraction of complex scene text, such as crossed text within an image, text within text, and text of various fonts and sizes. Furthermore, since the final text recognition result does not depend heavily on the detection of text regions within the image, it can be resilient to errors in detecting text regions. Furthermore, text can be correctly extracted even if the text in the image is rotated.

図８は、本開示の一実施例により文字単位のテキストが検出される一例を示す図である。一実施例において、認識器８１０が単語単位のテキストを認識した後、文字（character）単位のテキストが検出され得る。具体的には、光学文字認識モデルが重くならないように、認識器８１０に自己回帰デコーダ（auto-regressive decoder）８２０が追加され得る。なお、自己回帰デコーダ８２０は、抽出されたテキストに対する分類スコア（classification score）を用いて、抽出されたテキストに含まれている文字それぞれを自己回帰的に認識し得る。 Figure 8 is a diagram showing an example of character-level text detection according to an embodiment of the present disclosure. In one embodiment, after the recognizer 810 recognizes word-level text, character-level text can be detected. Specifically, to prevent the optical character recognition model from becoming heavy, an auto-regressive decoder 820 can be added to the recognizer 810. Note that the auto-regressive decoder 820 can auto-regressively recognize each character included in the extracted text using a classification score for the extracted text.

一実施例において、自己回帰デコーダ８２０は、認識された文字それぞれの位置（location）および角度（angle）を予測することによって、文字それぞれを認識すると同時に検出し得る。また、自己回帰デコーダ８２０は、学習により、外国語およびすべての文字の方向に対して正しく文字を認識して検出し得る。図８に示すように、自己回帰デコーダ８２０は、「ｃ、ｈ、ｏ、ｃ、ｏ…」それぞれを検出し得る。 In one embodiment, the autoregressive decoder 820 can simultaneously recognize and detect each character by predicting the location and angle of each recognized character. Furthermore, the autoregressive decoder 820 can be trained to correctly recognize and detect characters for foreign languages and all character orientations. As shown in Figure 8, the autoregressive decoder 820 can detect each of "c, h, o, c, o...".

一実施例において、自己回帰デコーダ８２０を学習させるために疑似ラベリング（pseudo-labeling）が適用され得る。具体的に、教師モデル（teacher model）は、ラベル付けされたデータを学習し得る。この場合、学習された教師モデルは、弱くラベル付けされた（weakly-labeled）データに対して、疑似ラベルデータを生成し得る。ここで、弱くラベル付けされたデータは、イメージで単語単位のテキスト領域が検出されたデータを含み得る。また、疑似ラベルデータは、教師モデルが弱くラベル付けされたデータで文字を検出したデータを含み得る。その後、自己回帰デコーダ８２０は、学生モデル（student model）として、疑似ラベル付けされたデータおよびラベル付けされたデータをいずれも学習し得る。 In one embodiment, pseudo-labeling may be applied to train the autoregressive decoder 820. Specifically, a teacher model may train labeled data. In this case, the trained teacher model may generate pseudo-labeled data for weakly labeled data. Here, the weakly labeled data may include data in which word-based text regions are detected in an image. The pseudo-labeled data may also include data in which the teacher model detects characters in the weakly labeled data. The autoregressive decoder 820 may then train both the pseudo-labeled data and the labeled data as a student model.

図９は、本開示の一実施例により、文書でライン単位および段落単位のテキスト領域を検出する一例を示す図である。第１例示９１０は、本開示の光学文字認識方法により、文書で単語単位のテキストが検出された例示である。また、第２例示９２０は、本開示の光学文字認識方法により、文書でライン単位のテキストが検出された例示である。そして、第３例示９３０は、本開示の光学文字認識方法により、文書で段落単位のテキストが検出された例示である。 Figure 9 shows an example of detecting line-based and paragraph-based text regions in a document according to an embodiment of the present disclosure. A first example 910 is an example of word-based text detection in a document using the optical character recognition method of the present disclosure. A second example 920 is an example of line-based text detection in a document using the optical character recognition method of the present disclosure. And a third example 930 is an example of paragraph-based text detection in a document using the optical character recognition method of the present disclosure.

一実施例において、トランスフォーマーエンコーダから抽出されたフィーチャを用いて、文書からテキストに関連するフィーチャが抽出され得る。また、フィーチャに基づいて、単語検出器、ライン検出器および段落検出器のそれぞれにおいて、単語、ラインおよび段落単位で光学文字認識が行われ得る。この場合、単語検出器、ライン検出器および段落検出器は、それぞれのレベルでテキストを少なくとも部分的に並列で検出し得る。また、後処理段階において、特定単語単位のテキストがどのライン単位のテキスト領域およびどの段落単位のテキスト領域に含まれるかマッチングされ得る。 In one embodiment, features extracted from the Transformer encoder can be used to extract text-related features from a document. Furthermore, based on the features, a word detector, a line detector, and a paragraph detector can perform optical character recognition at the word, line, and paragraph levels, respectively. In this case, the word detector, line detector, and paragraph detector can detect text at each level at least partially in parallel. Furthermore, in a post-processing stage, matching can be performed to determine which line-level text region and which paragraph-level text region a specific word unit is contained in.

このような構成により、特定単語単位のテキストがどのテキスト領域に含まれるかマッチングすることによって、検出されたテキストをより有機的に連結し得る。これにより、光学文字認識によって文書からテキストを抽出した後、翻訳サービスを提供する過程で、より優れた品質の翻訳サービスが提供され得る。 This configuration allows for more organic linking of detected text by matching which text regions contain specific word units. This allows for better quality translation services to be provided after extracting text from a document using optical character recognition.

図１０は、本開示の一実施例による方法１０００の一例を示すフローチャートである。一実施例において、方法１０００は、少なくとも１つのプロセッサによって行われ得る。方法１０００は、プロセッサがイメージから少なくとも１つのテキスト領域を検出することから開始され得る（Ｓ１０１０）。具体的に、プロセッサは、イメージから単語単位のテキストに関連するフィーチャを抽出し得る。また、プロセッサはフィーチャに基づいて、少なくとも１つのテキスト領域の位置情報を生成し得る。 FIG. 10 is a flowchart illustrating an example of a method 1000 according to an embodiment of the present disclosure. In one embodiment, the method 1000 may be performed by at least one processor. The method 1000 may begin with the processor detecting at least one text region from an image (S1010). Specifically, the processor may extract features related to text on a word-by-word basis from the image. The processor may also generate position information for the at least one text region based on the features.

その後、プロセッサは、少なくとも１つのテキスト領域に関連する少なくとも１つのリファレンスポイントを生成し得る（Ｓ１０２０）。なお、少なくとも１つのリファレンスポイントそれぞれは、少なくとも１つのテキスト領域それぞれの中心点を含み得る。 The processor may then generate at least one reference point associated with the at least one text region (S1020). Note that each of the at least one reference point may include a center point of each of the at least one text region.

その後、プロセッサは、イメージおよび少なくとも１つのリファレンスポイントに基づいて、イメージから少なくとも１つのテキストを抽出し得る（Ｓ１０３０）。具体的に、プロセッサは、少なくとも１つのリファレンスポイントに隣接する複数のオフセットポイントを生成し得る。なお、複数のオフセットポイントは、認識器が複数のオフセットポイントに隣接する領域にアテンションして、テキストを抽出するようにガイドし得る。また、プロセッサは、イメージおよび複数のオフセットポイントに基づいて、イメージから少なくとも1つのテキストを抽出し得る。 Then, the processor may extract at least one piece of text from the image based on the image and the at least one reference point (S1030). Specifically, the processor may generate multiple offset points adjacent to the at least one reference point. The multiple offset points may guide the recognizer to focus on regions adjacent to the multiple offset points and extract text. The processor may also extract at least one piece of text from the image based on the image and the multiple offset points.

一実施例において、少なくとも１つのテキスト領域は、第１テキスト領域と第２テキスト領域とを含み得る。また、少なくとも１つのリファレンスポイントは、第１テキスト領域に関連する第１リファレンスポイントと、第２テキスト領域に関連する第２リファレンスポイントとを含み得る。この場合、プロセッサは、イメージおよび第１リファレンスポイントに基づいて、第１テキスト領域から第１テキストを抽出し得る。さらに、プロセッサは、第１テキストを抽出する段階の少なくとも一部と並列で、イメージおよび第２リファレンスポイントに基づいて、第２テキスト領域から第２テキストを抽出し得る。 In one embodiment, the at least one text region may include a first text region and a second text region. The at least one reference point may include a first reference point associated with the first text region and a second reference point associated with the second text region. In this case, the processor may extract the first text from the first text region based on the image and the first reference point. Furthermore, the processor may extract the second text from the second text region based on the image and the second reference point, in parallel with at least a portion of the step of extracting the first text.

一実施例において、抽出されたテキストは単語単位のテキストであり得る。この場合、プロセッサは、抽出されたテキストに対する分類スコアを用いて、抽出されたテキストに含まれている少なくとも１つの文字それぞれを自己回帰的に検出し得る。また、プロセッサは、イメージ内における少なくとも１つの文字それぞれの位置および角度を予測し得る。 In one embodiment, the extracted text may be word-by-word text. In this case, the processor may autoregressively detect each of the at least one character contained in the extracted text using the classification score for the extracted text. The processor may also predict the position and angle of each of the at least one character within the image.

一実施例において、プロセッサはイメージからライン単位のテキストに関連するフィーチャを抽出し得る。フィーチャに基づいて、プロセッサは少なくとも１つのテキスト領域の位置情報を生成し得る。 In one embodiment, the processor may extract features associated with text line by line from the image. Based on the features, the processor may generate position information for at least one text region.

一実施例において、プロセッサは、イメージから段落単位のテキストに関連するフィーチャを抽出し得る。フィーチャに基づいて、プロセッサは少なくとも１つのテキスト領域の位置情報を生成し得る。 In one embodiment, the processor may extract features associated with paragraph-by-paragraph text from the image. Based on the features, the processor may generate position information for at least one text region.

一実施例において、プロセッサは、イメージから少なくとも１つの単語単位のテキスト領域を検出し得る。また、プロセッサは、イメージから少なくとも１つのライン単位のテキスト領域を検出し得る。そして、プロセッサは、イメージから少なくとも１つの段落単位のテキスト領域を検出し得る。この場合、単語単位のテキスト領域を検出し、ライン単位のテキスト領域を検出し、段落単位のテキスト領域を検出することは、少なくとも部分的に並列で行われ得る。さらに、プロセッサは、抽出されたテキストが含まれているライン単位のテキスト領域または段落単位のテキスト領域のうちの少なくとも１つの位置情報を検出し得る。 In one embodiment, the processor may detect at least one word-based text region from the image. The processor may also detect at least one line-based text region from the image. And the processor may detect at least one paragraph-based text region from the image. In this case, detecting the word-based text region, detecting the line-based text region, and detecting the paragraph-based text region may be performed at least partially in parallel. Furthermore, the processor may detect position information for at least one of the line-based text region or paragraph-based text region containing the extracted text.

前術の方法は、コンピュータで実行するためにコンピュータ読み取り可能な記録媒体に保存されているコンピュータプログラムとして提供され得る。媒体は、コンピュータで実行可能なプログラムを保存し続けるか、実行またはダウンロードのために一時的に保存するものでもあり得る。また、媒体は、単一または複数のハードウェアが結合された形態の様々な記録手段または保存手段であり得るが、任意のコンピュータシステムに直接接続される媒体に限定されず、ネットワーク上に分散して存在するものでもあり得る。媒体の例としては、ハードディスク、フロッピーディスクおよび磁気テープのような磁気媒体、ＣＤ－ＲＯＭおよびＤＶＤのような光記録媒体、フロプティカルディスク（floptical disk）のような光磁気媒体（magneto optical medium）、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含んで、プログラム命令語が保存されるように構成されたものがあり得る。また、他の媒体の例として、アプリケーションを流通するアプリストアや他の様々なソフトウェアを供給ないし流通するサイト、サーバーなどで管理する記録媒体ないし記憶媒体も挙げられる。 The above method may be provided as a computer program stored on a computer-readable recording medium for execution by a computer. The medium may permanently store the computer-executable program or may temporarily store it for execution or download. The medium may also be various recording or storage means in the form of a single or multiple pieces of hardware, and is not limited to media directly connected to any computer system, but may also be distributed over a network. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and ROM, RAM, flash memory, and other media configured to store program instructions. Other examples of media include recording or storage media managed by app stores that distribute applications, or sites or servers that supply or distribute various other software.

本開示の方法、動作、または技法は、様々な手段によっても実現され得る。例えば、そのような技法は、ハードウェア、ファームウェア、ソフトウェア、またはこれらの組み合わせによって実現されてもよい。本願の開示に連係して説明された様々な例示的な論理ブロック、モジュール、回路、およびアルゴリズム段階は、電子ハードウェア、コンピュータソフトウェア、またはその両方の組み合わせによっても実現され得ることを通常の技術者は理解することである。ハードウェアおよびソフトウェアのこのような相互代替を明確に説明するために、様々な例示的な構成要素、ブロック、モジュール、回路、および段階がそれらの機能的観点から一般的に前記において説明されている。そのような機能がハードウェアとして実現されるか、またはソフトウェアとして実現されるかは、特定のアプリケーションおよびシステム全体に課される設計要件によって変わる。通常の技術者は、それぞれの特定アプリケーションのために様々な方法で説明された機能を実現し得るが、そのような実現は本開示の範囲から逸脱するものと解釈されるべきではない。 The methods, operations, or techniques of the present disclosure may be implemented by various means. For example, such techniques may be implemented by hardware, firmware, software, or a combination thereof. Those of ordinary skill in the art will understand that the various illustrative logical blocks, modules, circuits, and algorithmic steps described in connection with the present disclosure may also be implemented by electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability between hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the particular application and design requirements imposed on the overall system. Those of ordinary skill in the art may implement the described functionality in various ways for each particular application, but such implementations should not be interpreted as a departure from the scope of the present disclosure.

ハードウェア実現において、技法を実行するめに用いられるプロセスユニットは、１つ以上のＡＳＩＣ、ＤＳＰ、デジタル信号処理装置（digital signal processing devices、ＤＳＰＤ）、プログラマブル論理装置（programmable logic devices、ＰＬＤ）、フィールドプログラマブルゲートアレイ（field programmable gate arrays、ＦＰＧＡ）、プロセッサ、コントローラ、マイクロコントローラ、マイクロプロセッサ、電子デバイス、本開示に説明の機能を実行するように設計された他の電子ユニット、コンピュータ、またはそれらの組み合わせ内において実現されてもよい。 In a hardware implementation, the processing units used to perform the techniques may be implemented within one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure, computers, or combinations thereof.

したがって、本開示に連係して説明された様々な例示的な論理ブロック、モジュール、および回路は、汎用プロセッサ、ＤＳＰ、ＡＳＩＣ、ＦＰＧＡ、または他のプログラマブル論理デバイス、ディスクリートゲートやトランジスタロジック、ディスクリートハードウェアコンポネンツ、または本願に説明されている機能を実行するように設計されたものの任意の組み合わせによって実現または実行されることもあり得る。汎用プロセッサは、マイクロプロセッサでもあり得るが、プロセッサは、任意の従来のプロセッサ、コントローラ、マイクロコントローラ、または状態マシーンでもあり得る。プロセッサはまた、コンピューティングデバイスの組み合わせ、例えば、ＤＳＰとマイクロプロセッサ、複数のマイクロプロセッサ、ＤＳＰコアと連携する１つ以上のマイクロプロセッサ、または任意の他の構成の組み合わせとして実現されてもよい。 Accordingly, the various example logic blocks, modules, and circuits described in connection with this disclosure may be implemented or performed by a general-purpose processor, a DSP, an ASIC, an FPGA, or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but the processor may also be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.

ファームウェアおよび／またはソフトウェアの実現において、技法は、ランダムアクセスメモリ（random access memory、ＲＡＭ）、読み取り専用メモリ（read-only memory、ＲＯＭ）、不揮発性ＲＡＭ（nonvolatile random access memory；ＮＶＲＡＭ）、ＰＲＯＭ（programmable read-only memory)、ＥＰＲＯＭ(erasable programmable read-only memory)、ＥＥＰＲＯＭ(electrically erasable PROM)、フラッシュメモリ、コンパクトディスク（compact disc、ＣＤ）、磁気または光学データストレージデバイスなどのようなコンピュータ読み取り可能媒体上に保存されている命令として実現されてもよい。命令は、１つ以上のプロセッサによって実行可能でもあり、本開示に説明された機能の特定の態様をプロセッサに実行させることもできる。 In a firmware and/or software implementation, the techniques may be implemented as instructions stored on a computer-readable medium such as random access memory (RAM), read-only memory (ROM), nonvolatile random access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage device, etc. The instructions may also be executable by one or more processors to cause the processors to perform certain aspects of the functionality described in this disclosure.

ソフトウェアで実現される場合、前記技法は、１つ以上の命令またはコードとしてコンピュータ読み取り可能な媒体上に保存されるか、またはコンピュータ読み取り可能な媒体を介して転送されてもよい。コンピュータ読み取り可能な媒体は、ある場所から他の場所へのコンピュータプログラムの転送を容易にする任意の媒体を含んで、コンピュータ記憶媒体および通信媒体のいずれも含む。記憶媒体は、コンピュータによってアクセスされ得る任意の利用可能な媒体でもあり得る。非制限的な例として、このようなコンピュータ読み取り可能な媒体は、ＲＡＭ、ＲＯＭ、ＥＥＰＲＯＭ、ＣＤ－ＲＯＭ、または他の光学ディスクストレージ、磁気ディスクストレージ、または他の磁気ストレージデバイス、もしくは、所望のプログラムコードを命令またはデータ構造の形態で移送または保存するために使用されてよく、コンピュータによってアクセスされ得る任意の他の媒体を含み得る。さらに、任意の接続が、コンピュータ読み取り可能媒体と適切に称される。 If implemented in software, the techniques may be stored on or transmitted via a computer-readable medium as one or more instructions or code. Computer-readable media includes both computer storage media and communication media, including any medium that facilitates transfer of a computer program from one place to another. Storage media may also be any available medium that can be accessed by a computer. By way of non-limiting example, such computer-readable media may include RAM, ROM, EEPROM, CD-ROM, or other optical disk storage, magnetic disk storage, or other magnetic storage devices, or any other medium that can be used to transport or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Furthermore, any connection is properly termed a computer-readable medium.

例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、撚線、デジタル加入者回線（ＤＳＬ）、または赤外線、無線、およびマイクロ波のような無線技術を用いて、ウェブサイト、サーバー、または他のリモートソースから転送されると、同軸ケーブル、光ファイバケーブル、撚線、デジタル加入者回線、または赤外線、無線、およびマイクロ波のような無線技術は、媒体の定義内に含まれる。本願で使用されるディスク(disk)およびディスク(disc)は、ＣＤ、レーザーディスク、光ディスク、ＤＶＤ(digital versatile disc)、フロッピーディスク、およびブルーレイディスクを含み、ここで、ディスク(disks)は通常磁気的にデータを再生する反面、ディスク(discs)はレーザーを用いて光学的にデータを再生する。前記の組み合わせも、コンピュータ読み取り可能媒体の範囲内に含まれるべきである。 For example, if software is transferred from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted wire, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, the coaxial cable, fiber optic cable, twisted wire, digital subscriber line, or wireless technologies such as infrared, radio, and microwave are included within the definition of medium. As used herein, disk and disc include CDs, laser discs, optical discs, digital versatile discs (DVDs), floppy disks, and Blu-ray discs, where disks typically reproduce data magnetically, while discs reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media.

ソフトウェアモジュールは、ＲＡＭメモリ、フラッシュメモリ、ＲＯＭメモリ、ＥＰＲＯＭメモリ、ＥＥＰＲＯＭメモリ、レジスタ、ハードディスク、リムーバブルディスク、ＣＤ－ＲＯＭ、または公知の任意の他の形態の記憶媒体内に常駐してもよい。例示的な記憶媒体は、プロセッサが記憶媒体から情報を読み取るか、または記憶媒体に情報の書き込みができるよう、プロセッサに連結され得る。あるいは、記憶媒体はプロセッサに統合されてもよい。プロセッサと記憶媒体は、ＡＳＩＣ内に存在してもよい。ＡＳＩＣは、ユーザ端末内に存在してもよい。または、プロセッサと記憶媒体とは、ユーザ端末において個別の構成要素として存在してもよい。 A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. Alternatively, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

以上で説明した実施例は、１つ以上の独立型コンピュータシステムにおいて、現在開示されている主題の態様を活用するものとして記述されているが、本開示はこれに限定されず、ネットワークや分散コンピューティング環境のような任意のコンピューティング環境と連携して実現されてもよい。さらには、本開示において主題の態様は、複数のプロセスチップや装置にて実現されてもよく、ストレージは複数の装置にわたって類似の影響を受けることともなり得る。このような装置は、ＰＣ、ネットワークサーバー、およびポータブル装置をも含み得る。 Although the embodiments described above are described as utilizing aspects of the presently disclosed subject matter in one or more stand-alone computer systems, the present disclosure is not limited thereto and may be implemented in connection with any computing environment, such as a network or distributed computing environment. Furthermore, aspects of the subject matter in the present disclosure may be implemented in multiple processing chips or devices, and storage may be affected similarly across multiple devices. Such devices may include PCs, network servers, and even portable devices.

本明細書においては、本開示が一部の実施例に関連して説明されたが、本開示の発明が属する技術分野の通常の技術者が理解し得る本開示の範囲から逸脱しない範囲で様々な変形および変更が行われ得る。また、そのような変形および変更は、本明細書に添付の特許請求の範囲内に属するものと考えられるべきである。 Although the present disclosure has been described herein with reference to some embodiments, various modifications and variations that would be understood by those of ordinary skill in the art to which the presently disclosed invention pertains may be made without departing from the scope of the present disclosure. Furthermore, such modifications and variations should be considered to fall within the scope of the claims appended hereto.

１１０：第１例示
１１２：テキスト領域
１２０：第２例示
１２２：リファレンスポイント
１２４＿１～１２４＿４：複数のオフセットポイント
110: First example 112: Text area 120: Second example 122: Reference point 124_1 to 124_4: Multiple offset points

Claims

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from the image based on model data obtained by deep learning ;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function ;
Including ,
Detecting at least one text region from the image comprises:
extracting text-related features from the image on a word-by-word basis;
and generating position information for the at least one text region based on the features .

The deep learning-based optical character recognition method of claim 1, wherein each of the at least one reference point includes a center point of each of the at least one text region.

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from an image based on model data obtained by deep learning;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
the at least one text area includes a first text area and a second text area;
the at least one reference point includes a first reference point associated with the first text region and a second reference point associated with the second text region;
The step of extracting at least one text from the image comprises:
extracting a first text from the first text region based on the image and the first reference points;
extracting second text from the second text region based on the image and the second reference points, at least in part in parallel with extracting the first text;
Deep learning-based optical character recognition methods, including:

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from the image based on model data obtained by deep learning;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
The step of extracting at least one text from the image comprises:
generating a plurality of offset points adjacent to the at least one reference point;
extracting at least one text from the image based on the image and the plurality of offset points;
Deep learning-based optical character recognition methods, including:

The deep learning-based optical character recognition method of claim 4 , wherein the offset points guide text extraction by attention to regions adjacent to the offset points.

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from an image based on model data obtained by deep learning;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
the extracted text is word-by-word text,
The method comprises:
The deep learning-based optical character recognition method further includes a step of autoregressively detecting at least one character included in the extracted text using a classification score for the extracted text.

The deep learning based optical character recognition method of claim 6 , further comprising predicting a position and an angle of each of the at least one character within the image.

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from an image based on model data obtained by deep learning;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
Detecting at least one text region from the image comprises:
extracting line-by-line text-related features from the image;
generating position information for at least one text region based on the features;
Deep learning-based optical character recognition methods, including:

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from an image based on model data obtained by deep learning;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
Detecting at least one text region from the image comprises:
extracting text-related features from the image in paragraph units;
generating position information for at least one text region based on the features;
Deep learning-based optical character recognition methods, including:

1. A deep learning-based Optical Character Recognition (OCR) method executed by at least one processor, comprising:
Detecting at least one text region from an image based on model data obtained by deep learning;
inferring and generating at least one reference point associated with the at least one detected text region so as to reduce a difference between the model data and the at least one detected text region;
predicting and extracting at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
Detecting at least one text region from the image comprises:
detecting at least one word-based text region from the image;
detecting at least one line-by-line text region from the image;
detecting at least one paragraph-based text region from the image;
The deep learning-based optical character recognition method, wherein the steps of detecting word-based text regions, detecting line-based text regions, and detecting paragraph-based text regions are performed at least partially in parallel.

The deep learning-based optical character recognition method of claim 10 , further comprising detecting position information of at least one of a line-based text region or a paragraph-based text region including the extracted text.

A computer readable computer program for executing the method according to any one of claims 1 to 11 on a computer.

1. A deep learning-based Optical Character Recognition (OCR) system executed by at least one processor , comprising:
a detector that detects at least one text region from an image based on model data obtained by deep learning, and infers and generates at least one reference point associated with the at least one text region so as to reduce a difference between the model data and the detected at least one text region;
a recognizer that predicts and extracts at least one text from the image based on the image and the at least one reference point so as to minimize a loss function ;
Including ,
a backbone for extracting at least one text-related feature from the image;
a transformer encoder for encoding the features;
an optical character recognition system further comprising :

1. A deep learning-based Optical Character Recognition (OCR) system executed by at least one processor, comprising:
a detector that detects at least one text region from an image based on model data obtained by deep learning, and infers and generates at least one reference point associated with the at least one text region so as to reduce a difference between the model data and the detected at least one text region;
a recognizer that predicts and extracts at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
The detector comprises:
extracting bounding polygons of text regions from the image using a segmentation map;
extracting the center coordinates of the text region from the bounding polygon;
An optical character recognition system that determines the center coordinate as the reference point.

1. A deep learning-based Optical Character Recognition (OCR) system executed by at least one processor, comprising:
a detector that detects at least one text region from an image based on model data obtained by deep learning, and infers and generates at least one reference point associated with the at least one text region so as to reduce a difference between the model data and the detected at least one text region;
a recognizer that predicts and extracts at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
The recognizer autoregressively predicts at least one text from a text instance associated with the at least one text region using deformable attention while referring to the image and the at least one reference point.

1. A deep learning-based Optical Character Recognition (OCR) system executed by at least one processor, comprising:
a detector that detects at least one text region from an image based on model data obtained by deep learning, and infers and generates at least one reference point associated with the at least one text region so as to reduce a difference between the model data and the detected at least one text region;
a recognizer that predicts and extracts at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
The recognizer
a first recognizer for extracting first text from the image based on first reference points associated with a first text region and the image;
a second recognizer for extracting second text from the image based on second reference points associated with second text regions and the image;
An optical character recognition system, wherein the first recognizer and the second recognizer perform text extraction at least partially in parallel.

1. A deep learning-based Optical Character Recognition (OCR) system executed by at least one processor, comprising:
a detector that detects at least one text region from an image based on model data obtained by deep learning, and infers and generates at least one reference point associated with the at least one text region so as to reduce a difference between the model data and the detected at least one text region;
a recognizer that predicts and extracts at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
The recognizer
an optical character recognition system that uses a classification score for the extracted text to autoregressively detect at least one character contained in the extracted text and predict a position and angle of each of the at least one character within the image.

1. A deep learning-based Optical Character Recognition (OCR) system executed by at least one processor, comprising:
a detector that detects at least one text region from an image based on model data obtained by deep learning, and infers and generates at least one reference point associated with the at least one text region so as to reduce a difference between the model data and the detected at least one text region;
a recognizer that predicts and extracts at least one text from the image based on the image and the at least one reference point so as to minimize a loss function;
Including,
The detector comprises:
a first detector for detecting word-based text regions;
a second detector for detecting line-by-line text regions;
a third detector for detecting text regions in paragraph units;
The optical character recognition system, wherein the first detector, the second detector, and the third detector perform text region detection at least partially in parallel.