TWI854101B

TWI854101B - Instant-messaging bot for robotic process automation and robotic textual-content extraction from images

Info

Publication number: TWI854101B
Application number: TW110105935A
Authority: TW
Inventors: 曾柄元; 傅楸善; 沈立健; 熊暉
Original assignee: 八維智能股份有限公司
Priority date: 2020-10-12
Filing date: 2021-02-20
Publication date: 2024-09-01
Also published as: TW202215286A; CN114419611A

Abstract

An enterprise instant-messaging (IM) platform containing an enterprise chatbot is connected to one or more public IM platforms over the Internet. A robotic-process-automation (RPA) manager contains multiple modules of enterprise workflows and receives instructions from the enterprise chatbot for executing individual workflows. The system allows enterprise users connected to the enterprise IM platform, and external users connected to the public IM platforms, to use instant messaging to initiate enterprise internal or customer-facing workflows that are automated with the help of the enterprise chatbot. Furthermore, textual-content extraction from digital images can be incorporated in the RPA manager as an enterprise workflow, and improved convolutional neural network (CNN) methods for textual-content extraction are provided.

Description

Using instant messaging robots to automate robotic processes and extract text from images Robotic processes for content

本發明涉及機器人流程自動化、文字抽取、文字偵測、文字識別、電腦視覺、卷積神經網路、對話機器人、自然語言處理、對話式使用者介面、即時簡訊(簡訊應用程式)和人機對話。本揭露描述用於機器人流程自動化的即時簡訊機器人的系統和方法，以及從影像中抽取文字內容的機器人流程方法。 The invention relates to robotic process automation, text extraction, text detection, text recognition, computer vision, convolutional neural networks, conversational robots, natural language processing, conversational user interfaces, instant messaging (SMS applications), and human-computer dialogue. This disclosure describes systems and methods for instant messaging robots for robotic process automation, and robotic process methods for extracting text content from images.

本發明欲解決與機器人流程自動化(robotic process automation，RPA)有關的三個問題：(1)結合了人工智能和對話式使用者介面的RPA，(2)透過即時簡訊和移動互聯網傳遞的機器人工作流程，以及(3)機器人從數位影像中抽取文字內容的方法。這些問題將在下面逐一描述。 The present invention aims to solve three problems related to robotic process automation (RPA): (1) RPA that combines artificial intelligence and a conversational user interface, (2) robotic workflows delivered via instant messaging and mobile Internet, and (3) methods for robotic extraction of text content from digital images. These problems will be described one by one below.

機器人流程自動化是指用軟體解決方案完成的商業流程自動化；與提高生產和物流效率的實體機器人類似，軟體機器人可以提高商業流程的效率，因此RPA對企業實現數位化轉型至關重要。對於企業而言，使用RPA自動化的商業工作流程可包括其內部流程以及面對客戶的流程。傳統的RPA是由電腦軟體基於規則的演算法來執行，但為了讓RPA更有智慧且便於使用，現代RPA需要結合機器學習和深度學習演算法，甚至用對話機器人作為其對話式使用者介面。 Robotic process automation refers to the automation of business processes using software solutions; similar to physical robots that improve production and logistics efficiency, software robots can improve the efficiency of business processes, so RPA is crucial for enterprises to achieve digital transformation. For enterprises, business workflows automated using RPA can include their internal processes as well as customer-facing processes. Traditional RPA is performed by computer software based on rule-based algorithms, but in order to make RPA more intelligent and easy to use, modern RPA needs to combine machine learning and deep learning algorithms, and even use conversational robots as its conversational user interface.

越來越多商業流程透過互聯網、尤其是移動互聯網來進行。針對個別企業的特定商業流程所開發的手機應用程式成本既高，而企業外部的使用者對這種專用的應用程式又興趣缺缺，不願意下載及使用。因此，市場上對於不需下載特定應用程式而又能透過移動互聯網服務客戶的RPA，就有相當的需求。此外，即時簡訊已經取代了電話、電子郵件，成為我們日常生活中最普遍的通訊方式；公用的即時簡訊程式已是隨手可得，幾個普及率高的簡訊平台，如WhatsApp、FacebookMessenger和WeChat，都各自擁有超過十億個活躍用戶。利用公用即時簡訊程式的普及性，將其作為企業外部客戶的主要使用者介面，本發明藉即時簡訊機器人透過移動互聯網提供機器人流程，而且無需外部客戶下載任何特定的手機應用程式。 More and more business processes are being conducted over the Internet, especially mobile Internet. Mobile applications developed for specific business processes of individual companies are expensive, and users outside the company are not interested in such dedicated applications and are unwilling to download and use them. Therefore, there is a considerable demand in the market for RPA that can serve customers through mobile Internet without downloading specific applications. In addition, instant messaging has replaced telephones and emails and has become the most common way of communication in our daily lives; public instant messaging programs are readily available, and several popular messaging platforms, such as WhatsApp, Facebook Messenger and WeChat, each have more than one billion active users. Taking advantage of the popularity of public instant messaging programs as the primary user interface for external customers of an enterprise, the present invention provides robotic processes via mobile Internet using instant messaging robots without requiring external customers to download any specific mobile phone applications.

近年來，從線上傳遞的數位影像中即時抽取文字訊息的需求已大幅增加。例如，使用者在線上將支票存入銀行帳戶的流程：透過手機應用程式內的相機拍攝支票的照片，上傳支票影像以即時抽取文字內容，確認結果；整個流程只需在手機面板上幾個簡單的觸控動作就完成了。類似地，使用者可以透過應用程式上傳收據照片或其他損失證明的文件影本，來提出保險索賠。為了使上述流程自動化，從數位影像中即時抽取文字內容是必不可缺的步驟。 In recent years, the demand for real-time extraction of text messages from digital images transmitted online has increased significantly. For example, the process of a user depositing a check into a bank account online: taking a photo of the check through the camera in the mobile app, uploading the check image to extract the text content in real time, and confirming the result; the entire process can be completed with just a few simple touch actions on the mobile phone panel. Similarly, users can upload photos of receipts or other proof of loss documents through the app to make insurance claims. In order to automate the above process, real-time extraction of text content from digital images is an indispensable step.

過去數十年來，文字內容的抽取已可透過傳統的光學文字辨識(OCR)技術達成；然而，儘管傳統的OCR對於在乾淨的背景上且文檔格式明確的影像具有良好的文字抽取效果，但是對於未定格式或複雜背景上的影像則效果不佳。近年來，卷積神經網路(Convolutional Neural Network,CNN)技術已被應用於文字內容的抽取，對於具有多元格式及背景的影像有較佳的泛用性；但是，基於CNN的文字內容抽取的實用方法仍需要進一步開發和改良。 In the past few decades, text content extraction has been achieved through traditional optical character recognition (OCR) technology; however, although traditional OCR has good text extraction effects on images with clear document formats on clean backgrounds, it is not as effective for images with unformatted or complex backgrounds. In recent years, Convolutional Neural Network (CNN) technology has been applied to text content extraction, and has better versatility for images with multiple formats and backgrounds; however, practical methods for text content extraction based on CNN still need further development and improvement.

本發明的目的是透過結合了機器人流程自動化的即時簡訊機器人來解決上述問題，其中一類機器人流程的實施例包含了從數位影像中抽取文字內容的新的CNN方法。 The purpose of the present invention is to solve the above problems through an instant messaging robot combined with robotic process automation, one implementation of which includes a new CNN method for extracting text content from digital images.

本發明揭露以即時簡訊機器人實現機器人流程自動化(RPA)的系統和方法，以及基於卷積神經網路(CNN)從數位影像中抽取文字內容的新方法。這種系統包括為企業(或機構)建置的對話機器人應用程式、軟體RPA管理器和即時簡訊平台。該RPA管理器包含複數個企業工作流程模組，並能接收企業對話機器人的指令以執行各工作流程。該企業即時簡訊平台還通過網際網路連接到一個或多個公用即時簡訊平台。 The present invention discloses a system and method for implementing robotic process automation (RPA) with an instant messaging robot, and a new method for extracting text content from digital images based on a convolutional neural network (CNN). The system includes a conversational robot application built for an enterprise (or organization), a software RPA manager, and an instant messaging platform. The RPA manager includes a plurality of enterprise workflow modules and can receive instructions from the enterprise conversational robot to execute each workflow. The enterprise instant messaging platform is also connected to one or more public instant messaging platforms via the Internet.

透過即時簡訊，該系統讓連接到企業即時簡訊平台的企業用戶和連接到公用即時簡訊平台的外部用戶能夠與企業對話機器人以一對一或群組模式進行溝通。更進一步，企業用戶和外部用戶可利用該系統啟動企業內部或面對客戶的工作流程，並透過企業對話機器人的協助來實現工作流程自動化。 Through instant messaging, the system enables enterprise users connected to the enterprise instant messaging platform and external users connected to the public instant messaging platform to communicate with the enterprise conversation robot in one-on-one or group mode. Furthermore, enterprise users and external users can use the system to activate internal or customer-facing workflows and automate workflows with the assistance of the enterprise conversation robot.

此外，本發明RPA管理器中的機器人流程的實施例還包含從數位影像中抽取文字內容，並為此類流程提供了改良的卷積神經網路(CNN)方法。 In addition, the implementation of the robot process in the RPA manager of the present invention also includes extracting text content from digital images, and provides an improved convolutional neural network (CNN) method for such processes.

100:企業即時簡訊機器人 100:Enterprise instant messaging robot

102:企業對話機器人 102: Enterprise dialogue robot

104:企業即時簡訊平台 104: Enterprise instant messaging platform

106:RPA管理器 106:RPA Manager

108:文字抽取流程 108: Text extraction process

108a~108n:企業工作流程 108a~108n: Enterprise workflow

108tp:第三方RPA工作流程 108tp: Third-party RPA workflow

110:公用即時簡訊平台 110: Public instant messaging platform

112:企業用戶 112: Enterprise users

114:外部用戶 114: External user

116:企業資料庫 116: Enterprise database

200:CT-SSD架構 200: CT-SSD architecture

204:ResNet-18卷積骨幹 204:ResNet-18 Convolutional Core

206:Conv4卷積層 206: Conv4 convolutional layer

208:文字切片偵測層 208: Text slice detection layer

210:非極大值抑制演算法 210: Non-maximum suppression algorithm

212:輸出：文字行外框 212: Output: Text line border

300:分水嶺U網分割架構 300: Watershed U-net segmentation architecture

304~322:U網分割模型 304~322: U network segmentation model

324:文字分布圖 324: Text distribution map

326:侵蝕文字分布圖 326: Distribution map of eroded text

328:分水嶺演算法 328: Watershed Algorithm

330:輸出：最佳化文字分布圖 330: Output: Optimized text distribution map

400:CTC-CNN架構 400:CTC-CNN architecture

402:輸入文字行影像 402: Input text line image

404:ResNet-18卷積骨幹 404:ResNet-18 Convolutional Core

406:Conv5特徵圖 406: Conv5 feature map

408、410:一維卷積運算 408, 410: One-dimensional convolution operation

412:Softmax運算 412: Softmax operation

414:CTC損失函數 414:CTC loss function

416:輸出：文字行內容 416: Output: Text line content

圖1是實施例用於提供機器人流程自動化的即時簡訊機器人的方塊示意圖。 FIG1 is a block diagram of an instant messaging robot for providing robot process automation according to an embodiment.

圖2是實施例用於數位影像的文字偵測的CT-SSD架構圖。 Figure 2 is a diagram of the CT-SSD architecture used in an embodiment for text detection in digital images.

圖3是實施例用於數位影像的文字偵測的分水嶺U網分割架構圖。 Figure 3 is a diagram of the watershed U-net segmentation architecture used in an embodiment for text detection in digital images.

圖4是實施例用於數位文字行影像的文字辨識的CTC-CNN架構圖；其中輸入端文字行影像的解析度及其卷積層特徵圖的解析度分別標注於括號中。 Figure 4 is a diagram of the CTC-CNN architecture used for text recognition of digital text line images in an embodiment; the resolution of the input text line image and the resolution of its convolutional layer feature map are marked in brackets respectively.

圖5是用於影像識別的VGG-16架構圖。 Figure 5 is a diagram of the VGG-16 architecture used for image recognition.

圖6是用於影像識別的ResNet-18架構圖；其中卷積骨幹包含了繞過特定卷積運算的快捷連接(實線箭頭代表恆等式連接；虛線箭頭代表投射式連接)。 Figure 6 is a diagram of the ResNet-18 architecture for image recognition; the convolution backbone contains shortcut connections that bypass specific convolution operations (solid arrows represent constant connections; dashed arrows represent projection connections).

圖7是用於數位影像的一般物件偵測的SSD架構圖。 Figure 7 is a diagram of the SSD architecture for general object detection in digital images.

圖8是SSD物件偵測機制的示意圖(來自W.Liu,D.Anguelov,D.Erhan,C.Szegedy,S.Reed,C.Y.Fu,A.C.Berg,<SSD：單發多框偵測器>，arXiv：1512.02325(2016))。(a)輸入影像，並顯示SSD偵測到貓和狗兩個物件的外框。(b)8x8卷積特徵圖；每個單元格都配置了不同高寬比的偵測框，其中兩個偵測到貓的偵測框用粗虛線表示。(c)4x4卷積特徵圖；每個單元格都配置了不同高寬比的偵測框，其中一個偵測到狗的偵測框用粗虛線表示。 Figure 8 is a schematic diagram of the SSD object detection mechanism (from W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg, <SSD: Single Shot Multi-Frame Detector>, arXiv: 1512.02325 (2016)). (a) Input image, and display of the outer frames of two objects detected by SSD: cat and dog. (b) 8x8 convolution feature map; each cell is configured with detection frames of different aspect ratios, and the two detection frames of cats are represented by thick dashed lines. (c) 4x4 convolution feature map; each cell is configured with detection boxes of different aspect ratios, and one of the detection boxes that detected a dog is indicated by a thick dashed line.

圖9是用於數位影像的文字偵測的CTPN架構圖。 Figure 9 is a diagram of the CTPN architecture for text detection in digital images.

圖10是CTPN文字偵測機制的示意圖。在Conv5特徵圖的水平方向上逐格滑動的3x3偵測框先產生一序列的文字切片提議，以供後續處理。VGG-16的Conv5特徵圖的空間解析度(WxH)視輸入影像的解析度而定；若輸入影像為512x512像素，則W=32，即每個特徵單元格的寬度相當於原影像中的 16像素。 Figure 10 is a schematic diagram of the CTPN text detection mechanism. The 3x3 detection box that slides horizontally in the Conv5 feature map first generates a sequence of text slice proposals for subsequent processing. The spatial resolution (WxH) of the Conv5 feature map of VGG-16 depends on the resolution of the input image; if the input image is 512x512 pixels, then W=32, that is, the width of each feature unit cell is equivalent to 16 pixels in the original image.

圖11是實施例對ResNet-18的Conv4特徵圖進行採樣的CT-SSD文字切片偵測器的示意圖。若輸入影像為512x512像素，則偵測器的寬度固定為16像素(即特徵單元格的寬度)。 Figure 11 is a schematic diagram of the CT-SSD text slice detector for sampling the Conv4 feature map of ResNet-18 in an embodiment. If the input image is 512x512 pixels, the width of the detector is fixed to 16 pixels (i.e., the width of the feature unit cell).

圖12是使用本發明CT-SSD文字偵測的示例，顯示(a)偵測到的候選文字切片，以及(b)文字切片經過組合演算後產生的文字行。 Figure 12 is an example of using the CT-SSD text detection of the present invention, showing (a) the detected candidate text slices, and (b) the text lines generated after the text slices are combined and calculated.

圖13是用於數位影像語義分割的U網架構圖。 Figure 13 is a U-grid structure used for digital image semantic segmentation.

圖14是使用本發明分水嶺U網分割執行文字偵測的示例，顯示(a)輸入影像，(b)經U網分割之後的文字分布圖，以及(c)經U網分割之後的侵蝕文字分布圖。將圖14(b)和(c)的兩幀半完成的文字分佈圖再經過分水嶺演算法處理，即可產生輸出端的最佳化文字分布圖。 FIG14 is an example of using the watershed U-net segmentation of the present invention to perform text detection, showing (a) input image, (b) text distribution map after U-net segmentation, and (c) eroded text distribution map after U-net segmentation. The two and a half frames of completed text distribution map of FIG14 (b) and (c) are further processed by the watershed algorithm to generate the optimized text distribution map at the output.

圖15是實施例在文字偵測之前，使用重疊方塊法作影像預處理的示例。在此示例中，輸入影像(陰影矩形)的解析度為1920x1080像素，每個方塊(虛線矩形)的解析度為512x512像素，相鄰方塊之間的重疊為32像素，位於邊緣的方塊視需要作補空填充。 FIG15 is an example of an embodiment using overlapping block method for image preprocessing before text detection. In this example, the resolution of the input image (shaded rectangle) is 1920x1080 pixels, the resolution of each block (dashed rectangle) is 512x512 pixels, the overlap between adjacent blocks is 32 pixels, and the blocks at the edge are padded as needed.

圖1說明用於機器人流程自動化(robotic process automation,RPA)的企業即時簡訊機器人100的實施例。系統包括企業(或機構)專屬的對話機器人軟體102、即時簡訊平台104、RPA管理器軟體106。RPA管理器106包含複數個工作流程模組108-108n，每個模組可為企業提供一種自動化的工作流程。企業即時簡訊平台104還通過互聯網連接到公用即時簡訊平台110。系統讓連接到企業即時簡訊平台104的企業用戶112、連接到公用即時簡訊平台110的外部用戶114、企業對話機器人102三方能使用即時簡訊以一對一或群組模式進行溝通。 FIG1 illustrates an embodiment of an enterprise instant messaging robot 100 for robotic process automation (RPA). The system includes an enterprise (or organization)-specific conversation robot software 102, an instant messaging platform 104, and an RPA manager software 106. The RPA manager 106 includes a plurality of workflow modules 108-108n, each of which can provide an automated workflow for the enterprise. The enterprise instant messaging platform 104 is also connected to a public instant messaging platform 110 via the Internet. The system allows enterprise users 112 connected to the enterprise instant messaging platform 104, external users 114 connected to the public instant messaging platform 110, and the enterprise conversation robot 102 to communicate using instant messaging in one-to-one or group mode.

企業對話機器人102包含以模擬真人方式接收、處理、分析和回應人類訊息的軟體。它由三個主要部分組成：(1)自然語言處理及理解(NLP/NLU)模組，用於分析來自企業用戶112或外部用戶114傳入訊息的意圖；(2)對話管理(DM)模組，用於解釋NLP/NLU模組的輸出(意圖)及分析進行中的人機對話情境，包括先前的訊息或與該對話相關的其他資訊，並據以提供回應指令作為輸出；以及(3)自然語言生成(NLG)模組，用於從DM模組接收回應指令並產生對企業用戶112或外部用戶114的回應訊息。 The enterprise conversation robot 102 includes software for receiving, processing, analyzing and responding to human messages in a simulated real-person manner. It consists of three main parts: (1) a natural language processing and understanding (NLP/NLU) module for analyzing the intent of incoming messages from enterprise users 112 or external users 114; (2) a conversation management (DM) module for interpreting the output (intent) of the NLP/NLU module and analyzing the ongoing human-computer conversation context, including previous messages or other information related to the conversation, and providing response instructions as output; and (3) a natural language generation (NLG) module for receiving response instructions from the DM module and generating response messages to enterprise users 112 or external users 114.

企業即時簡訊平台104包括管理企業用戶112、外部用戶114、企業對話機器人102三方之間的即時簡訊交換的軟體。在圖1中，實線箭頭、點線箭頭、虛線箭頭分別表示企業用戶112與對話機器人102之間、外部用戶114與對話機器人102之間、以及企業用戶112與外部用戶114之間的訊息交換。交換的訊息可以包含文字、影像、視頻、超連結或其他數位內容。有關本發明的企業對話機器人102和企業即時簡訊平台104的更多細節，請參見2019年11月7日的美國專利申請號16/677,645。 The enterprise instant messaging platform 104 includes software for managing instant messaging exchanges between enterprise users 112, external users 114, and enterprise conversation robot 102. In FIG1, solid arrows, dotted arrows, and dashed arrows represent message exchanges between enterprise users 112 and conversation robot 102, between external users 114 and conversation robot 102, and between enterprise users 112 and external users 114, respectively. The exchanged messages may include text, images, videos, hyperlinks, or other digital content. For more details about the enterprise conversation robot 102 and enterprise instant messaging platform 104 of the present invention, please refer to U.S. Patent Application No. 16/677,645, dated November 7, 2019.

RPA管理器106包含用於配置和執行內建工作流程108-108n的軟體。由第三方開發者提供的一個或多個RPA工作流程108tp也可選擇性地透過應用程式介面(application programming interface,API)連接到RPA管理器106並由其所控制。為啟動企業內部或面對客戶的某個機器人工作流程，企業用戶112或外部用戶114可以向企業對話機器人102傳送訊息表達這樣的意圖，而企業對話機器人102即會回應並指示RPA管理器106執行指定的工作流程。 The RPA manager 106 includes software for configuring and executing built-in workflows 108-108n. One or more RPA workflows 108tp provided by third-party developers can also be optionally connected to and controlled by the RPA manager 106 through an application programming interface (API). To activate a robotic workflow within the enterprise or facing customers, an enterprise user 112 or an external user 114 can send a message to the enterprise dialogue robot 102 to express such intention, and the enterprise dialogue robot 102 will respond and instruct the RPA manager 106 to execute the specified workflow.

某些工作流程是端對端的，即只需接收輸入就可以直接產生輸出；其他工作流程則是互動式的，即用戶112或114、對話機器人102、RPA管理器106需要在過程中進行來回互動才會產生輸出。在某些情況下，企業對話機器人102和RPA管理器106還需要連接到企業資料庫116，以便取得與進行中的即時簡訊對話或機器人工作流程相關的必要資訊。 Some workflows are end-to-end, i.e., they can directly generate outputs by simply receiving inputs; other workflows are interactive, i.e., users 112 or 114, dialogue robots 102, and RPA managers 106 need to interact back and forth in the process to generate outputs. In some cases, the enterprise dialogue robot 102 and RPA manager 106 also need to connect to the enterprise database 116 to obtain necessary information related to the ongoing instant messaging conversation or robot workflow.

由於對話機器人無法理解人類語言的細微差異，或者由於缺乏足夠的資訊，它回應用戶的查詢而提供錯誤的答案、或根本沒有答案的情況並不少見；這可能被認為是不良的用戶體驗，導致對企業的負面印象。本發明提供機制讓企業用戶112(例如，客戶服務人員)得以即時介入外部用戶114(例如，客戶)與企業對話機器人102之間進行中的簡訊對話；如此，外部用戶在與對話機器人對話中可能遇到的摩擦或挫折就能得到即時補救，這可以幫助企業提升機器人流程的用戶體驗。 It is not uncommon for a conversational robot to respond to a user's query with an incorrect answer or no answer at all, due to its inability to understand the nuances of human language or due to a lack of sufficient information; this may be considered a poor user experience, leading to a negative impression of the enterprise. The present invention provides a mechanism for an enterprise user 112 (e.g., a customer service person) to immediately intervene in an ongoing text message conversation between an external user 114 (e.g., a customer) and the enterprise conversational robot 102; in this way, friction or frustration that an external user may encounter in a conversation with the conversational robot can be immediately remedied, which can help enterprises improve the user experience of robot processes.

本發明提供了包括日常辦公室流程的機器人工作流程，如企業內部的自動化會議安排。會議發起人，即企業用戶112，僅需向企業對話機器人102發出會議要求，包括會議主題、參與人、預期的時間及場地；在RPA管理器106的協助下，企業對話機器人102會先在幕後比對每個參與人、場地的時程表，然後向會議發起人提出最佳時間及場地建議，並取得確認；一旦確認，企業對話機器人102即向各參與人(也是企業用戶)發出會議邀請簡訊，還會在會議前發送提醒簡訊。 The present invention provides a robot workflow including daily office processes, such as automated meeting arrangements within an enterprise. The meeting initiator, that is, the enterprise user 112, only needs to send a meeting request to the enterprise dialogue robot 102, including the meeting theme, participants, expected time and venue; with the assistance of the RPA manager 106, the enterprise dialogue robot 102 will first compare the schedule of each participant and venue behind the scenes, and then propose the best time and venue to the meeting initiator and obtain confirmation; once confirmed, the enterprise dialogue robot 102 will send a meeting invitation SMS to each participant (also an enterprise user), and will also send a reminder SMS before the meeting.

辦公室工作流程的另一個例子是自動化請假、准假流程。在這個例子中，一個企業用戶112經由企業對話機器人102向其上級主管(另一個企業用戶)提交請假申請；在RPA管理器106的協助下，企業對話機器人102會引導主管執行簽核，以確保企業的請假規定是被嚴格遵循的，也會適時發送提醒簡訊給主管，以確保簽核能及時完成。 Another example of an office workflow is the automated leave application and leave approval process. In this example, an enterprise user 112 submits a leave application to his supervisor (another enterprise user) through the enterprise dialogue robot 102; with the assistance of the RPA manager 106, the enterprise dialogue robot 102 will guide the supervisor to sign and approve to ensure that the enterprise's leave regulations are strictly followed, and will also send reminder text messages to the supervisor in a timely manner to ensure that the approval can be completed in time.

本發明還提供了從數位影像中抽取文字內容的機器人工作流程。這種工作流程中最基本的是端對端的，其中用戶112或114將數位影像傳送給企業對話機器人102，後者將影像轉發給RPA管理器106進行文字內容抽取；其結果會再經過對話機器人102回傳給用戶112或114。 The present invention also provides a robotic workflow for extracting text content from digital images. The most basic of such workflows is end-to-end, in which a user 112 or 114 transmits a digital image to an enterprise dialogue robot 102, which forwards the image to an RPA manager 106 for text content extraction; the result is then transmitted back to the user 112 or 114 via the dialogue robot 102.

涉及文字內容抽取的互動式機器人工作流程，可以用以下機器人退貨授權(return merchandise authorization,RMA)的實施例來說明。RMA對於企業的售後服務至關重要；儘管大多數RMA索賠是例行性和重複性的，但它們佔據了客戶服務人員的寶貴時間，因此市場上對機器人RMA的需求很強。 The interactive robotic workflow involving text content extraction can be illustrated by the following implementation of a robotic return merchandise authorization (RMA). RMA is crucial for a company's after-sales service; although most RMA claims are routine and repetitive, they take up the valuable time of customer service personnel, so there is a strong demand for robotic RMA in the market.

以下的機械人RMA流程代表了本發明的一個實施例：(1)因某件損壞的商品，外部用戶114(客戶)透過公用即時簡訊平台110向企業對話機器人102發送訊息要求退貨授權RMA，並附有該商品標籤的照片；(2)利用其NLP/NLU功能，企業對話機器人102得以瞭解該外部用戶的意圖；它將商品標籤影像轉傳給RPA管理器106，後者即執行文字內容抽取108，並將結果傳回企業對話機器人102；(3)企業對話機器人102將商品標籤抽取的文字內容(如型號、序列碼)與企業資料庫116中的銷售記錄進行比對，以確認該商品是否在保固期內； (4)根據商品保固狀態，企業對話機器人102向外部用戶114發送RMA授權碼或發送拒絕訊息；及(5)過程中如果外部用戶114的回應訊息含有負面情緒，則對話機器人102立即將問題升級至企業用戶112(客戶服務人員)，之後由企業用戶112接手與外部用戶114進行後續對話。 The following robotic RMA process represents an embodiment of the present invention: (1) Due to a damaged product, an external user 114 (customer) sends a message to the enterprise dialogue robot 102 through the public instant messaging platform 110 to request a return authorization RMA, and attaches a photo of the product label; (2) Using its NLP/NLU function, the enterprise dialogue robot 102 is able to understand the external user's intention; it forwards the product label image to the RPA manager 106, which then performs text content extraction 108 and returns the result to the enterprise dialogue robot 102; (3) The enterprise dialogue The robot 102 compares the text content (such as model number, serial number) extracted from the product label with the sales record in the enterprise database 116 to confirm whether the product is within the warranty period; (4) Based on the warranty status of the product, the enterprise dialogue robot 102 sends an RMA authorization code or a rejection message to the external user 114; and (5) If the response message of the external user 114 contains negative emotions during the process, the dialogue robot 102 immediately escalates the problem to the enterprise user 112 (customer service personnel), and then the enterprise user 112 takes over the subsequent dialogue with the external user 114.

文字內容抽取方法(Textual-Content Extraction Methods) Textual-Content Extraction Methods

本發明所提供的文字內容抽取方法，皆包含兩個主要步驟：(1)對輸入影像進行文字偵測(text detection)，用以提取影像中文字行(text line)的位置和大小，以及(2)針對步驟(1)中所偵測到的個別文字行影像進行文字識別(text recognition)，用以抽取其文字內容。圖2、圖3、圖4分別顯示兩種獨立的文字偵測方法和一種文字識別方法；這些方法皆由基於卷積神經網路(CNN)的深度學習模型衍生而來。為便於解釋本發明，以下先將基本CNN的概念和方法論、以及相關的深度學習模型，作一個說明。 The text content extraction methods provided by the present invention include two main steps: (1) performing text detection on the input image to extract the position and size of the text line in the image, and (2) performing text recognition on the individual text line images detected in step (1) to extract their text content. Figures 2, 3, and 4 respectively show two independent text detection methods and a text recognition method; these methods are all derived from a deep learning model based on a convolutional neural network (CNN). To facilitate the explanation of the present invention, the basic concept and methodology of CNN, as well as the related deep learning model, are first explained below.

基本卷積神經網路(Basic Convolutional Neural Network) Basic Convolutional Neural Network

近年來，用於自動偵測、分類、識別數位影像中的物件的CNN已被廣泛地應用。圖5及圖6分別顯示兩個常用的CNN模型架構：VGG-16(K.Simonyan,A.Zissermen，<用於大型影像識別的深層卷積網路>(Very deep convolutional networks for large-scale image recognition),arXiv：1409.1556(2014))和ResNet-18(K.He,X.Zhang,S.Ren,J.Sun，<用於影像識別的深度殘差學習>(Deep residual learning for image recognition),arXiv：1512.03385(2015))。一個完整的CNN流水線包含用於抽取影像空間特徵的卷積骨幹，接著是用於偵測、分類或識別影像中物件的解碼路徑。 In recent years, CNNs have been widely used to automatically detect, classify, and identify objects in digital images. Figures 5 and 6 show two commonly used CNN model architectures: VGG-16 (K. Simonyan, A. Zissermen, Very deep convolutional networks for large-scale image recognition, arXiv: 1409.1556 (2014)) and ResNet-18 (K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, arXiv: 1512.03385 (2015)). A complete CNN pipeline consists of a convolutional skeleton for extracting spatial features of an image, followed by a decoding path for detecting, classifying, or identifying objects in the image.

通常，卷積骨幹包含以特定順序排列的複數個卷積(convolutional)運算、線性整流函數(Rectified-Linear-Unit,ReLU)運算及池化(pooling)運算。如圖5及圖6所示，輸入影像首先通過卷積骨幹，產生出一系列的特徵圖(feature map)，每個特徵圖含有WxH個單元格(cell)；這些特徵圖的空間解析度逐步降低(如圖中所示遞減的單元格數目)，表示它們分別抽取了原影像中愈來愈大的局部特徵。每個卷積層(分別表示為Conv1至Conv5)中特徵圖的解析度取決於輸入影像的解析度；圖5、圖6假設輸入解析度為512×512像素。在卷積骨幹內，每個卷積運算包含以複數個空間濾波器(稱為卷積核)掃描輸入影像或個別特徵圖的程序，而卷積核的數目(稱為通道數)提供了該特徵圖的深度。每個池化運算也是透過空間濾波器(池化濾波器)掃描來執行，運算輸出值可採用池化濾波器所涵蓋的特徵單元值(feature-cell values)中的最大值(最大值池)或平均值(均值池)。ReLU運算通常跟在每個卷積運算之後，作用是將所有負數的特徵單元值設為零、正數的特徵單元值則維持不變。在圖5和圖6中，一個卷積或池化運算係以其濾波器大小、通道數、步幅(每個掃描步的大小)標示，例如：「3x3 conv,256,/2」代表「3x3卷積核，256個通道，步幅為2單元格」。為簡單起見，以下說明中所有特徵圖邊緣均假設做了填充(padding)，使得步幅為1的卷積和池化運算不會改變特徵圖的尺寸(WxH)、步幅為2的相同運算則將尺寸減半(W/2xH/2)。此外，圖5和圖6省略了ReLU運算而未顯示。VGG-16或ResNet-18的解碼路徑包含一個或多個完全連接的(fully-connected，fc)神經層，以及一次歸一化指數函數(Softmax)運算；Softmax是將輸出數值正規化的一種標準的激活函數(activation function)。 Typically, the convolutional backbone contains a number of convolutional operations, Rectified-Linear-Unit (ReLU) operations, and pooling operations arranged in a specific order. As shown in Figures 5 and 6, the input image first passes through the convolutional backbone to generate a series of feature maps, each of which contains WxH cells; the spatial resolution of these feature maps gradually decreases (as shown in the decreasing number of cells in the figure), indicating that they extract larger and larger local features in the original image. The resolution of the feature map in each convolutional layer (represented as Conv1 to Conv5) depends on the resolution of the input image; Figures 5 and 6 assume that the input resolution is 512×512 pixels. In the convolutional skeleton, each convolution operation involves scanning the input image or individual feature maps with multiple spatial filters (called convolution kernels), and the number of convolution kernels (called the number of channels) provides the depth of the feature map. Each pooling operation is also performed by scanning through a spatial filter (pooling filter), and the output value of the operation can be the maximum value (maximum pooling) or the average value (mean pooling) of the feature-cell values covered by the pooling filter. The ReLU operation usually follows each convolution operation, which sets all negative feature-cell values to zero and keeps the positive feature-cell values unchanged. In Figures 5 and 6, a convolution or pooling operation is labeled with its filter size, number of channels, and stride (size of each scan step), for example: "3x3 conv,256,/2" means "3x3 convolution kernel, 256 channels, stride 2 cells". For simplicity, all feature map edges in the following description are assumed to be padded, so that convolution and pooling operations with stride 1 do not change the size of the feature map (WxH), and the same operation with stride 2 will halve the size (W/2xH/2). In addition, ReLU operations are omitted in Figures 5 and 6 and are not shown. The decoding path of VGG-16 or ResNet-18 consists of one or more fully-connected (fc) neural layers and a normalized exponential function (Softmax) operation; Softmax is a standard activation function that normalizes the output value.

在CNN中，每個卷積運算或全連接層都含有可訓練的加權參數；實際被應用的CNN模型可能包含數千萬個可訓練的參數(例如，VGG-16含1.384億個參數，ResNet-18含1,150萬個參數)。因此，為了訓練一個完整的CNN模型，會需要很大量的(例如，數十萬至數百萬幀)、標註過的訓練影像。這對於僅能提出有限數量(例如，幾百至幾萬幀)訓練影像的大多數應用，是不切實際的。幸運的是，開源平台上已有針對某些領域且預先訓練好的一些CNN模型，應用開發者可以採用這些模型的全部或局部來作有效的運用；在適當的條件下，可能使用一組較小量的、相似領域的訓練影像集，就成功地訓練出CNN或其衍生模型作為新的應用。本發明的一些實施例就採用了預訓練過的VGG-16、ResNet-18或其各自變體(variants)的卷積骨幹，並加以衍生成為新的CNN方法。一般而言，與VGG變體相比，ResNet變體需要較少的可訓練參數且較容易訓練。 In CNN, each convolution operation or fully connected layer contains trainable weighted parameters; the actual applied CNN model may contain tens of millions of trainable parameters (e.g., VGG-16 contains 138.4 million parameters and ResNet-18 contains 11.5 million parameters). Therefore, in order to train a complete CNN model, a large amount (e.g., hundreds of thousands to millions of frames) of annotated training images is required. This is impractical for most applications that can only provide a limited number (e.g., hundreds to tens of thousands of frames) of training images. Fortunately, there are already some pre-trained CNN models for certain domains on the open source platform, and application developers can use all or part of these models for effective use; under appropriate conditions, it is possible to use a small set of training images in similar domains to successfully train CNN or its derivative models for new applications. Some embodiments of the present invention use the pre-trained convolutional backbone of VGG-16, ResNet-18 or their respective variants and derive them into new CNN methods. Generally speaking, compared with VGG variants, ResNet variants require fewer trainable parameters and are easier to train.

單發多框偵測器(Single Shot Multi-Box Detector,SSD) Single Shot Multi-Box Detector (SSD)

SSD是用於從數位影像中偵測一般物件的CNN方法(W.Liu,D.Anguelov,D.Erhan,C.Szegedy,S.Reed,C.Y.Fu,A.C.Berg,<SSD：單發多框偵測器>(Single Shot Multi-Box Detector),arXiv：1510.22325(2016))，它兼具準確度和速度，是一種效率較高的物件偵測模型。SSD架構顯示於圖7，它除了採用VGG-16的Conv1至Conv5卷積骨幹，又附加了幾個卷積層，以產生逐步遞減的特徵圖尺寸(即較小的WxH)。 SSD is a CNN method for detecting general objects from digital images (W.Liu, D.Anguelov, D.Erhan, C.Szegedy, S.Reed, C.Y.Fu, A.C.Berg, <SSD: Single Shot Multi-Box Detector>, arXiv: 1510.22325 (2016)). It combines accuracy and speed and is a more efficient object detection model. The SSD architecture is shown in Figure 7. In addition to using the Conv1 to Conv5 convolutional backbone of VGG-16, it also adds several convolutional layers to produce a gradually decreasing feature map size (i.e., smaller WxH).

SSD的偵測層將Conv4特徵圖及Conv6之後的附加卷積特徵圖的所有特徵單元值作為其輸入(如圖7中的粗線箭頭所示)，並於每個特徵單元格上配置不同高寬比的複數個偵測框(如圖8(b)、(c)所示)。通常，在較小的特徵圖中(具有較大的單元格)，較可能偵測到較大的物件，反之亦然。此機制可一次性地(one-shot)偵測到不同尺寸物件的類別和位置。圖8顯示輸入影像中貓和狗的同時偵測：在圖8(b)中，藉由8×8特徵圖中兩個偵測框，偵測到較小尺寸的貓；在圖8(c)中，藉由4x4特徵圖中單個偵測框，偵測到較大尺寸的狗。 The detection layer of SSD takes all the feature unit values of the Conv4 feature map and the additional convolution feature map after Conv6 as its input (as shown by the thick arrows in Figure 7), and configures multiple detection frames of different aspect ratios on each feature cell (as shown in Figure 8 (b) and (c)). Generally, in a smaller feature map (with a larger cell), it is more likely to detect a larger object, and vice versa. This mechanism can detect the category and location of objects of different sizes in one shot. Figure 8 shows the simultaneous detection of cats and dogs in the input image: in Figure 8(b), the smaller cat is detected by two detection frames in the 8×8 feature map; in Figure 8(c), the larger dog is detected by a single detection frame in the 4x4 feature map.

偵測層的輸出提供了每個偵測到的物件的類別分數、物件外框的坐標及大小。物件外框不必與偵測到物件的偵測框重合；通常，兩者之間的坐標、高度、寬度都會有偏差。此外，同一物件可能被幾個偵測框偵測到，因而產生出幾個候選的物件外框。SSD的最後一個步驟是非極大值抑制(Non-Maximum Suppression)演算法，這是一種基於規則的演算法，目的是從幾個候選的物件外框中選出最佳的一個。 The output of the detection layer provides the category score of each detected object, the coordinates and size of the object bounding box. The object bounding box does not have to coincide with the detection box of the detected object; usually, there will be deviations in coordinates, height, and width between the two. In addition, the same object may be detected by several detection boxes, thus generating several candidate object bounding boxes. The last step of SSD is the Non-Maximum Suppression algorithm, which is a rule-based algorithm that aims to select the best one from several candidate object bounding boxes.

連結文字提議網路(Connectionist Text Proposal Network,CTPN) Connectionist Text Proposal Network (CTPN)

文字偵測和一般物件偵測有兩個主要的不同點：第一，一般物件通常具有明確的封閉性邊界，而文字是由離散的元素(例如，英文子母、漢字、標點符號、空格)所組成，因此文字的邊界較不明確；第二，文字偵測通常比一般物件偵測需要更高的準度，這是因為一條文字行若只被偵測到一部份，將會導致後續的文字識別產生嚴重錯誤。因此，用於偵測一般物件的方法(例如SSD) 對於偵測文字的效果就不佳。 There are two main differences between text detection and general object detection: first, general objects usually have clear closed boundaries, while text is composed of discrete elements (e.g., English letters, Chinese characters, punctuation marks, spaces), so the boundaries of text are less clear; second, text detection usually requires higher accuracy than general object detection, because if a line of text is only partially detected, it will cause serious errors in subsequent text recognition. Therefore, methods used to detect general objects (such as SSD) are not very effective for detecting text.

連結文字提議網路(CTPN)是專為文字偵測而設計的(Z.Tian,W.Huang,T.He,P.He,Y.Qiao，<用連結文字提議網路偵測自然影像中的文字>(Detecting text in natural image with Connectionist Text Proposal Network),arXiv：1609.03605(2016))。CTPN架構顯示於圖9；與SSD類似，它用VGG-16的Conv1到Conv5層來作為卷積骨幹，但是二者間相似處僅此而已。如圖10所示，緊接卷積骨幹之後，CTPN使用一個小型偵測框(3x3窗口)在Conv5特徵圖的水平方向上逐格滑動、逐行掃描；從每行掃描過程中，CTPN產生出一序列精細的「文字切片提議」(text-slice proposals)；每個文字切片提議的寬度是固定的，僅代表一條文字行的一小部分(例如，若輸入影像為512x512像素，則文字切片寬度僅為16像素)。此外，CTPN提供了一種垂直定錨(vertical anchor)機制，可同時預測每個文字切片提議的垂直位置、高度、文字/非文字分數。以下為CTPN從這些文字切片提議序列中進一步偵測文字行的方法。 The Connectionist Text Proposal Network (CTPN) is designed specifically for text detection (Z. Tian, W. Huang, T. He, P. He, Y. Qiao, Detecting text in natural image with Connectionist Text Proposal Network, arXiv: 1609.03605 (2016)). The CTPN architecture is shown in Figure 9; similar to SSD, it uses the Conv1 to Conv5 layers of VGG-16 as the convolutional backbone, but the similarities end there. As shown in Figure 10, immediately after the convolution, CTPN uses a small detection box (3x3 window) to slide horizontally and scan line by line in the Conv5 feature map; from each line scanning process, CTPN generates a sequence of fine "text-slice proposals"; the width of each text slice proposal is fixed and only represents a small part of a text line (for example, if the input image is 512x512 pixels, the text slice width is only 16 pixels). In addition, CTPN provides a vertical anchor mechanism that can simultaneously predict the vertical position, height, and text/non-text score of each text slice proposal. The following is the method by which CTPN further detects text lines from these text slice proposal sequences.

如圖9所示，每個文字切片提議序列被饋送到一個具有雙向長短期記憶(long short-term memory,LSTM)的循環神經網路(recurrent neural network,RNN)架構，用LSTM編碼器來抽取文字序列的特徵；換句話說，用循環神經網路用來解析每個文字切片提議是否為文字的一部份。最後CTPN再將其中文字/非文字分數超過閾值(例如0.7)的文字切片提議組合起來，成為單一或多條文字行；如果相鄰文字切片提議之間的水平距離小於50像素，且垂直重疊超過0.7，就將它們組合到同一條文字行中。 As shown in Figure 9, each text slice proposal sequence is fed into a recurrent neural network (RNN) architecture with bidirectional long short-term memory (LSTM), and the LSTM encoder is used to extract the features of the text sequence; in other words, the recurrent neural network is used to analyze whether each text slice proposal is part of a text. Finally, the CTPN combines the text slice proposals whose text/non-text scores exceed the threshold (for example, 0.7) into a single or multiple text lines; if the horizontal distance between adjacent text slice proposals is less than 50 pixels and the vertical overlap exceeds 0.7, they are combined into the same text line.

儘管CTPN在偵測文字方面是準確的，但是它的速度(例如，每秒7幀)遠不及SSD的速度(例如，每秒59幀)。因此，實用上還需要一種兼具準確性及速度的文字偵測方法。 Although CTPN is accurate in detecting text, its speed (e.g., 7 frames per second) is far slower than that of SSD (e.g., 59 frames per second). Therefore, a text detection method that is both accurate and fast is needed in practice.

本發明的文字偵測方法1：連結文字單發多切片偵測器(Connectionist Text Single Shot Multi-Slice Detector,CT-SSD) The text detection method 1 of the present invention: Connectionist Text Single Shot Multi-Slice Detector (CT-SSD)

本發明提供一種混合CTPN和SSD的文字偵測方法，稱為連結文字單發多切片偵測器。CT-SSD的目的是要同時達到CTPN的文字偵測準確度和SSD的速度。圖2顯示CT-SSD的架構200；與SSD、CTPN類似，CT-SSD使用了一個預訓練的CNN模型的卷積骨幹204(例如，ResNet-18或VGG-16的卷積骨幹)，但是CT-SSD既不包含SSD的附加卷積層，也不包含CTPN龐大的LSTM編碼器網路。 The present invention provides a text detection method that mixes CTPN and SSD, called a concatenated text single shot multi-slice detector. The purpose of CT-SSD is to achieve the text detection accuracy of CTPN and the speed of SSD at the same time. Figure 2 shows the architecture 200 of CT-SSD; similar to SSD and CTPN, CT-SSD uses a convolutional backbone 204 of a pre-trained CNN model (e.g., the convolutional backbone of ResNet-18 or VGG-16), but CT-SSD does not include the additional convolutional layers of SSD nor the large LSTM encoder network of CTPN.

CT-SSD採用SSD的多框偵測機制，但如同CTPN，它僅採樣單一個卷積層206(例如，ResNet-18的Conv4層)來偵測精細的文字切片。雖然圖5、圖6中每個CNN卷積層包含了數個具有相同解析度(WxH)和深度(通道數)的特徵圖，但CT-SSD的文字偵測僅需採樣卷積層206的最後一個特徵圖就足夠了。此外，CT-SSD配置於每個特徵單元格的偵測框都有固定的寬度和數個不同的高度(即複數個高寬比)，如圖11所示，而CT-SSD的較佳實施例採用的高寬比範圍為[1,8]；這些精細的偵測框稱為「文字切片偵測器」。文字切片偵測的空間解析度由文字切片偵測器的寬度決定，而此寬度又由採樣的特徵單元格的寬度決定。如圖11所示的實施例，若輸入影像202為512×512像素，作為採樣的 ResNet-18的Conv4層特徵圖解析度就是32×32，而文字切片偵測解析度則是16像素；若要使用其他偵測解析度，只需對不同解析度的特徵圖進行採樣即可實現。 CT-SSD adopts the multi-frame detection mechanism of SSD, but like CTPN, it only samples a single convolutional layer 206 (e.g., the Conv4 layer of ResNet-18) to detect fine text slices. Although each CNN convolutional layer in Figures 5 and 6 contains several feature maps with the same resolution (WxH) and depth (number of channels), CT-SSD's text detection only needs to sample the last feature map of the convolutional layer 206. In addition, the detection box configured by CT-SSD on each feature cell has a fixed width and several different heights (i.e., multiple aspect ratios), as shown in Figure 11, and the preferred embodiment of CT-SSD adopts an aspect ratio range of [1,8]; these fine detection boxes are called "text slice detectors". The spatial resolution of text slice detection is determined by the width of the text slice detector, which in turn is determined by the width of the sampled feature cell. As shown in the embodiment of FIG. 11 , if the input image 202 is 512×512 pixels, the resolution of the Conv4 layer feature map of ResNet-18 used for sampling is 32×32, and the text slice detection resolution is 16 pixels; if other detection resolutions are to be used, it can be achieved by sampling feature maps of different resolutions.

文字切片偵測層208的輸出提供了初步(preliminary)候選文字切片的坐標、高度、文字/非文字分數；它們再經過非極大值抑制演算法210的過濾，選擇出最可能(most likely)候選文字切片。圖12(a)顯示了一個使用CT-SSD文字切片偵測的例子，其中被偵測到的候選文字切片各以高、寬不等的切片框所標注。與CTPN中固定寬度的文字切片提議不同，CT-SSD的候選文字切片的寬度是可變的。有了候選文字切片，輸出層212就可使用與CTPN相似的文字組合規則將它們連結成文字行：若相鄰的候選文字切片之間的水平距離小於預設值(例如50個像素)並且它們的垂直重疊超過預設值(例如0.7)，就將它們組合到同一條文字行中，如圖12(b)所示。經過實驗比較CT-SSD和CTPN的文字偵測效率，結果顯示前者不僅可達成更好的準確度，速度更為後者的10倍。 The output of the text slice detection layer 208 provides the coordinates, height, and text/non-text score of preliminary candidate text slices; they are then filtered by the non-maximum suppression algorithm 210 to select the most likely candidate text slices. Figure 12(a) shows an example of text slice detection using CT-SSD, in which the detected candidate text slices are marked with slice boxes of different heights and widths. Unlike the fixed-width text slice proposals in CTPN, the width of the candidate text slices in CT-SSD is variable. With the candidate text slices, the output layer 212 can use text combination rules similar to CTPN to connect them into text lines: if the horizontal distance between adjacent candidate text slices is less than a preset value (e.g., 50 pixels) and their vertical overlap exceeds a preset value (e.g., 0.7), they are combined into the same text line, as shown in Figure 12(b). The text detection efficiency of CT-SSD and CTPN is experimentally compared, and the results show that the former can not only achieve better accuracy, but also is 10 times faster than the latter.

U網(U-Net) U-Net

語義影像分割(semantic image segmentation)是從數位影像中偵測物件的另一類方法；在這類方法中，影像的每個像素(pixel)都依偵測物件的類別進行分類。U網是專為語義影像分割而設計的CNN方法(O.Ronneberger,P.Fischer,T.Brox，<U網：用於生物醫學影像分割的卷積網路>(U-Net：Convolutional networks for biomedical image segmentation),arXiv：1505.04597(2015))。本發明提供一種基於U網語義分割的文字偵測方法。 Semantic image segmentation is another method for detecting objects from digital images; in this method, each pixel of the image is classified according to the type of object to be detected. U-Net is a CNN method designed specifically for semantic image segmentation (O. Ronneberger, P. Fischer, T. Brox, <U-Net: Convolutional networks for biomedical image segmentation> (U-Net: Convolutional networks for biomedical image segmentation), arXiv: 1505.04597 (2015)). The present invention provides a text detection method based on U-Net semantic segmentation.

如圖13所示，U網架構包括一個典型的卷積路徑(Conv1至Conv5)，用於從輸入影像中抽取不同尺寸的特徵，以及與卷積路徑相對稱的轉置卷積路徑(transposed-convolutional path,T-Conv1至T-Conv4)，用於實現精確的像素級(pixel-level)定位。轉置卷積運算(在圖13中標示為「up conv」)將特徵圖中一個單元格的特徵值，投射到較高解析度特徵圖中多個單元格的加權特徵值；因此，轉置卷積路徑中的特徵圖尺寸(WxH)是逐步增大的。此外，每個轉置卷積層的上採樣(up-sampled)特徵圖與卷積路徑中對應的同尺寸特徵圖用序連運算(concatenation)連接起來(如圖13中的粗線箭頭所示)，目的是加強物件的內容和位置的同步偵測。輸出層(表示為「1x1 conv,C」，其中C是欲偵測物件的類別數)提供了輸入影像中每個像素的類別分數。 As shown in Figure 13, the U-Net architecture includes a typical convolutional path (Conv1 to Conv5) for extracting features of different sizes from the input image, and a transposed-convolutional path (T-Conv1 to T-Conv4) symmetrical to the convolutional path for accurate pixel-level positioning. The transposed convolution operation (labeled as "up conv" in Figure 13) projects the feature value of a cell in the feature map to the weighted feature values of multiple cells in the higher resolution feature map; therefore, the feature map size (WxH) in the transposed convolutional path is gradually increased. In addition, the up-sampled feature map of each transposed convolution layer is concatenated with the corresponding feature map of the same size in the convolution path (as shown by the thick arrows in Figure 13) to enhance the simultaneous detection of the content and position of the object. The output layer (denoted as "1x1 conv,C", where C is the number of categories of the object to be detected) provides the category score of each pixel in the input image.

本發明的文字偵測方法2：分水嶺U網分割(Watershed U-Net Segmentation) Text detection method 2 of the present invention: Watershed U-Net Segmentation

如前所述，由於文字沒有明確的邊界，文字行外框的偵測有時會有歧義(ambiguity)，當相鄰文字行的間距較小時尤其如此；單獨使用像素分割法作為文字偵測可能無法解決這類問題。因此，本發明提供了一種文字偵測方法，將U網分割與分水嶺(Watershed)演算法結合在一起(後者常被用於分割相互接觸的物件)，這種新方法稱為分水嶺U網分割。 As mentioned above, since text has no clear boundaries, the detection of the text line frame is sometimes ambiguous, especially when the distance between adjacent text lines is small; using pixel segmentation alone as text detection may not be able to solve this type of problem. Therefore, the present invention provides a text detection method that combines U-net segmentation with the watershed algorithm (the latter is often used to segment objects that touch each other). This new method is called watershed U-net segmentation.

圖3是分水嶺U網分割架構圖300，其中包含了完整的U網模型304-322。除了正常的文字分布圖324(即像素級文字分布)輸出通道外，另外增加了一個輔助輸出通道：侵蝕(eroded)文字分布圖326，如圖14(b)和圖14(c)所示。為了訓練U網模型以同時輸出文字分布圖和侵蝕文字分布圖，本發明對原訓練影像集進行擴充：將原影像中每個標註的文字行的邊緣都裁切掉15%而產生出新的訓練影像。經U網分割之後，再使用分水嶺演算法328對兩幀半完成的文字分布圖進行處理，產生輸出層330的最佳化文字分布圖。 FIG3 is a diagram of the watershed U-net segmentation architecture 300, which includes a complete U-net model 304-322. In addition to the normal text distribution map 324 (i.e., pixel-level text distribution) output channel, an additional auxiliary output channel is added: an eroded text distribution map 326, as shown in FIG14(b) and FIG14(c). In order to train the U-net model to output the text distribution map and the eroded text distribution map at the same time, the present invention expands the original training image set: the edge of each annotated text line in the original image is cut off by 15% to generate a new training image. After U-net segmentation, the watershed algorithm 328 is used to process the two and a half frames of the completed text distribution map to generate an optimized text distribution map of the output layer 330.

在實際應用中，輸入影像的解析度可能有很大範圍；為了做到全自動文字偵測，本發明實施例在前述文字偵測步驟之前，先用重疊方塊法(overlapping-tiles)做影像預處理。如圖15的示例，輸入影像(陰影矩形)的解析度為1920x1080像素，每個方塊(虛線矩形)的解析度固定為512x512像素，相鄰方塊之間的重疊部分為32像素，位於邊緣的方塊視需要加上填充補空(padding)。對每個方塊中的影像，先用CT-SSD或分水嶺U網分割來偵測文字行，再將結果合併，得出各文字行相對於原輸入影像中的位置和大小。 In practical applications, the resolution of the input image may have a wide range; in order to achieve fully automatic text detection, the embodiment of the present invention uses overlapping-tiles to pre-process the image before the aforementioned text detection step. As shown in the example of Figure 15, the resolution of the input image (shaded rectangle) is 1920x1080 pixels, the resolution of each tile (dashed rectangle) is fixed at 512x512 pixels, the overlapping part between adjacent tiles is 32 pixels, and the tiles at the edge are filled with padding as needed. For the image in each tile, CT-SSD or watershed U-net segmentation is first used to detect the text line, and then the results are merged to obtain the position and size of each text line relative to the original input image.

本發明的文字識別方法：連結時間分類卷積神經網路(Connectionist Temporal Classification CNN,CTC-CNN) The text recognition method of the present invention: Connectionist Temporal Classification CNN (CTC-CNN)

本發明的兩種文字偵測方法提供了輸入影像中文字行外框的位置和尺寸(wxh)；接著是文字識別過程，用於識別每條文字行的文字內容，這是一種序列到序列(sequence-to-sequence)流程，可以使用基於連結時間分類(CTC)損失函數的CNN方法來執行。(F.Borisyuk,A.Gordo,V.Sivakumar，<羅塞塔：用於影像中文字偵測和識別的大規模系統>(Rosetta：Large scale system for text detection and recognition in images),arXiv：1910.05085(2019))。 The two text detection methods of the present invention provide the position and size (wxh) of the text line frame in the input image; followed by a text recognition process for identifying the text content of each text line, which is a sequence-to-sequence process that can be performed using a CNN method based on a connection-time classification (CTC) loss function. (F.Borisyuk, A.Gordo, V.Sivakumar, <Rosetta: Large scale system for text detection and recognition in images>, arXiv: 1910.05085 (2019)).

圖4顯示CTC-CNN的架構圖400，它包括一個預訓練過的CNN模型的卷積骨幹404(例如，ResNet-18的Conv1-Conv5卷積骨幹)。在模型的訓練及測試中，所有文字行影像402(text-line images)的尺寸均被等比例調整為wx32(即影像高度固定為32像素，高寬比與原尺寸相同)。因此，卷積骨幹末端(Conv5)的特徵圖406的解析度為w/32x1；這是個含有w/32單元格的一維特徵序列(feature sequence)，每個單元格代表了輸入文字行影像402中的一個位置。接在Conv5層之後，是連續兩個使用3x1卷積核的一維卷積運算408和410，其輸出端的特徵序列410含有C個通道，對應於要辨識的文字元素的總數量。在本發明的一個實施例中，C=4,593，對應於52個英文本母、10個數字、40個標點符號，和4,491個漢字。結合了CTC損失函數414的Softmax運算412，為序列中的每個單元格提供了所有文字元素的概率分布；將其中空白和重複的文字元素刪除後，即可得到輸入文字行影像402的文字內容416。 FIG4 shows the architecture diagram 400 of CTC-CNN, which includes a convolutional backbone 404 of a pre-trained CNN model (e.g., the convolutional backbone of ResNet-18 Conv1-Conv5). During the training and testing of the model, the size of all text-line images 402 is proportionally adjusted to wx32 (i.e., the image height is fixed to 32 pixels, and the aspect ratio is the same as the original size). Therefore, the resolution of the feature map 406 at the end of the convolutional backbone (Conv5) is w/32x1; this is a one-dimensional feature sequence containing w/32 cells, each cell representing a position in the input text-line image 402. Following the Conv5 layer, there are two consecutive one-dimensional convolution operations 408 and 410 using 3x1 convolution kernels, and the feature sequence 410 at its output contains C channels, corresponding to the total number of text elements to be recognized. In one embodiment of the present invention, C=4,593, corresponding to 52 English letters, 10 numbers, 40 punctuation marks, and 4,491 Chinese characters. The Softmax operation 412 combined with the CTC loss function 414 provides the probability distribution of all text elements for each cell in the sequence; after removing the blank and repeated text elements, the text content 416 of the input text line image 402 can be obtained.

本發明揭露以即時簡訊機器人實現機器人流程自動化(RPA)的系統和方法，以及基於卷積神經網路(CNN)從數位影像中抽取文字內容的新方法。這種系統包括為企業建置的對話機器人、軟體RPA管理器和即時簡訊平台。該RPA管理器包含複數個企業工作流程模組，並能接收企業對話機器人的指令以執行各工作流程。該企業即時簡訊平台還通過網際網路連接到一個或多個公用即時簡訊平台。 The present invention discloses a system and method for implementing Robotic Process Automation (RPA) with an instant messaging robot, and a new method for extracting text content from digital images based on a convolutional neural network (CNN). The system includes a conversational robot built for an enterprise, a software RPA manager, and an instant messaging platform. The RPA manager includes a plurality of enterprise workflow modules and can receive instructions from the enterprise conversational robot to execute each workflow. The enterprise instant messaging platform is also connected to one or more public instant messaging platforms via the Internet.

此外，本發明RPA管理器中的機器人流程的實施例還包含從數位影像中抽取文字內容，並為此類流程提供了改良的卷積神經網路(CNN)方法。以上所述僅為本發明之較佳實施例，凡依本發明申請專利範圍所做之均等變化與修飾，皆應屬本發明之涵蓋範圍。 In addition, the implementation example of the robot process in the RPA manager of the present invention also includes extracting text content from digital images, and provides an improved convolutional neural network (CNN) method for such processes. The above is only a preferred embodiment of the present invention, and all equivalent changes and modifications made according to the scope of the patent application of the present invention shall fall within the scope of the present invention.

100:企業即時簡訊機器人 100:Enterprise instant messaging robot

102:企業對話機器人 102: Enterprise dialogue robot

104:企業即時簡訊平台 104: Enterprise instant messaging platform

106:RPA管理器 106:RPA Manager

108:文字抽取流程 108: Text extraction process

108a~108n:企業工作流程 108a~108n: Enterprise workflow

108tp:第三方RPA工作流程 108tp: Third-party RPA workflow

110:公用即時簡訊平台 110: Public instant messaging platform

112:企業用戶 112: Enterprise users

114:外部用戶 114: External user

116:企業資料庫 116: Enterprise database

Claims

A method for automatically detecting text lines in a digital image comprises: a convolution neural network backbone extracting the spatial features of the digital image; a text slice detection layer detecting a plurality of candidate text slices in the digital image; a non-maximum value based A filter layer of a Suppression algorithm selects a group of best candidate text slices from the candidate text slices; and a text construction layer combines the group of best candidate text slices into one or more text lines based on the following rule: if the horizontal distance between two adjacent candidate text slices is less than a preset value and the vertical overlap is greater than another preset value, then the two belong to the same text line; wherein the convolution neural network backbone is composed of a plurality of convolution layers in an orderly manner, including a plurality of convolution operations, linear rectification function operations, and pooling operations, and is used to generate a series of multi-channel features from the digital image. A convolutional layer comprises a plurality of feature maps, wherein each convolutional layer comprises a plurality of feature maps, and the feature maps have the same spatial resolution and the same number of channels, and the feature map contained in the next convolutional layer of the convolutional layer has the same or lower spatial resolution and a larger number of channels as the feature map of the previous layer; and the text slice detection layer samples a single feature map of a single convolutional layer in the convolutional bone trunk, and a set of a plurality of text slice detectors is arranged in each cell of the feature map, and the width of the set of text slice detectors is equal to the width of the cell, and the heights are different, and the height-to-width ratio is between 1 and 8.

A method for automatically detecting Chinese characters in pixelated digital images, comprising: a convolutional neural network generating two and a half frames of pixel-level text distribution maps from the digital image, the convolutional neural network comprising a convolutional path followed by a transposed convolutional path; and an adjustment layer generating an optimal pixel-level text distribution map; wherein the convolutional path The method comprises a plurality of convolution layers, which include a plurality of convolution operations, linear rectifier operations, and pooling operations, and is used to generate a series of multi-channel feature maps from the digital image; wherein each convolution layer includes a plurality of feature maps, and the feature maps all have the same spatial resolution and the same number of channels, and the feature map contained in the next convolution layer of the convolution layer has the same or lower spatial resolution as the feature map of the previous layer. resolution and a larger number of channels; wherein the transposed convolution path is composed of a plurality of transposed convolution layers in order, including a plurality of transposed convolution operations, concatenation operations, convolution operations, and linear rectifier function operations, for generating a series of multi-channel feature maps with gradually increasing resolutions; wherein each transposed convolution layer contains a plurality of feature maps with the same spatial resolution , wherein the first feature map is generated by the following two steps: 1. performing a transposed convolution operation on the last feature map of the previous layer to improve the resolution; 2. performing a sequential operation on the feature map generated by the operation and the feature map with the same resolution in the convolution path; wherein the two and a half frames of text distribution maps generated by the convolution neural network are one original text and the other eroded text; and wherein the adjustment layer applies the watershed algorithm to generate an optimized text distribution map with non-overlapping text lines from the two and a half frames of text distribution maps.