CN111832403B

CN111832403B - Document structure recognition method, document structure recognition model training method and device

Info

Publication number: CN111832403B
Application number: CN202010501381.7A
Authority: CN
Inventors: 彭艺宇; 曾凯; 路华; 陈永锋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-04
Filing date: 2020-06-04
Publication date: 2024-07-26
Anticipated expiration: 2040-06-04
Also published as: CN111832403A

Abstract

The present application discloses a document structure recognition method, a model training method and a device for document structure recognition, which relate to the fields of natural language processing and deep learning technology and are used for document layout analysis. The specific implementation scheme is: obtaining a document image, selecting a candidate area from the document image, performing image feature extraction on the candidate area to obtain image features, performing semantic recognition on the characters contained in the candidate area to obtain semantic features, and classifying according to the image features and semantic features to determine the document structure type to which the candidate area belongs. In the present application, when performing document structure type recognition, semantic features are added on the basis of image features, fully considering the importance of semantic information in document structure recognition, and improving the accuracy of document structure type judgment.

Description

Document structure recognition method, document structure recognition model training method and device

技术领域Technical Field

本申请涉及计算机技术领域，具体地，涉及自然语言处理和深度学习技术领域。The present application relates to the field of computer technology, and in particular, to the field of natural language processing and deep learning technology.

背景技术Background technique

现实生活中，为了便于对纸质文档的保存，通常会将纸质文档转化为图片形式的电子文档，电子文档在传输和保存方面有着明显的优势。In real life, in order to facilitate the preservation of paper documents, paper documents are usually converted into electronic documents in the form of pictures. Electronic documents have obvious advantages in transmission and preservation.

将文档转化为图像后，对文档布局进行分析，即对文档结构进行识别，是理解文档内容不可或缺的一步，众多后续任务，例如信息抽取、文本分类等都需要建立在精准的文档结构类型识别的基础之上。After converting the document into an image, analyzing the document layout, that is, identifying the document structure, is an indispensable step in understanding the document content. Many subsequent tasks, such as information extraction and text classification, need to be based on accurate document structure type recognition.

发明内容Summary of the invention

本申请提供了一种文档结构识别方法、文档结构识别的模型训练方法和装置，本申请中在进行文档结构类型识别时，在图像特征的基础上增加了语义特征，充分考虑了文档结构识别时语义信息的重要性，提高了文档结构类型判断的准确性。The present application provides a document structure recognition method, a model training method and a device for document structure recognition. In the present application, when identifying the document structure type, semantic features are added on the basis of image features, the importance of semantic information in document structure recognition is fully considered, and the accuracy of document structure type judgment is improved.

本申请的第一方面，提供了一种文档结构识别方法，该方法包括：In a first aspect of the present application, a document structure recognition method is provided, the method comprising:

获取文档图像；Acquire document images;

从所述文档图像中选取候选区域；Selecting a candidate region from the document image;

对所述候选区域进行图像特征提取，得到图像特征；Extracting image features from the candidate area to obtain image features;

对所述候选区域中包含的字符进行语义识别，得到语义特征；Performing semantic recognition on the characters contained in the candidate area to obtain semantic features;

根据所述图像特征和所述语义特征进行分类，以确定所述候选区域所属的文档结构类型。Classification is performed according to the image features and the semantic features to determine the document structure type to which the candidate region belongs.

本申请的第二方面，提供了一种用于文档结构识别的模型训练方法，所述方法包括：In a second aspect of the present application, a model training method for document structure recognition is provided, the method comprising:

获取训练样本集；Obtain a training sample set;

采用所述训练样本集，对目标检测模型进行训练，其中，所述目标检测模型，用于从文档图像中选取候选区域，对所述候选区域进行图像特征提取，得到图像特征，对所述候选区域包含的字符进行语义识别，得到语义特征，以及根据所述图像特征和所述语义特征进行目标检测，以确定所述候选区域所属的文档结构类型。The target detection model is trained using the training sample set, wherein the target detection model is used to select a candidate area from a document image, perform image feature extraction on the candidate area to obtain image features, perform semantic recognition on characters contained in the candidate area to obtain semantic features, and perform target detection based on the image features and the semantic features to determine the document structure type to which the candidate area belongs.

本申请的第三方面，提供了一种文档结构识别装置，所述装置包括：In a third aspect of the present application, a document structure recognition device is provided, the device comprising:

图像获取模块，用于获取文档图像；An image acquisition module, used for acquiring a document image;

选取模块，用于从所述文档图像中选取候选区域；A selection module, used for selecting a candidate area from the document image;

提取模块，用于对所述候选区域进行图像特征提取，得到图像特征；An extraction module, used for extracting image features from the candidate area to obtain image features;

识别模块，用于对所述候选区域中包含的字符进行语义识别，得到语义特征；A recognition module, used for performing semantic recognition on the characters contained in the candidate area to obtain semantic features;

检测模块，用于根据所述图像特征和所述语义特征进行目标检测，以确定所述候选区域所属的文档结构类型。The detection module is used to perform target detection according to the image features and the semantic features to determine the document structure type to which the candidate area belongs.

本申请的第四方面，提供了一种用于文档结构识别的模型训练装置，所述装置包括：In a fourth aspect of the present application, a model training device for document structure recognition is provided, the device comprising:

样本获取模块，用于获取训练样本集；A sample acquisition module is used to obtain a training sample set;

训练模块，用于采用所述训练样本集，对目标检测模型进行训练，其中，所述目标检测模型，用于从文档图像中选取候选区域，对所述候选区域进行图像特征提取，得到图像特征，对所述候选区域包含的字符进行语义识别，得到语义特征，以及根据所述图像特征和所述语义特征进行目标检测，以确定所述候选区域所属的文档结构类型。A training module is used to train a target detection model using the training sample set, wherein the target detection model is used to select a candidate area from a document image, perform image feature extraction on the candidate area to obtain image features, perform semantic recognition on characters contained in the candidate area to obtain semantic features, and perform target detection based on the image features and the semantic features to determine the document structure type to which the candidate area belongs.

本申请的第五方面，提供了一种电子设备，其特征在于，包括：According to a fifth aspect of the present application, an electronic device is provided, comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively connected to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行第一方面所述的文档结构识别方法，或者，第二方面所述的文档结构识别的模型训练方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the document structure recognition method described in the first aspect, or the model training method for document structure recognition described in the second aspect.

本申请的第六方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其特征在于，所述计算机指令用于使所述计算机执行第一方面所述的文档结构识别方法，或者，第二方面所述的文档结构识别的模型训练方法。The sixth aspect of the present application provides a non-transitory computer-readable storage medium storing computer instructions, characterized in that the computer instructions are used to enable the computer to execute the document structure recognition method described in the first aspect, or the model training method for document structure recognition described in the second aspect.

本申请实施例提供的技术方案包含如下的有益效果：The technical solution provided by the embodiment of the present application has the following beneficial effects:

获取文档图像，从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域中包含的字符进行语义识别，得到语义特征，根据图像特征和语义特征进行分类，以确定候选区域所属的文档结构类型。本申请中在进行文档结构类型识别时，在图像特征的基础上增加了语义特征，充分考虑了文档结构识别时语义信息的重要性，提高了文档结构类型判断的准确性。Acquire a document image, select a candidate area from the document image, extract image features from the candidate area to obtain image features, perform semantic recognition on the characters contained in the candidate area to obtain semantic features, and classify according to the image features and semantic features to determine the document structure type to which the candidate area belongs. In the present application, when performing document structure type recognition, semantic features are added on the basis of image features, fully considering the importance of semantic information in document structure recognition, and improving the accuracy of document structure type judgment.

应当理解，本部分所描述的内容并非旨在标识本申请的实施例的关键或重要特征，也不用于限制本申请的范围。本申请的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify the key or important features of the embodiments of the present application, nor is it intended to limit the scope of the present application. Other features of the present application will become easily understood through the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用于更好地理解本方案，不构成对本申请的限定。其中：The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present application.

图1为本申请实施例提供的一种文档结构识别方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a document structure recognition method provided in an embodiment of the present application;

图2为本申请实施例提供的另一种文档结构识别方法的流程示意图；FIG2 is a flow chart of another document structure recognition method provided in an embodiment of the present application;

图3为本申请实施例提供的又一种文档结构识别方法的流程示意图；FIG3 is a flow chart of another document structure recognition method provided in an embodiment of the present application;

图4为本申请实施例提供的再一种文档结构识别方法的流程示意图；FIG4 is a schematic diagram of a flow chart of another document structure recognition method provided in an embodiment of the present application;

图5为本实施例的候选区域闭运算的示意图之一；FIG5 is a schematic diagram of a candidate region closing operation according to the present embodiment;

图6为本实施例的候选区域闭运算的示意图之二；FIG6 is a second schematic diagram of the candidate region closing operation of this embodiment;

图7为本申请实施例提供的再一种文档结构识别方法的流程示意图；FIG7 is a flow chart of another document structure recognition method provided in an embodiment of the present application;

图8为本申请实施例提供的一种文档结构识别的模型训练方法的流程示意图；FIG8 is a flow chart of a model training method for document structure recognition provided in an embodiment of the present application;

图9为本申请一个训练样本中文本内容标注的示意图；FIG9 is a schematic diagram of text content annotation in a training sample of the present application;

图10为本申请实施例提供的一种文档结构识别装置的结构示意图；FIG10 is a schematic diagram of the structure of a document structure recognition device provided in an embodiment of the present application;

图11为本申请实施例提供的一种用于文档结构识别的模型训练装置的结构示意图；FIG11 is a schematic diagram of the structure of a model training device for document structure recognition provided in an embodiment of the present application;

图12为本申请实施例的电子设备的框图。FIG. 12 is a block diagram of an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

以下结合附图对本申请的示范性实施例做出说明，其中包括本申请实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本申请的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。The following is a description of exemplary embodiments of the present application in conjunction with the accompanying drawings, including various details of the embodiments of the present application to facilitate understanding, which should be considered as merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present application. Similarly, for the sake of clarity and conciseness, the description of well-known functions and structures is omitted in the following description.

下面参考附图描述本申请实施例的文档结构识别方法、文档结构识别的模型训练方法和装置。The document structure recognition method, document structure recognition model training method and device of the embodiments of the present application are described below with reference to the accompanying drawings.

图1为本申请实施例提供的一种文档结构识别方法的流程示意图。FIG1 is a flow chart of a document structure recognition method provided in an embodiment of the present application.

如图1所示，该方法包括以下步骤：As shown in FIG1 , the method comprises the following steps:

步骤101，获取文档图像。Step 101, obtaining a document image.

本实施例的执行主体为处理器，处理器中运行了训练好的文档结构识别的模型，可利用训练好的文档结构识别的模型实现本申请的文档结构识别方法。The executor of this embodiment is a processor, in which a trained model for document structure recognition is running. The trained model for document structure recognition can be used to implement the document structure recognition method of this application.

其中，文档图像是指需要用于进行文档结构识别的文档的图像，例如，扫描得到的图像格式的文档，例如图像格式的简历、图像格式的说明书等。The document image refers to an image of a document that needs to be used for document structure recognition, for example, a document in an image format obtained by scanning, such as a resume in an image format, an instruction manual in an image format, and the like.

步骤102，从文档图像中选取候选区域。Step 102: Select a candidate region from the document image.

其中，候选区域是指可能包含各种文档结构类型的区域。文档结构类型包含表格、文本、图像、段落、标题和脚注(用于对文本进行补充说明)等，本实施例中不一一列举。The candidate region refers to a region that may contain various document structure types, including tables, texts, images, paragraphs, titles, and footnotes (for supplementing the text), which are not listed one by one in this embodiment.

作为第一种可能的实现方式，利用快速目标检测模型Faster R-CNN识别出图像中的各对象的区域，将各对象的区域作为候选区域。As a first possible implementation method, a fast object detection model Faster R-CNN is used to identify the region of each object in the image, and the region of each object is used as a candidate region.

作为第二种可能的实现方式，利用候选区域算法，例如，选择性搜索(selectivesearch)，从文档图像中选取候选区域。As a second possible implementation, a candidate region algorithm, such as selective search, is used to select a candidate region from the document image.

作为第三种可能的实现方式，采用滑动检测框从文档图像中选取多个候选区域，具体地，根据预设的滑动检测框，在文档图像中，按照预设的步长从左至右，从上至下顺序移动，将滑动检测框每次移动所框住的区域作为候选区域，得到多个候选区域。As a third possible implementation method, a sliding detection frame is used to select multiple candidate areas from the document image. Specifically, according to the preset sliding detection frame, it moves sequentially from left to right and from top to bottom in the document image according to a preset step size, and the area framed by the sliding detection frame each time it moves is used as a candidate area, thereby obtaining multiple candidate areas.

步骤103，对候选区域进行图像特征提取，得到图像特征。Step 103: extract image features from the candidate area to obtain image features.

其中，图像特征，包含图像的颜色特征、纹理特征、形状特征和空间关系特征，候选区域中文本的粒度特征等，其中，空间关系特征还包含候选区域中不同区域所属的前景区域或背景区域，文本粒度包含字，句或段落中的至少一个。Among them, the image features include color features, texture features, shape features and spatial relationship features of the image, granularity features of the text in the candidate area, etc., wherein the spatial relationship features also include the foreground area or background area to which different areas in the candidate area belong, and the text granularity includes at least one of a word, a sentence or a paragraph.

作为一种可能的实现方式，利用神经网络模型，对候选区域进行图像特征提取，以得到图像特征。As a possible implementation method, a neural network model is used to extract image features from the candidate region to obtain image features.

步骤104，对候选区域中包含的字符进行语义识别，得到语义特征。Step 104: perform semantic recognition on the characters contained in the candidate area to obtain semantic features.

在一个实施例中，对候选区域中包含的各个字符转化为对应的字符向量，将各个字符向量进行拼接，得到拼接向量，对拼接向量进行语义识别，得到对应的语义特征。In one embodiment, each character contained in the candidate area is converted into a corresponding character vector, each character vector is concatenated to obtain a concatenated vector, and semantic recognition is performed on the concatenated vector to obtain a corresponding semantic feature.

例如，某候选区域是简历中关于教育经历的区域，则对该候选区域中包含的字符进行语义识别后，可以得到对应教育经历的语义特征。又例如，某候选区域是简历中关于工作经历的区域，则对该候选区域中包含的字符进行语义识别后，可以得到对应工作经历的语义特征。For example, if a candidate region is an area about education experience in a resume, semantic features of the corresponding education experience can be obtained after semantic recognition of the characters contained in the candidate region. For another example, if a candidate region is an area about work experience in a resume, semantic features of the corresponding work experience can be obtained after semantic recognition of the characters contained in the candidate region.

步骤105，根据图像特征和语义特征进行分类，以确定候选区域所属的文档结构类型。Step 105 , classifying according to the image features and semantic features to determine the document structure type to which the candidate region belongs.

其中，文档结构类型，包含文本、表格、标题、段落、脚注和图像等。Among them, the document structure type includes text, table, title, paragraph, footnote and image, etc.

本申请中，根据图像特征和语义特征进行分类，其中，可根据图像特征中包含的形状特征、纹理特征和内容密度等，对候选区域进行识别，以确定候选区域是表格、文本或图像等，而不同的文档结构类型对应的候选区域中包含的语义信息是不同的，不同的语义信息又可以指示对应的文档结构类型，例如，表格中包含的字符识别到的语义信息和文本中包含的字符识别到的语义信息是不同的，而不同的文本段落中，包含的语义信息也是不同的，例如，简历中包含较多的文本区域，不同的文本区域语义信息不同，有对应教育经历的文本区域，有对应工作经历的文本区域，还有对应个人信息的文本区域，从而根据提取到的候选区域的语义特征结合图像特征进行文档结构类型的识别，可以提高候选区域文档结构识别的准确性。In the present application, classification is performed based on image features and semantic features, wherein candidate regions can be identified based on shape features, texture features, content density, etc. contained in image features to determine whether the candidate region is a table, text, or image, etc., and the semantic information contained in candidate regions corresponding to different document structure types is different, and different semantic information can indicate the corresponding document structure type. For example, the semantic information recognized by the characters contained in the table is different from the semantic information recognized by the characters contained in the text, and the semantic information contained in different text paragraphs is also different. For example, a resume contains more text regions, and different text regions have different semantic information. There are text regions corresponding to educational experience, text regions corresponding to work experience, and text regions corresponding to personal information. Therefore, the document structure type can be identified based on the extracted semantic features of the candidate regions in combination with the image features, which can improve the accuracy of document structure identification in the candidate regions.

本申请实施例的文档结构识别方法中，获取文档图像，从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域中包含的字符进行语义识别，得到语义特征，根据图像特征和语义特征进行分类，以确定候选区域所属的文档结构类型。本申请中在进行文档结构类型识别时，在图像特征的基础上增加了语义特征，充分考虑了文档结构识别时语义信息的重要性，提高了文档结构类型判断的准确性。In the document structure recognition method of the embodiment of the present application, a document image is obtained, a candidate area is selected from the document image, image features are extracted for the candidate area to obtain image features, semantic recognition is performed on the characters contained in the candidate area to obtain semantic features, and classification is performed based on the image features and semantic features to determine the document structure type to which the candidate area belongs. In the present application, when performing document structure type recognition, semantic features are added on the basis of image features, the importance of semantic information in document structure recognition is fully considered, and the accuracy of document structure type judgment is improved.

基于上述实施例，本申请实施例提供了另一种文档结构识别方法，图2为本申请实施例提供的另一种文档结构识别方法的流程示意图，如图2所示，该方法包含以下步骤：Based on the above embodiment, the embodiment of the present application provides another document structure recognition method. FIG2 is a flow chart of another document structure recognition method provided by the embodiment of the present application. As shown in FIG2, the method includes the following steps:

步骤201，获取文档图像。Step 201, obtaining a document image.

步骤202，从文档图像中选取候选区域。Step 202: Select a candidate region from the document image.

具体地，可参照上一实施例中的步骤101-102，原理相同，此处不再赘述。Specifically, reference may be made to steps 101-102 in the previous embodiment, the principles are the same and will not be repeated here.

步骤203，对候选区域中每一像素单元识别内容属性特征。Step 203: Identify content attribute features for each pixel unit in the candidate area.

其中，内容属性特征指示了每一像素单元包含的属性相关的特征，例如像素单元是属于前景或是背景，像素单元对应的目标文本粒度，其中，目标文本粒度，包含字，句子和段落中的一个或多个。Among them, the content attribute features indicate the attribute-related features contained in each pixel unit, such as whether the pixel unit belongs to the foreground or the background, and the target text granularity corresponding to the pixel unit, wherein the target text granularity includes one or more of words, sentences and paragraphs.

步骤204，根据候选区域中的每一像素的像素值和内容属性特征，生成输入图像。Step 204: Generate an input image based on the pixel value and content attribute characteristics of each pixel in the candidate area.

在一个实施例中，重新生成的输入图像中的各像素点均具有多个通道，每一通道用于指示候选区域中对应像素的像素值或内容属性特征，增加了输入图像中各像素点包含的特征信息。In one embodiment, each pixel point in the regenerated input image has multiple channels, each channel is used to indicate the pixel value or content attribute feature of the corresponding pixel in the candidate area, and the feature information contained in each pixel point in the input image is increased.

其中，每个像素点包含的特征信息可表示为[像素值，内容属性特征]The feature information contained in each pixel can be expressed as [pixel value, content attribute feature]

步骤205，对输入图像进行图像特征提取，以得到图像特征。Step 205: extract image features from the input image to obtain image features.

具体地，可参照上一实施例中的步骤103，原理相同，此处不再赘述。Specifically, reference may be made to step 103 in the previous embodiment, the principle is the same and will not be repeated here.

本实施例中重新生成的输入图像中包含了像素值和内容属性特征，相比于现有技术中仅包含像素值信息，增加了提取到的图像特征中的信息量，从而后续基于图像特征进行文档结构类型识别时，可以识别得到更细粒度的文档结构类型，例如可以识别出该候选区域的文本是句子还是段落。The regenerated input image in this embodiment contains pixel values and content attribute features. Compared with the prior art which only contains pixel value information, the amount of information in the extracted image features is increased. Therefore, when the document structure type is subsequently identified based on the image features, a more fine-grained document structure type can be identified. For example, it can be identified whether the text in the candidate area is a sentence or a paragraph.

步骤206，获取文档图像对应的文档内容。Step 206: Obtain the document content corresponding to the document image.

作为一种可能的实现方式，可通过光学字符识别(Optical CharacterRecognition，OCR)识别技术，识别文档图像对应的文档内容。As a possible implementation method, the document content corresponding to the document image may be recognized by using optical character recognition (OCR) recognition technology.

作为另一种可能的实现方式，获取文档图像对应的源文档，以从源文档中获取对应的文档内容。As another possible implementation manner, a source document corresponding to the document image is obtained to obtain corresponding document content from the source document.

步骤207，根据候选区域在文档图像中的相对位置，查询文档内容中的相对位置，以得到候选区域中包含的字符。Step 207: query the relative position in the document content according to the relative position of the candidate area in the document image to obtain the characters contained in the candidate area.

作为一种可能的实现方式，识别文档图像中各像素点在图像中的坐标，确定候选区域的边框对应的像素点，根据边框对应的像素点的坐标，确定候选区域在文档图像中的相对位置关系。由于文档图像和对应的文档内容在位置和内容上具有一一对应的关系，因此，根据该相对位置，查询文档内容中的相对位置，将文档内容中对应候选区域的相对位置包含的字符，作为候选区域中包含的字符。As a possible implementation method, the coordinates of each pixel in the document image are identified, the pixel corresponding to the border of the candidate area is determined, and the relative position relationship of the candidate area in the document image is determined according to the coordinates of the pixel corresponding to the border. Since the document image and the corresponding document content have a one-to-one correspondence in position and content, the relative position in the document content is queried according to the relative position, and the characters contained in the relative position corresponding to the candidate area in the document content are taken as the characters contained in the candidate area.

步骤208，对候选区域中包含的字符进行语义识别，得到语义特征。Step 208: Perform semantic recognition on the characters contained in the candidate area to obtain semantic features.

进而，对候选区域中包含的字符进行语义识别，以得到语义特征，由于各字符是从文档图像对应的文档内容中获取到的，提高了候选区域中包含的字符的准确性，进而提高了识别到的语义特征的准确性。Furthermore, semantic recognition is performed on the characters contained in the candidate area to obtain semantic features. Since each character is obtained from the document content corresponding to the document image, the accuracy of the characters contained in the candidate area is improved, thereby improving the accuracy of the recognized semantic features.

步骤209，将候选区域的图像特征与候选区域的语义特征拼接，得到候选区域的合成特征。Step 209 , concatenating the image features of the candidate region with the semantic features of the candidate region to obtain synthetic features of the candidate region.

步骤210，根据候选区域的合成特征进行分类，以确定候选区域所属的文档结构类型。Step 210 , classifying the candidate region according to its synthetic features to determine the document structure type to which the candidate region belongs.

在本申请的一个实施例中，将候选区域的图像特征与候选区域的语义特征拼接，得到候选区域的合成特征，增加了候选区域中的合成特征包含的信息，以使得在语义信息的监督下，通过分类模型，得到候选区域所属的文档结构类型，可以提高候选区域文档结构识别的准确性，相比于现有技术中，仅通过图像特征进行分类，以确定候选区域所属的文档结构类型，提高了文档结构类型确定的准确性。In one embodiment of the present application, the image features of the candidate area are spliced with the semantic features of the candidate area to obtain the synthetic features of the candidate area, and the information contained in the synthetic features of the candidate area is increased so that under the supervision of the semantic information, the document structure type to which the candidate area belongs is obtained through a classification model, which can improve the accuracy of document structure identification in the candidate area. Compared with the prior art, in which only image features are used for classification to determine the document structure type to which the candidate area belongs, the accuracy of determining the document structure type is improved.

上述实施例中说明了内容属性特征可以包含相应像素单元是属于前景或是背景，或者还包含像素单元对应的目标文本粒度，下面通过实施例分别进行说明。The above embodiments illustrate that the content attribute feature may include whether the corresponding pixel unit belongs to the foreground or the background, or may also include the target text granularity corresponding to the pixel unit, which will be described below through embodiments.

基于上述实施例，本实施例说明了像素单元包含的内容属性特征包含相应像素单元是属于前景或是背景。图3为本申请实施例提供的又一种文档结构识别方法的流程示意图，如图3所示，作为一种可能的实现方式，上述的步骤203可以包含以下步骤：Based on the above embodiment, this embodiment illustrates that the content attribute feature contained in the pixel unit includes whether the corresponding pixel unit belongs to the foreground or the background. FIG3 is a flow chart of another document structure recognition method provided by the embodiment of the present application. As shown in FIG3, as a possible implementation method, the above step 203 may include the following steps:

步骤301，对候选区域进行二值化，以从候选区域中确定前景部分和背景部分。Step 301, binarize the candidate region to determine the foreground part and the background part from the candidate region.

在本申请的一个实施例中，对候选区域先进行灰度处理，得到灰度图像，再对灰度图像进行二值化处理。其中，灰度图像即通过使色彩的三种颜色分量红R、绿G、蓝B的值相同，由于颜色值的取值范围是[0,255]，所以灰度的级别有256种，也就是说灰度图像能表现256种灰度颜色。本申请中可通过以下三种方式生成灰度图。In one embodiment of the present application, the candidate area is first grayscale processed to obtain a grayscale image, and then the grayscale image is binarized. The grayscale image is obtained by making the values of the three color components of red R, green G, and blue B the same. Since the range of color values is [0, 255], there are 256 grayscale levels, that is, the grayscale image can express 256 grayscale colors. In the present application, a grayscale image can be generated in the following three ways.

作为第一种可能的实现方式，最大值法，即R＝B＝G＝Max(R,G,B)，这种方法处理后灰度图像的亮度偏高。As a first possible implementation method, the maximum value method, that is, R=B=G=Max(R,G,B), the brightness of the grayscale image after processing by this method is relatively high.

作为第二种可能的实现方式，平均值法：R＝B＝G＝(R+G+B)/3，这种方法处理后灰度图像颜色较柔和。As a second possible implementation method, the average value method: R=B=G=(R+G+B)/3. After being processed by this method, the grayscale image has a softer color.

作为第三种可能的实现方式，加权平均值法：R＝G＝B＝R*权重1+G*权重2+B*权重3，其中，权重1，权重2和权重3分别为R，G，B的权重。当权重取值不同时，能够形成不同灰度的灰度图像。As a third possible implementation, the weighted average method is: R = G = B = R * weight 1 + G * weight 2 + B * weight 3, where weight 1, weight 2 and weight 3 are the weights of R, G and B. When the weight values are different, grayscale images with different grayscales can be formed.

在上述将候选区域转化为灰度图像后，对候选区域进行二值化处理，例如，采用自适应阈值的二值化算法，将候选区域对应的图像中的前景部分的像素单元的灰度值设置为255，将背景部分的像素单元的灰度值设置为0，也就是说二值化后的候选区域呈现的即为前景部分是白色，背景部分是黑色的视觉效果，以实现从候选区域中确定前景部分和背景部分。After converting the candidate area into a grayscale image as described above, the candidate area is binarized. For example, an adaptive threshold binarization algorithm is used to set the grayscale value of the pixel unit of the foreground part of the image corresponding to the candidate area to 255, and the grayscale value of the pixel unit of the background part is set to 0. That is to say, the candidate area after binarization presents a visual effect that the foreground part is white and the background part is black, so as to determine the foreground part and the background part from the candidate area.

步骤302，根据各像素单元属于前景部分或背景部分，生成各像素单元的内容属性特征。Step 302: Generate content attribute features of each pixel unit according to whether each pixel unit belongs to the foreground part or the background part.

其中，内容属性特征对应多个维度，每个维度对应每一像素单元的像素值，和属于前景或是背景的特征。The content attribute features correspond to multiple dimensions, each dimension corresponds to the pixel value of each pixel unit, and the features belonging to the foreground or the background.

在本实施例中，识别出属于前景部分的各像素单元，以及属于背景部分的各像素单元，从而生成各像素单元的内容属性特征，也就是说在将像素单元是属于前景还是背景的特征添加至内容属性特征中，提高了各像素单元内容属性特征包含的特征信息。In this embodiment, each pixel unit belonging to the foreground and each pixel unit belonging to the background are identified, thereby generating content attribute features of each pixel unit. That is to say, by adding the feature of whether the pixel unit belongs to the foreground or the background to the content attribute feature, the feature information contained in the content attribute feature of each pixel unit is improved.

本申请的文档结构识别方法中，对候选区域进行二值化，以从候选区域中确定前景部分和背景部分，根据各像素单元属于前景部分或背景部分，生成各像素单元的内容属性特征，实现了增加各像素单元的内容属性特征包含的信息。In the document structure recognition method of the present application, the candidate area is binarized to determine the foreground part and the background part from the candidate area, and the content attribute features of each pixel unit are generated according to whether each pixel unit belongs to the foreground part or the background part, thereby increasing the information contained in the content attribute features of each pixel unit.

上一实施例中，说明了内容属性特征包含前景或背景，实际应用中还可以进一步扩展内容属性特征，即本实施例中内容属性特征还包含像素单元所属的目标文本的粒度，以增加内容属性特征中包含的信息。为此，本实施例还提供了一种文档结构识别方法，图4为本申请实施例提供的再一种文档结构识别方法的流程示意图，如图4所示，作为另一种可能的实现方式，上述的步骤203可以包含以下步骤：In the previous embodiment, it is explained that the content attribute feature includes the foreground or background. In practical applications, the content attribute feature can be further expanded, that is, the content attribute feature in this embodiment also includes the granularity of the target text to which the pixel unit belongs, so as to increase the information contained in the content attribute feature. To this end, this embodiment also provides a document structure recognition method. FIG4 is a flow chart of another document structure recognition method provided by the embodiment of the present application. As shown in FIG4, as another possible implementation method, the above step 203 can include the following steps:

步骤401，对候选区域进行二值化，以从候选区域中确定前景部分和背景部分。Step 401, binarize the candidate region to determine the foreground part and the background part from the candidate region.

具体地，可参照上一实施例中的步骤301，原理相同，此处不再赘述。Specifically, reference may be made to step 301 in the previous embodiment, the principle is the same and will not be repeated here.

步骤402，采用目标文本粒度对应的结构元尺寸，对二值化的候选区域进行闭运算，以扩展前景部分。Step 402 , using the structural element size corresponding to the target text granularity, performs a closing operation on the binary candidate region to expand the foreground portion.

其中，目标文本粒度包括字、句子和段落中的至少一个。结构元尺寸，和目标文本粒度对应，结构元尺寸可用矩阵表示，为可覆盖对应文本粒度的尺寸。The target text granularity includes at least one of a word, a sentence and a paragraph. The structural element size corresponds to the target text granularity. The structural element size can be represented by a matrix and is a size that can cover the corresponding text granularity.

在一个实施例中，由于目标文本粒度不同，不同的目标文本粒度对应不同的空间尺度，即结构元尺寸，结构元尺寸过大或过小都无法准确确定目标文本粒度，因此本实施例中根据不同的目标文本粒度，对应了不同的结构元尺寸，也就是说字粒度，具有对应的字的结构元尺寸，句子粒度，具有对应的句子的结构元尺寸，段落，具有对应的段落的结构元尺寸。采用目标文本粒度对应的结构元尺寸，对二值化的候选区域进行闭运算，其中，闭运算包含膨胀操作和腐蚀操作，具体来说，先采用对应的结构元尺寸，对二值化的候选区域中前景部分的目标文本粒度，采用膨胀操作进行处理，以扩展前景中对应的目标文本粒度，再采用腐蚀操作，以消除扩展得到的前景中的噪音，实现了扩展前景部分的同时，降低了前景部分的噪点。In one embodiment, due to different target text granularities, different target text granularities correspond to different spatial scales, i.e., structural element sizes. If the structural element size is too large or too small, the target text granularity cannot be accurately determined. Therefore, in this embodiment, different structural element sizes are corresponding to different target text granularities, i.e., the character granularity has the structural element size corresponding to the character, the sentence granularity has the structural element size corresponding to the sentence, and the paragraph has the structural element size corresponding to the paragraph. The structural element size corresponding to the target text granularity is used to perform a closing operation on the binarized candidate area, wherein the closing operation includes an expansion operation and an erosion operation. Specifically, the corresponding structural element size is first used to process the target text granularity of the foreground part of the binarized candidate area using an expansion operation to expand the corresponding target text granularity in the foreground, and then an erosion operation is used to eliminate the noise in the expanded foreground, thereby achieving the expansion of the foreground part while reducing the noise points in the foreground part.

例如，以目标文本粒度为句子进行说明。For example, the target text granularity is described as a sentence.

图5中示出了对二值化的一个候选区域，采用句子粒度对应的结构元尺寸，例如为5*5的矩阵，进行第一次闭运算后的结果，候选区域中的句子因膨胀使得句子对应的前景部分的区域扩张，进而，采用腐蚀操作，使得黑色的背景部分扩张，以消除部分噪点。为了增强句子间的连通效果，可以多次进行闭运算，以增强前景部分的扩张效果。图6为经过第二次闭运算后得到的扩张后的前景部分，实现了句子间的充分连通，以提高后续进行轮廓检测的效果。FIG5 shows the result of the first closing operation on a candidate region of binarization, using a structural element size corresponding to the sentence granularity, such as a 5*5 matrix. The sentences in the candidate region are expanded, so that the foreground area corresponding to the sentences is expanded. Then, the erosion operation is used to expand the black background part to eliminate some noise. In order to enhance the connectivity between sentences, the closing operation can be performed multiple times to enhance the expansion effect of the foreground part. FIG6 shows the expanded foreground part obtained after the second closing operation, which achieves full connectivity between sentences to improve the effect of subsequent contour detection.

需要说明的是，结构元尺寸，可以根据具体的文本粒度和精度进行调整，本实施例中不进行限定。It should be noted that the size of the structural element can be adjusted according to the specific text granularity and precision, and is not limited in this embodiment.

步骤403，对前景部分进行轮廓检测，以得到目标文本粒度的包围框。Step 403: perform contour detection on the foreground portion to obtain a bounding box of the target text granularity.

在申请实施例中，采用轮廓检测算法，对前景部分进行轮廓检测，以得到目标文本粒度的包围框。以句子粒度为例，对图6中连通的句子部分进行轮廓检测，可以确定各句子的包围框，例如图6中的包围框1，其中，图6中的每个句子都可以确定一个包围框，图6中未一一标识，本实施例中也不一一列举，本实施例中未一一列举。In the application embodiment, a contour detection algorithm is used to perform contour detection on the foreground part to obtain a bounding box of the target text granularity. Taking sentence granularity as an example, contour detection is performed on the connected sentence parts in Figure 6 to determine the bounding box of each sentence, such as bounding box 1 in Figure 6, where each sentence in Figure 6 can determine a bounding box, which is not marked one by one in Figure 6, nor is it listed one by one in this embodiment.

步骤404，将处于包围框内的像素单元确定为属于目标文本粒度，将未处于包围框内的像素单元确定为未属于目标文本粒度。Step 404: determine the pixel units in the bounding box as belonging to the target text granularity, and determine the pixel units not in the bounding box as not belonging to the target text granularity.

如图6所示，图中每一个白色区域对应的包围框即为目标文本粒度的包围框，从而将处于包围框内的像素单元确定为属于目标文本粒度，将未处于包围框内的像素单元确定为未属于目标文本粒度。例如，包围框1内的句子为句子1，在白色包围框1内的各像素单元则属于句子1，而包围框1周围的黑色区域，由于不处于包围框1，因此确定为不属于句子1。As shown in Figure 6, the bounding box corresponding to each white area in the figure is the bounding box of the target text granularity, so that the pixel units in the bounding box are determined to belong to the target text granularity, and the pixel units not in the bounding box are determined to not belong to the target text granularity. For example, the sentence in the bounding box 1 is sentence 1, and the pixel units in the white bounding box 1 belong to sentence 1, while the black area around the bounding box 1 is not in the bounding box 1, so it is determined not to belong to sentence 1.

步骤405，根据各像素单元属于的目标文本粒度，生成各像素单元的内容属性特征。Step 405: Generate content attribute features of each pixel unit according to the target text granularity to which each pixel unit belongs.

其中，内容属性特征对应多个维度，每个维度对应每一像素单元对应的是前景或是背景，以及是字，句子或者还是段落的属性特征中的一个。也就是说，每个像素点包含的特征信息可表示为[像素值，前景/背景，目标文本粒度]Among them, the content attribute features correspond to multiple dimensions, and each dimension corresponds to whether each pixel unit corresponds to the foreground or background, and whether it is one of the attribute features of a word, sentence or paragraph. In other words, the feature information contained in each pixel point can be expressed as [pixel value, foreground/background, target text granularity]

本实施例中，根据确定的各像素单元属于的目标文本粒度，生成各像素单元的内容属性特征，即识别各像素单元的内容属性特征，增加了内容属性特征包含的信息量，从而可以增加后续提取得到的图像特征包含的信息量，以提高文档结构类型确定的准确性。In this embodiment, content attribute features of each pixel unit are generated based on the target text granularity to which each pixel unit belongs, that is, the content attribute features of each pixel unit are identified, thereby increasing the amount of information contained in the content attribute features, thereby increasing the amount of information contained in the image features extracted subsequently, thereby improving the accuracy of determining the document structure type.

本实施例的文档结构识别方法中，在对候选区域进行二值化处理后，采用目标文本粒度对应的结构元尺寸，对二值化的候选区域进行闭运算，以扩展前景部分，并通过对前景部分进行轮廓检测，以得到目标文本粒度的包围框，并将处于包围框内的像素单元确定为属于目标文本粒度，将未处于包围框内的像素单元确定为未属于目标文本粒度，根据各像素单元属于的目标文本粒度，生成各像素单元的内容属性特征，提高了各像素单元内容属性特征包含的信息量。In the document structure recognition method of the present embodiment, after the candidate area is binarized, a closing operation is performed on the binarized candidate area using the structural element size corresponding to the target text granularity to expand the foreground part, and a bounding box of the target text granularity is obtained by performing contour detection on the foreground part, and the pixel units within the bounding box are determined to belong to the target text granularity, and the pixel units not within the bounding box are determined to not belong to the target text granularity, and the content attribute features of each pixel unit are generated according to the target text granularity to which each pixel unit belongs, thereby increasing the amount of information contained in the content attribute features of each pixel unit.

基于上述实施例，本实施例还提供了一种文档结构识别方法，在从文档图像中选取候选区域时，还可以采用滑动检测框的方式，图7为本申请实施例提供的再一种文档结构识别方法的流程示意图，如图7所示，该方法包含以下步骤：Based on the above embodiment, this embodiment further provides a document structure recognition method. When selecting a candidate area from a document image, a sliding detection frame method can also be used. FIG. 7 is a flow chart of another document structure recognition method provided by an embodiment of the present application. As shown in FIG. 7, the method includes the following steps:

步骤701，获取文档图像。Step 701, obtaining a document image.

步骤702，采用滑动检测框从文档图像中选取多个候选区域。Step 702: Select multiple candidate regions from the document image using a sliding detection frame.

在一个实施例中，根据预设的滑动检测框，在文档图像中，按照预设的步长从左至右，从上至下顺序移动，将滑动检测框每次移动所框住的区域作为候选区域，得到多个候选区域。本实施例中，为了提高候选区域获取的精度，滑动检测框移动的步长设置较小，从而相邻的候选区域之间具有重合的部分。In one embodiment, according to a preset sliding detection frame, in the document image, the sliding detection frame is sequentially moved from left to right and from top to bottom according to a preset step length, and the area framed by the sliding detection frame each time is used as a candidate area to obtain multiple candidate areas. In this embodiment, in order to improve the accuracy of obtaining the candidate area, the step length of the sliding detection frame movement is set to be small, so that adjacent candidate areas have overlapping parts.

步骤703，对候选区域中每一像素单元识别内容属性特征。Step 703: Identify content attribute features for each pixel unit in the candidate area.

步骤704，根据候选区域中的每一像素的像素值和内容属性特征，生成输入图像。Step 704: Generate an input image based on the pixel value and content attribute characteristics of each pixel in the candidate area.

其中，输入图像中的各像素点均具有多个通道，每一通道用于指示候选区域中对应像素的像素值或内容特征。Each pixel in the input image has multiple channels, and each channel is used to indicate the pixel value or content feature of the corresponding pixel in the candidate area.

步骤705，对输入图像进行图像特征提取，以得到图像特征。Step 705: extract image features from the input image to obtain image features.

步骤706，获取文档图像对应的文档内容。Step 706, obtaining the document content corresponding to the document image.

步骤707，根据候选区域在文档图像中的相对位置，查询文档内容中的相对位置，以得到候选区域中包含的字符。Step 707: query the relative position in the document content according to the relative position of the candidate area in the document image to obtain the characters contained in the candidate area.

步骤708，对候选区域中包含的字符进行语义识别，得到语义特征。Step 708: Perform semantic recognition on the characters contained in the candidate area to obtain semantic features.

步骤709，将候选区域的图像特征与候选区域的语义特征拼接，得到候选区域的合成特征。Step 709: concatenate the image features of the candidate region with the semantic features of the candidate region to obtain synthetic features of the candidate region.

步骤710，根据候选区域的合成特征进行分类，以确定候选区域所属的文档结构类型。Step 710: Classify the candidate region according to its synthetic features to determine the document structure type to which the candidate region belongs.

其中，步骤703-步骤710，具体可以参照上述实施例中相关的解释说明，本实施例中不再赘述。Among them, steps 703 to 710 may be specifically referred to the relevant explanations in the above embodiments, and will not be described in detail in this embodiment.

步骤711，若在文档图像中连续分布的至少两候选区域属于同一文档结构类型，则对至少两候选区域合并，得到合并区域，以及合并区域的文档结构类型。Step 711: If at least two candidate regions continuously distributed in the document image belong to the same document structure type, the at least two candidate regions are merged to obtain a merged region and the document structure type of the merged region.

具体地，通过滑动检测框确定的候选区域中若存在连续分布的重叠的候选区域，也就是说连续分布的多个候选区域由于包含相同的内容，被分类识别为属于同一文档结构类型，为了降低属于同一文档结构类型的候选区域的数量，可采用将至少两候选区域进行合并，并将至少两候选区域对应的文档结构类型，作为合并区域的文档结构类型，以降低模型输出信息的冗杂。Specifically, if there are continuously distributed overlapping candidate areas among the candidate areas determined by the sliding detection box, that is, multiple continuously distributed candidate areas are classified and identified as belonging to the same document structure type because they contain the same content, in order to reduce the number of candidate areas belonging to the same document structure type, at least two candidate areas can be merged, and the document structure type corresponding to the at least two candidate areas can be used as the document structure type of the merged area to reduce the redundancy of the model output information.

本申请实施例的文档结构识别方法中，采用滑动检测框从文档图像中选取多个候选区域，并在根据候选区域的合成特征进行分类，以确定候选区域所属的文档结构类型后，若在文档图像中连续分布的至少两候选区域属于同一文档结构类型，则对至少两候选区域合并，得到合并区域，以及合并区域的文档结构类型，以降低模型输出信息的冗杂。In the document structure recognition method of the embodiment of the present application, a sliding detection frame is used to select multiple candidate areas from a document image, and after classifying the candidate areas according to their synthetic features to determine the document structure type to which the candidate areas belong, if at least two candidate areas continuously distributed in the document image belong to the same document structure type, the at least two candidate areas are merged to obtain a merged area and the document structure type of the merged area to reduce the redundancy of the model output information.

上述实施例中是采用训练好的文档结构识别的模型对文本结构进行识别，为了实现上述实施例，本申请还提供了一种文档结构识别的模型训练方法，图8为本申请实施例提供的一种文档结构识别的模型训练方法的流程示意图，如图8所示，该方法包含以下步骤：In the above embodiment, a trained document structure recognition model is used to recognize the text structure. In order to implement the above embodiment, the present application also provides a model training method for document structure recognition. FIG8 is a flow chart of a model training method for document structure recognition provided in an embodiment of the present application. As shown in FIG8, the method comprises the following steps:

步骤801，获取训练样本集。Step 801, obtaining a training sample set.

在本申请实施例的一种可能的实现方式中，获取页面，例如，采用爬虫技术，爬取大量网页文档，提取页面中的文本内容，对页面的文档结构树进行解析，以得到各文档结构类型对应的文本区域，根据各文档结构类型对应的文本区域，对文本内容进行标注，其中，标注包含对表格区域标注文档结构类型为表格，对图像区域标注文档结构类型为图像，对标题区域标注标注文档结构类型为标题，对段落区域标注标注文档结构类型为段落，以及对脚注区域标注标注文档结构类型为脚注，以得到训练样本集中的训练样本，实现了可基于现有的页面数据，生成大规模的训练样本集，提高了生成的效率。In a possible implementation method of an embodiment of the present application, a page is obtained, for example, by using crawler technology to crawl a large number of web documents, extract text content in the page, and parse the document structure tree of the page to obtain text areas corresponding to each document structure type, and the text content is annotated according to the text areas corresponding to each document structure type, wherein the annotation includes annotating the table area with the document structure type as a table, annotating the image area with the document structure type as an image, annotating the title area with the document structure type as a title, annotating the paragraph area with the document structure type as a paragraph, and annotating the footnote area with the document structure type as a footnote, so as to obtain training samples in the training sample set, thereby achieving the generation of a large-scale training sample set based on the existing page data, thereby improving the generation efficiency.

例如，如图9所示，以一个简历文本为例，进行说明，对简历文本进行文本内容提取，并对简历页面的文档结构树进行解析，得到表格的文档结构类型对应的文本区域91、图片的文档结构类型对应的文本区域93、标题的文档结构类型对应的文本区域92和文本的文档结构类型对应的文本区域94，进而，对简历文本中各文本区域对应的文本内容进行标注，即区域92标注标题，区域93标注图像、区域91标注表格和区域94标注文本。For example, as shown in Figure 9, taking a resume text as an example, the text content of the resume text is extracted, and the document structure tree of the resume page is parsed to obtain a text area 91 corresponding to the document structure type of the table, a text area 93 corresponding to the document structure type of the picture, a text area 92 corresponding to the document structure type of the title, and a text area 94 corresponding to the document structure type of the text. Then, the text content corresponding to each text area in the resume text is annotated, that is, area 92 is annotated with the title, area 93 is annotated with the image, area 91 is annotated with the table, and area 94 is annotated with the text.

在本申请实施例的另一种可能的实现方式中，随机生成布局信息，根据布局信息，生成训练文档，根据布局信息，在训练文档中标注各文档结构类型对应的文本区域，以得到训练样本集中的训练样本，实现了可基于需求，预设相应的布局信息，满足了不同场景下的样本生成需求。In another possible implementation of the embodiment of the present application, layout information is randomly generated, and a training document is generated based on the layout information. According to the layout information, text areas corresponding to each document structure type are annotated in the training document to obtain training samples in a training sample set. This enables corresponding layout information to be preset based on demand, thereby meeting sample generation requirements in different scenarios.

例如，根据预设的布局信息和填充内容随机生成布局信息，例如布局信息中包含图片、文字和标题等，进而根据布局信息填充对应的内容后生成训练文档，由于训练文档中各文本区域的文档结构类型和内容是已知的，从而可以对各文本区域包含的内容进行标注，以生成训练样本集中的训练样本，实现了可生成大规模的训练样本集。For example, layout information is randomly generated according to preset layout information and filling content. For example, the layout information includes pictures, texts, and titles, and then a training document is generated after filling in the corresponding content according to the layout information. Since the document structure type and content of each text area in the training document are known, the content contained in each text area can be annotated to generate training samples in the training sample set, thereby realizing the generation of a large-scale training sample set.

在本申请实施例的又一种可能的实现方式中，可将现有已生成的大规模的英文版本的训练样本集，除保留对应的英文版本的训练样本，还通过文本翻译替换得到对应的中文版本，以生成大规模的训练样本集。In another possible implementation of the embodiment of the present application, the existing large-scale English version training sample set can be generated by not only retaining the corresponding English version training samples, but also obtaining the corresponding Chinese version through text translation to generate a large-scale training sample set.

步骤802，采用训练样本集，对目标检测模型进行训练，其中，目标检测模型，用于从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域包含的字符进行语义识别，得到语义特征，以及根据图像特征和语义特征进行目标检测，以确定候选区域所属的文档结构类型。Step 802, using a training sample set to train a target detection model, wherein the target detection model is used to select a candidate area from the document image, perform image feature extraction on the candidate area to obtain image features, perform semantic recognition on characters contained in the candidate area to obtain semantic features, and perform target detection based on the image features and semantic features to determine the document structure type to which the candidate area belongs.

在本申请的一个实施例中，采用训练样本集，对目标检测模型进行训练，具体可以通过深度学习的方式对目标检测模型进行训练，相对于其他深度学习方法，深度学习在大数据集上的表现更好，可提高目标检测模型的训练效果。对目标检测模型训练优化的目标是最小化目标函数，即分类得到的训练样本中各文本区域的文档结构类型与标注的各文本区域的文档结构类型的误差最小，则目标检测模型训练完成，以使训练完成的目标检测模型，用于从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域包含的字符进行语义识别，得到语义特征，以及根据图像特征和语义特征进行目标检测，以确定候选区域所属的文档结构类型。In one embodiment of the present application, a training sample set is used to train the target detection model. Specifically, the target detection model can be trained by deep learning. Compared with other deep learning methods, deep learning performs better on large data sets and can improve the training effect of the target detection model. The goal of optimizing the target detection model training is to minimize the objective function, that is, the error between the document structure type of each text area in the classified training sample and the document structure type of each labeled text area is minimized, and then the target detection model training is completed, so that the trained target detection model is used to select candidate areas from document images, extract image features of the candidate areas to obtain image features, perform semantic recognition on the characters contained in the candidate areas to obtain semantic features, and perform target detection based on image features and semantic features to determine the document structure type to which the candidate areas belong.

需要说明的是，利用训练完成的目标检测模型进行文档结构类型识别的方法，可参照图1-图7对应实施例中的说明，本实施例中不再赘述。It should be noted that the method of using the trained target detection model to identify the document structure type can be referred to the description in the corresponding embodiments of Figures 1 to 7, and will not be repeated in this embodiment.

本申请实施例的文档结构识别的模型训练方法中，通过构建大规模的训练语料对目标检测模型进行训练，以使得训练得到的目标检测模型可用于获取文档图像，从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域中包含的字符进行语义识别，得到语义特征，根据图像特征和语义特征进行分类，以确定候选区域所属的文档结构类型。本申请中在进行文档结构类型识别时，在图像特征的基础上增加了语义特征，充分考虑了文档结构识别时语义信息的重要性，提高了文档结构识别的模型进行文档结构类型判断的准确性。In the model training method for document structure recognition in the embodiment of the present application, a target detection model is trained by constructing a large-scale training corpus, so that the trained target detection model can be used to obtain a document image, select a candidate area from the document image, extract image features from the candidate area to obtain image features, perform semantic recognition on the characters contained in the candidate area to obtain semantic features, and classify according to the image features and semantic features to determine the document structure type to which the candidate area belongs. In the present application, when performing document structure type recognition, semantic features are added on the basis of image features, the importance of semantic information in document structure recognition is fully considered, and the accuracy of the document structure type judgment made by the document structure recognition model is improved.

为了实现上述实施例，本申请还提出一种文档结构识别装置。In order to implement the above embodiment, the present application also proposes a document structure recognition device.

图10为本申请实施例提供的一种文档结构识别装置的结构示意图。FIG. 10 is a schematic diagram of the structure of a document structure recognition device provided in an embodiment of the present application.

如图10所示，该装置包括：图像获取模块101、选取模块102、提取模块103、识别模块104和检测模块105。As shown in FIG. 10 , the device includes: an image acquisition module 101 , a selection module 102 , an extraction module 103 , a recognition module 104 and a detection module 105 .

图像获取模块101，用于获取文档图像。The image acquisition module 101 is used to acquire a document image.

选取模块102，用于从文档图像中选取候选区域。The selection module 102 is used to select a candidate region from the document image.

提取模块103，用于对候选区域进行图像特征提取，得到图像特征。The extraction module 103 is used to extract image features from the candidate area to obtain image features.

识别模块104，用于对候选区域中包含的字符进行语义识别，得到语义特征。The recognition module 104 is used to perform semantic recognition on the characters contained in the candidate area to obtain semantic features.

检测模块105，用于根据图像特征和语义特征进行目标检测，以确定候选区域所属的文档结构类型。The detection module 105 is used to perform object detection based on image features and semantic features to determine the document structure type to which the candidate region belongs.

在本申请实施例的一种可能的实现方式中，该装置，还包括In a possible implementation of the embodiment of the present application, the device further includes

获取模块，用于获取所述文档图像对应的文档内容。The acquisition module is used to acquire the document content corresponding to the document image.

查询模块，用于根据所述候选区域在所述文档图像中的相对位置，查询所述文档内容中的所述相对位置，以得到所述候选区域中包含的字符。The query module is used to query the relative position in the document content according to the relative position of the candidate area in the document image, so as to obtain the characters contained in the candidate area.

作为一种可能的实现方式，上述检测模块105，具体用于：As a possible implementation, the detection module 105 is specifically used to:

将所述候选区域的所述图像特征与所述候选区域的所述语义特征拼接，得到所述候选区域的合成特征，根据所述候选区域的合成特征进行分类，以确定所述候选区域所属的文档结构类型。The image features of the candidate region are concatenated with the semantic features of the candidate region to obtain synthetic features of the candidate region, and classification is performed according to the synthetic features of the candidate region to determine the document structure type to which the candidate region belongs.

在本申请实施例的一种可能的实现方式中，上述提取模块103，包括：In a possible implementation of the embodiment of the present application, the extraction module 103 includes:

识别单元，用于对所述候选区域中每一像素单元识别内容属性特征；An identification unit, used for identifying content attribute features of each pixel unit in the candidate area;

生成单元，用于根据所述候选区域中的每一像素的像素值和所述内容属性特征，生成输入图像，其中，所述输入图像中的各像素点均具有多个通道，每一通道用于指示所述候选区域中对应像素的所述像素值或所述内容特征；A generating unit, configured to generate an input image according to a pixel value of each pixel in the candidate area and the content attribute feature, wherein each pixel in the input image has a plurality of channels, and each channel is used to indicate the pixel value or the content feature of a corresponding pixel in the candidate area;

提取单元，用于对所述输入图像进行图像特征提取，以得到所述图像特征。The extraction unit is used to extract image features from the input image to obtain the image features.

作为一种可能的实现方式，上述识别单元，具体用于：As a possible implementation manner, the above identification unit is specifically used to:

对所述候选区域进行二值化，以从所述候选区域中确定前景部分和背景部分；根据各像素单元属于所述前景部分或所述背景部分，生成各像素单元的所述内容属性特征。The candidate area is binarized to determine a foreground part and a background part from the candidate area; and the content attribute feature of each pixel unit is generated according to whether each pixel unit belongs to the foreground part or the background part.

作为另一种可能的实现方式，上述识别单元，具体还用于：As another possible implementation, the identification unit is further configured to:

对所述候选区域进行二值化，以从所述候选区域中确定前景部分和背景部分；采用目标文本粒度对应的结构元尺寸，对二值化的候选区域进行闭运算，以扩展所述前景部分；其中，所述目标文本粒度包括字、句子和段落中的至少一个；对所述前景部分进行轮廓检测，以得到所述目标文本粒度的包围框；将处于所述包围框内的像素单元确定为属于所述目标文本粒度，将未处于所述包围框内的像素单元确定为未属于所述目标文本粒度；根据各像素单元属于的目标文本粒度，生成各像素单元的所述内容属性特征。Binarize the candidate area to determine the foreground part and the background part from the candidate area; use the structural element size corresponding to the target text granularity to perform a closing operation on the binarized candidate area to expand the foreground part; wherein the target text granularity includes at least one of a word, a sentence and a paragraph; perform contour detection on the foreground part to obtain a bounding box of the target text granularity; determine the pixel units in the bounding box as belonging to the target text granularity, and determine the pixel units not in the bounding box as not belonging to the target text granularity; generate the content attribute feature of each pixel unit according to the target text granularity to which each pixel unit belongs.

在本申请实施例的一种可能的实现方式中，上述选取模块102，具体用于：In a possible implementation of the embodiment of the present application, the selection module 102 is specifically used to:

采用滑动检测框从所述文档图像中选取多个所述候选区域；Selecting a plurality of candidate regions from the document image using a sliding detection frame;

对应地，该装置，还包括：Correspondingly, the device further includes:

合并模块，用于若在所述文档图像中连续分布的至少两所述候选区域属于同一所述文档结构类型，则对所述至少两候选区域合并，得到合并区域，以及所述合并区域的文档结构类型。The merging module is used for merging at least two candidate areas distributed continuously in the document image to obtain a merged area and the document structure type of the merged area if the at least two candidate areas belong to the same document structure type.

需要说明的是，前述对文档结构识别方法实施例的解释说明也适用于该实施例的文档结构识别装置，原理相同，此处不再赘述。It should be noted that the above explanation of the document structure recognition method embodiment is also applicable to the document structure recognition device of this embodiment, and the principles are the same, which will not be repeated here.

为了实现上述实施例，本申请还提出一种用于文档结构识别的模型训练装置。In order to implement the above embodiment, the present application also proposes a model training device for document structure recognition.

图11为本申请实施例提供的一种用于文档结构识别的模型训练装置的结构示意图。FIG11 is a schematic diagram of the structure of a model training device for document structure recognition provided in an embodiment of the present application.

如图11所示，该装置包含：样本获取模块111和训练模块112。As shown in FIG. 11 , the device includes: a sample acquisition module 111 and a training module 112 .

样本获取模块111，用于获取训练样本集。The sample acquisition module 111 is used to acquire a training sample set.

训练模块112，用于采用训练样本集，对目标检测模型进行训练，其中，目标检测模型，用于从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，候选区域包含的字符进行语义识别，得到语义特征，以及根据图像特征和语义特征进行目标检测，以确定候选区域所属的文档结构类型。The training module 112 is used to train the target detection model using a training sample set, wherein the target detection model is used to select a candidate area from a document image, perform image feature extraction on the candidate area to obtain image features, perform semantic recognition on characters contained in the candidate area to obtain semantic features, and perform target detection based on the image features and semantic features to determine the document structure type to which the candidate area belongs.

在本申请实施例的一种可能的实现方式中，上述样本获取模块111，包括：In a possible implementation of the embodiment of the present application, the sample acquisition module 111 includes:

获取单元，用于获取页面。The acquisition unit is used to acquire the page.

提取模块，用于提取所述页面中的文本内容。The extraction module is used to extract the text content in the page.

解析单元，用于对所述页面的文档结构树进行解析，以得到各文档结构类型对应的文本区域。The parsing unit is used to parse the document structure tree of the page to obtain the text area corresponding to each document structure type.

标注单元，用于根据各所述文档结构类型对应的文本区域，对所述文本内容进行标注，以得到所述训练样本集中的训练样本。The marking unit is used to mark the text content according to the text area corresponding to each of the document structure types to obtain training samples in the training sample set.

在本申请实施例的另一种可能的实现方式中，上述样本获取模块111，还包括：In another possible implementation of the embodiment of the present application, the sample acquisition module 111 further includes:

生成单元，用于随机生成布局信息，根据所述布局信息，生成训练文档。The generating unit is used to randomly generate layout information, and generate a training document according to the layout information.

上述标注单元，用于根据所述布局信息，在所述训练文档中标注各文档结构类型对应的文本区域，以得到所述训练样本集中的训练样本。The above-mentioned marking unit is used to mark the text area corresponding to each document structure type in the training document according to the layout information, so as to obtain the training samples in the training sample set.

需要说明的是，前述对文档结构识别的模型训练方法实施例的解释说明也适用于该实施例的文档结构识别的模型训练装置，原理相同，此处不再赘述。It should be noted that the aforementioned explanation of the embodiment of the model training method for document structure recognition is also applicable to the model training device for document structure recognition in this embodiment. The principles are the same and will not be repeated here.

本申请实施例的文档结构识别的模型训练装置中，通过构建大规模的训练语料对目标检测模型进行训练，以使得训练得到的目标检测模型可用于获取文档图像，从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域中包含的字符进行语义识别，得到语义特征，根据图像特征和语义特征进行分类，以确定候选区域所属的文档结构类型。本申请中在进行文档结构类型识别时，在图像特征的基础上增加了语义特征，充分考虑了文档结构识别时语义信息的重要性，提高了文档结构识别的模型进行文档结构类型判断的准确性。In the model training device for document structure recognition of the embodiment of the present application, a target detection model is trained by constructing a large-scale training corpus, so that the trained target detection model can be used to obtain a document image, select a candidate area from the document image, extract image features from the candidate area to obtain image features, perform semantic recognition on the characters contained in the candidate area to obtain semantic features, and classify according to the image features and semantic features to determine the document structure type to which the candidate area belongs. In the present application, when performing document structure type recognition, semantic features are added on the basis of image features, the importance of semantic information in document structure recognition is fully considered, and the accuracy of the document structure type judgment made by the document structure recognition model is improved.

为了实现上述实施例，本申请实施例还提供了一种电子设备，包括：In order to implement the above embodiment, the embodiment of the present application further provides an electronic device, including:

至少一个处理器；以及at least one processor; and

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行前述方法实施例中所述的文档结构识别方法或前述方法实施例中所述的文档结构识别的模型训练方法。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the document structure recognition method described in the aforementioned method embodiment or the model training method for document structure recognition described in the aforementioned method embodiment.

为了实现上述实施例，本申请实施例提出了一种存储有计算机指令的非瞬时计算机可读存储介质，所述计算机指令用于使所述计算机执行如前述方法实施例所述的文档结构识别方法或前述方法实施例中所述的文档结构识别的模型训练方法。In order to implement the above-mentioned embodiments, the embodiments of the present application propose a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to enable the computer to execute the document structure recognition method described in the above-mentioned method embodiments or the model training method for document structure recognition described in the above-mentioned method embodiments.

根据本申请的实施例，本申请还提供了一种电子设备和一种可读存储介质。According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

如图12所示，是根据本申请实施例的文档结构识别方法或文档结构识别的模型训练方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本申请的实现。As shown in Figure 12, it is a block diagram of an electronic device according to the document structure recognition method or the model training method of document structure recognition of an embodiment of the present application. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices and other similar computing devices. The components shown herein, their connections and relationships, and their functions are only examples, and are not intended to limit the implementation of the present application described and/or required herein.

如图12所示，该电子设备包括：一个或多个处理器1201、存储器1202，以及用于连接各部件的接口，包括高速接口和低速接口。各个部件利用不同的总线互相连接，并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理，包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如，耦合至接口的显示设备)上显示GUI的图形信息的指令。在其它实施方式中，若需要，可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样，可以连接多个电子设备，各个设备提供部分必要的操作(例如，作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图12中以一个处理器1201为例。As shown in Figure 12, the electronic device includes: one or more processors 1201, memory 1202, and interfaces for connecting various components, including high-speed interfaces and low-speed interfaces. The various components are connected to each other using different buses, and can be installed on a common mainboard or installed in other ways as needed. The processor can process the instructions executed in the electronic device, including instructions stored in or on the memory to display the graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, if necessary, multiple processors and/or multiple buses can be used together with multiple memories and multiple memories. Similarly, multiple electronic devices can be connected, and each device provides some necessary operations (for example, as a server array, a group of blade servers, or a multi-processor system). In Figure 12, a processor 1201 is taken as an example.

存储器1202即为本申请所提供的非瞬时计算机可读存储介质。其中，所述存储器存储有可由至少一个处理器执行的指令，以使所述至少一个处理器执行本申请所提供的文档结构识别方法，或执行文档结构识别的模型训练方法的方法。本申请的非瞬时计算机可读存储介质存储计算机指令，该计算机指令用于使计算机执行本申请所提供的文档结构识别方法，或者执行文档结构识别的模型训练方法。The memory 1202 is the non-transitory computer-readable storage medium provided in the present application. The memory stores instructions executable by at least one processor to enable the at least one processor to execute the document structure recognition method provided in the present application, or to execute the model training method for document structure recognition. The non-transitory computer-readable storage medium of the present application stores computer instructions, which are used to enable a computer to execute the document structure recognition method provided in the present application, or to execute the model training method for document structure recognition.

存储器1202作为一种非瞬时计算机可读存储介质，可用于存储非瞬时软件程序、非瞬时计算机可执行程序以及模块，如本申请实施例中的文档结构识别方法对应的程序指令/模块(例如，附图10所示的图像获取模块101、选取模块102、提取模块103、识别模块104和检测模块105)。处理器1201通过运行存储在存储器1202中的非瞬时软件程序、指令以及模块，从而执行服务器的各种功能应用以及数据处理，即实现上述方法实施例中的文档结构识别方法。同理，可实现上述方法实施例中的文档结构识别的模型训练方法，原理相同，此处不再赘述。The memory 1202, as a non-transient computer-readable storage medium, can be used to store non-transient software programs, non-transient computer executable programs and modules, such as the program instructions/modules corresponding to the document structure recognition method in the embodiment of the present application (for example, the image acquisition module 101, the selection module 102, the extraction module 103, the recognition module 104 and the detection module 105 shown in FIG. 10). The processor 1201 executes various functional applications and data processing of the server by running the non-transient software programs, instructions and modules stored in the memory 1202, that is, implements the document structure recognition method in the above method embodiment. Similarly, the model training method for document structure recognition in the above method embodiment can be implemented, and the principle is the same, which will not be repeated here.

存储器1202可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据文档结构识别方法或文档结构识别的模型训练方法的电子设备的使用所创建的数据等。此外，存储器1202可以包括高速随机存取存储器，还可以包括非瞬时存储器，例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中，存储器1202可选包括相对于处理器1201远程设置的存储器，这些远程存储器可以通过网络连接至文档结构识别方法或文档结构识别的模型训练方法的电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 1202 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required by at least one function; the data storage area may store data created by the use of an electronic device according to a document structure recognition method or a model training method for document structure recognition, etc. In addition, the memory 1202 may include a high-speed random access memory, and may also include a non-transient memory, such as at least one disk storage device, a flash memory device, or other non-transient solid-state storage device. In some embodiments, the memory 1202 may optionally include a memory remotely arranged relative to the processor 1201, and these remote memories may be connected to the electronic device of the document structure recognition method or the model training method for document structure recognition via a network. Examples of the above-mentioned network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

文档结构识别方法或文档结构识别的模型训练方法的电子设备还可以包括：输入装置1203和输出装置1204。处理器1201、存储器1202、输入装置1203和输出装置1204可以通过总线或者其他方式连接，图12中以通过总线连接为例。The electronic device of the document structure recognition method or the document structure recognition model training method may further include: an input device 1203 and an output device 1204. The processor 1201, the memory 1202, the input device 1203 and the output device 1204 may be connected via a bus or other means, and FIG12 takes the bus connection as an example.

输入装置1203可接收输入的数字或字符信息，以及产生与文档结构识别方法或文档结构识别的模型训练方法的电子设备的用户设置以及功能控制有关的键信号输入，例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置Y04可以包括显示设备、辅助照明装置(例如，LED)和触觉反馈装置(例如，振动电机)等。该显示设备可以包括但不限于，液晶显示器(LCD)、发光二极管(LED)显示器和等离子体显示器。在一些实施方式中，显示设备可以是触摸屏。The input device 1203 can receive input digital or character information, and generate key signal input related to user settings and function control of the electronic device of the document structure recognition method or the model training method of document structure recognition, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, an indicator rod, one or more mouse buttons, a trackball, a joystick and other input devices. The output device Y04 may include a display device, an auxiliary lighting device (e.g., an LED) and a tactile feedback device (e.g., a vibration motor), etc. The display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display and a plasma display. In some embodiments, the display device may be a touch screen.

此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、专用ASIC(专用集成电路)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein can be realized in digital electronic circuit systems, integrated circuit systems, dedicated ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

这些计算程序(也称作程序、软件、软件应用、或者代码)包括可编程处理器的机器指令，并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的，术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如，磁盘、光盘、存储器、可编程逻辑装置(PLD))，包括，接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。These computer programs (also referred to as programs, software, software applications, or code) include machine instructions for programmable processors and can be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and/or means (e.g., disk, optical disk, memory, programmable logic device (PLD)) for providing machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal for providing machine instructions and/or data to a programmable processor.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include: a local area network (LAN), a wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system may include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server is generated by computer programs running on respective computers and having a client-server relationship to each other.

根据本申请实施例的技术方案，涉及自然语言处理和深度学习技术领域，获取文档图像，从文档图像中选取候选区域，对候选区域进行图像特征提取，得到图像特征，对候选区域中包含的字符进行语义识别，得到语义特征，根据图像特征和语义特征进行分类，以确定候选区域所属的文档结构类型。本申请中在进行文档结构类型识别时，在图像特征的基础上增加了语义特征，充分考虑了文档结构识别时语义信息的重要性，提高了文档结构类型判断的准确性。According to the technical solution of the embodiment of the present application, which involves the field of natural language processing and deep learning technology, a document image is obtained, a candidate area is selected from the document image, image features are extracted from the candidate area to obtain image features, semantic recognition is performed on the characters contained in the candidate area to obtain semantic features, and classification is performed based on the image features and semantic features to determine the document structure type to which the candidate area belongs. In the present application, when identifying the document structure type, semantic features are added on the basis of image features, the importance of semantic information in document structure identification is fully considered, and the accuracy of document structure type judgment is improved.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本申请公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this application can be executed in parallel, sequentially or in different orders, as long as the expected results of the technical solution disclosed in this application can be achieved, and this document is not limited here.

上述具体实施方式，并不构成对本申请保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等，均应包含在本申请保护范围之内。The above specific implementations do not constitute a limitation on the protection scope of this application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of this application should be included in the protection scope of this application.

Claims

1. A document structure identification method, the method comprising:

Acquiring a document image;

Selecting a candidate region from the document image;

Extracting image features of the candidate areas to obtain image features;

carrying out semantic recognition on characters contained in the candidate region to obtain semantic features;

classifying according to the image features and the semantic features to determine document structure types of the candidate regions, wherein the document structure types comprise texts, tables, titles, paragraphs, footnotes and images;

The extracting the image features of the candidate region to obtain the image features includes:

Identifying a content attribute feature for each pixel cell in the candidate region;

generating an input image according to the pixel value of each pixel in the candidate region and the content attribute characteristics, wherein each pixel point in the input image is provided with a plurality of channels, and each channel is used for indicating the pixel value and the content attribute characteristics of the corresponding pixel in the candidate region;

and extracting image features of the input image to obtain the image features.

2. The document structure recognition method of claim 1, wherein the classifying according to the image feature and the semantic feature to determine a document structure type to which the candidate region belongs comprises:

Splicing the image features of the candidate region with the semantic features of the candidate region to obtain synthetic features of the candidate region;

classifying according to the synthesized characteristics of the candidate region to determine the document structure type of the candidate region.

3. The document structure recognition method of claim 1, wherein the identifying content attribute features for each pixel unit in the candidate region comprises:

Binarizing the candidate region to determine a foreground portion and a background portion from the candidate region;

And generating the content attribute characteristics of each pixel unit according to whether each pixel unit belongs to the foreground part or the background part.

4. The document structure recognition method of claim 1, wherein the identifying content attribute features for each pixel unit in the candidate region comprises:

Performing a closed operation on the binarized candidate region by adopting the size of the structural element corresponding to the granularity of the target text so as to expand the foreground part; wherein the target text granularity comprises at least one of words, sentences, and paragraphs;

performing contour detection on the foreground part to obtain a bounding box of the target text granularity;

Determining that the pixel units in the bounding box belong to the target text granularity, and determining that the pixel units not in the bounding box do not belong to the target text granularity;

And generating the content attribute characteristics of each pixel unit according to the target text granularity to which each pixel unit belongs.

5. The document structure recognition method according to any one of claims 1 to 4, wherein the selecting a candidate region from the document image includes:

selecting a plurality of candidate areas from the document image by adopting a sliding detection frame;

Correspondingly, after classifying according to the image features and the semantic features of the candidate region to determine the document structure type to which the candidate region belongs, the method further comprises:

and if at least two candidate areas continuously distributed in the document image belong to the same document structure type, merging the at least two candidate areas to obtain a merging area and the document structure type of the merging area.

6. The document structure recognition method according to any one of claims 1 to 4, wherein, before said semantically recognizing the characters contained in each of the candidate regions, further comprising:

acquiring document content corresponding to the document image;

And according to the relative position of the candidate region in the document image, inquiring the relative position in the document content to obtain the characters contained in the candidate region.

7. A model training method for document structure recognition, the method comprising:

acquiring a training sample set;

training a target detection model by adopting the training sample set, wherein the target detection model is used for selecting a candidate region from a document image, extracting image features of the candidate region to obtain image features, carrying out semantic recognition on characters contained in the candidate region to obtain semantic features, and carrying out target detection according to the image features and the semantic features to determine the document structure type of the candidate region, wherein the document structure type comprises texts, tables, titles, paragraphs, footnotes and images;

When training the target detection model by adopting the training sample set, selecting a candidate region from a document image by the target detection model, and extracting image features of the candidate region to obtain image features, wherein the method comprises the following steps:

and extracting image features of the input image to obtain the image features.

8. The model training method of claim 7, wherein the acquiring a training sample set comprises:

Acquiring a page;

Extracting text content in the page;

Analyzing the document structure tree of the page to obtain text areas corresponding to the document structure types;

And labeling the text content according to the text region corresponding to each document structure type to obtain training samples in the training sample set.

9. The model training method of claim 7, wherein the acquiring a training sample set comprises:

randomly generating layout information;

generating a training document according to the layout information;

And marking text areas corresponding to the document structure types in the training documents according to the layout information to obtain training samples in the training sample set.

10. A document structure identification apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a document image;

the selecting module is used for selecting candidate areas from the document image;

the extraction module is used for extracting the image characteristics of the candidate areas to obtain the image characteristics;

the identification module is used for carrying out semantic identification on the characters contained in the candidate region to obtain semantic features;

the detection module is used for carrying out target detection according to the image characteristics and the semantic characteristics so as to determine the document structure type of the candidate region, wherein the document structure type comprises a text, a table, a title, a paragraph, a footnote and an image;

Wherein, the extraction module includes:

an identifying unit configured to identify a content attribute feature for each pixel unit in the candidate region;

A generating unit, configured to generate an input image according to a pixel value of each pixel in the candidate region and the content attribute feature, where each pixel point in the input image has a plurality of channels, and each channel is configured to indicate the pixel value and the content attribute feature of a corresponding pixel in the candidate region;

and the extraction unit is used for extracting the image characteristics of the input image so as to obtain the image characteristics.

11. The document structure identification device of claim 10, wherein the detection module is specifically configured to:

12. The document structure identification device according to claim 10, wherein the identification unit is specifically configured to:

13. The document structure identification device according to claim 10, wherein the identification unit is further specifically configured to:

14. The document structure identification apparatus according to any one of claims 10-13, wherein the selecting module is specifically configured to:

correspondingly, the device further comprises:

And the merging module is used for merging the at least two candidate areas if at least two candidate areas continuously distributed in the document image belong to the same document structure type, so as to obtain a merging area and the document structure type of the merging area.

15. The document structure identification apparatus according to any one of claims 10 to 13, wherein the apparatus further comprises:

the acquisition module is used for acquiring the document content corresponding to the document image;

And the query module is used for querying the relative position in the document content according to the relative position of the candidate region in the document image so as to obtain the characters contained in the candidate region.

16. A model training apparatus for document structure recognition, the apparatus comprising:

the sample acquisition module is used for acquiring a training sample set;

The training module is used for training a target detection model by adopting the training sample set, wherein the target detection model is used for selecting a candidate region from a document image, extracting image features of the candidate region to obtain image features, carrying out semantic recognition on characters contained in the candidate region to obtain semantic features, and carrying out target detection according to the image features and the semantic features to determine the document structure type of the candidate region, wherein the document structure type comprises a text, a table, a title, a paragraph, a footnote and an image;

When the training sample set is adopted to train the target detection model, the training module is specifically configured to:

and extracting image features of the input image to obtain the image features.

17. The model training apparatus of claim 16 wherein said sample acquisition module comprises:

the acquisition unit is used for acquiring the page;

The extraction module is used for extracting text content in the page;

The analysis unit is used for analyzing the document structure tree of the page to obtain text areas corresponding to the document structure types;

and the labeling unit is used for labeling the text content according to the text region corresponding to each document structure type so as to obtain training samples in the training sample set.

18. The model training apparatus of claim 16 wherein said acquisition module comprises:

the generating unit is used for randomly generating layout information and generating a training document according to the layout information;

And the labeling unit is used for labeling text areas corresponding to the structure types of each document in the training document according to the layout information so as to obtain training samples in the training sample set.

19. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the document structure identification method of any one of claims 1-6 or the model training method of document structure identification of any one of claims 7-9.

20. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the document structure recognition method of any one of claims 1-6, or the model training method of document structure recognition of any one of claims 7-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements the document structure identification method of any one of claims 1 to 6 or the model training method of document structure identification of any one of claims 7 to 9.