CN105574530B

CN105574530B - Method and device for extracting text lines in a document

Info

Publication number: CN105574530B
Application number: CN201410525023.4A
Authority: CN
Inventors: 张明明; 许亮; 范伟; 孙俊
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-10-08
Filing date: 2014-10-08
Publication date: 2019-11-22
Anticipated expiration: 2034-10-08
Also published as: CN105574530A

Abstract

The present invention relates to the method and apparatus for extracting the line of text in document.According to an aspect of the invention, there is provided a kind of method for extracting the line of text in document, comprising: slightly clustered to multiple text blocks in document, to form multiple classes；Calculate the feature of each class；According to the feature of each class, the direction of the document is determined；And the multiple text block is finely clustered according to the direction of the document, to extract line of text.

Description

Method and device for extracting text lines in a document

技术领域technical field

本发明涉及文档处理领域，具体涉及对文档中的文本行进行提取的方法和装置。The invention relates to the field of document processing, in particular to a method and a device for extracting text lines in a document.

背景技术Background technique

随着计算机技术以及通信技术的发展，信息以及数据的数量剧烈增长。面对宠大的数据，用户对自动化信息处理技术特别是文档处理技术的要求变的越来越多。在文档处理技术中，需要对文档图像中的文字进行提取和识别。而对文档图像中的文字进行提取时，需要首先提取文本行，然后根据文本行中表现出的文字特征再对文本行进行切分和识别。With the development of computer technology and communication technology, the amount of information and data has increased dramatically. In the face of huge data, users have more and more requirements for automatic information processing technology, especially document processing technology. In document processing technology, it is necessary to extract and recognize text in document images. When extracting the text in the document image, it is necessary to extract the text line first, and then segment and recognize the text line according to the text features shown in the text line.

通常的文本行提取算法是通过在水平方向以及在垂直方向进行投影，利用文本行之间存在较大空白的特征对文本行进行提取的。然而，当文档图像存在较为复杂的版面或文档图像中包含较多噪声时，文本行的提取变得非常困难。The usual text line extraction algorithm is to extract the text line by using the feature that there is a large gap between the text lines through projection in the horizontal direction and in the vertical direction. However, when the document image has a relatively complex layout or the document image contains a lot of noise, it becomes very difficult to extract text lines.

针对上述问题，在现有技术中已提出了不同的解决方案，包括采用基于图论的方法、基于高斯卷积的方法、基于K均值聚类的方法以及基于形态学的方法等等。但这些方法中的一些出错率较高，还有一些在操作过程中需要人为的监督或介入，从而给操作带来了不便。For the above problems, different solutions have been proposed in the prior art, including methods based on graph theory, methods based on Gaussian convolution, methods based on K-means clustering, methods based on morphology, and so on. However, some of these methods have a high error rate, and some require human supervision or intervention during the operation, which brings inconvenience to the operation.

发明内容Contents of the invention

有鉴于此，本发明提出了一种提取文档中的文本行的方法和装置，以便准确高效地提取文档中的文本行。In view of this, the present invention proposes a method and device for extracting text lines in documents, so as to accurately and efficiently extract text lines in documents.

根据本发明的一个方面，提供了一种提取文档中的文本行的方法，包括：对文档中的多个文本块进行粗聚类，以形成多个类；计算每个类的特征；根据每个类的特征，确定所述文档的方向；以及根据所述文档的方向对所述多个文本块进行精细聚类，以提取出文本行。According to one aspect of the present invention, a method for extracting text lines in a document is provided, including: performing rough clustering on multiple text blocks in the document to form multiple classes; calculating features of each class; The features of each class determine the orientation of the document; and perform fine clustering on the plurality of text blocks according to the orientation of the document to extract text lines.

根据本发明的另一方面，提供了一种提取文档中的文本行的装置，包括：粗聚类单元，对文档中的多个文本块进行粗聚类，以形成多个类；类特征计算单元，计算类的特征；文档方向确定单元，根据所述类特征计算单元所计算出的类的特征，确定所述文档的方向；以及精细聚类单元，根据所述文档方向确定单元所确定的所述文档的方向，对所述多个文本块进行精细聚类，以提取出文本行。According to another aspect of the present invention, there is provided a device for extracting text lines in a document, including: a rough clustering unit, which performs rough clustering on multiple text blocks in the document to form multiple classes; class feature calculation A unit for calculating the characteristics of the class; a document direction determination unit for determining the direction of the document according to the class characteristics calculated by the class feature calculation unit; and a fine clustering unit for determining the direction of the document according to the document direction determination unit The orientation of the document is finely clustered on the multiple text blocks to extract text lines.

根据本发明所提供的技术方案，可以较高的准确度提取文档中的文本行。According to the technical solution provided by the invention, the text lines in the document can be extracted with high accuracy.

附图说明Description of drawings

参照附图来阅读本发明的各实施方式，将更容易理解本发明的其它特征和优点，在此描述的附图只是为了对本发明的实施方式进行示意性说明的目的，而非全部可能的实施，并且不旨在限制本发明的范围。在附图中：Other features and advantages of the present invention will be more easily understood by reading the various embodiments of the present invention with reference to the accompanying drawings. The accompanying drawings described here are only for the purpose of schematically illustrating the embodiments of the present invention, not all possible implementations , and are not intended to limit the scope of the invention. In the attached picture:

图1示出了根据本发明一个实施方式的提取文档中的文本行的方法的流程图。Fig. 1 shows a flowchart of a method for extracting text lines in a document according to an embodiment of the present invention.

图2示出了根据本发明一个实施方式的确定文档方向的流程图。Fig. 2 shows a flow chart of determining the direction of a document according to an embodiment of the present invention.

图3示出了根据本发明一个实施方式的精细聚类的流程图。Fig. 3 shows a flowchart of fine clustering according to one embodiment of the present invention.

图4示出了根据本发明一个实施方式的更新每个文本块所处的类的流程图。Fig. 4 shows a flow chart of updating the class of each text block according to an embodiment of the present invention.

图5示出了根据本发明一个实施方式的粗聚类的流程图。Fig. 5 shows a flowchart of coarse clustering according to one embodiment of the present invention.

图6示出了根据本发明一个实施方式的将每个文本块与距离该文本块最接近的文本块合并至同一个类的流程图。Fig. 6 shows a flow chart of merging each text block and the closest text block to the same class according to an embodiment of the present invention.

图7示出了根据本发明一个实施方式的提取文档中的文本行的装置的框图。Fig. 7 shows a block diagram of an apparatus for extracting text lines in a document according to an embodiment of the present invention.

图8示出了根据本发明一个实施方式的文档方向确定单元的框图。Fig. 8 shows a block diagram of a document orientation determining unit according to an embodiment of the present invention.

图9示出了根据本发明一个实施方式的精细聚类单元的框图。Fig. 9 shows a block diagram of a fine clustering unit according to an embodiment of the present invention.

图10示出了根据本发明一个实施方式的聚类子单元的框图。Fig. 10 shows a block diagram of a clustering subunit according to an embodiment of the present invention.

图11示出了根据本发明一个实施方式的粗聚类单元的框图。Fig. 11 shows a block diagram of a coarse clustering unit according to an embodiment of the present invention.

图12示出了根据本发明一个实施方式的粗聚类单元的聚类子单元的框图。Fig. 12 shows a block diagram of the clustering subunit of the coarse clustering unit according to an embodiment of the present invention.

图13示出了可用于实施根据本发明实施例的方法和装置的计算机的示意性框图。Fig. 13 shows a schematic block diagram of a computer that can be used to implement the method and apparatus according to the embodiments of the present invention.

具体实施方式Detailed ways

现参照附图对本发明的实施方式进行详细描述。应注意，以下描述仅仅是示例性的，而并不旨在限制本发明。此外，在以下描述中，将采用相同的附图标号表示不同附图中的相同或相似的部件。在以下描述的不同实施方式中的不同特征，可彼此结合，以形成本发明范围内的其他实施方式。Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the following description is exemplary only and is not intended to limit the present invention. Also, in the following description, the same reference numerals will be used to designate the same or similar components in different drawings. Different features in different embodiments described below can be combined with each other to form other embodiments within the scope of the present invention.

在本申请的描述中，“文本行”是指在文档中由文本文字所构成的行，此处所述的“行”并不涉及方向，既可以表示横向的“行”，也可以表示纵向的“行”。In the description of this application, "text line" refers to a line composed of text characters in a document. The "line" mentioned here does not involve the direction. It can mean either a horizontal "line" or a vertical "OK".

在本申请的描述中，“文档的方向”是指文档的总体阅读方向，即，各文本行之间的排列方向。例如，本文的“文档的方向”由于文本行与文本行之间是纵向排列的，所以为纵向。如果每个文本行中的文字是纵向排列的，即各文本行之间是横向排列的，则该文档的方向为横向。In the description of the present application, "direction of the document" refers to the general reading direction of the document, that is, the arrangement direction between text lines. For example, the "direction of the document" herein is vertical because text lines are arranged vertically. If the text in each text line is arranged vertically, that is, the text lines are arranged horizontally, the orientation of the document is horizontal.

在本申请的描述中，“文本块”是指在文档的预处理过程中，将整个文档所切分成的多个块，其用于后续的处理。这种切分可基于现有技术中的多种方法进行，例如通过对文档进行连续成分分析或基于直方图的切分而得到多个文本块。根据所选择的切分方法不同，所得到的每个文本块可包含一个字符的一部分或一个或多个字符，或者其组合。可以理解，本领域技术人员可根据实际需要以及文档的实际情况(例如文字的大小、字符间距以及行间距)通过适当的方式将文档切分为文本块。In the description of this application, "text block" refers to a plurality of blocks into which the entire document is divided during the preprocessing process of the document, which are used for subsequent processing. This segmentation can be performed based on various methods in the prior art, such as performing continuous component analysis on the document or segmentation based on histograms to obtain multiple text blocks. Depending on the selected segmentation method, each obtained text block may contain a part of a character or one or more characters, or a combination thereof. It can be understood that those skilled in the art can divide the document into text blocks in an appropriate manner according to the actual needs and the actual situation of the document (such as the size of the text, the spacing between characters and the spacing between lines).

图1示出了根据本发明一个实施方式的提取文档中的文本行的方法的流程图。如图1所示，在文档中提取文本行的方法100包括步骤S110至步骤S140。在步骤S110中，对文档中的多个文本块进行粗聚类，即初步聚类，以形成多个类。所形成的每个类是文本块的集合，可包含一个或多个文本块。在步骤S120中，计算由步骤S110所形成的每个类的特征。在本申请中，类的特征是表征类的位置信息的特征，这一点将在以下详述。在步骤S130中，根据在步骤S120中计算的每个类的特征，确定该文档的方向。在文档的方向确定后，在步骤S140中，根据在步骤S130中所确定的文档的方向，对文档中的多个文本块进行精细聚类，从而将精细聚类所得到的每个类作为一个文本行进行输出，即提取出了文本行。Fig. 1 shows a flowchart of a method for extracting text lines in a document according to an embodiment of the present invention. As shown in FIG. 1 , the method 100 for extracting text lines in a document includes steps S110 to S140. In step S110, rough clustering, ie preliminary clustering, is performed on multiple text blocks in the document to form multiple clusters. Each class formed is a collection of text blocks and can contain one or more text blocks. In step S120, features of each class formed by step S110 are calculated. In this application, the feature of a class is a feature that characterizes the location information of the class, which will be described in detail below. In step S130, the orientation of the document is determined according to the features of each class calculated in step S120. After the direction of the document is determined, in step S140, according to the direction of the document determined in step S130, a plurality of text blocks in the document are finely clustered, so that each class obtained by the fine clustering is regarded as a The text line is output, that is, the text line is extracted.

根据本发明的该实施方式，通过粗聚类和精细聚类，从文档中提取出了文本行，以用于后续的文字提取和/或识别等操作。该方法能够正确地从文档中提取文本行，出错率低，而且整个操作过程无需人为的监督或介入，从而便于操作。According to this embodiment of the present invention, through coarse clustering and fine clustering, text lines are extracted from documents for subsequent operations such as text extraction and/or recognition. The method can correctly extract text lines from the document, has a low error rate, and the whole operation process does not require human supervision or intervention, thereby facilitating operation.

根据本发明的另一实施方式，文档中的每个文本块均具有位置信息，并且在步骤S120中，可根据每个类中所包含的文本块的位置信息，分别计算每个类的特征。根据一个实施例，文档中的每个文本块的位置信息可包括水平位置信息和竖直位置信息。相应地，每个类的特征可包括水平特征和竖直特征，分别用于表征每个类的水平位置信息和竖直位置信息。在步骤S120中，每个类的水平特征和竖直特征分别是根据每个类中所包含的各个文本块的水平位置信息和竖直位置信息计算得出的。According to another embodiment of the present invention, each text block in the document has position information, and in step S120, the features of each class can be calculated respectively according to the position information of the text blocks included in each class. According to one embodiment, the position information of each text block in the document may include horizontal position information and vertical position information. Correspondingly, the features of each class may include horizontal features and vertical features, which are respectively used to characterize the horizontal position information and vertical position information of each class. In step S120, the horizontal feature and vertical feature of each class are respectively calculated according to the horizontal position information and vertical position information of each text block included in each class.

每个类的水平特征可包括该类所包含的所有文本块的水平位置信息的平均值和标准差，每个类的竖直特征可包括该类所包含的所有文本块的竖直位置信息的平均值和标准差。根据一个实施例，每个文本块的水平位置信息可包括该文本块的最上侧位置、最下侧位置和/或中心水平位置，用于表示该文本块在水平方向上在该文档中所处的位置。例如，当每个文本块的水平位置信息包括该文本块的最上侧位置和最下侧位置时，则每个类的水平特征可包括该类所包含的所有文本块的最上侧位置的平均值和标准差以及最下侧位置的平均值和标准差。类似地，每个文本块的竖直位置信息可包括该文本块的最左侧位置、最右侧位置和/或中心竖直位置，用于表示该文本块在竖直方向上在该文档中所处的位置。The horizontal features of each class can include the mean and standard deviation of the horizontal position information of all text blocks included in the class, and the vertical features of each class can include the vertical position information of all text blocks included in the class mean and standard deviation. According to one embodiment, the horizontal position information of each text block may include the uppermost position, the lowermost position and/or the central horizontal position of the text block, which are used to indicate the position of the text block in the document in the horizontal direction. s position. For example, when the horizontal position information of each text block includes the uppermost position and the lowermost position of the text block, the horizontal feature of each class may include the average value of the uppermost positions of all text blocks included in the class and standard deviation, and the mean and standard deviation of the bottommost positions. Similarly, the vertical position information of each text block may include the leftmost position, the rightmost position and/or the central vertical position of the text block, which are used to indicate that the text block is in the document in the vertical direction. where you are.

图2示出了根据本发明一个实施方式的确定文档方向的流程图。如图2所示，上述步骤S130包括子步骤S210-S230。Fig. 2 shows a flow chart of determining the direction of a document according to an embodiment of the present invention. As shown in FIG. 2, the above step S130 includes sub-steps S210-S230.

在子步骤S210中，比较所有类的水平特征中的标准差的平均值与竖直特征中的标准差之平均值的大小。可以理解，如果各类的水平特征的标准差相对于竖直特征的标准差较大，则表明每个类中的各文本块的竖直位置彼此比较接近。反之，如果各类的水平特征的标准差相对于竖直特征的标准差较小，则表明每个类中的各文本块的水平位置彼此比较接近。In sub-step S210, the average value of the standard deviations in the horizontal features and the average value of the standard deviations in the vertical features of all classes are compared. It can be understood that if the standard deviation of the horizontal features of each class is larger than the standard deviation of the vertical features, it indicates that the vertical positions of the text blocks in each class are relatively close to each other. Conversely, if the standard deviation of the horizontal features of each class is smaller than that of the vertical features, it indicates that the horizontal positions of the text blocks in each class are relatively close to each other.

例如，当每个文本块的水平位置信息包括该文本块的最上侧位置和最下侧位置并且竖直位置信息包括该文本块的最左侧位置和最右侧位置时，则每个类的水平特征包括该类所包含的所有文本块的最上侧位置的平均值和标准差以及最下侧位置的平均值和标准差并且竖直特征包括该类所包含的所有文本块的最左侧位置的平均值和标准差以及最右侧位置的平均值和标准差。在子步骤S210中，每个类的水平特征中的标准差既考虑各文本块最上侧位置的标准差，也考虑各文本块最下侧位置的标准差，例如，每个类的水平特征中的标准差可以为各文本块最上侧位置的标准差与各文本块最下侧位置的标准差之和或平均值。类似地，每个类的竖直特征中的标准差可以为各文本块最左侧位置的标准差与各文本块最右侧位置的标准差之和或平均值。For example, when the horizontal position information of each text block includes the uppermost position and the lowermost position of the text block and the vertical position information includes the leftmost position and the rightmost position of the text block, then each class The horizontal feature includes the mean and standard deviation of the topmost position and the mean and standard deviation of the bottommost position of all text blocks contained in the class and the vertical feature contains the leftmost position of all text blocks contained in the class The mean and standard deviation of , and the mean and standard deviation of the rightmost position. In sub-step S210, the standard deviation in the horizontal features of each class not only considers the standard deviation of the uppermost position of each text block, but also considers the standard deviation of the lowermost position of each text block, for example, in the horizontal feature of each class The standard deviation of can be the sum or average of the standard deviation of the uppermost position of each text block and the standard deviation of the lowermost position of each text block. Similarly, the standard deviation in the vertical features of each class may be the sum or average of the standard deviation of the leftmost position of each text block and the standard deviation of the rightmost position of each text block.

因此，当所有类的水平特征中的标准差的平均值大于竖直特征中的标准差的平均值时，在步骤S220中，确定文档的方向是水平方向；当所有类的水平特征中的标准差的平均值小于竖直特征中的标准差的平均值时，在步骤S230中，确定文档的方向是竖直方向。Therefore, when the average value of the standard deviations in the horizontal features of all classes is greater than the average value of the standard deviations in the vertical features, in step S220, it is determined that the direction of the document is the horizontal direction; when the standard deviation in the horizontal features of all classes When the average value of the difference is smaller than the average value of the standard deviation in the vertical feature, in step S230, it is determined that the orientation of the document is the vertical orientation.

图3示出了根据本发明一个实施方式的精细聚类的流程图。如图3所示，上述步骤S140包括子步骤S310-S350。Fig. 3 shows a flowchart of fine clustering according to one embodiment of the present invention. As shown in FIG. 3, the above step S140 includes sub-steps S310-S350.

在子步骤S310中，根据在上述步骤S130确定的文档的方向，计算每个文本块与每个类的关联值。在本申请中，一个文本块与一个类的关联值代表了该文本块与该类在位置上的接近程度。In sub-step S310, according to the direction of the document determined in step S130, the association value between each text block and each class is calculated. In this application, the association value between a text block and a class represents the closeness of the text block to the class.

根据一个实施例，每个文本块与每个类的关联值是根据该文本块的位置信息和该类的特征计算的。例如，当文档的方向是水平方向时，将一个文本块的竖直位置信息与一个类的竖直特征中的平均值之差计算为该文本块与该类的关联值。反之，当文档的方向是竖直方向时，将一个文本块的水平位置信息与一个类的水平特征中的平均值之差计算为该文本块与该类的关联值。According to one embodiment, the association value between each text block and each class is calculated according to the position information of the text block and the characteristics of the class. For example, when the orientation of the document is horizontal, the difference between the vertical position information of a text block and the average value of the vertical feature of a class is calculated as the association value of the text block and the class. Conversely, when the orientation of the document is vertical, the difference between the horizontal position information of a text block and the average value of the horizontal feature of a class is calculated as the association value between the text block and the class.

例如，如上所述，当每个文本块的水平位置信息包括该文本块的最上侧位置和最下侧位置并且竖直位置信息包括该文本块的最左侧位置和最右侧位置时，则每个类的水平特征包括该类所包含的所有文本块的最上侧位置的平均值和标准差以及最下侧位置的平均值和标准差并且竖直特征包括该类所包含的所有文本块的最左侧位置的平均值和标准差以及最右侧位置的平均值和标准差。在子步骤S310中，当文档方向是水平方向时，一个文本块与一个类的关联值为该文本块的最左侧位置与该类所包含的所有文本块的最左侧位置的平均值之差以及该文本块的最右侧位置与该类所包含的所有文本块的最右侧位置的平均值之差的和、平均值或其中较大的一个。类似地，当文档方向是竖直方向时，一个文本块与一个类的关联值为该文本块的最上侧位置与该类所包含的所有文本块的最上侧位置的平均值之差以及该文本块的最下侧位置与该类所包含的所有文本块的最下侧位置的平均值之差的和、平均值或其中较大的一个。For example, as described above, when the horizontal position information of each text block includes the uppermost position and the lowermost position of the text block and the vertical position information includes the leftmost position and the rightmost position of the text block, then The horizontal features of each class include the mean and standard deviation of the uppermost position and the mean and standard deviation of the lowermost position of all text blocks contained in the class and the vertical features include the mean and standard deviation of all text blocks contained in the class The mean and standard deviation for the leftmost position and the mean and standard deviation for the rightmost position. In sub-step S310, when the orientation of the document is horizontal, the association value between a text block and a class is the difference between the leftmost position of the text block and the average value of the leftmost positions of all text blocks included in the class difference, and the difference between the rightmost position of this text block and the average of the rightmost positions of all text blocks contained in this class, the sum, the average value, or whichever is greater. Similarly, when the document orientation is vertical, the associated value of a text block and a class is the difference between the uppermost position of the text block and the average of the uppermost positions of all text blocks contained in the class and the text The sum, the average, or the greater of the difference between the bottommost position of the block and the average of the bottommost positions of all text blocks contained in this class.

随后，在子步骤S320中，根据所计算出的各文本块与各类的关联值，更新每个文本块所处的类。通过上述步骤S110的粗聚类，聚类的结果往往不够精确，例如会将本应处于两个文本行的文本块聚类于一个类中。在此，通过利用每个文本块与各类的关联值，可确定该文本块与各类在位置上的接近程度，从而更新文本块应处的类。Subsequently, in sub-step S320, the class of each text block is updated according to the calculated association value between each text block and each class. Through the coarse clustering in step S110 above, the clustering result is often not accurate enough, for example, the text blocks that should be in two text lines are clustered into one class. Here, by using the associated value between each text block and each class, the proximity of the text block to each class can be determined, so as to update the class where the text block should be located.

在子步骤S330中，更新每个类的特征，并删除已不包含任何文本块的类。通过子步骤S320对每个文本块所处的类进行更新后，该文本块所处的类可能发生变化，也就是说，某个类或某些类所包含的文本块会发生变化，而且有可能某个类已不包含任何文本块。在此，可重新计算每个类的特征，并删除已不包含任何文本块的类。可见，在精细聚类的过程中，所述多个类的数量是动态变化的，这有利于精确且快速地得到最终聚类结果。In sub-step S330, the features of each class are updated, and the classes that do not contain any text blocks are deleted. After the class of each text block is updated by substep S320, the class of the text block may change, that is to say, the text blocks contained in a certain class or some classes will change, and some It is possible that a class no longer contains any text blocks. Here, the features of each class can be recalculated and classes that no longer contain any text blocks can be deleted. It can be seen that in the process of fine clustering, the number of the multiple classes is dynamically changed, which is beneficial to obtain the final clustering result accurately and quickly.

在子步骤S340中，判断每个文本块所处的类是否均已不发生改变。如果是，则将所产生的每个类作为一个文本行进行提取(S350)；如果否，则返回子步骤S310，继续精细聚类处理。In sub-step S340, it is judged whether the category of each text block has not changed. If yes, extract each generated class as a text line (S350); if no, return to sub-step S310 to continue the fine clustering process.

图4示出了根据本发明一个实施方式的更新每个文本块所处的类的流程图。如图4所示，上述子步骤S320包括子步骤S410-S450。Fig. 4 shows a flow chart of updating the class of each text block according to an embodiment of the present invention. As shown in FIG. 4, the above sub-step S320 includes sub-steps S410-S450.

在子步骤S410中，根据所确定的文档的方向以及所有类的特征，计算阈值。根据一个实施例，当文档的方向是水平方向时，计算所有类的竖直特征中的标准差的平均值乘以预设的系数，将该乘积作为阈值。当文档的方向是竖直方向时，计算所有类的水平特征中的标准差的平均值乘以预设的系数，将该乘积作为阈值。In sub-step S410, a threshold is calculated according to the determined orientation of the document and the features of all classes. According to one embodiment, when the orientation of the document is horizontal, the average value of the standard deviations in the vertical features of all classes is calculated and multiplied by a preset coefficient, and the product is used as a threshold. When the orientation of the document is the vertical orientation, calculate the average value of the standard deviations in the horizontal features of all classes multiplied by a preset coefficient, and use the product as a threshold.

例如，如上所述，当每个文本块的水平位置信息包括该文本块的最上侧位置和最下侧位置并且竖直位置信息包括该文本块的最左侧位置和最右侧位置时，则每个类的水平特征包括该类所包含的所有文本块的最上侧位置的平均值和标准差以及最下侧位置的平均值和标准差并且竖直特征包括该类所包含的所有文本块的最左侧位置的平均值和标准差以及最右侧位置的平均值和标准差。在子步骤S410中，当文档方向是水平方向时，将所有类所包含的所有文本块的最左侧位置的标准差的平均值以及最右侧位置的标准差的平均值的和、平均值或其中较大的一个乘以预设的系数，所得到的乘积作为阈值。类似地，当文档方向是竖直方向时，将所有类所包含的所有文本块的最上侧位置的标准差的平均值以及最下侧位置的标准差的平均值的和、平均值或其中较大的一个乘以预设的系数，所得到的乘积作为阈值。For example, as described above, when the horizontal position information of each text block includes the uppermost position and the lowermost position of the text block and the vertical position information includes the leftmost position and the rightmost position of the text block, then The horizontal features of each class include the mean and standard deviation of the uppermost position and the mean and standard deviation of the lowermost position of all text blocks contained in the class and the vertical features include the mean and standard deviation of all text blocks contained in the class The mean and standard deviation for the leftmost position and the mean and standard deviation for the rightmost position. In sub-step S410, when the document orientation is horizontal, the average value of the standard deviations of the leftmost positions and the average value of the standard deviations of the rightmost positions of all text blocks contained in all classes, the average value Or the larger one is multiplied by a preset coefficient, and the obtained product is used as a threshold. Similarly, when the document orientation is the vertical direction, the average value of the standard deviations of the uppermost positions and the average value of the standard deviations of the lowermost positions of all text blocks contained in all classes, the average value or the lower The larger one is multiplied by a preset coefficient, and the resulting product is used as a threshold.

对于一个文本块，其与每个类均具有一个关联值，在子步骤S420中，确定该文本块的最小关联值以及与该文本块具有该最小关联值的类。For a text block that has an associated value with each class, in substep S420, determine the minimum associated value of the text block and the class that has the minimum associated value with the text block.

在子步骤S430中，比较在子步骤S420中所确定的最小关联值与在子步骤S410中所计算的阈值的大小。当该最小关联值小于该阈值时，在子步骤S440中，将该文本块聚类至与该文本块具有该最小关联值的类中。当该最小关联值大于或等于该阈值时，在子步骤S450中，创建一个类，并将该文本块聚类于所创建的类。可见，文本块属于哪个类将受到阈值的影响，而阈值的数值大小又依赖于当前文本块的归属情况，因此，根据本发明的方法是动态调整的迭代过程，并且在更新每个文本块所处的类的过程中，所述多个类的数量也是动态变化的。可以理解，本领域技术人员可根据实际需要和所处理文档的实际情况来选择适当的上述预设的系数。根据一个实施例，该预设的系数为1-3范围内的实数，例如2。In sub-step S430, the minimum correlation value determined in sub-step S420 is compared with the threshold value calculated in sub-step S410. When the minimum correlation value is less than the threshold, in sub-step S440, the text block is clustered into a class having the minimum correlation value with the text block. When the minimum correlation value is greater than or equal to the threshold, in substep S450, a class is created, and the text block is clustered into the created class. It can be seen that which class a text block belongs to will be affected by the threshold, and the numerical value of the threshold depends on the belonging situation of the current text block. Therefore, the method according to the present invention is an iterative process of dynamic adjustment, and when updating each text block During the course of the class, the number of the plurality of classes is also dynamically changed. It can be understood that those skilled in the art can select appropriate preset coefficients according to actual needs and actual conditions of the processed documents. According to an embodiment, the preset coefficient is a real number in the range of 1-3, such as 2.

图5示出了根据本发明一个实施方式的粗聚类的流程图。如图5所示，上述步骤S110包括子步骤S510-S520。在子步骤S510中，计算每两个文本块之间的距离。本领域技术人员可以理解，可以通过现有技术中任何适用的方法计算两个文本块之间的距离，例如，计算两个文本块间最接近的两点的距离，计算两个文本块的中间点之间的距离等等。随后，在子步骤S520中，根据在子步骤S510中所计算出的距离，将每个文本块与距离该文本块最接近的文本块合并至同一个类。Fig. 5 shows a flowchart of coarse clustering according to one embodiment of the present invention. As shown in FIG. 5, the above step S110 includes sub-steps S510-S520. In sub-step S510, the distance between every two text blocks is calculated. Those skilled in the art can understand that the distance between two text blocks can be calculated by any applicable method in the prior art, for example, calculate the distance between the two closest points between two text blocks, and calculate the distance between two text blocks. distance between points, etc. Subsequently, in sub-step S520, according to the distance calculated in sub-step S510, each text block and the text block closest to the text block are merged into the same class.

图6示出了根据本发明一个实施方式的将每个文本块与距离该文本块最接近的文本块合并至同一个类的流程图。如图6所示，上述子步骤S520包括子步骤S610-S630。在子步骤S610中，判断未被聚类于任一个类的文本块最接近的文本块是否已被聚类于某个类中。如果是，则在子步骤S620中，将该文本块也聚类于其最接近的文本块所处的类中。如果否，则在子步骤S630中，创建一个类，并将该文本块及其最接近的文本块聚类于所创建的类。可见，在粗聚类的过程中，所述多个类的数量也是动态变化的。Fig. 6 shows a flow chart of merging each text block and the closest text block to the same class according to an embodiment of the present invention. As shown in FIG. 6, the above sub-step S520 includes sub-steps S610-S630. In sub-step S610, it is judged whether the text block closest to the text block not clustered in any class has been clustered in a certain class. If yes, then in sub-step S620, the text block is also clustered in the same class as its closest text block. If not, in substep S630, a class is created, and the text block and its closest text blocks are clustered into the created class. It can be seen that in the process of rough clustering, the number of the multiple classes also changes dynamically.

图7示出了根据本发明一个实施方式的提取文档中的文本行的装置的框图。如图7所示，该装置700包括粗聚类单元710、类特征计算单元720、文档方向确定单元730和精细聚类单元740。粗聚类单元710可对文档中的多个文本块进行粗聚类，以形成多个类。类特征计算单元720可计算类的特征。文档方向确定单元730可根据类特征计算单元720所计算出的类的特征，确定出文档的方向。精细聚类单元740可根据文档方向确定单元730所确定的文档的方向，对多个文本块进行精细聚类，以提取出文本行。Fig. 7 shows a block diagram of an apparatus for extracting text lines in a document according to an embodiment of the present invention. As shown in FIG. 7 , the device 700 includes a rough clustering unit 710 , a class feature calculation unit 720 , a document direction determination unit 730 and a fine clustering unit 740 . The rough clustering unit 710 can perform rough clustering on multiple text blocks in the document to form multiple clusters. The class feature calculation unit 720 may calculate features of a class. The document orientation determination unit 730 can determine the orientation of the document according to the class features calculated by the class feature calculation unit 720 . The fine clustering unit 740 may perform fine clustering on multiple text blocks according to the direction of the document determined by the document direction determining unit 730 to extract text lines.

根据本发明的另一实施方式，文档中的每个文本块均具有位置信息，类特征计算单元720可根据类中所包含的文本块的位置信息，计算该类的特征。According to another embodiment of the present invention, each text block in the document has position information, and the class feature calculation unit 720 can calculate the feature of the class according to the position information of the text blocks included in the class.

根据一个实施例，文本块的位置信息包括水平位置信息和竖直位置信息，文本块的水平位置信息包括该文本块的最上侧位置、最下侧位置和/或中心水平位置，文本块的竖直位置信息包括该文本块的最左侧位置、最右侧位置和/或中心竖直位置。相应地，类的特征包括水平特征和竖直特征，类的水平特征包括该类所包含的所有文本块的水平位置信息的平均值和标准差，类的竖直特征包括该类所包含的所有文本块的竖直位置信息的平均值和标准差。According to one embodiment, the position information of the text block includes horizontal position information and vertical position information, the horizontal position information of the text block includes the uppermost position, the lowermost position and/or the central horizontal position of the text block, and the vertical position of the text block. The vertical position information includes the leftmost position, rightmost position and/or center vertical position of the text block. Correspondingly, the features of a class include horizontal features and vertical features. The horizontal features of a class include the average and standard deviation of the horizontal position information of all text blocks contained in the class. The vertical features of a class include all text blocks contained in the class. The mean and standard deviation of the vertical position information of the text block.

图8示出了根据本发明一个实施方式的文档方向确定单元的框图。如图8所示，文档方向确定单元730包括比较子单元810和确定子单元820。比较子单元810可比较所有类的水平特征中的标准差之平均值与竖直特征中的标准差之平均值的大小。确定子单元820可根据比较子单元810的比较结果，确定文档的方向是水平方向还是竖直方向。具体地，当所有类的水平特征中的标准差的平均值大于竖直特征中的标准差的平均值时，确定子单元820确定文档的方向是水平方向；当所有类的水平特征中的标准差的平均值小于竖直特征中的标准差的平均值时，确定子单元820确定文档的方向是竖直方向。Fig. 8 shows a block diagram of a document orientation determining unit according to an embodiment of the present invention. As shown in FIG. 8 , the document direction determination unit 730 includes a comparison subunit 810 and a determination subunit 820 . The comparison subunit 810 can compare the average value of the standard deviations in the horizontal features and the average value of the standard deviations in the vertical features of all classes. The determination subunit 820 may determine whether the orientation of the document is horizontal or vertical according to the comparison result of the comparison subunit 810 . Specifically, when the average value of the standard deviations in the horizontal features of all classes is greater than the average value of the standard deviations in the vertical features, the determining subunit 820 determines that the orientation of the document is a horizontal direction; when the standard deviation in the horizontal features of all classes When the average value of the difference is smaller than the average value of the standard deviation in the vertical feature, the determination subunit 820 determines that the orientation of the document is the vertical orientation.

图9示出了根据本发明一个实施方式的精细聚类单元的框图。如图9所示，精细聚类单元740包括关联值计算子单元910、聚类子单元920和类更新子单元930。关联值计算子单元910可根据文档方向确定单元730所确定的文档的方向，计算文本块与每个类的关联值。聚类子单元920可根据关联值计算子单元910所计算出的关联值，更新该文本块所处的类。由于文本块所处的类有可能发生变化，因此在多个类中，可能有至少一个类的特征发生变化。类更新子单元930可更新每个类的特征，并删除已不包含任何文本块的类。Fig. 9 shows a block diagram of a fine clustering unit according to an embodiment of the present invention. As shown in FIG. 9 , the fine clustering unit 740 includes an association value calculation subunit 910 , a clustering subunit 920 and a class updating subunit 930 . The correlation value calculation subunit 910 can calculate the correlation value between the text block and each class according to the orientation of the document determined by the document orientation determination unit 730 . The clustering subunit 920 may update the class of the text block according to the association value calculated by the association value calculation subunit 910 . Since the class of the text block may change, among multiple classes, the characteristics of at least one class may change. The class update subunit 930 may update the characteristics of each class and delete classes that no longer contain any text blocks.

根据一个实施方式，关联值计算子单元910可根据文本块的位置信息和类的特征计算该文本块与该类的关联值。According to one embodiment, the association value calculation subunit 910 may calculate the association value between the text block and the class according to the position information of the text block and the characteristics of the class.

可选地，当文档方向确定单元730确定文档的方向是水平方向时，关联值计算子单元910计算文本块的竖直位置信息与类的竖直特征中的平均值之差作为该文本块与该类的关联值。反之，当文档方向确定单元730确定文档的方向是竖直方向时，关联值计算子单元910计算文本块的水平位置信息与类的水平特征中的平均值之差作为该文本块与该类的关联值。Optionally, when the document orientation determination unit 730 determines that the orientation of the document is the horizontal orientation, the associated value calculation subunit 910 calculates the difference between the vertical position information of the text block and the average value of the vertical feature of the class as the text block and The associated value for this class. Conversely, when the document direction determination unit 730 determines that the direction of the document is the vertical direction, the correlation value calculation subunit 910 calculates the difference between the horizontal position information of the text block and the average value of the horizontal feature of the class as the difference between the text block and the class. associated value.

图10示出了根据本发明一个实施方式的聚类子单元的框图。如图10所示，聚类子单元920包括阈值计算模块1010、最小关联值确定模块1020、类创建模块1030和聚类模块1040。阈值计算模块1010可根据文档的方向和所有类的特征，计算阈值。类创建模块1030可用于创建类。对于一个文本块，最小关联值确定模块1020可确定该文本块的最小关联值以及与该文本块具有该最小关联值的类，并且如果其最小关联值小于阈值，则聚类模块1040可将该文本块聚类至最小关联值确定模块1020所确定的类中，如果其最小关联值大于或等于阈值，则聚类模块1040可将该文本块聚类至类创建模块1030所创建的类中。Fig. 10 shows a block diagram of a clustering subunit according to an embodiment of the present invention. As shown in FIG. 10 , the clustering subunit 920 includes a threshold calculation module 1010 , a minimum correlation value determination module 1020 , a class creation module 1030 and a clustering module 1040 . The threshold calculation module 1010 can calculate the threshold according to the direction of the document and the features of all classes. The class creation module 1030 can be used to create classes. For a text block, the minimum association value determination module 1020 can determine the minimum association value of the text block and the class that has the minimum association value with the text block, and if its minimum association value is less than a threshold, the clustering module 1040 can use the The text block is clustered into the class determined by the minimum association value determination module 1020 , and if the minimum association value is greater than or equal to the threshold, the clustering module 1040 may cluster the text block into the class created by the class creation module 1030 .

根据一个实施方式，当文档方向确定单元730确定文档的方向是水平方向时，阈值计算模块1010计算所有类的竖直特征中的标准差的平均值乘以预设的系数所得到的结果作为该阈值。反之，当文档方向确定单元730确定文档的方向是竖直方向时，阈值计算模块1010计算所有类的水平特征中的标准差的平均值乘以预设的系数所得到的结果作为该阈值。According to one embodiment, when the document orientation determination unit 730 determines that the orientation of the document is horizontal, the threshold calculation module 1010 calculates the average value of the standard deviations in the vertical features of all classes multiplied by a preset coefficient as the result threshold. Conversely, when the document orientation determination unit 730 determines that the orientation of the document is vertical, the threshold calculation module 1010 calculates the result obtained by multiplying the average value of the standard deviations in the horizontal features of all classes by a preset coefficient as the threshold.

图11示出了根据本发明一个实施方式的粗聚类单元的框图。如图11所示，粗聚类单元710包括距离计算子单元1110和聚类子单元1120。距离计算子单元1110可计算每两个文本块之间的距离。聚类子单元1120可根据距离计算子单元1110所计算出的距离，将每个文本块与距离该文本块最接近的文本块合并至同一个类。Fig. 11 shows a block diagram of a coarse clustering unit according to an embodiment of the present invention. As shown in FIG. 11 , the rough clustering unit 710 includes a distance calculation subunit 1110 and a clustering subunit 1120 . The distance calculation subunit 1110 can calculate the distance between every two text blocks. The clustering subunit 1120 can combine each text block and the text block closest to the text block into the same class according to the distance calculated by the distance calculation subunit 1110 .

图12示出了根据本发明一个实施方式的粗聚类单元的聚类子单元的框图。如图12所示，聚类子单元1120包括类创建模块1210和聚类模块1220。类创建模块1210可用于创建类。对于未被聚类于任一个类的文本块，当该文本块最接近的文本块已被聚类于一个类中，则聚类模块1220将该文本块也聚类于该类中。当该文本块最接近的文本块也未被聚类于任一个类，则聚类模块1220将该文本块及其最接近的文本块聚类于类创建模块1210所创建的类。Fig. 12 shows a block diagram of the clustering subunit of the coarse clustering unit according to an embodiment of the present invention. As shown in FIG. 12 , the clustering subunit 1120 includes a class creation module 1210 and a clustering module 1220 . Class creation module 1210 may be used to create classes. For a text block that is not clustered in any class, if the text block closest to the text block has been clustered in a class, the clustering module 1220 will also cluster the text block in the class. When the text block closest to the text block is not clustered in any class, the clustering module 1220 clusters the text block and its closest text block into the class created by the class creation module 1210 .

另外，这里尚需指出的是，上述装置中各个组成部件可以通过软件、固件、硬件或其组合的方式进行配置。配置可使用的具体手段或方式为本领域技术人员所熟知，在此不再赘述。在通过软件或固件实现的情况下，从存储介质或网络向具有专用硬件结构的计算机(例如图13所示的通用计算机1300)安装构成该软件的程序，该计算机在安装有各种程序时，能够执行各种功能等。In addition, it should be pointed out here that each component in the above device may be configured by software, firmware, hardware or a combination thereof. Specific means or manners that can be used for configuration are well known to those skilled in the art, and will not be repeated here. In the case of realizing by software or firmware, the program constituting the software is installed from a storage medium or a network to a computer (for example, a general-purpose computer 1300 shown in FIG. 13 ) having a dedicated hardware configuration. Capable of performing various functions, etc.

在图13中，中央处理单元(CPU)1301根据只读存储器(ROM)1302中存储的程序或从存储部分1308加载到随机存取存储器(RAM)1303的程序执行各种处理。在RAM 1303中，还根据需要存储当CPU 1301执行各种处理等等时所需的数据。CPU 1301、ROM 1302和RAM 1303经由总线1304彼此连接。输入/输出接口1305也连接到总线1304。In FIG. 13 , a central processing unit (CPU) 1301 executes various processes according to programs stored in a read only memory (ROM) 1302 or loaded from a storage section 1308 to a random access memory (RAM) 1303 . In the RAM 1303, data required when the CPU 1301 executes various processes and the like is also stored as necessary. The CPU 1301 , ROM 1302 , and RAM 1303 are connected to each other via a bus 1304 . The input/output interface 1305 is also connected to the bus 1304 .

下述部件连接到输入/输出接口1305：输入部分1306(包括键盘、鼠标等等)、输出部分1307(包括显示器，比如阴极射线管(CRT)、液晶显示器(LCD)等，和扬声器等)、存储部分1308(包括硬盘等)、通信部分1309(包括网络接口卡比如LAN卡、调制解调器等)。通信部分1309经由网络比如因特网执行通信处理。根据需要，驱动器1310也可连接到输入/输出接口1305。可拆卸介质1311比如磁盘、光盘、磁光盘、半导体存储器等等可以根据需要被安装在驱动器1310上，使得从中读出的计算机程序根据需要被安装到存储部分1308中。The following components are connected to the input/output interface 1305: an input section 1306 (including a keyboard, a mouse, etc.), an output section 1307 (including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.), A storage section 1308 (including a hard disk, etc.), a communication section 1309 (including a network interface card such as a LAN card, a modem, etc.). The communication section 1309 performs communication processing via a network such as the Internet. A driver 1310 may also be connected to the input/output interface 1305 as needed. A removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like can be mounted on the drive 1310 as needed, so that a computer program read therefrom can be installed into the storage section 1308 as needed.

在通过软件实现上述系列处理的情况下，从网络比如因特网或存储介质比如可拆卸介质1311安装构成软件的程序。In the case of realizing the above-described series of processes by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 1311 .

本领域的技术人员应当理解，这种存储介质不局限于图13所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质1311。可拆卸介质1311的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者，存储介质可以是ROM 1302、存储部分1308中包含的硬盘等等，其中存有程序，并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 1311 shown in FIG. 13 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of the removable media 1311 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including )) and semiconductor memory. Alternatively, the storage medium may be the ROM 1302, a hard disk contained in the storage section 1308, or the like, in which programs are stored and distributed to users together with devices containing them.

本发明还提出一种存储有机器可读取的指令代码的程序产品。所述指令代码由机器读取并执行时，可执行上述根据本发明实施方式的方法。The invention also proposes a program product storing machine-readable instruction codes. When the instruction code is read and executed by a machine, the above method according to the embodiments of the present invention can be executed.

相应地，用于承载上述存储有机器可读取的指令代码的程序产品的存储介质也包括在本发明的范围内。所述存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。Correspondingly, a storage medium for carrying the program product storing the above-mentioned machine-readable instruction codes is also included in the scope of the present invention. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick, and the like.

应当注意，本发明的方法不限于按照说明书中描述的时间顺序来执行，也可以按照其他的时间顺序地、并行地或独立地执行。因此，本说明书中描述的方法的执行顺序不对本发明的技术范围构成限制。It should be noted that the method of the present invention is not limited to being executed in the chronological order described in the specification, and may also be executed in other chronological order, in parallel or independently. Therefore, the execution order of the methods described in this specification does not limit the technical scope of the present invention.

以上对本发明各实施方式的描述是为了更好地理解本发明，其仅仅是示例性的，而非旨在对本发明进行。应注意，在以上描述中，针对一种实施方式描述和/或示出的特征可以以相同或类似的方式在一个或更多个其它实施方式中使用，与其它实施方式中的特征相组合，或替代其它实施方式中的特征。本领域技术人员可以理解，在不脱离本发明的发明构思的情况下，针对以上所描述的实施方式进行的各种变化和修改，均属于本发明的范围内。The above descriptions of various embodiments of the present invention are for a better understanding of the present invention, which are only exemplary rather than intended to implement the present invention. It should be noted that in the above description, features described and/or illustrated for one embodiment can be used in the same or similar manner in one or more other embodiments, in combination with features in other embodiments, Or replace the features in other embodiments. Those skilled in the art can understand that without departing from the inventive concept of the present invention, various changes and modifications made to the above-described implementations fall within the scope of the present invention.

综上，在根据本发明的实施例中，本发明提供了如下技术方案。To sum up, in the embodiments according to the present invention, the present invention provides the following technical solutions.

方案1、一种提取文档中的文本行的方法，包括：Scheme 1. A method for extracting text lines in a document, comprising:

对文档中的多个文本块进行粗聚类，以形成多个类；Roughly cluster multiple text blocks in a document to form multiple classes;

计算每个类的特征；Calculate the features of each class;

根据每个类的特征，确定所述文档的方向；以及determining the orientation of the document based on the characteristics of each class; and

根据所述文档的方向对所述多个文本块进行精细聚类，以提取出文本行。Perform fine clustering on the plurality of text blocks according to the orientation of the document to extract text lines.

方案2、如方案1所述的方法，其中计算每个类的特征包括：根据每个类中所包含的文本块的位置信息，分别计算每个类的特征。Solution 2. The method according to solution 1, wherein calculating the features of each class includes: calculating the features of each class respectively according to the position information of the text blocks contained in each class.

方案3、如方案2所述的方法，其中Scheme 3, the method as described in scheme 2, wherein

每个文本块的位置信息包括水平位置信息和竖直位置信息，每个文本块的水平位置信息包括所述文本块的最上侧位置、最下侧位置和/或中心水平位置，每个文本块的竖直位置信息包括所述文本块的最左侧位置、最右侧位置和/或中心竖直位置；以及The position information of each text block includes horizontal position information and vertical position information, and the horizontal position information of each text block includes the uppermost side position, the lowermost side position and/or the central horizontal position of the described text block, each text block The vertical position information includes the leftmost position, rightmost position and/or center vertical position of the text block; and

每个类的特征包括水平特征和竖直特征，每个类的水平特征包括所述类所包含的所有文本块的水平位置信息的平均值和标准差，每个类的竖直特征包括所述类所包含的所有文本块的竖直位置信息的平均值和标准差。The features of each class include horizontal features and vertical features, the horizontal features of each class include the mean and standard deviation of the horizontal position information of all text blocks contained in the class, the vertical features of each class include the The mean and standard deviation of the vertical position information of all text blocks contained in the class.

方案4、如方案3所述的方法，其中确定所述文档的方向包括：Scheme 4. The method as described in scheme 3, wherein determining the direction of the document comprises:

比较所述多个类中所有类的水平特征中的标准差之平均值与竖直特征中的标准差之平均值；comparing the mean of the standard deviations in the horizontal features to the mean of the standard deviations in the vertical features for all classes in the plurality of classes;

当所述多个类中所有类的水平特征中的标准差之平均值大于竖直特征中的标准差之平均值时，确定所述文档的方向是水平方向；以及determining that the orientation of the document is a horizontal orientation when the average of the standard deviations in the horizontal features of all of the plurality of classes is greater than the average of the standard deviations in the vertical features; and

当所述多个类中所有类的水平特征中的标准差之平均值小于竖直特征中的标准差之平均值时，确定所述文档的方向是竖直方向。The orientation of the document is determined to be a vertical orientation when an average of standard deviations in horizontal features of all of the plurality of classes is smaller than an average of standard deviations in vertical features.

方案5、如方案1至4中任一项所述的方法，其中精细聚类包括：Scheme 5. The method according to any one of schemes 1 to 4, wherein fine clustering comprises:

a)对于每个文本块，根据所述文档的方向，计算每个文本块与每个类的关联值；a) For each text block, according to the direction of the document, calculate the associated value of each text block and each class;

b)根据计算出的关联值，更新每个文本块在所述多个类中所处的类；b) updating the class of each text block in the plurality of classes according to the calculated association value;

c)更新每个类的特征，并删除已不包含任何文本块的类；以及c) update the features of each class and remove classes that do not contain any text blocks; and

d)重复a)至c)，直至每个文本块所处的类均不发生改变。d) Repeat a) to c) until the class of each text block does not change.

方案6、如方案5所述的方法，其中每个文本块与每个类的关联值是根据该文本块的位置信息和该类的特征计算的。Solution 6. The method according to solution 5, wherein the association value between each text block and each class is calculated according to the position information of the text block and the characteristics of the class.

方案7、如方案6所述的方法，其中Scheme 7, the method as described in scheme 6, wherein

每个文本块的位置信息包括水平位置信息和竖直位置信息，每个文本块的水平位置信息包括所述文本块的最上侧位置、最下侧位置和/或中心水平位置，每个文本块的竖直位置信息包括所述文本块的最左侧位置、最右侧位置和/或中心竖直位置；The position information of each text block includes horizontal position information and vertical position information, and the horizontal position information of each text block includes the uppermost side position, the lowermost side position and/or the central horizontal position of the described text block, each text block The vertical position information includes the leftmost position, the rightmost position and/or the center vertical position of the text block;

每个类的特征包括水平特征和竖直特征，每个类的水平特征包括所述类所包含的所有文本块的水平位置信息的平均值和标准差，每个类的竖直特征包括所述类所包含的所有文本块的竖直位置信息的平均值和标准差；The features of each class include horizontal features and vertical features, the horizontal features of each class include the mean and standard deviation of the horizontal position information of all text blocks contained in the class, the vertical features of each class include the The average value and standard deviation of the vertical position information of all text blocks contained in the class;

当确定所述文档的方向是水平方向时，每个文本块与每个类的关联值为所述文本块的竖直位置信息与所述类的竖直特征中的平均值之差；以及When it is determined that the orientation of the document is horizontal, the value associated with each text block and each class is the difference between the vertical position information of the text block and the average value of the vertical features of the class; and

当确定所述文档的方向是竖直方向时，每个文本块与每个类的关联值为所述文本块的水平位置信息与所述类的水平特征中的平均值之差。When it is determined that the orientation of the document is the vertical orientation, the value associated with each text block and each class is the difference between the horizontal position information of the text block and the average value of the horizontal feature of the class.

方案8、如方案7所述的方法，其中更新每个文本块在所述多个类中所处的类包括：Scheme 8. The method of scheme 7, wherein updating the class of each text block in the plurality of classes comprises:

根据所述文档的方向和所述多个类的特征，计算阈值；calculating a threshold according to the direction of the document and the features of the plurality of classes;

确定该文本块的最小关联值以及与该文本块具有该最小关联值的类；determining a minimum associated value of the text block and a class having the minimum associated value with the text block;

当该最小关联值小于所述阈值，则将该文本块聚类至与该文本块具有该最小关联值的类中；以及When the minimum correlation value is less than the threshold value, clustering the text block into a class with the minimum correlation value with the text block; and

当该最小关联值大于或等于所述阈值，则创建一个类，并将该文本块聚类于所创建的类。When the minimum association value is greater than or equal to the threshold, a class is created, and the text block is clustered into the created class.

方案9、如方案8所述的方法，其中Scheme 9, the method as described in scheme 8, wherein

当确定所述文档的方向是水平方向时，所述阈值为所有类的竖直特征中的标准差的平均值乘以预设的系数所得到的结果，When it is determined that the orientation of the document is a horizontal orientation, the threshold is the result obtained by multiplying the average value of the standard deviations in the vertical features of all classes by a preset coefficient,

当确定所述文档的方向是竖直方向时，所述阈值为所有类的水平特征中的标准差的平均值乘以预设的系数所得到的结果。When it is determined that the orientation of the document is a vertical orientation, the threshold value is a result obtained by multiplying an average value of standard deviations in horizontal features of all classes by a preset coefficient.

方案10、如方案1至9中任一项所述的方法，其中粗聚类包括：Scheme 10. The method of any one of schemes 1 to 9, wherein the coarse clustering comprises:

计算所述多个文本块中的每两个文本块之间的距离；以及calculating the distance between each two text blocks in the plurality of text blocks; and

根据所计算出的距离，将每个文本块与距离该文本块最接近的文本块合并至同一个类。Merge each text block with the closest text block to the same class according to the calculated distance.

方案11、如方案10所述的方法，其中将每个文本块与距离该文本块最接近的文本块合并至同一个类包括：Scheme 11. The method as described in scheme 10, wherein merging each text block and the nearest text block to the same class includes:

对于未被聚类于任一个类的文本块，For text blocks that are not clustered in any class,

当该文本块最接近的文本块已被聚类于所述多个类中的一个类，则将该文本块也聚类于所述一个类；以及When the text block closest to the text block has been clustered in one of the plurality of classes, the text block is also clustered in the one class; and

当该文本块最接近的文本块未被聚类于任一个类，则创建一个类，并将该文本块及其最接近的文本块聚类于所创建的类。When the closest text block of the text block is not clustered in any class, a class is created, and the text block and its closest text block are clustered in the created class.

方案12、一种提取文档中的文本行的装置，包括：Solution 12. A device for extracting text lines in a document, comprising:

粗聚类单元，对文档中的多个文本块进行粗聚类，以形成多个类；A rough clustering unit, which roughly clusters multiple text blocks in the document to form multiple classes;

类特征计算单元，计算类的特征；A class feature calculation unit, which calculates the feature of the class;

文档方向确定单元，根据所述类特征计算单元所计算出的类的特征，确定所述文档的方向；以及a document orientation determination unit, for determining the orientation of the document according to the class features calculated by the class feature calculation unit; and

精细聚类单元，根据所述文档方向确定单元所确定的所述文档的方向，对所述多个文本块进行精细聚类，以提取出文本行。The fine clustering unit performs fine clustering on the plurality of text blocks according to the direction of the document determined by the document direction determining unit, so as to extract text lines.

方案13、如方案12所述的装置，其中所述类特征计算单元根据类中所包含的文本块的位置信息，计算该类的特征。Solution 13. The device according to solution 12, wherein the class feature calculation unit calculates the feature of the class according to the position information of the text blocks included in the class.

方案14、如方案13所述的装置，其中Item 14. The device of item 13, wherein

文本块的位置信息包括水平位置信息和竖直位置信息，文本块的水平位置信息包括该文本块的最上侧位置、最下侧位置和/或中心水平位置，文本块的竖直位置信息包括该文本块的最左侧位置、最右侧位置和/或中心竖直位置；以及The position information of the text block includes horizontal position information and vertical position information, the horizontal position information of the text block includes the uppermost position, the lowermost position and/or the central horizontal position of the text block, and the vertical position information of the text block includes the the leftmost position, rightmost position and/or center vertical position of the text block; and

类的特征包括水平特征和竖直特征，类的水平特征包括该类所包含的所有文本块的水平位置信息的平均值和标准差，类的竖直特征包括该类所包含的所有文本块的竖直位置信息的平均值和标准差。The features of a class include horizontal features and vertical features. The horizontal features of a class include the average and standard deviation of the horizontal position information of all text blocks contained in the class. The vertical features of a class include the values of all text blocks contained in the class. The mean and standard deviation of the vertical position information.

方案15、如方案14所述的装置，其中所述文档方向确定单元包括：Scheme 15. The device according to scheme 14, wherein the document direction determining unit comprises:

比较子单元，比较所述多个类中所有类的水平特征中的标准差之平均值与竖直特征中的标准差之平均值的大小；以及Comparing subunits, comparing the average of the standard deviations in the horizontal features to the average of the standard deviations in the vertical features for all classes in the plurality of classes; and

确定子单元，根据所述比较子单元的比较结果，确定所述文档的方向是水平方向还是竖直方向。The determination subunit determines whether the orientation of the document is horizontal or vertical according to the comparison result of the comparison subunit.

方案16、如方案12至15中任一项所述的装置，其中精细聚类单元包括：Scheme 16. The device according to any one of schemes 12 to 15, wherein the fine clustering unit comprises:

关联值计算子单元，根据所述文档方向确定单元所确定的所述文档的方向，计算文本块与每个类的关联值；an association value calculation subunit, calculating an association value between a text block and each class according to the orientation of the document determined by the document orientation determination unit;

聚类子单元，根据所述关联值计算子单元所计算出的关联值，更新该文本块在所述多个类中所处的类；以及The clustering subunit updates the class of the text block in the plurality of classes according to the association value calculated by the association value calculation subunit; and

类更新子单元，更新每个类的特征，并删除已不包含任何文本块的类。Classes update subunits, updating the characteristics of each class, and removing classes that no longer contain any text blocks.

方案17、如方案16所述的装置，其中所述关联值计算子单元根据文本块的位置信息和类的特征计算该文本块与该类的关联值。Solution 17. The device according to solution 16, wherein the association value calculation subunit calculates the association value between the text block and the class according to the position information of the text block and the characteristics of the class.

方案18、如方案17所述的装置，其中Item 18. The device of item 17, wherein

文本块的位置信息包括水平位置信息和竖直位置信息，文本块的水平位置信息包括该文本块的最上侧位置、最下侧位置和/或中心水平位置，文本块的竖直位置信息包括该文本块的最左侧位置、最右侧位置和/或中心竖直位置；The position information of the text block includes horizontal position information and vertical position information, the horizontal position information of the text block includes the uppermost position, the lowermost position and/or the central horizontal position of the text block, and the vertical position information of the text block includes the the leftmost position, rightmost position and/or center vertical position of the text block;

类的特征包括水平特征和竖直特征，类的水平特征包括该类所包含的所有文本块的水平位置信息的平均值和标准差，类的竖直特征包括该类所包含的所有文本块的竖直位置信息的平均值和标准差；The features of a class include horizontal features and vertical features. The horizontal features of a class include the average and standard deviation of the horizontal position information of all text blocks contained in the class. The vertical features of a class include the values of all text blocks contained in the class. The mean and standard deviation of the vertical position information;

当所述文档方向确定单元确定所述文档的方向是水平方向时，所述关联值计算子单元计算文本块的竖直位置信息与类的竖直特征中的平均值之差作为该文本块与该类的关联值；以及When the document orientation determination unit determines that the orientation of the document is the horizontal orientation, the correlation value calculation subunit calculates the difference between the vertical position information of the text block and the average value of the vertical feature of the class as the text block and the associated value of the class; and

当所述文档方向确定单元确定所述文档的方向是竖直方向时，所述关联值计算子单元计算文本块的水平位置信息与类的水平特征中的平均值之差作为该文本块与该类的关联值。When the document orientation determination unit determines that the orientation of the document is a vertical orientation, the correlation value calculation subunit calculates the difference between the horizontal position information of the text block and the average value of the horizontal feature of the class as the difference between the text block and the The associated value of the class.

方案19、如方案18所述的装置，其中所述聚类子单元包括：Scheme 19. The device of scheme 18, wherein the clustering subunit comprises:

阈值计算模块，根据所述文档的方向和所述多个类的特征，计算阈值；A threshold calculation module calculates a threshold according to the direction of the document and the characteristics of the plurality of classes;

最小关联值确定模块，确定文本块的最小关联值以及与该文本块具有该最小关联值的类；The minimum correlation value determination module determines the minimum correlation value of the text block and the class that has the minimum correlation value with the text block;

类创建模块，用于创建类；以及a class creation module for creating classes; and

聚类模块，将具有小于所述阈值的最小关联值的文本块聚类至所述最小关联值确定模块所确定的类中，并将具有大于或等于所述阈值的最小关联值的文本块聚类至所述类创建模块所创建的类中。A clustering module, clustering the text blocks with the minimum association value less than the threshold into the class determined by the minimum association value determination module, and clustering the text blocks with the minimum association value greater than or equal to the threshold class into the class created by the class creation module.

方案20、如方案19所述的装置，其中Item 20. The device of item 19, wherein

当所述文档方向确定单元确定所述文档的方向是水平方向时，所述阈值计算模块计算所有类的竖直特征中的标准差的平均值乘以预设的系数所得到的结果作为所述阈值，When the document orientation determination unit determines that the orientation of the document is a horizontal orientation, the threshold calculation module calculates the average value of the standard deviations in the vertical features of all classes multiplied by a preset coefficient as the result obtained threshold,

当所述文档方向确定单元确定所述文档的方向是竖直方向时，所述阈值计算模块计算所有类的水平特征中的标准差的平均值乘以预设的系数所得到的结果作为所述阈值。When the document orientation determination unit determines that the orientation of the document is a vertical orientation, the threshold calculation module calculates the average value of the standard deviations in the horizontal features of all classes multiplied by a preset coefficient as the result obtained threshold.

Claims

1. A method of extracting text lines in a document, comprising:

Roughly cluster multiple text blocks in a document to form multiple classes;

Calculate the features of each class;

determining the orientation of the document based on the characteristics of each class; and

Carrying out fine clustering on the plurality of text blocks according to the direction of the document to extract text lines, wherein the fine clustering includes:

a) For each text block, according to the direction of the document, calculate the associated value of each text block and each class;

b) updating the class of each text block in the plurality of classes according to the calculated association value;

c) update the features of each class and remove classes that do not contain any text blocks; and

d) Repeat a) to c) until the class of each text block does not change,

Among them, the associated value of each text block and each class is calculated according to the position information of the text block and the characteristics of the class, where:

The position information of each text block includes horizontal position information and vertical position information, and the horizontal position information of each text block includes the uppermost side position, the lowermost side position and/or the central horizontal position of the described text block, each text block The vertical position information includes the leftmost position, the rightmost position and/or the center vertical position of the text block;

The features of each class include horizontal features and vertical features, the horizontal features of each class include the mean and standard deviation of the horizontal position information of all text blocks contained in the class, the vertical features of each class include the The average value and standard deviation of the vertical position information of all text blocks contained in the class;

When it is determined that the orientation of the document is horizontal, the value associated with each text block and each class is the difference between the vertical position information of the text block and the average value of the vertical features of the class; and

When it is determined that the orientation of the document is the vertical orientation, the value associated with each text block and each class is the difference between the horizontal position information of the text block and the average value in the horizontal feature of the class.

2. The method according to claim 1, wherein calculating the features of each class comprises: calculating the features of each class respectively according to the position information of the text blocks contained in each class.

3. The method of claim 1 or 2, wherein determining the orientation of the document comprises:

comparing the mean of the standard deviations in the horizontal features to the mean of the standard deviations in the vertical features for all classes in the plurality of classes;

determining that the orientation of the document is a horizontal orientation when the average of the standard deviations in the horizontal features of all of the plurality of classes is greater than the average of the standard deviations in the vertical features; and

The orientation of the document is determined to be a vertical orientation when an average of standard deviations in horizontal features of all of the plurality of classes is smaller than an average of standard deviations in vertical features.

4. The method of claim 1 , wherein updating the class in which each text block resides in the plurality of classes comprises:

calculating a threshold according to the direction of the document and the features of the plurality of classes;

determining a minimum associated value of the text block and a class having the minimum associated value with the text block;

When the minimum correlation value is less than the threshold value, clustering the text block into a class with the minimum correlation value with the text block; and

When the minimum association value is greater than or equal to the threshold, a class is created, and the text block is clustered into the created class.

5. The method of claim 4, wherein

When it is determined that the orientation of the document is a horizontal orientation, the threshold is the result obtained by multiplying the average value of the standard deviations in the vertical features of all classes by a preset coefficient,

When it is determined that the orientation of the document is a vertical orientation, the threshold value is a result obtained by multiplying an average value of standard deviations in horizontal features of all classes by a preset coefficient.

6. An apparatus for extracting lines of text in a document, comprising:

A rough clustering unit, which roughly clusters multiple text blocks in the document to form multiple classes;

A class feature calculation unit, which calculates the feature of the class;

a document orientation determination unit, configured to determine the orientation of the document according to the class features calculated by the class feature calculation unit; and

The fine clustering unit performs fine clustering on the plurality of text blocks according to the direction of the document determined by the document direction determining unit to extract text lines, wherein the fine clustering includes:

d) Repeat a) to c) until the class of each text block does not change,

When it is determined that the orientation of the document is the vertical orientation, the value associated with each text block and each class is the difference between the horizontal position information of the text block and the average value of the horizontal feature of the class.