CN1588431A

CN1588431A - Character extracting method from complecate background color image based on run-length adjacent map

Info

Publication number: CN1588431A
Application number: CN 200410062261
Authority: CN
Inventors: 刘长松; 丁晓青; 陈又新; 彭良瑞; 方驰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2004-07-02
Filing date: 2004-07-02
Publication date: 2005-03-02
Anticipated expiration: 2024-07-02
Also published as: CN1312625C

Abstract

The invention discloses a method for extracting characters from complex background color images based on run-length adjacency graphs, belonging to the field of character extraction in color image character recognition preprocessing. After obtaining the digital color image, first use the CRAG (color run-length adjacency graph) region growing algorithm to obtain all the color connected domains of the image, and then perform color clustering on the color averages of these connected domains to obtain several color centers, Different color levels are formed based on the color center, and then the color connected domains that meet the connected domain discrimination rules are divided into several color levels. Finally, the text character image level is selected from the color level through feature analysis and size consistency criterion, and the character image at the text image level is obtained. This algorithm solves the character image extraction problem of color text character stroke image gradient, and has a high extraction speed and high extraction accuracy, while retaining the original color of the text and background image, which is convenient for future image restoration.

Description

Character Extraction Method in Color Image with Complex Background Based on Run-length Adjacency Graph

技术领域technical field

基于游程邻接图的复杂背景彩色图像中字符提取方法既属于图像分割领域，又属于文字识别的预处理领域。The method of extracting characters from complex background color images based on run-length adjacency graph belongs to the field of image segmentation and the preprocessing field of character recognition.

背景技术Background technique

从具有复杂彩色图像中提取文字字符，已经成为彩色印刷体文档识别系统中的既困难又关键的步骤。在彩色印刷文本图像中和照片图像中往往存在着大量的文字，这些文字符包含了很多有用的信息。为了提取这些有用信息，首先需要从复杂的彩色图像中自动而精确的提取这些有用的字符图像，才能予以识别处理。目前流行的OCR系统尚不能解决这种在复杂彩色图像中文字的提取问题。Extracting text characters from complex color images has become a difficult and critical step in color printed document recognition systems. There are often a large number of texts in color printed text images and photo images, and these text characters contain a lot of useful information. In order to extract these useful information, it is necessary to automatically and accurately extract these useful character images from complex color images before they can be recognized and processed. The current popular OCR system cannot solve the problem of extracting characters in complex color images.

彩色文档中文字字符的提取方法大致可以分为两类：第一类是不考虑彩色印刷文档中特有的彩色信息，而直接将其扫描转为灰度图像，后进行二值化分割。这一类方法丢失了文档图像的彩色信息，已经不适用于从复杂的彩色图像中提取文字字符前景图像。第二类方法是先利用颜色信息得到图像的连通域，而后分析得到字符层面。由于这一类方法较之第一类更多地考虑了彩色印刷文档图像的颜色信息，所以在处理具有复杂背景的彩色文本图像时具有明显的优越性，因而这一类方法已经逐渐成为现在研究的热点。The extraction methods of text characters in color documents can be roughly divided into two categories: the first category does not consider the unique color information in color printed documents, but directly scans them into grayscale images, and then performs binary segmentation. This type of method loses the color information of the document image, and is no longer suitable for extracting the foreground image of text characters from complex color images. The second type of method is to use the color information to obtain the connected domain of the image, and then analyze to obtain the character level. Since this type of method considers the color information of color printed document images more than the first type, it has obvious advantages when dealing with color text images with complex backgrounds, so this type of method has gradually become the current research hotspots.

目前，在第二类方法中大致分为大致又可以分为三类：At present, in the second category of methods, they can be roughly divided into three categories:

1)边缘分析：在图像中的颜色突变处抽取边缘，并通过分析边缘来抽取不同的颜色层面。1) Edge analysis: extract the edge at the sudden change of color in the image, and extract different color levels by analyzing the edge.

对于背景条纹干扰等复杂现象，采用边缘分析时将产生大量的边缘断裂和交叉的情况，给颜色层面的分割带来很大的困难。For complex phenomena such as background stripe interference, a large number of edge breaks and intersections will occur when using edge analysis, which will bring great difficulties to the segmentation of color layers.

2)区域生长：根据颜色一致性准则进行区域生长、合并，分割不同的颜色层面2) Region growth: region growth and merging are performed according to the color consistency criterion, and different color layers are segmented

3)聚类分析：抽取图像中每一个象素点的颜色特征矢量，并在选定的颜色空间上对这些特征进行聚类分析，根据聚类的结果来分割颜色层面。通过分析发现直接聚类对于背景变化大的图像会产生过多的聚类中心，如果采用模糊C均值聚类，在平滑过程中会使所占象素数较少的中心丢失，这样会引起小字的丢失，并且由于损失边缘过渡颜色信息，会造成笔画过多的断裂。3) Cluster analysis: extract the color feature vector of each pixel in the image, and perform cluster analysis on these features in the selected color space, and divide the color level according to the clustering results. Through analysis, it is found that direct clustering will produce too many cluster centers for images with large background changes. If fuzzy C-means clustering is used, centers that occupy less pixels will be lost during the smoothing process, which will cause small fonts , and due to the loss of edge transition color information, it will cause too many breaks in strokes.

边缘分析和聚类分析的方法没有充分的利用彩色图像特有的颜色和位置的相关信息，因而都不能很好的从彩色图像中提取文字字符。The methods of edge analysis and cluster analysis do not make full use of the color and position information of color images, so they cannot extract text characters from color images well.

传统的区域生长算法采用的生长准则造成了过量的计算消耗，但是区域生长算法恰恰是考虑到彩色图像中的颜色和位置的相关信息，有效的避免了颜色聚类法忽视位置信息的缺陷，同时可以通过改进生长准则来减少计算量。The growth criterion adopted by the traditional region growing algorithm causes excessive calculation consumption, but the region growing algorithm just takes into account the color and position related information in the color image, effectively avoiding the defect that the color clustering method ignores the position information, and at the same time The computation can be reduced by improving the growth criterion.

本发明就是通过采用新的区域生长算法CRAG(Color Run-length Adjacency Graph)，从图像中搜索得到彩色连通域，而后将这些连通域的平均颜色进行颜色聚类，根据得到的颜色中心生成不同的色彩层面。最后根据特定的判别准则得到所需要的可能的文字层面。这种方法有以下优点：The present invention uses the new region growing algorithm CRAG (Color Run-length Adjacency Graph) to search for color connected domains from the image, and then performs color clustering on the average colors of these connected domains, and generates different colors according to the obtained color centers. color level. Finally, the required possible text level is obtained according to a specific criterion. This approach has the following advantages:

1)算法简单，计算速度快；1) The algorithm is simple and the calculation speed is fast;

2)以连通域为单位的颜色聚类使文字更容易被分出来；2) Color clustering based on connected domains makes it easier to separate text;

3)能自动处理反白文字；3) Can automatically process highlighted text;

4)可以提取图像中由于字符本身，或者由于光照而造成颜色渐变的字符；4) It is possible to extract characters in the image that have color gradients due to the characters themselves or due to illumination;

5)保留字符颜色信息。5) Keep character color information.

本发明就是通过利用相邻象素的彩色和位置信息，同彩色聚类相结合作为主要的突破口，实现了高速度高准确性高性能的字符提取算法，同时也是一种图像分割算法。这是目前所有其他文献里都没有使用的方法。The present invention realizes a high-speed, high-accuracy and high-performance character extraction algorithm by utilizing the color and position information of adjacent pixels and combining it with color clustering as the main breakthrough, and is also an image segmentation algorithm. This is a method not used in any other literature so far.

发明内容Contents of the invention

本发明的目的在于实现基于CRAG结构区域生长算法的复杂彩色图像中文字字符提取的方法，该方法也可以应用于彩色图像分割领域。在BAG结构的基础上提出了新的彩色空间内的CRAG结构，并以此为基础，提出了一种新的区域生长算法。最后，以该生长算法为核心建立了一种彩色文档图像中文字字符的提取方法(下面中所指的CRAG方法即为此方法)。The purpose of the present invention is to realize the method for character extraction in complex color image based on CRAG structural region growing algorithm, and this method can also be applied to the field of color image segmentation. On the basis of BAG structure, a new CRAG structure in color space is proposed, and based on this, a new region growing algorithm is proposed. Finally, based on the growth algorithm, a method for extracting text characters in color document images is established (the CRAG method referred to below is this method).

需要说明的是本发明的方法适用于其他任何彩色空间，只需要将下文中的r(红)，g(绿)，b(蓝)三种颜色分量分别对应于其他彩色空间的三个基本分量即可，方法中涉及到的阈值根据选取的色彩空间不同而有所不同。本发明采用的聚类方法不必仅局限于初始聚类方法，也可以采用其他聚类方法。It should be noted that the method of the present invention is applicable to any other color space, and it is only necessary to correspond to the three basic components of r (red), g (green), and b (blue) hereinafter respectively to the three basic components of other color spaces That is, the threshold involved in the method varies according to the selected color space. The clustering method used in the present invention is not limited to the initial clustering method, and other clustering methods can also be used.

本发明有以下4部分组成：彩色图像分割，连通域中心颜色聚类，图像层面生成与字符层选取。The present invention consists of the following four parts: color image segmentation, color clustering at the center of connected domains, image layer generation and character layer selection.

1彩色图像分割1 color image segmentation

采用的是基于CRAG结构的彩色连通域搜索算法，属于区域生长算法。这里简称为CRAG算法。The color connected domain search algorithm based on the CRAG structure is used, which belongs to the region growing algorithm. Here it is referred to as the CRAG algorithm for short.

该算法的思路与二值图像上连通域轮廓提取的BAG(block adjacency graph)算法相近。CRAG算法可以理解成两个步骤，首先获取水平方向彩色游程，然后把相邻的颜色相近的彩色游程不断合并，得到彩色连通域。下面以RGB空间为例进行说明：The idea of this algorithm is similar to the BAG (block adjacency graph) algorithm for extracting connected domain contours on binary images. The CRAG algorithm can be understood as two steps. First, obtain the color runs in the horizontal direction, and then continuously merge adjacent color runs with similar colors to obtain the color connected domain. The following uses the RGB space as an example to illustrate:

彩色游程表示如下：R_p{(r_p，g_p，b_p)，(x_p，y_p)，f_p}，其中(r_p，g_p，b_p)是游程上各点在RGB彩色空间的r，g，b颜色分量平均值，(x_p，y_p)为该游程的起始坐标，f_p为游程的长度。The color run is expressed as follows: R _p {(r _p , g _p , b _p ), (x _p , y _p ), f _p }, where (r _p , g _p , b _p ) is the RGB color The average value of r, g, b color components in the space, (x _p , y _p ) is the starting coordinate of the run, and f _p is the length of the run.

产生方法如下：从每一行的第一个象素开始，认为该象素为一个新的游程的起始点，计算该起始点和同一行中与它紧邻的象素在RGB空间内的欧氏距离o_pq，The generation method is as follows: starting from the first pixel of each row, consider this pixel as the starting point of a new run, and calculate the Euclidean distance between the starting point and its adjacent pixels in the same row in the RGB space o _pq ,

${o o}_{pq pq} = = \sqrt{{(({r r}_{q q} - - {r r}_{p p}))}^{22} + + {(({g g}_{q q} - - {g g}_{p p}))}^{22} + + {(({b b}_{q q} - - {b b}_{p p}))}^{22}} . .$

If(o_pq＜TD)If(o _pq <TD)

${{{r r}_{p p} = = \frac{(({r r}_{p p} \times \times {f f}_{p p} + + {r r}_{q q}))}{{f f}_{p p} + + 11};; {g g}_{p p} = = \frac{(({g g}_{p p} \times \times {f f}_{p p} + + {g g}_{q q}))}{{f f}_{p p} + + 11};; {b b}_{p p} = = \frac{(({b b}_{p p} \times \times {f f}_{p p} + + {b b}_{q q}))}{{f f}_{p p} + + 11};; {f f}_{p p} = = {f f}_{p p} + + 11;;}}$

Else{p＝p+1；r_p＝r_q；g_p＝g_q；b_p＝b_q；} (1-1)Else {p=p+1; r _p = r _q ; g _p = g _q ; b _p = b _q ;} (1-1)

根据(1-1)可知：如果o_pq小于阈值TD，那么这两个象素合并为一个游程，并重新计算该游程的平均r，g，b值：r_p，g_p，b_p，反之，第二个象素便成为新游程的起始点。继续计算其与下一个相邻象素的欧氏距离，如果仍小于TD，就将该象素加入该游程，并重新计算它的r，g，b值，否则，以该象素点为下一个新游程起始点。根据上述规则，可以这样遍历图像每一行中的所有象素得到若干个彩色游程。According to (1-1), it can be seen that if o _pq is less than the threshold TD, then the two pixels are merged into one run, and the average r, g, b values of the run are recalculated: r _p , g _p , b _p , and vice versa , the second pixel becomes the starting point of the new run. Continue to calculate the Euclidean distance between it and the next adjacent pixel, if it is still smaller than TD, add this pixel to the run, and recalculate its r, g, b values, otherwise, take this pixel as the next A new run starting point. According to the above rules, several color runs can be obtained by traversing all the pixels in each row of the image.

另外从图像的第二行开始，在得到一个彩色游程以后，计算该游程与上一相邻行在位置上是4邻域相连的彩色游程在RGB空间的欧氏距离o_pp’：In addition, starting from the second line of the image, after obtaining a color run, calculate the Euclidean distance o _pp' of the color run in RGB space between this run and the previous adjacent line that is connected by 4 neighbors in position:

${o o}_{p p {p p}^{' '}} = = \sqrt{{(({r r}_{{p p}^{' '}} - - {r r}_{p p}))}^{22} + + {(({g g}_{{p p}^{' '}} - - {g g}_{p p}))}^{22} + + {(({b b}_{{p p}^{' '}} - - {b b}_{p p}))}^{22}}$

判断该距离是否小于TV，若小于则合并为同一个连通域，即连接这两个游程；反之，作为新连通域的起始游程。TD和TV在12～16之间取值。Judging whether the distance is less than TV, if it is less than, merge into the same connected domain, that is, connect the two runs; otherwise, use it as the initial run of the new connected domain. TD and TV take values between 12 and 16.

如图6所示：图中每一个方格代表一个象素，对于象素“5”来讲，“2，4，6，8”四个相邻的象素所在的位置与它4邻域相连。对于两个相邻行的不同游程而言，如果它们各自包含的象素彼此之间的相对位置中有符合图6所示的4邻域相连位置的情况，那么称这两个游程之间4邻域相连。As shown in Figure 6: each square in the figure represents a pixel. For pixel "5", the positions of the four adjacent pixels of "2, 4, 6, 8" and its 4 neighbors connected. For the different runs of two adjacent lines, if the relative positions of the pixels contained in each of them conform to the 4-neighborhood connection position shown in Figure 6, then the distance between the two runs is called 4 Neighbors are connected.

按照上述规则，遍历完整幅图像后，根据游程之间的连接关系便可以得到组成图像的所有连通域的集合{C_n|n＝1，2，...，K}。According to the above rules, after traversing the entire image, the set {C _n |n=1, 2, ..., K} of all connected domains composing the image can be obtained according to the connection relationship between runs.

连通域的结构定义如下：The structure of a connected domain is defined as follows:

C_n{(r_n，g_n，b_n)，X_n，(v_n，h_n)}。(r_n，g_n，b_n)表示的是连通域C_n的平均颜色r，g，b值，C _n {(r _n , g _n , b _n ), X _n , (v _n , h _n )}. (r _n , g _n , b _n ) represents the average color r, g, b value of the connected domain C _n ,

${r r}_{n no} = = {Σ Σ}_{u u = = 11}^{{m m}_{n no}} (({r r}_{{p p}_{u u}} \times \times {f f}_{{p p}_{u u}})) / / {Σ Σ}_{u u = = 11}^{{m m}_{n no}} {f f}_{{p p}_{u u}} - - - - - - ((11 - - 22))$

${g g}_{n no} = = {Σ Σ}_{u u = = 11}^{{m m}_{n no}} (({g g}_{{p p}_{u u}} \times \times {f f}_{{p p}_{u u}})) / / {Σ Σ}_{u u = = 11}^{{m m}_{n no}} {f f}_{{p p}_{u u}} - - - - - - ((11 - - 33))$

${b b}_{n no} = = {Σ Σ}_{u u = = 11}^{{m m}_{n no}} (({b b}_{{p p}_{u u}} \times \times {f f}_{{p p}_{u u}})) / / {Σ Σ}_{u u = = 11}^{{m m}_{n no}} {f f}_{{p p}_{u u}} - - - - - - ((11 - - 44))$

$X_{n} = {R_{p_{u}} | u = 1,2 . . . m_{n}}$ 表示该连通域内包含的所有彩色游程的集合。通过简单计算很容易得到连通域的高v_n和宽h_n。从而，一幅图像可以用所有得到的连通域描述。 $x_{no} = {R_{p_{u}} | u = 1,2 . . . m_{no}}$ Represents the set of all colored runs contained in the connected domain. It is easy to get the height v _n and width h _n of the connected domain by simple calculation. Thus, an image can be described by all the resulting connected domains.

2连通域彩色聚类步骤分析2 Connected Domain Color Clustering Step Analysis

任意选取一个连通域的颜色作为初始中心，计算其它连通域与其在RGB彩色空间的欧氏距离o_cn：Randomly select the color of a connected domain as the initial center, and calculate the Euclidean distance o _cn between other connected domains and their RGB color space:

${o o}_{cn cn} = = \sqrt{{(({r r}_{n no} - - {r r}_{c c}))}^{22} + + {(({g g}_{n no} - - {g g}_{c c}))}^{22} + + {(({b b}_{n no} - - {b b}_{c c}))}^{22}}$

若小于阈值TC，将其聚类，重新计算r，g，b的均值作为聚类的中心颜色值，若大于TC，则生成第二个新的中心，按照该方法计算所有样本，由于颜色中心位置不断变化，同时需要合并中心距离小于TC的颜色中心，最终可以得到适当数目的颜色聚类中心。If it is less than the threshold TC, cluster it, recalculate the mean value of r, g, b as the center color value of the cluster, if it is greater than TC, generate a second new center, and calculate all samples according to this method, because the color center The position is constantly changing, and at the same time, it is necessary to merge the color centers whose center distance is less than TC, and finally an appropriate number of color cluster centers can be obtained.

有些特殊的连通域不可能是文字块，预先作了一个筛选，参与聚类连通域样本的选取准则如下：Some special connected domains cannot be text blocks, and a screening is made in advance. The selection criteria for clustering connected domain samples are as follows:

1)Hmin＜h_n＜Hmax，Vmin＜v_n＜Vmax；1) Hmin<h _n <Hmax, Vmin<v _n <Vmax;

$2) - - H_V \min < h_{n} / v_{n} < H_V \max,$ 或者 $V_H \min < v_{n} / h_{n} < V_H \max;$ $2) - - h_V \min < h_{no} / v_{no} < h_V \max,$ or $V_h \min < v_{no} / h_{no} < V_h \max;$

$3) - - Q_{2} > (Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n}) > Q_{1},$ 这里 $(Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n})$ 表示连通域的象素密度。 $3) - - Q_{2} > (Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no}) > Q_{1},$ here $(Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no})$ Represents the pixel density of the connected domain.

上式中h_n和v_n分别指代的是所得彩色连通域的高和宽，m_n表示第n个连通域内的彩色游程数，f_pu表示第p_u个游程的游程长度。In the above formula, h _n and v _n respectively refer to the height and width of the obtained color connected domain, m _n represents the number of color runs in the nth connected domain, and f _pu represents the run length of the p _uth run.

1)中，由于测试图像中的字符笔画高度和宽度分别大都是小于图像高H和宽V，所谓高，即指图像的纵向象素数目，宽指图像的横向象素数目。这里设定待选连通域最大高宽分别为：Hmax＝min(H，400)，Vmax＝min(V，400)，这是由于目前彩色印刷文档中的文字字符的字号大都小于120磅，而在300dpi扫描分辨率的情况下录入的彩色图像中，该字符笔画的最大高宽均小于400个象素长，同时，考虑到文本区域图像实际的高宽。Hmin和Vmin分别为参与颜色聚类的连通域样本的最小高宽，通过实验可知如果该值取得过大会降低小字体的招回率，因而为了使本发明具有广泛的通用性这里取值为3，这样既可以除去大量噪声点的干扰，又很好的保留了标点符号的图像。In 1), since the character stroke height and width in the test image are mostly less than the image height H and width V respectively, so-called height refers to the number of vertical pixels of the image, and width refers to the number of horizontal pixels of the image. The maximum height and width of the connected domain to be selected here are set to be respectively: Hmax=min(H, 400), Vmax=min(V, 400), this is because the font size of the text characters in the current color printing document is mostly less than 120 points, and In the color image entered under the scanning resolution of 300dpi, the maximum height and width of the character strokes are both less than 400 pixels long, and at the same time, the actual height and width of the image in the text area are considered. Hmin and Vmin are the minimum height and width of connected domain samples participating in color clustering respectively. It can be seen from experiments that if the value is too large, the recall rate of small fonts will be reduced. Therefore, in order to make the present invention have wide versatility, the value is 3 here. , which can not only remove the interference of a large number of noise points, but also preserve the image of punctuation marks well.

2)中的H_Vmin和H_Vmax分别指的是连通域的高度与宽度比值的最小和最大值，同样，V_Hmin和V_Hmax指的是宽高比的最小和最大值。在这里根据笔画的特点，最小值为1，最大值为50即可。H_Vmin and H_Vmax in 2) refer to the minimum and maximum value of the height-to-width ratio of the connected domain, respectively. Similarly, V_Hmin and V_Hmax refer to the minimum and maximum value of the aspect ratio. Here, according to the characteristics of strokes, the minimum value is 1, and the maximum value is 50.

3)中如果Q₁＝0.3，Q₂＝0.8，部分由图像的边框和其他狭长的细线边缘的影响将被排除，需要说明的是Q₁和Q₂仍可以在设定值的±0.2左右变化，即Q₁可以在0.1～0.5范围内取值，Q₂的取值范围可以是0.6～1。3) If Q ₁ ＝0.3, Q ₂ ＝0.8, partly affected by the border of the image and other long and narrow thin line edges will be excluded, it should be noted that Q ₁ and Q ₂ can still be within ±0.2 of the set value Change left and right, that is, Q ₁ can take a value in the range of 0.1-0.5, and Q ₂ can take a value in the range of 0.6-1.

另外，刚提到的阈值TC可以在20-50之间取值，但TC较小的时候，会造成层面过多，因而，采用TC＝45，减少了图像层面的生成，降低计算消耗，这对于从彩色图像中提取文字字符是一个很好的选择，可以有效的去干扰噪声点。In addition, the threshold TC just mentioned can take a value between 20-50, but when TC is small, it will cause too many layers. Therefore, using TC=45 reduces the generation of image layers and reduces calculation consumption. It is a good choice for extracting text characters from color images, which can effectively remove noise points.

以上这些参量的设定范围的不同，会造成聚类所用的连通域数目变化，同时也会改变生成的颜色中心数目的不同。如果限定的过窄，虽然能降低计算量，提高速度，但是对某些个别背景和前景过于颜色接近的会造成粘连；如果太宽，会造成生成的颜色中心过多，增加计算量。因而，通过实验发现：如果在上面所提到的参数范围内选值，可以取得很好的文字字符提取结果。并且，通过这些条件的限制，进一步降低了初始聚类的运算量，同时也一定程度的去处了部分噪声颜色中心。The different setting ranges of the above parameters will cause the number of connected domains used for clustering to change, and also change the number of generated color centers. If the limit is too narrow, although it can reduce the amount of calculation and increase the speed, it will cause adhesion for some individual backgrounds and foregrounds whose colors are too close; if it is too wide, it will cause too many color centers to be generated and increase the amount of calculation. Therefore, it is found through experiments that if the value is selected within the above-mentioned parameter range, a good text character extraction result can be obtained. Moreover, through the limitations of these conditions, the computational load of the initial clustering is further reduced, and at the same time, part of the noise color centers are removed to a certain extent.

3图像层面的生成3 Generation of image level

将所有高或宽分别小于文本区域图像的高或宽的连通域都与颜色中心比较，如果连通域的平均颜色值和颜色中心的欧氏距离小于TC，便将满足这个条件的连通域放在一个图像层面上，从而可以得到多个层面，这样文字字符图像便可能会存在一个或多个层上。另外如果存在高和宽分别等于文本区域图像高和宽的连通域，则把该连通域所在的层面定为背景层面。(为了便于后续切分识别工作，这里已将生成的层面全部转为白底黑字的图像。)然后，通过如下准则先排除部分非文字字符图层：Compare all the connected domains whose height or width are smaller than the height or width of the text area image respectively with the color center, if the Euclidean distance between the average color value of the connected domain and the color center is less than TC, then put the connected domain satisfying this condition in On an image layer, multiple layers can be obtained, so that the text character image may exist on one or more layers. In addition, if there is a connected domain whose height and width are respectively equal to the height and width of the image in the text region, then the layer where the connected domain is located is defined as the background layer. (In order to facilitate the subsequent segmentation and recognition work, all the generated layers have been converted to images of black characters on a white background.) Then, some non-text character layers are first excluded by the following criteria:

1)每一个文字层的象素数要超过200个，否则定为噪声层；1) The number of pixels in each text layer must exceed 200, otherwise it will be defined as a noise layer;

2)如果连通域C的高和宽和测试图像大小大体相当，那么将C的中心颜色作为背景色，它所在层面为背景层面；2) If the height and width of the connected domain C are roughly equal to the size of the test image, then the center color of C is used as the background color, and the layer where it is located is the background layer;

如果通过1)，2)的筛选后，如果剩下的层面数大于L个的时候，这里假定前景色不多于L个，便取层面中所包含黑色象素总数排在前L+2个的层面。前景指的是整幅图像中所包含的文字字符图像，前景色指的是这些文字字符图像的大致颜色，图像中除了文字字符图像以外的部分都称为背景。If after passing the screening of 1) and 2), if the number of remaining layers is greater than L, it is assumed that there are no more than L foreground colors, and the total number of black pixels contained in the layers is ranked in the top L+2 level. The foreground refers to the character images included in the entire image, the foreground color refers to the approximate color of these character images, and the part of the image other than the character images is called the background.

这里，L可以根据实际情况选取，本发明的一般取L＝4，在这个范围内取值可以有效的进一步减少备选字符层中的噪声或背景层面，避免字符层的丢失。通过删除噪声层，背景层等上述选取准则以后，剩下的层中将被认为有可能包含文字字符的图像层。Here, L can be selected according to the actual situation. In the present invention, L=4 is generally taken. Taking a value within this range can effectively further reduce the noise or the background layer in the candidate character layer, and avoid the loss of the character layer. After the above selection criteria such as noise layer and background layer are deleted, the remaining layers will be considered as image layers that may contain text characters.

4字符层面的选择4 character level selection

假定图像的垂直方向上的高度为H，水平方向上的宽度为V。颜色分层后得到K个层面，对于层面i(1≤i≤K)，分别作水平和垂直方向的投影，可以得到水平方向投影宽度的u_il(0≤l＜N_i)和垂直方向的投影宽度w_ij(0≤j＜M_i)，i为图像层面的序号，l代表水平方向投影宽度的序号，j代表垂直方向投影宽度的序号，为了消除小噪声的干扰，每一个坐标位置上的对应的投影黑色象素数目必须超过5个。同时，仅统计两个方向上投影宽度超过10个象素宽的投影个数N_i和M_i，即N_i和M_i分别为在两个方向上得到的符合要求的投影宽度的总数。水平方向上相邻两个投影宽度之间的距离为水平投影间隔宽度e_is(0≤s＜Z_i)，垂直方向上相邻两个投影宽度之间的距离为垂直投影间隔宽度d_it(0≤t＜Y_i)，Z_i和Y_i分别为在两个方向上得到的投影间隔宽度的总数。根据得到的以上结果，可以计算得出层面i上投影宽度的平均值：Assume that the height of the image in the vertical direction is H, and the width in the horizontal direction is V. After color layering, K levels are obtained. For level i (1≤i≤K), the horizontal and vertical projections are made respectively, and u _il (0≤l<N _i ) of the projection width in the horizontal direction and u il in the vertical direction can be obtained. Projection width w _ij (0≤j<M _i ), i is the sequence number of the image level, l represents the sequence number of the projection width in the horizontal direction, and j represents the sequence number of the projection width in the vertical direction. In order to eliminate the interference of small noises, each coordinate position The number of corresponding projected black pixels must exceed 5. At the same time, only count the number of projections N _i and M _i whose projection width exceeds 10 pixels in the two directions, that is, N _i and M _i are the total number of projection widths obtained in the two directions that meet the requirements. The distance between two adjacent projection widths in the horizontal direction is the horizontal projection interval width e _is (0≤s<Z _i ), and the distance between two adjacent projection widths in the vertical direction is the vertical projection interval width d _it ( 0≤t<Y _i ), Z _i and Y _i are respectively the total number of projection interval widths obtained in two directions. According to the above results obtained, the average value of the projection width on layer i can be calculated:

水平方向投影的平均宽度 $Avg H_{i} = \frac{1}{N_{i}} Σ_{l = 0}^{N_{i} - 1} u_{il};$ 垂直方向投影的平均宽度The average width of the projection in the horizontal direction $Avg h_{i} = \frac{1}{N_{i}} Σ_{l = 0}^{N_{i} - 1} u_{il};$ The average width of the vertical projection

$Avg Avg {W W}_{i i} = = \frac{11}{{M m}_{i i}} {Σ Σ}_{j j = = 00}^{{M m}_{i i} - - 11} {w w}_{ij ij} . .$

层面i上投影间隔宽度的平均值：Average of projected interval widths on slice i:

水平方向投影间隔的平均宽度 $Avg E_{i} = \frac{1}{Z_{i}} Σ_{s = 0}^{Z_{i} - 1} e_{is};$ 垂直方向投影的平均宽度The average width of the projection interval in the horizontal direction $Avg {E.}_{i} = \frac{1}{Z_{i}} Σ_{the s = 0}^{Z_{i} - 1} e_{is};$ The average width of the vertical projection

$Avg Avg {D D.}_{i i} = = \frac{11}{{Y Y}_{i i}} {Σ Σ}_{i i = = 00}^{{Y Y}_{i i} - - 11} {d d}_{it it} . .$

计算得到该层面水平投影宽度的方差为 $Var H_{i} = \sqrt{Σ_{l = 0}^{N_{i} - 1} {(u_{il} - Avg H_{i})}^{2} / N_{i}},$ 垂直投影宽度的方差为 $Var W_{i} = \sqrt{Σ_{j = 0}^{M_{i} - 1} {(w_{ij} - Avg W_{i})}^{2} / M_{i}};$ The calculated variance of the horizontal projection width of this layer is $Var h_{i} = \sqrt{Σ_{l = 0}^{N_{i} - 1} {(u_{il} - Avg h_{i})}^{2} / N_{i}},$ The variance of the vertical projection width is $Var W_{i} = \sqrt{Σ_{j = 0}^{m_{i} - 1} {(w_{ij} - Avg W_{i})}^{2} / m_{i}};$

该层面的水平投影间隔宽度的方差 $Var E_{i} = \sqrt{Σ_{s = 0}^{Z_{i} - 1} {(e_{is} - Avg E_{i})}^{2} / Z_{i}},$ 垂直投影间隔宽度的方差 $Var D_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - Avg D_{i})}^{2} / Y_{i}};$ The variance of the horizontal projection interval width for this slice $Var {E.}_{i} = \sqrt{Σ_{the s = 0}^{Z_{i} - 1} {(e_{is} - Avg {E.}_{i})}^{2} / Z_{i}},$ Variance of Vertical Projection Interval Width $Var {D.}_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - Avg {D.}_{i})}^{2} / Y_{i}};$

通过分析文字字符连通域的特征可以发现，文字字符图像连通域的大小基本一致，分布比较均匀，根据这些物理特性，可以定义图层的大小一致性判据P_i如下(1≤i≤K)：By analyzing the characteristics of the connected domain of text characters, it can be found that the size of the connected domain of text characters and images is basically the same, and the distribution is relatively uniform. According to these physical characteristics, the size consistency criterion P _i of the layer can be defined as follows (1≤i≤K) :

${P P}_{i i} = = \frac{min min ((Avg Avg {H h}_{i i} / / Avg Avg {W W}_{i i},, Avg Avg {W W}_{i i} / / Avg Avg {H h}_{i i})) \times \times H h \times \times V V}{((11 + + | | max max (({N N}_{i i},, {M m}_{i i})) - - max max ((H h / / V V,, V V / / H h)) | | / / 22)) \times \times ((11 + + max max ((Var Var {E E.}_{i i},, Var Var {D D.}_{i i})))) \times \times ((11 + + max max ((Var Var {H h}_{i i},, Var Var {W W}_{i i}))))}$

max()和min()分别代表括号中两个数值的最大和最小值。max() and min() represent the maximum and minimum of the two values in parentheses, respectively.

计算各个图层的大小一致性判据P_i，并按数值大小排序，最大的即为最可能的文字字符层面。实验结果也表明，通过大小一致性判别准则，可以在一定范围的满足了系统对自动判别文字层面的要求，同时可以为系统提供备选层面的排列顺序，便于后续的处理工作。The size consistency criterion P _i of each layer is calculated, and sorted by numerical value, the largest is the most likely character level. The experimental results also show that the size consistency criterion can meet the requirements of the system for automatic discrimination of the text level within a certain range, and at the same time provide the system with the arrangement order of the alternative levels, which is convenient for subsequent processing.

本发明的特征在于：它依次包含以下步骤：The present invention is characterized in that: it comprises the following steps successively:

(1)通过图像采集设备把彩色印刷文档或照片图像扫描入图像处理器中；(1) Scan the color printed document or photo image into the image processor through the image acquisition device;

(2)在上述图像处理器中设定：(2) Set in the above image processor:

图像的高和宽分别用符号H和V表示；The height and width of the image are represented by symbols H and V, respectively;

图像中每一行象素与同一行和它紧邻的彩色游程再RGB空间内的欧氏距离o_pq的阈值为TD；The threshold of the Euclidean distance o _pq between each row of pixels in the image and the same row and its adjacent color run in RGB space is TD;

从图像的第二行开始算起，该彩色游程与上一相邻行在位置上是4邻域相连的彩色游程在RGB空间的欧氏距离o_pp’的阈值是TV，选取TD＝TV＝12～16。Counting from the second row of the image, the color run and the last adjacent row are in position 4 neighbors. The threshold of the Euclidean distance o _pp' of the color run in RGB space is TV, select TD=TV= 12-16.

连通域的初始中心与组成图像所有连通域的集合中的其他连通域在RGB彩色空间的欧氏距离o_cn的阈值TC，选取TC＝20～50；The threshold TC of the Euclidean distance o _cn between the initial center of the connected domain and other connected domains in the set of all connected domains forming the image in RGB color space, select TC=20～50;

待选连通域最大高度Hmax＝min(H，400)，象素数；The maximum height of the connected domain to be selected Hmax=min(H, 400), the number of pixels;

待选连通域最大宽度Vmax＝min(V，400)，象素数；The maximum width of the connected domain to be selected Vmax=min(V, 400), the number of pixels;

待选连通域最小高度Hmin＝3，象素数；The minimum height of the connected domain to be selected Hmin=3, the number of pixels;

待选连通域最小宽度Vmin＝3，象素数；The minimum width of the connected domain to be selected Vmin=3, the number of pixels;

待选连通域的高宽比或宽高比的最小值为1，最大值为50；The minimum value of the aspect ratio or aspect ratio of the connected domain to be selected is 1, and the maximum value is 50;

各连通域的象素密度用 $(Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n})$ 表示，h_n和v_n分别指代的是所得彩色连通域的高和宽，m_n表示第n个连通域内的彩色游程数，f_pu表示第p_u个游程的游程长度，设定： $Q_{2} > (Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n}) > Q_{1},$ Q₁＝0.1～0.5，Q₂＝0.6～1；The pixel density of each connected domain is used $(Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no})$ Indicates that h _n and v _n respectively refer to the height and width of the obtained color connected domain, m _n indicates the number of color runs in the nth connected domain, f _pu indicates the run length of the p _uth run, set: $Q_{2} > (Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no}) > Q_{1},$ Q ₁ =0.1～0.5, Q ₂ =0.6～1;

在连通域彩色聚类过程中的阈值TC＝20～50；Threshold TC=20～50 in the process of connected domain color clustering;

在选取得到的备选彩色层面数K≤L+2，L＝4。The number of candidate color layers K≤L+2 obtained after selection, L=4.

(3)分割彩色图像，获取彩色连通域，即一幅图像用连通域集合来描述。(3) Segment the color image to obtain color connected domains, that is, an image is described by a set of connected domains.

(3.1)从每一行的第一个象素开始，认为该象素为一个新的游程的起始点，计算该起始点和同一行中与它紧邻的象素在RGB空间内的欧氏距离o_pq，其中所述的彩色游程表示如下：R_p{(r_p，g_p，b_p)，(x_p，y_p)，f_p}，r_p，g_p，b_p是游程上各点在RGB彩色空间的r，g，b颜色分量平均值，(x_p，y_p)为该游程的起始坐标，f_p为游程的长度：(3.1) Starting from the first pixel of each row, consider this pixel as the starting point of a new run, and calculate the Euclidean distance between the starting point and its adjacent pixels in the same row in RGB space o _pq , wherein the color run is expressed as follows: R _p {(r _p , g _p , b _p ), (x _p , y _p ), f _p }, r _p , g _p , b _p are the points on the run The average value of r, g, and b color components in the RGB color space, (x _p , y _p ) is the starting coordinate of the run, and f _p is the length of the run:

若o_pq＜TD，则把两个象素合并成为一个游程，并计算该游程的平均r，g，b值，即r_p，g_p，b_p：If o _pq <TD, combine two pixels into one run, and calculate the average r, g, b value of the run, namely r _p , g _p , b _p :

${r r}_{p p} = = \frac{(({r r}_{p p} \times \times {f f}_{p p} + + {r r}_{q q}))}{{f f}_{p p} + + 11};; {g g}_{p p} = = \frac{(({g g}_{p p} \times \times {f f}_{p p} + + {g g}_{q q}))}{{f f}_{p p} + + 11};; {b b}_{p p} = = \frac{(({b b}_{p p} \times \times {f f}_{p p} + + {b b}_{q q}))}{{f f}_{p p} + + 11};;$

游程的长度增1：f_p＝f_p+1；The length of the run increases by 1: f _p =f _p +1;

反之，第二个象素便成为新游程的起始点，继续计算其与下一个相邻象素的欧氏距离，如果仍小于TD，就将该象素加入该游程，并重新计算它的r，g，b值，否则，以该象素点为下一个新游程起始点。根据上述规则，可以这样遍历图像每一行中的所有象素得到若干个彩色游程。On the contrary, the second pixel will become the starting point of a new run, continue to calculate the Euclidean distance between it and the next adjacent pixel, if it is still less than TD, add this pixel to the run, and recalculate its r , g, b values, otherwise, take this pixel as the starting point of the next new run. According to the above rules, several color runs can be obtained by traversing all the pixels in each row of the image.

(3.2)从图像的第二行开始得到彩色游程后，计算该游程与上一相邻行在位置上是4邻域相连的彩色游程在RGB空间的欧氏距离o_pp’：(3.2) After obtaining the color run length from the second line of the image, calculate the Euclidean distance o _pp' of the color run length in RGB space between the color run length and the previous adjacent line in the position of 4 neighbors:

判断该距离是否小于TV，若小于则合并为同一个连通域，即连接这两个游程；反之，作为新连通域的起始游程。以这种方式遍历完整幅图像后，根据游程之间的连接关系便可以得到组成图像的所有连通域的集合{C_n|n＝1，2，...，K}。Judging whether the distance is less than TV, if it is less than, merge into the same connected domain, that is, connect the two runs; otherwise, use it as the initial run of the new connected domain. After traversing the entire image in this way, the set {C _n |n=1, 2, ..., K} of all connected domains composing the image can be obtained according to the connection relationship between runs.

所述连通域用下列结构式表示：The connected domain is represented by the following structural formula:

$X_{n} = {R_{p_{u}} | u = 1,2 . . . m_{n}}$ 表示该连通域内包含的所有彩色游程的集合。通过简单计算很容易得到连通域的高v_n和宽h_n。 $x_{no} = {R_{p_{u}} | u = 1,2 . . . m_{no}}$ Represents the set of all colored runs contained in the connected domain. It is easy to get the height v _n and width h _n of the connected domain by simple calculation.

(4)对连通域进行彩色聚类，以得到适当数目的颜色聚类中心。(4) Carry out color clustering on connected domains to obtain an appropriate number of color cluster centers.

同时按以下三个准则选取参与彩色聚类的连通域样本：At the same time, the connected domain samples participating in the color clustering are selected according to the following three criteria:

1)Hmin＜h_n＜Hmax，Vmin＜v_n＜Vmax，即参与彩色聚类的连通域的高度和宽度都要在上述设定范围内；1) Hmin<h _n <Hmax, Vmin<v _n <Vmax, that is, the height and width of connected domains participating in color clustering must be within the above-mentioned setting range;

$2) - - H_V \min < h_{n} / v_{n} < H_V \max,$ 或者 $V_H \min < v_{n} / h_{n} < V_H \max,$ 其中的H_Vmin和H_Vmax分别指的是连通域的高度与宽度比值的最小和最大值，同样，V_Hmin和V_Hmax指的是宽高比的最小和最大值。 $2) - - h_V \min < h_{no} / v_{no} < h_V \max,$ or $V_h \min < v_{no} / h_{no} < V_h \max,$ Among them, H_Vmin and H_Vmax refer to the minimum and maximum value of the height-to-width ratio of the connected domain, respectively. Similarly, V_Hmin and V_Hmax refer to the minimum and maximum value of the aspect ratio.

$3) - - Q_{2} > (Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n}) > Q_{1},$ 即连通域的象素密度在Q₁和Q₂之间。 $3) - - Q_{2} > (Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no}) > Q_{1},$ That is, the pixel density of the connected domain is between _Q1 and _Q2 .

(5)形成图像层面，并从中删除噪声层和明显的背景层，并得到有可能包含文字的图像层。(5) Form the image layer, and delete the noise layer and obvious background layer therefrom, and obtain the image layer that may contain text.

(5.1)形成图像层面(5.1) Form the image level

把所有高或宽分别小于文本区域图像的高或宽的连通域都与颜色中心比较，如果连通域的平均颜色值和颜色中心的欧氏距离小于TC，便将满足这个条件的连通域放在一个图像层面上，从而可以得到多个层面，同时把它们全部转为白底黑字的图像；Compare all the connected domains whose height or width are smaller than the height or width of the text region image respectively with the color center, if the Euclidean distance between the average color value of the connected domain and the color center is less than TC, then put the connected domain satisfying this condition in the One image layer, so that multiple layers can be obtained, and all of them can be converted into black-and-white images at the same time;

(5.2)按照以下准则依次排除非文字字符层(5.2) Exclude the non-literal character layer in sequence according to the following guidelines

1)当每一个文字层的象素数少于200个，定为噪声层，予以排除；1) When the number of pixels in each text layer is less than 200, it is defined as a noise layer and excluded;

2)如果连通域的高和宽和测试图像大小相当，就把该连通域的中心颜色作为背景色，它所在层面为背景层面；2) If the height and width of the connected domain are equivalent to the size of the test image, the central color of the connected domain is used as the background color, and the layer where it is located is the background layer;

(5.3)在前景色不多于L个的条件下，若剩下图像层面数大于L个时，便选取层面中所包含黑色象素总数排在前L+2个的层面，作为可能存在文字字符图像的层面，按以下步骤处理。前景指的是整幅图像中所包含的文字字符图像，前景色指的是这些文字字符图像的大致颜色，图像中除了文字字符图像以外的部分都称为背景。(5.3) Under the condition that the number of foreground colors is not more than L, if the number of remaining image layers is greater than L, then select the layer whose total number of black pixels in the layer ranks in the top L+2, as the possible text The level of the character image is processed according to the following steps. The foreground refers to the character images included in the entire image, the foreground color refers to the approximate color of these character images, and the part of the image other than the character images is called the background.

(6)根据一致性判据公式计算得到的步骤(5.3)所得的可能的文字字符图像层的一致性判决值P_i，(1≤i≤K)，K为上述层面数，进行排序，其P_i值最大的层面即为最可能的文字字符层面。(6) The consistency judgment value P _i of the possible text character image layer obtained in the step (5.3) obtained according to the consistency criterion formula, (1≤i≤K), K is the number of the above-mentioned layers, sorted, and The level with the largest P _i value is the most likely character level.

(6.1)对于所述K个层面分别作为水平和垂直方向的投影，可以得到水平方向投影宽度的u_il(0≤l＜N_i)和垂直方向的投影宽度w_ij(0≤j＜M_i)，i为图像层面的序号，l代表水平方向投影宽度的序号，j代表垂直方向投影宽度的序号，为了消除小噪声的干扰，每一个坐标位置上的对应的投影黑色象素数目必须超过5个。同时，仅统计两个方向上投影宽度超过10个象素宽的投影个数N_i和M_i，即N_i和M_i分别为在两个方向上得到的符合要求的投影宽度的总数。水平方向上相邻两个投影宽度之间的距离为水平投影间隔宽度e_is(0≤s＜Z_i)，垂直方向上相邻两个投影宽度之间的距离为垂直投影间隔宽度d_it(0≤t＜Y_i)，Z_i和Y_i分别为在两个方向上得到的投影间隔宽度的总数。(6.1) For the K levels as the projections in the horizontal and vertical directions respectively, u _il (0≤l<N _i ) of the projection width in the horizontal direction and w _ij (0≤j<M _{i )} of the projection width in the vertical direction can be obtained ), i is the sequence number of the image level, l represents the sequence number of the projection width in the horizontal direction, and j represents the sequence number of the projection width in the vertical direction. In order to eliminate the interference of small noises, the number of corresponding projected black pixels on each coordinate position must exceed 5 indivual. At the same time, only count the number of projections N _i and M _i whose projection width exceeds 10 pixels in the two directions, that is, N _i and M _i are the total number of projection widths obtained in the two directions that meet the requirements. The distance between two adjacent projection widths in the horizontal direction is the horizontal projection interval width e _is (0≤s<Z _i ), and the distance between two adjacent projection widths in the vertical direction is the vertical projection interval width d _it ( 0≤t<Y _i ), Z _i and Y _i are respectively the total number of projection interval widths obtained in two directions.

(6.2)计算以下各值：(6.2) Calculate the following values:

水平方向投影的平均宽度 $Avg H_{i} = \frac{1}{N_{i}} Σ_{i = 0}^{N_{i} - 1} u_{il},$ The average width of the projection in the horizontal direction $Avg h_{i} = \frac{1}{N_{i}} Σ_{i = 0}^{N_{i} - 1} u_{il},$

垂直方向投影的平均宽度 $Avg W_{i} = \frac{1}{M_{i}} Σ_{j = 0}^{M_{i} - 1} w_{ij},$ The average width of the vertical projection $Avg W_{i} = \frac{1}{m_{i}} Σ_{j = 0}^{m_{i} - 1} w_{ij},$

水平方向投影间隔的平均宽度 $Avg E_{i} = \frac{1}{Z_{i}} Σ_{s = 0}^{Z_{i} - 1} e_{is},$ The average width of the projection interval in the horizontal direction $Avg {E.}_{i} = \frac{1}{Z_{i}} Σ_{the s = 0}^{Z_{i} - 1} e_{is},$

垂直方向投影的平均宽度 ${AvgD}_{i} = \frac{1}{Y_{i}} Σ_{t = 0}^{Y_{i} - 1} d_{it},$ The average width of the vertical projection ${AvgD}_{i} = \frac{1}{Y_{i}} Σ_{t = 0}^{Y_{i} - 1} d_{it},$

水平投影宽度的方差为 $Var H_{i} = \sqrt{Σ_{i = 0}^{N_{i} - 1} {(u_{il} - {AvgH}_{i})}^{2} / N_{i}},$ The variance of the horizontal projection width is $Var h_{i} = \sqrt{Σ_{i = 0}^{N_{i} - 1} {(u_{il} - {ikB}_{i})}^{2} / N_{i}},$

垂直投影宽度的方差为 ${VarW}_{i} = \sqrt{Σ_{j = 0}^{M_{i} - 1} {(w_{ij} - {AvgW}_{i})}^{2} / M_{i}},$ The variance of the vertical projection width is ${wxya}_{i} = \sqrt{Σ_{j = 0}^{m_{i} - 1} {(w_{ij} - {wxya}_{i})}^{2} / m_{i}},$

水平投影间隔宽度的方差 $Var E_{i} = \sqrt{Σ_{s = 0}^{Z_{i} - 1} {(e_{is} - {AvgE}_{i})}^{2} / Z_{i}},$ variance of horizontal projection interval width $Var {E.}_{i} = \sqrt{Σ_{the s = 0}^{Z_{i} - 1} {(e_{is} - {AvgE}_{i})}^{2} / Z_{i}},$

垂直投影间隔宽度的方差 ${VarD}_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - {AvgD}_{i})}^{2} / Y_{i}};$ Variance of Vertical Projection Interval Width ${VarD}_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - {AvgD}_{i})}^{2} / Y_{i}};$

(6.3)在原文字区域图像内文字颜色单一，所含文字行或列的总数小于三个，且行或列方向上的文字近似在一条直线上，按下式计算一致性判据值P_i：(6.3) In the original text area image, the text color is single, the total number of text rows or columns contained is less than three, and the text in the row or column direction is approximately on a straight line, and the consistency criterion value P _i is calculated according to the following formula:

${P P}_{i i} = = \frac{min min (({AvgH ikB}_{i i} / / {AvgW wxya}_{i i},, {AvgW wxya}_{i i} / / {AvgH ikB}_{i i})) \times \times H h \times \times V V}{((11 + + | | max max (({N N}_{i i},, {M m}_{i i})) - - max max ((H h / / V V,, V V / / H h)) | | / / 22)) \times \times ((11 + + max max (({VarE VarE}_{i i},, {VarD VarD}_{i i})))) \times \times ((11 + + max max ((Var Var {H h}_{i i},, {VarW wxya}_{i i}))))}$

i为层面数，i＝1，...，K；i is the number of layers, i=1,...,K;

对得到的P_i按大小排序，取其值最大的文字层面供文字字符切分与识别使用。The obtained P _i are sorted by size, and the text level with the largest value is used for text character segmentation and recognition.

(7)本发明可以作用于其他任何彩色空间，只需要将下文中的r，g，b三种颜色分量分别对应于其他彩色空间的三个基本分量即可，方法中涉及到的阈值根据选取的色彩空间不同而有所不同。(7) The present invention can be applied to any other color space, and only need to correspond to the three basic components of other color spaces by the following r, g, and b three color components, and the thresholds involved in the method are selected according to The color space varies.

本发明的实验效果表明，采用本发明处理包含文字的彩色图像可以得到很高的文字字符正确提取率：对于彩色杂志上标题文字字符的正确提取率为94.4％，对于彩色报纸上文字字符的正确提取率90.7％，彩色照片上文字字符的正确提取率95％，均高于采用现有的其他方法的文字字符的正确提取率。The experimental effect of the present invention shows that adopting the present invention to process color images containing text can obtain a very high correct extraction rate of text characters: the correct extraction rate for title text characters on color magazines is 94.4%, and the correct extraction rate for text characters on color newspapers is 94.4%. The extraction rate is 90.7%, and the correct extraction rate of text characters on the color photo is 95%, both of which are higher than the correct extraction rates of text characters using other existing methods.

附图说明Description of drawings

图1一个典型的字符提取系统的硬件构成。Figure 1 shows the hardware configuration of a typical character extraction system.

图2基于CRAG的文字字符提取方法的流程图。Fig. 2 is a flow chart of the text character extraction method based on CRAG.

图3CRAG结构示意图：3a，，3b，3c，3d，3e，3f，3g。Figure 3 Schematic diagram of CRAG structure: 3a, 3b, 3c, 3d, 3e, 3f, 3g.

图4多层面生成举例：4a为原始彩色图像，4b、4c、4d、4e、4f、4g、4h为生成的图像层面。Figure 4 Multi-level generation example: 4a is the original color image, 4b, 4c, 4d, 4e, 4f, 4g, 4h are the generated image levels.

图5图层投影示意图：5a为垂直方向投影直方图，5b为垂直方向投影宽度示意图，5c为水平方向投影直方图，5d为水平方向投影宽度示意图。Figure 5 is a schematic diagram of layer projection: 5a is a projection histogram in the vertical direction, 5b is a schematic diagram of the projection width in the vertical direction, 5c is a projection histogram in the horizontal direction, and 5d is a schematic diagram of the projection width in the horizontal direction.

图64邻域相连示意图。Fig. 64 Schematic diagram of neighborhood connection.

具体实施方式Detailed ways

如图1所示，一个彩色图像中字符提取系统在硬件上有两个部分构成：图像采集设备和处理器。图像采集设备一般是扫描仪，数字摄像机或数字照相机，用来获取包含字符的数字图像。处理器一般是计算机或者某些具有运算处理能力的终端，用于对数字图像进行处理，并进行文字字符提取。As shown in Figure 1, the character extraction system in a color image consists of two parts in hardware: image acquisition device and processor. The image acquisition device is generally a scanner, a digital video camera or a digital camera, and is used to acquire digital images containing characters. The processor is generally a computer or some terminal with computing and processing capabilities, which is used to process digital images and extract text and characters.

如图2所示的基于CRAG文字字符提取方法的流程图。首先通过扫描仪将彩色印刷文档等扫入，或者将数字照相或摄像机获得的彩色图像输入到处理器(计算机或其他终端处理设备)，这样得到含有文字字符的彩色图像。而后对这些包含字符的图像采用区域生长算法得到采用CRAG结构描述的彩色连通域，再加入连通域筛选准则，将筛选后连通域的平均颜色进行简单的颜色聚类，得到的不同的颜色中心，根据这些颜色中心可以生成不同的色彩图像层面，最后通过大小一致性判据得到待选的文字字符图像层面，即转变为所需的文字字符二值图像，送入后续的字符切分与识别模块处理。The flow chart of the method for extracting text characters based on CRAG as shown in FIG. 2 . Firstly, a color printed document is scanned in through a scanner, or a color image obtained by a digital camera or video camera is input to a processor (computer or other terminal processing equipment), so that a color image containing text characters is obtained. Then use the region growing algorithm on these images containing characters to obtain the color connected domain described by the CRAG structure, and then add the connected domain screening criterion, and perform simple color clustering on the average color of the filtered connected domain, and obtain different color centers, According to these color centers, different color image levels can be generated, and finally the character image level to be selected is obtained through the size consistency criterion, that is, converted into the required binary image of the text character, and sent to the subsequent character segmentation and recognition module deal with.

分割图像获取连通域Segment an image to obtain connected domains

将包含文字字符的彩色图像转变为数字图像输入计算机后，采用CRAG算法分解图像为多个连通域。该算法可以理解成两个步骤，首先获取水平方向彩色游程，然后把相邻的颜色相近的彩色游程不断合并，得到彩色连通域。After converting the color image containing text characters into a digital image and inputting it into the computer, the CRAG algorithm is used to decompose the image into multiple connected domains. The algorithm can be understood as two steps, first obtain the color runs in the horizontal direction, and then continuously merge the adjacent color runs with similar colors to obtain the color connected domain.

If(o_pq＜TD)If(o _pq <TD)

根据(1-1)可知：如果o_pq小于阈值TD，那么这两个象素合并为一个游程，并重新计算该游程的平均r，g，b值：r_p，g_p，b_p，反之，第二个象素便成为新游程的起始点。继续计算其与下一个相邻象索的欧氏距离，如果仍小于TD，就将该象素加入该游程，并重新计算它的r，g，b值，否则，以该象素点为下一个新游程起始点。根据上述规则，可以这样遍历图像每一行中的所有象素得到若干个彩色游程。According to (1-1), it can be seen that if o _pq is less than the threshold TD, then the two pixels are merged into one run, and the average r, g, b values of the run are recalculated: r _p , g _p , b _p , and vice versa , the second pixel becomes the starting point of the new run. Continue to calculate the Euclidean distance between it and the next adjacent pixel, if it is still smaller than TD, add this pixel to the run, and recalculate its r, g, b values, otherwise, take this pixel as the next A new run starting point. According to the above rules, several color runs can be obtained by traversing all the pixels in each row of the image.

${o o}_{{pp pp}^{' '}} = = \sqrt{{(({r r}_{{p p}^{' '}} - - {r r}_{p p}))}^{22} + + {(({g g}_{{p p}^{' '}} - - {g g}_{p p}))}^{22} + + {(({b b}_{{p p}^{' '}} - - {b b}_{p p}))}^{22}}$

判断该距离是否小于TV，若小于则合并为同一个连通域，即连接这两个游程；反之，作为新连通域的起始游程。Judging whether the distance is less than TV, if it is less than, merge into the same connected domain, that is, connect the two runs; otherwise, use it as the initial run of the new connected domain.

如图6所示：图中每一个方格代表一个象素，对于象素“5”来讲，“2，4，6，8”四个相邻的象素所在的位置与它4邻域相连。对于两个相邻行的不同游程而言，如果它们各自包含的象素彼此之间的相对位置中有符合图6所示的4邻域相连位置的情况，那么称这两个游程之间4邻域相连。As shown in Figure 6: each square in the figure represents a pixel, and for pixel "5", the positions of the four adjacent pixels of "2, 4, 6, 8" and its 4 neighbors connected. For the different runs of two adjacent rows, if the relative positions of the pixels contained in each of them conform to the 4-neighborhood connection position shown in Figure 6, then the distance between the two runs is called 4 Neighbors are connected.

门限TD和TV是影响算法成功与否的重要参数，如果选得过小，会使字符分得比较碎，丧失了区域生长的意义，实质是损失了象素的位置信息，破坏了前景一致性的提取规则，同时增加了下一步连通域颜色聚类的运算量；如果取得过大，会使字符连通域与其它目标粘连。本发明这里采用经验参数，经实验验证当TD＝TV＝12～16这个范围的时候，可以得到很好的结果，如果超过这个范围，往往会造成很多的字符与背景粘连，即无法从相似的背景中提取字符。Threshold TD and TV are important parameters that affect the success of the algorithm. If they are selected too small, the characters will be divided into fragments, and the meaning of region growth will be lost. In essence, the position information of pixels will be lost, and the foreground consistency will be destroyed. At the same time, it increases the calculation amount of connected domain color clustering in the next step; if it is too large, it will make the character connected domain stick to other objects. The present invention adopts empirical parameters here. It has been verified by experiments that when TD=TV=12～16, good results can be obtained. Extract characters from the background.

如图3图3所示：背景为黄绿两色，前景字符为渐变色文字的彩色图像A可以看成是由前景文字字符R的图像B1和背景图像B2组成，图C1表述组成字母R的连通域C₁的CRAG结构组成，图中若干矩形块用来表示该连通域所包含的彩色游程，各游程宽度为一个象素，彩色游程之间的折线表示连通域内这些颜色相近彩色游程之间存在的连接关系。同样背景图像B2可以用连通域C2，C3和C4联合表述。假定忽略该图的边缘效应，采用CRAG算法便可以得到组成图像A的连通域的集合{C_n|n＝1，2，...，K}，K＝3，h₁和v₁分别为C1的高和宽，h₂和v₂则分别为C2的高和宽。为了更好地说明本算法的特点，这里字符前景采用的是渐变颜色。H和V分别表示原始图像的高和宽。As shown in Figure 3 and Figure 3: the background is yellow and green, and the color image A of the foreground character is a gradient color text, which can be regarded as composed of the image B1 of the foreground text character R and the background image B2. Figure C1 expresses the composition of the letter R The CRAG structure of the connected domain C ₁ is composed of several rectangular blocks in the figure to represent the color runs contained in the connected domain, and the width of each run is one pixel. Existing connections. Similarly, the background image B2 can be jointly expressed by connected domains C2, C3 and C4. Assuming that the edge effect of the graph is ignored, the CRAG algorithm can be used to obtain the set of connected domains {C _n |n=1, 2, ..., K}, K=3, h ₁ and v ₁ are respectively The height and width of C1, h ₂ and v ₂ are the height and width of C2 respectively. In order to better illustrate the characteristics of this algorithm, the foreground of the characters here is a gradient color. H and V represent the height and width of the original image, respectively.

颜色聚类color clustering

颜色是区分字符前景和背景的重要判据。为了人眼能够看清楚，字符本身的颜色一般与背景有相当大的差别。把颜色不同的区域分开到不同的图像层上，便于文字字符区域的获取，而对颜色聚类的步骤能够实现这样的目标。Color is an important criterion for distinguishing character foreground and background. In order for the human eye to see clearly, the color of the character itself is generally quite different from the background. Separating areas with different colors into different image layers facilitates the acquisition of text character areas, and the step of clustering colors can achieve this goal.

得到连通域以后，根据前景特点采用特定的连通域筛选准则，将符合要求的连通域的平均颜色进行聚类，得到一些聚类中心，以每个聚类中心代表并构成一种颜色的层面。根据每个连通域的颜色离哪个聚类中心更近，把它分到相应颜色的层上。After the connected domain is obtained, according to the characteristics of the foreground, the average color of the connected domain that meets the requirements is clustered using specific connected domain screening criteria, and some cluster centers are obtained, and each cluster center represents and constitutes a level of color. According to which cluster center the color of each connected domain is closer to, it is assigned to the layer of the corresponding color.

一般聚类算法需要预先知道聚类中心的个数，而聚类中心的个数在本发明的应用中无法实现确定。另外，颜色差别大于预定值的连通域分到不同的层上。以便使文字的背景和前景分开。因而，在这里采用选择初始聚类中心的方法，聚类方法如下所述：General clustering algorithms need to know the number of cluster centers in advance, but the number of cluster centers cannot be determined in the application of the present invention. In addition, connected domains whose color difference is greater than a predetermined value are assigned to different layers. In order to separate the background and foreground of the text. Therefore, the method of selecting the initial cluster center is adopted here, and the clustering method is as follows:

任意选取一个连通域的颜色作为初始中心，计算其它连通域与其在RGB彩色空间的欧氏距离，若小于阈值TC，将其聚类，重新计算r，g，b的均值作为聚类的中心颜色值，若大于TC，则生成第二个新的中心，按照该方法计算所有样本，由于颜色中心位置不断变化，同时需要合并中心距离小于TC的颜色中心，最终可以得到适当数目的颜色聚类中心。Randomly select the color of a connected domain as the initial center, calculate the Euclidean distance between other connected domains and its RGB color space, if it is less than the threshold TC, cluster it, and recalculate the mean value of r, g, b as the center color of the cluster If the value is greater than TC, a second new center is generated, and all samples are calculated according to this method. Since the position of the color center is constantly changing, and it is necessary to merge the color centers whose center distance is smaller than TC, an appropriate number of color cluster centers can be finally obtained. .

1)Hmin＜h_n＜Hmax，Vmin＜v_n＜Vmax；1) Hmin<h _n <Hmax, Vmin<v _n <Vmax;

$3) Q_{2} > (Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n}) > Q_{1},$ 这里 $(Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n})$ 表示连通域的象素密度。 $3) Q_{2} > (Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no}) > Q_{1},$ here $(Σ_{u = 1}^{m_{no}} f_{p_{u}} / h_{no} \times v_{no})$ Represents the pixel density of the connected domain.

上式中h_n和v_n分别指代的是所得彩色连通域的高和宽。In the above formula, h _n and v _n respectively refer to the height and width of the obtained color connected domain.

1)中，由于测试图像中的字符笔画高度和宽度分别大都是小于图像高H和宽V，所谓高，即指图像的纵向象素数目，宽指图像的横向象素数目。这里设定待选连通域最大高宽分别为：Hmax＝min(H，400)，Vmax＝min(V，400)，这是由于目前彩色印刷文档中的文字字符的字号大都小于120磅，而在300dpi扫描分辨率的情况下录入的彩色图像中，该字符笔画的最大高宽均小于400个象素长，同时，考虑到文本区域图像实际的高宽。Hmin和Vmin分别为参与颜色聚类的连通域样本的最小高宽，通过实验可知如果该值取得过大会降低小字体的招回率，因而为了使本发明具有广泛的通用性这里取值为3，这样既可以除去大量噪声点的干扰，又很好的保留了标点符号的图像。In 1), since the character stroke height and width in the test image are mostly less than the image height H and width V respectively, the so-called height refers to the vertical pixel number of the image, and the width refers to the horizontal pixel number of the image. The maximum height and width of the connected domain to be selected here are set to be respectively: Hmax=min(H, 400), Vmax=min(V, 400), this is because the font size of the text characters in the current color printing document is mostly less than 120 pounds, and In the color image entered under the scanning resolution of 300dpi, the maximum height and width of the character strokes are both less than 400 pixels long, and at the same time, the actual height and width of the image in the text area are considered. Hmin and Vmin are the minimum height and width of connected domain samples participating in color clustering respectively. It can be seen from experiments that if the value is too large, the recall rate of small fonts will be reduced. Therefore, in order to make the present invention have wide versatility, the value is 3 here. , which can not only remove the interference of a large number of noise points, but also preserve the image of punctuation marks well.

以上这些参量的设定范围的不同，会造成聚类所用的连通域数目变化，同时也会改变生成的颜色中心数目的不同。如果限定的过窄，虽然能降低计算量，提高速度，但是对某些个别背景和前景过于颜色接近的会造成粘连；如果太宽，会造成生成的颜色中心过多，增加计算量。因而，通过实验发现：如果在上面所提到的参数范围内选值，可以取得很好的文字字符提取结果。并且，通过这些条件的限制，进一步降低了初始聚类的运算量，同时也一定程度的去处了部分噪声颜色中心。与直接采用C均值聚类方法比较，聚类样本数减少从而减少了聚类运算量，同时克服了模糊C均值的平滑过程引起所占象素较少的文字字符丢失的问题。The different setting ranges of the above parameters will cause the number of connected domains used for clustering to change, and also change the number of generated color centers. If the limit is too narrow, although it can reduce the amount of calculation and increase the speed, it will cause adhesion for some individual backgrounds and foregrounds whose colors are too close; if it is too wide, it will cause too many color centers to be generated and increase the amount of calculation. Therefore, it is found through experiments that if the value is selected within the above-mentioned parameter range, a good text character extraction result can be obtained. Moreover, through the limitations of these conditions, the computational load of the initial clustering is further reduced, and at the same time, part of the noise color centers are removed to a certain extent. Compared with the direct use of C-means clustering method, the number of clustering samples is reduced, thereby reducing the amount of clustering calculations, and at the same time, it overcomes the problem of the loss of text characters that occupy less pixels caused by the smoothing process of fuzzy C-means.

图像分层image layering

在连通域颜色聚类之后，计算连通域与聚类中心的欧氏距离。若距离小于TC，即具有相似彩色的连通域分到一个层上，便可以生成不同的图像层。After connected domain color clustering, calculate the Euclidean distance between the connected domain and the cluster center. If the distance is less than TC, that is, the connected domains with similar colors are divided into one layer, and then different image layers can be generated.

在生成文字字符层面图像的过程中，同样需要一些连通域筛选准则，但是，印刷文字字体大小大都在10pt-12pt之间，同时彩色图像点扩散效应的存在，得到文字字符的笔画连通域都比较小，标点符号也是需要兼顾的。因而，为了避免小的连通域丢失而造成笔画断裂，生成字符层面时的连通域筛选准则与颜色聚类时采用的筛选准则并不相同。In the process of generating images at the level of text characters, some connected domain screening criteria are also required. However, the font size of printed text is mostly between 10pt-12pt. Small, punctuation marks also need to be taken into account. Therefore, in order to avoid stroke breaks caused by the loss of small connected domains, the screening criteria for connected domains when generating character levels are different from those used for color clustering.

在这一步骤中，将所有高或宽分别小于文本区域图像的高或宽的连通域都与颜色中心比较，如果连通域的平均颜色值和颜色中心的欧氏距离小于TC，便将满足这个条件的连通域放在一个图像层面上，从而可以得到多个层面，这样文字字符图像便可能会存在一个或多个层上。另外如果存在高和宽分别等于文本区域图像高和宽的连通域，则把该连通域所在的层面定为背景层面。(为了便于后续切分识别工作，这里已将生成的层面全部转为白底黑字的图像。)然后，通过如下准则先排除部分非文字字符图层：In this step, all the connected domains whose height or width are smaller than the height or width of the text area image are compared with the color center, if the Euclidean distance between the average color value of the connected domain and the color center is less than TC, this will be satisfied The conditional connected domain is placed on an image level, so that multiple levels can be obtained, so that the text character image may exist on one or more layers. In addition, if there is a connected domain whose height and width are respectively equal to the height and width of the image in the text region, then the layer where the connected domain is located is defined as the background layer. (In order to facilitate the subsequent segmentation and recognition work, all the generated layers have been converted to images of black characters on a white background.) Then, some non-text character layers are first excluded through the following criteria:

4)每一个文字层的象素数要超过200个，否则定为噪声层；4) The number of pixels in each text layer must exceed 200, otherwise it will be defined as a noise layer;

5)如果连通域C的高和宽和测试图像大小大体相当，那么将C的中心颜色作为背景色，它所在层面为背景层面；5) If the height and width of the connected domain C are roughly equivalent to the size of the test image, then the center color of C is used as the background color, and the layer where it is located is the background layer;

6)如果通过1)，2)的筛选后，如果剩下的层面数大于L个的时候，这里假定前景色不多于L个，便取层面中所包含黑象素总数排在前L+2个的层面。前景指的是整幅图像中所包含的文字字符图像，前景色指的是这些文字字符图像的大致颜色，图像中除了文字字符图像以外的部分都称为背景。6) After passing the screening of 1) and 2), if the number of remaining layers is greater than L, it is assumed that there are no more than L foreground colors, and the total number of black pixels contained in the layer is ranked first L+ 2 levels. The foreground refers to the character images included in the entire image, the foreground color refers to the approximate color of these character images, and the part of the image other than the character images is called the background.

如图4中所示，a为原始文本区域图像，b，c，d，e，f，g，h为根据连通域平均颜色聚类得到的7颜色中心而生成的7个图像层面，这里为了便于处理，各图层都已经转为黑白图像。根据上述准则选取所含象素数目位于前六名的b，c，d，e，f，g六个层面。注意到备选层面仍然过多，下面将对于常见情况的给出进一步字符层面判断准则。As shown in Figure 4, a is the original text area image, b, c, d, e, f, g, h are the 7 image levels generated by the 7 color centers obtained by clustering the average color of the connected domain, here for For ease of processing, each layer has been converted to a black and white image. Select b, c, d, e, f, g six layers which contain the number of pixels in the top six according to the above criteria. Noticing that there are still too many alternative levels, the following will give further character level judgment criteria for common situations.

字符层选择character layer selection

由于本发明不涉及到字符的切分和识别，并且系统一般要求在文字字符图像提取阶段尽量不引入切分信息，因而需要一种简单易行的方法来进行自动的文字字符层面的判决。通过分析印刷文档中文字字符的有两个明显的特点：Since the present invention does not involve character segmentation and recognition, and the system generally requires that no segmentation information be introduced in the character image extraction stage, a simple and easy method is needed for automatic character-level judgment. By analyzing the text characters in printed documents, there are two obvious characteristics:

● 文本区域图像内的文字字符大小基本一致；● The text characters in the image in the text area are basically the same size;

● 文字字符排列较为整齐。● Text characters are neatly arranged.

本发明将利用上述特点定义一种大小一致性准则，进行字符层面。The present invention will utilize the above-mentioned characteristics to define a size consistency criterion at the character level.

由于本发明提供的大小一致性准则主要是利用图像层面中的象素两方向投影的大小，是针对单行文字的投影或者在垂直方向上无交错多行文字情况设定的，并不考虑更为复杂的情况。对于更为复杂的情况，需要更为复杂的切分步骤去得到文字字符块的大小，而本发明这里只是在将文字字符层面送入后续切分识别之前进行的初步判断，因此这就要求在原文本区域图像还符合下述情况下：Because the size consistency criterion provided by the present invention mainly utilizes the size of the two-direction projection of the pixel in the image level, it is set for the projection of a single-line text or the situation of multi-line text without interlacing in the vertical direction, and does not consider more complicated situation. For more complex situations, more complex segmentation steps are needed to obtain the size of the text character block, and the present invention is only a preliminary judgment before sending the text character level into the subsequent segmentation recognition, so this requires The original text area image also meets the following conditions:

● 原文本区域图像内的文字颜色单一；● The text color in the image of the original text area is single;

● 所含文字行或列的总数不超过三个，且在行和列方向都是整齐的，即近似位于一条直线上。● The total number of rows or columns of text contained is not more than three, and the row and column directions are all neat, that is, they are approximately on a straight line.

保证根据本发明定义的大小一致性原则进行文字字符层面的自动判断，得到较高的文字字符层面的判别准确率。It is guaranteed to perform automatic judgment at the character level according to the size consistency principle defined in the present invention, so as to obtain a higher discrimination accuracy rate at the character level.

为了便于说明，以图4中的图层c为例，参照图5所示，假定图像的垂直方向上的高度为H，水平方向上的宽度为V。颜色分层后得到K个层面，对于层面i(1≤i≤K)，分别作水平和垂直方向的投影，可以得到水平方向投影宽度的u_il(0≤l＜N_i)和垂直方向的投影宽度w_ij(0≤j＜M_i)，i为图像层面的序号，l代表水平方向投影宽度的序号，j代表垂直方向投影宽度的序号，为了消除小噪声的干扰，每一个坐标位置上的对应的投影黑色象素数目必须超过5个。同时，仅统计两个方向上投影宽度超过10个象素宽的投影个数N_i和M_i，即N_i和M_i分别为在两个方向上得到的符合要求的投影宽度的总数。水平方向上相邻两个投影宽度之间的距离为水平投影间隔宽度e_is(0≤s＜Z_i)，垂直方向上相邻两个投影宽度之间的距离为垂直投影间隔宽度d_it(0≤t＜Y_i)，Z_i和Y_i分别为在两个方向上得到的投影间隔宽度的总数。根据得到的以上结果，可以计算得出层面i上投影宽度的平均值：For the convenience of description, take the layer c in FIG. 4 as an example, as shown in FIG. 5 , assume that the height of the image in the vertical direction is H, and the width in the horizontal direction is V. After color layering, K levels are obtained. For level i (1≤i≤K), the horizontal and vertical projections are made respectively, and u _il (0≤l<N _i ) of the projection width in the horizontal direction and u il in the vertical direction can be obtained. Projection width w _ij (0≤j<M _i ), i is the sequence number of the image level, l represents the sequence number of the projection width in the horizontal direction, and j represents the sequence number of the projection width in the vertical direction. In order to eliminate the interference of small noises, each coordinate position The number of corresponding projected black pixels must exceed 5. At the same time, only count the number of projections N _i and M _i whose projection width exceeds 10 pixels in the two directions, that is, N _i and M _i are the total number of projection widths obtained in the two directions that meet the requirements. The distance between two adjacent projection widths in the horizontal direction is the horizontal projection interval width e _is (0≤s<Z _i ), and the distance between two adjacent projection widths in the vertical direction is the vertical projection interval width d _it ( 0≤t<Y _i ), Z _i and Y _i are respectively the total number of projection interval widths obtained in two directions. According to the above results obtained, the average value of the projection width on layer i can be calculated:

水平方向投影的平均宽度 ${AvgH}_{i} = \frac{1}{N_{i}} Σ_{l = 0}^{N_{i} - 1} u_{il};$ 垂直方向投影的平均宽度The average width of the projection in the horizontal direction ${ikB}_{i} = \frac{1}{N_{i}} Σ_{l = 0}^{N_{i} - 1} u_{il};$ The average width of the vertical projection

${AvgW wxya}_{i i} = = \frac{11}{{M m}_{i i}} {Σ Σ}_{j j = = 00}^{{M m}_{i i} - - 11} {w w}_{ij ij} . .$

水平方向投影间隔的平均宽度 ${AvgE}_{i} = \frac{1}{Z_{i}} Σ_{s = 0}^{Z_{i} - 1} e_{is};$ 垂直方向投影的平均宽度The average width of the projection interval in the horizontal direction ${AvgE}_{i} = \frac{1}{Z_{i}} Σ_{the s = 0}^{Z_{i} - 1} e_{is};$ The average width of the vertical projection

${AvgD AvgD}_{i i} = = \frac{11}{{Y Y}_{i i}} {Σ Σ}_{t t = = 00}^{{Y Y}_{i i} - - 11} {d d}_{it it} . .$

计算得到该层面水平投影宽度的方差为 ${VarH}_{i} = \sqrt{Σ_{l = 0}^{N_{i} - 1} {(u_{il} - {AvgH}_{i})}^{2} / N_{i}},$ 垂直投影宽度的方差为 ${VarW}_{i} = \sqrt{Σ_{j = 0}^{M_{i} - 1} {(w_{ij} - {AvgW}_{i})}^{2} / M_{i}};$ The calculated variance of the horizontal projection width of this layer is ${Var H}_{i} = \sqrt{Σ_{l = 0}^{N_{i} - 1} {(u_{il} - {ikB}_{i})}^{2} / N_{i}},$ The variance of the vertical projection width is ${wxya}_{i} = \sqrt{Σ_{j = 0}^{m_{i} - 1} {(w_{ij} - {wxya}_{i})}^{2} / m_{i}};$

该层面的水平投影间隔宽度的方差 ${VarE}_{i} = \sqrt{Σ_{s = 0}^{Z_{i} - 1} {(e_{is} - Avg E_{i})}^{2} / Z_{i}},$ 垂直投影间隔宽度的方差 ${VarD}_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - {AvgD}_{i})}^{2} / Y_{i}};$ The variance of the horizontal projection interval width for this slice ${Var E}_{i} = \sqrt{Σ_{the s = 0}^{Z_{i} - 1} {(e_{is} - Avg {E.}_{i})}^{2} / Z_{i}},$ Variance of Vertical Projection Interval Width ${VarD}_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - {AvgD}_{i})}^{2} / Y_{i}};$

${P P}_{i i} = = \frac{min min (({AvgH ikB}_{i i} / / {AvgW wxya}_{i i},, {AvgW wxya}_{i i} / / {AvgH ikB}_{i i})) \times \times H h \times \times V V}{((11 + + | | max max (({N N}_{i i},, {M m}_{i i})) - - max max ((H h / / V V,, V V / / H h)) | | / / 22)) \times \times ((11 + + max max (({VarE VarE}_{i i},, {VarD VarD}_{i i})))) \times \times ((11 + + max max (({VarH Var H}_{i i},, {VarW wxya}_{i i}))))}$

表1给出了，对于图4中的原始文本区域图像a的六个字符代选层面b，c，d，e，f，g的一致性判据，根据P_i得出c图层即为生成的文字字符层面。同时对照比较图4中的c和e，可以很容易的发现，e中大都含有的是文字字符的轮廓边缘，因而其一致性判据排在第二位。由此可以看出，可以按P(i)的大小将备选层面排序。Table 1 shows, for the six character selection levels b, c, d, e, f, g of the original text area image a in Figure 4, the consistency criterion, according to P _i , the c layer is obtained as Generated text character level. At the same time, by comparing c and e in Figure 4, it can be easily found that most of e contains the outline edges of text characters, so its consistency criterion ranks second. It can be seen from this that the candidate levels can be sorted according to the size of P(i).

表1 图像a的各图层的一致性判据Table 1 Consistency criteria of each layer of image a

b c d e f gb b c d d e e f f g

P_i 11.394 82.948 21.704 47.1 10.289 4.819 _Pi 11.394 82.948 21.704 47.1 10.289 4.819

由于切分和识别不属于本发明的涉及范围，因而在本发明中将不再作阐述。Since the segmentation and identification do not belong to the scope of the present invention, they will not be elaborated in the present invention.

样本库sample library

为了验证该方法的优越性，根据常见的彩色印刷文档图像建立了一些样本库，如表2所示。In order to verify the superiority of the method, some sample libraries are established based on common color printed document images, as shown in Table 2.

表2 样本库数据统计列表Table 2 Statistical list of sample library data

名称文本区域图像块数(张) 字符数(个)Name Number of image blocks in the text area (sheets) Number of characters (pieces)

标题库 47 1224Title Library 47 1224

彩色杂志样本库Color Magazine Sample Library

正文库 30 5420Text Library 30 5420

彩色报纸样本库 39 551Color Newspaper Sample Library 39 551

彩色照片图像库 52 664Color Photo Image Library 52 664

实验结果Experimental results

表3给出了多种方法的比较结果Table 3 shows the comparison results of various methods

表3 正确提取字符数比较Table 3 Comparison of the number of correctly extracted characters

CRAG 直接颜连通性局域自适

字符数(个)Number of characters (a)

方法色聚类分析法应动态阈值法 ,

彩色杂志标题库Library of Colored Magazine Titles

1156 732 905 847

(1224)(1224)

彩色报纸样本库Color Newspaper Sample Library

500 457 318 143

(551)(551)

彩色照片样本库Color Photo Sample Library

631 578 357 277

(664)(664)

综上所述，可以发现CRAG方法具有在以下几种优点：In summary, it can be found that the CRAG method has the following advantages:

● 算法简单，能有效的克服背景噪声变化的影响；● The algorithm is simple and can effectively overcome the influence of background noise changes;

● 以连通域为单位的颜色聚类使文字更容易被分出来，并减少了运算量；● Color clustering in units of connected domains makes it easier to separate text and reduces the amount of computation;

● 能自动处理反白文字和多色字；● Can automatically process highlighted text and multi-colored text;

● 可以提取前景色范围变化较大的字符图像，利用由于字符本身，或者由于光照而造成颜色渐变的字符；● It can extract character images whose foreground color range changes greatly, and use characters whose colors change gradually due to the characters themselves or due to illumination;

● 受边缘过渡效应影响小，避免了小字符的丢失；● Less affected by the edge transition effect, avoiding the loss of small characters;

● 保留了字符颜色信息；● Retain the character color information;

● 可处理对象范围广：如彩色杂志，报纸和照片图像等。● Can handle a wide range of objects: such as color magazines, newspapers and photo images.

本发明在实验中获得了优异的识别结果，具有非常广泛的应用前景。The invention has obtained excellent recognition results in experiments and has very wide application prospects.

Claims

1. based on character extracting method in the complex background coloured image of run-length adjacency graph, it is characterized in that: it comprises following steps successively:

(1) by image capture device colored printing document or photograph image are scanned in the image processor;

(2) in above-mentioned image processor, set:

The height and width of image are represented with symbol H and V respectively;

Each row pixel and the Euclidean distance o of the colored distance of swimming in rgb space in the image with delegation and its next-door neighbour _PqThreshold value be TD;

Begin to count from second row of image, this colour distance of swimming and last adjacent lines are the Euclidean distance o of the colored distance of swimming that links to each other of 4 neighborhoods at rgb space on the position _{Pp '}Threshold value be TV, choose TD=TV=12～16;

Other connected domains in the set of the initial center of connected domain and all connected domains of composition diagram picture are at the Euclidean distance o of RGB color space _CnThreshold value TC, choose TC=20～50;

Connected domain maximum height Hmax=min to be selected (H, 400), number of picture elements;

Connected domain breadth extreme Vmax=min to be selected (V, 400), number of picture elements;

Connected domain minimum constructive height Hmin=3 to be selected, number of picture elements;

Connected domain minimum widith Vmin=3 to be selected, number of picture elements;

The depth-width ratio of connected domain to be selected or the minimum value of the ratio of width to height are 1, and maximal value is 50;

The PEL (picture element) density of each connected domain is used

(Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n})

Expression, h _nAnd v _nWhat refer to respectively is the height and width of the colored connected domain of gained, m _nRepresent the colored number of runs in n the connected domain, f _PuRepresent p _uThe run length of the individual distance of swimming, set:

Q_{2} > (Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n}) > Q_{1},

Q ₁＝0.1～0.5，Q ₂＝0.6～1；

Threshold value TC=20～50 in connected domain color clustering process;

Count K≤L+2, L=4 choosing the alternative colored aspect that obtains;

(3) cut apart coloured image, obtain colored connected domain, promptly piece image is gathered with connected domain and is described;

(3.1) from each the row first pixel, think that this pixel is the starting point of a new distance of swimming, calculate this starting point and with in the delegation with it next-door neighbour the Euclidean distance o of pixel in rgb space _Pq, the wherein said colored distance of swimming is expressed as follows: R _p{ (r _p, g _p, b _p), (x _p, y _p), f _p, r _p, g _p, b _pBe on the distance of swimming each point at the r of RGB color space, g, b color component mean value, (x _p, y _p) be the origin coordinates of this distance of swimming, f _pLength for the distance of swimming:

o_{pq} = \sqrt{{(r_{q} - r_{p})}^{2} + {(g_{q} - q_{p})}^{2} + {(b_{q} - b_{p})}^{2}} .

If o _Pq＜TD, then two pixels being merged becomes a distance of swimming, and calculates the average r of this distance of swimming, g, b value, i.e. r _p, g _p, b _p:

r_{p} = \frac{(r_{p} \times f_{p} + r_{q})}{f_{p} + 1};

g_{p} = \frac{(g_{p} \times f_{p} + g_{q})}{f_{p} + 1};

b_{p} = \frac{(b_{p} \times f_{p} + b_{q})}{f_{p} + 1};

The length of the distance of swimming increases 1:f _p=f _p+ 1;

Otherwise,, continue to calculate the Euclidean distance of itself and next adjacent image point, if still less than TD just second pixel becomes the starting point of the new distance of swimming, just this pixel is added this distance of swimming, and recomputate its r, g, the b value, otherwise, serve as next new distance of swimming starting point with this picture element; According to above-mentioned rule, traversing graph obtains several colored distances of swimming as all pixels in each row like this;

(3.2) begin to obtain the colored distance of swimming from second row of image after, calculate this distance of swimming and last adjacent lines and be the colored distance of swimming that 4 neighborhoods link to each other Euclidean distance o at rgb space on the position _Pp':

o_{{pp}^{'}} = \sqrt{{(r_{p^{'}} - r_{p})}^{2} + {(g_{p^{'}} - g_{p})}^{2} + {(b_{p^{'}} - b_{p})}^{2}} .

Whether judge this distance less than TV,, promptly connect this two distances of swimming if less than then merging into same connected domain; Otherwise, as the initial distance of swimming of new connected domain; After traveling through complete width of cloth image by this way, just can obtain the set { C of all connected domains of composition diagram picture according to the annexation between the distance of swimming _n| n=1,2 ..., K};

Described connected domain is represented with following structural:

C _n{ (r _n, g _n, b _n), X _n, (v _n, h _n), (r _n, g _n, b _n) that represent is connected domain C _nAverage color r, g, the b value,

r_{n} = Σ_{u = 1}^{m_{n}} (r_{p_{u}} \times f_{p_{u}}) / Σ_{u = 1}^{m_{n}} f_{p_{u}} - - - (1 - 2)

g_{n} = Σ_{u = 1}^{m_{n}} (g_{p_{u}} \times f_{p_{u}}) / Σ_{u = 1}^{m_{n}} f_{p_{u}} - - - (1 - 3)

b_{n} = Σ_{u = 1}^{m_{n}} (b_{p_{u}} \times f_{p_{u}}) / Σ_{u = 1}^{m_{n}} f_{p_{u}} - - - (1 - 4)

X _n={ R _Pu| u=1,2...m _nRepresent the set of institute's chromatic colour distance of swimming of comprising in this connected domain, be easy to obtain the high v of connected domain by simple computation _nWith wide h _n

(4) connected domain is carried out color clustering, to obtain the color cluster center of proper number;

Choose the connected domain sample that participates in color clustering by following three criterions simultaneously:

1) Hmin＜h _n＜Hmax, Vmin＜v _n＜Vmax, the height and the width that promptly participate in the connected domain of color clustering all will be in above-mentioned setting ranges;

2)

H_V \min < h_{n} / v_{n} < H - V \max,

Perhaps

V_H \min < v_{n} / h_{n} < V_H \max,

H_Vmin wherein and H_Vmax refer to the minimum and the maximal value of the height and the width ratio of connected domain respectively, and same, V_Hmin and V_Hmax refer to the minimum and the maximal value of the ratio of width to height;

3)

Q_{2} > (Σ_{u = 1}^{m_{n}} f_{p_{u}} / h_{n} \times v_{n}) > Q_{1},

The PEL (picture element) density that is connected domain is at Q ₁And Q ₂Between;

(5) form the image aspect, and therefrom erased noise layer and significantly background layer, and obtain to comprise the image layer of literal;

(5.1) form the image aspect

All high or wide respectively less than the high or wide connected domain of text filed image all with color center relatively, if the Euclidean distance of the average color of connected domain and color center is less than TC, the connected domain that just will satisfy this condition is placed on the image aspect, thereby can obtain a plurality of aspects, simultaneously they all be transferred to the image of white gravoply, with black engraved characters;

(5.2) get rid of non-legible character layer successively according to following criterion

1) number of picture elements when each character layer is less than 200, is decided to be noise floor, is got rid of;

2) if the height and width of connected domain and test pattern sizableness, just the center color of this connected domain look as a setting, its place aspect is the background aspect;

(5.3) under the condition of the no more than L of foreground, if remaining image aspect number during greater than L, just choose comprise the black picture element sum in the aspect and come before the individual aspect of L+2, as the aspect that may have the alphabetic character image, processing according to the following steps; The alphabetic character image that prospect refers in the entire image to be comprised, foreground refers to the roughly color of these alphabetic character images, and the part in the image except the alphabetic character image all is called background;

(6) the consistance decision value P of the possible alphabetic character image layer of step (5.3) gained that calculates according to consistance criterion formula _i, (1≤i≤K), K is above-mentioned aspect number, sorts its P _iThe maximum aspect of value is most probable alphabetic character aspect;

(6.1) for a described K aspect respectively as the projection of level and vertical direction, can obtain the u of horizontal direction projection width _Il(0≤l＜N _i) and the w of projection width of vertical direction _Ij(0≤j＜M _i), i is the sequence number of image aspect, and l represents the sequence number of horizontal direction projection width, and j represents the sequence number of vertical direction projection width, and in order to eliminate little interference of noise, the projection black picture element number of the correspondence on each coordinate position must be above 5; Simultaneously, only add up on the both direction projection width and surpass 10 projection number N that pixel is wide _iAnd M _i, i.e. N _iAnd M _iBe respectively the sum of the satisfactory projection width that on both direction, obtains; Distance on the horizontal direction between adjacent two projection widths is horizontal projection interval width e _Is(0≤s＜Z _i), the distance on the vertical direction between adjacent two projection widths is vertical projection interval width d _It(0≤t＜Y _i), Z _iAnd Y _iBe respectively the sum of the projection interval width that on both direction, obtains;

(6.2) calculate following each value:

The mean breadth of horizontal direction projection

{AvgH}_{i} = \frac{1}{N_{i}} Σ_{l = 0}^{N_{i} - 1} u_{il},

The mean breadth of vertical direction projection

{AvgW}_{i} = \frac{1}{M_{i}} Σ_{j = 0}^{M_{i} - 1} w_{ij},

The mean breadth of horizontal direction projection interval

{AvgE}_{i} = \frac{1}{Z_{i}} Σ_{s = 0}^{Z_{i} - 1} e_{is},

The mean breadth of vertical direction projection

{AvgD}_{i} = \frac{1}{Y_{i}} Σ_{t = 0}^{Y_{i} - 1} d_{it},

The variance of horizontal projection width is

{VarH}_{i} = \sqrt{Σ_{l = 0}^{N_{i} - 1} {(u_{il} - Avg H_{i})}^{2} / N_{i}},

The variance of vertical projection width is

{VarW}_{i} = \sqrt{Σ_{j = 0}^{M_{i} - 1} {(w_{il} - Avg W_{i})}^{2} / M_{i}},

The variance of horizontal projection interval width

{VarE}_{i} = \sqrt{Σ_{s = 0}^{z_{i} - 1} {(e_{il} - Avg E_{i})}^{2} / Z_{i}},

The variance of vertical projection interval width

{VarD}_{i} = \sqrt{Σ_{t = 0}^{Y_{i} - 1} {(d_{it} - Avg D_{i})}^{2} / Y_{i}};

(6.3) text color is single in former literal field area image, and the sum of contained literal row or column is less than three, and the literal on the row or column direction is approximate point-blank, is calculated as follows consistance criterion value P _i:

P_{i} = \frac{\min (Avg H_{i} / Avg W_{i}, Avg W_{i} / Avg H_{i}) \times H \times V}{(1 + | \max (N_{i}, M_{i}) - \max (H / V, V / H) | / 2) \times (1 + \max (Var D_{i})) \times (1 + \max (Var H_{i}, Var W_{i}))}

I is the aspect number, i=1 ..., K;

To the P that obtains _iSort by size, get the maximum literal aspect of its value and use for alphabetic character cutting and identification.