CN104376300A

CN104376300A - Identification method used for intelligent matching of incomplete Chinese characters on basis of grid characteristics

Info

Publication number: CN104376300A
Application number: CN201410607290.6A
Authority: CN
Inventors: 陈旭; 李耘书; 杨翰典; 王越亚; 白维珊
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2014-11-03
Filing date: 2014-11-03
Publication date: 2015-02-25
Anticipated expiration: 2034-11-03
Also published as: CN104376300B

Abstract

The invention discloses an identification method used for intelligent matching of incomplete Chinese characters on the basis of grid characteristics. The method comprises the following steps that S1, a shredded paper restored picture is turned into a 0-1 matrix; S2, the picture positions of the Chinese characters are located through the circulation method that sub-matrices having the same sizes as complete characters are used line by line and column by column; S3, grid partitioning is conducted on the incomplete Chinese characters obtained in the step S2 so as to obtain a plurality of sub-matrices, and characteristics are extracted; S4, intelligent matching identification is conducted on each grid sub-matrix characteristic of a partitioned incomplete Chinese character grid through a standard word bank. The identification method used for intelligent matching of the incomplete Chinese characters on the basis of the grid characteristics solves the problem that even though identification and matching are conducted through machines in the shredded paper restoration technology, incomplete Chinese characters cannot be identified finally due to the situation that errors occur in splicing in lines and columns is solved.

Description

A Recognition Method for Intelligently Matching Incomplete Chinese Characters Based on Grid Features

技术领域 technical field

本发明涉及一种基于网格特征智能匹配残缺汉字的识别方法。 The invention relates to a recognition method for intelligently matching incomplete Chinese characters based on grid features.

背景技术 Background technique

如今，碎纸复原技术在司法物证复原、历史文献修复以及军事情报获取等重要领域都有着重大作用。在对隐私信息进行处理时，也要将碎纸复原技术考虑在内。 Today, shredded paper recovery technology plays an important role in important fields such as judicial evidence recovery, historical document restoration, and military intelligence acquisition. Shredded paper recovery technology should also be taken into consideration when processing private information.

如图1和图2所示，现在的碎纸复原技术主要是使用一种拼接算法，将汉字按像素点以矩阵形式存储，根据纸片边距和汉字匹配程度进行碎纸还原。虽然此方法科学易实现，但是由机器进行识别匹配，行列拼接均有出错的情况，最终会导致无法对汉字进行识别的问题。 As shown in Figure 1 and Figure 2, the current shredded paper restoration technology mainly uses a splicing algorithm to store Chinese characters in matrix form by pixel, and restore shredded paper according to the matching degree of paper margins and Chinese characters. Although this method is scientific and easy to implement, there are errors in the recognition and matching by the machine, and in the splicing of rows and columns, which will eventually lead to the problem that Chinese characters cannot be recognized.

发明内容 Contents of the invention

本发明的目的在于克服现有技术的不足，提供一种基于网格特征智能匹配残缺汉字的识别方法，解决碎纸复原技术虽然由机器进行识别匹配，但行列拼接均有出错的情况导致最终无法对残缺汉字进行识别的问题。 The purpose of the present invention is to overcome the deficiencies of the prior art, to provide a recognition method for intelligently matching incomplete Chinese characters based on grid features, and to solve the problem that although the shredded paper recovery technology is recognized and matched by a machine, errors in the splicing of rows and columns lead to failure in the end. The problem of recognizing incomplete Chinese characters.

本发明的目的是通过以下技术方案来实现的：一种基于网格特征智能匹配残缺汉字的识别方法，包括以下步骤： The purpose of the present invention is achieved through the following technical solutions: a recognition method based on grid features intelligently matching incomplete Chinese characters, comprising the following steps:

S1：将碎纸复原图转化成0-1矩阵； S1: Convert the shredded paper recovery map into a 0-1 matrix;

S2：根据图像位置定位规则，用完整字大小（大小取决于图像中平均字大小）的子矩阵逐行逐列的循环方法来定位汉字的图像位置； S2: According to the image position positioning rules, use the sub-matrix of the complete character size (the size depends on the average character size in the image) to locate the image position of the Chinese character row by row and column by row;

S3：将步骤S2中得到的残缺汉字进行网格分块分成子矩阵，提取特征； S3: dividing the incomplete Chinese characters obtained in step S2 into sub-matrix by grid block, and extracting features;

S4：分别对残缺汉字网格分割后的每一网格子矩阵特征通过标准词库进行智能匹配识别。 S4: Carry out intelligent matching and recognition on the features of each grid sub-matrix after grid segmentation of the incomplete Chinese characters through the standard lexicon.

步骤S1采用MATLAB软件对碎纸复原图进行转化。 Step S1 uses MATLAB software to convert the shredded paper restoration map.

步骤S2中所述的图像位置定位规则包括： The image location positioning rules described in step S2 include:

（1）如果在完整字大小的子矩阵中含有宽/长等于一个字大小的，则确定一个残缺字，同时记录位置； (1) If there is a sub-matrix with a width/length equal to the size of a word in the sub-matrix of the complete word size, determine an incomplete word and record the position at the same time;

（2）如果完整字大小的子矩阵中含有宽/长大于一个字大小的，则确定为1个残缺字，同时记录位置，并且再分别从左右/上下两个反方向循环，再确定一个残缺字，同时记录位置； (2) If there is a sub-matrix with a width/length greater than one word size in the complete word size sub-matrix, it will be determined as an incomplete word, and the position will be recorded at the same time, and then cycle from left to right/up and down respectively, and then an incomplete character will be determined word, and record the position at the same time;

（3）如果完整字大小的子矩阵中含有宽/长少于一个字大小的，确定为1个残缺字，同时记录位置。 (3) If the sub-matrix of the complete word size contains a sub-matrix whose width/length is less than one word size, it is determined as one incomplete word, and the position is recorded at the same time.

所述的步骤S3包括以下子步骤： Described step S3 comprises the following sub-steps:

S31：按照残缺汉字大小，将残缺汉字分成多个子矩阵； S31: divide the incomplete Chinese characters into multiple sub-matrices according to the size of the incomplete Chinese characters;

S32：对每个子矩阵分别用小波函数分析提取这多个子矩阵图片的多个参数矩阵，将这多个参数矩阵一起作为该残缺字的特征。 S32: Analyze each sub-matrix by wavelet function to extract multiple parameter matrices of the multiple sub-matrix pictures, and use the multiple parameter matrices together as the feature of the incomplete character.

一种基于网格特征智能匹配残缺汉字的识别方法还包括一个建立标准词库子步骤：将每一个完整汉字的每种字号，分别进行网格分解，得到标准特征的多个子矩形及其多个参数矩阵，确定一个完整汉字的特征值。 A recognition method for intelligently matching incomplete Chinese characters based on grid features also includes a sub-step of establishing a standard lexicon: decomposing each font size of each complete Chinese character into grids to obtain multiple sub-rectangles of standard features and their multiple Parameter matrix to determine the eigenvalues of a complete Chinese character.

所述的子矩阵为2*2大小的子矩阵。 The sub-matrix is a sub-matrix with a size of 2*2.

所述的参数矩阵包括垂直属性、水平属性和对角属性的3个参数矩阵。 The parameter matrix includes three parameter matrices of vertical attribute, horizontal attribute and diagonal attribute.

所述的多种字号为10号字至22号字之间的8种字号。 The multiple font sizes mentioned are 8 font sizes between the 10th and the 22nd.

所述的步骤S4包括以下子步骤： Described step S4 comprises the following sub-steps:

S41：将步骤S3得到的多个网格子矩阵与标准词库中每一个完整汉字的标准特征矩阵进行比较； S41: Comparing the plurality of grid sub-matrices obtained in step S3 with the standard feature matrix of each complete Chinese character in the standard lexicon;

S42：如果相似度大于某一比例，就判定该残缺字为词库中的这个完整的字。 S42: If the similarity is greater than a certain ratio, determine that the incomplete character is the complete character in the thesaurus.

步骤S42所述的某一比例为百分之五十。 A certain ratio described in step S42 is 50%.

本发明的有益效果是：本发明首先将碎纸复原图转化成0-1矩阵，再根据图像位置定位规则，用完整字大小的子矩阵逐行逐列的循环方法来定位汉字的图像位置，判断其是否可能是一个残缺的字，有可能是字的话将其保存，然后通过基于小波函数提取汉字特征向量来实现与词库中的汉字识别。本发明解决碎纸复原技术虽然由机器进行识别匹配，但行列拼接均有出错的情况导致最终无法对残缺汉字进行识别的问题，提供一种残缺汉字识别方法。 The beneficial effects of the present invention are: firstly, the present invention converts the shredded paper restoration image into a 0-1 matrix, and then uses the sub-matrix of the complete character size to locate the image position of the Chinese character row by row according to the image position positioning rule. Judging whether it may be an incomplete character, if possible, save it, and then extract the Chinese character feature vector based on the wavelet function to realize the recognition of the Chinese character in the thesaurus. The invention solves the problem that although the shredded paper recovery technology is recognized and matched by a machine, the incomplete Chinese characters cannot be recognized due to errors in the splicing of rows and columns, and provides a method for recognizing incomplete Chinese characters. the

附图说明 Description of drawings

图1为商务函电样本图； Figure 1 is a sample diagram of business correspondence;

图2为样本碎纸复原效果图； Figure 2 is the restoration effect diagram of the sample shredded paper;

图3为本发明方法流程图。 Fig. 3 is a flow chart of the method of the present invention.

具体实施方式 Detailed ways

下面结合附图进一步详细描述本发明的技术方案：如图3所示，一种基于网格特征智能匹配残缺汉字的识别方法，包括以下步骤： The technical solution of the present invention is further described in detail below in conjunction with the accompanying drawings: as shown in Figure 3, a kind of recognition method based on grid feature intelligently matching incomplete Chinese characters comprises the following steps:

S2：用完整字大小（大小取决于图像中平均字大小）的子矩阵逐行逐列的循环方法来定位汉字的图像位置； S2: Use the sub-matrix of the full word size (the size depends on the average word size in the image) to locate the image position of the Chinese character row by row and column by row;

步骤S2中所述的定位汉字的图象位置的规则包括以下子步骤： The rule of the image position of the positioning Chinese character described in the step S2 comprises the following substeps:

S21：如果在完整字大小的子矩阵中含有宽/长等于一个字大小的，则确定一个残缺字，同时记录位置； S21: If there is a sub-matrix with a width/length equal to the size of a word in the sub-matrix of the full word size, determine an incomplete word and record the position at the same time;

S22：如果完整字大小的子矩阵中含有宽/长大于一个字大小的，则确定为1个残缺字，同时记录位置，并且再分别从左右/上下两个反方向循环，再确定一个残缺字，同时记录位置； S22: If the sub-matrix of the full word size has a width/length greater than one word size, determine it as an incomplete character, record the position at the same time, and cycle from left to right/up and down respectively, and then determine an incomplete character , and record the position at the same time;

S23：如果完整字大小的子矩阵中含有宽/长少于一个字大小的，确定为1个残缺字，同时记录位置。 S23: If the sub-matrix of the complete word size contains a sub-matrix whose width/length is less than one word size, determine it as one incomplete word, and record the position at the same time.

步骤S42所述的某一比例为百分之五十。 A certain ratio described in step S42 is 50%. the

Claims

1., based on a recognition methods for the incomplete Chinese character of grid search-engine Intelligent Matching, it is characterized in that: it comprises the following steps:

S1: shredded paper palinspastic map is changed into 0-1 matrix;

S2: according to picture position locating rule, locates the picture position of Chinese character by the submatrix round-robin method line by line of complete word size;

S3: the incomplete Chinese character obtained in step S2 is carried out grid piecemeal and is divided into submatrix, extracts feature;

S4: respectively by standard dictionary, Intelligent Matching identification is carried out to each the grid submatrix feature after incomplete Chinese character mesh segmentation.

2. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 1, is characterized in that: step S1 adopts MATLAB software to transform shredded paper palinspastic map.

3. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 1, is characterized in that: the picture position locating rule described in step S2 comprises:

(1): if equal a word size containing width/length in the submatrix of complete word size, then determine an incomplete word, record position simultaneously;

(2) if be greater than a word size containing width/length in the submatrix of complete word size, be then defined as 1 incomplete word, simultaneously record position, and more respectively from left and right/upper and lower two circulations in the other direction, then determine an incomplete word, record position simultaneously;

(3) if be less than a word size containing width/length in the submatrix of complete word size, 1 incomplete word is defined as, simultaneously record position.

4. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 1, is characterized in that: described step S3 comprises following sub-step:

S31: according to incomplete Chinese character size, incomplete Chinese character is divided into multiple submatrix;

S32: the multiple parameter matrixs each submatrix being extracted respectively to this multiple submatrix picture with wavelet function analysis, using this multiple parameter matrix together as the feature of this incomplete word.

5. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 1, it is characterized in that: it also comprises a Criterion dictionary sub-step: by the multiple font size of each complete Chinese character, carry out grid decomposition respectively, obtain the multiple sub-rectangle of standard feature, extract multiple parameter matrixs of this multiple submatrix with wavelet function, determine the eigenwert of a complete Chinese character.

6. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 4 or 5, is characterized in that: described submatrix is the submatrix of 2*2 size.

7. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 4 or 5, is characterized in that: described parameter matrix comprises 3 parameter matrixs of vertical attribute, level property and diagonal angle attribute.

8. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 5, is characterized in that: described multiple font size is 8 kinds of font sizes between No. 10 word to 22 words.

9. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 1, is characterized in that: described step S4 comprises following sub-step:

S41: in the multiple grid submatrix obtained by step S3 and standard dictionary, the standard feature matrix of each complete Chinese character compares;

S42: if similarity is greater than a certain ratio, just judges that this incomplete word is as the complete word of this in dictionary.

10. a kind of recognition methods based on the incomplete Chinese character of grid search-engine Intelligent Matching according to claim 9, is characterized in that: a certain ratio described in step S42 is 50 percent.