CN106203496B

CN106203496B - Hydrographic curve extracting method based on machine learning

Info

Publication number: CN106203496B
Application number: CN201610520993.4A
Authority: CN
Inventors: 李士进; 郑展; 朱跃龙; 郝立; 余宇峰; 胡金龙; 高祥涛; 冯钧; 万定生
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2016-07-01
Filing date: 2016-07-01
Publication date: 2019-07-12
Anticipated expiration: 2036-07-01
Also published as: CN106203496A

Abstract

The invention discloses a hydrological curve extraction method based on machine learning. When the method of the present invention performs curve extraction on the hydrological data image, some features with discriminative ability in the image are selected and extracted, and a sampling window with variable scale is used to sample the image pixels in a certain area, as sample data, through machine learning. The method divides image components with different characteristics, and adds new training samples incrementally according to the classification effect; and uses chain code tracking for post-processing, which effectively removes the noise after classification. Compared with the prior art, the present invention solves the problem that the target curve is broken when the hydrological curve to be extracted is thin, which is difficult to be effectively solved in the original hydrological curve extraction method.

Description

Extraction method of hydrological curve based on machine learning

技术领域technical field

本发明涉及一种图像提取方法，尤其涉及水文资料图像中的水文曲线的提取方法，属于图像分割领域。The invention relates to an image extraction method, in particular to an extraction method of a hydrological curve in a hydrological data image, and belongs to the field of image segmentation.

背景技术Background technique

在当今信息化与数字化的时代，随着计算机的普及以及存储介质的高速发展，各种研究领域都对数据信息的数字化愈加重视。由于历史原因，水文水利等领域大多使用网格图纸记录观测数据。然而纸质材料由于保存不当等原因会造成损坏、污染等问题，容易对所承载的信息造成损失。且纸质材料占据空间，又不易于信息的交换和传递，更可能埋没了海量信息中可能隐藏的、有待发掘的知识。因此有必要对这些纸质资料进行数字化。利用图像处理的方式将这些信息采集并建立数据库，将避免大量的手工重复劳动，也能高效精准的对这些信息进行录入，具有较强的实际应用价值。In today's information and digital age, with the popularization of computers and the rapid development of storage media, various research fields are paying more and more attention to the digitization of data and information. For historical reasons, most fields such as hydrology and water conservancy use grid drawings to record observation data. However, due to improper storage and other reasons, paper materials will cause problems such as damage and pollution, and it is easy to cause losses to the information they carry. In addition, paper materials occupy space, and it is not easy to exchange and transmit information, and it is more likely to bury the knowledge that may be hidden and yet to be discovered in the massive information. Therefore, it is necessary to digitize these paper materials. Using image processing to collect these information and establish a database will avoid a lot of manual repetitive labor, and can also enter this information efficiently and accurately, which has strong practical application value.

纸质水文资料通常是在橘红色的坐标网格纸上绘制的蓝紫色的水文曲线，在数字化过程中，获取图纸中的信息时就需要得到水文曲线与网格线的各个交点，作为各个时刻的观测值。这一过程要求对图像进行分割，涉及了网格线分割与水文曲线分割。Paper-based hydrological data are usually blue-purple hydrological curves drawn on orange-red coordinate grid paper. In the process of digitization, each intersection of the hydrological curve and grid lines needs to be obtained when acquiring the information in the drawings, as each moment. observed value. This process requires image segmentation, involving grid line segmentation and hydrological curve segmentation.

图像分割就是把图像按照一定的标准划分成若干个特定的、具有独特性质的区域并从中提取出感兴趣目标的技术与过程。图像分割是图像分析的关键前提，其分割的质量优劣很大程度上决定着后续图像分析的效果。图像分割可分为灰度图像分割和彩色图像分割。与灰度图像相比，彩色图像不仅包含亮度信息，更包含了各种颜色信息，其分割方式更为多样，但与之对应的分割难度也更大。目前为止，国内外的研究人员在彩色图像分割领域已进行了大量的研究，并提出了许多分割算法，以及针对特定图像的分割策略，主要包括基于直方图阈值法、基于区域方法、边缘检测方法、模糊聚类分割方法和神经网络法等。Image segmentation is the technology and process of dividing an image into several specific regions with unique properties according to certain standards and extracting objects of interest from them. Image segmentation is the key premise of image analysis, and the quality of its segmentation largely determines the effect of subsequent image analysis. Image segmentation can be divided into grayscale image segmentation and color image segmentation. Compared with grayscale images, color images contain not only brightness information, but also various color information, and their segmentation methods are more diverse, but the corresponding segmentation difficulty is also greater. So far, researchers at home and abroad have carried out a lot of research in the field of color image segmentation, and have proposed many segmentation algorithms and segmentation strategies for specific images, including histogram-based threshold method, region-based method, and edge detection method. , fuzzy clustering segmentation method and neural network method.

在之前的研究中，对水文资料图像的分割通常采用的是基于颜色直方图的阈值分析方法，也考虑了梯度信息与颜色信息的融合使用。此类方法能自适应的完成一般情况下的图像分割，且能减少相机拍摄是光照不均的影响。但在实际使用该类方法时发现提取获得的水文曲线在某些特殊情况下容易产生断线，并且常常断的很严重，难以用膨胀方法解决。In previous studies, the segmentation of hydrological data images is usually based on the threshold analysis method based on color histogram, and the fusion of gradient information and color information is also considered. This kind of method can adaptively complete image segmentation in general, and can reduce the influence of uneven illumination of camera shooting. However, in the actual use of this kind of method, it is found that the hydrological curve obtained by extraction is prone to breakage in some special cases, and the breakage is often very serious, which is difficult to solve by the expansion method.

发明内容SUMMARY OF THE INVENTION

发明目的：针对纸质水文资料的数字化，提供一种水文曲线提取方法，能够准确的提取出其中的水文曲线，有效的规避曲线断线问题。The purpose of the invention is to provide a hydrological curve extraction method for the digitization of paper hydrological data, which can accurately extract the hydrological curve and effectively avoid the problem of curve disconnection.

本发明的水文曲线提取方法，所涉及的水文资料图像通过对纸质水文资料拍摄得到。In the hydrological curve extraction method of the present invention, the involved hydrological data images are obtained by photographing paper hydrological data.

本发明具体采用以下技术方案解决上述问题。The present invention specifically adopts the following technical solutions to solve the above problems.

一种基于机器学习的水文曲线提取方法，包括以下步骤：A method for extracting hydrological curves based on machine learning, comprising the following steps:

步骤A、选定采样窗口的尺度及需采样的目标特征，并据此采集具有代表性的训练样本集合；所述窗口的尺度可伸缩。窗口尺度的选择决定了用于分类的数据量，也直接影响着计算量的规模。Step A: Select the scale of the sampling window and the target feature to be sampled, and collect a representative training sample set accordingly; the scale of the window is scalable. The choice of window scale determines the amount of data used for classification, and also directly affects the scale of computation.

步骤B、利用机器学习的方法从训练样本中训练产生分类预测模型；Step B, using the method of machine learning to train and generate a classification prediction model from the training samples;

步骤C、对待处理图像中的各个像素，按照采样窗口采集得到目标特征作为待分类样本，利用步骤B训练得到的分类预测模型进行分类；In step C, each pixel in the image to be processed is collected according to the sampling window to obtain the target feature as the sample to be classified, and the classification prediction model obtained by the training in step B is used for classification;

步骤D、判断待处理图像中各个像素的分类结果是否较好，使得曲线提取完整且没有其他分类错误明显的区域。若是，则进入步骤F；否则，进入步骤E；Step D: Determine whether the classification result of each pixel in the image to be processed is good, so that the curve extraction is complete and there are no other regions with obvious classification errors. If so, enter step F; otherwise, enter step E;

步骤E、从曲线断线区域以及分类错误明显的区域选取具有代表性的样本点，对其采样后添加到训练样本集合中，并重复步骤B-D；Step E. Select representative sample points from the broken line area of the curve and the area with obvious classification errors, add them to the training sample set after sampling, and repeat steps B-D;

步骤F、对处理后的图像进行后处理，去除可能存在的噪声。In step F, post-processing is performed on the processed image to remove possible noise.

优选地，步骤A中所采集的训练样本集合应至少包含“水文曲线”、“网格线”、“其他背景”三种类别的样本。Preferably, the training sample set collected in step A should at least include samples of three categories: "hydrological curve", "grid line" and "other background".

优选地，步骤F中使用的图像后处理方法采用链码跟踪与膨胀处理相结合。其中链码跟踪水文曲线之前，先对网格线进行跟踪，确定网格线对应的作图区域；此步骤能减轻跟踪水文曲线时的处理强度。Preferably, the image post-processing method used in step F adopts a combination of chain code tracking and dilation processing. Before the chain code traces the hydrological curve, the grid line is traced first to determine the mapping area corresponding to the grid line; this step can reduce the processing intensity when tracing the hydrological curve.

优选地，链码跟踪后将尺寸小于特定阈值的连通域认定为噪声，并剔除出图像。该阈值取值为10000。Preferably, after chain code tracking, the connected domain whose size is smaller than a certain threshold is regarded as noise, and the image is eliminated. The threshold value is 10000.

优选地，连通域的尺寸大小采用该连通域的最小外接矩形的面积来表示。Preferably, the size of the connected domain is represented by the area of the smallest circumscribed rectangle of the connected domain.

相比现有技术，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

一、本发明能更好的解决对细线进行提取时容易产生的断线问题；1. The present invention can better solve the problem of wire breakage that is easily generated when the thin wire is extracted;

二、本发明基于对样本模式的分类，只要选择充分的训练样本，并不需要考虑光照等问题的影响；2. The present invention is based on the classification of sample patterns, as long as sufficient training samples are selected, the influence of lighting and other issues does not need to be considered;

三、采用离线学习，并不需要对每个图像重新采集样本训练。3. Using offline learning, it is not necessary to re-collect samples for training on each image.

附图说明Description of drawings

图1、图2和图3为三幅拍摄得到的水文资料图像。Figure 1, Figure 2 and Figure 3 are three hydrological data images obtained by shooting.

图4a和图4b为现有方法对图1和图2提取水文曲线的结果。Fig. 4a and Fig. 4b are the results of extracting the hydrological curve of Fig. 1 and Fig. 2 by the existing method.

图5a和图5b为本发明方法中训练分类模型的不同阶段对图2分类预测的结果。Fig. 5a and Fig. 5b are the results of classification prediction of Fig. 2 in different stages of training the classification model in the method of the present invention.

图6a和图6b为本发明方法中训练分类模型的不同阶段对图3分类预测的结果。Fig. 6a and Fig. 6b are the classification prediction results of Fig. 3 in different stages of training the classification model in the method of the present invention.

图7a和图7b为本发明方法对图2和图3提取水文曲线的结果。Fig. 7a and Fig. 7b are the results of extracting the hydrological curve of Fig. 2 and Fig. 3 by the method of the present invention.

图8是本发明的流程图。Figure 8 is a flow chart of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案进行详细说明：图1和图2分别显示了两幅水文资料图像，对其进行数字化的关键在于对其中的水文曲线(蓝紫色)及坐标网格线(橘红色)的提取。从图中可以看出由于保存时间长、保存条件不够理想，图像中除了磨坏破损、纸质老化外，还存在着色彩晕染、褪色等问题，甚至同一份图纸上的不图区域颜色浓淡不一。且由于拍摄时的光照影响，部分区域的颜色信息减弱或失去了原有的特征，使得提取问题变得更为复杂。The technical scheme of the present invention is described in detail below in conjunction with the accompanying drawings: Fig. 1 and Fig. 2 show two hydrological data images respectively, and the key to digitizing them lies in the hydrological curve (blue-purple) and the coordinate grid line (orange) among them. red) extraction. It can be seen from the picture that due to the long storage time and the unsatisfactory storage conditions, in addition to wear and tear and paper aging, there are also problems such as color blooming and fading in the image, and even the color shades of the unmapped areas on the same drawing different. And due to the influence of lighting during shooting, the color information in some areas is weakened or loses its original features, which makes the extraction problem more complicated.

之前的研究所采用的基于颜色直方图的阈值分析方法，融合了梯度信息与颜色信息，对上述问题能得到较好的解决，其对图1和图2的处理结果如图4a和图4b。可以看到图像的提取结果较好，大致能完成对目标物的识别及提取，但有时提取的水文曲线会存在断线，属于特例，如图4b中的断线处。对这些断线处进行细致观察，发现断线的主要原因是图中水文曲线太细，使得在与网格线有大量重合时(通常为曲线与网格线近似平行时的重合)曲线的墨色没有彻底盖住网格线的颜色，最终显示的颜色将为二者的叠加。这种颜色的叠加导致了颜色信息的迁移，该处的像素不再满足普遍的水文曲线颜色特征，提取曲线时在该处就容易产生断线。更有甚者，当曲线的某一段与网格线近似平行时，将会产生大量的重合区域，该处往往断线严重，难以通过膨胀等方法进行补全。由于不再满足阈值特征，原有处理方法不再适用，而调整阈值将会导致噪声点大量增加，图像提取不稳定。于是考虑采用新的思路对水文曲线进行提取。The threshold analysis method based on the color histogram adopted in the previous study, which integrates gradient information and color information, can solve the above problems well. The processing results of Figure 1 and Figure 2 are shown in Figure 4a and Figure 4b. It can be seen that the extraction result of the image is good, and the identification and extraction of the target can be roughly completed, but sometimes the extracted hydrological curve will have a broken line, which is a special case, such as the broken line in Figure 4b. After careful observation of these broken lines, it is found that the main reason for the broken lines is that the hydrological curve in the figure is too thin, which makes the ink color of the curve when there is a lot of overlap with the grid line (usually when the curve is approximately parallel to the grid line). Colors that don't completely cover the grid lines will end up showing a superposition of the two. The superposition of this color leads to the migration of color information, where the pixels no longer satisfy the common color characteristics of the hydrological curve, and it is easy to break the line when extracting the curve. What's more, when a certain segment of the curve is approximately parallel to the grid line, there will be a large number of overlapping areas, where the line is often broken, and it is difficult to complete it by methods such as expansion. Since the threshold feature is no longer satisfied, the original processing method is no longer applicable, and adjusting the threshold will lead to a large increase in noise points and unstable image extraction. So consider adopting a new idea to extract the hydrological curve.

考虑到重合区域的颜色信息发生迁移后的结果并不与曲线特征或网格线特征等既有信息相同，目标是将这些变化后的特征信息依然识别为目标曲线。本发明提出了基于机器学习的水文曲线提取方法，通过对图像中像素进行多特征融合的采样，获得一定数量的带标签的训练样本，并利用机器学习的方法训练得到分类预测模型。利用模型可以对图中像素进行分类预测，将预测为水文曲线的像素提取出来。并且通过对分类错误像素区域进行重采样，可以使训练样本集合更为完备，训练产生更为健壮的分类预测模型。为了不使处理过程中产生的噪声影响曲线提取结果，还要利用链码跟踪的方法对提取出的水文曲线图像进行后处理。Considering that the result of the migration of the color information in the overlapping area is not the same as the existing information such as curve features or grid line features, the goal is to still identify these changed feature information as the target curve. The invention proposes a method for extracting hydrological curves based on machine learning. By sampling pixels in an image for multi-feature fusion, a certain number of labeled training samples are obtained, and a classification prediction model is obtained by training with the machine learning method. Using the model, the pixels in the picture can be classified and predicted, and the pixels predicted as hydrological curves can be extracted. Moreover, by resampling the misclassified pixel area, the training sample set can be made more complete, and the training produces a more robust classification prediction model. In order not to cause the noise generated in the process to affect the curve extraction results, the method of chain code tracking is also used to post-process the extracted hydrological curve images.

具体的，本发明包含以下步骤：Specifically, the present invention comprises the following steps:

步骤A、确定窗口尺度和采样特征组合，采集训练样本；Step A, determine the combination of window scale and sampling feature, and collect training samples;

采集训练样本时要先确定采样窗口的尺度以及需要采样的特征。采样窗口以当前像素点为窗口中心点，同时兼顾窗口中包围当前像素点的其他像素的信息，即局部信息。由于单个像素中存有的信息相当有限，当把像素周围的其他局部信息一起纳入考虑时，可以使得样本维度更高，更有利于细致的分类。窗口可以使用多种尺度，如3*3、5*5、7*7等，尺度越大则样本中包含的信息越多，有助于更细致精准的分类，但计算时间也越长；尺度越小则样本中的信息越少，相对应的，计算时间也越短。具体的尺度选择应由实际应用的需求而定。When collecting training samples, it is necessary to determine the scale of the sampling window and the features to be sampled. The sampling window takes the current pixel as the center point of the window, and also takes into account the information of other pixels in the window surrounding the current pixel, that is, local information. Since the information stored in a single pixel is quite limited, when other local information around the pixel is taken into consideration, the sample dimension can be made higher, which is more conducive to detailed classification. The window can use a variety of scales, such as 3*3, 5*5, 7*7, etc. The larger the scale, the more information contained in the sample, which is helpful for more detailed and accurate classification, but the calculation time is also longer; The smaller the sample, the less information in the sample, and the correspondingly, the shorter the calculation time. The specific scale selection should be determined by the needs of the actual application.

另一方面，对窗口采样时所采用的局部信息也需要进行选择，这些信息由一些特征值组成，包括颜色特征(RGB、Lab或HSI等)、梯度特征、纹理特征(LBP或Gabor等)、SIFT特征等。具体的特征组合的选择应考虑这些特征对水文曲线提取效果的影响，选取其中适量的几组特征进行组合。过多的特征选择会带来信息的冗余和计算的负荷，过少的特征则可能使得分类效果下降。特征组合的选择关系着分类的效果与计算时的负荷。On the other hand, the local information used for window sampling also needs to be selected, which consists of some feature values, including color features (RGB, Lab or HSI, etc.), gradient features, texture features (LBP or Gabor, etc.), SIFT features, etc. The selection of specific feature combinations should consider the influence of these features on the extraction effect of hydrological curves, and select appropriate groups of features for combination. Too much feature selection will bring information redundancy and computational load, while too few features may reduce the classification effect. The choice of feature combination is related to the classification effect and the calculation load.

对一个像素进行上述采样时，应按照事先约定的窗口大小、特征组合方式按一定顺序获取各个特征值并整合为有序的样本向量。采集颜色特征时，应按照从左到右、从上到下的顺序依次对窗口中的各像素进行采集。另外，对训练样本要另外附加一维特征，作为当前训练样本的类别标签。When the above sampling is performed on a pixel, each feature value should be acquired in a certain order according to the pre-agreed window size and feature combination method and integrated into an ordered sample vector. When collecting color features, each pixel in the window should be collected sequentially from left to right and top to bottom. In addition, an additional one-dimensional feature is added to the training sample as the category label of the current training sample.

在采集训练样本时，采样点的选取尤为重要，应注意：1、在采样时应兼顾各个不同的目标类别，各自都应取得足量的样本点；2、在目标类别相同的像素点中，要尽量涵盖具有不同局部特征的像素；3、在类别相同且局部特征相似的像素点中应选择几个具有典型性的像素进行采样。其中，所采集的训练样本集合应至少包含“水文曲线”、“网格线”、“其他背景”三种类别的样本。When collecting training samples, the selection of sampling points is particularly important. It should be noted that: 1. Different target categories should be taken into account when sampling, and each should obtain a sufficient number of sample points; 2. Among the pixels with the same target category, Pixels with different local features should be covered as much as possible; 3. Several typical pixels should be selected for sampling among the pixels of the same category and similar local features. Among them, the collected training sample set should contain at least three categories of samples of "hydrological curve", "grid line" and "other background".

步骤B、利用机器学习方法训练产生分类预测模型；Step B, using machine learning method training to generate a classification prediction model;

机器学习方法包括有监督的学习和无监督的学习。由于当前问题中分类目标明确，只希望提取出水文曲线，故采用有监督的学习方法，利用采集的带类别标签的训练样本获得分类预测模型。此类学习方法包括决策树、贝叶斯分类器、K近邻、BP神经网络、感知器以及支持向量机SVM等。不同的机器学习方法有不同的特点，应根据实际需要选用。机器学习方法的选择关系着曲线提取的效果及效率。Machine learning methods include supervised learning and unsupervised learning. Since the classification goal in the current problem is clear, and only the hydrological curve is expected to be extracted, a supervised learning method is adopted, and the classification prediction model is obtained by using the collected training samples with class labels. Such learning methods include decision trees, Bayesian classifiers, K-nearest neighbors, BP neural networks, perceptrons, and support vector machines (SVMs). Different machine learning methods have different characteristics and should be selected according to actual needs. The choice of machine learning method is related to the effect and efficiency of curve extraction.

分析不同的水文资料图像发现，各个图像之间的颜色及结构特征很相似，从特征空间的角度，即使是不同的图像也可以用特征空间上的同一组分界面大致进行分类、提取。于是决定采用离线学习的方法，目标为训练产生一个效果优异的分类预测模型，用于对所有待处理图像中的像素点进行分类、提取，而不是为每一个图像训练产生一个模型。Analyzing different images of hydrological data, it is found that the color and structural features of each image are very similar. From the perspective of feature space, even different images can be roughly classified and extracted by the same component interface on the feature space. Therefore, it was decided to adopt the offline learning method. The goal is to train a classification prediction model with excellent effect, which is used to classify and extract the pixels in all the images to be processed, instead of generating a model for each image training.

步骤C、对待处理图像进行分类预测，并补充训练样本集；Step C, classify and predict the image to be processed, and supplement the training sample set;

依照上述步骤A中所约定的方式，对待处理的图像逐像素的提取对应的特征样本，并作为输入经由所获得的分类预测模型进行预测分类。提取出其中预测为“水文曲线”的像素，作为此次曲线提取的结果。According to the method agreed in the above-mentioned step A, the corresponding feature samples are extracted pixel by pixel from the image to be processed, and used as input to perform prediction classification through the obtained classification prediction model. The pixels predicted to be "hydrological curves" are extracted as the result of this curve extraction.

如果曲线提取结果完整，效果令人满意，即可进入下一步骤；但通常不能立刻获得令人满意的提取结果，提取的曲线往往比较粗糙且会出现断线，也会提取出许多噪声点。解决办法是，以增量方式不断地获得新的样本加入训练样本集，从而训练得到愈加完善而健壮的分类预测模型。每次作为增量的新训练样本都来自前一次的预测分类结果，即从中找出曲线断线处以及其他分类出错率较大的区域，选择区域内在局部特征上具有典型性的像素点进行采样。此方法旨在对过去错误预测进行学习弥补，在类别分界面附近重新采样，由此补偿样本空间上的遗漏和空缺，得到更为完备的训练集，从而获得更为精细准确的分界面和更为健壮的分类预测模型。If the curve extraction result is complete and the effect is satisfactory, you can proceed to the next step; but usually a satisfactory extraction result cannot be obtained immediately, and the extracted curve is often rough and has broken lines, and many noise points are also extracted. The solution is to continuously acquire new samples to add to the training sample set in an incremental manner, so as to train a more complete and robust classification prediction model. Each new training sample as an increment comes from the previous prediction and classification result, that is, find out the broken line of the curve and other regions with a large classification error rate, and select the typical pixels in the region for sampling. . This method aims to learn to make up for past wrong predictions and resample near the category interface, thereby compensating for omissions and vacancies in the sample space, and obtaining a more complete training set, thereby obtaining a more precise and accurate interface and more Predictive models for robust classification.

利用添加增量后的训练样本集合重新训练分类预测模型并对该图像再次进行分类预测，若曲线提取效果令人满意，则该图像通过当前处理，进入下一步骤；否则重复上述增量添加训练样本的过程。Use the training sample set after adding increment to retrain the classification prediction model and perform classification prediction on the image again. If the curve extraction effect is satisfactory, the image passes the current processing and goes to the next step; otherwise, repeat the above incremental training sample process.

分类预测模型的训练过程不是一蹴而就的，需要多次修改增加新的样本进行再训练；同时也不是在某一个泾渭分明的“训练阶段”中完成的，而是在某图像的分类效果不佳时才启动“再训练”；也就是说，没有一个显式而有限的“训练阶段”。另外，对训练样本集合的增添需要人工操作的干预，由手工选定新增加的采样点。但由于训练样本的初始积累阶段通常能迅速完成，得到效果较好的模型，且只有在特殊情况下才会需要再训练已有的模型，实际上人工操作的工作量很小。The training process of the classification prediction model is not achieved overnight, and requires multiple modifications to add new samples for retraining; at the same time, it is not completed in a certain “training stage”, but only when the classification effect of an image is not good. Initiate "retraining"; that is, without an explicit and limited "training phase". In addition, the addition of the training sample set requires manual intervention, and the newly added sampling points are manually selected. However, since the initial accumulation stage of training samples can usually be completed quickly, a model with better effect can be obtained, and the existing model needs to be retrained only in special cases. In fact, the workload of manual operation is very small.

步骤D、链码跟踪进行后处理。Step D, the chain code tracking is post-processed.

由于原始拍摄图像往往存在大量噪声点，上述步骤的曲线提取结果中仍旧存在许多难以消去的环境噪声，它们大多是颜色偏差导致分类错误而引入的。由于本发明主要针对较细的水文曲线图像的提取，而如果利用常用的腐蚀或是滤波方法去除噪声，往往会把曲线变得很细甚至再次严重断线，并不能得到让人满意的结果。理想的目标是将噪声点去除，而水文曲线不发生任何变化，为了达到这一目标可以采取链码跟踪的方式进行后处理。Since the original captured image often has a large number of noise points, there are still many environmental noises that are difficult to eliminate in the curve extraction results of the above steps, and most of them are introduced by color deviation resulting in classification errors. Since the present invention is mainly aimed at extracting a relatively thin hydrological curve image, if the noise is removed by a common erosion or filtering method, the curve will often become very thin or even broken again, and satisfactory results cannot be obtained. The ideal goal is to remove the noise points without any change in the hydrological curve. In order to achieve this goal, the chain code tracking method can be used for post-processing.

链码跟踪方法能够以链码方式跟踪并记录图中各个连通域的信息，即为各个像素标记其所属连通域，并记录各个连通域的大小及边框位置。连通域的大小不以其中像素个数为准，而以其最小外接矩形的面积为准。The chain code tracking method can track and record the information of each connected domain in the graph in a chain code manner, that is, mark each connected domain for each pixel, and record the size and border position of each connected domain. The size of the connected domain is not based on the number of pixels, but the area of its smallest circumscribed rectangle.

对分类提取后的结果图像，跟踪其中预测为“目标曲线”的像素点，获得其连通域信息；即，将图像进行“目标曲线/非目标曲线”的二值化，并对其进行上述的链码跟踪。其跟踪结果将包括真实的水文曲线目标区域以及噪声点区域，前者所在的连通域通常很大，而后者相对而言较小。于是可以对连通域的大小设置特定阈值，从而排除那些较小的、噪声所在的连通域。For the result image after classification and extraction, track the pixel points predicted as "target curve" to obtain its connected domain information; that is, binarize the image with "target curve/non-target curve", and perform the above-mentioned steps on it. Chaincode tracking. The tracking results will include the real hydrological curve target area and the noise point area. The former is usually located in a large connected domain, while the latter is relatively small. Then a certain threshold can be set on the size of the connected domain, thereby excluding those connected domains that are smaller and where the noise is located.

其所以不直接取得最大的连通域，是为了防止模型分类提取后所得的图像中曲线仍然存在断线。这种断线往往比较细微，容易解决，可以在除去噪声点后另外对目标曲线进行几次膨胀操作。The reason why the maximum connected domain is not directly obtained is to prevent the curve in the image obtained after the model classification and extraction from still having broken lines. This kind of disconnection is often subtle and easy to solve. After removing the noise points, the target curve can be expanded several times.

为了提高处理效果，也可以在上述“曲线追踪”过程前先进行一次对图像副本的“网格线追踪”，以确定网格线所在区域，并在该区域内进行上述“曲线追踪”。即，对图像副本进行“网格线/非网格线”的二值化，并对其进行链码跟踪，取得最大连通域的边框位置，作为网格线的外缘线。该操作的目的在于去除网格线外与曲线提取无关的所有像素，降低曲线追踪时的复杂性。In order to improve the processing effect, it is also possible to perform a "grid line tracing" on the image copy before the above "curve tracing" process to determine the area where the grid lines are located, and perform the above "curve tracing" in this area. That is, the image copy is binarized with "grid line/non-grid line", and chain code tracking is performed on it, and the border position of the maximum connected domain is obtained as the outer edge of the grid line. The purpose of this operation is to remove all pixels outside the grid lines that are not related to curve extraction, reducing the complexity of curve tracing.

为了验证本发明的效果，选取多幅彩色水文资料图像进行实验，对其进行上述的分类提取过程。约定所选择窗口尺度大小为7*7，所采样点的特征组合为各个像素点的RGB颜色值、HSI颜色值以及Lab颜色值总共9个特征值；即，所涉及的样本的维度均为7*7*9＝441。并约定所选择的机器学习方法为支持向量机SVM。以图2为例，初始时训练样本集合为空，首先对图2进行初始采样，获得足量的训练样本后训练生成SVM分类器，并用于对图2进行分类预测，其结果如图5a，其中黑色点为预测结果为“水文曲线”的点，灰色点为“网格线”。可见，此时对图2的分类效果并不令人满意，存在许多断线处，尤其出现了两个较大的断线位置。对这些断线处重新采样几次，经过几轮重新训练后生成的SVM分类器效果得到提高，对图2的分类结果中断线处均得到解决，如图5b。In order to verify the effect of the present invention, a plurality of color hydrological data images are selected for experiments, and the above-mentioned classification and extraction process is performed on them. It is agreed that the size of the selected window is 7*7, and the feature combination of the sampled points is the RGB color value, HSI color value and Lab color value of each pixel. A total of 9 feature values; that is, the dimensions of the samples involved are all 7 *7*9=441. And it is agreed that the selected machine learning method is support vector machine SVM. Taking Figure 2 as an example, the training sample set is initially empty. First, initial sampling is performed on Figure 2. After obtaining a sufficient number of training samples, the SVM classifier is trained and used to classify and predict Figure 2. The result is shown in Figure 5a. The black points are the points whose prediction result is "hydrological curve", and the gray points are "grid lines". It can be seen that the classification effect of Fig. 2 is not satisfactory at this time, and there are many broken lines, especially two large broken lines. Re-sampling these broken lines several times, after several rounds of retraining, the effect of the generated SVM classifier is improved, and the broken lines of the classification results in Figure 2 are resolved, as shown in Figure 5b.

利用当前SVM分类器尝试对图3进行分类预测，结果如图6a，此时得到的曲线不存在断线现象，但还有太多噪声，分类效果并不算好。再次对这些噪声点采样，训练新一轮的分类器，再次对图3分类的结果如图6b。此时分类效果较好，认为当前分类器已经能完成对这两张图的分类要求。如有需要，还可在对其他图像重复上述操作。The current SVM classifier is used to try to classify and predict Figure 3, and the result is shown in Figure 6a. The curve obtained at this time has no broken line phenomenon, but there is too much noise, and the classification effect is not good. These noise points are sampled again, a new round of classifier is trained, and the result of classifying Figure 3 again is shown in Figure 6b. At this time, the classification effect is better, and it is considered that the current classifier has been able to complete the classification requirements for these two images. This can be repeated for other images if desired.

对图5b与图6b的分类结果继续进行后处理，移去不需要的“网格线”灰色点，利用链码跟踪的方式剔除噪声点，并另外对曲线进行几次膨胀操作，得到曲线提取结果如图7a和图7b。可见，本发明对完成了对图2和图3中较细的水文曲线的提取。Continue post-processing on the classification results of Figure 5b and Figure 6b, remove the unnecessary "grid line" gray points, use chain code tracking to eliminate noise points, and perform several expansion operations on the curve to obtain the curve extraction. The results are shown in Figure 7a and Figure 7b. It can be seen that the present invention has completed the extraction of the finer hydrological curves in FIGS. 2 and 3 .

本发明的基于机器学习方法的水文曲线提取方法，基于对样本模式的分类，只要选择充分的训练样本，并不需要考虑光照等问题的影响；采用离线学习，并不需要对每个图像重新采集样本训练；以增量方式选择并添加训练样本，能适应不断到来的新的分类要求。本发明能更好的解决对细线进行提取时容易产生的断线问题，具有很好的研究价值。The hydrological curve extraction method based on the machine learning method of the present invention is based on the classification of the sample mode, as long as sufficient training samples are selected, the influence of lighting and other issues does not need to be considered; using offline learning, it is not necessary to re-collect each image Sample training; incrementally select and add training samples to accommodate incoming new classification requirements. The invention can better solve the problem of wire breakage which is easily generated when the thin wire is extracted, and has good research value.

Claims

1. a hydrological curve extraction method based on machine learning, is characterized in that, comprises the following steps:

Step A, selecting the scale of the sampling window and the target feature to be sampled, and collecting a representative training sample set accordingly;

Step B, using the machine learning method to train and generate a classification prediction model from the training samples;

In step C, each pixel in the image to be processed is collected according to the sampling window to obtain the target feature as the sample to be classified, and the classification prediction model is used for classification;

Step D, judge whether the classification result of each pixel in the image to be processed reaches the expectation, whether the curve is extracted completely and whether there are other regions with obvious classification errors; if the classification result reaches the expectation, then go to step F; otherwise, go to step E;

Step E. Select representative sample points from the broken line area of the curve and the area with obvious classification errors, add them to the training sample set after sampling, and repeat steps B-D;

Step F, post-processing the processed image.

2 . The method for extracting hydrological curves based on machine learning according to claim 1 , wherein in step A, the scale of the window is scalable. 3 .

3. The method for extracting hydrological curves based on machine learning as claimed in claim 1, wherein in step A, the selection of local features adopts the combination of various different types of features, and the features include color features, gradient features, Texture features and SIFT features.

4 . The method for extracting hydrological curves based on machine learning according to claim 1 , wherein, in step B, the machine learning method comprises support vector machines, neural network methods, and combinations thereof. 5 .

5. the hydrological curve extraction method based on machine learning as claimed in claim 1, it is characterized in that, in step C, traverse all the pixel points that can use the current sampling window to extract feature, obtain corresponding each local feature value according to the window and form to be Categorical sample vector.

6. the hydrological curve extraction method based on machine learning as claimed in claim 1 is characterized in that, step E is further: the sample point selected should be added to the original training set in incremental form as training sample and retrain to generate prediction model to make directional adjustments to the original model.

7 . The method for extracting hydrological curves based on machine learning according to claim 1 , wherein in step F, the post-processing method is a combination of chain code tracking and dilation processing. 8 .

8 . The method for extracting hydrological curves based on machine learning according to claim 1 , wherein a sample category “grid line” needs to be added, so as to locate the drawing area on the graph when processing the image. 9 .

9. The method for extracting hydrological curves based on machine learning as claimed in claim 7, wherein the chain code tracking is used to track the connected domains existing in the image, and calculate and record the size of these connected domains, thereby excluding the A connected domain of smaller size consisting of noise.