Background
Optical flow estimation aims at calculating pixel-level two-dimensional motion between successive frames, describing motion fields in two-dimensional space. Conventional optical flow algorithms typically rely on photometric consistency assumptions to derive motion information through optimization. The core of such methods is the manual extraction of features, while the introduction of deep learning significantly alters the way the optical flow is estimated. Early deep learning optical flow methods (e.g., flowNet) estimate optical flow directly from image pairs via two convolutional neural networks, eliminating the step of manual feature extraction in conventional methods, and verifying the feasibility of learning optical flow directly from images.
With the development of deep learning, the design of optical flow estimation networks becomes increasingly diverse. For example, PWC-Net adopts a spatial pyramid structure to gradually estimate the optical flow at a plurality of layers, and introduces a feature pyramid and cost volume-based matching method, so that the accuracy of optical flow estimation is improved, and meanwhile, the calculation efficiency is maintained. RAFT updates optical flow through a novel network structure by using multi-scale 4D correlation volume search and GRU iteration, while GMA enhances global motion matching capability in feature extraction and encoding through global motion information aggregation and local attention mechanism, thereby improving optical flow calculation performance.
However, existing optical flow methods are designed primarily for image data acquired under normal environmental conditions. In complex environments, image data is often affected by information loss. For example, in low light conditions, noise increases and texture feature attenuation can significantly reduce image quality, breaking the photometric consistency assumption that optical flow estimation relies on. The increase in noise and the lack of texture detail reduce inter-frame texture consistency, resulting in reduced performance of the subsequent optical flow estimation network.
Optical flow estimation for challenging environments such as severe weather conditions or low-light environments, corresponding solutions have also emerged in the prior art.
For example, in a rainy scene RobustNet estimates optical flow by using a rainless residual channel, while RainFlow reduces the effects of rain streaks and haze on image features by generating features that are invariant to rain streaks and rain mist. In haze scenes, some methods perform image defogging and optical flow estimation simultaneously by synthesizing haze data and applying style migration techniques.
In low-light environments, zheng et al propose a synthetic dark Noise optical flow reference dataset named FLYINGCHAIRS DARK & Noise (FCDN) dataset that solves the problem of dataset missing by adding dark image Noise to the normal illumination dataset. They also introduced Various Brightness Optical Flow (VBOF) datasets of different exposure levels, containing images of different exposure levels and optical flow pseudo-labels. In addition, CEDFlow introduces high-low frequency self-adaptive enhancement and edge enhancement to feature dimension, and specially designs a structure for extracting image features in low light environment to improve optical flow estimation performance.
However, extracting image features solely from limited information in low-light images limits optical flow estimation performance. The reduced light intake under low light conditions makes it difficult to extract the necessary information, which prevents the extraction of high quality features that are critical for subsequent optical flow calculations. Furthermore, since the network is trained using only low quality images, this reduces the performance that can typically be achieved under normal lighting conditions.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an implicit image enhancement and optical flow estimation method based on multi-mode collaborative optimization, which combines RGB and Depth (Depth) image data, and remarkably improves the robustness and precision of optical flow estimation by introducing multi-mode information and multi-mode collaborative optimization frames, so as to solve the defect of limited optical flow estimation performance in difficult scenes such as low illumination, high noise or other complex dynamic scenes in the prior art.
An implicit image enhancement and optical flow estimation method based on multi-mode collaborative optimization specifically comprises the following steps:
step 1, acquiring an RGB image under a normal illumination scene and a depth map of a corresponding view angle. And calculating the three-dimensional point cloud corresponding to the pixel according to the depth map and the camera internal parameters.
And adding simulated low-light noise into the RGB image, adjusting the brightness of the image, and synthesizing low-light image data.
And 2, constructing a high-low frequency characteristic enhancement network, and decomposing the low-light image data obtained in the step 1 into high-frequency characteristics and low-frequency characteristics. The high frequency features contain detailed information of the image such as edges and textures. The low frequency features are the background and the overall outline of the image.
The method comprises the steps of inputting high-frequency features into a dense convolution network for enhancement, highlighting texture details, processing low-frequency features through a multi-scale feature enhancement network, and capturing global background information by using an attention mechanism. And then, performing weighted fusion through a channel attention mechanism and residual connection to generate enhanced image features F en.
Extracting image features of low-light image data from enhanced image features by an encoderAnd contextual features。
Step 3, respectively extracting two-dimensional image features from the RGB image and the three-dimensional point cloud in the normal illumination scene through the encoderAnd three-dimensional point cloud featuresAfter feature alignment and fusion, obtaining the fused normal illumination image featuresAnd contextual features。
By a priori feature loss functionThe feature extraction process in step2 is supervised:
And 2 represents computing a 2-norm.
Step 4, image characteristics based on low-light imageAnd contextual featuresCalculation of 4D correlation volumesIterative optimization of initial optical flow field using multi-scale correlation volumes, recursive update operator of GRU (gated loop iteration unit), progressive refinement of optical flow estimation results, calculation of optical flow estimation loss。
Step 5, setting a total loss functionFor a priori feature lossAnd optical flow estimation lossIs used for completing model training:
wherein, Representing the loss weight.
Inputting the image pair needing optical flow estimation into a trained high-low frequency characteristic enhancement network to extract image characteristicsAnd contextual featuresAnd outputting an optical flow estimation result by the method of the step 4.
The invention has the following beneficial effects:
1. The information loss of the original low-quality image is supplemented by using the multi-mode data, and the performance bottleneck of the single-mode optical flow estimation method in a complex scene is broken through. And through high-low frequency feature decomposition and enhancement, the image detail and global information extraction capability are respectively optimized, and the optical flow calculation precision is obviously improved. An implicit feature supervision mechanism is introduced, training guidance is carried out on the enhanced network through multi-mode features, RGBD fusion and 3D-to-2D projection are utilized, the enhanced features are ensured to maintain geometric interpretability, and task and enhanced depth collaborative optimization is realized.
2. The method is suitable for wide computer vision tasks including robot vision navigation, automatic driving environment sensing, moving object detection and tracking and the like, and particularly has excellent performance in difficult scenes such as low-light and high-noise environments. In addition, the technology has potential value in image restoration, video analysis and other application scenes requiring high-precision dynamic scene estimation. The method provides a brand new solution for optical flow estimation in a complex environment, improves the performance of the optical flow estimation in a difficult scene through implicit multi-modal knowledge guiding and feature enhancement, and further promotes the technical progress in the optical flow calculation field.
Detailed Description
The invention is further explained below with reference to the drawings;
As shown in FIG. 1, the implicit image enhancement and optical flow estimation method based on multi-mode collaborative optimization mainly comprises high-low frequency feature enhancement, 2D-3D feature fusion and iterative optical flow estimation. The method comprises the following specific steps:
Step 1, the embodiment uses FLYINGTHING D data set and VBOF (Variable Brightness Optical Flow) data set as data sources. The FLYINGTHING3D dataset is a synthetic dataset designed specifically for optical flow estimation in a 3D scene, containing image pairs with dynamic 3D objects and their corresponding optical flow real data. The images of the dynamic 3D object come from flying objects that simulate different camera perspectives and lighting conditions. The VBOF dataset then contains a set of images of the same scene under a plurality of exposure conditions, taken by four different cameras, and providing real optical flow data for each exposure condition.
And calculating a three-dimensional point cloud corresponding to the pixels according to the depth map and the camera internal parameters based on the normal illumination RGB image provided by the FLYINGTHING D data set and the corresponding visual angle depth map. Then, adding simulated low-light noise into the RGB image, introducing uncorrected white balance effect and noise model, and adjusting image brightness to generate synthetic low-light image data with low-light noise characteristics.
Step 2, decomposing the low-illumination image data into high-frequency characteristics and low-frequency characteristics through a high-low frequency characteristic enhancement network, and enhancing the original image by utilizing the double-frequency-domain characteristics, wherein the specific steps are as follows:
s2.1, firstly inputting the low-light image pair under the same scene into a convolution layer to obtain a low-light image characteristic F. Then, the low-frequency characteristic F low of the low-light image is obtained from F through an averaging pooling operation.
And carrying out bilinear interpolation up-sampling on the low-frequency characteristic F low, and subtracting the up-sampling result from the F to obtain a high-frequency characteristic F high of the low-illumination image.
S2.2, because the high frequency information mainly represents the details of the image, the local image information can be better focused by adopting a smaller receptive field, so that the details are more accurately enhanced. The high frequency characteristic F high is thus input into a dense convolution network:
Where Dense (-) represents a Dense convolutional network. The dense convolution network comprises a plurality of small convolution kernels and has a residual connection structure, so that the dense convolution network can be better focused on a detail area and is helpful for exploring high-frequency information.
The low frequency information contains the background and contours of the image from which long range dependencies can be suitably captured. Firstly, carrying out two continuous downsampling on a low-frequency characteristic F low, wherein the characteristic obtained by downsampling is F low1、Flow2, respectively inputting the three-scale low-frequency characteristics into the self-attention of a channel, capturing global background information, obtaining an enhanced multi-scale low-frequency characteristic F' low、F'low1、F'low2, and finally fusing through a wavelet fusion network (Wavelet Fusion Network):
where WF (∈) represents the wavelet fusion operation.
S2.3 for the processed high frequency characteristicsAnd low frequency characteristicsAnd splicing on the channel, performing residual connection with the low-light image characteristic F after one-dimensional convolution and a channel attention mechanism, and generating an enhanced image characteristic F en. The high-frequency and low-frequency characteristic enhancement network can respectively process damaged image texture details and edge contours, so that the characteristic extraction capability is improved.
S2.4, inputting the enhanced image feature F en into a CNN network-based encoder, extracting the image feature of the low-light imageAnd contextual features。
And 3, aligning and fusing the point cloud features and the RGB features by using a pre-trained Encoder coding and a projection mode capable of learning interpolation through a 2D-3D feature fusion network, and implicitly supervising the feature extraction of low-quality data.
S3.1, firstly extracting two-dimensional features from RGB image pairs under the same scene through pretrained CNN,、Representing two-dimensional features respectivelyIs defined by a height and a width of (a),Indicating the number of channels. Extracting three-dimensional features from point cloud dataMeanwhile, the position information of the point cloud is acquired from the depth map, M represents the number of the point cloud,Representing three-dimensional features of point cloud dataIs a number of channels.
S3.2, because the image features are dense and the point cloud features are sparse, in order to realize feature alignment, a learnable interpolation method is applied to the point cloud featuresConverting to dense point cloud features of the same size as image features. For each image pixel, neighborhood features are weighted by ScoreNet based on coordinate offset, and then employedThe projection points of the point cloud features are found in a nearest neighbor mode:
wherein, Representing the offset of each pixel with respect to the point cloud 2D projection point. ScoreNet () represents ScoreNet network.Indicating KNN neighbors.Representing dense point cloud featuresCharacteristic of pixel (i, j), i=1, 2..h, j=1, 2..w.
S3.3, throughConvolution versus dense point cloud featuresAnd image featuresFusion is performed in the channel dimension:
Conv
Fused normal image features And normal context featuresFor implicitly providing guidance to high and low frequency feature enhancement networks, optimizing feature extraction capabilities. By a priori feature loss functionThe characteristic extraction process of the high-low frequency characteristic enhancement network is supervised:
And 2 represents computing a 2-norm.
And 4, constructing a 4D related volume table, and deducing an optical flow by using the GRU.
S4.1 Low-light image-based image featuresAnd contextual featuresCalculation of 4D correlation volumesFor establishing a feature correspondence between pixels.
S4.2, using the multi-scale correlation volume, and gradually optimizing optical flow estimation by iteration based on an update operator of the GRU. Assume an initial optical flow fieldIn the first placeIn the iterations, the optical flow estimation value is updated:
wherein, Is a relevant feature retrieved from the relevant volume C, and GRU () represents a gated loop iteration unit.
S4.3 by predicting optical flowAnd true optical flowBetween (a) and (b)Distance supervision and prediction sequenceApplying exponentially decaying weightsCalculating optical flow estimation loss:
Wherein the method comprises the steps ofQ represents the number of iterative updates of the GRU.
The exponential decay weights in this embodimentSet to 0.9, 12 optical flow prediction iterations were performed to achieve a better coarse to fine optical flow update.
Final loss functionSet to a priori feature lossAnd optical flow estimation lossAnd (2) sum:
wherein, The loss weight is indicated and is set to 0.2 in this embodiment. The model herein was implemented using PyTorch and trained using Adam optimizer, which performed a total of 20 ten thousand iterations. The initial learning rate is set toAnd gradually reduces to the point in the training process through a cosine annealing strategy。
The original image from the FLYINGTHING D dataset is denoted as C and the processed image with low light noise characteristics is denoted as CN.
Firstly, selecting part of original images C for model training, using the rest of original images as a test set for performance test, selecting an endpoint error (EPE) and 1 pixel precision (ACC 1 px) as evaluation indexes, and comparing the performance of the method with that of the prior art method, wherein the test results are shown in Table 1:
TABLE 1
As can be seen from the data in Table 1, the EPE of the present method was 2.91 and the ACC 1px was 86.54%, which is superior to all other methods. The present method increased about 9.3% on EPE compared to GMFlow for the second rank and about 19.3% compared to RAFT. Experimental results show that the method can indirectly guide the model to learn effective feature extraction by introducing implicit feature supervision in the training process, so that the subsequent optical flow estimation effect is remarkably improved.
Then, model training is performed by using the original image C and the image CN with low light noise characteristics simultaneously, and performances of different methods on the original normal image and the image after noise injection are compared, as shown in a table 2:
TABLE 2
The data in Table 2 shows that the method achieves the best performance on all the criteria. Compared with GMFlow, on the normal image and the injected noise image, the EPE index of the method is improved by 7.1% and 7.8% respectively, which shows that the method is superior to a model trained only on RGB images in solving the optical flow estimation challenge under the support of a high-low frequency characteristic enhancement network and a priori characteristic guidance.
To compare the performance of different models in low-light optical flow estimation, tests were performed on VBOF datasets, as shown in table 3:
TABLE 3 Table 3
On VBOF dataset and its SONY subset, the method achieved the best results on the SONY subset, with EPE of 20.05, an improvement of about 1.5% over GMFlow of the second name. Furthermore, the method also achieves optimal results over the entire VBOF dataset, which is about 1.1% higher than RAFT, indicating that the method can effectively cope with low-light noise scenes in the real world by training with synthetic low-light images.
FIG. 2 illustrates a comparison of visual results of optical flow estimation for different methods, the first behavior being the output of different methods for a clear RGB image. It can be seen that the method can effectively track the outline of a small object under the guidance of implicit features, and performs well in estimating the optical flow of a moving object. The second line shows the estimation result of RGB image injected with noise, and the method still has strong noise immunity despite the added noise, can accurately recover the information in the damaged image, and keeps high optical flow estimation precision. The third line shows the image in a real low light environment, and the method helps to efficiently process the light flow estimation in the face of real ambient noise by training the composite data. In contrast, other approaches perform poorly in the case of noisy regions and elongated object contours, highlighting the limitations of noise-affected image features in the absence of additional prior guidance.
Finally, in order to verify the effectiveness of the high-low frequency characteristic enhancement network and the prior characteristic loss in the method, a series of ablation experiments are performed, and the results are shown in table 4:
TABLE 4 Table 4
It can be seen that the a priori feature loss improves the performance of the RAFT frame by about 7.9%, while the high and low frequency feature enhancement networks improve performance by about 4.1%. The prior characteristic loss and the high-low frequency characteristic enhancement network are simultaneously applied, namely the method improves the performance by about 12.7 percent.
In summary, the present application proposes a novel multi-modal collaborative implicit image enhancement method for improving the optical flow estimation performance in challenging environments. By introducing multi-mode (RGBD) collaborative training, the quality of an input image is effectively improved, so that more accurate optical flow estimation is realized under a complex condition. Collaborative learning of RGB and depth images at the feature level enables the enhancement network to implicitly acquire multi-modal knowledge with geometric consistency, thereby improving feature extraction capability of optical flow calculation. A large number of experimental verification is carried out on the synthetic data set and the real data set, and the result shows that the optical flow estimation performance can be improved through the method.