[go: up one dir, main page]

CN119478648B - A method for underwater robot vision clarity based on multimodal fusion network - Google Patents

A method for underwater robot vision clarity based on multimodal fusion network

Info

Publication number
CN119478648B
CN119478648B CN202411488920.2A CN202411488920A CN119478648B CN 119478648 B CN119478648 B CN 119478648B CN 202411488920 A CN202411488920 A CN 202411488920A CN 119478648 B CN119478648 B CN 119478648B
Authority
CN
China
Prior art keywords
underwater
polarization
network
features
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411488920.2A
Other languages
Chinese (zh)
Other versions
CN119478648A (en
Inventor
齐晓志
赵青竹
秦国庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong Qizhi Intelligent Technology Co ltd
Original Assignee
Nantong Qizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong Qizhi Intelligent Technology Co ltd filed Critical Nantong Qizhi Intelligent Technology Co ltd
Priority to CN202411488920.2A priority Critical patent/CN119478648B/en
Publication of CN119478648A publication Critical patent/CN119478648A/en
Application granted granted Critical
Publication of CN119478648B publication Critical patent/CN119478648B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/05Underwater scenes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an underwater robot vision sharpening method based on a multi-mode fusion network, which belongs to the technical field of underwater image processing and mainly comprises the steps of collecting turbid underwater images and corresponding clear underwater images, constructing an underwater polarized image dataset comprising underwater polarized images, polarized images and polarized angle images with different angles, constructing an underwater robot vision sharpening model based on the multi-mode fusion network comprising the multi-mode fusion network and an image enhancement network, training the multi-mode fusion network based on the underwater polarized image dataset, updating RGB information and polarized information by using pixel multi-scale fusion and generating fusion characteristics during training, training the image enhancement network by using the underwater polarized image dataset to obtain an image enhancement model, acquiring fusion characteristics with turbidity to be processed, and inputting the fusion characteristics into the image enhancement model based on the network so as to obtain clear underwater images.

Description

Underwater robot vision definition method based on multi-mode fusion network
Technical Field
The invention relates to the technical fields of marine environment perception, digital image processing and image enhancement, in particular to an underwater robot vision sharpening method based on a multi-mode fusion network.
Background
With the development of human exploration for the ocean, underwater robots are becoming an important tool for acquiring information on the ocean floor. However, due to the strong absorption and scattering of light by the aqueous medium and the complex underwater environment, the underwater environment has energy attenuation due to the light in the underwater propagation process, and the impurities and suspended particles in the water can cause the light to scatter in the propagation process, so that the image collected underwater is more blurred than the image collected on land, and the problems can cause the obtained image to have low color deviation and definition, thereby seriously affecting the visual quality and affecting the performance of the visual task under the water. The underwater optical imaging is a core mode for completing underwater environment sensing and detecting tasks at present, plays an indispensable role in scientific research fields such as underwater robots, ocean science investigation, a plurality of downstream visual tasks (underwater target identification or tracking) and the like, and due to complexity and instability of the underwater environment, the underwater image often faces problems such as color cast, low contrast, blurring and the like. Specifically, when light propagates in water, the light is firstly affected by the water depth, and as the depth increases, the light with different wavelengths is continuously attenuated. Thus, since red light first disappears, the underwater image tends to be mainly blue or green, resulting in color shift of the image. Meanwhile, unlike air media on land, water contains a large amount of suspended particles, which can cause scattering of light in water, further challenging underwater optical imaging. In addition, because of the uncertainty of the imaging equipment in underwater movement, the acquired images have the problems of fuzzy details, low contrast and the like, the visual perception is influenced, and more serious challenges are presented for subsequent advanced visual tasks. Meanwhile, due to the lack of underwater scenes and high-quality images, UIE faces various challenges such as over-enhancement, detail feature blurring, and the like. These problems limit the performance of the UIE method, resulting in poor downstream task performance.
Disclosure of Invention
In view of the shortcomings of the prior art, the invention provides an underwater robot vision sharpening method based on a multi-mode fusion network, which tries to introduce polarization information as an additional mode to strengthen an original underwater image according to inspiration obtained from multi-mode learning, provides a novel detail focusing and polarization guiding multi-mode fusion network, integrates RGB modes and polarization modes to strengthen the underwater image, utilizes detail focusing differential convolution to capture more detail and edge information, and uses polarization degree information and polarization angle information to strengthen contrast and texture details of different areas in the image, thereby more accurately restoring the true color of the image, reducing serious interference of the image to subsequent computer vision tasks and remarkably improving the quality of the underwater image. The technical proposal is as follows:
an underwater robot vision definition method based on a U-Net multi-mode fusion network is characterized by comprising the following steps:
s1, acquiring a turbid underwater image and a corresponding clear underwater image, and constructing an underwater polarized image data set, wherein the underwater polarized image data set comprises underwater polarized images with different angles, a polarization degree image and a polarization angle image;
s2, constructing an underwater robot vision definition model based on a multi-mode fusion network, wherein the model comprises the multi-mode fusion network and an image enhancement U-Net network;
s3, training the multi-mode fusion network based on the underwater polarized image dataset, and updating RGB information and polarized information by using pixel multi-scale fusion and generating fusion characteristics during training;
S4, training the image enhancement U-Net network by using the underwater polarized image data set to obtain an image enhancement model based on the U-Net network;
S5, acquiring fusion characteristics with turbidity to be processed, and inputting an image enhancement model based on a U-Net network, so that a clear underwater image is acquired.
Further, the method for constructing the underwater polarized image dataset in the step S1 comprises the steps of manufacturing different-color water bodies with different turbidity degrees in a water body scene, collecting turbidity underwater images corresponding to objects in the different-color water bodies with different turbidity degrees by utilizing a polarized camera, collecting clear underwater images corresponding to the objects in purified water, taking the clear underwater images as tag images, and constructing a training set and a testing set according to the turbidity underwater images and the tag images.
Further, in step S2, the network architecture of the underwater robot vision definition model based on the multi-mode fusion network is an end-to-end structure, feature maps with different sizes are generated at each level, so that the network can capture features with different scales, the turbid underwater polarized images in the training set are preprocessed to obtain RGB modal information and polarized modal information before the fusion network is performed, the polarized modal information comprises polarization degree information DoLP and polarization angle information AoLP, and the obtained RGB modal information and polarized modal information are used as input of the multi-mode fusion network.
Further, the multimode fusion network comprises two modules, namely a characteristic fusion module and a polarization guiding fusion module,
The feature fusion module performs robust fusion on the DoLP and AoLP features from the polarized mode input domain by utilizing global and local information, generates two spatial attention patterns according to two mark embedding sequences provided by two Conformers for the two input features DoLP and AoLP, and then weights and fuses the extracted convolution features according to the spatial attention patterns to obtain polarized mode features;
The polarization guiding fusion module is used for processing the modal deviation, enhancing the input characteristic by using the attention operation on the RGB modal input characteristic X, updating and generating the fusion characteristic X * by using the polarization modal characteristic M for guiding, carrying out series connection and projection on the polarization modal characteristic and the RGB modal characteristic by a multi-layer perceptron to generate a key (k x)、query(qx) and a value (v x), reducing the embedding height and width H, W of the query and the key by the space dimension, and learning channel statistics S q、Sk of the query and the key, thereby obtaining a guiding updated channel relation M *.
Further, the polarization mode feature and the RGB mode feature are connected in series and projected by a multi-layer perceptron in order to generate key (k x)、query(qx) and value (v x),As a learnable parameter:
[X*]=FC(softmax(FC([qx;kx]))⊙vx)
wherein, as indicated by the symbols, the symbols are multiplied and FC represents the fully connected layer with filtering.
Further, the channel statistics S q、Sk of the query and the key are learned by reducing the embedding height and width H, W of the query and the key in the spatial dimension, so as to obtain the formula of the channel relation M * for guiding update as follows:
Km,Qm,Vm=X,M,kx
M*=Mx+FC((sqQm+skKm)⊙Vm)。
The further image enhancement model based on the U-Net network consists of three parts, namely an encoder part, a feature change part and a decoder part, wherein feature extraction blocks are deployed from a first layer to a third layer in the image enhancement U-Net network, namely corresponding features are extracted by adopting different blocks in different layers, and the third layer captures more details and edge features by using a Detail Enhancement Attention Block (DEAB);
The Detail Enhancement Attention Block (DEAB) comprises a detail focusing convolution block and a content guiding attention block, wherein the detail focusing convolution block integrates prior information by using differential convolution to supplement a convolution layer of parallel processing operation, so as to enhance the characterization capability;
The content guiding attention block adopts a dynamic fusion mode, acquires more useful information coded in the characteristics by distributing a unique space importance graph for each channel, fuses the low-dimensional characteristics from the encoder part and the characteristic high-dimensional characteristics from the decoder part together, modulates the characteristics through the learned space weight, so as to adaptively fuse the low-latitude characteristics of the encoder part and the corresponding high-dimensional characteristics from the decoder part, adds input characteristics through jump connection to relieve gradient vanishing problems and simplify the learning process, and maps the fused characteristics through a 3X 3 convolution layer to obtain a final clear result.
Further, the consistency of the dimensions is ensured by adopting two downsampling and two upsampling operations between different layers, the downsampling operation halves the space dimension and doubles the channel number, the method is realized by a convolution layer, the step value is set to 2, and the output channel number is set to 2 times of the input channel number, the upsampling operation is regarded as the inverse form of the downsampling operation, the downsampling operation is realized by a deconvolution layer, the sizes of the first layer, the second layer and the third layer are C multiplied by H multiplied by W respectively,
Compared with the prior art, the invention has the following advantages:
The invention provides an underwater robot vision definition method based on a multi-mode fusion network. The method can effectively avoid the imaging defect problem caused by the existing underwater image definition method under the condition of high turbidity, improves the quality of the underwater image while removing turbidity, and has effectiveness and robustness, thereby laying a theoretical technical foundation for subsequent visual tasks such as submarine panoramic observation and the like.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.
Fig. 1 is a schematic flow chart of an underwater robot vision sharpening method based on a multi-mode fusion network in an embodiment of the invention.
Fig. 2 is a diagram of an underwater robot vision-based network architecture for the multi-mode fusion network in an embodiment of the invention.
Fig. 3 is a turbid image, a polarization angle image and a polarization degree image obtained by an underwater robot vision sharpening network based on a multi-mode fusion network in an embodiment of the invention.
Fig. 4 is a clear underwater image output by an underwater robot vision-sharpening network based on a multi-mode fusion network in an embodiment of the invention.
Fig. 5 is a table comparing the underwater robot vision-sharpening network based on the multi-mode fusion network with the existing other networks on 5 common underwater image quality evaluation indexes in the embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms in the description of the present invention and the claims and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1, the invention provides an underwater robot vision sharpening method based on a multi-mode fusion network, which mainly comprises the following steps:
S1, constructing an underwater polarized image data set, wherein the training data set comprises an underwater polarized image with four different angles, a polarization degree image and a polarization angle image. Specifically, different-color water bodies with different turbidity degrees are manufactured in an indoor water body scene constructed manually, turbid underwater images corresponding to objects in the different-color water bodies with different turbidity degrees are collected, clear underwater images corresponding to the objects in purified water are collected, and the clear underwater images are used as label images.
For example, by constructing an indoor turbid underwater polarized image acquisition platform, the platform comprises a glass water tank, a polarized camera, a computer, an illumination system and a camera tripod. The glass water tank has the size of 150cm multiplied by 35cm multiplied by 50cm, and the illumination system is three different-color illumination lamps, namely blue, green and white. The acquisition steps of the turbid underwater image are as follows:
Firstly, object images are acquired in water bodies with different turbidity degrees in different color scenes, each group acquires object images with 5 different turbidity degrees and 3 different color scenes, namely, a turbidity underwater image, clear label images are also acquired, and objects in a polarization camera and a water tank are fixed through a physical method in the shooting and acquisition process, so that the relative positions of the objects cannot be changed. Secondly, preparing various objects such as coral, starfish, conch, shell and the like, and dividing the objects into two groups in the process of collecting data. One group is to pave the bottom sand and broken stone at the bottom of the water tank to simulate the sea bottom or the river bed, fix the object on the bottom as the underwater image under the complex scene, and the other group is to fix the object on the pure white background plate as the underwater image under the simple scene. 1600 turbid underwater images and corresponding tag images thereof are acquired through a turbid underwater image acquisition platform, wherein 800 turbid underwater images are respectively arranged in a simple scene and a complex scene, and the resolution ratio of the images is 1024 multiplied by 1224.
S2, constructing an underwater robot vision definition model based on a multi-mode fusion network, wherein the model comprises the multi-mode fusion network and an image enhancement U-Net network, the underwater robot vision definition model based on the multi-mode fusion network is of an end-to-end structure, and feature maps with different sizes are generated at each level, so that the network can capture features with different scales.
Before the fusion network is carried out, the turbid underwater polarized image in the training set is preprocessed to obtain RGB information and polarization information, the polarized camera measures the light intensity I pol of a polarized angle phi pol passing through the linear polaroid through focal plane light splitting, and the calculation formula is as follows:
Ipol=Iun*(1+ρ*cos(2φ-2φpol))
S0=I+I90°=I45°+I135°
S1=I-I90°
S2=I45°-I135°
Wherein I un is the total incident light entering the camera, generally unpolarized light, ρ is the degree of linear polarization, φ is the linear polarization angle, S 0、S1、S2 is Stokes constant, S 0 is also used to represent RGB mode information, the polarized mode information comprises DoLP and AoLP, wherein DoLP represents the degree of polarization information AoLP represents the polarization angle information, I I90°I45°I135° represents the images captured by the polarized camera recording light in four linear polarization states at angles of 0 °, 45 °, 90 ° and 135 °, respectively, and the obtained RGB mode information and polarized mode information are then used as inputs to the multi-mode fusion network.
S3, the multimode fusion network comprises a feature fusion module and a polarization guiding fusion module. Training the multi-mode fusion network based on the underwater polarized image dataset, and updating RGB modal information and polarized modal information by utilizing a feature fusion module and polarization guiding fusion and generating fusion features during training.
S301, performing robust fusion on the DoLP and AoLP features from the polarized mode input domain by using global and local information by adopting a feature fusion module. Two spatial attention profiles are generated from the two marker embedded sequences provided by two Conformers for the two input features DoLP and AoLP, and the extracted convolution features are then weighted according to the attention profiles and fused together:
Mφ,Mρ=softmax(Ω(Tφ),Ω(Tρ))
Wherein C and T are the convolution features and tag embeddings generated by conv and trans branches in Conformer, respectively, Is element multiplication. M is an attention map generated by phi (AoLP) and rho (DoLP), respectively, omega is a function that first reduces the dimension of each marker embedding to 1 through a fully connected layer, and then reshapes the generated embedding into a two-dimensional map. Features extracted from Conformers in different layers are utilized to capture more details and edge information by utilizing the DoLP and AoLP, and contrast and texture details of different areas in the image are enhanced, so that the true color of the image is restored more accurately.
S302, because the polarization mode and RGB mode deviate greatly, the importance of the clues collected from the RGB mode and the polarization mode is related to a scene, the clues are simply combined together, the influence of the strong clues can be diluted by weak signals, even the adverse influence of the mixed clues can be amplified, in order to deal with the mode deviation, a polarization guiding fusion module is designed, the input characteristic X of the RGB mode is enhanced by using the operation of attention, the input characteristic M of the polarization mode is guided to update and generate a fusion characteristic X *, the polarization mode characteristic and the RGB mode characteristic are connected in series and projected by a multi-layer perceptron to generate key (k x)、query(qx) and value (v x),As a learnable parameter:
[X*]=FC(softmax(FC([qx;kx]))⊙vx)
wherein, as indicated by the symbols, the symbols are multiplied and FC represents the fully connected layer with filtering.
And (3) learning channel statistics S q、Sk of the query and the key by reducing the embedding height and width H, W of the query and the key in the space dimension, so as to obtain a channel relation for guiding update.
Km,Qm,Vm=X,M,kx
M*=Mx+FC((sqQm+skKm)⊙Vm)
And S4, training the enhanced network by utilizing an underwater polarized image data set to obtain the output of the multi-mode fusion network, namely, the fusion characteristic with turbidity to be processed, and inputting the fusion characteristic into an image enhanced network based on a U-Net network so as to obtain a clear underwater image, wherein the image enhanced network based on the U-Net network comprises three parts, namely, an encoder part, a characteristic change part and a decoder part.
For fusion features with turbidity to be processed, the U-Net goal is to restore the corresponding sharp image. However, for the task of detail sensitivity like turbidity removal, only feature conversion in low resolution space cannot be considered, which leads to information loss, so that feature extraction blocks are deployed from the first layer to the third layer in the U-Net network, and different blocks are used in different layers to extract corresponding features, the first layer and the second layer use conventional feature extraction blocks (DEBs), and the third layer uses detail enhancement attention blocks (DEBs) to capture more detail and edge features. Meanwhile, in U-Net, two downsampling operations and two upsampling operations are adopted, the downsampling operations halve the space dimension and double the channel number, the method is realized by a common convolution layer, the step value is set to be 2, the output channel number is set to be 2 times of the input channel number, the upsampling operation can be regarded as an inverse form of the downsampling operation, the downsampling operations are realized by a deconvolution layer, the sizes of a first layer, a second layer and a third layer are C multiplied by H multiplied by W respectively,
Further, the detail enhancement attention block consists of a detail focusing convolution block and a content guiding attention block, which are used for enhancing feature learning so as to improve the haze removal performance.
The detail focusing convolution block integrates prior information by using differential convolution, supplements common convolution information, enhances the characterization capability, and can be equivalently converted into common convolution by using a re-parameterization technology, thereby reducing parameters and calculation cost.
The content-guided attention block consists of channel attention and spatial attention, which in turn are used to calculate the attention weights for the channel and spatial dimensions. Channel attention calculates a channel vector, i.eTo recalibrate the feature. Spatial attention computation of spatial importance map, i.eTo adaptively indicate the information area. The content-guided attention block is non-uniform with respect to the processing of different channels and pixels, thereby improving the denoising performance.
Where max (0, x) represents the ReLU activation function,Denote convolutional layers with a kernel size of k x k, [ ] denote channel connect operations.AndFeatures of global average pooling operation through space dimension, global average pooling operation through channel dimension and global maximum pooling operation processing through channel dimension are respectively represented. To reduce the number of parameters and limit the complexity of the model, the first 1×1 convolution reduces the channel dimension from C toThe second 1 x 1 convolution again expands the channel dimension to C.
Wcoa=Wc+Ws
The content-guided attention block is used to acquire a dedicated spatial importance map of each single channel of the input features from thick to thin, while fully mixing channel attention weights and spatial attention weights to ensure information interaction. According to the broadcasting rule, W c and W s are fused together through simple addition operation to obtain a coarse spatial importance mapSince W c is channel-based, W coa and X are consistent across channels. In order to obtain the final perfect spatial importance map W, each channel of W coa is adjusted according to the corresponding input features. And generating a final specific channel space importance graph W by taking the content of the input characteristics as a guide. In particular, each of the channels of W coa and X are rearranged in an alternating fashion by a channel shuffling operation. The number of parameters can be greatly reduced in combination with subsequent group convolution layers.
Wherein sigma represents a sigmoid operation, CS (·) represents a channel shuffling operation,Representing a group convolution layer with a kernel size of k x k, the number of groups is set to C in an implementation. The content-guided attention mechanism assigns each channel a unique spatial importance map, and the guided model focuses on the important areas of each channel. Thus, more useful information encoded in the features can be emphasized, effectively improving dehazing performance.
And secondly, the dynamic fusion scheme based on the content-guided attention block can effectively fuse the characteristics and help gradient flow, fuse the characteristics after the downsampling operation with the corresponding characteristics before the upsampling operation, and adopt a coder-decoder-like architecture. Fusing the feature F low from the encoder portion with the feature F high from the decoder portion to get F fuse is an effective technique in dehazing and other low-level visual tasks. Low-level features (such as edges and contours) have a non-negligible effect on restoring a sharp image, but gradually lose their effect after passing through many intermediate layers. Feature fusion can enhance information flow from shallow layer to deep layer, and is beneficial to feature preservation and gradient back propagation. The features are modulated by the learned spatial weights using a dynamic fusion scheme based on content-guided attention blocks, thereby adaptively fusing low-level features of the encoder portion with corresponding high-level features. The core selects to compute spatial weights for feature modulation using a content-guided attention mechanism. The low-level features and corresponding high-level features of the encoder section are input to the content-guided attention mechanism to calculate weights, then combined by a weighted summation method, and input features are added by jump connection to alleviate the gradient vanishing problem and simplify the learning process. Finally, mapping the fused features through a3×3 convolution layer to obtain a final clear result.
Ffuse=C1×1(Flow·W+Fhigh·(1-W)+Flow+Fhigh)
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims (8)

1.一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,包括以下步骤:1. A method for underwater robot vision clarity based on a multimodal fusion network, characterized by comprising the following steps: S1、采集浑浊水下图像和对应的清晰水下图像,构建水下偏振图像数据集,所述水下偏振图像数据集包括不同角度的水下偏振图像、偏振度图像和偏振角图像;S1, collecting turbid underwater images and corresponding clear underwater images, and constructing an underwater polarization image dataset, wherein the underwater polarization image dataset includes underwater polarization images, polarization degree images, and polarization angle images at different angles; S2、构建基于多模态融合网络的水下机器人视觉清晰化模型,包括多模态融合网络和图像增强U-Net网络;S2. Construct an underwater robot visual clarity model based on a multimodal fusion network, including a multimodal fusion network and an image enhancement U-Net network; S3、基于水下偏振图像数据集对所述多模态融合网络进行训练,训练时利用像素多尺度融合对RGB信息和偏振信息进行更新并产生融合特征;S3, training the multimodal fusion network based on the underwater polarization image dataset, and using pixel multi-scale fusion to update the RGB information and polarization information and generate fusion features during training; S4、利用水下偏振图像数据集对所述图像增强U-Net网络进行训练,得到基于U-Net网络的图像增强模型;S4, using the underwater polarization image dataset to train the image enhancement U-Net network to obtain an image enhancement model based on the U-Net network; S5、获取待处理的多模态融合网络输出的有浑浊度的融合特征并输入基于U-Net网络的图像增强模型,从而获取清晰的水下图像。S5. Obtain the fusion features with turbidity output by the multimodal fusion network to be processed and input them into the image enhancement model based on the U-Net network, so as to obtain a clear underwater image. 2.根据权利要求1所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,步骤S1中构建水下偏振图像数据集的方法如下:通过在水体场景中制作不同浑浊程度的不同颜色水体,利用偏振相机采集在不同浑浊程度不同颜色水体中物体对应的浑浊水下图像,采集在纯净水中物体对应的清晰水下图像,并将所述清晰水下图像作为标签图像;根据所述浑浊水下图像和所述标签图像,构建训练集和测试集。2. According to the method for underwater robot vision clarity based on a multimodal fusion network according to claim 1, it is characterized in that the method for constructing an underwater polarization image dataset in step S1 is as follows: by creating different colors of water with different turbidity levels in the water scene, using a polarization camera to collect turbid underwater images corresponding to objects in different colors of water with different turbidity levels, collecting clear underwater images corresponding to objects in pure water, and using the clear underwater images as label images; constructing a training set and a test set based on the turbid underwater images and the label images. 3.根据权利要求1所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,步骤S2中,基于多模态融合网络的水下机器人视觉清晰化模型的网络架构为端到端结构,在每一层级生成大小不同的特征图,从而使得网络能够捕捉不同尺度的特征;在进行融合网络之前对训练集中的浑浊水下偏振图像进行预处理得到RGB模态信息和偏振模态信息,偏振模态信息包含偏振度信息DoLP和偏振角信息AoLP;将得到的RGB模态信息和偏振模态信息作为多模态融合网络的输入。3. According to claim 1, a method for underwater robot vision clarity based on a multimodal fusion network is characterized in that, in step S2, the network architecture of the underwater robot vision clarity model based on the multimodal fusion network is an end-to-end structure, and feature maps of different sizes are generated at each level, so that the network can capture features of different scales; before the fusion network is performed, the turbid underwater polarization images in the training set are preprocessed to obtain RGB modal information and polarization modal information, and the polarization modal information includes polarization degree information DoLP and polarization angle information AoLP; the obtained RGB modal information and polarization modal information are used as input of the multimodal fusion network. 4.根据权利要求1所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,所述的多模态融合网络包括两个模块:特征融合模块和偏振引导融合模块,其中,4. According to claim 1, a method for underwater robot visual clarity based on a multimodal fusion network is characterized in that the multimodal fusion network includes two modules: a feature fusion module and a polarization guidance fusion module, wherein: 特征融合模块通过利用全局和局部信息,对来自偏振模态输入域的DoLP和AoLP特征进行稳健融合;根据两个Conformers为两种输入特征DoLP和AoLP提供的两个标记嵌入序列生成两个空间注意力图,提取的卷积特征随后根据空间注意力图进行加权,并融合在一起,得到偏振模态特征;The feature fusion module robustly fuses the DoLP and AoLP features from the polarization modal input domain by leveraging global and local information. Two spatial attention maps are generated based on the two labeled embedding sequences provided by the two Conformers for the two input features DoLP and AoLP. The extracted convolutional features are then weighted according to the spatial attention maps and fused together to obtain the polarization modal features. 偏振引导融合模块,用于处理模态偏差,对RGB模态输入特征X利用注意力的操作来增强输入特征,利用偏振模态特征M引导来更新并产生融合特征X*,偏振模态特征和RGB模态特征通过多层感知机进行串联和投影,以产生key(kx)、query(qx)和value(vx);通过空间维度缩小query和key的嵌入高度和宽度H、W,学习query和key的通道统计Sq、Sk,从而得到引导更新的通道关系M*The polarization-guided fusion module is used to process modal bias. It uses attention operations to enhance the input features of the RGB modal input feature X, and uses the polarization modal feature M to guide the update and generate the fused feature X * . The polarization modal feature and the RGB modal feature are connected in series and projected through a multi-layer perceptron to generate key ( kx ), query ( qx ) and value ( vx ); the embedding height and width H and W of the query and key are reduced by the spatial dimension, and the channel statistics Sq and Sk of the query and key are learned to obtain the channel relationship M * that guides the update. 5.根据权利要求4所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,偏振模态特征和RGB模态特征通过多层感知机进行串联和投影的方法为:以产生key(kx)、query(qx)和value(vx),作为可学习参数:5. The underwater robot vision clarity method based on multimodal fusion network according to claim 4 is characterized in that the polarization modal feature and the RGB modal feature are connected in series and projected through a multi-layer perceptron to generate key (k x ), query (q x ) and value (v x ), As learnable parameters: [X*]=FC(softmax(FC([qx;kx]))⊙vx)[X * ]=FC(softmax(FC([q x ;k x ]))⊙v x ) 其中,⊙代表元素相乘,FC代表带有滤波的全连接层。Among them, ⊙ represents element-wise multiplication and FC represents a fully connected layer with filtering. 6.根据权利要求5所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,通过空间维度缩小query和key的嵌入高度和宽度H、W,学习query和key的通道统计Sq、Sk,从而得到引导更新的通道关系M*的公式如下:6. The underwater robot vision clarity method based on a multimodal fusion network according to claim 5 is characterized in that the embedding height and width H, W of the query and key are reduced by the spatial dimension, and the channel statistics S q , S k of the query and key are learned, so that the formula for guiding the updated channel relationship M * is obtained as follows: Km,Qm,Vm=X,M,kx K m ,Q m ,V m =X,M,k x M*=Mx+FC((sqQm+skKm)⊙Vm)。M * =M x +FC((s q Q m +s k K m )⊙V m ). 7.根据权利要求1所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,基于U-Net网络的图像增强模型由三部分组成:编码器部分、特征变化部分和解码器部分;在所述图像增强U-Net网络中从第一层到第三层部署特征提取块,即在不同的层次中采用不同的块来提取相应的特征,第三层使用细节增强注意块DEAB来捕捉更多的细节和边缘特征;7. According to claim 1, a method for underwater robot visual clarity based on multimodal fusion network is characterized in that the image enhancement model based on U-Net network consists of three parts: an encoder part, a feature change part and a decoder part; feature extraction blocks are deployed from the first layer to the third layer in the image enhancement U-Net network, that is, different blocks are used in different layers to extract corresponding features, and the third layer uses detail enhancement attention block DEAB to capture more details and edge features; 所述细节增强注意块DEAB,包括细节聚焦卷积块和内容引导注意力块,细节聚焦卷积块使用差分卷积进行整合先验信息,对并行处理操作的卷积层进行补充,增强表征能力;通过使用重参数化技术,细节聚集卷积等价转换为卷积运算,从而减少参数和计算成本;The detail enhancement attention block DEAB includes a detail focusing convolution block and a content guided attention block. The detail focusing convolution block uses differential convolution to integrate prior information, supplements the convolution layer of parallel processing operations, and enhances the representation capability. By using the re-parameterization technology, the detail aggregation convolution is equivalently converted to a convolution operation, thereby reducing parameters and computational costs. 内容引导注意力块采用动态融合方式,通过为每个通道分配唯一的空间重要性图,获取更多编码在特征中的有用信息:将来自编码器部分的低维特征与来自解码器部分的特征高维特征融合在一起,通过学习到的空间权重对特征进行调制,从而将编码器部分的低纬特征与相应的来自解码器部分的高维特征进行自适应融合;还通过跳转连接添加输入特征,用以缓解梯度消失问题和简化学习过程;通过3×3卷积层对融合后的特征进行映射,得到最终清晰化的结果。The content-guided attention block adopts a dynamic fusion method to obtain more useful information encoded in the features by assigning a unique spatial importance map to each channel: the low-dimensional features from the encoder part are fused with the high-dimensional features from the decoder part, and the features are modulated by the learned spatial weights, so as to adaptively fuse the low-dimensional features of the encoder part with the corresponding high-dimensional features from the decoder part; input features are added through jump connections to alleviate the gradient vanishing problem and simplify the learning process; the fused features are mapped through a 3×3 convolutional layer to obtain the final clear result. 8.根据权利要求7所述的一种基于多模态融合网络的水下机器人视觉清晰化方法,其特征在于,在不同层之间采用两个下采样和两个上采样操作保证维度的一致性,下采样操作将空间维度减半,通道数加倍;通过卷积层实现,方法是将步长值设为2,并将输出通道数设为输入通道数的2倍;上采样操作看作是下采样操作的逆形式,而下采样操作是通过解卷积层实现的,其中第一层、第二层和第三层的尺寸分别为C×H×W, 8. According to claim 7, a method for underwater robot visual clarity based on a multimodal fusion network is characterized in that two downsampling and two upsampling operations are used between different layers to ensure dimensional consistency, the downsampling operation halves the spatial dimension and doubles the number of channels; it is implemented through a convolutional layer by setting the step value to 2 and setting the number of output channels to twice the number of input channels; the upsampling operation is regarded as the inverse form of the downsampling operation, and the downsampling operation is implemented through a deconvolution layer, wherein the sizes of the first layer, the second layer, and the third layer are C×H×W, respectively,
CN202411488920.2A 2024-10-24 2024-10-24 A method for underwater robot vision clarity based on multimodal fusion network Active CN119478648B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411488920.2A CN119478648B (en) 2024-10-24 2024-10-24 A method for underwater robot vision clarity based on multimodal fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411488920.2A CN119478648B (en) 2024-10-24 2024-10-24 A method for underwater robot vision clarity based on multimodal fusion network

Publications (2)

Publication Number Publication Date
CN119478648A CN119478648A (en) 2025-02-18
CN119478648B true CN119478648B (en) 2025-07-18

Family

ID=94596171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411488920.2A Active CN119478648B (en) 2024-10-24 2024-10-24 A method for underwater robot vision clarity based on multimodal fusion network

Country Status (1)

Country Link
CN (1) CN119478648B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120355594A (en) * 2025-06-24 2025-07-22 苏州城市学院 Sparse aperture optical system polarization image fusion method based on deep learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740515A (en) * 2023-05-19 2023-09-12 中北大学 CNN-based intensity image and polarization image fusion enhancement method
CN117048814A (en) * 2023-09-15 2023-11-14 南通奇致智能科技有限公司 High-flexibility underwater intelligent robot

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7420675B2 (en) * 2003-06-25 2008-09-02 The University Of Akron Multi-wavelength imaging system
CN114549548B (en) * 2022-01-28 2024-09-13 大连理工大学 Glass image segmentation method based on polarization clues
CN117291832A (en) * 2023-08-25 2023-12-26 天津市天开海洋科技有限公司 Underwater polarized image polarization information restoration method based on deep neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740515A (en) * 2023-05-19 2023-09-12 中北大学 CNN-based intensity image and polarization image fusion enhancement method
CN117048814A (en) * 2023-09-15 2023-11-14 南通奇致智能科技有限公司 High-flexibility underwater intelligent robot

Also Published As

Publication number Publication date
CN119478648A (en) 2025-02-18

Similar Documents

Publication Publication Date Title
Ikoma et al. Depth from defocus with learned optics for imaging and occlusion-aware depth estimation
CN101422035B (en) Light source estimation device, light source estimation system, and light source estimation method, and image high resolution device and image high resolution method
CN119478648B (en) A method for underwater robot vision clarity based on multimodal fusion network
CN106997581A (en) A kind of method that utilization deep learning rebuilds high spectrum image
Agrafiotis et al. Underwater photogrammetry in very shallow waters: main challenges and caustics effect removal
Singh et al. Low-light image enhancement for UAVs with multi-feature fusion deep neural networks
CN113160053B (en) An underwater video image restoration and stitching method based on pose information
CN113160085B (en) A method for collecting water splash occlusion image dataset based on generative adversarial network
CN115035010A (en) Underwater image enhancement method based on convolutional network guided model mapping
CN112906675A (en) Unsupervised human body key point detection method and system in fixed scene
CN118469842B (en) A remote sensing image dehazing method based on generative adversarial network
CN113592755B (en) Image reflection elimination method based on panoramic camera
CN119784943A (en) An underwater 3D measurement method based on the fusion of vision and line structured light
CN112950481A (en) Water bloom shielding image data collection method based on image mosaic network
CN117094895B (en) Image panorama stitching method and system
Vijayalakshmi et al. Variants of generative adversarial networks for underwater image enhancement
CN115439376B (en) Compound eye camera multi-focal-length image fusion model, method and device
Huang et al. AFNet: Asymmetric fusion network for monocular panorama depth estimation
CN117115038A (en) An image glare removal system and method based on glare degree estimation
Xu et al. Real-time panoramic map modeling method based on multisource image fusion and three-dimensional rendering
CN115471397A (en) Multimodal Image Registration Method Based on Disparity Estimation
CN115034974A (en) Method, device and storage medium for natural color restoration of visible light and infrared fusion images
Tandekar et al. Underwater Image Enhancement through Deep Learning and Advanced Convolutional Encoders
Li et al. Context convolution dehazing network with channel attention
CN107038706A (en) Infrared image confidence level estimation device and method based on adaptive mesh

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant