CN119478648B

CN119478648B - A method for underwater robot vision clarity based on multimodal fusion network

Info

Publication number: CN119478648B
Application number: CN202411488920.2A
Authority: CN
Inventors: 齐晓志; 赵青竹; 秦国庆
Original assignee: Nantong Qizhi Intelligent Technology Co ltd
Current assignee: Nantong Qizhi Intelligent Technology Co ltd
Priority date: 2024-10-24
Filing date: 2024-10-24
Publication date: 2025-07-18
Anticipated expiration: 2044-10-24
Also published as: CN119478648A

Abstract

The invention discloses an underwater robot vision sharpening method based on a multi-mode fusion network, which belongs to the technical field of underwater image processing and mainly comprises the steps of collecting turbid underwater images and corresponding clear underwater images, constructing an underwater polarized image dataset comprising underwater polarized images, polarized images and polarized angle images with different angles, constructing an underwater robot vision sharpening model based on the multi-mode fusion network comprising the multi-mode fusion network and an image enhancement network, training the multi-mode fusion network based on the underwater polarized image dataset, updating RGB information and polarized information by using pixel multi-scale fusion and generating fusion characteristics during training, training the image enhancement network by using the underwater polarized image dataset to obtain an image enhancement model, acquiring fusion characteristics with turbidity to be processed, and inputting the fusion characteristics into the image enhancement model based on the network so as to obtain clear underwater images.

Description

Underwater robot vision definition method based on multi-mode fusion network

Technical Field

The invention relates to the technical fields of marine environment perception, digital image processing and image enhancement, in particular to an underwater robot vision sharpening method based on a multi-mode fusion network.

Background

With the development of human exploration for the ocean, underwater robots are becoming an important tool for acquiring information on the ocean floor. However, due to the strong absorption and scattering of light by the aqueous medium and the complex underwater environment, the underwater environment has energy attenuation due to the light in the underwater propagation process, and the impurities and suspended particles in the water can cause the light to scatter in the propagation process, so that the image collected underwater is more blurred than the image collected on land, and the problems can cause the obtained image to have low color deviation and definition, thereby seriously affecting the visual quality and affecting the performance of the visual task under the water. The underwater optical imaging is a core mode for completing underwater environment sensing and detecting tasks at present, plays an indispensable role in scientific research fields such as underwater robots, ocean science investigation, a plurality of downstream visual tasks (underwater target identification or tracking) and the like, and due to complexity and instability of the underwater environment, the underwater image often faces problems such as color cast, low contrast, blurring and the like. Specifically, when light propagates in water, the light is firstly affected by the water depth, and as the depth increases, the light with different wavelengths is continuously attenuated. Thus, since red light first disappears, the underwater image tends to be mainly blue or green, resulting in color shift of the image. Meanwhile, unlike air media on land, water contains a large amount of suspended particles, which can cause scattering of light in water, further challenging underwater optical imaging. In addition, because of the uncertainty of the imaging equipment in underwater movement, the acquired images have the problems of fuzzy details, low contrast and the like, the visual perception is influenced, and more serious challenges are presented for subsequent advanced visual tasks. Meanwhile, due to the lack of underwater scenes and high-quality images, UIE faces various challenges such as over-enhancement, detail feature blurring, and the like. These problems limit the performance of the UIE method, resulting in poor downstream task performance.

Disclosure of Invention

In view of the shortcomings of the prior art, the invention provides an underwater robot vision sharpening method based on a multi-mode fusion network, which tries to introduce polarization information as an additional mode to strengthen an original underwater image according to inspiration obtained from multi-mode learning, provides a novel detail focusing and polarization guiding multi-mode fusion network, integrates RGB modes and polarization modes to strengthen the underwater image, utilizes detail focusing differential convolution to capture more detail and edge information, and uses polarization degree information and polarization angle information to strengthen contrast and texture details of different areas in the image, thereby more accurately restoring the true color of the image, reducing serious interference of the image to subsequent computer vision tasks and remarkably improving the quality of the underwater image. The technical proposal is as follows:

an underwater robot vision definition method based on a U-Net multi-mode fusion network is characterized by comprising the following steps:

s1, acquiring a turbid underwater image and a corresponding clear underwater image, and constructing an underwater polarized image data set, wherein the underwater polarized image data set comprises underwater polarized images with different angles, a polarization degree image and a polarization angle image;

s2, constructing an underwater robot vision definition model based on a multi-mode fusion network, wherein the model comprises the multi-mode fusion network and an image enhancement U-Net network;

s3, training the multi-mode fusion network based on the underwater polarized image dataset, and updating RGB information and polarized information by using pixel multi-scale fusion and generating fusion characteristics during training;

S4, training the image enhancement U-Net network by using the underwater polarized image data set to obtain an image enhancement model based on the U-Net network;

S5, acquiring fusion characteristics with turbidity to be processed, and inputting an image enhancement model based on a U-Net network, so that a clear underwater image is acquired.

Further, the method for constructing the underwater polarized image dataset in the step S1 comprises the steps of manufacturing different-color water bodies with different turbidity degrees in a water body scene, collecting turbidity underwater images corresponding to objects in the different-color water bodies with different turbidity degrees by utilizing a polarized camera, collecting clear underwater images corresponding to the objects in purified water, taking the clear underwater images as tag images, and constructing a training set and a testing set according to the turbidity underwater images and the tag images.

Further, in step S2, the network architecture of the underwater robot vision definition model based on the multi-mode fusion network is an end-to-end structure, feature maps with different sizes are generated at each level, so that the network can capture features with different scales, the turbid underwater polarized images in the training set are preprocessed to obtain RGB modal information and polarized modal information before the fusion network is performed, the polarized modal information comprises polarization degree information DoLP and polarization angle information AoLP, and the obtained RGB modal information and polarized modal information are used as input of the multi-mode fusion network.

Further, the multimode fusion network comprises two modules, namely a characteristic fusion module and a polarization guiding fusion module,

The feature fusion module performs robust fusion on the DoLP and AoLP features from the polarized mode input domain by utilizing global and local information, generates two spatial attention patterns according to two mark embedding sequences provided by two Conformers for the two input features DoLP and AoLP, and then weights and fuses the extracted convolution features according to the spatial attention patterns to obtain polarized mode features;

The polarization guiding fusion module is used for processing the modal deviation, enhancing the input characteristic by using the attention operation on the RGB modal input characteristic X, updating and generating the fusion characteristic X ^* by using the polarization modal characteristic M for guiding, carrying out series connection and projection on the polarization modal characteristic and the RGB modal characteristic by a multi-layer perceptron to generate a key (k ^x)、query(q^x) and a value (v ^x), reducing the embedding height and width H, W of the query and the key by the space dimension, and learning channel statistics S _q、S_k of the query and the key, thereby obtaining a guiding updated channel relation M ^*.

Further, the polarization mode feature and the RGB mode feature are connected in series and projected by a multi-layer perceptron in order to generate key (k ^x)、query(q^x) and value (v ^x),As a learnable parameter:

[X^*]＝FC(softmax(FC([q^x;k^x]))⊙v^x)

wherein, as indicated by the symbols, the symbols are multiplied and FC represents the fully connected layer with filtering.

Further, the channel statistics S _q、S_k of the query and the key are learned by reducing the embedding height and width H, W of the query and the key in the spatial dimension, so as to obtain the formula of the channel relation M ^* for guiding update as follows:

K^m,Q^m,V^m＝X,M,k^x

M^*＝M^x+FC((s_qQ^m+s_kK^m)⊙V^m)。

The further image enhancement model based on the U-Net network consists of three parts, namely an encoder part, a feature change part and a decoder part, wherein feature extraction blocks are deployed from a first layer to a third layer in the image enhancement U-Net network, namely corresponding features are extracted by adopting different blocks in different layers, and the third layer captures more details and edge features by using a Detail Enhancement Attention Block (DEAB);

The Detail Enhancement Attention Block (DEAB) comprises a detail focusing convolution block and a content guiding attention block, wherein the detail focusing convolution block integrates prior information by using differential convolution to supplement a convolution layer of parallel processing operation, so as to enhance the characterization capability;

The content guiding attention block adopts a dynamic fusion mode, acquires more useful information coded in the characteristics by distributing a unique space importance graph for each channel, fuses the low-dimensional characteristics from the encoder part and the characteristic high-dimensional characteristics from the decoder part together, modulates the characteristics through the learned space weight, so as to adaptively fuse the low-latitude characteristics of the encoder part and the corresponding high-dimensional characteristics from the decoder part, adds input characteristics through jump connection to relieve gradient vanishing problems and simplify the learning process, and maps the fused characteristics through a 3X 3 convolution layer to obtain a final clear result.

Further, the consistency of the dimensions is ensured by adopting two downsampling and two upsampling operations between different layers, the downsampling operation halves the space dimension and doubles the channel number, the method is realized by a convolution layer, the step value is set to 2, and the output channel number is set to 2 times of the input channel number, the upsampling operation is regarded as the inverse form of the downsampling operation, the downsampling operation is realized by a deconvolution layer, the sizes of the first layer, the second layer and the third layer are C multiplied by H multiplied by W respectively,

Compared with the prior art, the invention has the following advantages:

The invention provides an underwater robot vision definition method based on a multi-mode fusion network. The method can effectively avoid the imaging defect problem caused by the existing underwater image definition method under the condition of high turbidity, improves the quality of the underwater image while removing turbidity, and has effectiveness and robustness, thereby laying a theoretical technical foundation for subsequent visual tasks such as submarine panoramic observation and the like.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic flow chart of an underwater robot vision sharpening method based on a multi-mode fusion network in an embodiment of the invention.

Fig. 2 is a diagram of an underwater robot vision-based network architecture for the multi-mode fusion network in an embodiment of the invention.

Fig. 3 is a turbid image, a polarization angle image and a polarization degree image obtained by an underwater robot vision sharpening network based on a multi-mode fusion network in an embodiment of the invention.

Fig. 4 is a clear underwater image output by an underwater robot vision-sharpening network based on a multi-mode fusion network in an embodiment of the invention.

Fig. 5 is a table comparing the underwater robot vision-sharpening network based on the multi-mode fusion network with the existing other networks on 5 common underwater image quality evaluation indexes in the embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms in the description of the present invention and the claims and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, the invention provides an underwater robot vision sharpening method based on a multi-mode fusion network, which mainly comprises the following steps:

S1, constructing an underwater polarized image data set, wherein the training data set comprises an underwater polarized image with four different angles, a polarization degree image and a polarization angle image. Specifically, different-color water bodies with different turbidity degrees are manufactured in an indoor water body scene constructed manually, turbid underwater images corresponding to objects in the different-color water bodies with different turbidity degrees are collected, clear underwater images corresponding to the objects in purified water are collected, and the clear underwater images are used as label images.

For example, by constructing an indoor turbid underwater polarized image acquisition platform, the platform comprises a glass water tank, a polarized camera, a computer, an illumination system and a camera tripod. The glass water tank has the size of 150cm multiplied by 35cm multiplied by 50cm, and the illumination system is three different-color illumination lamps, namely blue, green and white. The acquisition steps of the turbid underwater image are as follows:

Firstly, object images are acquired in water bodies with different turbidity degrees in different color scenes, each group acquires object images with 5 different turbidity degrees and 3 different color scenes, namely, a turbidity underwater image, clear label images are also acquired, and objects in a polarization camera and a water tank are fixed through a physical method in the shooting and acquisition process, so that the relative positions of the objects cannot be changed. Secondly, preparing various objects such as coral, starfish, conch, shell and the like, and dividing the objects into two groups in the process of collecting data. One group is to pave the bottom sand and broken stone at the bottom of the water tank to simulate the sea bottom or the river bed, fix the object on the bottom as the underwater image under the complex scene, and the other group is to fix the object on the pure white background plate as the underwater image under the simple scene. 1600 turbid underwater images and corresponding tag images thereof are acquired through a turbid underwater image acquisition platform, wherein 800 turbid underwater images are respectively arranged in a simple scene and a complex scene, and the resolution ratio of the images is 1024 multiplied by 1224.

S2, constructing an underwater robot vision definition model based on a multi-mode fusion network, wherein the model comprises the multi-mode fusion network and an image enhancement U-Net network, the underwater robot vision definition model based on the multi-mode fusion network is of an end-to-end structure, and feature maps with different sizes are generated at each level, so that the network can capture features with different scales.

Before the fusion network is carried out, the turbid underwater polarized image in the training set is preprocessed to obtain RGB information and polarization information, the polarized camera measures the light intensity I _pol of a polarized angle phi _pol passing through the linear polaroid through focal plane light splitting, and the calculation formula is as follows:

Ipol=Iun*(1+ρ*cos(2φ-2φpol))

S₀＝I_0°+I_90°＝I_45°+I_135°

S₁＝I_0°-I_90°

S₂＝I_45°-I_135°

Wherein I _un is the total incident light entering the camera, generally unpolarized light, ρ is the degree of linear polarization, φ is the linear polarization angle, S ₀、S₁、S₂ is Stokes constant, S ₀ is also used to represent RGB mode information, the polarized mode information comprises DoLP and AoLP, wherein DoLP represents the degree of polarization information AoLP represents the polarization angle information, I _0°I_90°I_45°I_135° represents the images captured by the polarized camera recording light in four linear polarization states at angles of 0 °, 45 °, 90 ° and 135 °, respectively, and the obtained RGB mode information and polarized mode information are then used as inputs to the multi-mode fusion network.

S3, the multimode fusion network comprises a feature fusion module and a polarization guiding fusion module. Training the multi-mode fusion network based on the underwater polarized image dataset, and updating RGB modal information and polarized modal information by utilizing a feature fusion module and polarization guiding fusion and generating fusion features during training.

S301, performing robust fusion on the DoLP and AoLP features from the polarized mode input domain by using global and local information by adopting a feature fusion module. Two spatial attention profiles are generated from the two marker embedded sequences provided by two Conformers for the two input features DoLP and AoLP, and the extracted convolution features are then weighted according to the attention profiles and fused together:

M_φ,M_ρ＝softmax(Ω(T_φ),Ω(T_ρ))

Wherein C and T are the convolution features and tag embeddings generated by conv and trans branches in Conformer, respectively, Is element multiplication. M is an attention map generated by phi (AoLP) and rho (DoLP), respectively, omega is a function that first reduces the dimension of each marker embedding to 1 through a fully connected layer, and then reshapes the generated embedding into a two-dimensional map. Features extracted from Conformers in different layers are utilized to capture more details and edge information by utilizing the DoLP and AoLP, and contrast and texture details of different areas in the image are enhanced, so that the true color of the image is restored more accurately.

S302, because the polarization mode and RGB mode deviate greatly, the importance of the clues collected from the RGB mode and the polarization mode is related to a scene, the clues are simply combined together, the influence of the strong clues can be diluted by weak signals, even the adverse influence of the mixed clues can be amplified, in order to deal with the mode deviation, a polarization guiding fusion module is designed, the input characteristic X of the RGB mode is enhanced by using the operation of attention, the input characteristic M of the polarization mode is guided to update and generate a fusion characteristic X ^*, the polarization mode characteristic and the RGB mode characteristic are connected in series and projected by a multi-layer perceptron to generate key (k ^x)、query(q^x) and value (v ^x),As a learnable parameter:

[X^*]＝FC(softmax(FC([q^x;k^x]))⊙v^x)

And (3) learning channel statistics S _q、S_k of the query and the key by reducing the embedding height and width H, W of the query and the key in the space dimension, so as to obtain a channel relation for guiding update.

K^m,Q^m,V^m＝X,M,k^x

M^*＝M^x+FC((s_qQ^m+s_kK^m)⊙V^m)

And S4, training the enhanced network by utilizing an underwater polarized image data set to obtain the output of the multi-mode fusion network, namely, the fusion characteristic with turbidity to be processed, and inputting the fusion characteristic into an image enhanced network based on a U-Net network so as to obtain a clear underwater image, wherein the image enhanced network based on the U-Net network comprises three parts, namely, an encoder part, a characteristic change part and a decoder part.

For fusion features with turbidity to be processed, the U-Net goal is to restore the corresponding sharp image. However, for the task of detail sensitivity like turbidity removal, only feature conversion in low resolution space cannot be considered, which leads to information loss, so that feature extraction blocks are deployed from the first layer to the third layer in the U-Net network, and different blocks are used in different layers to extract corresponding features, the first layer and the second layer use conventional feature extraction blocks (DEBs), and the third layer uses detail enhancement attention blocks (DEBs) to capture more detail and edge features. Meanwhile, in U-Net, two downsampling operations and two upsampling operations are adopted, the downsampling operations halve the space dimension and double the channel number, the method is realized by a common convolution layer, the step value is set to be 2, the output channel number is set to be 2 times of the input channel number, the upsampling operation can be regarded as an inverse form of the downsampling operation, the downsampling operations are realized by a deconvolution layer, the sizes of a first layer, a second layer and a third layer are C multiplied by H multiplied by W respectively,

Further, the detail enhancement attention block consists of a detail focusing convolution block and a content guiding attention block, which are used for enhancing feature learning so as to improve the haze removal performance.

The detail focusing convolution block integrates prior information by using differential convolution, supplements common convolution information, enhances the characterization capability, and can be equivalently converted into common convolution by using a re-parameterization technology, thereby reducing parameters and calculation cost.

The content-guided attention block consists of channel attention and spatial attention, which in turn are used to calculate the attention weights for the channel and spatial dimensions. Channel attention calculates a channel vector, i.eTo recalibrate the feature. Spatial attention computation of spatial importance map, i.eTo adaptively indicate the information area. The content-guided attention block is non-uniform with respect to the processing of different channels and pixels, thereby improving the denoising performance.

Where max (0, x) represents the ReLU activation function,Denote convolutional layers with a kernel size of k x k, [ ] denote channel connect operations.AndFeatures of global average pooling operation through space dimension, global average pooling operation through channel dimension and global maximum pooling operation processing through channel dimension are respectively represented. To reduce the number of parameters and limit the complexity of the model, the first 1×1 convolution reduces the channel dimension from C toThe second 1 x 1 convolution again expands the channel dimension to C.

W_coa＝W_c+W_s

The content-guided attention block is used to acquire a dedicated spatial importance map of each single channel of the input features from thick to thin, while fully mixing channel attention weights and spatial attention weights to ensure information interaction. According to the broadcasting rule, W _c and W _s are fused together through simple addition operation to obtain a coarse spatial importance mapSince W _c is channel-based, W _coa and X are consistent across channels. In order to obtain the final perfect spatial importance map W, each channel of W _coa is adjusted according to the corresponding input features. And generating a final specific channel space importance graph W by taking the content of the input characteristics as a guide. In particular, each of the channels of W _coa and X are rearranged in an alternating fashion by a channel shuffling operation. The number of parameters can be greatly reduced in combination with subsequent group convolution layers.

Wherein sigma represents a sigmoid operation, CS (·) represents a channel shuffling operation,Representing a group convolution layer with a kernel size of k x k, the number of groups is set to C in an implementation. The content-guided attention mechanism assigns each channel a unique spatial importance map, and the guided model focuses on the important areas of each channel. Thus, more useful information encoded in the features can be emphasized, effectively improving dehazing performance.

And secondly, the dynamic fusion scheme based on the content-guided attention block can effectively fuse the characteristics and help gradient flow, fuse the characteristics after the downsampling operation with the corresponding characteristics before the upsampling operation, and adopt a coder-decoder-like architecture. Fusing the feature F _low from the encoder portion with the feature F _high from the decoder portion to get F _fuse is an effective technique in dehazing and other low-level visual tasks. Low-level features (such as edges and contours) have a non-negligible effect on restoring a sharp image, but gradually lose their effect after passing through many intermediate layers. Feature fusion can enhance information flow from shallow layer to deep layer, and is beneficial to feature preservation and gradient back propagation. The features are modulated by the learned spatial weights using a dynamic fusion scheme based on content-guided attention blocks, thereby adaptively fusing low-level features of the encoder portion with corresponding high-level features. The core selects to compute spatial weights for feature modulation using a content-guided attention mechanism. The low-level features and corresponding high-level features of the encoder section are input to the content-guided attention mechanism to calculate weights, then combined by a weighted summation method, and input features are added by jump connection to alleviate the gradient vanishing problem and simplify the learning process. Finally, mapping the fused features through a3×3 convolution layer to obtain a final clear result.

F_fuse＝C_1×1(F_low·W+F_high·(1-W)+F_low+F_high)

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A method for underwater robot vision clarity based on a multimodal fusion network, characterized by comprising the following steps:

S1, collecting turbid underwater images and corresponding clear underwater images, and constructing an underwater polarization image dataset, wherein the underwater polarization image dataset includes underwater polarization images, polarization degree images, and polarization angle images at different angles;

S2. Construct an underwater robot visual clarity model based on a multimodal fusion network, including a multimodal fusion network and an image enhancement U-Net network;

S3, training the multimodal fusion network based on the underwater polarization image dataset, and using pixel multi-scale fusion to update the RGB information and polarization information and generate fusion features during training;

S4, using the underwater polarization image dataset to train the image enhancement U-Net network to obtain an image enhancement model based on the U-Net network;

S5. Obtain the fusion features with turbidity output by the multimodal fusion network to be processed and input them into the image enhancement model based on the U-Net network, so as to obtain a clear underwater image.

2. According to the method for underwater robot vision clarity based on a multimodal fusion network according to claim 1, it is characterized in that the method for constructing an underwater polarization image dataset in step S1 is as follows: by creating different colors of water with different turbidity levels in the water scene, using a polarization camera to collect turbid underwater images corresponding to objects in different colors of water with different turbidity levels, collecting clear underwater images corresponding to objects in pure water, and using the clear underwater images as label images; constructing a training set and a test set based on the turbid underwater images and the label images.

3. According to claim 1, a method for underwater robot vision clarity based on a multimodal fusion network is characterized in that, in step S2, the network architecture of the underwater robot vision clarity model based on the multimodal fusion network is an end-to-end structure, and feature maps of different sizes are generated at each level, so that the network can capture features of different scales; before the fusion network is performed, the turbid underwater polarization images in the training set are preprocessed to obtain RGB modal information and polarization modal information, and the polarization modal information includes polarization degree information DoLP and polarization angle information AoLP; the obtained RGB modal information and polarization modal information are used as input of the multimodal fusion network.

4. According to claim 1, a method for underwater robot visual clarity based on a multimodal fusion network is characterized in that the multimodal fusion network includes two modules: a feature fusion module and a polarization guidance fusion module, wherein:

The feature fusion module robustly fuses the DoLP and AoLP features from the polarization modal input domain by leveraging global and local information. Two spatial attention maps are generated based on the two labeled embedding sequences provided by the two Conformers for the two input features DoLP and AoLP. The extracted convolutional features are then weighted according to the spatial attention maps and fused together to obtain the polarization modal features.

The polarization-guided fusion module is used to process modal bias. It uses attention operations to enhance the input features of the RGB modal input feature X, and uses the polarization modal feature M to guide the update and generate the fused feature X ^* . The polarization modal feature and the RGB modal feature are connected in series and projected through a multi-layer perceptron to generate key ( ^kx ), query ( ^qx ) and value ( ^vx ); the embedding height and width H and W of the query and key are reduced by the spatial dimension, and the channel statistics _Sq and _Sk of the query and key are learned to obtain the channel relationship M ^* that guides the update.

5. The underwater robot vision clarity method based on multimodal fusion network according to claim 4 is characterized in that the polarization modal feature and the RGB modal feature are connected in series and projected through a multi-layer perceptron to generate key (k ^x ), query (q ^x ) and value (v ^x ), As learnable parameters:

[X ^* ]＝FC(softmax(FC([q ^x ;k ^x ]))⊙v ^x )

Among them, ⊙ represents element-wise multiplication and FC represents a fully connected layer with filtering.

6. The underwater robot vision clarity method based on a multimodal fusion network according to claim 5 is characterized in that the embedding height and width H, W of the query and key are reduced by the spatial dimension, and the channel statistics S _q , S _k of the query and key are learned, so that the formula for guiding the updated channel relationship M ^* is obtained as follows:

K ^m ,Q ^m ,V ^m =X,M,k ^x

M ^* =M ^x +FC((s _q Q ^m +s _k K ^m )⊙V ^m ).

7. According to claim 1, a method for underwater robot visual clarity based on multimodal fusion network is characterized in that the image enhancement model based on U-Net network consists of three parts: an encoder part, a feature change part and a decoder part; feature extraction blocks are deployed from the first layer to the third layer in the image enhancement U-Net network, that is, different blocks are used in different layers to extract corresponding features, and the third layer uses detail enhancement attention block DEAB to capture more details and edge features;

The detail enhancement attention block DEAB includes a detail focusing convolution block and a content guided attention block. The detail focusing convolution block uses differential convolution to integrate prior information, supplements the convolution layer of parallel processing operations, and enhances the representation capability. By using the re-parameterization technology, the detail aggregation convolution is equivalently converted to a convolution operation, thereby reducing parameters and computational costs.

The content-guided attention block adopts a dynamic fusion method to obtain more useful information encoded in the features by assigning a unique spatial importance map to each channel: the low-dimensional features from the encoder part are fused with the high-dimensional features from the decoder part, and the features are modulated by the learned spatial weights, so as to adaptively fuse the low-dimensional features of the encoder part with the corresponding high-dimensional features from the decoder part; input features are added through jump connections to alleviate the gradient vanishing problem and simplify the learning process; the fused features are mapped through a 3×3 convolutional layer to obtain the final clear result.

8. According to claim 7, a method for underwater robot visual clarity based on a multimodal fusion network is characterized in that two downsampling and two upsampling operations are used between different layers to ensure dimensional consistency, the downsampling operation halves the spatial dimension and doubles the number of channels; it is implemented through a convolutional layer by setting the step value to 2 and setting the number of output channels to twice the number of input channels; the upsampling operation is regarded as the inverse form of the downsampling operation, and the downsampling operation is implemented through a deconvolution layer, wherein the sizes of the first layer, the second layer, and the third layer are C×H×W, respectively,