CN116127408A

CN116127408A - Cross-modal enhancement-based multi-modal self-adaptive fusion method and system

Info

Publication number: CN116127408A
Application number: CN202310162234.5A
Authority: CN
Inventors: 李成龙; 殷策; 刘磊; 汤进; 毛军军; 吴涛
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-05-16

Abstract

The invention provides a multi-mode self-adaptive fusion method and a system based on cross-mode enhancement, wherein the method comprises the following steps: the N two classifiers are used for training a learning-based network as a classification-based weight prediction module for predicting the reliability weight of different modes to replace an original regressor. Each two classifiers consists of a global average pooling layer and two full-connection layers respectively provided with a ReLU activation function and a Sigmoid activation function; the mode features extracted from the backbone network are used as input, and then modulated through two parallel structures consisting of 1X 1 convolution layers, and the scaling and displacement functions of the features are respectively exerted to guide the other mode to learn more distinguishing features; and performing offline training on the three branch networks aiming at training data of two modes, namely an A-mode branch, a B-mode branch and an A-B-mode mixed branch. The method solves the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation.

Description

Cross-modal enhancement-based multi-modal self-adaptive fusion method and system

Technical Field

The invention relates to the technical field of multi-modal fusion of deep learning, in particular to a multi-modal self-adaptive fusion method and system based on cross-modal enhancement.

Background

In the data field, the multi-mode data refers to the data obtained by different description fields or different description modes for the same description object; each of which becomes a modality. The multi-mode data fusion means that the advantages of the data of different modes are exerted by combining the characteristics of the data of different modes through the existing computer method; and (3) carrying out comprehensive processing on the data of different modes, and fusing information of each mode to execute a task of target prediction.

In recent years, deep learning is applied to data fusion, data processing can be performed on data from multiple sources, and the method has the characteristics of high accuracy and strong real-time performance; the multi-modal fusion method based on deep learning is applied to various fields by the strong feature extraction and expression capability. In the fluid physical direction, li Yang et al, an Iterative Neural Operator to Predict the Thermo-Fluid Information in Internal Cooling Channels (iterative nerve operator for internal cooling channel thermal fluid information prediction) [ J ], "Journal of Turbomachinery:. 2022, volume 144, the thermal fluid information of the internal cooling channel is predicted by a multi-modal fusion method based on deep learning, and the speed, hidden state and boundary condition indicating variables of the thermal fluid at the moment are input for fusion; thereby predicting the speed and hiding state at the next moment; the multi-mode fusion method based on deep learning is combined with the original partial differential equation system, so that the prediction root mean square error of temperature and pressure is reduced by about 20%. Luca Guastoni et al, convolitional-network models to predict wall-bounded turbulence from wall quantities (Convolutional network model for predicting wall boundary turbulence from wall volume) [ J ], "Fluid Mech 2021, volume 928, adaptively fused by deep learning with wall shear stress field and wall pressure field sampled from DNS as inputs to predict flow direction, wall normal fluctuation and spanwise fluctuation; the results produced can be very consistent with the reference data provided. In the image restoration direction, jingai Xu et al Deep Fully Interpretable Network for Multi-Modal Image Restoration (multi-modal image restoration of deep fully interpretable networks) [ C ] [ International Conference on Image Processing ] 2021, authors introduced a cyclic scheme to perform extraction of common and unique features, ensuring unification of multi-modal adaptive fusion features based on deep learning, and experimental results on two multi-modal image restoration tasks verify the significance of the proposed method. In the image super-resolution direction, dengX et al, deep coupled feedback network for joint exposure fusion and image super-resolution (deep coupling feedback network for combining exposure fusion and image super-resolution) [ J ] [ Transactions on Image Processing ]. 2021, volume 30, a coupling structure is introduced into a feedback mechanism, so that the coupling feedback module can acquire the characteristics of overexposed and underexposed images to fuse at the same time, and the effect of the super-resolution task is promoted.

Although multi-modal fusion based on deep neural networks has been successful in all of the above-described various tasks. Its effectiveness and versatility has been well documented by many of the most advanced works and wide practical applications in different industries. However, there is often a potential condition in which to learn the weights at the time of fusion from each modality data; since accurate measurement of the "representatives" of different modality data features can itself be a challenging problem. For example, the problem of fusion of complex data of multiple modes, existing practice usually introduces a attention mechanism to generate reliability weights for adaptive fusion of multi-mode features, and suppresses weak features of different modes. For example, the prior patent document of the invention with publication number of CN110363707A discloses a multi-view three-dimensional point cloud splicing method based on virtual characteristics of constraints, which comprises the following steps: optimizing a chain-shaped multi-view splicing strategy into an annular multi-view splicing strategy, so as to form a splicing ring; introducing constraint objects into a measurement scene, and obtaining measured object point clouds and constraint object local point clouds of the splice rings of each view field through measurement equipment; thirdly, supplementing the missing point cloud of the surface of the constraint object by least square fitting according to the local point cloud of the constraint object in each view field, so as to reconstruct the complete virtual surface point cloud of the constraint object; registering local features of corresponding constraints in adjacent view fields by using a singular value decomposition method or a quaternion method to realize the point cloud rough splicing of the adjacent view fields of the splicing ring; constructing a virtual overlapping area of adjacent fields of view of the splicing ring according to the measurement point cloud of the constraint object and the virtual surface point cloud of the constraint object, and calculating corresponding point pairs of the measurement point cloud of the object to be measured and the measurement point cloud of the constraint object according to the virtual overlapping area; step six, constructing a weight factor radiation model according to the spatial position distribution of the object point cloud to be measured and the constraint object point cloud, and calculating weight factors of corresponding point pairs of the object point clouds of adjacent fields; constructing a full scene data weighted fusion model through the object point cloud to be detected and the virtual surface point cloud of the constraint object, taking the sum of squares of products of projection distances and weights of all corresponding point pairs in adjacent view fields as an objective function, enabling the objective function to be minimum, carrying out iterative optimization to obtain a transformation matrix of each adjacent view field, and completing point cloud fine splicing of each adjacent view field of a splicing ring to obtain a transformation matrix of each adjacent view field in the splicing ring; step eight, the transformation matrix of each adjacent view field in the splice ring indirectly obtains the transformation matrix of the first point cloud and the last point cloud in the splice ring through matrix operation, and the transformation parameters have accumulated errors; and step nine, performing re-optimization splicing on each view field in the splicing ring by taking the transformation parameters with large accumulated errors as priorities according to the accumulated errors of the transformation parameters calculated by the transformation matrix, and completing multi-view three-dimensional point cloud splicing. However, there is no explicit supervision to generate these weights. Therefore, in a complex scene, the fusion effect is still to be improved.

For example, the prior patent document of the invention with publication number CN102844766a, "multi-feature fusion identity recognition method based on human eye image", includes registration and recognition, wherein the method includes: for a given registered human eye image, obtaining a normalized human eye image and a normalized iris image; extracting multi-modal characteristics of a human eye image of a user to be registered, and storing the obtained multi-modal characteristics of the human eye image as registration information into a registration database; the identifying includes: for a given identified human eye image, obtaining a normalized human eye image and a normalized iris image; extracting multi-modal features from a human eye image of a user to be identified; comparing the extracted multi-modal characteristics with the multi-modal characteristics in the database to obtain comparison scores, and obtaining fusion scores through fractional fusion; and carrying out multi-feature fusion identity recognition on the human eye image by using a classifier. However, in the prior art, the reliability of different mode data generated by the same object under different sensors is different; in complex scenarios, the estimation of multi-modal feature adaptive fusion weights is often unreliable. Both the weighted residual error guiding module and the classification-based weight prediction module require quality labels for training, but existing two-modality datasets are not labeled with quality labels, and it is laborious and difficult to manually acquire quality labels from the data.

In summary, the prior art has the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation.

Disclosure of Invention

The method solves the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation in the prior art.

The invention adopts the following technical scheme to solve the technical problems: the multi-mode self-adaptive fusion method based on cross-mode enhancement comprises the following steps:

s1, training by using a preset number of two classifiers to obtain a weight learning network, and inputting fusion features of preset difference modes into the weight learning network to predict and distinguish the reliability weights of the preset difference modes, wherein the difference modes comprise: a first modality and a second modality;

s2, extracting modal characteristics from a preset backbone network, enabling the modal characteristics of a first modality to pass through a parallel structure, modulating and acquiring relative scaling factors and relative displacement factors of the first modality relative to a second modality according to the modal characteristics, processing by using a weight learning network to obtain classification weight-based, and guiding the rest modalities to learn and judge the characteristics according to the classification weight-based, so as to obtain multi-modality training data;

and S3, performing offline training on the branch network of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate a reliable pseudo weight label, and estimating the reliability weight according to the reliable pseudo weight label.

The invention learns the weights of each modality in a supervised manner, then performs weighted residual error guidance, and finally mines and utilizes useful features from both modalities. Reliable pseudo weight labels are generated by a three-branch network, and a simple and efficient classification scheme is used to estimate high accuracy reliability weights. The invention designs a weighted residual error guiding module based on the estimated weight and residual error connection to propagate useful characteristics among modes, and simultaneously lightens the influence of noise mode characteristics on migration information.

In a more specific technical scheme, in step S1, each two classifiers includes a global average pooling layer and a fully connected layer, where the global average pooling layer includes: reLU activation function, full connectivity layer includes: sigmoid activates a function.

In a more specific technical solution, step S1 includes: the prediction probability of the ith two-classifier is set to be p _i The preset differential modality reliability weight W is predicted accordingly using the following logic:

wherein N is the number of two classifiers.

The invention aims at the problem that the weight label usually contains much noise, so that the fusion result is unreliable. According to the method, the reliability weight with high precision can be estimated even under the condition of label noise through weight prediction based on classification, so that the robustness and the multi-mode fusion effect are improved.

In a more specific technical solution, step S2 includes:

s21, enabling the modal characteristics of the first modality to sequentially pass through a first 1 multiplied by 1 convolution layer, a ReLU layer, a second 1 multiplied by 1 convolution layer and a Sigmoid layer to obtain a relative scaling factor;

s22, enabling the modal characteristics of the first mode to sequentially pass through a first 1 multiplied by 1 convolution, a ReLU layer, a second 1 multiplied by 1 convolution and the ReLU layer to obtain a relative displacement factor;

s23, the modal characteristics of the first modality are sent into a weight learning network, learning weight is given to the processing, and accordingly the first modality enhancement characteristics and the second modality enhancement characteristics are obtained.

The invention aims at the difference of mass of different modes to cause different contributions, and the invention explicitly utilizes the useful information of one mode to enhance the characteristics of the other mode, takes the characteristics of one mode as condition information and adaptively modulates the characteristics of the other mode through affine transformation. Complementarity of different modes is fully excavated, the robustness of fusion is improved, and the discrimination of a low-quality mode is improved by utilizing a high-quality mode.

In a more specific technical solution, in step S21, the relative scaling factor of the first modality to the second modality is obtained by the first 1×1 convolution, the ReLU layer, the second 1×1 convolution, and the ReLU layer by using the following logic processing:

Scale _A ＝Sigmoid(conv(ReLU(conv(F _A ))))

wherein Scale is _* To scale factors, sigmoid (·) is a Sigmoid function, conv (·) is a 1×1 convolution, reLU (·) is a ReLU function, F _* Is characterized by.

In a more specific technical scheme, in step S22, the relative displacement factor of the first mode with respect to the second mode is obtained by using the following logic processing:

Shift _A ＝ReLU(conv(ReLU(conv(F _A ))))。

wherein Shift (·) is the Shift factor.

In a more specific technical solution, in step S23, the following logic is used to process the classification weights to obtain the first mode enhancement feature and the second mode enhancement feature:

wherein ,

is the characteristic of the A mode enhanced by the B characteristic, W _B Means that the weight after the classification-based weight prediction module meets the requirement of W _A +W _B ＝1。Scale _B Refers to the scaling factor of the B mode relative to the A mode obtained through the step (1), shift _B Refers to the displacement factor of the mode B relative to the mode A obtained by the step (2), F _A Is the original feature of the a-modality.

The invention learns the accurate reliability weight of each mode in a supervised mode, then carries out weighted residual error guidance to mine and utilize the useful characteristics of the two modes, generates reliable weight, and improves the reliability of multi-mode characteristic self-adaptive fusion weight estimation.

In a more specific technical solution, step S3 includes:

s31, in a sample acquisition stage, a three-branch network is operated on different training video sequences, and fusion results of all frames are recorded;

s32, comparing the fusion results of the first branch and the second branch with real data of each frame, and respectively calculating score values;

and S33, normalizing the first branch and the second branch by using a normalization function, and processing according to the score value to obtain the reliable pseudo weight label of the current image pair.

Aiming at the problem of difficult acquisition of the quality label, the invention designs a three-branch network to generate a reliable pseudo quality label, thereby improving the accuracy of weight learning during fusion.

In a more specific technical scheme, in step S33, the first branch and the second branch are normalized by using the following logic, so as to obtain a reliable pseudo weight tag:

wherein A is the branch output result of A, B is the branch output result of B, GT is the true value, score (·) can select IOU, L is the formula calculation result, and the representative meaning is label.

In a more specific technical scheme, the multi-mode self-adaptive fusion method based on cross-mode enhancement comprises the following steps:

the classification-based weight prediction module is used for training with a preset number of two classifiers to obtain a weight learning network, inputting fusion features of preset difference modes into the weight learning network, and predicting and distinguishing the reliability weights of the preset difference modes according to the fusion features, wherein the difference modes comprise: a first modality and a second modality;

the weighted residual error guiding module is used for extracting modal characteristics from a preset backbone network, enabling the modal characteristics of a first modality to pass through a parallel structure, modulating and acquiring relative scaling factors and relative displacement factors of the first modality relative to a second modality, processing by using a weight learning network to obtain classification-based weights, guiding the rest modalities to learn distinguishing characteristics, and obtaining multi-modality training data, and connecting the weighted residual error guiding module with the classification-based weight prediction module;

the pseudo weight label generating module is used for performing offline training on the branch network of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate a reliable pseudo weight label, and accordingly reliability weight is estimated, and the pseudo weight label generating module is connected with the weighted residual error guiding module.

Compared with the prior art, the invention has the following advantages: the invention learns the weights of each modality in a supervised manner, then performs weighted residual error guidance, and finally mines and utilizes useful features from both modalities. Reliable pseudo weight labels are generated by a three-branch network, and a simple and efficient classification scheme is used to estimate high accuracy reliability weights. The invention designs a weighted residual error guiding module based on the estimated weight and residual error connection to propagate useful characteristics among modes, and simultaneously lightens the influence of noise mode characteristics on migration information.

Drawings

FIG. 1 is a schematic diagram of data flow processing based on a cross-modal enhanced multi-modal adaptive fusion method according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a basic module of a multi-modal adaptive fusion data stream processing system based on cross-modal enhancement according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram illustrating basic steps of a method for processing a multi-modal adaptive fusion data stream based on cross-modal enhancement according to embodiment 1 of the present invention;

FIG. 4 is a schematic diagram of a classification-based weight prediction module according to embodiment 1 of the present invention;

FIG. 5 is a schematic diagram of a weighted residual error guiding module according to embodiment 1 of the present invention;

FIG. 6 is a diagram showing the specific steps of the weighted residual instruction of embodiment 1 of the present invention;

FIG. 7 is a schematic diagram of pseudo weight tag generation according to embodiment 1 of the present invention;

FIG. 8 is a schematic view of an IOU according to example 1 of the invention;

FIG. 9 is a schematic diagram of the experimental result of example 2 of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the multi-mode self-adaptive fusion system based on cross-mode enhancement provided by the invention comprises the following basic modules: a classification-based weight prediction module 1, a weighted residual error guiding module 2 and a pseudo weight tag generating module 3.

As shown in fig. 2 and fig. 3, the multi-mode self-adaptive fusion method based on cross-mode enhancement provided by the invention comprises the following basic steps:

s1, learning the accurate reliability weight of each mode in a supervised mode by using a weight prediction module 1 based on classification;

as shown in fig. 4, in the present embodiment, the classification-based weight prediction module 1 includes a plurality of bi-classifiers, each of which has its decision boundary, and thus has a stronger robustness to noise pseudo weight labels.

In this embodiment, a learning-based network is trained with N bi-classifiers as a classification-based weight prediction module for predicting the reliability weight of different modalities instead of the original regressor. In this embodiment, the number of the two classifiers is 16. Each two-classifier consists of a global averaging pooling layer and two fully connected layers with a ReLU activation function and a Sigmoid activation function, respectively. In each frame, each classifier takes the fusion characteristics of two modes as input, learns and distinguishes the weights of the two modes, and judges whether the weights are greater than a certain threshold delta _i Here delta _i ＝(i-1)/N,i＝1,2,…,N

Assuming that the prediction probability of the ith classifier is p _i The final prediction weight is the average of all the prediction probabilities, and the formula is as follows:

in this embodiment, data of two modalities is taken as an example: guiding the generation of the weight by using the weight prediction module based on classification, if the weight of the mode A is W _A The weight of the mode B is W _B The method comprises the steps of carrying out a first treatment on the surface of the Then there is W _A +W _B =1 holds.

S2, using a weighted residual error guiding module 2, taking the modal characteristics extracted from the backbone network as input, modulating the modal characteristics by two parallel structures consisting of 1 multiplied by 1 convolution layers, respectively playing the scaling and displacement roles of the characteristics, and guiding the other modal to learn more distinguishing characteristics;

as shown in fig. 5, in this embodiment, the mass difference of different modes causes different contributions, so as to fully mine complementarity of different modes and improve robustness of fusion, and the discrimination of the low-quality mode is improved by using the high-quality mode. Thus, useful information of one modality is explicitly utilized to enhance features of another modality to enhance feature fusion effects.

In this embodiment, the modal feature extracted from the backbone network is used as input, and then modulated by two parallel structures composed of 1×1 convolution layers, which respectively exert the scaling and displacement effects of the feature, and instruct another modality to learn more discriminating features.

As shown in fig. 6, in this embodiment, step S2 further includes the following specific steps:

s21, enabling the features of the A mode to sequentially pass through a 1×1 convolution layer, a ReLU layer, a 1×1 convolution layer and a Sigmoid layer; the scaling factor of the mode A relative to the mode B is obtained, and the specific formula is as follows:

Scale _A ＝Sigmoid(conv(ReLU(conv(F _A ))))。

wherein Scale is _* To scale factors, sigmoid (·) is a Sigmoid function, conv (·) is a 1×1 convolution, reLU (·) is a ReLU function, F _* Is characterized in that;

s22, the characteristics of the A mode are subjected to 1×1 convolution, a ReLU layer, 1×1 convolution and a ReLU layer, and at the moment, the displacement factor of the A mode relative to the B mode is obtained, wherein the specific formula is as follows:

Shift _A ＝ReLU(conv(ReLU(conv(F _A ))))。

wherein Shift (·) is a Shift factor;

in this embodiment, the ReLU layer differs from the Sigmoid layer in the last layer;

s23, sending the features of the A mode into a weight prediction module based on classification to obtain weight based on classification for subsequent calculation.

Through the operations of the foregoing steps S21 to S23, the final formula is obtained as follows:

wherein ,

is the characteristic of the A mode enhanced by the B characteristic, W _B Means that the weight after the classification-based weight prediction module meets the requirement of W _A +W _B ＝1。Scale _B Refers to the scaling factor of the B mode relative to the A mode obtained through the step (1), shift _B Refers to the displacement factor of the mode B relative to the mode A obtained by the step (2), F _A Is the original feature of the A mode; a and B are vice versa.

Under different application scenes, the reliability of the mode A is better, or the reliability of the mode B is better. Thus, in both cases, the structure of the weighted residual guide module is the same, but the parameters are different.

S3, performing offline training on the three branch networks by using a pseudo weight label generation module 3 aiming at training data of two modes, namely an A-mode branch, a B-mode branch and an A-B-mode mixed branch, so as to generate a reliable pseudo quality label;

as shown in fig. 7, in the present embodiment, the three branch networks are trained offline for training data of two modes, namely, an a-mode branch, a B-mode branch, and an a-B mode mixed branch. In the sample acquisition stage, the method runs a three-branch network on different training video sequences, and records the fusion result of all frames.

The results of the A and B branches are compared with the real data of each frame, and score values are calculated respectively.

Normalizing the two branches by a normalization function, and taking the result of one branch as a pseudo quality label of the current image pair, wherein a reference formula is as follows:

As shown in fig. 8, in this embodiment, the specific formula of the IOU is:

wherein a is a box a in fig. 8, and GT is a GT box in fig. 8; a n GT represents the intersection of box A and GT; AU GT represents the union of A and GT. In this embodiment, the IOU is the ratio of the intersection of a and GT to the union area of a and GT, and the greater the ratio, the higher the branch score; the smaller the ratio, the lower this branch score.

Example 2

As shown in fig. 9, in the present embodiment, in an application scenario of computer vision tracking, visible light data and near infrared data are selected as two modalities of the present application; score (·) select IOU.

The algorithm of the present application was evaluated by tracking performance comparison with some of the most advanced trackers on four tracking data sets (GTOT, RGBT210, RGBT234 and LasHeR) to verify the validity of the proposed method, see table 1:

table 1 method results comparison list

The generated result shows that the method can obviously extract effective features from one mode under the condition that a great amount of noise exists in one mode aiming at the respective characteristics of the two modes; playing a role in enhancing the result.

As is clear from the above table, the present application adopts the accuracy (PR) and the Success Rate (SR) in the overall evaluation (OPE) as the evaluation index for the quantitative performance evaluation. Where accuracy is the percentage of all predicted frames to true values less than the threshold, the present application sets the GTOT dataset threshold to 5 pixels and the other datasets to 20 pixels, as the targets at the GTOT dataset are typically small. The success rate is the percentage of successful tracking frames with overlapping parts larger than the threshold value, and the success rate score is calculated through the area under the success rate curve. Considering that the accuracy is very sensitive to the size of the target, the Normalized Precision Ratio (NPR) obtained by normalizing the true value is also used for evaluating the tracking performance of the LasHeR data set, and can be obtained from the result.

In summary, the invention learns the weights of each modality in a supervised manner, then performs weighted residual error guidance, and finally mines and utilizes useful features from both modalities. Reliable pseudo weight labels are generated by a three-branch network, and a simple and efficient classification scheme is used to estimate high accuracy reliability weights. The invention designs a weighted residual error guiding module based on the estimated weight and residual error connection to propagate useful characteristics among modes, and simultaneously lightens the influence of noise mode characteristics on migration information.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The multi-mode self-adaptive fusion method based on cross-mode enhancement is characterized by comprising the following steps of:

s1, training by using a preset number of two classifiers to obtain a weight learning network, and inputting fusion features of preset difference modes into the weight learning network so as to predict and distinguish the reliability weights of the preset difference modes, wherein the difference modes comprise: a first modality and a second modality;

s2, extracting modal features from a preset backbone network, enabling the modal features of the first modality to be modulated and obtained through a parallel structure, and accordingly modulating and obtaining relative scaling factors and relative displacement factors of the first modality relative to the second modality, and accordingly obtaining classification weight based on the weight learning network processing, and guiding other modality learning distinguishing features to obtain multi-modality training data;

and S3, performing offline training on the branch networks of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate a reliable pseudo weight label, and estimating the reliability weight according to the reliable pseudo weight label.

2. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein in step S1, each classifier includes a global averaging pooling layer and a fully connected layer, wherein the global averaging pooling layer includes: a ReLU activation function, the fully connected layer comprising: sigmoid activates a function.

3. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein the step S1 includes: setting the prediction probability of the ith two classifiers as p _i Predicting the preset differential modality reliability weight W according to the following logic:

wherein N is the number of the classifier.

4. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein the step S2 includes:

s21, enabling the modal characteristics of the first modality to sequentially pass through a first 1 multiplied by 1 convolution layer, a ReLU layer, a second 1 multiplied by 1 convolution layer and a Sigmoid layer to obtain the relative scaling factor;

s22, sequentially passing the modal characteristics of the first modality through the first 1 multiplied by 1 convolution, the ReLU layer, the second 1 multiplied by 1 convolution and the ReLU layer to obtain the relative displacement factor;

s23, the modal characteristics of the first modality are sent into the weight learning network, learning weight is given after processing, and accordingly the first modality enhancement characteristics and the second modality enhancement characteristics are obtained.

5. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 4, wherein in step S21, the relative scaling factor of the first modality relative to the second modality is obtained by the first 1 x 1 convolution, the ReLU layer, the second 1 x 1 convolution, the ReLU layer by using the following logic processing:

Scale _A ＝Sigmoid(conv(ReLU(conv(F _A ))))

6. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 4, wherein in step S22, the relative displacement factor of the first modality with respect to the second modality is obtained by using the following logic process:

Shift _A ＝ReLU(conv(ReLU(conv(F _A ))))。

wherein Shift (·) is the Shift factor.

7. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein in step S23, the classification weight is processed by the following logic to obtain the first modal enhancement feature and the second modal enhancement feature:

wherein ,

is the characteristic of the A mode enhanced by the B characteristic, W _B Means that the weight after the classification-based weight prediction module meets the requirement of W _A +W _B ＝1。Scale _B Refers to the scaling factor of the B mode relative to the A mode obtained through the step (1), shift _B Is obtained through the step (2)A displacement factor of B mode relative to A mode, F _A Is the original feature of the a-modality.

8. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein the step S3 includes:

9. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 8, wherein in step S33, the first branch and the second branch are normalized by using the following logic, so as to obtain the reliable pseudo-weight tag:

10. The multi-mode self-adaptive fusion method based on cross-mode enhancement is characterized in that the system comprises the following steps:

the weighted residual error guiding module is used for extracting modal characteristics from a preset backbone network, enabling the modal characteristics of the first modality to pass through a parallel structure, modulating and acquiring relative scaling factors and relative displacement factors of the first modality relative to the second modality, processing by utilizing the weight learning network to obtain classification-based weights, guiding the rest modalities to learn and distinguish the characteristics, and obtaining multi-modality training data, and the weighted residual error guiding module is connected with the classification-based weight prediction module;

the pseudo weight label generating module is used for performing offline training on the branch networks of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate reliable pseudo weight labels, and accordingly reliability weights are estimated, and the pseudo weight label generating module is connected with the weighted residual error guiding module.