[go: up one dir, main page]

CN116127408A - Cross-modal enhancement-based multi-modal self-adaptive fusion method and system - Google Patents

Cross-modal enhancement-based multi-modal self-adaptive fusion method and system Download PDF

Info

Publication number
CN116127408A
CN116127408A CN202310162234.5A CN202310162234A CN116127408A CN 116127408 A CN116127408 A CN 116127408A CN 202310162234 A CN202310162234 A CN 202310162234A CN 116127408 A CN116127408 A CN 116127408A
Authority
CN
China
Prior art keywords
modality
modal
weight
mode
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310162234.5A
Other languages
Chinese (zh)
Inventor
李成龙
殷策
刘磊
汤进
毛军军
吴涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202310162234.5A priority Critical patent/CN116127408A/en
Publication of CN116127408A publication Critical patent/CN116127408A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a multi-mode self-adaptive fusion method and a system based on cross-mode enhancement, wherein the method comprises the following steps: the N two classifiers are used for training a learning-based network as a classification-based weight prediction module for predicting the reliability weight of different modes to replace an original regressor. Each two classifiers consists of a global average pooling layer and two full-connection layers respectively provided with a ReLU activation function and a Sigmoid activation function; the mode features extracted from the backbone network are used as input, and then modulated through two parallel structures consisting of 1X 1 convolution layers, and the scaling and displacement functions of the features are respectively exerted to guide the other mode to learn more distinguishing features; and performing offline training on the three branch networks aiming at training data of two modes, namely an A-mode branch, a B-mode branch and an A-B-mode mixed branch. The method solves the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation.

Description

Cross-modal enhancement-based multi-modal self-adaptive fusion method and system
Technical Field
The invention relates to the technical field of multi-modal fusion of deep learning, in particular to a multi-modal self-adaptive fusion method and system based on cross-modal enhancement.
Background
In the data field, the multi-mode data refers to the data obtained by different description fields or different description modes for the same description object; each of which becomes a modality. The multi-mode data fusion means that the advantages of the data of different modes are exerted by combining the characteristics of the data of different modes through the existing computer method; and (3) carrying out comprehensive processing on the data of different modes, and fusing information of each mode to execute a task of target prediction.
In recent years, deep learning is applied to data fusion, data processing can be performed on data from multiple sources, and the method has the characteristics of high accuracy and strong real-time performance; the multi-modal fusion method based on deep learning is applied to various fields by the strong feature extraction and expression capability. In the fluid physical direction, li Yang et al, an Iterative Neural Operator to Predict the Thermo-Fluid Information in Internal Cooling Channels (iterative nerve operator for internal cooling channel thermal fluid information prediction) [ J ], "Journal of Turbomachinery:. 2022, volume 144, the thermal fluid information of the internal cooling channel is predicted by a multi-modal fusion method based on deep learning, and the speed, hidden state and boundary condition indicating variables of the thermal fluid at the moment are input for fusion; thereby predicting the speed and hiding state at the next moment; the multi-mode fusion method based on deep learning is combined with the original partial differential equation system, so that the prediction root mean square error of temperature and pressure is reduced by about 20%. Luca Guastoni et al, convolitional-network models to predict wall-bounded turbulence from wall quantities (Convolutional network model for predicting wall boundary turbulence from wall volume) [ J ], "Fluid Mech 2021, volume 928, adaptively fused by deep learning with wall shear stress field and wall pressure field sampled from DNS as inputs to predict flow direction, wall normal fluctuation and spanwise fluctuation; the results produced can be very consistent with the reference data provided. In the image restoration direction, jingai Xu et al Deep Fully Interpretable Network for Multi-Modal Image Restoration (multi-modal image restoration of deep fully interpretable networks) [ C ] [ International Conference on Image Processing ] 2021, authors introduced a cyclic scheme to perform extraction of common and unique features, ensuring unification of multi-modal adaptive fusion features based on deep learning, and experimental results on two multi-modal image restoration tasks verify the significance of the proposed method. In the image super-resolution direction, dengX et al, deep coupled feedback network for joint exposure fusion and image super-resolution (deep coupling feedback network for combining exposure fusion and image super-resolution) [ J ] [ Transactions on Image Processing ]. 2021, volume 30, a coupling structure is introduced into a feedback mechanism, so that the coupling feedback module can acquire the characteristics of overexposed and underexposed images to fuse at the same time, and the effect of the super-resolution task is promoted.
Although multi-modal fusion based on deep neural networks has been successful in all of the above-described various tasks. Its effectiveness and versatility has been well documented by many of the most advanced works and wide practical applications in different industries. However, there is often a potential condition in which to learn the weights at the time of fusion from each modality data; since accurate measurement of the "representatives" of different modality data features can itself be a challenging problem. For example, the problem of fusion of complex data of multiple modes, existing practice usually introduces a attention mechanism to generate reliability weights for adaptive fusion of multi-mode features, and suppresses weak features of different modes. For example, the prior patent document of the invention with publication number of CN110363707A discloses a multi-view three-dimensional point cloud splicing method based on virtual characteristics of constraints, which comprises the following steps: optimizing a chain-shaped multi-view splicing strategy into an annular multi-view splicing strategy, so as to form a splicing ring; introducing constraint objects into a measurement scene, and obtaining measured object point clouds and constraint object local point clouds of the splice rings of each view field through measurement equipment; thirdly, supplementing the missing point cloud of the surface of the constraint object by least square fitting according to the local point cloud of the constraint object in each view field, so as to reconstruct the complete virtual surface point cloud of the constraint object; registering local features of corresponding constraints in adjacent view fields by using a singular value decomposition method or a quaternion method to realize the point cloud rough splicing of the adjacent view fields of the splicing ring; constructing a virtual overlapping area of adjacent fields of view of the splicing ring according to the measurement point cloud of the constraint object and the virtual surface point cloud of the constraint object, and calculating corresponding point pairs of the measurement point cloud of the object to be measured and the measurement point cloud of the constraint object according to the virtual overlapping area; step six, constructing a weight factor radiation model according to the spatial position distribution of the object point cloud to be measured and the constraint object point cloud, and calculating weight factors of corresponding point pairs of the object point clouds of adjacent fields; constructing a full scene data weighted fusion model through the object point cloud to be detected and the virtual surface point cloud of the constraint object, taking the sum of squares of products of projection distances and weights of all corresponding point pairs in adjacent view fields as an objective function, enabling the objective function to be minimum, carrying out iterative optimization to obtain a transformation matrix of each adjacent view field, and completing point cloud fine splicing of each adjacent view field of a splicing ring to obtain a transformation matrix of each adjacent view field in the splicing ring; step eight, the transformation matrix of each adjacent view field in the splice ring indirectly obtains the transformation matrix of the first point cloud and the last point cloud in the splice ring through matrix operation, and the transformation parameters have accumulated errors; and step nine, performing re-optimization splicing on each view field in the splicing ring by taking the transformation parameters with large accumulated errors as priorities according to the accumulated errors of the transformation parameters calculated by the transformation matrix, and completing multi-view three-dimensional point cloud splicing. However, there is no explicit supervision to generate these weights. Therefore, in a complex scene, the fusion effect is still to be improved.
For example, the prior patent document of the invention with publication number CN102844766a, "multi-feature fusion identity recognition method based on human eye image", includes registration and recognition, wherein the method includes: for a given registered human eye image, obtaining a normalized human eye image and a normalized iris image; extracting multi-modal characteristics of a human eye image of a user to be registered, and storing the obtained multi-modal characteristics of the human eye image as registration information into a registration database; the identifying includes: for a given identified human eye image, obtaining a normalized human eye image and a normalized iris image; extracting multi-modal features from a human eye image of a user to be identified; comparing the extracted multi-modal characteristics with the multi-modal characteristics in the database to obtain comparison scores, and obtaining fusion scores through fractional fusion; and carrying out multi-feature fusion identity recognition on the human eye image by using a classifier. However, in the prior art, the reliability of different mode data generated by the same object under different sensors is different; in complex scenarios, the estimation of multi-modal feature adaptive fusion weights is often unreliable. Both the weighted residual error guiding module and the classification-based weight prediction module require quality labels for training, but existing two-modality datasets are not labeled with quality labels, and it is laborious and difficult to manually acquire quality labels from the data.
In summary, the prior art has the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation.
Disclosure of Invention
The method solves the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation in the prior art.
The invention adopts the following technical scheme to solve the technical problems: the multi-mode self-adaptive fusion method based on cross-mode enhancement comprises the following steps:
s1, training by using a preset number of two classifiers to obtain a weight learning network, and inputting fusion features of preset difference modes into the weight learning network to predict and distinguish the reliability weights of the preset difference modes, wherein the difference modes comprise: a first modality and a second modality;
s2, extracting modal characteristics from a preset backbone network, enabling the modal characteristics of a first modality to pass through a parallel structure, modulating and acquiring relative scaling factors and relative displacement factors of the first modality relative to a second modality according to the modal characteristics, processing by using a weight learning network to obtain classification weight-based, and guiding the rest modalities to learn and judge the characteristics according to the classification weight-based, so as to obtain multi-modality training data;
and S3, performing offline training on the branch network of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate a reliable pseudo weight label, and estimating the reliability weight according to the reliable pseudo weight label.
The invention learns the weights of each modality in a supervised manner, then performs weighted residual error guidance, and finally mines and utilizes useful features from both modalities. Reliable pseudo weight labels are generated by a three-branch network, and a simple and efficient classification scheme is used to estimate high accuracy reliability weights. The invention designs a weighted residual error guiding module based on the estimated weight and residual error connection to propagate useful characteristics among modes, and simultaneously lightens the influence of noise mode characteristics on migration information.
In a more specific technical scheme, in step S1, each two classifiers includes a global average pooling layer and a fully connected layer, where the global average pooling layer includes: reLU activation function, full connectivity layer includes: sigmoid activates a function.
In a more specific technical solution, step S1 includes: the prediction probability of the ith two-classifier is set to be p i The preset differential modality reliability weight W is predicted accordingly using the following logic:
Figure BDA0004094694580000041
wherein N is the number of two classifiers.
The invention aims at the problem that the weight label usually contains much noise, so that the fusion result is unreliable. According to the method, the reliability weight with high precision can be estimated even under the condition of label noise through weight prediction based on classification, so that the robustness and the multi-mode fusion effect are improved.
In a more specific technical solution, step S2 includes:
s21, enabling the modal characteristics of the first modality to sequentially pass through a first 1 multiplied by 1 convolution layer, a ReLU layer, a second 1 multiplied by 1 convolution layer and a Sigmoid layer to obtain a relative scaling factor;
s22, enabling the modal characteristics of the first mode to sequentially pass through a first 1 multiplied by 1 convolution, a ReLU layer, a second 1 multiplied by 1 convolution and the ReLU layer to obtain a relative displacement factor;
s23, the modal characteristics of the first modality are sent into a weight learning network, learning weight is given to the processing, and accordingly the first modality enhancement characteristics and the second modality enhancement characteristics are obtained.
The invention aims at the difference of mass of different modes to cause different contributions, and the invention explicitly utilizes the useful information of one mode to enhance the characteristics of the other mode, takes the characteristics of one mode as condition information and adaptively modulates the characteristics of the other mode through affine transformation. Complementarity of different modes is fully excavated, the robustness of fusion is improved, and the discrimination of a low-quality mode is improved by utilizing a high-quality mode.
In a more specific technical solution, in step S21, the relative scaling factor of the first modality to the second modality is obtained by the first 1×1 convolution, the ReLU layer, the second 1×1 convolution, and the ReLU layer by using the following logic processing:
Scale A =Sigmoid(conv(ReLU(conv(F A ))))
wherein Scale is * To scale factors, sigmoid (·) is a Sigmoid function, conv (·) is a 1×1 convolution, reLU (·) is a ReLU function, F * Is characterized by.
In a more specific technical scheme, in step S22, the relative displacement factor of the first mode with respect to the second mode is obtained by using the following logic processing:
Shift A =ReLU(conv(ReLU(conv(F A ))))。
wherein Shift (·) is the Shift factor.
In a more specific technical solution, in step S23, the following logic is used to process the classification weights to obtain the first mode enhancement feature and the second mode enhancement feature:
Figure BDA0004094694580000051
Figure BDA0004094694580000052
wherein ,
Figure BDA0004094694580000053
is the characteristic of the A mode enhanced by the B characteristic, W B Means that the weight after the classification-based weight prediction module meets the requirement of W A +W B =1。Scale B Refers to the scaling factor of the B mode relative to the A mode obtained through the step (1), shift B Refers to the displacement factor of the mode B relative to the mode A obtained by the step (2), F A Is the original feature of the a-modality.
The invention learns the accurate reliability weight of each mode in a supervised mode, then carries out weighted residual error guidance to mine and utilize the useful characteristics of the two modes, generates reliable weight, and improves the reliability of multi-mode characteristic self-adaptive fusion weight estimation.
In a more specific technical solution, step S3 includes:
s31, in a sample acquisition stage, a three-branch network is operated on different training video sequences, and fusion results of all frames are recorded;
s32, comparing the fusion results of the first branch and the second branch with real data of each frame, and respectively calculating score values;
and S33, normalizing the first branch and the second branch by using a normalization function, and processing according to the score value to obtain the reliable pseudo weight label of the current image pair.
Aiming at the problem of difficult acquisition of the quality label, the invention designs a three-branch network to generate a reliable pseudo quality label, thereby improving the accuracy of weight learning during fusion.
In a more specific technical scheme, in step S33, the first branch and the second branch are normalized by using the following logic, so as to obtain a reliable pseudo weight tag:
Figure BDA0004094694580000054
wherein A is the branch output result of A, B is the branch output result of B, GT is the true value, score (·) can select IOU, L is the formula calculation result, and the representative meaning is label.
In a more specific technical scheme, the multi-mode self-adaptive fusion method based on cross-mode enhancement comprises the following steps:
the classification-based weight prediction module is used for training with a preset number of two classifiers to obtain a weight learning network, inputting fusion features of preset difference modes into the weight learning network, and predicting and distinguishing the reliability weights of the preset difference modes according to the fusion features, wherein the difference modes comprise: a first modality and a second modality;
the weighted residual error guiding module is used for extracting modal characteristics from a preset backbone network, enabling the modal characteristics of a first modality to pass through a parallel structure, modulating and acquiring relative scaling factors and relative displacement factors of the first modality relative to a second modality, processing by using a weight learning network to obtain classification-based weights, guiding the rest modalities to learn distinguishing characteristics, and obtaining multi-modality training data, and connecting the weighted residual error guiding module with the classification-based weight prediction module;
the pseudo weight label generating module is used for performing offline training on the branch network of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate a reliable pseudo weight label, and accordingly reliability weight is estimated, and the pseudo weight label generating module is connected with the weighted residual error guiding module.
Compared with the prior art, the invention has the following advantages: the invention learns the weights of each modality in a supervised manner, then performs weighted residual error guidance, and finally mines and utilizes useful features from both modalities. Reliable pseudo weight labels are generated by a three-branch network, and a simple and efficient classification scheme is used to estimate high accuracy reliability weights. The invention designs a weighted residual error guiding module based on the estimated weight and residual error connection to propagate useful characteristics among modes, and simultaneously lightens the influence of noise mode characteristics on migration information.
The invention aims at the problem that the weight label usually contains much noise, so that the fusion result is unreliable. According to the method, the reliability weight with high precision can be estimated even under the condition of label noise through weight prediction based on classification, so that the robustness and the multi-mode fusion effect are improved.
The invention aims at the difference of mass of different modes to cause different contributions, and the invention explicitly utilizes the useful information of one mode to enhance the characteristics of the other mode, takes the characteristics of one mode as condition information and adaptively modulates the characteristics of the other mode through affine transformation. Complementarity of different modes is fully excavated, the robustness of fusion is improved, and the discrimination of a low-quality mode is improved by utilizing a high-quality mode.
The invention learns the accurate reliability weight of each mode in a supervised mode, then carries out weighted residual error guidance to mine and utilize the useful characteristics of the two modes, generates reliable weight, and improves the reliability of multi-mode characteristic self-adaptive fusion weight estimation.
Aiming at the problem of difficult acquisition of the quality label, the invention designs a three-branch network to generate a reliable pseudo quality label, thereby improving the accuracy of weight learning during fusion.
The method solves the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation in the prior art.
Drawings
FIG. 1 is a schematic diagram of data flow processing based on a cross-modal enhanced multi-modal adaptive fusion method according to embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of a basic module of a multi-modal adaptive fusion data stream processing system based on cross-modal enhancement according to embodiment 1 of the present invention;
FIG. 3 is a schematic diagram illustrating basic steps of a method for processing a multi-modal adaptive fusion data stream based on cross-modal enhancement according to embodiment 1 of the present invention;
FIG. 4 is a schematic diagram of a classification-based weight prediction module according to embodiment 1 of the present invention;
FIG. 5 is a schematic diagram of a weighted residual error guiding module according to embodiment 1 of the present invention;
FIG. 6 is a diagram showing the specific steps of the weighted residual instruction of embodiment 1 of the present invention;
FIG. 7 is a schematic diagram of pseudo weight tag generation according to embodiment 1 of the present invention;
FIG. 8 is a schematic view of an IOU according to example 1 of the invention;
FIG. 9 is a schematic diagram of the experimental result of example 2 of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions in the embodiments of the present invention will be clearly and completely described in the following in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, the multi-mode self-adaptive fusion system based on cross-mode enhancement provided by the invention comprises the following basic modules: a classification-based weight prediction module 1, a weighted residual error guiding module 2 and a pseudo weight tag generating module 3.
As shown in fig. 2 and fig. 3, the multi-mode self-adaptive fusion method based on cross-mode enhancement provided by the invention comprises the following basic steps:
s1, learning the accurate reliability weight of each mode in a supervised mode by using a weight prediction module 1 based on classification;
as shown in fig. 4, in the present embodiment, the classification-based weight prediction module 1 includes a plurality of bi-classifiers, each of which has its decision boundary, and thus has a stronger robustness to noise pseudo weight labels.
In this embodiment, a learning-based network is trained with N bi-classifiers as a classification-based weight prediction module for predicting the reliability weight of different modalities instead of the original regressor. In this embodiment, the number of the two classifiers is 16. Each two-classifier consists of a global averaging pooling layer and two fully connected layers with a ReLU activation function and a Sigmoid activation function, respectively. In each frame, each classifier takes the fusion characteristics of two modes as input, learns and distinguishes the weights of the two modes, and judges whether the weights are greater than a certain threshold delta i Here delta i =(i-1)/N,i=1,2,…,N
Assuming that the prediction probability of the ith classifier is p i The final prediction weight is the average of all the prediction probabilities, and the formula is as follows:
Figure BDA0004094694580000081
in this embodiment, data of two modalities is taken as an example: guiding the generation of the weight by using the weight prediction module based on classification, if the weight of the mode A is W A The weight of the mode B is W B The method comprises the steps of carrying out a first treatment on the surface of the Then there is W A +W B =1 holds.
S2, using a weighted residual error guiding module 2, taking the modal characteristics extracted from the backbone network as input, modulating the modal characteristics by two parallel structures consisting of 1 multiplied by 1 convolution layers, respectively playing the scaling and displacement roles of the characteristics, and guiding the other modal to learn more distinguishing characteristics;
as shown in fig. 5, in this embodiment, the mass difference of different modes causes different contributions, so as to fully mine complementarity of different modes and improve robustness of fusion, and the discrimination of the low-quality mode is improved by using the high-quality mode. Thus, useful information of one modality is explicitly utilized to enhance features of another modality to enhance feature fusion effects.
In this embodiment, the modal feature extracted from the backbone network is used as input, and then modulated by two parallel structures composed of 1×1 convolution layers, which respectively exert the scaling and displacement effects of the feature, and instruct another modality to learn more discriminating features.
As shown in fig. 6, in this embodiment, step S2 further includes the following specific steps:
s21, enabling the features of the A mode to sequentially pass through a 1×1 convolution layer, a ReLU layer, a 1×1 convolution layer and a Sigmoid layer; the scaling factor of the mode A relative to the mode B is obtained, and the specific formula is as follows:
Scale A =Sigmoid(conv(ReLU(conv(F A ))))。
wherein Scale is * To scale factors, sigmoid (·) is a Sigmoid function, conv (·) is a 1×1 convolution, reLU (·) is a ReLU function, F * Is characterized in that;
s22, the characteristics of the A mode are subjected to 1×1 convolution, a ReLU layer, 1×1 convolution and a ReLU layer, and at the moment, the displacement factor of the A mode relative to the B mode is obtained, wherein the specific formula is as follows:
Shift A =ReLU(conv(ReLU(conv(F A ))))。
wherein Shift (·) is a Shift factor;
in this embodiment, the ReLU layer differs from the Sigmoid layer in the last layer;
s23, sending the features of the A mode into a weight prediction module based on classification to obtain weight based on classification for subsequent calculation.
Through the operations of the foregoing steps S21 to S23, the final formula is obtained as follows:
Figure BDA0004094694580000091
Figure BDA0004094694580000092
wherein ,
Figure BDA0004094694580000093
is the characteristic of the A mode enhanced by the B characteristic, W B Means that the weight after the classification-based weight prediction module meets the requirement of W A +W B =1。Scale B Refers to the scaling factor of the B mode relative to the A mode obtained through the step (1), shift B Refers to the displacement factor of the mode B relative to the mode A obtained by the step (2), F A Is the original feature of the A mode; a and B are vice versa.
Under different application scenes, the reliability of the mode A is better, or the reliability of the mode B is better. Thus, in both cases, the structure of the weighted residual guide module is the same, but the parameters are different.
S3, performing offline training on the three branch networks by using a pseudo weight label generation module 3 aiming at training data of two modes, namely an A-mode branch, a B-mode branch and an A-B-mode mixed branch, so as to generate a reliable pseudo quality label;
as shown in fig. 7, in the present embodiment, the three branch networks are trained offline for training data of two modes, namely, an a-mode branch, a B-mode branch, and an a-B mode mixed branch. In the sample acquisition stage, the method runs a three-branch network on different training video sequences, and records the fusion result of all frames.
The results of the A and B branches are compared with the real data of each frame, and score values are calculated respectively.
Normalizing the two branches by a normalization function, and taking the result of one branch as a pseudo quality label of the current image pair, wherein a reference formula is as follows:
Figure BDA0004094694580000094
wherein A is the branch output result of A, B is the branch output result of B, GT is the true value, score (·) can select IOU, L is the formula calculation result, and the representative meaning is label.
As shown in fig. 8, in this embodiment, the specific formula of the IOU is:
Figure BDA0004094694580000101
wherein a is a box a in fig. 8, and GT is a GT box in fig. 8; a n GT represents the intersection of box A and GT; AU GT represents the union of A and GT. In this embodiment, the IOU is the ratio of the intersection of a and GT to the union area of a and GT, and the greater the ratio, the higher the branch score; the smaller the ratio, the lower this branch score.
Example 2
As shown in fig. 9, in the present embodiment, in an application scenario of computer vision tracking, visible light data and near infrared data are selected as two modalities of the present application; score (·) select IOU.
The algorithm of the present application was evaluated by tracking performance comparison with some of the most advanced trackers on four tracking data sets (GTOT, RGBT210, RGBT234 and LasHeR) to verify the validity of the proposed method, see table 1:
table 1 method results comparison list
Figure BDA0004094694580000102
Figure BDA0004094694580000111
The generated result shows that the method can obviously extract effective features from one mode under the condition that a great amount of noise exists in one mode aiming at the respective characteristics of the two modes; playing a role in enhancing the result.
As is clear from the above table, the present application adopts the accuracy (PR) and the Success Rate (SR) in the overall evaluation (OPE) as the evaluation index for the quantitative performance evaluation. Where accuracy is the percentage of all predicted frames to true values less than the threshold, the present application sets the GTOT dataset threshold to 5 pixels and the other datasets to 20 pixels, as the targets at the GTOT dataset are typically small. The success rate is the percentage of successful tracking frames with overlapping parts larger than the threshold value, and the success rate score is calculated through the area under the success rate curve. Considering that the accuracy is very sensitive to the size of the target, the Normalized Precision Ratio (NPR) obtained by normalizing the true value is also used for evaluating the tracking performance of the LasHeR data set, and can be obtained from the result.
In summary, the invention learns the weights of each modality in a supervised manner, then performs weighted residual error guidance, and finally mines and utilizes useful features from both modalities. Reliable pseudo weight labels are generated by a three-branch network, and a simple and efficient classification scheme is used to estimate high accuracy reliability weights. The invention designs a weighted residual error guiding module based on the estimated weight and residual error connection to propagate useful characteristics among modes, and simultaneously lightens the influence of noise mode characteristics on migration information.
The invention aims at the problem that the weight label usually contains much noise, so that the fusion result is unreliable. According to the method, the reliability weight with high precision can be estimated even under the condition of label noise through weight prediction based on classification, so that the robustness and the multi-mode fusion effect are improved.
The invention aims at the difference of mass of different modes to cause different contributions, and the invention explicitly utilizes the useful information of one mode to enhance the characteristics of the other mode, takes the characteristics of one mode as condition information and adaptively modulates the characteristics of the other mode through affine transformation. Complementarity of different modes is fully excavated, the robustness of fusion is improved, and the discrimination of a low-quality mode is improved by utilizing a high-quality mode.
The invention learns the accurate reliability weight of each mode in a supervised mode, then carries out weighted residual error guidance to mine and utilize the useful characteristics of the two modes, generates reliable weight, and improves the reliability of multi-mode characteristic self-adaptive fusion weight estimation.
Aiming at the problem of difficult acquisition of the quality label, the invention designs a three-branch network to generate a reliable pseudo quality label, thereby improving the accuracy of weight learning during fusion.
The method solves the technical problems of poor fusion effect, low weight learning accuracy during fusion and unreliable multi-modal characteristic self-adaptive fusion weight estimation in the prior art.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The multi-mode self-adaptive fusion method based on cross-mode enhancement is characterized by comprising the following steps of:
s1, training by using a preset number of two classifiers to obtain a weight learning network, and inputting fusion features of preset difference modes into the weight learning network so as to predict and distinguish the reliability weights of the preset difference modes, wherein the difference modes comprise: a first modality and a second modality;
s2, extracting modal features from a preset backbone network, enabling the modal features of the first modality to be modulated and obtained through a parallel structure, and accordingly modulating and obtaining relative scaling factors and relative displacement factors of the first modality relative to the second modality, and accordingly obtaining classification weight based on the weight learning network processing, and guiding other modality learning distinguishing features to obtain multi-modality training data;
and S3, performing offline training on the branch networks of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate a reliable pseudo weight label, and estimating the reliability weight according to the reliable pseudo weight label.
2. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein in step S1, each classifier includes a global averaging pooling layer and a fully connected layer, wherein the global averaging pooling layer includes: a ReLU activation function, the fully connected layer comprising: sigmoid activates a function.
3. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein the step S1 includes: setting the prediction probability of the ith two classifiers as p i Predicting the preset differential modality reliability weight W according to the following logic:
Figure FDA0004094694570000011
wherein N is the number of the classifier.
4. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein the step S2 includes:
s21, enabling the modal characteristics of the first modality to sequentially pass through a first 1 multiplied by 1 convolution layer, a ReLU layer, a second 1 multiplied by 1 convolution layer and a Sigmoid layer to obtain the relative scaling factor;
s22, sequentially passing the modal characteristics of the first modality through the first 1 multiplied by 1 convolution, the ReLU layer, the second 1 multiplied by 1 convolution and the ReLU layer to obtain the relative displacement factor;
s23, the modal characteristics of the first modality are sent into the weight learning network, learning weight is given after processing, and accordingly the first modality enhancement characteristics and the second modality enhancement characteristics are obtained.
5. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 4, wherein in step S21, the relative scaling factor of the first modality relative to the second modality is obtained by the first 1 x 1 convolution, the ReLU layer, the second 1 x 1 convolution, the ReLU layer by using the following logic processing:
Scale A =Sigmoid(conv(ReLU(conv(F A ))))
wherein Scale is * To scale factors, sigmoid (·) is a Sigmoid function, conv (·) is a 1×1 convolution, reLU (·) is a ReLU function, F * Is characterized by.
6. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 4, wherein in step S22, the relative displacement factor of the first modality with respect to the second modality is obtained by using the following logic process:
Shift A =ReLU(conv(ReLU(conv(F A ))))。
wherein Shift (·) is the Shift factor.
7. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein in step S23, the classification weight is processed by the following logic to obtain the first modal enhancement feature and the second modal enhancement feature:
Figure FDA0004094694570000021
Figure FDA0004094694570000022
wherein ,
Figure FDA0004094694570000023
is the characteristic of the A mode enhanced by the B characteristic, W B Means that the weight after the classification-based weight prediction module meets the requirement of W A +W B =1。Scale B Refers to the scaling factor of the B mode relative to the A mode obtained through the step (1), shift B Is obtained through the step (2)A displacement factor of B mode relative to A mode, F A Is the original feature of the a-modality.
8. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 1, wherein the step S3 includes:
s31, in a sample acquisition stage, a three-branch network is operated on different training video sequences, and fusion results of all frames are recorded;
s32, comparing the fusion results of the first branch and the second branch with real data of each frame, and respectively calculating score values;
and S33, normalizing the first branch and the second branch by using a normalization function, and processing according to the score value to obtain the reliable pseudo weight label of the current image pair.
9. The multi-modal adaptive fusion method based on cross-modal enhancement according to claim 8, wherein in step S33, the first branch and the second branch are normalized by using the following logic, so as to obtain the reliable pseudo-weight tag:
Figure FDA0004094694570000031
wherein A is the branch output result of A, B is the branch output result of B, GT is the true value, score (·) can select IOU, L is the formula calculation result, and the representative meaning is label.
10. The multi-mode self-adaptive fusion method based on cross-mode enhancement is characterized in that the system comprises the following steps:
the classification-based weight prediction module is used for training with a preset number of two classifiers to obtain a weight learning network, inputting fusion features of preset difference modes into the weight learning network, and predicting and distinguishing the reliability weights of the preset difference modes according to the fusion features, wherein the difference modes comprise: a first modality and a second modality;
the weighted residual error guiding module is used for extracting modal characteristics from a preset backbone network, enabling the modal characteristics of the first modality to pass through a parallel structure, modulating and acquiring relative scaling factors and relative displacement factors of the first modality relative to the second modality, processing by utilizing the weight learning network to obtain classification-based weights, guiding the rest modalities to learn and distinguish the characteristics, and obtaining multi-modality training data, and the weighted residual error guiding module is connected with the classification-based weight prediction module;
the pseudo weight label generating module is used for performing offline training on the branch networks of the first modality, the second modality and the mixed modality according to the multi-modality training data so as to generate reliable pseudo weight labels, and accordingly reliability weights are estimated, and the pseudo weight label generating module is connected with the weighted residual error guiding module.
CN202310162234.5A 2023-02-21 2023-02-21 Cross-modal enhancement-based multi-modal self-adaptive fusion method and system Pending CN116127408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310162234.5A CN116127408A (en) 2023-02-21 2023-02-21 Cross-modal enhancement-based multi-modal self-adaptive fusion method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310162234.5A CN116127408A (en) 2023-02-21 2023-02-21 Cross-modal enhancement-based multi-modal self-adaptive fusion method and system

Publications (1)

Publication Number Publication Date
CN116127408A true CN116127408A (en) 2023-05-16

Family

ID=86299118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310162234.5A Pending CN116127408A (en) 2023-02-21 2023-02-21 Cross-modal enhancement-based multi-modal self-adaptive fusion method and system

Country Status (1)

Country Link
CN (1) CN116127408A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032846A1 (en) * 2016-08-01 2018-02-01 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
US20180189572A1 (en) * 2016-12-30 2018-07-05 Mitsubishi Electric Research Laboratories, Inc. Method and System for Multi-Modal Fusion Model
CN113963200A (en) * 2021-10-18 2022-01-21 郑州大学 Modal data fusion processing method, device, device and storage medium
CN114966696A (en) * 2021-12-23 2022-08-30 昆明理工大学 Transformer-based cross-modal fusion target detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180032846A1 (en) * 2016-08-01 2018-02-01 Nvidia Corporation Fusing multilayer and multimodal deep neural networks for video classification
US20180189572A1 (en) * 2016-12-30 2018-07-05 Mitsubishi Electric Research Laboratories, Inc. Method and System for Multi-Modal Fusion Model
CN113963200A (en) * 2021-10-18 2022-01-21 郑州大学 Modal data fusion processing method, device, device and storage medium
CN114966696A (en) * 2021-12-23 2022-08-30 昆明理工大学 Transformer-based cross-modal fusion target detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张婷婷;章坚武;郭春生;陈华华;周迪;王延松;徐爱华;: "基于深度学习的图像目标检测算法综述", 电信科学, no. 07, 20 July 2020 (2020-07-20), pages 96 - 110 *
王福田,张淑云,李成龙,罗斌: "《动态模态交互和特征自适应融合的RGBT跟踪》", 《中国图象图形学报》, vol. 27, no. 10, 20 October 2022 (2022-10-20), pages 3010 - 3021 *

Similar Documents

Publication Publication Date Title
CN111160297B (en) Pedestrian Re-identification Method and Device Based on Residual Attention Mechanism Spatio-temporal Joint Model
Von Stumberg et al. Gn-net: The gauss-newton loss for multi-weather relocalization
US20230134967A1 (en) Method for recognizing activities using separate spatial and temporal attention weights
CN101701818B (en) Detection methods for distant obstacles
CN114387496A (en) A target detection method and electronic device
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN114419323B (en) Cross-modal learning and domain self-adaptive RGBD image semantic segmentation method
CN113361475A (en) Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing
CN112308921A (en) A dynamic SLAM method for joint optimization based on semantics and geometry
CN112949451A (en) Cross-modal target tracking method and system through modal perception feature learning
CN118865494B (en) Human body action recognition method based on space-time interest point and space-time diagram convolution
Ge et al. Vipose: Real-time visual-inertial 6d object pose tracking
Shen et al. A self‐supervised monocular depth estimation model with scale recovery and transfer learning for construction scene analysis
CN104700105B (en) unstructured outdoor terrain global detection method
He et al. A generative feature-to-image robotic vision framework for 6D pose measurement of metal parts
Hirner et al. FC-DCNN: A densely connected neural network for stereo estimation
CN120708004A (en) Intelligent sorting method for waste plastic bottles based on hyperspectral point cloud based on cross-modal image fusion
Jiang et al. Triangulate geometric constraint combined with visual-flow fusion network for accurate 6DoF pose estimation
CN119068016B (en) An RGBT target tracking method based on modality-aware feature learning
Spurr et al. Adversarial motion modelling helps semi-supervised hand pose estimation
CN113610058A (en) Facial pose enhancement interaction method for facial feature migration
CN114419529A (en) A cross-modal pedestrian re-identification method and system based on distribution space alignment
CN117994594B (en) Power operation risk identification method based on deep learning
CN116127408A (en) Cross-modal enhancement-based multi-modal self-adaptive fusion method and system
Yang et al. TMU-GAN: a compliance detection algorithm for protective equipment in power operations: X. Yang et al.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination