CN119810116A

CN119810116A - A multi-scale attention fusion method for microscopic hyperspectral image segmentation

Info

Publication number: CN119810116A
Application number: CN202411832727.6A
Authority: CN
Inventors: 李臣明; 张慧如; 高红民
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2024-12-12
Filing date: 2024-12-12
Publication date: 2025-04-11

Abstract

The invention discloses a multi-scale attention fusion microscopic hyperspectral image segmentation method, aiming at improving the accuracy and the robustness of medical image segmentation. The method combines the advantages of the transducer and the CNN in extracting global and local features, designs an innovative fusion module MSA (multi-scale attention), and fuses the global features of the transducer into the local features of the CNN efficiently through a multi-scale, channel attention and space attention mechanism. Meanwhile, residual connection is introduced in an up-sampling stage of a transducer branch to relieve the gradient vanishing problem and enhance the network expression capacity, and a SE module is added after each CNN module to dynamically adjust the characteristic channel weight so as to enhance the useful characteristics and inhibit redundant information. Experiments show that the method has excellent segmentation precision, robustness and efficiency in a microscopic hyperspectral image segmentation task, and particularly has remarkable advantages on challenges such as boundary blurring and tissue mixing of cancerous regions, image differences of different patients and the like.

Description

Multi-scale attention fusion microscopic hyperspectral image segmentation method

Technical Field

The invention relates to the field of medical image processing and deep learning, in particular to a deep learning model of a multi-scale attention fusion microscopic hyperspectral image segmentation method and application thereof in microscopic hyperspectral image segmentation.

Background

Current diagnosis of cancer relies primarily on laboratory tests, imaging tests, tumor marker detection, and in combination with clinical manifestations of the patient, where imaging tests include ultrasound, computerized tomography (Computer tomography, CT), magnetic resonance imaging (Magnetic Resonance Imaging, MRI), and the like. However, there is still some non-specific possibility using the above diagnostic methods, and histopathological examination is still the gold standard for cancer diagnosis. After the pathological section is completed, the doctor observes the pathological section through a microscope and analyzes the tumor cells and the tumor tissue. This process is often time consuming and inefficient, and requires high skill in the doctor's professionals to achieve sufficiently high accuracy. Especially, in the face of some tiny foci with high uncertainty, doctors need to pay attention to the discussion repeatedly in the diagnosis process, and in the process, the doctors may need to consume a great deal of effort due to fatigue, personal diagnosis standard difference and other reasons, and misdiagnosis and missed diagnosis may be caused. Therefore, there is an urgent need for an auxiliary means that can improve efficiency to reduce the labor intensity of doctors, while improving consistency and accuracy of diagnosis. Computer-Aided Diagnosis (CAD) is becoming increasingly trusted by doctors due to its rapid and efficient, independent of such characteristics as subjective factors.

In general, diagnosis of cancer using CAD relies mainly on conventional digital pathological section images, which contain a limited amount of information, which has gradually come close to the ceiling of intelligent pathological diagnosis of cancer. Hyperspectral imaging technology is a technology for acquiring optical characteristics of an object through spectrum analysis and image processing, and is applied to the field of remote sensing at the earliest. Because the hyperspectral imaging technology combines the spectroscopic technology and the traditional imaging technology, the two-dimensional space information and the one-dimensional spectral information of the shot object can be acquired, and the hyperspectral imaging technology has the characteristic of 'map-in-one'. With the development of science and technology and the cross fusion between disciplines, the hyperspectral imaging technology has a relatively great research result in the fields of agricultural monitoring, food detection, military reconnaissance and the like. Due to the complexity of pathological images and the requirement of medical image processing on the information content of medical data, hyperspectral imaging technology is also rapidly developed in the field of medical image processing. The hyperspectral imaging is carried out on the histopathological section, and the optical characteristic difference of the pathological tissue and the normal tissue is obtained through the spectral analysis of light with different wavelengths, which cannot be represented by the traditional digital image.

At present, the deep learning method continuously breaks through in the field of medical image processing, and provides a new idea for cancer diagnosis. The Convolutional Neural Network (CNN) is used as an end-to-end learning model, and complex image characteristics and structural information can be learned by learning a large amount of medical image data, so that accurate segmentation of medical images is realized. Although CNN networks have achieved great success in the field of medical image segmentation, they have drawbacks that make them difficult to break through. CNN is a local receptive field-based convolution operation, although the receptive field can be gradually enlarged by stacking multiple rolling and pooling layers, there may still be cases where global information is not captured sufficiently, and it is difficult to establish long-distance dependency relationships, especially for a wide range of relevance-splitting tasks. To address this problem, many researchers have introduced a transducer structure for Natural Language Processing (NLP) into computer vision. The transform architecture differs from CNN by replacing convolution operations with a self-attention mechanism. This modification helps to more fully understand the global information in one image. Although the transducer is excellent in extracting global information, it has a certain limitation in extracting local information. This is because the self-attention mechanism of the transducer ignores the spatial positional relationship of pixels in the image, resulting in a problem that local information loss may occur when processing the image. In order to effectively integrate local and global information together, many researchers have proposed using CNN and transducer simultaneously for feature extraction and fusing their output features. Researchers often combine the two feature extraction methods using convolution or the like to take full advantage of their own. However, these approaches have limitations in effectively combining local and global features.

Disclosure of Invention

The invention aims to solve the problem of effectively combining local and global features and improving the segmentation effect on the challenges of boundary blurring, tissue mixing, image difference of different patients and the like of a cancerous region, and discloses a multi-scale attention fusion microscopic hyperspectral image segmentation method. To efficiently extract features, we employ a dual encoder framework in which one branch extracts local features using CNN and the other branch captures global context information using a transducer. The double encoder structure can extract more complementary characteristics, and overcomes the limitation existing when CNN or a transducer is singly used. In order to fuse the features extracted by the encoder branches, a novel multi-scale attention fusion module MSA is provided, which can effectively fuse the features from different branches.

The invention discloses a multi-scale attention fusion microscopic hyperspectral image segmentation method for achieving the purposes, which comprises the following steps of:

step 1, performing main component analysis method dimension reduction treatment on an input microscopic hyperspectral image, performing data set pretreatment, and dividing the pretreated data set into a training set, a verification set and a test set;

Step 2, extracting features of the image by utilizing a double encoder architecture, wherein the double encoder comprises a CNN encoder branch, a transform encoder branch and a double encoder architecture, wherein the CNN encoder branch is used for extracting local features of the image, and the transform encoder branch is used for capturing a global feature and long-distance dependency relationship of the image;

step 3, designing a multi-scale attention fusion Module (MSA) comprising a multi-scale feature fusion Module (MSF) and a dual-attention module (DAM), and carrying out multi-scale fusion on the features extracted by the CNN and the Transformer;

And 4, up-sampling and convolution processing are carried out on the fusion characteristics through a decoder, the image size is recovered, and a segmentation result is generated.

As a preferable scheme of the present invention, the specific content of the step 2 is:

Adopting ResNet-34 model, using the first four stages as CNN branch structure, introducing SE module (Squeeze-and-specification module) after ResNet each stage output, dynamically adjusting the weight of the characteristic diagram channel to enhance the characteristic expression capability, fusing the output of each stage of CNN branch with the corresponding characteristic of the transducer branch;

1.2 transform encoder branches the transform branches are models composed of 6-layer transform encoder blocks reshape and upsampling operations. First, an input image is divided into image blocks of size p×p, and each image block is converted into a one-dimensional vector, which is mapped into an embedded vector of a fixed dimension through a linear layer, and then global features are captured using a multi-layer transform encoder, each layer of which includes a multi-head self-attention Mechanism (MSA), a feed forward neural network (MLP), a residual connection, and layer normalization. Finally, in the up-sampling process, a bilinear interpolation method is adopted, and residual connection is introduced at the same time to compensate the loss of low-level semantic information.

As a preferred solution of the present invention, the specific content of the step 3 is:

2.1 Multi-scale feature fusion Module (MSF) features of CNN and transducer branches are stitched in the channel dimension. The features under the different receptive fields were then extracted using a multi-scale convolution (convolution kernel sizes 1 x 1,3 x 3, 5 x 5, 7 x 7). And then splicing the four groups of convolved features on the channel dimension again to form a feature representation containing multi-scale information. Finally, the spliced multi-scale features are mapped back to the original channel number using a1 x 1 convolution.

2.2 Dual Attention Module (DAM) consists of a spatial attention module and a channel attention module, aimed at enhancing feature expression capability by integrating spatial location and channel information. The Spatial Attention Module (SAM) first performs a 3×3 convolution operation on the input features, global average pooling, and then extracts important spatial information through another 3×3 convolution operation. The output feature map of the above operation is then subjected to element-wise product (Hadamard product) and feature addition with the original input features. The module focuses on the key space position in the feature map, enhances the feature expression of the target region, and simultaneously suppresses irrelevant regions. A Channel Attention Module (CAM) firstly carries out global average pooling and global maximum pooling on input features respectively, and extracts global statistical information of a channel. And then carrying out element-by-element multiplication on the output of the two pooling operations to obtain an importance score of each channel, and normalizing the importance score into weight through a Sigmoid function. And finally multiplying the channel weight with the output characteristic of the spatial attention module to adjust the channel response intensity of the spatial characteristic.

As a preferred solution of the present invention, the specific content of the step 4 is:

The method comprises the steps of up-sampling the fused features through a bilinear interpolation method, convolving the up-sampled features through two 3×3 convolutions and one 1×1 convolution, introducing residual connection after convolution operation to enhance feature expression capability, aligning the decoder output with the resolution of an input image, and generating a final segmentation result.

G _i denotes the features obtained through one decoder block, the features after up _i up-sampling,And (5) outputting a fusion module.

Compared with the prior art, the invention has the following beneficial technical effects:

A novel dual encoder architecture is presented that combines the advantages of a transducer and Convolutional Neural Network (CNN). While the Transformer encoder is good at capturing long-range dependencies, the CNN encoder is excellent in capturing local features. By fusing the two encoders, the model is able to more efficiently handle complex image segmentation tasks.

An MSA module is presented that includes a multi-scale feature fusion module and a dual-attention module. The scale fusion module is combined with the dual-attention module, and the capability of the model in capturing complex structures and details is remarkably improved through the combination application of the self-adaptive fusion of the multi-scale features and the attention mechanism.

The effectiveness of the proposed model is verified in a microscopic hyperspectral image segmentation task, and the result shows that the model has remarkable advantages in the aspects of processing complex structures and different patient image differences, and the potential of the model in clinical application is proved.

Drawings

FIG. 1 is a diagram of the overall structure of the model.

Fig. 2. Transducer encoder block diagram.

FIG. 3 is a multi-scale attention Module (MSA) schematic.

FIG. 4 is a schematic diagram of a multi-scale feature fusion Module (MSF).

Fig. 5 is a Dual Attention Module (DAM) schematic.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As shown in fig. 1, the embodiment of the invention discloses a multi-scale attention fusion microscopic hyperspectral image segmentation method, which specifically comprises the following steps:

And A, preprocessing a data set, namely selecting hyperspectral image data sets of gastric cancer and cholangiocarcinoma, reducing the dimension of the data by a Principal Component Analysis (PCA) method, and carrying out standardized preprocessing on the data to ensure the consistency and the effectiveness of model input.

And B, constructing a segmentation network model, wherein the network model comprises an encoder, a feature fusion module and a decoder, and the structure of the segmentation network model is shown in figure 2.

And C, data set division and model training, namely dividing the data set into a training set, a verification set and a test set according to the proportion of 7:1:2. The training set is used for learning model parameters, the verification set is used for adjusting super parameters to prevent overfitting, and the test set is used for evaluating final model performance.

The encoder part is a double-branch structure and comprises a CNN encoder and a transducer encoder:

B.1.1 CNN encoder:

Multi-scale features are extracted layer by layer based on the first four stage designs of ResNet-34. And an SE module is added after the output of each stage, and the channel weight is dynamically adjusted to enhance important characteristics and reduce the calculation cost. An image with the input size of R ^H×W×C is subjected to ResNet-34 multi-stage processing, and the output characteristic dimension of each stage is And fusing each layer of output with the corresponding features of the transducer encoder, and combining the local and global information.

B.1.2 transducer encoder:

The input image size is R ^H×W×C, divided into p×p image blocks (default p=16), and generated The patches are flattened into a one-dimensional vector and embedded into a linear layerThe transducer module consists of 6 layers of encoder blocks, and features output by the transducer module are subjected to reshape and up-sampling to obtain features with the same resolution as a CNN encoderAnd

And B.2, a feature fusion module, wherein the advantages of local and global features can not be fully exerted by a simple feature splicing, weighting or averaging method. A fusion module MSA (shown in FIG. 3) is designed to more effectively fuse local and global features

B.2.1, a multi-scale feature fusion module:

And the multi-scale feature fusion module (shown in fig. 4) is used for inputting the group convolution of four groups of different convolution kernel sizes (1, 3, 5 and 7) after the CNN and the features extracted by the transducer encoder are spliced in the channel dimension, and extracting the multi-scale features. After the multi-scale features are spliced again, the original channel number is mapped back through 1×1 convolution, so that the input consistency is maintained and the feature expression capability is enhanced.

B.2.2 double-attention module:

A dual attention module (as shown in fig. 5) includes spatial attention and channel attention branches. The Spatial Attention Module (SAM) first performs a 3×3 convolution operation on the input features, global average pooling, and then extracts important spatial information through another 3×3 convolution operation. The output feature map of the above operation is then subjected to element-wise product (Hadamard product) and feature addition with the original input features. The module focuses on the key space position in the feature map, enhances the feature expression of the target region, and simultaneously suppresses irrelevant regions. A Channel Attention Module (CAM) firstly carries out global average pooling and global maximum pooling on input features respectively, and extracts global statistical information of a channel. And then carrying out element-by-element multiplication on the output of the two pooling operations to obtain an importance score of each channel, and normalizing the importance score into weight through a Sigmoid function. And finally multiplying the channel weight with the output characteristic of the spatial attention module to adjust the channel response intensity of the spatial characteristic.

B.3 decoder part:

the decoder uses bilinear interpolation upsampling combining multi-layer convolution and residual connection. The specific design is that the up-sampling part uses bilinear interpolation to increase resolution, the convolution part comprises 3X 3 convolution kernel extraction characteristics, the number of channels is adjusted by 1X 1 convolution, and the information loss is reduced by combining residual connection. The decoder output formula is:

G _i denotes the features obtained through one decoder block, the features after up _i up-sampling, And (5) outputting a fusion module.

MSA-Net built-in PyTorch framework, training on a single NVIDIA RTX 3090GPU, epochs =50, lot size=2. An Adam optimizer with a learning rate of 0.00006 is adopted, a cross entropy loss function is used, multi-step long attenuation is used in the training process, and the learning rate is reduced to half of the original learning rate when training to 30 th and 40 th epochs.

And C.2, inputting the training set into the network of the B for training, after the training is finished, using the verification set for evaluation, storing model parameters with the best training result, and using the model with the best training result for image segmentation of the test set to obtain a segmentation result.

And C3, evaluating the segmentation effect by using global Accuracy (Accuracy), average Accuracy (MeanAcc), average cross-over ratio (Mean Iou) and average Dice coefficient (Mean Dice) as evaluation indexes. The formulas of these indices are as follows:

TP (True Positive) number of samples predicted to be positive and actually positive. TN (True Negative) number of samples predicted to be negative but actually negative. FP (False Positive) number of samples predicted positive but actually negative. FN (False Negative) number of samples predicted negative but actually positive. K represents the number of segmentation classes.

Claims

1. The multi-scale attention fusion microscopic hyperspectral image segmentation method is characterized by comprising the following steps of:

2. The method for segmenting the multi-scale attention fusion microscopic hyperspectral image according to claim 1, wherein the specific contents of the step 2 are as follows:

Adopting ResNet-34 model, using the first four stages as CNN branch structure, introducing SE module (Squeeze-and-specification module) after ResNet-34 each stage output, dynamically adjusting the weight of the characteristic diagram channel to enhance the characteristic expression capability, fusing the output of each stage of CNN branch with the corresponding characteristic of the transducer branch;

3. The multi-scale attention fusion microscopic hyperspectral image segmentation method according to claim 1, wherein the specific process of the step 3 is as follows:

4. The method for segmenting the multi-scale attention fusion microscopic hyperspectral image according to claim 1, wherein the specific process of the step 4 is as follows:

the method comprises the steps of up-sampling the fused features through a bilinear interpolation method, convolving the up-sampled features through two 3×3 convolutions and one 1×1 convolution, introducing residual connection after convolution operation to enhance feature expression capability, aligning the decoder output with the resolution of an input image, and generating a final segmentation result. The detailed steps are shown in formula 1: