CN118967424B

CN118967424B - Screen shot image robust watermarking method based on attention mechanism and contrast learning

Info

Publication number: CN118967424B
Application number: CN202411448920.XA
Authority: CN
Inventors: 高光勇; 李力; 陈晓安
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-10-17
Filing date: 2024-10-17
Publication date: 2025-03-14
Anticipated expiration: 2044-10-17
Also published as: CN118967424A

Abstract

The invention discloses a robust watermarking method for a screen shot image based on attention mechanism and contrast learning, which comprises the steps of generating an encoded image containing watermark information by using an encoder, inputting the encoded image and a carrier image into a discriminator, outputting a predicted value by using the discriminator, carrying out distortion simulation on the encoded image, inputting the distorted encoded image into a decoder, extracting watermark information hidden in the distorted encoded image, carrying out model training on the encoder, the discriminator and the decoder according to the predicted value of the discriminator and a joint loss function, forming a screen shot watermarking model by using the encoder and the decoder after training, and encoding and decoding the screen shot image by using the screen shot watermarking model. The method and the device effectively enhance the robustness of the watermark model in the real scene by optimizing the watermark image coding process while guaranteeing the invisibility of the coded image, and better guarantee the integrity of watermark information when facing screen shot noise.

Description

Screen shot image robust watermarking method based on attention mechanism and contrast learning

Technical Field

The invention relates to a screen shot image watermarking technology, in particular to a screen shot image robust watermarking method based on an attention mechanism and contrast learning.

Background

With the update iteration of intelligent equipment, the streaming and spreading of digital information become unprecedented convenient and rapid, the development of digital copyright protection is very perfect under the multimedia technical background of high-speed development, but the copyright protection under the scene facing screen shooting still has gaps, such as film and television works piracy, medical data piracy, military secret piracy and the like, and the traditional electronic information watermarking technology can not well prevent the malicious behaviors, so that new problems and challenges are brought to the technical field of image watermarking.

Most of traditional digital watermarking schemes are based on electronic channel propagation, and the design purpose is mainly to prevent watermark information extraction, such as Gaussian noise, JPEG compression, color distortion and the like, from being influenced when watermark images face electronic channel noise, however, the difference between shooting principles in real scenes and distortion principles of electronic channels is large, so that the watermarks in the traditional schemes are difficult to resist the distortion. This is due to many physical distortions in the panning scene, such as lens distortion, illumination distortion, motion blur, moire effects, etc., which are not present in the traditional electronic channel domain. Therefore, the screen robust watermark has been developed, and the purpose of the screen robust watermark is to enable watermark images to smoothly read hidden watermark information after the watermark images are shot by facing the screen, so that the security of user data is ensured.

The advent of deep learning has provided a powerful aid to the screening robust watermarking study, where the generation of an antagonism network has provided a new idea to the watermarking technology study, by generating antagonism training of the encoder and discriminator of the antagonism network, the position of watermark information embedding can be fitted to the image features as much as possible. In addition, the noise layer simulates specific noise that may occur in a real scene to enhance the robustness of the decoder to the specific noise. However, the following problems still remain in the prior art:

1. the robustness is insufficient, namely, the situation of insufficient watermark extraction rate can still occur in the encoder after the encoded image generated by the prior art is subjected to the panning attack.

2. The invisibility is insufficient, namely the embedding position of the watermark in the coded image generated by the prior art is poor in fit with the image characteristics, so that the similarity between the watermark and the original image is low, and a user can easily observe that the image possibly contains the watermark by naked eyes.

3. The model architecture is insufficient, and the problems of mode collapse, gradient disappearance and the like exist in the model architecture in the prior art, so that the upper limit of the model performance is blocked, and the overall performance of the model is influenced.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a screen shot image robust watermarking method based on an attention mechanism and contrast learning.

The invention discloses a screen shot image robust watermarking method based on an attention mechanism and contrast learning, which comprises the following steps:

Inputting the carrier image and the watermark information into an encoder, and generating an encoded image containing the watermark information by the encoder;

inputting the encoded image and the carrier image into a discriminator, and outputting a predicted value using the discriminator;

Performing distortion simulation on the coded image to obtain a distorted coded image;

Inputting the distorted coded image into a decoder, and extracting watermark information hidden in the distorted coded image;

And (3) carrying out model training on the encoder, the discriminator and the decoder according to the predicted value of the discriminator and the joint loss function, forming a screen shot watermark model by using the trained encoder and decoder, and encoding and decoding the screen shot image by using the screen shot watermark model.

Further, before inputting the carrier image and the watermark information into the encoder, comprising:

Randomly extracting n ₀ numerical elements from standard uniform distribution in the interval obeying [0, 1), taking 1 for numerical values larger than 0.5 and 0 for numerical values not larger than 0.5 to form binary watermark ciphertext information as watermark information;

processing the original carrier image size to n ₁ N ₁ as carrier images.

Further, inputting the carrier image and watermark information into the encoder comprises:

Carrying out three times of downsampling operation on the carrier image to sequentially obtain local feature images F1, F2 and F3, carrying out maximum pooling operation after each downsampling operation, obtaining a local feature image F4 after the third maximum pooling operation, and carrying out global average pooling once again to obtain a global feature image F5;

Watermark information is subjected to a full-connection layer to obtain watermark tensor M, and the dimension of the watermark tensor M is the same as the dimension of the feature map F6;

splicing the local feature map F4, the feature map F6 and the watermark tensor M in the channel dimension, and carrying out up-sampling once to obtain a feature map D4;

splicing the local feature map F3, the feature map D4 and the watermark tensor M in the channel dimension, and obtaining a feature map D3 through a secondary convolution layer and an up-sampling layer;

splicing the local feature map F2, the feature map D3 and the watermark tensor M in the channel dimension, and obtaining a feature map D2 through a secondary convolution layer and an up-sampling layer;

Splicing the local feature map F1, the feature map D2 and the watermark tensor M in the channel dimension, and performing secondary convolution layer and 1 1 Convolving the layer to obtain the encoded image D1.

Further, performing the downsampling operation on the carrier image three times includes:

At each downsampling, sequentially passing the carrier image through a 3×3 convolution layer, a batch normalization layer, a first activation function layer, a 3×3 convolution layer, a batch normalization layer, a second activation function layer and a HWC attention module;

The HWC attention module includes HAttention, WAttention, and CAttention modules;

In HAttention module, input feature map x is subjected to self-adaptive maximum pooling operation and self-adaptive global average pooling operation respectively, the width is compressed to be 1, feature map max _h and feature map avg _h are obtained respectively, feature map max _h sequentially passes through a 1×1 convolution layer, an activation function layer and a 1×1 convolution layer to obtain feature map se (max _h), feature map avg _h sequentially passes through the 1×1 convolution layer, the activation function layer and the 1×1 convolution layer to obtain feature map se (avg _h), feature map se (max _h) and feature map se (avg _h) are spliced, and feature map A _h is obtained after the activation function layer;

In WAttention module, the input feature map x is subjected to adaptive maximum pooling and adaptive global average pooling respectively, the height is compressed to 1, so as to obtain a feature map max _w and a feature map avg _w, the feature map max _w sequentially passes through a 1×1 convolution layer, an activation function layer and a 1×1 convolution layer to obtain a feature map se (max _w), the feature map avg _w sequentially passes through the 1×1 convolution layer, the activation function layer and the 1×1 convolution layer to obtain a feature map se (avg _w), the feature map se (max _w) and the feature map se (avg _w) are spliced, and then the feature map A _w is obtained after the function layer is activated;

In CAttention module, the input feature map x is compressed to 1 through a self-adaptive maximum value pooling layer and a self-adaptive global average pooling layer respectively, then the two obtained tensors are spliced in the channel dimension, the dimension of the channel is reduced to 1 through a 1X1 convolution layer, and then the feature map A _c is obtained through an activation function layer;

The input feature map x is multiplied by the feature map a _h and the feature map a _w in the height and width dimensions, then multiplied by the feature map a _c in the channel dimension, and the obtained result is added to the input feature map x to be used as an output result of the HWC attention module.

Further, the discriminator generates an countermeasure network for spectrum normalization, the carrier picture and the watermark encoded image are input to the spectrum normalization generation countermeasure network, the output of the discriminator is true or false, the output result of the discriminator is input to the encoder, and the loss function L _C of the encoder and the loss function L _E of the discriminator are calculated, and the calculation expressions are respectively:

,

Where α and β are training hyper-parameters, X _r and X _w represent the carrier image and the encoded image, respectively, L _nce and L _gan represent NCE loss and hinge loss, respectively, C represents the weight parameters of the discriminator, Representing the weight parameters of the fixed discriminator, E represents the weight parameters of the encoder,Representing the weight parameters of the fixed encoder.

Further, performing distortion simulation on the encoded image includes:

randomly perturbing four corners of the encoded image by utilizing perspective transformation, and then performing bilinear resampling on the encoded image to create a perspective warped image;

performing illumination distortion and moire distortion simulation on the image subjected to perspective distortion by using an illumination simulation function and a moire simulation function;

and carrying out distortion simulation on the interference in the real scene by using Gaussian noise.

Further, the decoder includes 3 single convolution blocks, 3 residual convolution blocks, 1 single convolution block, 6 residual convolution blocks, 1 single convolution block, and 1 full connection layer, which are sequentially arranged.

Further, extracting watermark information hidden in the distorted encoded image includes:

and calculating a decoder loss function L _D according to the decoded watermark information and the original watermark information, wherein the calculation formula is as follows:

,

Where M represents the original watermark information, M _d represents the decoded watermark information, γ _D represents the decoding super-parameters, I _n represents the distorted encoded image, and D (γ _D,I_n) represents the decoder to decode the distorted encoded image.

Further, the joint loss function is composed of an encoder loss function, a discriminator loss function, and a decoder loss function, and the formula is as follows:

,

Where λ ₁、λ₂ and λ ₃ are weight parameters of the corresponding loss function.

Further, model training the encoder, the discriminator, and the decoder based on the predictor of the discriminator and the joint loss function comprises:

The encoder, the discriminator and the decoder are input to an Adam optimizer for iterative training, the maximum iteration number is set, and the joint loss function is utilized for back propagation.

Compared with the prior art, the invention has the remarkable advantages that:

1. the invention improves the performance of the encoder by designing a new multi-branch convolution and HWC attention mechanism module, so that the invisibility of the encoded watermark is improved;

2. The invention solves the problem of mode collapse in the traditional model architecture by comparing the learning with the multi-discriminator module, improves the upper limit of model training and enhances the performance of the model after training is finished;

3. by arranging the screen shot distortion simulation layer, the invention simulates several kinds of distortion possibly generated during screen shooting in a real scene, trains the decoder by using the screen shot distortion simulation layer, and improves the robustness of the decoder to the screen shot distortion.

Drawings

FIG. 1 is a block flow diagram of a robust watermarking method for a screen shot image based on attention mechanisms and contrast learning;

FIG. 2 is a schematic diagram of the structure of the HWC attention mechanism module;

FIG. 3 is a schematic diagram of pairing positive and negative samples in a contrast learning process;

FIG. 4 is a moire image under noise interference of different intensities;

fig. 5 is a screen shot image under different angular disturbances.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent.

The method for robust watermarking of a screen shot image based on an attention mechanism and contrast learning in this embodiment at least includes steps 1 to 5, and a flowchart is shown in fig. 1.

Step 1, inputting a carrier image and watermark information into an encoder, and generating an encoded image containing the watermark information by using the encoder;

Step 2, inputting the coded image and the carrier image into a discriminator, and outputting a predicted value by using the discriminator;

step 3, performing distortion simulation on the coded image to obtain a distorted coded image;

Step 4, inputting the distorted coded image into a decoder, and extracting watermark information hidden in the distorted coded image;

And 5, performing model training on the encoder, the discriminator and the decoder according to the predicted value of the discriminator and the joint loss function, forming a screen shot watermark model by using the trained encoder and decoder, and encoding and decoding the screen shot image by using the screen shot watermark model.

processing the original carrier image size to n ₁ N ₁ as carrier images.

In one example, 30 elements may be randomly extracted from a standard uniform distribution within the [0, 1] interval according to the dimension of the encoder input to form watermark information to be embedded into the carrier image, n ₁ takes on a value of 128, i.e., the original carrier image is processed to 128128, And inputting the processed watermark information and the carrier image into an encoder for encoding, thereby obtaining an encoded image containing the watermark information.

Further, in step 1, inputting the carrier image and the watermark information into the encoder comprises:

In the encoder, carrying out three downsampling operations on a carrier image to sequentially obtain local feature images F1, F2 and F3, carrying out maximum pooling operation after downsampling each time, obtaining a local feature image F4 after the third maximum pooling operation, and carrying out global average pooling once again to obtain a global feature image F5;

Specifically, performing the downsampling operation on the carrier image three times includes:

at each downsampling, the carrier image is sequentially passed through a 3×3 convolution layer, a batch normalization layer, a first activation function layer, a 3×3 convolution layer, a batch normalization layer, a second activation function layer, and a HWC attention module.

As shown in fig. 2, the HWC attention module includes HAttention, WAttention, and CAttention modules;

In HAttention, the input feature map x of the HWC attention module is subjected to adaptive maximum pooling operation and adaptive global average pooling operation, the width is compressed to 1, that is, the original feature map size is changed from (B, C, H, W) to (B, C, H, 1), B, C, H, W respectively represent training batch, channel dimension, width dimension, and height dimension, respectively obtain feature map max _h and feature map avg _h, the feature map max _h sequentially passes through a 1×1 convolution layer, relu activation function layer, and a 1×1 convolution layer, to obtain a feature map se (max _h), the feature map avg _h sequentially passes through a 1×1 convolution layer, relu activation function layer, and a 1×1 convolution layer, to obtain a feature map se (avg _h), the feature map se (max _h) and the feature map se (avg _h) are spliced, the feature map a _h is obtained after the sigid activation function layer, the channel number is reduced from C1 to C2 when the channel number passes through the first convolution layer, and the channel number is restored from C2 to the second channel number when the channel number passes through the first convolution layer.

In WAttention module, the input feature map x is respectively subjected to adaptive maximum pooling and adaptive global average pooling, the height is compressed to be 1, namely the original feature map size is changed from (B, C, H and W) to (B, C,1 and W), a feature map max _w and a feature map avg _w are obtained, the feature map max _w sequentially passes through a 1×1 convolution layer, a Relu activation function layer and a 1×1 convolution layer to obtain a feature map se (max _w), the feature map avg _w sequentially passes through the 1×1 convolution layer, the Relu activation function layer and the 1×1 convolution layer to obtain a feature map se (avg _w), the feature map se (max _w) and the feature map se (avg _w) are spliced, the feature map A _w is obtained after the sigmoid activation function layer, the channel number is reduced from C1 to C2 when the first convolution layer passes through the second convolution layer, and the channel number is restored from C2 to the original C1 when the second convolution layer passes through the first convolution layer.

In CAttention module, the input feature image x is compressed to 1 through the self-adaptive maximum value pooling layer and the self-adaptive global average pooling layer respectively, namely the original feature image size is changed from (B, C, H and W) to (B, 1, H and W), then the two obtained tensors are spliced in the channel dimension, the channel dimension is reduced to 1 through the 1X 1 convolution layer, and then the feature image A _c is obtained through the sigmoid activation function layer;

In the encoder, for the convolutional layer, kaiming normal distribution initialization weights are used, for the batch normalization layer, the weights are initialized to be constant 1, the offsets are initialized to be constant 0, and for the full-connection layer, normal distribution initialization weights are used, the standard deviation is 0.001, and the offsets are initialized to be constant 0.

The HWC module has the main function that the encoder carries out attention mechanism operation in three dimensions of width, height and channel through three sub-modules HAttention, WAttention and CAttention respectively, so that the feature perceptibility of an input original carrier image is enhanced, the watermark embedding position is better determined, and the quality of the encoded image is improved. In addition, the weight optimization is used for initializing weight parameters so as to prevent the conditions of gradient disappearance, gradient explosion and the like in training from interfering with the normal training process.

The discriminator generates an countermeasure network SNGAN for spectrum normalization, the carrier picture and the watermark encoded image are input to the spectrum normalization generates the countermeasure network, the output of the discriminator is true or false, the output result of the discriminator is input to the encoder, and the loss function L _C of the encoder and the loss function L _E of the discriminator are calculated, and the calculation expressions are respectively:

,

Discriminator SNGAN comprises a plurality of local feature discriminators and a global feature discriminator, which, after extracting the local and global features, project into a higher dimensional reproduction kernel hilbert space (RKHS, reproducing Kernel Hilbert Space), capturing the similarity between the global and local features using linearly evaluated values, as shown in fig. 1. Then, these projection features go through a contrast learning sample pairing stage to create positive/negative sample pairs, the process is as shown in fig. 3, where local features and global features of an m×m input image are taken as positive sample pairs, other pictures in the same batch and pictures in different batches are taken as negative sample pairs, these positive and negative sample pairs are used to calculate the NCE loss in the later input loss function, and finally true or false is output and input to the next round of encoder.

The multiple local feature discriminators and the global feature discriminator in the example adopt a multi-discriminator structure, so that the problem of catastrophic forgetting which is easy to occur in a traditional single discriminator is relieved, the problem of mode collapse of the encoder is prevented by comparing the pairing of learning samples and NCE loss, the two methods optimize two common problems which occur in the prior art and lead to the fact that training cannot be performed normally, the framework structure of a training model is optimized, and the upper limit of model performance is improved.

Further, in step 3, performing distortion simulation on the encoded image includes:

s31, randomly perturbing four corners of the coded image by utilizing perspective transformation, and then performing bilinear resampling on the coded image to create a perspective distorted image;

S32, performing illumination distortion and moire distortion simulation on the image subjected to perspective distortion by using an illumination simulation function and a moire simulation function;

and S33, performing distortion simulation on the interference in the real scene by using Gaussian noise.

In step 4, watermark information hidden in the distorted encoded image is extracted using a decoder. The decoder comprises 3 single convolution blocks, 3 residual convolution blocks, 1 single convolution block, 6 residual convolution blocks, 1 single convolution block and 1 full connection layer which are sequentially arranged.

The structure of the single convolution block is a 3×3 convolution, batch normalization and activation function which are sequentially arranged. The structure of the residual convolution block is 3×3 convolution, batch normalization, activation function, 3×3 convolution, batch normalization and one jump connection which are sequentially arranged, and the final output is the sum of jump connection after the activation function and 3×3 convolution result.

,

Where λ ₁、λ₂ and λ ₃ are weight parameters of the corresponding loss function. In this example, λ ₁、λ₂ and λ ₃ can be set to 0.5, 0.001 and 3, respectively.

In step 5, model training the encoder, discriminator and decoder based on the predictor of the discriminator and the joint loss function comprises:

The encoder, the discriminator and the decoder are input into an Adam optimizer for iterative training, the maximum iteration times are set, and the joint loss function is utilized for back propagation, so that a final trained model is obtained.

And forming a screen shot watermark model by the trained encoder and decoder, inputting the carrier image and watermark information into the encoder in the screen shot watermark model to obtain an encoded image, displaying the encoded image on a display, and decoding the watermark information in the screen shot image by using the decoder in the screen shot watermark model after shooting by a mobile phone to obtain decoded watermark information.

To verify the effectiveness of the method of the invention, the method of the invention was tested as follows:

According to the invention, 15000 images are randomly selected from a COCO training set to serve as a training set, 500 images are randomly selected from a COCO testing set to serve as a testing set, pyTorch is selected from a programming language, NVIDIA RTX 3070 GPU is used as training equipment, AOC LV273HUPR and CSO 1609 are used as displays for experiments, and Realme RMX3366 and HUAWEI DBY-W09 are used as shooting equipment. The training batch was set to 100 and the batch size was set to 16. The invention in tables 1 to 4 below are all used to represent the panning watermark models proposed in the present invention, and other comparative models are all existing models, including STEGASTAMP (STEGASTAMP: robust HYPERLINKS IN PHYSICAL photographs) model, RIHOOP (RIHOOP: robust HYPERLINKS IN Offline and Online Photographs) model 、PIMoG（PIMoG: An effective screen-shooting noise-layer simulation for deep-learning-based watermarking network） model, and the average bit correct extraction rate is selected as an evaluation index for the robustness experiment.

The results of the robustness experiments of the proposed model of the invention against moire noise are given in table 1, compared with several other models. In view of the fact that the moire noise intensity does not have a fixed evaluation index, the consistency of the moire noise is ensured by fixing the distance and angle of the device in this example, and the moire noise is carried out under the condition that the fixed distance is 20cm and the angle is 0 degrees, namely, the moire noise intensity is opposite to the screen, and the moire image under the interference of the strong, medium and weak noise intensities is shown in fig. 4.

TABLE 1 Moire noise test results at different intensities

From table 1, it can be derived that the screen watermark model of the invention shows more excellent performance, and the average extraction rate of the screen watermark model against moire noise is up to 99.019%.

The robustness experiments of the panning watermark model and other models under the interference of 40 degrees left, 20 degrees left, 0 degrees right, 20 degrees right and 40 degrees right are shown in table 2, and the panning images under the interference of different angles are shown in fig. 5. From the data in table 2, the model provided by the invention shows more excellent performance, and the extraction rate is higher than that of the existing other models under the interference of five different angles.

Table 2 results of robustness experiments at different angles

The robustness experiments of the screen shot watermark model and other comparison models under different conditions are shown in the table 3, and the experimental results show that the model provided by the invention is better than the existing scheme under different brands of equipment and has higher robustness.

TABLE 3 results of experiments with different devices

The comparison of the image quality experimental results of the model proposed by the present invention with other comparative models, including peak signal-to-noise ratio PSNR and structural similarity SSIM, is given in table 4, and the image quality of the present invention is superior to the existing scheme, indicating the effectiveness of the scheme proposed by the present invention.

TABLE 4 results of image quality experiments

The comparison result shows that the watermark method ensures the quality of the coded image, has higher extraction rate than other models, and has invisibility and robustness.

Claims

1. A screen shot image robust watermarking method based on attention mechanism and contrastive learning, characterized by comprising:

The carrier image and the watermark information are input into the encoder, and the encoder is used to generate an encoded image containing the watermark information;

Input the encoded image and the carrier image into the discriminator, and use the discriminator to output the predicted value;

The distorted coded image is input into a decoder to extract the watermark information hidden in the distorted coded image;

According to the prediction value of the discriminator and the joint loss function, the encoder, discriminator and decoder are trained, the trained encoder and decoder are used to form a screen capture watermark model, and the screen capture image is encoded and decoded using the screen capture watermark model;

Based on the discriminator’s predictions and the joint loss function, the encoder, discriminator, and decoder are trained by:

The encoder, discriminator, and decoder are input into the Adam optimizer for iterative training, the maximum number of iterations is set, and the joint loss function is used for back propagation.

2. According to claim 1, a screen shot image robust watermarking method based on attention mechanism and contrastive learning is characterized in that before the carrier image and the watermark information are input into the encoder, the method comprises:

Randomly select n ₀ numerical elements from the standard uniform distribution in the interval [0,1), and set the values greater than 0.5 to 1 and the values less than 0.5 to 0 to form binary watermark ciphertext information as watermark information;

Resize the original carrier image to n ₁ n ₁ , as the carrier image.

3. According to claim 2, a screen shot image robust watermarking method based on attention mechanism and contrastive learning is characterized in that inputting the carrier image and watermark information into the encoder comprises:

The carrier image is downsampled three times to obtain local feature maps F1 , F2 , and F3 in sequence, and a maximum pooling operation is performed after each downsampling; after the third maximum pooling operation, a local feature map F4 is obtained, and a global average pooling operation is performed again to obtain a global feature map F5 ; the local feature map F4 and the global feature map F5 are spliced to obtain a feature map F6 ;

The watermark information is passed through a fully connected layer to obtain the watermark tensor M , and the dimension of the watermark tensor M is the same as the dimension of the feature map F6 ;

The local feature map F4 , the feature map F6 and the watermark tensor M are concatenated in the channel dimension and upsampled once to obtain the feature map D4 ;

The local feature map F 3, feature map D 4 and watermark tensor M are concatenated in the channel dimension, and then passed through a secondary convolution layer and an upsampling layer to obtain feature map D 3;

The local feature map F2 , feature map D3 and watermark tensor M are concatenated in the channel dimension, and then passed through a secondary convolution layer and an upsampling layer to obtain feature map D2 ;

The local feature map F1 , feature map D2 and watermark tensor M are concatenated in the channel dimension and passed through the secondary convolution layer and 1 1 convolution layer to obtain the encoded image D 1.

4. According to the method of claim 3, wherein the method of performing three downsampling operations on the carrier image comprises:

At each downsampling, the carrier image is sequentially passed through a 3×3 convolution layer, a batch normalization layer, a first activation function layer, a 3×3 convolution layer, a batch normalization layer, a second activation function layer, and a HWC attention module;

The HWC attention module includes the HAttention module, the WAttention module, and the CAttention module;

Among them, in the HAttention module, the input feature map x is subjected to adaptive maximum pooling operation and adaptive global average pooling operation respectively, and the width is compressed to 1 to obtain the feature map max _h and the feature map avg _h respectively. The feature map max _h passes through the 1×1 convolution layer, the activation function layer and the 1×1 convolution layer in sequence to obtain the feature map se ( max _h ), and the feature map avg _h passes through the 1×1 convolution layer, the activation function layer and the 1×1 convolution layer in sequence to obtain the feature map se ( avg _h ). The feature map se ( max _h ) and the feature map se ( avg _h ) are concatenated and then pass through the activation function layer to obtain the feature map Ah _;

In the WAttention module, the input feature map x is subjected to adaptive maximum pooling and adaptive global average pooling respectively, and the height is compressed to 1 to obtain the feature map max _w and the feature map avg _w . The feature map max _w passes through the 1×1 convolution layer, the activation function layer and the 1×1 convolution layer in sequence to obtain the feature map se ( max _w ). The feature map avg _w passes through the 1×1 convolution layer, the activation function layer and the 1×1 convolution layer in sequence to obtain the feature map se ( avg _w ). The feature map se ( max _w ) and the feature map se ( avg _w ) are concatenated and then pass through the activation function layer to obtain the feature map A _w .

In the CAttention module, the input feature map x passes through the adaptive maximum pooling layer and the adaptive global average pooling layer respectively, compresses the number of channels to 1, and then concatenates the two tensors in the channel dimension, and then passes through a 1×1 convolution layer to reduce the channel dimension to 1, and then passes through the activation function layer to obtain the feature map A _c ;

The input feature map x is multiplied with _{the feature map Ah and the feature map Aw} in _the height and width dimensions, and then multiplied with _the feature map Ac in the channel dimension. The result is then added to the input feature map x as the output result of the HWC attention module.

5. According to claim 4, a robust watermarking method for screen capture images based on attention mechanism and contrastive learning is characterized in that the discriminator is a spectrum normalization generative adversarial network, the carrier image and the watermark encoded image are input into the spectrum normalization generative adversarial network, the output of the discriminator is true or false, the output result of the discriminator is input into the encoder, and _the encoder loss function LC and the discriminator loss function LE are calculated, _and the calculation expressions are respectively:

,

Where α and β are training hyperparameters, Xr and Xw represent the carrier image and the encoded image _respectively , Lnce _and _Lgan represent the NCE loss and hinge loss respectively, C represents the weight parameter _of the discriminator, represents the weight parameter of the fixed discriminator; E represents the weight parameter of the encoder, represents the weight parameters of the fixed encoder.

6. According to the method of claim 5, wherein the step of simulating the distortion of the encoded image comprises:

The four corners of the encoded image are randomly perturbed using a perspective transform, and then the encoded image is bilinearly resampled to create a perspective-distorted image.

Using the illumination simulation function and the moiré simulation function, the illumination distortion and the moiré distortion of the perspective-distorted image are simulated;

Gaussian noise is used to simulate the distortion of interference in real scenarios.

7. According to claim 6, a robust watermarking method for screen shots based on attention mechanism and contrastive learning is characterized in that the decoder includes 3 single convolution blocks, 3 residual convolution blocks, 1 single convolution block, 6 residual convolution blocks, 1 single convolution block and 1 fully connected layer arranged in sequence.

8. According to claim 7, a method for robust watermarking of screen shots based on attention mechanism and contrastive learning is characterized in that after extracting the watermark information hidden in the distorted encoded image, the method comprises:

The decoder loss function LD is calculated based on the decoded watermark information and the original watermark information _. The calculation formula is:

,

Wherein, M represents the original watermark information, Md represents the decoded watermark information _, γD represents the decoding hyperparameter, In represents the distorted _coded image, _and D ( γD _, In ₎ represents the decoder decoding the distorted coded image.

9. According to claim 8, a screen shot image robust watermarking method based on attention mechanism and contrastive learning is characterized in that the joint loss function is composed of an encoder loss function, a discriminator loss function and a decoder loss function, and the formula is as follows:

,

Among them, λ ₁ , λ ₂ and λ ₃ are weight parameters of the corresponding loss function.