CN114926812B

CN114926812B - An adaptive focusing positioning target detection method

Info

Publication number: CN114926812B
Application number: CN202210501677.8A
Authority: CN
Inventors: 施颖君; 石文君; 朱冬晨; 李嘉茂; 张晓林
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2025-06-27
Anticipated expiration: 2042-05-09
Also published as: CN114926812A

Abstract

The present invention relates to an adaptive focusing and positioning target detection method, comprising the following steps: receiving an image to be identified; inputting the image to be identified into a target detection model to obtain the position and category of the target in the image to be identified. The target detection model comprises: a feature extraction layer, used to extract the features of the image to be identified; a target category prediction layer, used to perform block operations on each layer of features extracted by the feature extraction layer, and perform category prediction and coefficient prediction on each block; a target positioning prediction layer, used to generate a mask tuple according to the features extracted by the feature extraction layer, and multiply the mask tuple by the coefficients obtained by the target category prediction layer and then sum them to obtain a target mask. The present invention can improve the accuracy of target detection.

Description

Self-adaptive focusing positioning target detection method

Technical Field

The invention relates to the technical field of target detection, in particular to a self-adaptive focusing positioning target detection method.

Background

Current traffic sign detection can be broadly divided into two main categories. One is the traditional identification of traffic signs according to the shape, color and other features of the traffic signs, and the other is the use of deep learning to identify traffic signs. The method of deep learning can be further divided into a single-stage and a double-stage according to the method adopted by the method. The method comprises the steps of firstly generating candidate frames with various sizes and dimensions at each pixel point by utilizing a neural network, and then judging whether targets and categories of the targets are contained in the candidate frames. The single-stage method is to directly obtain the parameters of the target rectangular frame and the class labels of the corresponding targets through the neural network. The accuracy of the dual-stage target detection method is generally higher than that of the single-stage method, but this is at the cost of computational effort.

Compared with other target detection tasks, the traffic sign detection task does not need to consider the shielding condition, but the similarity of different traffic signs and the recognition difficulty caused by the changeable traffic sign sizes caused by different distances are difficult.

Disclosure of Invention

The invention aims to solve the technical problem of providing a self-adaptive focusing positioning target detection method which can improve the target detection precision.

The technical scheme adopted by the invention for solving the technical problems is that the invention provides a self-adaptive focusing and positioning target detection method, which comprises the following steps:

Receiving an image to be identified;

Inputting the image to be identified into a target detection model to obtain the position and the category of a target in the image to be identified, wherein the target detection model comprises:

the feature extraction layer is used for extracting the features of the image to be identified;

The target category prediction layer is used for carrying out blocking operation on each layer of features extracted by the feature extraction layer, and carrying out category prediction and coefficient prediction on each block;

and the target positioning prediction layer is used for generating a mask tuple according to the characteristics extracted by the characteristic extraction layer, multiplying the mask tuple by the coefficients obtained by the target category prediction layer, and summing the multiplied mask tuple to obtain a target mask.

The feature extraction layer adds an aliased residual structure between adjacent residual structures.

The aliasing residual structure comprises a first processing unit, a second processing unit and a second ReLU activation function layer, wherein the input of the first processing unit is the characteristic information of the layer, the output of the first processing unit and the low-layer characteristic information are simultaneously used as the input of the second processing unit, the output of the second processing unit and the characteristic information of the layer are used as the input of the second ReLU activation function layer, the first processing unit comprises a 3X 3 convolution layer, a first normalization layer and a first ReLU activation function layer which are sequentially connected, and the second processing unit comprises a 1X 1 convolution layer and a second normalization layer which are sequentially connected.

And the channel number of the mask tuple is the same as the vector dimension of the coefficient obtained by the target class prediction layer.

The total loss function of the target detection model comprises a category loss function, a segmentation loss function and a center point loss function, wherein the center point loss function is a mask-guided center point error function.

The center point loss function is: Wherein m represents the number of positive samples, k represents the index of S x S blocks from left to right and from top to bottom, i= [ k/S ], j= kmodS, I represents an indication function, 1 is when p _i,j >0, otherwise 0; Representing the predicted center point position, c _i represents the true center point position.

Advantageous effects

Compared with the prior art, the method has the advantages that the residual aliasing module is introduced to maintain low-level information so as to cope with the similarity between targets in traffic sign recognition tasks, and therefore the recognition accuracy is improved. According to the invention, the independent single target positioning result is obtained through the sum of the multiplied target base and the coefficient, the quality of the target base can be improved, and the influence of the low-quality target base is weakened, so that the aliasing phenomenon between targets existing in the current method is improved. The invention avoids the situation that the detection result is poor due to an error which is not emphasized in the segmentation by introducing the center point error of mask guidance.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a network architecture diagram of a target detection model in an embodiment of the invention;

FIG. 3 is a schematic structural view of a feature extraction layer in an embodiment of the invention;

Fig. 4 is a schematic diagram of an aliasing residual structure in an embodiment of the invention.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The embodiment of the invention relates to a self-adaptive focusing positioning target detection method, as shown in fig. 1, comprising the following steps of receiving an image to be identified, and inputting the image to be identified into a target detection model to obtain the position and the category of a target in the image to be identified. The method for improving the target detection model comprises the steps of (1) introducing a residual aliasing module to maintain low-layer information so as to improve the identification accuracy, (2) obtaining independent single target positioning results through the sum of multiplied target bases and coefficients, weakening the influence of low-quality target bases, and improving the aliasing phenomenon between targets existing in the existing method, (3) introducing a mask-guided center point error to avoid the situation that the detection results are poor due to the error which is not emphasized in segmentation. Therefore, the detection method of the present embodiment can be applied to the detection of traffic signs, and the present invention will be further described below with reference to SOLOv as an example.

SOLOv2 the main idea is to divide the picture into S x S blocks and predict for each block his class (background class that does not contain an object). Each block also predicts a coefficient vector in N dimensions. SOLOv2 refers to the coefficient vectors predicted by all blocks as kernels. The target mask is generated by convolving the kernel with the mask branches. For a large target across multiple blocks SOLOv provides that the target center position determines which block is responsible for predicting the target.

The existing SOLOv network structure includes two parts, a feature extraction part and a decoding part.

The feature extraction part is mainly composed of resnet plus FPN. Resnet consists of a residual block and a downsampling operation. resnet continuously extracting the features by a residual error module, obtaining a new layer of features with larger receptive fields by downsampling operation, and enabling the features with large receptive fields to be integrated into the low layer features by FPN.

The decoding section may be divided into two branches. A branch performs a block operation on each layer of the FPN output, and each block performs class prediction and coefficient prediction. All coefficients of each layer are kernels. The other branch generates a mask with the same number of channels as the coefficient vector dimension. The mask for each object is generated by a kernel and mask convolution. One benefit of placing the core prediction and the class prediction in the same branch is that the result of the class prediction can be used to filter the predicted coefficients and filter out some blocks of background classes.

As shown in fig. 2, the object detection model in this embodiment includes a feature extraction layer, an object class prediction layer, and an object location prediction layer, where the feature extraction layer is used to extract features of the image to be identified, the object class prediction layer is used to perform a blocking operation on each layer of features extracted by the feature extraction layer, and perform class prediction and coefficient prediction on each block, and the object location prediction layer is used to generate a mask tuple according to the features extracted by the feature extraction layer, and sum the mask tuple multiplied by coefficients obtained by the object class prediction layer to obtain an object mask. The present embodiment improves on the feature extraction layer and the target location prediction layer in the existing SOLOv network architecture.

In this embodiment, the feature extraction layer is improved for Resnet, as shown in fig. 3, and adds an aliased residual structure between adjacent residual structures to enhance the transmission of detail features, and maintains low-level information through the introduced residual aliased structure, so as to improve the accuracy of recognition.

As shown in FIG. 4, the aliasing residual structure comprises a first processing unit, a second processing unit and a second ReLU activation function layer, wherein the input of the first processing unit is the characteristic information of the layer, the output of the first processing unit and the low-layer characteristic information are simultaneously used as the input of the second processing unit, the output of the second processing unit and the characteristic information of the layer are used as the input of the second ReLU activation function layer, the first processing unit comprises a 3×3 convolution layer, a first normalization layer and a first ReLU activation function layer which are sequentially connected, and the second processing unit comprises a 1×1 convolution layer and a second normalization layer which are sequentially connected. The structure strengthens the detail features of the input features by introducing low-level features into the current module and aliasing with the input features through the first processing unit, and is used between different layers of the encoder such that the detail features are preserved and transferred.

According to the characteristics extracted by the characteristic extraction layer, the target positioning prediction layer in the embodiment generates mask tuples with the same channel number as the vector dimension of the coefficient obtained by the target class prediction layer, multiplies the mask tuples by the coefficient obtained by the target class prediction layer, and then sums the multiplied mask tuples to obtain a target mask. The model tuples in this embodiment are similar to the concept of bases in vector space and can therefore be regarded as target bases. Only in vector space, after basis determination, the vector has a unique representation. However, in the neural network, the number of channels is fixed and the number of targets in the input picture is not fixed, so that there is no guarantee that the masks of the channels have linear uncorrelation or non-uniformity. The inventor finds that for a picture, most channels have little help to the detection result, so the method obtains the target mask by multiplying the mask tuple and the coefficient obtained by the target class prediction layer and then summing the result, so that the quality of the target base can be improved, the influence of the low-quality target base is weakened, and the aliasing phenomenon between targets existing in the SOLOv network is improved.

The target detection model in this embodiment introduces a center point error of mask guidance in terms of a loss function, so the total loss function of the target detection model in this embodiment is composed of a class loss function, a division loss function, and a center point loss function, where the center point loss function is: Wherein m represents the number of positive samples, k represents the index of S x S blocks from left to right and from top to bottom, i= [ k/S ], j= kmodS, I represents an indication function, 1 is when p _i,j >0, otherwise 0; Representing the predicted center point position, c _i representing the true center point position, c= (u _c,v_c), U, v denote pixel positions. The center point error of mask guiding is introduced to avoid the situation that the detection result is deteriorated due to an error which is not emphasized in the segmentation.

It is easy to find that the invention introduces a residual aliasing module to maintain low-level information so as to cope with the similarity between targets in traffic sign recognition tasks, thereby improving the recognition accuracy. According to the invention, the independent single target positioning result is obtained through the sum of the multiplied target base and the coefficient, the quality of the target base can be improved, and the influence of the low-quality target base is weakened, so that the aliasing phenomenon between targets existing in the current method is improved. The invention avoids the situation that the detection result is poor due to an error which is not emphasized in the segmentation by introducing the center point error of mask guidance.

Claims

1. The self-adaptive focusing positioning target detection method is characterized by comprising the following steps of:

Receiving an image to be identified;

inputting the image to be identified into a target detection model to obtain the position and the category of a target in the image to be identified;

Wherein the object detection model comprises:

2. The adaptive focus positioning target detection method of claim 1, wherein the feature extraction layer adds an aliased residual structure between adjacent residual structures.

3. The adaptive focusing and positioning target detection method according to claim 2, wherein the aliasing residual structure comprises a first processing unit, a second processing unit and a second ReLU activation function layer, the input of the first processing unit is present layer feature information, the output of the first processing unit and low layer feature information are simultaneously used as the input of the second processing unit, the output of the second processing unit and the present layer feature information are used as the input of the second ReLU activation function layer, the first processing unit comprises a 3×3 convolution layer, a first normalization layer and a first ReLU activation function layer which are sequentially connected, and the second processing unit comprises a1×1 convolution layer and a second normalization layer which are sequentially connected.

4. The adaptive focus location target detection method according to claim 1, wherein the number of channels of the mask tuple is the same as the vector dimension of the coefficients obtained by the target class prediction layer.

5. The adaptive focus positioning object detection method of claim 1 in which the total loss function of the object detection model comprises a class loss function, a segmentation loss function, and a center point loss function, wherein the center point loss function is a mask-guided center point error function.

6. The adaptive focus positioning target detection method of claim 5 wherein the center point loss function is: Wherein m represents the number of positive samples, k represents the index of S x S blocks from left to right and from top to bottom, i= [ k/S ], j=k mod S, I represents an indication function, when p _i,j >0, 1, otherwise 0; Representing the predicted center point position, c _i represents the true center point position.