CN115423818B

CN115423818B - Identification method, measurement method and identification device based on close frame standard

Info

Publication number: CN115423818B
Application number: CN202211058151.3A
Authority: CN
Inventors: 王娟; 夏斌
Original assignee: Shenzhen Sibionics Intelligent Technology Co Ltd
Current assignee: Shenzhen Sibionics Intelligent Technology Co Ltd
Priority date: 2021-10-11
Filing date: 2021-10-19
Publication date: 2025-09-16
Anticipated expiration: 2041-10-19
Also published as: CN113780477A; CN113920126B; CN115578577A; CN115359070B; WO2023060637A1; CN113780477B; CN115423818A; CN115331050A; CN113920126A; CN115359070A

Abstract

The present disclosure describes a recognition method, measurement method, and recognition device based on tight-frame markers. The method utilizes a network module trained based on the tight-frame markers of the target to identify the target. The network module includes a segmentation network for image segmentation based on weakly supervised learning and a regression network based on bounding box regression. The regression network recognition method includes: obtaining an input image including at least one target belonging to at least one category of interest; inputting the input image into the network module to obtain a first output output by the segmentation network and a second output output by the regression network. The first output includes the probability that each pixel in the input image belongs to each category, and the second output includes the offset between the position of each pixel in the input image and the tight-frame marker of each category of the target; and identifying the target based on the first and second outputs. Thus, the target can be identified.

Description

Identification method, measurement method and identification device based on close frame standard

The application relates to a measuring method and a patent application of a measuring device for deep learning based on a tight frame standard, which are applied for the patent application of 2021, 10, 19, 2021112166277 and the patent application of the measuring device.

Technical Field

The disclosure relates generally to the field of recognition technology based on deep learning, and in particular relates to a recognition method, a measurement method and a recognition device based on a close frame standard.

Background

The image often includes information of various targets, and the target can be automatically analyzed based on the information of the targets in the image identified by the image processing technology. For example, in the medical field, tissue objects in medical images may be identified, and the size of the tissue objects can be measured to monitor changes in the tissue objects.

In recent years, artificial intelligence technology represented by deep learning has been remarkably developed, and applications thereof in object recognition or measurement and the like have been increasingly focused. Researchers use deep learning techniques to identify or further measure objects in images. In particular, in some deep learning-based studies, a neural network based on deep learning is often trained with labeling data to identify and segment a target in an image, thereby enabling measurement of the target.

However, the above-described methods of target identification or measurement often require accurate pixel-level labeling data for training of the neural network, and collecting pixel-level labeling data often requires a significant amount of manpower and material resources. In addition, some methods for identifying objects are not based on pixel-level labeling data, but only identify objects in an image, and the boundary identification of the objects is not accurate enough or tends to be lower in accuracy at the boundary position close to the objects, so that the method is not suitable for scenes requiring accurate measurement. In this case, the accuracy of measurement of the target in the image has yet to be improved.

Disclosure of Invention

The present disclosure has been made in view of the above-described circumstances, and an object thereof is to provide a measurement method and a measurement device for deep learning based on a close frame standard, which can identify a target and can accurately measure the target.

To this end, a first aspect of the disclosure provides a measurement method for performing depth learning based on a tight frame standard, wherein the tight frame standard is a minimum circumscribed rectangle of the target by using a network module for training based on the tight frame standard to identify the target, the measurement method comprises the steps of obtaining an input image comprising at least one target, the at least one target belongs to at least one interested category, inputting the input image into the network module to obtain a first output and a second output, the first output comprises a probability that each pixel point in the input image belongs to each category, the second output comprises a position of each pixel point in the input image and an offset of the tight frame standard of the target of each category, taking the offset in the second output as a target offset, wherein the network module comprises a backbone network, a segmentation network based on a weak supervision learning image segmentation, and a regression network based on a frame regression, the backbone network is used for extracting a feature map of the input image, the segmentation network takes the feature map as the input, the regression network takes the feature map as the first output, the first frame map and the second frame map is obtained based on the first frame map, and the second frame map is taken as the input image, and the second frame map is obtained, and the first frame map is taken as the input image and the second frame map.

In the method, a network module comprising a backbone network, a segmentation network based on image segmentation of weak supervised learning and a regression network based on frame regression is constructed, the network module is trained based on a tight frame mark of a target, the backbone network receives an input image and extracts a feature map consistent with the resolution of the input image, the feature map is respectively input into the segmentation network and the regression network to obtain a first output and a second output, and then the tight frame mark of the target in the input image is obtained based on the first output and the second output so as to realize measurement. In this case, the training network module based on the target's tight-framing target can accurately predict the target's tight-framing target in the input image, and thus can accurately measure based on the target's tight-framing target.

In addition, in the measurement method according to the first aspect of the present disclosure, the size of each target is optionally measured based on a tight frame of the target. Thus, the target can be accurately measured based on the tight frame standard of the target.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the network module is trained by constructing a training sample, input image data of the training sample includes a plurality of images to be trained including an image including a target belonging to at least one category, tag data of the training sample includes a gold standard of the category to which the target belongs and a gold standard of a close frame standard of the target, obtaining, by the network module, prediction segmentation data output by the segmentation network and prediction offset output by the regression network corresponding to the training sample based on the input image data of the training sample, determining a training loss of the network module based on the tag data, the prediction segmentation data, and the prediction offset corresponding to the training sample, and training the network module based on the training loss to optimize the network module. Thereby, an optimized network module can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the determining the training loss of the network module based on the label data corresponding to the training sample, the prediction segmentation data, and the prediction offset includes obtaining the segmentation loss of the segmentation network based on the prediction segmentation data and the label data corresponding to the training sample, obtaining the regression loss of the regression network based on the prediction offset corresponding to the training sample and the real offset corresponding to the label data, wherein the real offset is an offset of a position of a pixel point of the image to be trained and a gold standard of a close frame of a target in the label data, and obtaining the training loss of the network module based on the segmentation loss and the regression loss. In this case, the predicted segmentation data of the segmentation network can be approximated to the label data by the segmentation loss, and the predicted offset of the regression network can be approximated to the true offset by the regression loss.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the target offset is an offset normalized based on an average width and an average height of targets of each category. Thus, the accuracy of identifying or measuring the object whose size is not greatly changed can be improved.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the width and the height of the close-frame tag of the object in the tag data are averaged by category to obtain an average width and an average height, respectively. Thus, the average width and the average width of the target can be obtained by training the sample.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, with multi-example learning, a plurality of packets to be trained are acquired by category based on the gold standard of the tight-framed standard of the object in each image to be trained, and the segmentation loss is acquired based on a plurality of packets to be trained of each category, wherein the plurality of packets to be trained include a plurality of positive packets and a plurality of negative packets, all pixels on each of a plurality of straight lines connecting two sides of the tight-framed standard of the object opposite to each other are divided into one positive packet, the plurality of straight lines including at least one set of first parallel lines parallel to each other and second parallel lines perpendicular to each set of first parallel lines, and the negative packets are single pixels of an area other than the gold standard of the tight-framed standard of all objects of one category. Thus, the segmentation loss can be acquired based on the positive and negative packets of the multi-instance learning.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the angle of the first parallel line is an angle of an extension line of the first parallel line with an extension line of any one of non-intersecting sides of a gold standard of a tight frame standard of the target, and the angle of the first parallel line is greater than-90 ° and less than 90 °. In this case, the positive packets at different angles can be partitioned to optimize the partitioning network. Thus, accuracy of prediction segmentation data of the segmentation network can be improved.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the segmentation loss includes a univariate term describing a degree to which each to-be-trained packet belongs to the gold standard of each category, and a pairwise term describing a degree to which a pixel point of the to-be-trained image belongs to the same category as a pixel point adjacent to the pixel point. In this case, the tight frame can be constrained by the positive and negative packets at the same time by the unitary loss, and the prediction segmentation result can be smoothed by the pairwise loss.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, pixel points that at least fall within the gold standard of the close-frame standard of one target are selected from the image to be trained as positive samples to optimize the regression network. In this case, the regression network is optimized based on the pixel points falling within the real close-frame label of at least one target, so that the efficiency of the regression network optimization can be improved.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, pixel points that at least fall within the gold standard of the tight frame standard of one target are selected from the image to be trained by category as positive samples of each category, and a matching tight frame standard corresponding to the positive sample is obtained to screen the positive samples of each category based on the matching tight frame standard, and then the regression network is optimized by using the screened positive samples of each category, where the matching tight frame standard is the gold standard of the tight frame standard with the smallest true deviation of the position of the positive sample from the gold standard of the tight frame standard that the positive sample falls. Thus, the regression network can be optimized by using positive samples of each class screened based on the matching close-frame markers.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, let the position of the pixel point be represented as (x, y), the close frame of one object corresponding to the pixel point be represented as b= (xl, yt, xr, yb), the offset of the close frame b of the object relative to the position of the pixel point be represented as t= (tl, tt, tr, tb), and then tl, tt, tr, tb satisfy the formula of tl= (x-xl)/S _c1,tt＝(y-yt)/S_c2,tr＝(xr-x)/S_c1,tb＝(yb-y)/S_c2, where xl, yt represents the position of the upper left corner of the close frame of the object, xr, yb represents the position of the lower right corner of the close frame of the object, S _c1 represents the average width of the object of the c-th category, and S _c2 represents the average height of the object of the c-th category. Thereby, normalized offset can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the regression network is optimized by using, according to the category, the expected intersection ratio corresponding to the pixel points of the image to be trained, and selecting the pixel points with the expected intersection ratio greater than the preset expected intersection ratio from the pixel points of the image to be trained. Thus, positive samples meeting a preset expected overlap ratio can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, a plurality of frames with different sizes are constructed with a pixel point of the image to be trained as a center point, and a maximum value in the intersection ratios of the plurality of frames and the matched tight frame mark of the pixel point is obtained and is used as the expected intersection ratio, where the matched tight frame mark is a tight frame mark gold standard with the minimum true offset relative to the position of the pixel point in the tight frame mark gold standard of the pixel point of the image to be trained. Thus, a desired cross-over ratio can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the desired overlap ratio satisfies the formula: Wherein r ₁,r₂ is the relative position ,0<r₁,r₂<1,IoU₁(r₁,r₂)＝4r₁r₂,IoU₂(r₁,r₂)＝2r₁/(2r₁(1-2r₂)+1),IoU₃(r₁,r₂)＝2r₂/(2r₂(1-2r₁)+1),IoU₄(r₁,r₂)＝1/(4(1-r₁)(1-r₂)). of the pixel point of the image to be trained at the matching close frame standard, so that the expected cross ratio can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, a minimum true offset is obtained by comparing L1 patterns of the true offsets. In this case, the smallest true offset can be obtained based on the L1 paradigm, and thus the matching close-frame index can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the regression loss satisfies the formula: Wherein C represents the number of the classes, M _c represents the number of positive samples of the C-th class, t _ic represents the true offset corresponding to the i-th positive sample of the C-th class, v _ic represents the predicted offset corresponding to the i-th positive sample of the C-th class, and s (x) represents the sum of the smoth L1 losses of all the elements in x. Thus, regression loss can be obtained.

In addition, in the measurement method according to the first aspect of the present disclosure, optionally, the identifying the target based on the first output and the second output to obtain the close-frame label of the target of each category is that, a position of a pixel point with the largest local probability belonging to each category is obtained from the first output as a first position, and the close-frame label of the target of each category is obtained based on a position corresponding to the first position and a target offset of the corresponding category in the second output. In this case, one or more objects of each category can be identified.

In addition, in the measurement method according to the first aspect of the present disclosure, the sizes of the plurality of targets of the same category may optionally differ from each other by less than 10 times. Thereby, the accuracy of recognition of the target can be further improved.

Further, in the measurement method according to the first aspect of the present disclosure, optionally, the backbone network includes an encoding module configured to extract image features on different scales and a decoding module configured to map the image features extracted on different scales back to the resolution of the input image to output the feature map. Thus, a feature map matching the resolution of the input image can be acquired.

The second aspect of the disclosure provides a measuring device based on depth learning of a close-frame standard, which is a measuring device for realizing measurement by identifying the target by a network module based on training of the close-frame standard, wherein the close-frame standard is a minimum circumscribed rectangle of the target, the measuring device comprises an acquisition module, a network module and an identification module, the acquisition module is configured to acquire an input image comprising at least one target, the at least one target belongs to at least one interested category, the network module is configured to receive the input image and acquire a first output and a second output based on the input image, the first output comprises a probability that each pixel point in the input image belongs to each category, the second output comprises a position of each pixel point in the input image and an offset of the close-frame standard of the target of each category, the offset in the second output is used as a target offset, the network module comprises a backbone network, a segmented network based on the image of weak-supervision learning, and a regression network based on a regression network, the network is configured to acquire a first output and a second output based on the backbone network, the first output and the second output is used as a feature of the backbone network, the first output is configured to acquire the characteristics of the backbone graph and the second image, and the second output is used as the input graph and the identification network is used for acquiring the characteristics of the first output and the target graph.

According to the present disclosure, there are provided a measurement method and a measurement apparatus based on tight-framed deep learning capable of identifying a target and accurately measuring the target.

Drawings

The present disclosure will now be explained in further detail by way of example only with reference to the accompanying drawings, in which:

Fig. 1 is a schematic diagram showing an application scenario of a measurement method based on tight-box-mark deep learning according to an example of the present disclosure.

Fig. 2 (a) is a schematic diagram showing a fundus image to which the example of the present disclosure relates.

Fig. 2 (b) is a schematic diagram showing the recognition result of the fundus image according to the example of the present disclosure.

Fig. 3 is a schematic diagram showing one example of a network module to which examples of the present disclosure relate.

Fig. 4 is a schematic diagram showing another example of a network module to which examples of the present disclosure relate.

Fig. 5 is a flow chart illustrating a training method of a network module according to an example of the present disclosure.

Fig. 6 is a schematic diagram illustrating a positive pack to which examples of the present disclosure relate.

Fig. 7 is a schematic diagram showing a frame constructed centering on a pixel point according to an example of the present disclosure.

Fig. 8 (a) is a flowchart showing a measurement method of tight-box-based deep learning according to an example of the present disclosure.

Fig. 8 (b) is a flowchart showing another example of the measurement method of tight-box-mark-based deep learning according to the example of the present disclosure.

Fig. 9 (a) is a block diagram showing a measurement apparatus for tight-box-based deep learning according to an example of the present disclosure.

Fig. 9 (b) is a block diagram showing another example of the measurement apparatus based on the tight-box-mark deep learning according to the example of the present disclosure.

Fig. 9 (c) is a block diagram showing another example of the measurement apparatus based on the tight-box-mark deep learning according to the example of the present disclosure.

Detailed Description

Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the following description, the same members are denoted by the same reference numerals, and overlapping description thereof is omitted. In addition, the drawings are schematic, and the ratio of the sizes of the components to each other, the shapes of the components, and the like may be different from actual ones. It should be noted that the terms "comprises" and "comprising," and any variations thereof, in this disclosure, such as a process, method, system, article, or apparatus that comprises or has a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include or have other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. All methods described in this disclosure can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context.

The measuring method and the measuring device based on the deep learning of the tight frame standard can identify the target and improve the accuracy of target measurement. For example, a video disc or a compact frame of the video disc in the fundus image can be identified, and further the size of the video disc or the video disc can be measured based on the compact frame. The measurement method based on the deep learning of the tight frame mark related to the present disclosure may also be referred to as an identification method, a tight frame mark measurement method, a tight frame mark identification method, an automatic measurement method, an auxiliary measurement method, and the like. The measuring method disclosed by the disclosure can be applied to any application scene for accurately measuring the width and/or the height of a target in an image.

The measurement method related to the disclosure is a measurement method for realizing measurement by identifying a target by using a network module trained based on a target tight frame standard. The tight frame mark may be the smallest circumscribed rectangle of the target. In this case, the object is in contact with the four sides of the tight tag and does not overlap with the area outside the tight tag (i.e., the object is tangential to the four sides of the tight tag). Thus, the tight boxes can represent the width and height of the target. In addition, training the network module based on the target's close-frame tag can reduce the time and labor cost of collecting pixel-level annotation data (which may also be referred to as tag data) and the network module can accurately identify the target's close-frame tag.

The input images to which the present disclosure relates may be from cameras, CT scans, PET-CT scans, SPECT scans, MRI, ultrasound, X-rays, angiography, fluoroscopic, capsule endoscopic images, or combinations thereof. In some examples, the input image may be an image of a tissue object (e.g., fundus image). In some examples, the input image may be a natural image. The natural image may be an image observed or photographed in a natural scene. Thereby, the object in the natural image can be measured. For example, the size of a face in a natural image or the height of a pedestrian may be measured. Examples of the present disclosure are described below taking an input image as a fundus image acquired by a fundus camera as an example, and such description does not limit the scope of the present disclosure.

Fig. 1 is a schematic diagram showing an application scenario of a measurement method based on tight-box-mark deep learning according to an example of the present disclosure. Fig. 2 (a) is a schematic diagram showing a fundus image to which the example of the present disclosure relates. Fig. 2 (b) is a schematic diagram showing the recognition result of the fundus image according to the example of the present disclosure.

In some examples, the measurement methods related to the present disclosure may be applied in an application scenario as shown in fig. 1. In the application scenario, an image of the target object 51 including the corresponding position of the target may be acquired as an input image by the acquisition device 52 (e.g., a camera) (see fig. 1), the input image is input to the network module 20 to identify the target in the input image and acquire a close-up B (see fig. 1) of the target, and the target may be measured based on the close-up B. Taking fundus images as an example, the fundus images shown in fig. 2 (a) are input into the network module 20 to obtain the recognition result shown in fig. 2 (B), where the recognition result may include two types of target tight boxes of the optic cup and the optic disc, the tight box B11 is a tight box of the optic disc, and the tight box B12 is a tight box of the optic cup. In this case, the measurement can be performed on the video cup and the video disc based on the close-frame standard.

The network module 20 to which the present disclosure relates may be multi-tasking based. In some examples, network module 20 may be a deep learning based neural network. In some examples, the network module 20 may include two tasks, one of which may be a segmentation network 22 (described later) based on image segmentation for weakly supervised learning and the other of which may be a regression network 23 (described later) based on frame regression.

In some examples, the segmentation network 22 may segment the input image to obtain a target (e.g., a cup and/or a disk). In some examples, the split network 22 may be based on Multiple instance learning (MIL-INSTANCE LEARNING) and used to supervise the keytag. In some examples, the problem addressed by the split network 22 may be a multi-label classification (multi-label classification) problem. In some examples, the input image may contain targets of at least one category of interest (which may be simply referred to as a category). Thus, the segmentation network 22 is able to identify input images of objects that contain at least one category of interest. In some examples, the input image may also be devoid of any targets. In some examples, the number of targets for each category of interest may be at least greater than 1.

In some examples, regression network 23 may be used to predict tight boxes by category. In some examples, regression network 23 may predict the tight-box by predicting the offset of the tight-box relative to the location of each pixel of the input image.

In some examples, the network module 20 may also include a backbone network 21. The backbone network 21 may be used to extract a feature map of the input image (i.e., the original image of the input network module 20). In some examples, the backbone network 21 may extract high-level features for the object representation. In some examples, the resolution of the feature map may be consistent with the input image (i.e., the feature map may be single-scale and consistent with the size of the input image). Thus, the accuracy of identifying or measuring the object whose size is not greatly changed can be improved. In some examples, feature maps consistent with the scale of the input image may be obtained by continually fusing image features of different scales. In some examples, the feature map may be input to the segmentation network 22 and the regression network 23.

In some examples, backbone network 21 may include an encoding module and a decoding module. In some examples, the encoding module may be configured to extract image features on different scales. In some examples, the decoding module may be configured to map image features extracted on different scales back to the resolution of the input image to output a feature map. Thus, a feature map matching the resolution of the input image can be acquired.

Fig. 3 is a schematic diagram showing one example of the network module 20 to which the examples of the present disclosure relate.

In some examples, as shown in fig. 3, the network module 20 may include a backbone network 21, a split network 22, and a regression network 23. The backbone network 21 may receive an input image and output a feature map. The feature map may be taken as input to the segmentation network 22 and the regression network 23 to obtain corresponding outputs. Specifically, the segmentation network 22 may take the feature map as input to obtain a first output, and the regression network 23 may take the feature map as input to obtain a second output. In this case, the input image can be input to the network module 20 to acquire the first output and the second output.

In some examples, the first output may be a result of an image segmentation prediction. In some examples, the second output may be a result of a bounding box regression prediction.

In some examples, the first output may include probabilities that respective pixels in the input image belong to respective categories. In some examples, the probability that each pixel point belongs to each category may be obtained by an activation function. In some examples, the first output may be a matrix. In some examples, the first output may correspond to a matrix of size m×n×c, where m×n may represent a resolution of the input image, M and N may correspond to a row and a column of the input image, respectively, and C may represent a number of categories. For example, for fundus images targeting both the cup and disc, the size of the matrix to which the first output corresponds may be mxn×2.

In some examples, the corresponding value of the pixel point at each location in the input image in the first output may be a vector, and the number of elements in the vector may be consistent with the number of categories. For example, for a pixel point at a kth position in the input image, the corresponding value in the first output may be vector p _k, vector p _k may include C elements, and C may be the number of categories. In some examples, the element value of vector p _k may be a numerical value of 0 to 1.

In some examples, the second output may include a shift of the location of the individual pixels in the input image from a close-frame label of the object of each category. That is, the second output may include a tight-framed offset for an explicit class of targets. In other words, the regression network 23 predicts a tight-framed offset that may be a target of an explicit class. In this case, when the overlapping degree of the targets of different categories is high, the close-fitting marks of the targets of the corresponding categories can be distinguished, and further, the close-fitting marks of the targets of the corresponding categories can be obtained. Therefore, the method can be compatible with the identification or measurement of targets with higher overlapping targets of different categories. In some examples, the offset in the second output may be taken as the target offset.

In some examples, the target offset may be a normalized offset. In some examples, the target offset may be an offset normalized based on an average size of targets of the respective categories. In some examples, the target offset is an offset that may be normalized based on the average width and average height of the targets of the respective categories. The target offset and the predicted offset (described later) may correspond to the true offset (described later). That is, if the actual offset when the network module 20 is trained (may be simply referred to as a training phase) is normalized, the target offset (corresponding to the measurement phase) and the predicted offset (corresponding to the training phase) predicted (may be simply referred to as a measurement phase) by the network module 20 may be normalized accordingly. Thus, the accuracy of identifying or measuring the object whose size is not greatly changed can be improved.

In some examples, the average size of the target may be obtained by averaging the average width and average height of the target. In some examples, the average size of the target may be an empirical value (i.e., the average width and the average width may be empirical values). In some examples, the average size of the target may be obtained by counting samples corresponding to the captured input image. In some examples, the width and height of the tight-box of the target in the tag data of the sample may be averaged by category to obtain an average width and an average height, respectively. In some examples, the average width and average height may be averaged to obtain an average size of the targets for the category. In some examples, the samples may be training samples (described later). That is, the average width and average width of the target and the average size of the target may be obtained by counting training samples. Thus, the average width and average width of the target, or the average size of the target, can be obtained by training the sample.

In some examples, the second output may be a matrix. In some examples, the size of the matrix to which the second output corresponds may be mxn x a, where a may represent the size of the total target offset, mxn may represent the resolution of the input image, and M and N may correspond to the rows and columns of the input image, respectively. In some examples, if the size of one target offset is a 4 x1 vector (i.e., can be represented by 4 numbers), a can be C x 4, and C can represent the number of categories. For example, for fundus images targeting both the cup and disc, the size of the matrix to which the second output corresponds may be mxn×8.

In some examples, the corresponding value of the pixel point at each location in the input image in the second output may be a vector. For example, the pixel at the kth position in the input image, the corresponding value in the second output may be denoted as v _k＝[v_k1,v_k2,…,v_kC. Where C may be the number of categories and the individual elements in v _k may be represented as target displacements for targets of each category. Thus, the target displacement and the corresponding category can be conveniently represented. In some examples, the elements of v _k may be 4-dimensional vectors.

In some examples, backbone network 21 may be a U-net based network. In this embodiment, the coding module of the backbone network 21 may include a unit layer and a pooling layer (pooling layers). The decoding module of the backbone network 21 may include a cell layer, an Up-sampling layer (Up-SAMPLINGLAYERS, up-sampling), and a Skip-connection unit (Skip-connectionunits).

In some examples, the cell layers may include a convolution layer, a batch normalization layer, and a modified linear cell layer (RECTIFIED LINEAR unit layers, reLu). In some examples, the pooling layer (pooling layers, pooling) may be a maximum pooling layer (Max pooling layers, max-poooling). In some examples, the skip connect unit may be used to combine image features from deep layers with image features from shallow layers.

In addition, the partition network 22 may be a forward neural network. In some examples, the split network 22 may include multiple cell layers. In some examples, the split network 22 may include multiple cell layers and convolutional layers (convolutional layers, conv).

In addition, the regression network 23 may include an expansion convolution layer (dilated convolutionlayers, dilated Conv) and a modified linear cell layer (batch normalizationlayers, BN). In some examples, the regression network 23 may include an expanded convolution layer, a modified linear element layer, and a convolution layer.

Fig. 4 is a schematic diagram showing another example of the network module 20 to which the examples of the present disclosure relate. In fig. 4, the network layers in the network module 20 are distinguished by numerals in arrows, where arrow 1 represents a network layer (i.e., a cell layer) composed of a convolution layer, a batch normalization layer, and a correction linearity cell layer, arrow 2 represents a network layer composed of an expansion convolution layer and a correction linearity cell, arrow 3 represents a convolution layer, arrow 4 represents a maximum pooling layer, arrow 5 represents an upsampling layer, and arrow 6 represents a jump connection cell, in order to more clearly describe the network structure of the network module 20.

As an example of a network module 20. As shown in fig. 4, an input image with a resolution of 256×256 may be input to the network module 20, image features are extracted through different hierarchical cell layers (see arrow 1) and a maximum pooling layer (see arrow 4) of the encoding module, and image features of different scales are continuously fused through different hierarchical cell layers (see arrow 1), an upsampling layer (see arrow 5) and a skip connection unit (see arrow 6) of the decoding module to obtain a feature map 221 consistent with the scale of the input image, and then the feature map 221 is input to the segmentation network 22 and the regression network 23, respectively, to obtain a first output and a second output.

In addition, as shown in fig. 4, the dividing network 22 may be composed of a unit layer (see arrow 1) and a convolution layer (see arrow 3) in order, and the regression network 23 may be composed of a plurality of network layers (see arrow 2) composed of an expanded convolution layer and a modified linear unit layer, and a convolution layer (see arrow 3) in order. The unit layer may be composed of a convolution layer, a batch normalization layer and a correction linear unit layer.

In some examples, the convolution kernel of the convolution layer in the network module 20 may be set to a size of 3×3. In some examples, the convolution kernel of the maximum pooling layer in the network module 20 may be set to 2 x 2 in size and the convolution step size may be set to 2. In some examples, the scale-up factor (scale-factor) of the up-sampling layer in the network module 20 may be set to 2. In some examples, as shown in fig. 4, the expansion coefficients (dilation-factors) of the multiple expansion convolution layers in the network module 20 may be set to 1, 2, 4, 8, and 16 in sequence (see the numbers above arrow 2). In some examples, as shown in fig. 4, the number of max pooling layers may be 5. Thus, the size of the input image can be divided by 32 (32 may be the 5 th power of 2).

As described above, the measurement method according to the present disclosure is a measurement method in which a target is identified by the network module 20 trained on the target-based tight-framed markers, thereby realizing measurement. Hereinafter, a training method (may be simply referred to as a training method) of the network module 20 according to the present disclosure will be described in detail with reference to the accompanying drawings. Fig. 5 is a flowchart illustrating a training method of the network module 20 according to an example of the present disclosure.

In some examples, the segmentation network 22 and the regression network 23 in the network module 20 may be trained simultaneously on an end-to-end basis.

In some examples, the segmentation network 22 and the regression network 23 in the network module 20 may be trained jointly to optimize the segmentation network 22 and the regression network 23 simultaneously. In some examples, the segmentation network 22 and the regression network 23 may adjust network parameters of the backbone network 21 through back propagation through joint training to enable the feature map output by the backbone network 21 to better express features of the input image and input the segmentation network 22 and the regression network 23. In this case, the split network 22 and the regression network 23 each perform processing based on the feature map output from the backbone network 21.

In some examples, the segmentation network 22 may be trained using multiple example learning. In some examples, pixels used to train the regression network 23 (described later) may be screened using the desired intersection ratio corresponding to the pixels of the image to be trained.

In some examples, as shown in fig. 5, the training method may include constructing training samples (step S120), inputting the training samples into the network module 20 to obtain predictive data (step S140), and determining training loss of the network module 20 based on the training samples and the predictive data and optimizing the network module 20 based on the training loss (step S160). Thereby, an optimized (which may also be referred to as trained) network module 20 can be obtained.

In some examples, in step S120, a training sample may be constructed. The training samples may include input image data and label data. In some examples, the input image data may include a plurality of images to be trained. For example, the image to be trained may be a fundus image to be trained.

In some examples, the plurality of images to be trained may include an image including the target. In some examples, the plurality of images to be trained may include an image that includes the target and an image that does not include the target. In some examples, the target may belong to at least one category. In some examples, the number of targets for each category in the image to be trained may be 1 or more. For example, taking a fundus image as an example, if a cup and a optic disc are identified or measured, the object in the fundus image may be one optic disc and one cup. That is, there are two targets in the fundus image that need to be identified or measured, and the number of each target may be 1, and if the microangioma is identified or measured, the target in the fundus image may be at least one microangioma. Examples of the present disclosure are not intended to limit the number of targets, the categories to which the targets belong, and the number of targets for each category.

In some examples, the tag data may include a category of gold standard to which the target belongs (the category of gold standard may sometimes be referred to as a true category) and a tight-box gold standard of the target (the tight-box gold standard may sometimes be referred to as a true tight-box). That is, the tag data may be a real class to which the target belongs and a real close-frame label of the target in the image to be trained. It should be noted that, unless otherwise specified, the close frame of the target or the category to which the target belongs in the tag data in the training method may be a gold standard by default.

In some examples, the image to be trained may be annotated to obtain tag data. In some examples, the images to be trained may be annotated with an annotation tool, such as a line annotation system. Specifically, a close frame label (i.e., a minimum circumscribed rectangle) of a target in the image to be trained can be labeled by using a labeling tool, and a corresponding category is set for the close frame label to represent a true category to which the target belongs.

In some examples, to inhibit network module 20 from overfitting, a data augmentation process may be performed on the training samples. In some examples, the data amplification process may include, but is not limited to, flipping (e.g., flip up and down or flip side-to-side), magnifying, rotating, adjusting contrast, adjusting brightness, or color balancing. In some examples, the same data augmentation process may be performed on the input image data and the label data in the training samples. This makes it possible to match the input image data with the tag data.

In some examples, in step S140, training samples may be input to network module 20 to obtain predictive data. As described above, the network module 20 may include the segmentation network 22 and the regression network 23. In some examples, predictive data corresponding to the training samples may be obtained by network module 20 based on the input image data of the training samples. The prediction data may include prediction segmentation data output by the segmentation network 22 and prediction offsets output by the regression network 23.

In addition, the prediction partition data may correspond to a first output, and the prediction offset may correspond to a second output (i.e., may correspond to a target offset). That is, the prediction segmentation data may include probabilities that respective pixels in the image to be trained belong to respective categories, and the prediction offset may include an offset of a position of the respective pixels in the image to be trained from a close frame of a target of each category. In some examples, the predicted offset may be an offset normalized based on an average size of the targets of the respective categories, corresponding to the target offset. Thus, the accuracy of identifying or measuring the object whose size is not greatly changed can be improved. Preferably, the dimensions of the multiple objects of the same class may differ from each other by less than a factor of 10. For example, the sizes of multiple targets of the same class may differ from each other by 1,2,3, 5, 7, 8, or 9 times, etc. Thereby, the accuracy of recognition or measurement of the target can be further improved.

To more clearly describe the offset of the pixel point from the target's close-frame label, and the normalized offset, the following is described in conjunction with the formula. It should be noted that the predicted offset, the target offset, and the true offset belong to one of the offsets, and are equally applicable to the following equation (1).

Specifically, the position of the pixel point may be expressed as (x, y), the close-frame label of one object corresponding to the pixel point is expressed as b= (xl, yt, xr, yb), the offset of the close-frame label b of the object relative to the position of the pixel point (i.e., the offset of the position of the pixel point and the close-frame label of the object) is expressed as t= (tl, tt, tr, tb), and then tl, tt, tr, tb may satisfy formula (1):

tl=(x-xl)/S_c1,

tt=(y-yt)/S_c2,

tr=(xr-x)/S_c1,

tb=(yb-y)/S_c2,

Where xl, yt may represent the position of the top left corner of the target 'S close-frame, xr, yb may represent the position of the bottom right corner of the target' S close-frame, c may represent the index of the class to which the target belongs, S _c1 may represent the average width of the target of the c-th class, and S _c2 may represent the average height of the target of the c-th class. Thereby, normalized offset can be obtained. In some examples, S _c1 and S _c2 may both be average sizes of targets of the c-th category.

Examples of the present disclosure are not limited thereto, but in other examples, a tight-box of a target may be represented by a position of a lower left corner and a position of an upper right corner, or by a position, a length, and a width of any one corner. In addition, in other examples, other approaches may be used to normalize, for example, the offset may be normalized with the length and width of the target's close-frame label.

In addition, the pixel points in the formula (1) may be the pixel points of the image to be trained or the input image. That is, equation (1) may be applied to a true offset corresponding to an image to be trained in the training phase and a target offset corresponding to an input image in the measurement phase.

Specifically, for the training stage, the pixel point may be a pixel point in the image to be trained, the close frame b of the target may be a gold standard of the close frame b of the target of the image to be trained, and the offset t may be a true offset (may also be referred to as an offset gold standard). Thus, the regression loss of the regression network 23 can be subsequently obtained based on the predicted offset and the true offset. In addition, if the pixel point is a pixel point in the image to be trained, and the offset t is a prediction offset, the frame label of the predicted target can be reversely deduced according to the formula (1).

In addition, for the measurement stage, the pixel point may be a pixel point in the input image, the offset t may be a target offset, and then the tight frame of the target in the input image may be back-pushed according to the formula (1) and the target offset (i.e., the target offset and the position of the pixel point may be substituted into the formula (1) to obtain the tight frame of the target). Thereby, a tight box of the object in the input image can be obtained.

In some examples, in step S160, a training loss for network module 20 may be determined based on the training samples and the prediction data and network module 20 may be optimized based on the training loss. In some examples, a training loss for network module 20 may be determined based on the label data, the predictive segmentation data, and the predictive offset corresponding to the training samples, and then network module 20 may be trained based on the training loss to optimize network module 20.

As described above, the network module 20 may include the segmentation network 22 and the regression network 23. In some examples, the training loss may include a segmentation loss of the segmentation network 22 and a regression loss of the regression network 23. That is, the training loss of the network module 20 may be obtained based on the segmentation loss and the regression loss. Thereby, the network module 20 can be optimized based on the training loss. In some examples, the training loss may be a sum of a segmentation loss and a regression loss. In some examples, the segmentation loss may represent a degree to which pixels in the image to be trained in the predictive segmentation data belong to each true class, and the regression loss may represent a proximity of the predictive offset to the true offset.

In some examples, the segmentation loss of the segmentation network 22 may be obtained based on the predicted segmentation data and the tag data corresponding to the training samples. This allows the prediction segmentation data of the segmentation network 22 to approximate the label data by the segmentation loss. In some examples, segmentation loss may be obtained using multi-example learning. In multi-example learning, multiple packages to be trained may be acquired by category based on the true close-captions of the targets in the respective images to be trained (i.e., each category may correspond to multiple packages to be trained, respectively). Segmentation loss may be obtained based on multiple packets to be trained for each class. In some examples, the plurality of packets to be trained may include a plurality of positive packets and a plurality of negative packets. Thus, the segmentation loss can be acquired based on the positive and negative packets of the multi-instance learning. Note that, unless otherwise specified, the following positive packets and negative packets are for each category.

In some examples, multiple positive packages may be acquired based on an area within a true close-frame of the target. As shown in fig. 6, the region A2 in the image to be trained P1 is a region within the real close-frame B21 of the target T1.

In some examples, all pixels on each of a plurality of straight lines connecting two sides of the real close-frame label of the target may be divided into one positive packet (i.e., one straight line may correspond to one positive packet). Specifically, both ends of each straight line may be at the upper and lower ends, or the left and right ends of the true tight frame standard. As an example, as shown in fig. 6, the pixel points on the straight lines D1, D2, D3, D4, D5, D6, D7, and D8 may be divided into one positive packet, respectively. Examples of the present disclosure are not limited thereto and in other examples, other ways of dividing the positive packet may be used. For example, the pixel point at a specific position of the real close-frame label can be divided into a positive packet.

In some examples, the plurality of straight lines may include at least one set of first parallel lines that are parallel to each other. For example, the plurality of straight lines may include one set of first parallel lines, two sets of first parallel lines, three sets of first parallel lines, four sets of first parallel lines, or the like. In some examples, the number of straight lines in the first parallel line may be 2 or more.

In some examples, the plurality of straight lines may include at least one set of first parallel lines parallel to each other and second parallel lines parallel to each other perpendicular to each set of first parallel lines, respectively. Specifically, if the plurality of straight lines includes a set of first parallel lines, the plurality of straight lines may further include a set of second parallel lines perpendicular to the set of first parallel lines, and if the plurality of straight lines includes a plurality of sets of first parallel lines, the plurality of straight lines may further include a plurality of sets of second parallel lines perpendicular to each set of first parallel lines, respectively. As shown in fig. 6, one set of first parallel lines may include parallel straight lines D1 and D2, one set of second parallel lines corresponding to the one set of first parallel lines may include parallel straight lines D3 and D4, wherein the straight line D1 may be perpendicular to the straight line D3, and the other set of first parallel lines may include parallel straight lines D5 and D6, and one set of second parallel lines corresponding to the one set of first parallel lines may include parallel straight lines D7 and D8, wherein the straight line D5 may be perpendicular to the straight line D7. In some examples, the number of straight lines in the first parallel line and the second parallel line may be 2 or more.

As described above, in some examples, the plurality of straight lines may include multiple sets of first parallel lines (i.e., the plurality of straight lines may include parallel lines of different angles). In this case, the forward packets at different angles can be partitioned to optimize the partitioning network 22. This can improve the accuracy of the prediction segmentation data of the segmentation network 22.

In some examples, the angle of the first parallel line may be an angle of an extension of the first parallel line from an extension of any one of the non-intersecting sides of the real close-frame standard, and the angle of the first parallel line may be greater than-90 ° and less than 90 °. For example, the included angle may be-89 °, -75 °, -50 °, -25 °, -20 °,0 °,10 °, 20 °,25 °,50 °, 75 °, 89 °, etc. Specifically, if the angle of the included angle formed by the extension lines of the non-intersecting sides rotated clockwise by less than 90 ° to the extension lines of the first parallel lines may be greater than 0 ° and less than 90 °, if the angle of the included angle formed by the extension lines of the non-intersecting sides rotated counterclockwise by less than 90 ° (i.e., rotated clockwise by more than 270 °) to the extension lines of the first parallel lines may be greater than-90 ° and less than 0 °, if the non-intersecting sides are parallel to the first parallel lines, the angle of the included angle may be 0 °. As shown in fig. 6, the angles of the straight lines D1, D2, D3, and D4 may be 0 °, and the angles of the straight lines D5, D6, D7, and D8 (i.e., the angle C1) may be 25 °. In some examples, the angle of the first parallel line may be a superparameter, and may be optimized during the training process.

In addition, the angle of the first parallel line may also be described in such a way that the image to be trained rotates. The angle of the first parallel line may be a rotated angle. Specifically, the angle of the first parallel line may be a rotation angle by which the image to be trained is rotated such that any side of the image to be trained that does not intersect the first parallel line is parallel to the first parallel line, wherein the angle by which the first parallel line is parallel may be greater than-90 ° and less than 90 °, the rotation angle by which clockwise rotation may be positive degrees, and the rotation angle by which counterclockwise rotation may be negative degrees.

Examples of the present disclosure are not limited thereto, however, and in other examples, the angle of the first parallel lines may be other ranges as well, depending on the manner in which the angle of the first parallel lines is described. For example, if described based on a true tight-framed edge intersecting a first parallel line, the angle of the first parallel line may also be greater than 0 ° and less than 180 °.

In some examples, multiple negative packets may be acquired based on an area outside of the true close-frame of the target. As shown in fig. 6, the region A1 in the image to be trained P1 is a region other than the real close-frame marker B21 of the target T1. In some examples, the negative packet may be a single pixel of the area outside the true close-captions of all targets of a class (i.e., one pixel may correspond to one negative packet).

As described above, in some examples, segmentation loss may be obtained based on multiple packets to be trained for each category. In some examples, the partitioning penalty may include a unigram term (also referred to as a unigram penalty) and a pairwise term (also referred to as a pairwise penalty). In some examples, the univariate entry may describe the extent to which each package to be trained belongs to a respective true category. In this case, the tight frame tag can be constrained by both positive and negative packets by a unitary loss. In some examples, the pair-wise term may describe the extent to which a pixel of an image to be trained belongs to the same class as a pixel adjacent to the pixel. In this case, the pairwise loss smoothes the prediction segmentation result.

In some examples, the segmentation loss for a category may be obtained by category, with the segmentation loss (i.e., total segmentation loss) being obtained based on the segmentation loss for the category. In some examples, the total segmentation loss L _seg may satisfy the formula:

Where L _c may represent the segmentation penalty for category C, which may represent the number of categories. For example, C may be 2 if the cup and disc in the fundus image are identified, and C may be 1 if only the cup or only the disc are identified.

In some examples, the segmentation loss L _c for category c may satisfy the formula:

Wherein phi _c can represent a univariate term, May represent pairs of terms, P may represent the degree (also referred to as probability) that each pixel predicted by the segmentation network 22 belongs to a respective class,A collection of multiple positive packs may be represented,A set of multiple negative packets may be represented and λ may represent a weight factor. The weight factor lambda may be a hyper-parameter, which may be optimized during the training process. In some examples, the weight factor λ may be used to switch the two losses (i.e., the univariate term and the pairwise term).

In general, in the multi-example learning, if each positive packet of a class includes at least one pixel belonging to the class, the pixel with the highest probability of belonging to the class in each positive packet may be taken as the positive sample of the class, and if the pixel belonging to the class does not exist in each negative packet of a class, even the pixel with the highest probability in the negative packet is taken as the negative sample of the class. Based on this, in some examples, the univariate term Φ _c corresponding to category c may satisfy the formula:

Wherein P _c (b) may represent a probability that the packet to be trained belongs to class c (which may also be referred to as the degree of belonging to class c or the probability of the packet to be trained), b may represent a packet to be trained, A collection of multiple positive packs may be represented,A collection of multiple negative packs may be represented,Max may represent the function of the maximum value,May represent the cardinality of a set of multiple positive packs (i.e., the number of elements of the set), β may represent a weighting factor, and γ may represent a focusing parameter (focusing parameter). In some examples, the value of the unary term is minimal when P _c (b) corresponding to the positive packet is equal to 1 and P _c (b) corresponding to the negative packet is equal to 0. I.e. the unary loss is minimal.

In some examples, the weight factor β may be between 0 and 1. In some examples, the focus parameter γ may be equal to or greater than 0.

In some examples, P _c (b) may be the highest probability of belonging to class c in a pixel of a packet to be trained. In some examples, P _c (b) may satisfy the formula P _c(b)＝max_k∈b(p_kc), where P _kc may represent the probability that the pixel at the kth position of the packet b to be trained belongs to class c.

In some examples, the maximum probability of belonging to a class in a pixel of a packet to be trained (i.e., P _c (b)) may be obtained based on a maximum smooth approximation function (Smooth maximum approximation). Thus, a relatively stable maximum probability can be obtained.

In some examples, the maximum smooth approximation function may be at least one of an alpha-softmax function and an alpha-quasimax function.

In some examples, for a maximum function f (x) =max _1≤i≤nx_i, max may represent the maximum function, n may represent the number of elements (which may correspond to the number of pixels in the packet to be trained), x _i may represent the value of an element (which may correspond to the probability that the pixel at the i-th position of the packet to be trained belongs to one class:

Where α may be constant. In some examples, the larger α is, the closer to the maximum of the maximum function.

In addition, the α -quasimax function can satisfy the formula:

As described above, in some examples, the pair-wise term may describe the extent to which a pixel of an image to be trained belongs to the same class as a pixel adjacent to the pixel. That is, the pair-wise term may evaluate how close the probabilities that neighboring pixel points belong to the same class. In some examples, the paired item corresponding to category cThe formula can be satisfied:

Where ε may represent a set of all adjacent pixel pairs, (k, k ') may represent a pair of adjacent pixels, k and k ' may represent the positions of two pixels of the adjacent pixel pair, respectively, p _kc may represent the probability that the pixel at the kth position belongs to class c, and p _k'c may represent the probability that the pixel at the kth ' position belongs to class c.

In some examples, the neighboring pixels may be eight-neighborhood or four-neighborhood pixels. In some examples, adjacent pixels of each pixel in the image to be trained may be acquired to obtain a set of adjacent pixel pairs.

As described above, the training loss may include a regression loss. In some examples, the regression loss of the regression network 23 may be obtained based on the predicted offset corresponding to the training samples and the true offset corresponding to the tag data. In this case, the predicted shift of the regression network 23 can be approximated to the true shift by the regression loss.

In some examples, the true offset may be an offset of the location of the pixel point of the image to be trained from a true close-frame of the target in the tag data. In some examples, the true offset may be a normalized offset based on the average size of the targets of each category, corresponding to the predicted offset. For details, see the description of the offset in equation (1) above.

In some examples, the regression network 23 may be trained by selecting a corresponding pixel from pixels in the image to be trained as a positive sample. That is, the regression network 23 may be optimized with positive samples. Specifically, regression losses may be obtained based on positive samples, and then the regression network 23 is optimized with the regression losses.

In some examples, the regression loss may satisfy the formula:

Where C may represent the number of classes, M _c may represent the number of positive samples of the C-th class, t _ic may represent the true offset corresponding to the i-th positive sample of the C-th class, v _ic may represent the predicted offset corresponding to the i-th positive sample of the C-th class, and s (x) may represent the sum of the smoth L1 losses of all elements in x. In some examples, t _ic-v_ic,s(t_ic-v_ic for x) may represent the degree to which the predicted offset corresponding to the ith positive sample of the c-th class is consistent with the true offset corresponding to the ith positive sample, calculated using the smoth L1 penalty. Here, the positive samples may be pixel points in the image to be trained selected for training the regression network 23 (i.e., for calculating regression losses). Thus, regression loss can be obtained.

In some examples, the true offset corresponding to the positive sample may be an offset corresponding to a true tight box. In some examples, the true offset corresponding to the positive sample may be an offset corresponding to the matching tight box. Thus, it is possible to apply to the case where the positive sample falls into a plurality of real close-frame markers.

In some examples, the smooth L1 loss function may satisfy the formula:

Where σ may represent a hyper-parameter for switching between a smooth L1 loss function and a smooth L2 loss function, and x may represent a variable of the smooth L1 loss function.

As described above, in some examples, the regression network 23 may be trained by selecting respective pixels from pixels in the image to be trained as positive samples.

In some examples, the positive samples may be pixels in the image to be trained that fall within the true close-frame of at least one target (i.e., pixels in the image to be trained that fall within the true close-frame of at least one target may be selected as positive samples). In this case, optimizing the regression network 23 based on the pixel points falling within the true close-frame label of at least one target can improve the efficiency of the regression network 23 optimization. In some examples, pixel points that fall within at least the true close-frame label of one target may be selected from the image to be trained by category as positive samples for each category. In some examples, regression losses for each category may be obtained based on positive samples for each category.

As described above, pixel points that fall within at least the true close-frame markers of one target can be selected from the images to be trained by category as positive samples of the respective categories. In some examples, positive samples of each of the categories described above may be screened and the regression network 23 optimized based on the screened positive samples. That is, the positive sample used to calculate the regression loss may be a positive sample after screening.

In some examples, after the positive samples of the respective classes are obtained (i.e., after the pixel points in the real tight boxes of at least one target are selected from the image to be trained as the positive samples), the matching tight boxes corresponding to the positive samples may be obtained, and then the positive samples of the respective classes may be screened based on the matching tight boxes. Thus, the regression network 23 can be optimized using positive samples of each class screened based on the matching close-frame markers.

In some examples, the true tight-box for a pixel (e.g., positive sample) that falls into may be filtered to obtain a matching tight-box for the pixel. In some examples, the matching tight landmark may be the true tight landmark in which the true offset of the position of the pixel point relative to the pixel point of the image to be trained is smallest among the true tight landmarks. For positive samples, the matching tight fiducial may be the true tight fiducial that has the smallest true offset from the position of the positive sample in the true tight fiducial that the positive sample falls into.

Specifically, in one category, if a pixel (for example, a positive sample) falls only within a real tight landmark of an object to be measured, the real tight landmark is used as a matching tight landmark, and if the pixel falls within a plurality of real tight landmarks of the object to be measured, a real tight landmark with the smallest real deviation of the position of the real tight landmark of the object to be measured with respect to the pixel may be used as the matching tight landmark. Thus, the matching close frame mark corresponding to the pixel point can be obtained.

In some examples, the smallest true offset (i.e., the true tight box with the smallest true offset) may be obtained by comparing the L1 paradigm of true offsets. In this case, the smallest true offset can be obtained based on the L1 paradigm, and thus the matching close-frame index can be obtained. Specifically, the elements of each of the plurality of true offsets may be absolute values and then summed to obtain a plurality of offset values, and the true offset with the smallest offset value may be obtained as the smallest true offset by comparing the plurality of offset values.

In some examples, positive samples of each class may be screened using a desired cross-over alignment corresponding to a pixel point (e.g., positive sample). In this case, pixels far from the center of the real or matching tight fiducial can be screened out. Thereby, the adverse effect of the pixel points far from the center on the optimization of the regression network 23 can be reduced and the efficiency of the regression network 23 optimization can be improved.

In some examples, expected cross-ratios corresponding to positive samples may be obtained based on matching tight-box labels and the positive samples of each class may be screened based on the expected cross-ratios. Specifically, after the positive samples of each category are obtained, a matching close-frame label corresponding to the positive sample may be obtained, then, based on the matching close-frame label, the expected cross-correlation corresponding to the positive sample is obtained, and based on the expected cross-correlation, the positive samples of each category are screened, and finally, the regression network 23 may be optimized by using the screened positive samples of each category. However, examples of the present disclosure are not limited thereto, and in some examples, the pixels of the image to be trained may be screened by category and using the expected intersection ratio corresponding to the pixels of the image to be trained (i.e., the pixels of the image to be trained may be screened using the expected intersection ratio without first selecting the pixels that fall within the real close-frame label of at least one target from the image to be trained as positive samples). In addition, pixels that do not fall within any real tight fiducial (i.e., there are no pixels that match the tight fiducial) may be identified. Therefore, the pixel point can be conveniently and subsequently screened. For example, the desired intersection ratio of a pixel may be made 0 to identify the pixel. Specifically, the pixels of the image to be trained may be screened by category and based on the expected cross-correlation corresponding to the pixels of the image to be trained, and the regression network 23 may be optimized based on the screened pixels.

In some examples, pixels with a desired intersection ratio greater than a preset desired intersection ratio may be screened from pixels of the image to be trained to optimize the regression network 23. In some examples, positive samples with desired cross-ratios greater than a preset desired cross-ratio may be screened from positive samples of each category to optimize the regression network 23. Thus, a pixel (e.g., a positive sample) that meets a preset desired intersection ratio can be obtained. In some examples, the preset desired overlap ratio may be greater than 0 and less than or equal to 1. For example, the preset desired overlap ratio may be 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, or 1, etc. In some examples, the preset desired intersection ratio may be a super parameter. The preset desired intersection ratio may be adjusted during the training process of the regression network 23.

In some examples, the desired intersection ratio for a pixel point may be obtained based on a matching tight landmark for the pixel point (e.g., a positive sample). In some examples, a pixel may be ignored if it does not correspond to a matching tight box label or may have a desired overlap ratio of 0. In this case, it is possible to make pixels for which there is no matching close-frame label not used for training of the regression network 23 or to reduce contribution to regression loss. Unless otherwise specified, the following description of the desired overlap ratio corresponding to the pixel point applies equally to the desired overlap ratio corresponding to the positive sample.

In some examples, the desired overlap ratio may be the maximum of the matched tight boxes of pixels and the overlap ratios (Intersection-over-union, iou) of the multiple bounding boxes, respectively, built centered around the pixels. Thus, a desired cross-over ratio can be obtained. However, examples of the present disclosure are not limited thereto, and in other examples, the desired overlap ratio may be a maximum value of the real close-frame index of a pixel and the overlap ratios of a plurality of frames respectively constructed centering on the pixel.

In some examples, a plurality of frames constructed by taking a pixel point of an image to be trained as a center point can be obtained, and the maximum value in the cross-over ratio of the plurality of frames and the matched close frame mark of the pixel point is taken as the expected cross-over ratio. In some examples, the dimensions of the plurality of rims may be different. Specifically, each of the plurality of frames may be different from other frames in width or height.

For a clearer description of the desired overlap ratio, the following description is made in connection with fig. 7. As shown in fig. 7, the pixel M1 has a matching tight border B31, and the border B32 is an exemplary border constructed centering on the pixel M1.

In some examples, W may be made to be the width of the matching tight fiducial, H is the height of the matching tight fiducial, (r ₁W,r₂ H) represents the position of the pixel point, r ₁,r₂ is the relative position of the pixel point at the matching tight fiducial, and the conditions 0< r1, r2<1 are satisfied. Multiple frames can be constructed based on the pixel points. As an example, as shown in fig. 7, the position of the pixel point M1 may be represented as (r ₁W,r₂ H), and the width and height of the matching close-frame mark B31 may be W and H, respectively.

In some examples, the matching tight fiducial may be divided into four regions with two centerlines of the matching tight fiducial. The four regions may be an upper left region, an upper right region, a lower left region, and a lower right region. For example, as shown in fig. 7, the center line D9 and the center line D10 of the matching tight-frame mark B31 may divide the matching tight-frame mark B31 into an upper left area A3, an upper right area A4, a lower left area A5, and a lower right area A6.

The desired cross-over ratio is described below taking the upper left region of the pixel (i.e., r ₁,r₂ satisfies the condition 0<r ₁,r₂ +.0.5) as an example. For example, as shown in fig. 7, the pixel point M1 may be a point in the upper left area A3.

First, a plurality of frames constructed centering on pixels are constructed. Specifically, for r ₁,r₂ to satisfy the condition that 0<r ₁,r₂ is less than or equal to 0.5, four boundary conditions corresponding to the pixel point M1 may be respectively:

w₁＝2r₁W,h₁＝2r₂H;

w₂＝2r₁W,h₂＝2(1-r₂)H;

w₃＝2(1-r₁)W,h₃＝2r₂H;

w₄＝2(1-r₁)W,h₄＝2(1-r₂)H;

Where w ₁ and h ₁ may represent the width and height of a first boundary condition, w ₂ and h ₂ may represent the width and height of a second boundary condition, w ₃ and h ₃ may represent the width and height of a third boundary condition, and w ₄ and h ₄ may represent the width and height of a fourth boundary condition.

And secondly, calculating the intersection ratio of the frame and the matching tight frame standard under each boundary condition. Specifically, the intersection ratio corresponding to the four boundary conditions may satisfy the formula (2):

IoU₁(r₁,r₂)＝4r₁r₂,

IoU₂(r₁,r₂)＝2r₁/(2r₁(1-2r₂)+1),

IoU₃(r₁,r₂)＝2r₂/(2r₂(1-2r₁)+1),

IoU₄(r₁,r₂)＝1/(4(1-r₁)(1-r₂)),

IoU ₁(r₁,r₂) may represent the intersection ratio corresponding to the first boundary condition, ioU ₂(r₁,r₂) may represent the intersection ratio corresponding to the second boundary condition, ioU ₃(r₁,r₂) may represent the intersection ratio corresponding to the third boundary condition, ioU ₄(r₁,r₂) may represent the intersection ratio corresponding to the fourth boundary condition. In this case, the cross ratio corresponding to each boundary condition can be obtained.

And finally, the largest cross ratio of the cross ratios of the plurality of boundary conditions is the expected cross ratio. In some examples, the condition 0<r ₁,r₂ +.0.5 is satisfied for r ₁,r₂ that the desired overlap ratio may satisfy equation (3):

In addition, the desired intersection ratio for the pixel points located in the other regions (i.e., the upper right region, the lower left region, and the lower right region) can be obtained based on a similar method to the upper left region. In some examples, r ₁ of equation (3) may be replaced with 1-r ₁ for r ₁ satisfying the condition 0.5 r ₁ <1, and r ₂ of equation (3) may be replaced with 1-r ₂ for r ₂ satisfying the condition 0.5 r ₂ < 1. Thus, a desired cross-over ratio of pixel points located in other regions can be obtained. That is, the pixel points located in other regions may be mapped to the upper left region through coordinate transformation, and thus the desired intersection ratio may be obtained based on the manner in which the upper left region is identical. Thus, for r ₁,r₂ to satisfy the condition 0<r ₁,r₂ <1, the expected overlap ratio may satisfy equation (4):

Wherein ,IoU₁(r₁,r₂)、IoU₂(r₁,r₂)、IoU₂(r₁,r₂) and IoU ₂(r₁,r₂) can be obtained from equation (2). Thus, a desired cross-over ratio can be obtained.

As described above, in some examples, the desired intersection ratio for a pixel point may be obtained based on a matching close-frame label of the pixel point (e.g., positive sample). However, examples of the present disclosure are not limited thereto, and in other examples, the matching close-frame markers may not be acquired in the process of screening positive samples of each category or pixels of the image to be trained. Specifically, the expected intersection ratio corresponding to the pixel point can be obtained based on the real close frame standard corresponding to the pixel point (for example, positive sample), and the pixel points of each category can be screened based on the expected intersection ratio. In this case, the desired overlap ratio may be the maximum value of the desired overlap ratios corresponding to the respective real tight frame markers. And the expected intersection ratio corresponding to the pixel point can be obtained based on the real close frame standard, and the related description of the expected intersection ratio corresponding to the pixel point can be obtained based on the matched close frame standard of the pixel point.

Hereinafter, a measurement method according to the present disclosure will be described in detail with reference to the accompanying drawings. The network module 20 involved in the measurement method may be trained by the training method described above. Fig. 8 (a) is a flowchart showing a measurement method of tight-box-based deep learning according to an example of the present disclosure.

In some examples, as shown in fig. 8 (a), the measurement method may include acquiring an input image (step S220), inputting the input image to the network module 20 to acquire a first output and a second output (step S240), and identifying targets based on the first output and the second output to acquire close-captions of targets of respective categories (step S260).

In some examples, in step S220, an input image may be acquired. In some examples, the input image may include at least one target. In some examples, at least one target may belong to at least one category of interest (the category of interest may be simply referred to as a category). In particular, if the input image comprises one object, the object may belong to one category of interest, and if the input image comprises a plurality of objects, the plurality of objects may belong to at least one category of interest. In some examples, the input image may also not include the target. In this case, it is possible to determine that there is no input image of the object.

In some examples, in step S240, an input image may be input to the network module 20 to obtain a first output and a second output. In some examples, the first output may include probabilities that respective pixels in the input image belong to respective categories. In some examples, the second output may include a shift of the location of the individual pixels in the input image from a close-frame label of the object of each category. In some examples, the offset in the second output may be taken as the target offset. In some examples, the network module 20 may include a backbone network 21, a split network 22, and a regression network 23. In some examples, the segmentation network 22 may be image segmentation based on weakly supervised learning. In some examples, regression network 23 may be based on a frame regression. In some examples, the backbone network 21 may be used to extract feature maps of the input image. In some examples, the segmentation network 22 may take the feature map as input to obtain a first output and the regression network 23 may take the feature map as input to obtain a second output. In some examples, the resolution of the feature map may be consistent with the input image. See for details the relevant description of the network module 20.

In some examples, in step S260, the targets may be identified based on the first output and the second output to obtain a close-up label for each category of targets. Thus, the target can be accurately measured based on the close frame standard of the target. As described above, the first output may include probabilities that respective pixels in the input image belong to respective categories, and the second output may include offsets of positions of the respective pixels in the input image from the close-frame markers of the targets of each category. In some examples, a target offset for a class of pixel point correspondence for a respective location may be selected from the second output based on the first output, and a close-frame indicator for each class of targets may be obtained based on the target offset.

In some examples, the position of the pixel point with the highest local probability of belonging to each category may be obtained from the first output as the first position, and the close-frame label of the target of each category may be obtained based on the position corresponding to the first position in the second output and the target offset of the corresponding category. In this case, one or more objects of each category can be identified. In some examples, the first location may be obtained using a Non-maximum suppression method (Non-Maximum Suppression, NMS). In some examples, the number of first locations corresponding to each category may be 1 or more. However, examples of the present disclosure are not limited thereto, and for an input image having only one target for each category, in some examples, a position of a pixel point having the greatest probability of belonging to each category may be acquired from a first output as a first position, and a close frame label of the target for each category may be acquired based on a position corresponding to the first position in a second output and a target offset of the corresponding category. That is, the first position may be acquired using a maximum method. In some examples, the first location may also be obtained using a smoothed maximum suppression method.

In some examples, the close-captions for the targets of each category may be obtained based on the first location and the target offset. In some examples, the first position and the target offset may be substituted into equation (1) to reverse the tight box label of the target. Specifically, the first position may be taken as the position (x, y) of the pixel point of formula (1) and the target offset may be taken as the offset t to obtain the close-frame index b of the target.

Fig. 8 (b) is a flowchart showing another example of the measurement method of tight-box-mark-based deep learning according to the example of the present disclosure. In some examples, as shown in fig. 8 (b), the measurement method may further include measuring the size of each target based on the tight-framed markers of the targets (step S280). Thus, the target can be accurately measured based on the tight frame standard of the target. In some examples, the dimensions of the target may be the width and height of the target's tight fiducial.

Hereinafter, the measurement apparatus 100 for tight-frame-based deep learning related to the present disclosure will be described in detail with reference to the accompanying drawings. The measuring device 100 may also be referred to as an identification device, a tight-fitting measuring device, a tight-fitting identification device, an automatic measuring device, an auxiliary measuring device, etc. The measuring device 100 according to the present disclosure is used for implementing the above-described measuring method. Fig. 9 (a) is a block diagram showing a measurement apparatus 100 for tight-box-mark-based deep learning according to an example of the present disclosure.

As shown in fig. 9 (a), in some examples, the measurement apparatus 100 may include an acquisition module 10, a network module 20, and an identification module 30.

In some examples, the acquisition module 10 may be configured to acquire an input image. In some examples, the input image may include at least one target. In some examples, the at least one target may belong to at least one category of interest. See the description of the related in step S220 for details.

In some examples, network module 20 may be configured to receive an input image and obtain a first output and a second output based on the input image. In some examples, the first output may include probabilities that respective pixels in the input image belong to respective categories. In some examples, the second output may include a shift of the location of the individual pixels in the input image from a close-frame label of the object of each category. In some examples, the offset in the second output may be taken as the target offset. In some examples, the network module 20 may include a backbone network 21, a split network 22, and a regression network 23. In some examples, the segmentation network 22 may be image segmentation based on weakly supervised learning. In some examples, regression network 23 may be based on a frame regression. In some examples, the backbone network 21 may be used to extract feature maps of the input image. In some examples, the segmentation network 22 may take the feature map as input to obtain a first output and the regression network 23 may take the feature map as input to obtain a second output. In some examples, the resolution of the feature map may be consistent with the input image. See for details the relevant description of the network module 20.

In some examples, identification module 30 may be configured to identify the targets based on the first output and the second output to obtain a close-frame indicator of each category of targets. See the description of the related in step S260 for details.

Fig. 9 (b) is a block diagram showing another example of the measurement apparatus 100 based on the tight-box-mark deep learning according to the example of the present disclosure. Fig. 9 (c) is a block diagram showing another example of the measurement apparatus 100 based on tight-box-mark deep learning according to the example of the present disclosure.

As shown in fig. 9 (b) and 9 (c), in some examples, the measurement device 100 may also include a measurement module 40. The measurement module 40 may be configured to measure the size of each target based on the tight-framed indicator of the target. See the description of the related in step S280 for details.

The measurement method and the measurement device 100 related to the present disclosure construct a network module 20 including a backbone network 21, a segmentation network 22 based on image segmentation of weak supervised learning, and a regression network 23 based on frame regression, the network module 20 is trained based on a tight frame of a target, the backbone network 21 receives an input image (e.g., fundus image) and extracts a feature map consistent with the resolution of the input image, the feature map is input to the segmentation network 22 and the regression network 23 to obtain a first output and a second output, respectively, and then the tight frame of the target in the input image is obtained based on the first output and the second output to realize measurement. In this case, the network module 20 based on training of the target's tight-box can accurately predict the target's tight-box in the input image, and thus can accurately measure based on the target's tight-box. In addition, the accuracy of identifying or measuring the target whose size is not greatly changed can be improved by predicting the normalized offset through the regression network 23. In addition, the pixel points used for optimizing the regression network 23 are screened by using the expected intersection ratio, so that adverse effects of the pixel points far from the center on the optimization of the regression network 23 can be reduced, and the efficiency of the optimization of the regression network 23 can be improved. In addition, regression network 23 predicts a clear class of offsets, which can further improve the accuracy of target identification or measurement.

While the disclosure has been described in detail in connection with the drawings and examples, it is to be understood that the foregoing description is not intended to limit the disclosure in any way. Modifications and variations of the present disclosure may be made as desired by those skilled in the art without departing from the true spirit and scope of the disclosure, and such modifications and variations fall within the scope of the disclosure.

Claims

1. The identification method based on the close-frame label is characterized by comprising the steps of utilizing a network module trained on the close-frame label of a target to identify the target, wherein the close-frame label is the minimum circumscribed rectangle of the target, the network module comprises a segmentation network for image segmentation and a regression network based on frame regression, the identification method comprises the steps of acquiring an input image comprising at least one target, the at least one target belongs to at least one interested category, inputting the input image into the network module to acquire a first output by the segmentation network and a second output by the regression network, the first output comprises the probability that each pixel point in the input image belongs to each category, the second output comprises the offset of the position of each pixel point in the input image and the close-frame label of the target in each category, and identifying the target based on the first output and the second output, wherein the network module further comprises a backbone network for extracting a characteristic map of the input image, and the backbone network takes the characteristic map as the input image and the second output as the input characteristic map.

2. The identification method as claimed in claim 1, wherein:

the feature map is consistent with the resolution of the input image.

3. The identification method as claimed in claim 1, wherein:

and taking the offset in the second output as a target offset, wherein the target offset is normalized based on the average size of targets of each category.

4. The identification method as claimed in claim 1, wherein:

the network module is trained by the following method:

The method comprises the steps of constructing a training sample, obtaining predictive segmentation data output by a segmentation network and predictive offset output by a regression network corresponding to the training sample through a network module based on input image data of the training sample, determining training loss of the network module based on the label data corresponding to the training sample, the predictive segmentation data and the predictive offset, and training the network module based on the training loss to optimize the network module.

5. The identification method of claim 4, wherein:

The method comprises the steps of determining training loss of a network module based on label data corresponding to a training sample, prediction segmentation data and prediction offset, acquiring segmentation loss of the segmentation network based on the prediction segmentation data and the label data corresponding to the training sample, acquiring regression loss of a regression network based on the prediction offset corresponding to the training sample and real offset corresponding to the label data, wherein the real offset is offset of a position of a pixel point of an image to be trained and a gold standard of a tight frame mark of a target in the label data, and acquiring the training loss of the network module based on the segmentation loss and the regression loss.

6. The identification method of claim 5, wherein:

with multi-example learning, a plurality of packets to be trained are acquired according to categories based on gold standards of tight-framed targets in respective images to be trained, and the segmentation loss is acquired according to a plurality of packets to be trained of respective categories, wherein the plurality of packets to be trained comprise a plurality of positive packets and a plurality of negative packets, all pixel points on each of a plurality of straight lines connecting two sides of the tight-framed targets opposite to each other of the gold standards of the targets are divided into one positive packet, the plurality of straight lines comprise at least one group of first parallel lines parallel to each other and second parallel lines perpendicular to each group of first parallel lines, and the negative packets are single pixel points of areas outside the gold standards of the tight-framed targets of all targets of one category.

7. The identification method of claim 4, wherein:

And screening the pixel points with the expected intersection ratio larger than the preset expected intersection ratio from the pixel points of the image to be trained according to the category and by utilizing the expected intersection ratio corresponding to the pixel points of the image to be trained, and optimizing the regression network.

8. A measuring method based on a close frame standard is characterized in that the method is based on the identification method of any one of claims 1-7 to identify targets to obtain close frame frames of targets of various categories so as to realize the measuring of the targets, and the close frame standard is the minimum circumscribed rectangle of the targets.

9. The identification device based on the close-frame label is characterized by identifying the target by utilizing a network module for training the close-frame label based on the target, wherein the close-frame label is the minimum circumscribed rectangle of the target, the identification device comprises an acquisition module, a network module and an identification module, the acquisition module is configured to acquire an input image comprising at least one target, the at least one target belongs to at least one interested class, the network module is configured to receive the input image and acquire a first output and a second output based on the input image, the first output comprises the probability that each pixel point in the input image belongs to each class, the second output comprises the position of each pixel point in the input image and the offset of the close-frame label of the target in each class, the network module comprises a segmentation network for image segmentation and a regression network based on frame regression, the segmentation network is used for outputting the first output, the regression network is used for outputting the second output, the identification module is configured to acquire a first output and a second output based on the input image, the identification module is further configured to acquire the characteristics of the first output and the second output network as the backbone network.