CN113869099B

CN113869099B - Image processing method, device, electronic device and storage medium

Info

Publication number: CN113869099B
Application number: CN202110693496.5A
Authority: CN
Inventors: 徐路; 郭昱宇; 高联丽; 陈敏; 王浩宇
Original assignee: University of Electronic Science and Technology of China; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: University of Electronic Science and Technology of China; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2024-12-24
Anticipated expiration: 2041-06-22
Also published as: CN113869099A

Abstract

The present disclosure relates to an image processing method, device, electronic device and storage medium, the method comprising: inputting an image to be processed into an image detection model for object detection, obtaining object detection information corresponding to at least two objects in the image to be processed, inputting the object detection information into a visual relationship detection model for visual relationship detection, obtaining a visual relationship between two objects, the visual relationship being obtained after the visual relationship detection model adjusts the amount of semantic information corresponding to the visual relationship, inputting the visual relationship into a scene graph generation model for scene graph generation, obtaining a target scene graph corresponding to the image to be processed. The method detects the visual relationship between two objects based on the visual relationship detection model, and can improve the accuracy of visual relationship detection.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium.

Background

The scene graph marked with the visual relationship can be generated through the visual relationship detection, and is taken as a structural representation of image content and is a bridge between computer vision and natural language. After the scene graph marked with the visual relationship is generated, the visual relationship triples formed by subjects, predicates and objects in the image can be detected according to the scene image.

In the related art, when visual relation detection is performed on an image to be processed, the detected visual relation is easily confused, so that the accuracy of the visual relation detection is reduced, and the effectiveness of a scene graph marked with the visual relation is also reduced.

Disclosure of Invention

The disclosure provides an image processing method, an image processing device, an electronic device and a storage medium, which are used for at least solving the problems that the accuracy of visual relation detection in the related art is low and the effectiveness of a scene image marked with the visual relation is low. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided an image processing method, the method comprising;

inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed respectively;

Inputting the object detection information into a visual relation detection model to perform visual relation detection to obtain visual relation between every two objects, wherein the visual relation represents interaction relation between every two objects in the image to be processed;

Inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model to generate a scene graph, and obtaining a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information marked with the visual relationship between every two objects.

As an optional embodiment, the visual relationship detection model includes a predicate identification network, and the inputting the object detection information into the visual relationship detection model to perform visual relationship detection, and obtaining the visual relationship between the two objects includes:

Inputting the object detection information into the predicate identification network to identify predicates corresponding to predicate relations among the objects, and obtaining target predicates, wherein the target predicates represent predicates after semantic adjustment;

And obtaining the visual relationship according to the target predicate and the object corresponding to the target predicate.

As an optional embodiment, the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and the inputting the object detection information into the predicate identification network to perform predicate identification between two objects, and obtaining the target predicate includes:

Inputting the object detection information and preset predicates into the initial correlation calculation layer, and performing correlation calculation on predicates corresponding to the object detection information and each preset predicate to obtain initial correlation distribution information, wherein the initial correlation distribution information characterizes correlation between predicates corresponding to the object detection information and each preset predicate before semantic adjustment;

Inputting the initial relevance distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix to obtain target relevance distribution information, wherein the target relevance distribution information characterizes predicates corresponding to the two-to-two object detection information after semantic adjustment and the relevance between each preset predicate;

And determining the target predicate according to the target correlation distribution information.

As an optional embodiment, inputting the initial relevance distribution information into a semantic adjustment layer, performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix, and obtaining the target relevance distribution information includes:

Determining an initial predicate according to the initial correlation distribution information;

under the condition that the initial predicate is a general predicate, performing predicate semantic adjustment on the initial correlation degree distribution information based on a semantic adjustment matrix in the preset matrix, wherein the general predicate characterizes predicates with use probabilities larger than a preset threshold value in the preset predicates;

and under the condition that the initial predicate is a non-universal predicate, determining the initial correlation distribution information as the target correlation distribution information based on a semantic retention matrix in the preset matrix, wherein the non-universal predicate characterizes predicates with the probability smaller than a preset threshold value in the preset predicates.

As an alternative embodiment, the method further comprises:

Inputting a labeling image into the image detection model for object detection to obtain training object detection information corresponding to each object in the labeling image, wherein the labeling image is labeled with a reference visual relationship between every two objects;

inputting the training object detection information into a first model to be trained for visual relation detection to obtain a first training visual relation between every two objects, wherein the first training visual relation characterizes the interaction relation between every two objects in the labeling image obtained through the first model to be trained;

Inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtaining a first training scene graph corresponding to the labeling image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;

And training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.

As an optional embodiment, after the training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further includes:

Detecting word frequency information corresponding to each reference predicate in the reference visual relationship;

Classifying the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain reference predicate types corresponding to each labeling image;

combining the first visual relation detection model with a preset matrix to obtain a second visual relation detection model;

Inputting the training object detection information into the second visual relation detection model to perform visual relation detection, so as to obtain a second training visual relation between every two objects, wherein the second training visual relation characterizes the interaction relation between every two objects in the annotation image under the condition that a preset matrix exists;

Inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain a second training scene graph corresponding to the labeling image, wherein the second training scene graph is structural information labeled with the second training visual relationship between every two objects;

And adjusting the second visual relation detection model and the initial scene graph generation model based on the reference predicate types, the second training visual relation and the reference visual relation corresponding to each marked image to obtain the visual relation detection model and the scene graph generation model.

As an alternative embodiment, the method comprises:

Inputting the training object detection information into the first visual relation detection model to perform visual relation detection to obtain an initial visual relation between every two objects, wherein the initial visual relation characterizes the interaction relation between every two objects in the annotation image obtained through the first visual relation detection model;

Inputting the initial visual relationship and training object detection information corresponding to the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain an initial scene graph corresponding to the labeling image, wherein the initial scene graph is the structural information labeled with the initial visual relationship between every two objects;

Determining an initial matrix according to predicates in the initial visual relationship and reference predicates in the reference visual relationship;

and obtaining a preset matrix according to the normalized matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix.

According to a second aspect of embodiments of the present disclosure, there is provided an image processing apparatus, the apparatus including;

the object detection module is configured to input an image to be processed into the image detection model for object detection, and object detection information corresponding to at least two objects in the image to be processed is obtained;

The visual relation detection module is configured to input the object detection information into a visual relation detection model to perform visual relation detection, so as to obtain a visual relation between every two objects, wherein the visual relation characterizes an interaction relation between every two objects in the image to be processed;

The scene graph generating module is configured to input the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generating model to generate a scene graph, so as to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is the structural information marked with the visual relationship between every two objects.

As an alternative embodiment, the visual relationship detection model includes a predicate identification network and a visual relationship determination network, the visual relationship detection module including:

The predicate identification unit is configured to perform predicate identification corresponding to a predicate relation between every two objects by inputting the object detection information into the predicate identification network to obtain a target predicate, wherein the target predicate characterizes the predicate with semantic adjustment;

And a visual relation determining unit configured to execute the object corresponding to the target predicate and the target predicate to obtain the visual relation.

As an alternative embodiment, the predicate identification network includes an initial relevance calculation layer and a semantic adjustment layer, and the predicate identification unit includes:

The initial correlation calculation unit is configured to input the object detection information and preset predicates into the initial correlation calculation layer, and calculate correlation between predicates corresponding to the two-by-two object detection information and each preset predicate to obtain initial correlation distribution information, wherein the initial correlation distribution information characterizes correlation between predicates corresponding to the two-by-two object detection information and each preset predicate before semantic adjustment;

The semantic adjustment unit is configured to input the initial correlation distribution information into a semantic adjustment layer, perform predicate semantic adjustment on the initial correlation distribution information based on the preset matrix to obtain target correlation distribution information, and the target correlation distribution information represents predicates corresponding to the two-by-two object detection information after semantic adjustment and correlation among the preset predicates;

And a target predicate determination unit configured to perform determination of the target predicate according to the target correlation distribution information.

As an alternative embodiment, the semantic adjustment unit comprises:

an initial predicate determination unit configured to perform determination of an initial predicate according to the initial correlation distribution information;

The first semantic adjustment unit is configured to perform predicate semantic adjustment on the initial relevance distribution information based on a semantic adjustment matrix in the preset matrix when the initial predicate is a universal predicate, wherein the universal predicate characterizes predicates with use probabilities larger than a preset threshold value in the preset predicates;

and a second semantic adjustment unit configured to perform determining the initial relevance distribution information as the target relevance distribution information based on a semantic retention matrix in the preset matrix in a case where the initial predicate is a non-universal predicate that characterizes a predicate of the preset predicates that uses a probability smaller than a preset threshold.

As an alternative embodiment, the apparatus further comprises:

The first training feature extraction module is configured to execute feature extraction by inputting a labeling image into the image detection model to obtain training object detection information corresponding to each object in the labeling image, wherein the labeling image is labeled with a reference visual relationship between every two objects;

The first training visual relation detection module is configured to execute the detection of the visual relation by inputting the detection information of the training objects into a first model to be trained to obtain a first training visual relation between every two objects, and the first training visual relation characterizes the interaction relation between every two objects in the labeling image obtained through the first model to be trained;

the first training scene graph generation module is configured to input the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, so as to obtain a first training scene graph corresponding to the labeling image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;

and the model training module is configured to train the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, so as to obtain a first visual relationship detection model and an initial scene graph generation model.

As an alternative embodiment, the apparatus further comprises:

the word frequency information detection module is configured to detect word frequency information corresponding to each reference predicate in the reference visual relationship;

A second visual relationship detection model acquisition module configured to perform a combination of the first visual relationship detection model and a preset matrix to obtain a second visual relationship detection model;

The second training visual relationship acquisition module is configured to perform visual relationship detection by inputting the training object detection information into the second visual relationship detection model to obtain a second training visual relationship between every two objects, and the second training visual relationship characterizes the interaction relationship between every two objects in the labeling image obtained through the second visual relationship detection model;

The second training scene graph acquisition module is configured to input training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain a second training scene graph corresponding to the labeling image, wherein the second training scene graph is structural information labeled with the second training visual relationship between the two objects;

And the model adjustment module is configured to execute adjustment on the second visual relation detection model and the initial scene graph generation model based on the reference predicate type corresponding to each marked image, the second training visual relation and the reference visual relation to obtain the visual relation detection model and the scene graph generation model.

As an alternative embodiment, the apparatus further comprises:

The initial visual relationship detection module is configured to input the object detection information into the first visual relationship detection model to perform visual relationship detection to obtain an initial visual relationship between every two objects, and the initial visual relationship characterizes the interaction relationship between every two objects in the labeling image obtained through the first visual relationship detection model;

The scene initial diagram generation module is configured to input training object detection information corresponding to the initial visual relationship and the first training visual relationship into the initial scene diagram generation model to generate a scene diagram, so as to obtain an initial scene diagram corresponding to the labeling image, wherein the initial scene diagram is structural information labeled with the initial visual relationship between every two objects;

An initial matrix determination module configured to perform determining an initial matrix from predicates in the initial visual relationship and reference predicates in the reference visual relationship;

the preset matrix determining module is configured to execute the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix to obtain a preset matrix.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the image processing method described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the image processing method as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the above-described image processing method.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

Inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed, inputting the object detection information into a visual relation detection model for visual relation detection to obtain visual relation between every two objects, wherein the visual relation is obtained by adjusting semantic information corresponding to the visual relation through the visual relation detection model, and inputting the visual relation into a scene graph generation model for scene graph generation to obtain a target scene graph corresponding to the image to be processed. The method is based on the visual relation detection model, and can detect the visual relation between every two objects, so that the accuracy of visual relation detection can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a schematic view of an application scenario of an image processing method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating an image processing method according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating predicate identification in an image processing method according to an example embodiment.

FIG. 4 is a flowchart illustrating predicate semantic adjustment in an image processing method according to an example embodiment.

FIG. 5 is a flow chart illustrating model training in an image processing method according to an exemplary embodiment.

FIG. 6 is a flowchart illustrating an adjustment of a trained model in an image processing method according to an exemplary embodiment.

Fig. 7 is a schematic diagram showing transfer learning in an image processing method according to an exemplary embodiment.

Fig. 8 is a schematic diagram showing parameter fixing during model training in an image processing method according to an exemplary embodiment.

Fig. 9 is a schematic diagram illustrating a process of inputting a picture to be processed and generating a target scene graph in an image processing method according to an exemplary embodiment.

Fig. 10 is a block diagram of an image processing apparatus according to an exemplary embodiment.

Fig. 11 is a block diagram of a server-side electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Fig. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment, and the application scenario includes a client 110 and a server 120 as shown in fig. 1. The client 110 collects the image to be processed, the server 120 receives the image to be processed sent from the client 110, and the server 120 inputs the image to be processed into the image detection model to perform object detection, so as to obtain object detection information corresponding to at least two objects in the image to be processed respectively. The server 120 inputs the object detection information into the visual relation detection model to perform visual relation detection to obtain visual relation between every two objects, and inputs the visual relation and the object detection information corresponding to the visual relation into the scene graph generation model to perform image processing to obtain a target scene graph corresponding to the image to be processed. The server 120 sends the target scene graph to the client 110 for display.

In the embodiment of the present disclosure, the client 110 includes a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, and other types of physical devices, and may also include software running in the physical devices, such as an application program, and the like. The operating system running on the entity device in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, unix, windows, etc. The client 110 includes a UI (User Interface) layer, and the client 110 provides display of a target scene graph and collection of a to-be-processed image to the outside through the UI layer, and in addition, transmits the to-be-processed image to the server 120 based on an API (Application Programming Interface, application program Interface).

In the disclosed embodiment, the server 120 may include one independently operating server, or a distributed server, or a server cluster composed of a plurality of servers. The server 120 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 120 may be configured to publish the publication content through a publication channel, mark the channel tracking code for the publication channel by using a tracking code application interface, an identification parameter or a fixed tracking code, and receive log data fed back by the user terminal.

In embodiments of the present disclosure, the server 120 may perform visual relationship detection on the object detection information based on the technique of visual relationship detection. Visual relationship detection combines images with semantics, not only needs to identify objects in the images and the positions of the objects, but also needs to identify the relationship between the objects, wherein the visual relationship is defined as a pair of objects connected by predicates, is usually expressed in the form of a main predicate, and can be used for describing the interaction relationship between every two objects. Visual relationship detection is the basis for image understanding and can be applied to object detection, image description, visual questions and answers, image retrieval and the like.

Fig. 2 is a flowchart illustrating an image processing method according to an exemplary embodiment, which is used in a server as shown in fig. 2, and includes the following steps.

S210, inputting an image to be processed into an image detection model to perform object detection, and obtaining object detection information corresponding to at least two objects in the image to be processed respectively;

As an optional embodiment, in the image detection model, each object in the image to be processed is detected according to a preset labeling frame, and a detection area of each object in the image to be processed is extracted. Extracting feature information corresponding to each object in a detection area corresponding to each object, and determining the object in the image to be processed according to the feature information corresponding to each object so as to obtain object detection information. The image detection model can be different image detection models such as a Fast R-CNN model, an R-CNN model and the like. When the object detection information is input into the visual relation detection model to detect the visual relation, the visual relation corresponding to the object detection information can be detected.

As an alternative embodiment, the combined detection information may also be acquired based on an image detection model. The image detection model can detect every two objects in the image to be processed, extracts detection areas of every two objects in the image to be processed, extracts joint characteristic information corresponding to the two objects from the detection areas corresponding to the every two objects, and determines every two objects in the image to be processed according to the joint characteristic information, so that combined detection information is obtained. The combination detection information includes two pieces of object detection information having an interactive relationship in the image to be processed. When the combination detection information is input into the visual relationship detection model to perform visual relationship detection, the visual relationship corresponding to the combination detection information may be detected.

The combination of two objects without interaction relationship can be eliminated by utilizing the combination detection information, so that the data volume to be detected is reduced, and the efficiency of visual relationship detection in the subsequent steps is improved.

S220, inputting object detection information into a visual relation detection model to perform visual relation detection, so as to obtain visual relation between every two objects, wherein the visual relation represents interaction relation between every two objects in an image to be processed;

As an alternative embodiment, the interaction relationship between the two objects may include an action relationship, a spatial relationship, a preposition relationship, and a comparison relationship. The action relationship may be expressed as one object taking a certain action on another object, such as a person riding a bicycle, and the spatial relationship may be expressed as a relative position between two objects, such as a cup to the left of a book. Preposition relations may be represented as associations between two objects in terms of information of membership, state, direction, etc., such as vehicle tires. The comparison relationship may represent a distinction between two objects, e.g., a first apple being larger than a second apple. The visual relation detection model can detect the visual relation between objects corresponding to the object detection information and semantically adjust the visual relation. The visual relationship between two objects can be corresponding to a triplet consisting of two objects as a subject and an object, and a predicate between the subject and the object. When the visual relation detection model performs semantic adjustment on the visual relation, the information quantity of predicates between subjects and objects can be adjusted to obtain predicates with rich meanings.

As an optional embodiment, the visual relationship detection model includes a predicate identification network, the inputting the object detection information into the visual relationship detection model to perform visual relationship detection, and obtaining the visual relationship between every two objects includes:

Inputting the object detection information into a predicate identification network to identify predicates corresponding to predicate relations among the objects, thereby obtaining target predicates;

And obtaining a visual relationship according to the target predicate and the object corresponding to the target predicate.

As an optional embodiment, the visual relationship detection model includes a predicate identification network, in which predicates corresponding to the predicate relationship between two objects are identified, the identified target predicate is a semantically adjusted predicate, and the two objects and the corresponding target predicate form a visual relationship with a subject, a predicate and an object.

And predicate relationships are arranged between objects corresponding to the object detection information, predicates corresponding to the predicate relationships are identified in a predicate identification network, and semantic adjustment can be performed to obtain target predicates. The object corresponding to the target predicate has two objects, one can be taken as a subject, the other can be taken as an object, and the visual relationship with the subject, the predicate and the object can be determined, for example, when the object 'person' is taken as the subject, 'yes' is taken as the target predicate, the object 'hand' is taken as the object, and the obtained visual relationship can be (person, yes, hand).

The target predicates between every two objects are determined through the predicate identification network, and the predicate identification network comprises a semantic adjustment layer, so that the accuracy of predicate identification can be improved.

As an optional embodiment, referring to fig. 3, the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and inputting object detection information into the predicate identification network to perform predicate identification between two objects, where obtaining a target predicate includes:

s310, inputting object detection information and preset predicates into an initial correlation calculation layer, and performing correlation calculation on predicates corresponding to the object detection information and each preset predicate to obtain initial correlation distribution information;

S320, inputting the initial correlation degree distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial correlation degree distribution information based on a preset matrix to obtain target correlation degree distribution information;

S330, determining target predicates according to the target correlation distribution information.

As an optional embodiment, the predicate identification network includes an initial relevance calculating layer and a semantic adjustment layer, the initial relevance calculating layer may be used to calculate initial relevance distribution information, and the semantic adjustment layer may be used to perform semantic adjustment on the initial relevance distribution information to obtain target relevance distribution information. The preset predicates comprise a plurality of predicates, the initial correlation distribution information is the probability distribution of a certain preset predicate corresponding to the detection information of the two objects before semantic adjustment, and the correlation between the predicates corresponding to the detection information of the two objects before semantic adjustment and each preset predicate is represented.

When the initial correlation distribution information is subjected to predicate semantic adjustment based on a preset matrix to obtain target correlation distribution information, predicates corresponding to the initial correlation distribution information before semantic adjustment and predicates corresponding to the target correlation distribution information after semantic adjustment have semantic correlation, for example, the predicates corresponding to the initial correlation distribution information are "on top", the predicates corresponding to the target correlation distribution information are "riding", wherein "riding" also has the meaning of "on top", and semantic correlation exists between the two predicates, so that the semantic adjustment can be determined to be correct. If the predicate corresponding to the initial relevance distribution information is "above", the predicate corresponding to the target relevance distribution information cannot be adjusted to "below", because the semantics of "above" and "below" are exactly opposite, and there is no semantic relevance between the two predicates.

The target correlation distribution information characterizes correlation between predicates corresponding to the post-semantic-adjustment pairwise object detection information and each preset predicate. According to the magnitude of each correlation in the target correlation distribution information, a correlation maximum value in the target correlation distribution information can be determined, and a preset predicate corresponding to the correlation maximum value is determined as a target predicate.

As an alternative embodiment, a calculation formula when identifying predicates corresponding to predicate relationships between object detection information can be expressed as:

Wherein, The initial correlation distribution information can be calculated by the formula so as to represent the probability distribution of predicates before semantic adjustment. R represents preset predicates, K represents the number of the preset predicates, and the preset predicates comprise different kinds of predicates. y _i denotes a predicate, the predicate with a g-superscript denotes an output before semantic adjustment, and the predicate with a s-superscript denotes an output after semantic adjustment. (o _j,o_k) represents the object detection information pair, θ represents the model parameters in the visual relationship detection model. And determining the correlation degree between the corresponding predicate and each preset predicate by the object detection information according to the probability distribution of the predicate before semantic adjustment.

The probability distribution of the predicates after semantic adjustment can be represented, target correlation distribution information can be obtained through calculation according to the formula, the probability distribution of the predicates before semantic adjustment is adjusted based on the semantic adjustment matrix, the probability distribution of the predicates after semantic adjustment is obtained, the correlation degree between the corresponding predicates and each preset predicate of the object detection information after semantic adjustment is determined, and the preset predicates corresponding to the maximum value in the correlation degree are determined as the corresponding predicates of the object detection information pair.

The probability distribution representing the semantic adjustment can measure the confidence of converting a predicate without rich semantic information into a predicate with rich semantic information, wherein the s superscript represents the output after the semantic adjustment and the g superscript represents the output before the semantic adjustment.

Instead of the preset matrix, the preset matrix may be used to perform semantic adjustment on the initial relevance distribution information in the semantic adjustment layer.

The initial correlation degree distribution information is subjected to predicate semantic adjustment through a semantic adjustment layer, so that target predicates with rich semantic information quantity can be obtained, the accuracy of predicate identification is improved, and the effectiveness of target scene graphs generated in subsequent steps can be improved.

As an optional embodiment, please refer to fig. 4, inputting the initial relevance distribution information into the semantic adjustment layer, performing predicate semantic adjustment on the initial relevance distribution information based on a preset matrix, and obtaining the target relevance distribution information includes:

s410, determining an initial predicate according to the initial correlation distribution information;

s420, performing predicate semantic adjustment on initial correlation distribution information based on a semantic adjustment matrix in a preset matrix under the condition that the initial predicate is a general predicate, wherein the general predicate characterizes predicates with probability larger than a preset threshold value in the preset predicate;

S430, determining the initial correlation distribution information as target correlation distribution information based on a semantic retention matrix in a preset matrix under the condition that the initial predicate is a non-universal predicate, wherein the non-universal predicate characterizes predicates with use probabilities smaller than a preset threshold value in the preset predicates.

As an alternative embodiment, based on shannon semantic information amount theory, the semantic information amount contained in a predicate can be determined by the probability of occurrence of the predicate, and predicates with small occurrence probability contain more semantic information amount. The semantic information amount can determine whether one predicate is a general predicate or a non-general predicate, if the predicate is a general predicate, the semantic information amount contained in the general predicate is lower, and the use probability of the general predicate in a preset predicate is larger than a preset threshold. If the predicate is a non-universal predicate, the semantic information amount contained in the non-universal predicate is higher, and the use probability of the non-universal predicate in the preset predicate is smaller than a preset threshold. For example, a person is on a bicycle where "on" may describe the relationship of the person to the bicycle, but "riding" in "riding a bicycle by a person" means that the person is an action on the bicycle, and thus "riding" has a greater amount of semantic information than "on". While "above" indicates only the relative relationship of the positions between two objects, and "riding" indicates an action, where "riding" can be applied, for example, "people riding horses" and "people are immediately above" can be applied, but "riding" when "above" is applied, for example, "books are on a table" cannot necessarily be applied, so that the probability of occurrence of "riding" is significantly lower than "above", and thus predicates with lower occurrence probability have more semantic information.

When the initial correlation distribution information is input into a semantic adjustment matrix for predicate semantic adjustment, if the initial predicate corresponding to the initial correlation distribution information is a general predicate, the condition that the amount of semantic information contained in the initial predicate is small is indicated, and the correlation distribution in the initial correlation distribution information can be adjusted through the semantic adjustment matrix to obtain target correlation distribution information, so that the predicates corresponding to the two-object detection information are associated with preset predicates containing a large amount of semantic information to obtain the target predicate.

When the initial predicate corresponding to the initial correlation distribution information is a non-universal predicate, the meaning information amount contained in the initial predicate is more, and the correlation distribution in the initial correlation distribution information can be not adjusted through the meaning holding matrix, so that the initial predicate corresponding to the initial correlation distribution information is held, and the predicates corresponding to the object detection information and the preset predicate containing more meaning information amount are associated to obtain the target predicate.

Based on a semantic adjustment matrix in the preset matrix, when the predicates corresponding to the predicate relation between every two object detection information are identified as general predicates, semantic adjustment can be performed on the initial relevance distribution information, and the identification result with the low semantic information content is converted into the identification result with the high semantic information content. Based on the semantic retention matrix in the preset matrix, when the predicates corresponding to the predicate relation between every two object detection information are recognized as non-universal predicates, semantic adjustment on the initial relevance distribution information can be reduced, and therefore recognition results with rich semantic information quantity can be retained.

When the semantic adjustment is carried out, the recognition result without rich semantic information is adjusted, and the recognition result with rich semantic information is maintained, so that the false adjustment of predicates with rich semantic information can be avoided, and the effectiveness of the semantic adjustment is improved.

S230, inputting the visual relationship and object detection information corresponding to the visual relationship into a scene graph generation model to generate a scene graph, and obtaining a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information marked with the visual relationship between every two objects.

As an alternative embodiment, the visual relationship and the object detection information corresponding to the visual relationship are input into the scene graph generation model, and according to the visual relationship between the objects corresponding to the object detection information, a target scene graph can be obtained, wherein the target scene graph is structural information formed by points and edges, the points in the target scene graph can represent the objects, and the edges can represent the visual relationship between every two objects. Visual relationships with subject, predicate and object may be displayed in the target scene graph. For example, visual relationships (person, hand) are displayed in the target scene graph, where "person" is the subject, "hand" is the predicate, and "hand" is the object.

As an alternative embodiment, the method further includes a model training method, please refer to fig. 5, the model training method includes:

s510, inputting the marked image into an image detection model to perform object detection, and obtaining training object detection information corresponding to each object in the marked image;

s520, inputting training object detection information into a first model to be trained to detect visual relations, and obtaining a first training visual relation between every two objects;

s530, inputting a first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtaining a first training scene graph corresponding to the labeling image;

S540, training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.

As an optional embodiment, during model training, the image detection model is a model trained in advance, so that training object detection information required during training can be extracted through the image detection model, and the training object detection information is object detection information corresponding to an object in the labeling image. And acquiring a labeling image by adopting a full-supervision training mode, wherein the labeling image is labeled with reference visual relations between every two objects. And inputting the marked image into an image detection model, detecting each object in the marked image according to a preset marked frame, and extracting a detection area of each object in the marked image. Extracting the characteristic information corresponding to each object in the detection area corresponding to each object, and determining the object in the annotation image according to the characteristic information corresponding to each object so as to obtain the training object detection information. At this time, the first model to be trained does not include the preset matrix, i.e., the first model to be trained does not have the function of semantic adjustment. After training the preset matrix to be trained based on the training visual relationship and the reference visual relationship, adding the trained preset matrix into the first visual relationship detection model.

The training object detection information is input into a first to-be-trained model for visual relation detection, predicates corresponding to predicate relations between every two training object detection information can be identified based on a first to-be-trained network in the first to-be-trained model to obtain a first training target predicate, the training target predicate and objects corresponding to the first training target predicate can be combined based on a second to-be-trained network in the first to-be-trained model to obtain a first training visual relation with a main subject, a predicate and an object, and the first training visual relation characterizes interaction relations between every two objects in the marked image obtained through the first to-be-trained model.

Inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and marking the first training visual relationship among objects in a marked image in the second model to be trained to obtain a first training scene graph corresponding to the marked image, wherein the first training scene graph is the structural information marked with the first training visual relationship among every two objects.

And calculating first loss data between the first training visual relationship and the reference visual relationship, wherein the first loss data can be a loss function between the first training visual relationship and the reference visual relationship, and training a first model to be trained and a second model to be trained according to the first loss data to obtain a first visual relationship detection model and an initial scene graph generation model, wherein the first visual relationship detection model is a visual relationship detection model without a preset matrix.

When the model is trained, only the visual relation detection model and the scene graph generation model are required to be trained, most training steps are completed based on the source domain, and only fine adjustment is required to be performed on the target domain, so that the training cost is reduced.

As an alternative embodiment, referring to fig. 6, after training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further includes:

S610, detecting word frequency information corresponding to each reference predicate in the reference visual relationship;

s620, classifying the reference predicates according to preset word frequency segmentation information and word frequency information to obtain reference predicate types corresponding to each labeling image;

S630, combining the first visual relation detection model with a preset matrix to obtain a second visual relation detection model;

s640, inputting training object detection information into a second visual relationship detection model to perform visual relationship detection, and obtaining a second training visual relationship between every two objects;

s650, inputting the second training visual relationship and training object detection information corresponding to the second training visual relationship into an initial scene graph generation model to generate a scene graph, and obtaining a second training scene graph corresponding to the labeling image;

S660, adjusting the second visual relation detection model and the initial scene graph generation model based on the reference predicate types, the second training visual relation and the reference visual relation corresponding to each marked image to obtain the visual relation detection model and the scene graph generation model.

As an alternative embodiment, please refer to fig. 7, which is a schematic diagram of the migration learning shown in fig. 7. Because predicates with small occurrence probability contain more semantic information, word frequency information corresponding to each reference predicate in the reference visual relationship is detected, and the word frequency information characterizes the occurrence probability of each reference predicate in all the reference predicates. Calculating the semantic information amount contained in each reference predicate according to the word frequency information, and estimating the semantic information amount contained in the reference predicate according to the formula as follows:

I(y_i)＝-log_b[Pr(y_i)]

where y _i denotes predicate, pr (y _i) denotes word frequency information, that is, probability of occurrence of predicate, and I (y _i) denotes semantic information amount in predicate. The smaller the word frequency information, the larger the amount of semantic information contained in the predicate, and the larger the word frequency information, the smaller the amount of semantic information contained in the predicate. And sequencing the reference predicates from small to large according to the size of the semantic information quantity to obtain a reference predicate sequence. Taking a preset number of reference predicates from the first reference predicate as general predicates, taking reference predicates except the preset number of reference predicates as non-general predicates, and classifying the reference predicates into two types. The general predicates are predicates with occurrence probabilities larger than preset probabilities, and the non-general predicates are predicates with occurrence probabilities smaller than the preset probabilities, wherein the preset probabilities correspond to the occurrence probabilities of the last reference predicate in the preset number of reference predicates. For example, the preset number may be 15, i.e., the first fifteen reference predicates of the sequence of reference predicates are used as universal predicates, and the reference predicates after the first fifteen reference predicates are used as non-universal predicates.

Taking the labeling image as a source domain, simultaneously downsampling the labeling image of a general predicate to obtain a target domain, transferring a first visual relation detection model, a preset matrix and an initial scene graph generation model which are obtained by training on the source domain to the target domain, combining the first visual relation detection model and the preset matrix to obtain a second visual relation detection model, and adjusting the last neural network layer in the neural network layers which are orderly arranged in the second visual relation detection model and the initial scene graph generation model to obtain the visual relation detection model and the scene graph generation model, wherein the last neural network layer is a classification layer. When the second visual relationship detection model and the initial scene graph generation model are adjusted on the target domain, sample images can be acquired from the annotation images with the reference predicate types for adjustment, and all the annotation images are not required to be used.

When the first visual relation detection model and the initial scene graph generation model are adjusted, a preset matrix can be obtained, the preset matrix is a matrix obtained by training based on the first visual relation detection model and the initial scene graph generation model, the preset matrix and the first visual relation detection model are combined to obtain a second visual relation detection model, and the second visual relation detection model is the visual relation detection model with the preset matrix.

After the labeling image is input into the image detection model to perform object detection, training object detection information is obtained, then the training object detection information is input into the second visual relation detection model to perform visual relation detection, and a second training visual relation between every two objects is obtained, wherein the second training visual relation characterizes interaction relation between every two objects in the labeling image obtained through the second visual relation detection model. The preset matrix can carry out semantic adjustment on initial correlation distribution information corresponding to the training object detection information to obtain target correlation distribution information, and a second target training predicate can be determined according to the target correlation distribution information to obtain a second training visual relationship.

And inputting the second training visual relationship and training object detection information corresponding to the second training visual relationship into the initial scene graph generation model to generate a scene graph, so that a second training scene graph corresponding to the labeling image can be obtained, wherein the second training scene graph is the structural information labeled with the second training visual relationship between every two objects.

The second training visual relationship is a detection result obtained based on the information quantity corresponding to the reference predicate type, the reference predicate type and the reference visual relationship corresponding to each labeling image are obtained through calculation, and second loss data between the second training visual relationship and the reference predicate type and the reference visual relationship can be loss functions between the second loss data and the second training visual relationship. And adjusting the second visual relation detection model and the initial scene graph generation model according to the second loss data to obtain the visual relation detection model and the scene graph generation model.

Classifying the marked images according to semantic information amount contained in predicates in the marked images, and performing model adjustment based on the reference predicate types, the training visual relationship and the reference visual relationship, so that the model has the capability of identifying general predicates and non-general predicates, and the accuracy of predicate identification is improved. And the second visual relationship detection model and the initial scene graph generation model are adjusted instead of retrained, the problem of overfitting can be avoided.

As an alternative embodiment, referring to fig. 8, training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, to obtain the first visual relationship detection model and the initial scene graph generation model includes:

S810, inputting training object detection information into a first visual relation detection model to perform visual relation detection, and obtaining an initial visual relation between every two objects;

s820, inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model to generate a scene graph, so as to obtain an initial scene graph corresponding to the labeling image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects;

S830, determining an initial matrix according to predicates in the initial visual relationship and reference predicates in the reference visual relationship;

S840, obtaining a preset matrix according to the normalized matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix.

As an optional embodiment, training object detection information is input into a first visual relation detection model to perform visual relation detection, initial correlation distribution information can be obtained, initial predicates can be determined according to the initial correlation distribution information, initial visual relation between every two objects can be obtained through the initial predicates, and the initial visual relation characterizes interaction relation between every two objects in the marked image obtained through the first visual relation detection model.

And inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model to generate a scene graph, so as to obtain an initial scene graph corresponding to the labeling image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects. The predicates in the initial visual relationship and the reference predicates in the reference visual relationship are compared, and the predicates with correct classification and the predicates with incorrect classification can be determined.

As an alternative embodiment, the preset matrix may be expressed as:

C^*∈R^K×K

Wherein, the matrix C ^* represents a preset matrix, R represents preset predicates, K represents the number of the preset predicates, and the preset predicates are different kinds of predicates. In the process of acquiring the preset matrix, firstly initializing a confusion matrix for identifying predicates to obtain an initial matrix, wherein the initial matrix can be expressed as:

C∈R^K×K

each element in the initial matrix is denoted as C _j,k, which is denoted as labeled as a type j predicate, but predicted as the number of type k predicates, where j may be equal to k, and when j is equal to k, it indicates that the labeling result for the predicate and the recognition result for the predicate agree.

The elements in the semantic adjustment matrix of the preset matrix are represented as the predicates marked as the j-th predicate but predicted as the k-th predicate, so that the elements in the semantic adjustment matrix can be determined according to the number of predicates with correct classification and the number of predicates with incorrect classification, for example, 100A predicates are referred to in the predicates, the A predicates correspond to the class number 3, but in the initial visual relationship, only 50A predicates are provided in the predicates corresponding to 100A predicates, and further 30B predicates and 20C predicates are provided, the B predicates correspond to the class number 4 and the C class corresponds to the predicate number 5. The number of correctly classified predicates is 50, the number of incorrectly classified predicates is 30 and 20, and then C _3,3,C_3,4 and C _3,5 can be represented in the matrix.

And obtaining a preset matrix according to the normalized matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix. The normalization matrix corresponding to the initial matrix is a semantic adjustment matrix, and the identity matrix corresponding to the initial matrix is a semantic retention matrix. Normalizing the initial matrix to obtain a normalized semantic adjustment matrix C ', wherein the semantic adjustment matrix C' can be obtained by calculating the following formula:

The semantic adjustment matrix C ' represents semantic relativity among predicates to a certain extent, but the diagonal elements of the semantic adjustment matrix C ' have smaller values on predicates with rich semantic information, and the probability of predicates with rich semantic information which are identified can be reduced by directly multiplying initial relativity distribution information by the semantic adjustment matrix C '. Therefore, a semantic keeping matrix is added on the semantic adjustment matrix C', the recognition result of predicates with rich semantic information can be kept based on the semantic keeping matrix, and the preset matrix C ^* can be obtained. The specific formula is as follows:

C^*＝(C′+I_K)×0.5

Where I _K∈R^K×K is an identity matrix, the whole formula multiplied by 0.5 ensures that the sum of the elements of each row of the preset matrix is 1.

After the preset matrix is obtained, the preset matrix is added into the initial scene graph generation model, so that the initial scene graph generation model has a semantic adjustment function.

In the training process, a trained preset matrix is used, and parameters in the preset matrix are fixed, so that the condition of semantic drift can be avoided, and the accuracy of semantic adjustment is improved.

As an alternative embodiment, please refer to fig. 9, which is a schematic diagram of a process of inputting a to-be-processed picture and generating a target scene graph. The image to be processed is input into an image detection model for object detection, the labeling frame and the characteristic information of each object are obtained through detection, the positions of four objects of the racket, the hand, the person and the short sleeve can be determined, and the object detection information of the four objects of the racket, the hand, the person and the short sleeve can be obtained. And inputting the object detection information into a visual relation detection model to obtain the visual relation between every two objects. In the visual relationship detection model, predicates corresponding to two objects of "racket" and "hand" are identified, and the target predicate "on top" can be obtained, so that a visual relationship (racket, on top, hand) composed of a subject, predicate and object can be determined. The predicates corresponding to the two objects of the person and the hand are identified, so that the target predicate is available, and the visual relationship (person, available hand) formed by the subject, the predicate and the object can be determined. The predicates corresponding to the short sleeve and the person are identified, so that the target predicate can be obtained to be 'on', and the visual relationship (short sleeve, on and person) formed by the subject, the predicate and the object can be determined. The object detection information of the four objects of the racket, the hand, the person and the short sleeve is input into a scene graph generation model, and the object detection information corresponding to each object is marked with the visual relationship between every two objects to obtain a target scene graph. The target scene graph can be applied to various aspects such as image retrieval, visual question and answer, for example, when the image retrieval is performed, the input retrieval information is the image of the person wearing the short sleeve, and then the target scene graph with the visual relationship (short sleeve, above and on) in the target scene graph generated by the visual relationship detection model and the scene graph generation model can be searched to obtain a retrieval result. Or when the user inputs the problem of 'what is worn on the person', the answer 'short sleeve' is obtained by identifying the visual relation in the target scene graph (short sleeve, above, people) generated by the visual relation detection model and the scene graph generation model, so that the visual question and answer of the time are completed.

The embodiment of the disclosure provides an image processing method, which is characterized in that an image to be processed is input into an image detection model to perform object detection, object detection information is obtained, the object detection information is input into a visual relation detection model to perform visual relation detection, semantic adjustment is performed on predicates corresponding to predicate relations between every two detected objects in the visual relation detection process, so that the object predicates contain abundant semantic information, the accuracy of predicate identification is improved, in a subsequent step, visual relations between every two objects are generated through the object predicates and objects corresponding to the object predicates, the visual relation is input into a scene graph generation model to generate a target scene graph, and the accuracy of the visual relation marked in the target scene graph is improved, thereby improving the effectiveness of the target scene graph.

Fig. 10 is a block diagram of an image processing apparatus according to an exemplary embodiment. Referring to fig. 10, the apparatus includes:

The object detection module 1010 is configured to perform object detection by inputting an image to be processed into the image detection model, so as to obtain object detection information corresponding to at least two objects in the image to be processed respectively;

The visual relation detection module 1020 is configured to perform visual relation detection by inputting object detection information into the visual relation detection model to obtain a visual relation between every two objects, wherein the visual relation represents an interaction relation between every two objects in the image to be processed;

The scene graph generating module 1030 is configured to perform image processing by inputting the visual relationship and the object detection information corresponding to the visual relationship into the scene graph generating model, so as to obtain a target scene graph corresponding to the image to be processed, where the target scene graph is structural information marked with the visual relationship between every two objects.

As an alternative embodiment, the visual relationship detection model includes a predicate identification network and the visual relationship detection module 1020 includes:

The predicate identification unit is configured to perform predicate identification corresponding to the predicate relation between every two objects by inputting object detection information into the predicate identification network, so as to obtain a target predicate, wherein the target predicate represents the predicate after semantic adjustment;

And a visual relationship determination unit configured to perform, as the visual relationship, the target predicate and an object to which the target predicate corresponds.

As an alternative embodiment, the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and the predicate identification unit includes:

The initial correlation calculation unit is configured to input object detection information and preset predicates into the initial correlation calculation layer, and performs correlation calculation on predicates corresponding to the two-by-two object detection information and each preset predicate to obtain initial correlation distribution information, wherein the initial correlation distribution information represents correlation between predicates corresponding to the two-by-two object detection information before semantic adjustment and each preset predicate;

The semantic adjustment unit is configured to input initial correlation distribution information into the semantic adjustment layer, perform predicate semantic adjustment on the initial correlation distribution information based on a preset matrix to obtain target correlation distribution information, wherein the target correlation distribution information characterizes predicates corresponding to the post-semantic-adjustment pairwise object detection information and correlation among each preset predicate;

And a target predicate determination unit configured to perform determination of a target predicate according to the target correlation distribution information.

As an alternative embodiment, the semantic adjustment unit comprises:

The first semantic adjustment unit is configured to perform predicate semantic adjustment on the initial correlation distribution information based on a semantic adjustment matrix in a preset matrix under the condition that the initial predicate is a general predicate, and the general predicate characterizes predicates with probability larger than a preset threshold value in the preset predicate;

And a second semantic adjustment unit configured to perform determining the initial correlation distribution information as target correlation distribution information based on a semantic retention matrix in a preset matrix in the case where the initial predicate is a non-universal predicate, the non-universal predicate characterizing a predicate in the preset predicate whose use probability is smaller than a preset threshold.

As an alternative embodiment, the apparatus further comprises:

The first training feature extraction module is configured to execute feature extraction by inputting the marked image into the image detection model to obtain training object detection information corresponding to each object in the marked image;

The first training visual relation detection module is configured to execute the input of training object detection information into a first model to be trained for visual relation detection to obtain a first training visual relation between every two objects, and the first training visual relation characterizes the interaction relation between every two objects in a labeling image obtained through the first model to be trained;

The first training scene graph generation module is configured to execute the steps of inputting a first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtaining a first training scene graph corresponding to a labeling image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;

The model training module is configured to train the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, and a first visual relationship detection model and an initial scene graph generation model are obtained.

As an alternative embodiment, the apparatus further comprises:

The second visual relation detection model acquisition module is configured to perform combination of the first visual relation detection model and a preset matrix to obtain a second visual relation detection model;

the second training visual relationship acquisition module is configured to input training object detection information into the second visual relationship detection model for visual relationship detection to obtain a second training visual relationship between every two objects, and the second training visual relationship characterizes the interaction relationship between every two objects in the labeling image obtained through the second visual relationship detection model;

the second training scene graph acquisition module is configured to execute the steps of inputting a second training visual relationship and training object detection information corresponding to the second training visual relationship into the initial scene graph generation model to generate a scene graph, and obtaining a second training scene graph corresponding to the labeling image, wherein the second training scene graph is structural information labeled with the second training visual relationship between every two objects;

And the model adjustment module is configured to execute adjustment on the second visual relation detection model and the initial scene graph generation model based on the reference predicate type, the second training visual relation and the reference visual relation corresponding to each labeling image to obtain the visual relation detection model and the scene graph generation model.

As an alternative embodiment, the apparatus further comprises:

The initial visual relationship detection module is configured to input object detection information into the first visual relationship detection model for visual relationship detection to obtain an initial visual relationship between every two objects, wherein the initial visual relationship characterizes interaction relationship between every two objects in the annotation image obtained through the first visual relationship detection model;

The scene initial diagram generation module is configured to execute the steps of inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into the initial scene diagram generation model to generate a scene diagram, so as to obtain an initial scene diagram corresponding to the labeling image, wherein the initial scene diagram is the structural information labeled with the initial visual relationship between every two objects;

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 11 is a block diagram illustrating an electronic device for image processing, which may be a server, according to an exemplary embodiment, and an internal structure diagram thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image processing method.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not limiting of the electronic device to which the disclosed aspects apply, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an exemplary embodiment, a computer-readable storage medium is also provided, such as memory 1104 including instructions executable by processor 1120 of electronic device 1100 to perform the above-described method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising computer instructions which, when executed by a processor, implement the above-described image processing method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

Inputting the image to be processed into the image detection model to perform object detection, and obtaining object detection information corresponding to at least two objects in the image to be processed;

Inputting the object detection information into a visual relationship detection model to perform visual relationship detection to obtain a visual relationship between two objects, wherein the visual relationship represents an interactive relationship between two objects in the image to be processed;

Inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model to generate a scene graph, thereby obtaining a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information annotated with the visual relationship between the two objects;

The method for generating the visual relationship detection model and the scene graph generation model comprises:

The annotated image is used as a source domain, and the annotated image of the general predicate is downsampled to obtain a target domain; the annotated image is annotated with reference visual relations between two objects; the general predicate is a preset number of reference predicates starting from the first reference predicate in a reference predicate sequence, and the reference predicate sequence is obtained by sorting the reference predicates from small to large according to the amount of semantic information;

Migrating the first visual relationship detection model, the preset matrix, and the initial scene graph generation model trained on the source domain to the target domain, and combining the first visual relationship detection model and the preset matrix to obtain a second visual relationship detection model;

Adjusting the last neural network layer in the neural network layers arranged in sequence in the second visual relationship detection model and the initial scene graph generation model to obtain a visual relationship detection model and a scene graph generation model;

The method for determining the preset matrix is:

Inputting the training object detection information into the first visual relationship detection model to perform visual relationship detection to obtain an initial visual relationship between the two objects, wherein the initial visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the first visual relationship detection model;

Inputting the initial visual relationship and the corresponding training object detection information into the initial scene graph generation model to generate a scene graph, thereby obtaining an initial scene graph corresponding to the annotated image, wherein the initial scene graph is structural information annotated with the initial visual relationship between the two objects;

determining an initial matrix according to the predicate in the initial visual relationship and the reference predicate in the reference visual relationship;

A preset matrix is obtained according to a normalized matrix corresponding to the initial matrix and a unit matrix corresponding to the initial matrix.

2. The image processing method according to claim 1, wherein the visual relationship detection model comprises a predicate recognition network, and the step of inputting the object detection information into the visual relationship detection model to perform visual relationship detection to obtain the visual relationship between two objects comprises:

Inputting the object detection information output by the image detection model into the predicate recognition network to perform predicate recognition corresponding to the predicate relationship between two objects, and obtaining a target predicate, wherein the target predicate represents the semantically adjusted predicate;

The visual relationship is obtained according to the target predicate and the object corresponding to the target predicate.

3. The image processing method according to claim 2, characterized in that the predicate recognition network comprises an initial relevance calculation layer and a semantic adjustment layer, and the step of inputting the object detection information into the predicate recognition network to perform predicate recognition between two objects to obtain the target predicate comprises:

Inputting the object detection information and the preset predicates into the initial relevance calculation layer, performing relevance calculation on the predicates corresponding to the pairwise object detection information and each preset predicate, and obtaining initial relevance distribution information, wherein the initial relevance distribution information represents the relevance between the predicates corresponding to the pairwise object detection information and each preset predicate before semantic adjustment;

Inputting the initial relevance distribution information into a semantic adjustment layer, performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix, and obtaining target relevance distribution information, wherein the target relevance distribution information represents the relevance between the predicate corresponding to the pairwise object detection information and each preset predicate after the semantic adjustment;

The target predicate is determined according to the target relevance distribution information.

4. The image processing method according to claim 3, characterized in that the step of inputting the initial relevance distribution information into the semantic adjustment layer, and performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix to obtain the target relevance distribution information comprises:

Determining an initial predicate according to the initial relevance distribution information;

In the case where the initial predicate is a universal predicate, based on the semantic adjustment matrix in the preset matrix, the initial relevance distribution information is subjected to predicate semantic adjustment, wherein the universal predicate represents a predicate in the preset predicate whose use probability is greater than a preset threshold;

In the case where the initial predicate is a non-universal predicate, the initial relevance distribution information is determined as the target relevance distribution information based on a semantic preservation matrix in the preset matrix, and the non-universal predicate represents a predicate in the preset predicate whose usage probability is less than a preset threshold.

5. The image processing method according to claim 1, characterized in that the method further comprises:

Inputting the annotated image into the image detection model to perform object detection, and obtaining training object detection information corresponding to each object in the annotated image, wherein the annotated image is annotated with reference visual relationships between two objects;

Inputting the training object detection information into a first model to be trained to perform visual relationship detection, thereby obtaining a first training visual relationship between two objects, wherein the first training visual relationship represents an interactive relationship between two objects in the annotated image obtained by the first model to be trained;

Inputting the first training visual relationship and the training object detection information corresponding to the first training visual relationship into the second to-be-trained model to generate a scene graph, thereby obtaining a first training scene graph corresponding to the annotated image, wherein the first training scene graph is annotated with structural information of the first training visual relationship between the two objects;

According to the first training visual relationship and the reference visual relationship, the first model to be trained and the second model to be trained are trained to obtain a first visual relationship detection model and an initial scene graph generation model, wherein the first visual relationship detection model is a visual relationship detection model without a preset matrix.

6. The image processing method according to claim 5, characterized in that after the first to-be-trained model and the second to-be-trained model are trained according to the first training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further comprises:

Classifying the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain a reference predicate type corresponding to each annotated image;

Combining the first visual relationship detection model and a preset matrix to obtain a second visual relationship detection model;

Inputting the training object detection information into the second visual relationship detection model to perform visual relationship detection to obtain a second training visual relationship between the two objects, wherein the second training visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the second visual relationship detection model;

Inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, thereby obtaining a second training scene graph corresponding to the annotated image, wherein the second training scene graph is annotated with structural information of the second training visual relationship between the two objects;

Based on the reference predicate type corresponding to each annotated image, the second training visual relationship and the reference visual relationship, the second visual relationship detection model and the initial scene graph generation model are adjusted to obtain the visual relationship detection model and the scene graph generation model.

7. An image processing device, characterized in that the device comprises:

An object detection module is configured to input the image to be processed into an image detection model to perform object detection, and obtain object detection information corresponding to at least two objects in the image to be processed;

A visual relationship detection module is configured to input the object detection information into a visual relationship detection model to perform visual relationship detection to obtain a visual relationship between two objects, wherein the visual relationship represents an interactive relationship between two objects in the image to be processed;

A scene graph generation module is configured to execute the scene graph generation by inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is a structural information annotated with the visual relationship between the two objects;

The device also includes a model generation module configured to execute:

Migrating the first visual relationship detection model, the preset matrix and the initial scene graph generation model trained on the source domain to the target domain, combining the first visual relationship detection model and the preset matrix to obtain a second visual relationship detection model;

The device also includes:

an initial visual relationship detection module, configured to input the training object detection information into the first visual relationship detection model to perform visual relationship detection, and obtain the initial visual relationship between the two objects, wherein the initial visual relationship represents the interaction relationship between the two objects in the annotated image obtained by the first visual relationship detection model;

A scene initial graph generation module is configured to execute scene graph generation by inputting the initial visual relationship and the corresponding training object detection information into the initial scene graph generation model to obtain an initial scene graph corresponding to the annotated image, wherein the initial scene graph is annotated with structural information of the initial visual relationship between the two objects;

an initial matrix determination module, configured to determine an initial matrix according to the predicate in the initial visual relationship and the reference predicate in the reference visual relationship;

The preset matrix determination module is configured to execute a normalized matrix corresponding to the initial matrix and a unit matrix corresponding to the initial matrix to obtain a preset matrix.

8. The image processing device according to claim 7, wherein the visual relationship detection model comprises a predicate recognition network, and the visual relationship detection module comprises:

A predicate recognition unit is configured to input the object detection information output by the image detection model into the predicate recognition network to perform predicate recognition corresponding to the predicate relationship between two objects, and obtain a target predicate, wherein the target predicate represents the semantically adjusted predicate;

The visual relationship determining unit is configured to obtain the visual relationship according to the target predicate and the object corresponding to the target predicate.

9. The image processing device according to claim 8, characterized in that the predicate recognition network comprises an initial relevance calculation layer and a semantic adjustment layer, and the predicate recognition unit comprises:

an initial relevance calculation unit, configured to input the object detection information and the preset predicate into the initial relevance calculation layer, perform relevance calculation on the predicates corresponding to the pairwise object detection information and each preset predicate, and obtain initial relevance distribution information, wherein the initial relevance distribution information represents the relevance between the predicates corresponding to the pairwise object detection information and each preset predicate before semantic adjustment;

A semantic adjustment unit is configured to input the initial relevance distribution information into a semantic adjustment layer, perform predicate semantic adjustment on the initial relevance distribution information based on the preset matrix, and obtain target relevance distribution information, wherein the target relevance distribution information represents the relevance between the predicate corresponding to the pairwise object detection information and each preset predicate after the semantic adjustment;

The target predicate determination unit is configured to determine the target predicate according to the target relevance distribution information.

10. The image processing device according to claim 9, wherein the semantic adjustment unit comprises:

an initial predicate determination unit, configured to determine an initial predicate according to the initial relevance distribution information;

A first semantic adjustment unit is configured to perform predicate semantic adjustment on the initial relevance distribution information based on a semantic adjustment matrix in the preset matrix when the initial predicate is a universal predicate, wherein the universal predicate represents a predicate in the preset predicate whose use probability is greater than a preset threshold;

The second semantic adjustment unit is configured to determine the initial relevance distribution information as the target relevance distribution information based on a semantic preservation matrix in the preset matrix when the initial predicate is a non-universal predicate, wherein the non-universal predicate represents a predicate in the preset predicate whose usage probability is less than a preset threshold.

11. The image processing device according to claim 10, characterized in that the device further comprises:

A first training feature extraction module is configured to perform feature extraction by inputting the annotated image into the image detection model to obtain training object detection information corresponding to each object in the annotated image, wherein the annotated image is annotated with reference visual relationships between the two objects;

A first training visual relationship detection module is configured to input the training object detection information into a first model to be trained to perform visual relationship detection to obtain a first training visual relationship between the two objects, wherein the first training visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the first model to be trained;

A first training scene graph generation module is configured to execute the step of inputting the first training visual relationship and the training object detection information corresponding to the first training visual relationship into a second to-be-trained model to generate a scene graph, thereby obtaining a first training scene graph corresponding to the annotated image, wherein the first training scene graph is structural information annotated with the first training visual relationship between the two objects;

The model training module is configured to train the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.

12. The image processing device according to claim 11, characterized in that the device further comprises:

A word frequency information detection module is configured to detect word frequency information corresponding to each reference predicate in the reference visual relationship;

A reference predicate classification module is configured to classify the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain a reference predicate type corresponding to each annotated image;

A second visual relationship detection model acquisition module is configured to combine the first visual relationship detection model with a preset matrix to obtain a second visual relationship detection model;

A second training visual relationship acquisition module is configured to input the training object detection information into the second visual relationship detection model to perform visual relationship detection to obtain a second training visual relationship between the two objects, wherein the second training visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the second visual relationship detection model;

A second training scene graph acquisition module is configured to execute inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain a second training scene graph corresponding to the annotated image, wherein the second training scene graph is structural information annotated with the second training visual relationship between the two objects;

The model adjustment module is configured to adjust the second visual relationship detection model and the initial scene graph generation model based on the reference predicate type corresponding to each annotated image, the second training visual relationship and the reference visual relationship to obtain the visual relationship detection model and the scene graph generation model.

13. An electronic device, characterized in that the electronic device comprises:

processor;

a memory for storing instructions executable by the processor;

The processor is configured to execute the instructions to implement the image processing method according to any one of claims 1 to 6.

14. A computer-readable storage medium, characterized in that when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the image processing method according to any one of claims 1 to 6.

15. A computer program product, comprising computer instructions, wherein when the computer instructions are executed by a processor, the image processing method according to any one of claims 1 to 6 is implemented.