[go: up one dir, main page]

CN113869099B - Image processing method, device, electronic device and storage medium - Google Patents

Image processing method, device, electronic device and storage medium Download PDF

Info

Publication number
CN113869099B
CN113869099B CN202110693496.5A CN202110693496A CN113869099B CN 113869099 B CN113869099 B CN 113869099B CN 202110693496 A CN202110693496 A CN 202110693496A CN 113869099 B CN113869099 B CN 113869099B
Authority
CN
China
Prior art keywords
predicate
visual relationship
initial
training
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110693496.5A
Other languages
Chinese (zh)
Other versions
CN113869099A (en
Inventor
徐路
郭昱宇
高联丽
陈敏
王浩宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
University of Electronic Science and Technology of China
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, Beijing Dajia Internet Information Technology Co Ltd filed Critical University of Electronic Science and Technology of China
Priority to CN202110693496.5A priority Critical patent/CN113869099B/en
Publication of CN113869099A publication Critical patent/CN113869099A/en
Application granted granted Critical
Publication of CN113869099B publication Critical patent/CN113869099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

本公开关于一种图像处理方法、装置、电子设备和存储介质,该方法包括:将待处理图像输入到图像检测模型中进行对象检测,得到待处理图像中至少两个对象分别对应的对象检测信息,将对象检测信息输入到视觉关系检测模型中进行视觉关系检测,得到两两对象间的视觉关系,该视觉关系为经过视觉关系检测模型对视觉关系对应的语义信息量进行调整后得到的,将视觉关系输入到场景图生成模型中进行场景图生成,得到待处理图像对应的目标场景图。该方法基于视觉关系检测模型,对两两对象间的视觉关系进行检测,可以提高视觉关系检测的准确性。

The present disclosure relates to an image processing method, device, electronic device and storage medium, the method comprising: inputting an image to be processed into an image detection model for object detection, obtaining object detection information corresponding to at least two objects in the image to be processed, inputting the object detection information into a visual relationship detection model for visual relationship detection, obtaining a visual relationship between two objects, the visual relationship being obtained after the visual relationship detection model adjusts the amount of semantic information corresponding to the visual relationship, inputting the visual relationship into a scene graph generation model for scene graph generation, obtaining a target scene graph corresponding to the image to be processed. The method detects the visual relationship between two objects based on the visual relationship detection model, and can improve the accuracy of visual relationship detection.

Description

Image processing method, device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium.
Background
The scene graph marked with the visual relationship can be generated through the visual relationship detection, and is taken as a structural representation of image content and is a bridge between computer vision and natural language. After the scene graph marked with the visual relationship is generated, the visual relationship triples formed by subjects, predicates and objects in the image can be detected according to the scene image.
In the related art, when visual relation detection is performed on an image to be processed, the detected visual relation is easily confused, so that the accuracy of the visual relation detection is reduced, and the effectiveness of a scene graph marked with the visual relation is also reduced.
Disclosure of Invention
The disclosure provides an image processing method, an image processing device, an electronic device and a storage medium, which are used for at least solving the problems that the accuracy of visual relation detection in the related art is low and the effectiveness of a scene image marked with the visual relation is low. The technical scheme of the present disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided an image processing method, the method comprising;
inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed respectively;
Inputting the object detection information into a visual relation detection model to perform visual relation detection to obtain visual relation between every two objects, wherein the visual relation represents interaction relation between every two objects in the image to be processed;
Inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model to generate a scene graph, and obtaining a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information marked with the visual relationship between every two objects.
As an optional embodiment, the visual relationship detection model includes a predicate identification network, and the inputting the object detection information into the visual relationship detection model to perform visual relationship detection, and obtaining the visual relationship between the two objects includes:
Inputting the object detection information into the predicate identification network to identify predicates corresponding to predicate relations among the objects, and obtaining target predicates, wherein the target predicates represent predicates after semantic adjustment;
And obtaining the visual relationship according to the target predicate and the object corresponding to the target predicate.
As an optional embodiment, the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and the inputting the object detection information into the predicate identification network to perform predicate identification between two objects, and obtaining the target predicate includes:
Inputting the object detection information and preset predicates into the initial correlation calculation layer, and performing correlation calculation on predicates corresponding to the object detection information and each preset predicate to obtain initial correlation distribution information, wherein the initial correlation distribution information characterizes correlation between predicates corresponding to the object detection information and each preset predicate before semantic adjustment;
Inputting the initial relevance distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix to obtain target relevance distribution information, wherein the target relevance distribution information characterizes predicates corresponding to the two-to-two object detection information after semantic adjustment and the relevance between each preset predicate;
And determining the target predicate according to the target correlation distribution information.
As an optional embodiment, inputting the initial relevance distribution information into a semantic adjustment layer, performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix, and obtaining the target relevance distribution information includes:
Determining an initial predicate according to the initial correlation distribution information;
under the condition that the initial predicate is a general predicate, performing predicate semantic adjustment on the initial correlation degree distribution information based on a semantic adjustment matrix in the preset matrix, wherein the general predicate characterizes predicates with use probabilities larger than a preset threshold value in the preset predicates;
and under the condition that the initial predicate is a non-universal predicate, determining the initial correlation distribution information as the target correlation distribution information based on a semantic retention matrix in the preset matrix, wherein the non-universal predicate characterizes predicates with the probability smaller than a preset threshold value in the preset predicates.
As an alternative embodiment, the method further comprises:
Inputting a labeling image into the image detection model for object detection to obtain training object detection information corresponding to each object in the labeling image, wherein the labeling image is labeled with a reference visual relationship between every two objects;
inputting the training object detection information into a first model to be trained for visual relation detection to obtain a first training visual relation between every two objects, wherein the first training visual relation characterizes the interaction relation between every two objects in the labeling image obtained through the first model to be trained;
Inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtaining a first training scene graph corresponding to the labeling image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;
And training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.
As an optional embodiment, after the training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further includes:
Detecting word frequency information corresponding to each reference predicate in the reference visual relationship;
Classifying the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain reference predicate types corresponding to each labeling image;
combining the first visual relation detection model with a preset matrix to obtain a second visual relation detection model;
Inputting the training object detection information into the second visual relation detection model to perform visual relation detection, so as to obtain a second training visual relation between every two objects, wherein the second training visual relation characterizes the interaction relation between every two objects in the annotation image under the condition that a preset matrix exists;
Inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain a second training scene graph corresponding to the labeling image, wherein the second training scene graph is structural information labeled with the second training visual relationship between every two objects;
And adjusting the second visual relation detection model and the initial scene graph generation model based on the reference predicate types, the second training visual relation and the reference visual relation corresponding to each marked image to obtain the visual relation detection model and the scene graph generation model.
As an alternative embodiment, the method comprises:
Inputting the training object detection information into the first visual relation detection model to perform visual relation detection to obtain an initial visual relation between every two objects, wherein the initial visual relation characterizes the interaction relation between every two objects in the annotation image obtained through the first visual relation detection model;
Inputting the initial visual relationship and training object detection information corresponding to the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain an initial scene graph corresponding to the labeling image, wherein the initial scene graph is the structural information labeled with the initial visual relationship between every two objects;
Determining an initial matrix according to predicates in the initial visual relationship and reference predicates in the reference visual relationship;
and obtaining a preset matrix according to the normalized matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix.
According to a second aspect of embodiments of the present disclosure, there is provided an image processing apparatus, the apparatus including;
the object detection module is configured to input an image to be processed into the image detection model for object detection, and object detection information corresponding to at least two objects in the image to be processed is obtained;
The visual relation detection module is configured to input the object detection information into a visual relation detection model to perform visual relation detection, so as to obtain a visual relation between every two objects, wherein the visual relation characterizes an interaction relation between every two objects in the image to be processed;
The scene graph generating module is configured to input the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generating model to generate a scene graph, so as to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is the structural information marked with the visual relationship between every two objects.
As an alternative embodiment, the visual relationship detection model includes a predicate identification network and a visual relationship determination network, the visual relationship detection module including:
The predicate identification unit is configured to perform predicate identification corresponding to a predicate relation between every two objects by inputting the object detection information into the predicate identification network to obtain a target predicate, wherein the target predicate characterizes the predicate with semantic adjustment;
And a visual relation determining unit configured to execute the object corresponding to the target predicate and the target predicate to obtain the visual relation.
As an alternative embodiment, the predicate identification network includes an initial relevance calculation layer and a semantic adjustment layer, and the predicate identification unit includes:
The initial correlation calculation unit is configured to input the object detection information and preset predicates into the initial correlation calculation layer, and calculate correlation between predicates corresponding to the two-by-two object detection information and each preset predicate to obtain initial correlation distribution information, wherein the initial correlation distribution information characterizes correlation between predicates corresponding to the two-by-two object detection information and each preset predicate before semantic adjustment;
The semantic adjustment unit is configured to input the initial correlation distribution information into a semantic adjustment layer, perform predicate semantic adjustment on the initial correlation distribution information based on the preset matrix to obtain target correlation distribution information, and the target correlation distribution information represents predicates corresponding to the two-by-two object detection information after semantic adjustment and correlation among the preset predicates;
And a target predicate determination unit configured to perform determination of the target predicate according to the target correlation distribution information.
As an alternative embodiment, the semantic adjustment unit comprises:
an initial predicate determination unit configured to perform determination of an initial predicate according to the initial correlation distribution information;
The first semantic adjustment unit is configured to perform predicate semantic adjustment on the initial relevance distribution information based on a semantic adjustment matrix in the preset matrix when the initial predicate is a universal predicate, wherein the universal predicate characterizes predicates with use probabilities larger than a preset threshold value in the preset predicates;
and a second semantic adjustment unit configured to perform determining the initial relevance distribution information as the target relevance distribution information based on a semantic retention matrix in the preset matrix in a case where the initial predicate is a non-universal predicate that characterizes a predicate of the preset predicates that uses a probability smaller than a preset threshold.
As an alternative embodiment, the apparatus further comprises:
The first training feature extraction module is configured to execute feature extraction by inputting a labeling image into the image detection model to obtain training object detection information corresponding to each object in the labeling image, wherein the labeling image is labeled with a reference visual relationship between every two objects;
The first training visual relation detection module is configured to execute the detection of the visual relation by inputting the detection information of the training objects into a first model to be trained to obtain a first training visual relation between every two objects, and the first training visual relation characterizes the interaction relation between every two objects in the labeling image obtained through the first model to be trained;
the first training scene graph generation module is configured to input the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, so as to obtain a first training scene graph corresponding to the labeling image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;
and the model training module is configured to train the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, so as to obtain a first visual relationship detection model and an initial scene graph generation model.
As an alternative embodiment, the apparatus further comprises:
the word frequency information detection module is configured to detect word frequency information corresponding to each reference predicate in the reference visual relationship;
A second visual relationship detection model acquisition module configured to perform a combination of the first visual relationship detection model and a preset matrix to obtain a second visual relationship detection model;
The second training visual relationship acquisition module is configured to perform visual relationship detection by inputting the training object detection information into the second visual relationship detection model to obtain a second training visual relationship between every two objects, and the second training visual relationship characterizes the interaction relationship between every two objects in the labeling image obtained through the second visual relationship detection model;
The second training scene graph acquisition module is configured to input training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain a second training scene graph corresponding to the labeling image, wherein the second training scene graph is structural information labeled with the second training visual relationship between the two objects;
And the model adjustment module is configured to execute adjustment on the second visual relation detection model and the initial scene graph generation model based on the reference predicate type corresponding to each marked image, the second training visual relation and the reference visual relation to obtain the visual relation detection model and the scene graph generation model.
As an alternative embodiment, the apparatus further comprises:
The initial visual relationship detection module is configured to input the object detection information into the first visual relationship detection model to perform visual relationship detection to obtain an initial visual relationship between every two objects, and the initial visual relationship characterizes the interaction relationship between every two objects in the labeling image obtained through the first visual relationship detection model;
The scene initial diagram generation module is configured to input training object detection information corresponding to the initial visual relationship and the first training visual relationship into the initial scene diagram generation model to generate a scene diagram, so as to obtain an initial scene diagram corresponding to the labeling image, wherein the initial scene diagram is structural information labeled with the initial visual relationship between every two objects;
An initial matrix determination module configured to perform determining an initial matrix from predicates in the initial visual relationship and reference predicates in the reference visual relationship;
the preset matrix determining module is configured to execute the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix to obtain a preset matrix.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising:
A processor;
A memory for storing the processor-executable instructions;
Wherein the processor is configured to execute the instructions to implement the image processing method described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the image processing method as described above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the above-described image processing method.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
Inputting an image to be processed into an image detection model for object detection to obtain object detection information corresponding to at least two objects in the image to be processed, inputting the object detection information into a visual relation detection model for visual relation detection to obtain visual relation between every two objects, wherein the visual relation is obtained by adjusting semantic information corresponding to the visual relation through the visual relation detection model, and inputting the visual relation into a scene graph generation model for scene graph generation to obtain a target scene graph corresponding to the image to be processed. The method is based on the visual relation detection model, and can detect the visual relation between every two objects, so that the accuracy of visual relation detection can be improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
Fig. 1 is a schematic view of an application scenario of an image processing method according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating an image processing method according to an exemplary embodiment.
Fig. 3 is a flowchart illustrating predicate identification in an image processing method according to an example embodiment.
FIG. 4 is a flowchart illustrating predicate semantic adjustment in an image processing method according to an example embodiment.
FIG. 5 is a flow chart illustrating model training in an image processing method according to an exemplary embodiment.
FIG. 6 is a flowchart illustrating an adjustment of a trained model in an image processing method according to an exemplary embodiment.
Fig. 7 is a schematic diagram showing transfer learning in an image processing method according to an exemplary embodiment.
Fig. 8 is a schematic diagram showing parameter fixing during model training in an image processing method according to an exemplary embodiment.
Fig. 9 is a schematic diagram illustrating a process of inputting a picture to be processed and generating a target scene graph in an image processing method according to an exemplary embodiment.
Fig. 10 is a block diagram of an image processing apparatus according to an exemplary embodiment.
Fig. 11 is a block diagram of a server-side electronic device, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart illustrating an image processing method according to an exemplary embodiment, and the application scenario includes a client 110 and a server 120 as shown in fig. 1. The client 110 collects the image to be processed, the server 120 receives the image to be processed sent from the client 110, and the server 120 inputs the image to be processed into the image detection model to perform object detection, so as to obtain object detection information corresponding to at least two objects in the image to be processed respectively. The server 120 inputs the object detection information into the visual relation detection model to perform visual relation detection to obtain visual relation between every two objects, and inputs the visual relation and the object detection information corresponding to the visual relation into the scene graph generation model to perform image processing to obtain a target scene graph corresponding to the image to be processed. The server 120 sends the target scene graph to the client 110 for display.
In the embodiment of the present disclosure, the client 110 includes a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, and other types of physical devices, and may also include software running in the physical devices, such as an application program, and the like. The operating system running on the entity device in the embodiment of the present application may include, but is not limited to, an android system, an IOS system, linux, unix, windows, etc. The client 110 includes a UI (User Interface) layer, and the client 110 provides display of a target scene graph and collection of a to-be-processed image to the outside through the UI layer, and in addition, transmits the to-be-processed image to the server 120 based on an API (Application Programming Interface, application program Interface).
In the disclosed embodiment, the server 120 may include one independently operating server, or a distributed server, or a server cluster composed of a plurality of servers. The server 120 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 120 may be configured to publish the publication content through a publication channel, mark the channel tracking code for the publication channel by using a tracking code application interface, an identification parameter or a fixed tracking code, and receive log data fed back by the user terminal.
In embodiments of the present disclosure, the server 120 may perform visual relationship detection on the object detection information based on the technique of visual relationship detection. Visual relationship detection combines images with semantics, not only needs to identify objects in the images and the positions of the objects, but also needs to identify the relationship between the objects, wherein the visual relationship is defined as a pair of objects connected by predicates, is usually expressed in the form of a main predicate, and can be used for describing the interaction relationship between every two objects. Visual relationship detection is the basis for image understanding and can be applied to object detection, image description, visual questions and answers, image retrieval and the like.
Fig. 2 is a flowchart illustrating an image processing method according to an exemplary embodiment, which is used in a server as shown in fig. 2, and includes the following steps.
S210, inputting an image to be processed into an image detection model to perform object detection, and obtaining object detection information corresponding to at least two objects in the image to be processed respectively;
As an optional embodiment, in the image detection model, each object in the image to be processed is detected according to a preset labeling frame, and a detection area of each object in the image to be processed is extracted. Extracting feature information corresponding to each object in a detection area corresponding to each object, and determining the object in the image to be processed according to the feature information corresponding to each object so as to obtain object detection information. The image detection model can be different image detection models such as a Fast R-CNN model, an R-CNN model and the like. When the object detection information is input into the visual relation detection model to detect the visual relation, the visual relation corresponding to the object detection information can be detected.
As an alternative embodiment, the combined detection information may also be acquired based on an image detection model. The image detection model can detect every two objects in the image to be processed, extracts detection areas of every two objects in the image to be processed, extracts joint characteristic information corresponding to the two objects from the detection areas corresponding to the every two objects, and determines every two objects in the image to be processed according to the joint characteristic information, so that combined detection information is obtained. The combination detection information includes two pieces of object detection information having an interactive relationship in the image to be processed. When the combination detection information is input into the visual relationship detection model to perform visual relationship detection, the visual relationship corresponding to the combination detection information may be detected.
The combination of two objects without interaction relationship can be eliminated by utilizing the combination detection information, so that the data volume to be detected is reduced, and the efficiency of visual relationship detection in the subsequent steps is improved.
S220, inputting object detection information into a visual relation detection model to perform visual relation detection, so as to obtain visual relation between every two objects, wherein the visual relation represents interaction relation between every two objects in an image to be processed;
As an alternative embodiment, the interaction relationship between the two objects may include an action relationship, a spatial relationship, a preposition relationship, and a comparison relationship. The action relationship may be expressed as one object taking a certain action on another object, such as a person riding a bicycle, and the spatial relationship may be expressed as a relative position between two objects, such as a cup to the left of a book. Preposition relations may be represented as associations between two objects in terms of information of membership, state, direction, etc., such as vehicle tires. The comparison relationship may represent a distinction between two objects, e.g., a first apple being larger than a second apple. The visual relation detection model can detect the visual relation between objects corresponding to the object detection information and semantically adjust the visual relation. The visual relationship between two objects can be corresponding to a triplet consisting of two objects as a subject and an object, and a predicate between the subject and the object. When the visual relation detection model performs semantic adjustment on the visual relation, the information quantity of predicates between subjects and objects can be adjusted to obtain predicates with rich meanings.
As an optional embodiment, the visual relationship detection model includes a predicate identification network, the inputting the object detection information into the visual relationship detection model to perform visual relationship detection, and obtaining the visual relationship between every two objects includes:
Inputting the object detection information into a predicate identification network to identify predicates corresponding to predicate relations among the objects, thereby obtaining target predicates;
And obtaining a visual relationship according to the target predicate and the object corresponding to the target predicate.
As an optional embodiment, the visual relationship detection model includes a predicate identification network, in which predicates corresponding to the predicate relationship between two objects are identified, the identified target predicate is a semantically adjusted predicate, and the two objects and the corresponding target predicate form a visual relationship with a subject, a predicate and an object.
And predicate relationships are arranged between objects corresponding to the object detection information, predicates corresponding to the predicate relationships are identified in a predicate identification network, and semantic adjustment can be performed to obtain target predicates. The object corresponding to the target predicate has two objects, one can be taken as a subject, the other can be taken as an object, and the visual relationship with the subject, the predicate and the object can be determined, for example, when the object 'person' is taken as the subject, 'yes' is taken as the target predicate, the object 'hand' is taken as the object, and the obtained visual relationship can be (person, yes, hand).
The target predicates between every two objects are determined through the predicate identification network, and the predicate identification network comprises a semantic adjustment layer, so that the accuracy of predicate identification can be improved.
As an optional embodiment, referring to fig. 3, the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and inputting object detection information into the predicate identification network to perform predicate identification between two objects, where obtaining a target predicate includes:
s310, inputting object detection information and preset predicates into an initial correlation calculation layer, and performing correlation calculation on predicates corresponding to the object detection information and each preset predicate to obtain initial correlation distribution information;
S320, inputting the initial correlation degree distribution information into a semantic adjustment layer, and performing predicate semantic adjustment on the initial correlation degree distribution information based on a preset matrix to obtain target correlation degree distribution information;
S330, determining target predicates according to the target correlation distribution information.
As an optional embodiment, the predicate identification network includes an initial relevance calculating layer and a semantic adjustment layer, the initial relevance calculating layer may be used to calculate initial relevance distribution information, and the semantic adjustment layer may be used to perform semantic adjustment on the initial relevance distribution information to obtain target relevance distribution information. The preset predicates comprise a plurality of predicates, the initial correlation distribution information is the probability distribution of a certain preset predicate corresponding to the detection information of the two objects before semantic adjustment, and the correlation between the predicates corresponding to the detection information of the two objects before semantic adjustment and each preset predicate is represented.
When the initial correlation distribution information is subjected to predicate semantic adjustment based on a preset matrix to obtain target correlation distribution information, predicates corresponding to the initial correlation distribution information before semantic adjustment and predicates corresponding to the target correlation distribution information after semantic adjustment have semantic correlation, for example, the predicates corresponding to the initial correlation distribution information are "on top", the predicates corresponding to the target correlation distribution information are "riding", wherein "riding" also has the meaning of "on top", and semantic correlation exists between the two predicates, so that the semantic adjustment can be determined to be correct. If the predicate corresponding to the initial relevance distribution information is "above", the predicate corresponding to the target relevance distribution information cannot be adjusted to "below", because the semantics of "above" and "below" are exactly opposite, and there is no semantic relevance between the two predicates.
The target correlation distribution information characterizes correlation between predicates corresponding to the post-semantic-adjustment pairwise object detection information and each preset predicate. According to the magnitude of each correlation in the target correlation distribution information, a correlation maximum value in the target correlation distribution information can be determined, and a preset predicate corresponding to the correlation maximum value is determined as a target predicate.
As an alternative embodiment, a calculation formula when identifying predicates corresponding to predicate relationships between object detection information can be expressed as:
Wherein, The initial correlation distribution information can be calculated by the formula so as to represent the probability distribution of predicates before semantic adjustment. R represents preset predicates, K represents the number of the preset predicates, and the preset predicates comprise different kinds of predicates. y i denotes a predicate, the predicate with a g-superscript denotes an output before semantic adjustment, and the predicate with a s-superscript denotes an output after semantic adjustment. (o j,ok) represents the object detection information pair, θ represents the model parameters in the visual relationship detection model. And determining the correlation degree between the corresponding predicate and each preset predicate by the object detection information according to the probability distribution of the predicate before semantic adjustment.
The probability distribution of the predicates after semantic adjustment can be represented, target correlation distribution information can be obtained through calculation according to the formula, the probability distribution of the predicates before semantic adjustment is adjusted based on the semantic adjustment matrix, the probability distribution of the predicates after semantic adjustment is obtained, the correlation degree between the corresponding predicates and each preset predicate of the object detection information after semantic adjustment is determined, and the preset predicates corresponding to the maximum value in the correlation degree are determined as the corresponding predicates of the object detection information pair.
The probability distribution representing the semantic adjustment can measure the confidence of converting a predicate without rich semantic information into a predicate with rich semantic information, wherein the s superscript represents the output after the semantic adjustment and the g superscript represents the output before the semantic adjustment.
Instead of the preset matrix, the preset matrix may be used to perform semantic adjustment on the initial relevance distribution information in the semantic adjustment layer.
The initial correlation degree distribution information is subjected to predicate semantic adjustment through a semantic adjustment layer, so that target predicates with rich semantic information quantity can be obtained, the accuracy of predicate identification is improved, and the effectiveness of target scene graphs generated in subsequent steps can be improved.
As an optional embodiment, please refer to fig. 4, inputting the initial relevance distribution information into the semantic adjustment layer, performing predicate semantic adjustment on the initial relevance distribution information based on a preset matrix, and obtaining the target relevance distribution information includes:
s410, determining an initial predicate according to the initial correlation distribution information;
s420, performing predicate semantic adjustment on initial correlation distribution information based on a semantic adjustment matrix in a preset matrix under the condition that the initial predicate is a general predicate, wherein the general predicate characterizes predicates with probability larger than a preset threshold value in the preset predicate;
S430, determining the initial correlation distribution information as target correlation distribution information based on a semantic retention matrix in a preset matrix under the condition that the initial predicate is a non-universal predicate, wherein the non-universal predicate characterizes predicates with use probabilities smaller than a preset threshold value in the preset predicates.
As an alternative embodiment, based on shannon semantic information amount theory, the semantic information amount contained in a predicate can be determined by the probability of occurrence of the predicate, and predicates with small occurrence probability contain more semantic information amount. The semantic information amount can determine whether one predicate is a general predicate or a non-general predicate, if the predicate is a general predicate, the semantic information amount contained in the general predicate is lower, and the use probability of the general predicate in a preset predicate is larger than a preset threshold. If the predicate is a non-universal predicate, the semantic information amount contained in the non-universal predicate is higher, and the use probability of the non-universal predicate in the preset predicate is smaller than a preset threshold. For example, a person is on a bicycle where "on" may describe the relationship of the person to the bicycle, but "riding" in "riding a bicycle by a person" means that the person is an action on the bicycle, and thus "riding" has a greater amount of semantic information than "on". While "above" indicates only the relative relationship of the positions between two objects, and "riding" indicates an action, where "riding" can be applied, for example, "people riding horses" and "people are immediately above" can be applied, but "riding" when "above" is applied, for example, "books are on a table" cannot necessarily be applied, so that the probability of occurrence of "riding" is significantly lower than "above", and thus predicates with lower occurrence probability have more semantic information.
When the initial correlation distribution information is input into a semantic adjustment matrix for predicate semantic adjustment, if the initial predicate corresponding to the initial correlation distribution information is a general predicate, the condition that the amount of semantic information contained in the initial predicate is small is indicated, and the correlation distribution in the initial correlation distribution information can be adjusted through the semantic adjustment matrix to obtain target correlation distribution information, so that the predicates corresponding to the two-object detection information are associated with preset predicates containing a large amount of semantic information to obtain the target predicate.
When the initial predicate corresponding to the initial correlation distribution information is a non-universal predicate, the meaning information amount contained in the initial predicate is more, and the correlation distribution in the initial correlation distribution information can be not adjusted through the meaning holding matrix, so that the initial predicate corresponding to the initial correlation distribution information is held, and the predicates corresponding to the object detection information and the preset predicate containing more meaning information amount are associated to obtain the target predicate.
Based on a semantic adjustment matrix in the preset matrix, when the predicates corresponding to the predicate relation between every two object detection information are identified as general predicates, semantic adjustment can be performed on the initial relevance distribution information, and the identification result with the low semantic information content is converted into the identification result with the high semantic information content. Based on the semantic retention matrix in the preset matrix, when the predicates corresponding to the predicate relation between every two object detection information are recognized as non-universal predicates, semantic adjustment on the initial relevance distribution information can be reduced, and therefore recognition results with rich semantic information quantity can be retained.
When the semantic adjustment is carried out, the recognition result without rich semantic information is adjusted, and the recognition result with rich semantic information is maintained, so that the false adjustment of predicates with rich semantic information can be avoided, and the effectiveness of the semantic adjustment is improved.
S230, inputting the visual relationship and object detection information corresponding to the visual relationship into a scene graph generation model to generate a scene graph, and obtaining a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information marked with the visual relationship between every two objects.
As an alternative embodiment, the visual relationship and the object detection information corresponding to the visual relationship are input into the scene graph generation model, and according to the visual relationship between the objects corresponding to the object detection information, a target scene graph can be obtained, wherein the target scene graph is structural information formed by points and edges, the points in the target scene graph can represent the objects, and the edges can represent the visual relationship between every two objects. Visual relationships with subject, predicate and object may be displayed in the target scene graph. For example, visual relationships (person, hand) are displayed in the target scene graph, where "person" is the subject, "hand" is the predicate, and "hand" is the object.
As an alternative embodiment, the method further includes a model training method, please refer to fig. 5, the model training method includes:
s510, inputting the marked image into an image detection model to perform object detection, and obtaining training object detection information corresponding to each object in the marked image;
s520, inputting training object detection information into a first model to be trained to detect visual relations, and obtaining a first training visual relation between every two objects;
s530, inputting a first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtaining a first training scene graph corresponding to the labeling image;
S540, training the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model.
As an optional embodiment, during model training, the image detection model is a model trained in advance, so that training object detection information required during training can be extracted through the image detection model, and the training object detection information is object detection information corresponding to an object in the labeling image. And acquiring a labeling image by adopting a full-supervision training mode, wherein the labeling image is labeled with reference visual relations between every two objects. And inputting the marked image into an image detection model, detecting each object in the marked image according to a preset marked frame, and extracting a detection area of each object in the marked image. Extracting the characteristic information corresponding to each object in the detection area corresponding to each object, and determining the object in the annotation image according to the characteristic information corresponding to each object so as to obtain the training object detection information. At this time, the first model to be trained does not include the preset matrix, i.e., the first model to be trained does not have the function of semantic adjustment. After training the preset matrix to be trained based on the training visual relationship and the reference visual relationship, adding the trained preset matrix into the first visual relationship detection model.
The training object detection information is input into a first to-be-trained model for visual relation detection, predicates corresponding to predicate relations between every two training object detection information can be identified based on a first to-be-trained network in the first to-be-trained model to obtain a first training target predicate, the training target predicate and objects corresponding to the first training target predicate can be combined based on a second to-be-trained network in the first to-be-trained model to obtain a first training visual relation with a main subject, a predicate and an object, and the first training visual relation characterizes interaction relations between every two objects in the marked image obtained through the first to-be-trained model.
Inputting the first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and marking the first training visual relationship among objects in a marked image in the second model to be trained to obtain a first training scene graph corresponding to the marked image, wherein the first training scene graph is the structural information marked with the first training visual relationship among every two objects.
And calculating first loss data between the first training visual relationship and the reference visual relationship, wherein the first loss data can be a loss function between the first training visual relationship and the reference visual relationship, and training a first model to be trained and a second model to be trained according to the first loss data to obtain a first visual relationship detection model and an initial scene graph generation model, wherein the first visual relationship detection model is a visual relationship detection model without a preset matrix.
When the model is trained, only the visual relation detection model and the scene graph generation model are required to be trained, most training steps are completed based on the source domain, and only fine adjustment is required to be performed on the target domain, so that the training cost is reduced.
As an alternative embodiment, referring to fig. 6, after training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further includes:
S610, detecting word frequency information corresponding to each reference predicate in the reference visual relationship;
s620, classifying the reference predicates according to preset word frequency segmentation information and word frequency information to obtain reference predicate types corresponding to each labeling image;
S630, combining the first visual relation detection model with a preset matrix to obtain a second visual relation detection model;
s640, inputting training object detection information into a second visual relationship detection model to perform visual relationship detection, and obtaining a second training visual relationship between every two objects;
s650, inputting the second training visual relationship and training object detection information corresponding to the second training visual relationship into an initial scene graph generation model to generate a scene graph, and obtaining a second training scene graph corresponding to the labeling image;
S660, adjusting the second visual relation detection model and the initial scene graph generation model based on the reference predicate types, the second training visual relation and the reference visual relation corresponding to each marked image to obtain the visual relation detection model and the scene graph generation model.
As an alternative embodiment, please refer to fig. 7, which is a schematic diagram of the migration learning shown in fig. 7. Because predicates with small occurrence probability contain more semantic information, word frequency information corresponding to each reference predicate in the reference visual relationship is detected, and the word frequency information characterizes the occurrence probability of each reference predicate in all the reference predicates. Calculating the semantic information amount contained in each reference predicate according to the word frequency information, and estimating the semantic information amount contained in the reference predicate according to the formula as follows:
I(yi)=-logb[Pr(yi)]
where y i denotes predicate, pr (y i) denotes word frequency information, that is, probability of occurrence of predicate, and I (y i) denotes semantic information amount in predicate. The smaller the word frequency information, the larger the amount of semantic information contained in the predicate, and the larger the word frequency information, the smaller the amount of semantic information contained in the predicate. And sequencing the reference predicates from small to large according to the size of the semantic information quantity to obtain a reference predicate sequence. Taking a preset number of reference predicates from the first reference predicate as general predicates, taking reference predicates except the preset number of reference predicates as non-general predicates, and classifying the reference predicates into two types. The general predicates are predicates with occurrence probabilities larger than preset probabilities, and the non-general predicates are predicates with occurrence probabilities smaller than the preset probabilities, wherein the preset probabilities correspond to the occurrence probabilities of the last reference predicate in the preset number of reference predicates. For example, the preset number may be 15, i.e., the first fifteen reference predicates of the sequence of reference predicates are used as universal predicates, and the reference predicates after the first fifteen reference predicates are used as non-universal predicates.
Taking the labeling image as a source domain, simultaneously downsampling the labeling image of a general predicate to obtain a target domain, transferring a first visual relation detection model, a preset matrix and an initial scene graph generation model which are obtained by training on the source domain to the target domain, combining the first visual relation detection model and the preset matrix to obtain a second visual relation detection model, and adjusting the last neural network layer in the neural network layers which are orderly arranged in the second visual relation detection model and the initial scene graph generation model to obtain the visual relation detection model and the scene graph generation model, wherein the last neural network layer is a classification layer. When the second visual relationship detection model and the initial scene graph generation model are adjusted on the target domain, sample images can be acquired from the annotation images with the reference predicate types for adjustment, and all the annotation images are not required to be used.
When the first visual relation detection model and the initial scene graph generation model are adjusted, a preset matrix can be obtained, the preset matrix is a matrix obtained by training based on the first visual relation detection model and the initial scene graph generation model, the preset matrix and the first visual relation detection model are combined to obtain a second visual relation detection model, and the second visual relation detection model is the visual relation detection model with the preset matrix.
After the labeling image is input into the image detection model to perform object detection, training object detection information is obtained, then the training object detection information is input into the second visual relation detection model to perform visual relation detection, and a second training visual relation between every two objects is obtained, wherein the second training visual relation characterizes interaction relation between every two objects in the labeling image obtained through the second visual relation detection model. The preset matrix can carry out semantic adjustment on initial correlation distribution information corresponding to the training object detection information to obtain target correlation distribution information, and a second target training predicate can be determined according to the target correlation distribution information to obtain a second training visual relationship.
And inputting the second training visual relationship and training object detection information corresponding to the second training visual relationship into the initial scene graph generation model to generate a scene graph, so that a second training scene graph corresponding to the labeling image can be obtained, wherein the second training scene graph is the structural information labeled with the second training visual relationship between every two objects.
The second training visual relationship is a detection result obtained based on the information quantity corresponding to the reference predicate type, the reference predicate type and the reference visual relationship corresponding to each labeling image are obtained through calculation, and second loss data between the second training visual relationship and the reference predicate type and the reference visual relationship can be loss functions between the second loss data and the second training visual relationship. And adjusting the second visual relation detection model and the initial scene graph generation model according to the second loss data to obtain the visual relation detection model and the scene graph generation model.
Classifying the marked images according to semantic information amount contained in predicates in the marked images, and performing model adjustment based on the reference predicate types, the training visual relationship and the reference visual relationship, so that the model has the capability of identifying general predicates and non-general predicates, and the accuracy of predicate identification is improved. And the second visual relationship detection model and the initial scene graph generation model are adjusted instead of retrained, the problem of overfitting can be avoided.
As an alternative embodiment, referring to fig. 8, training the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, to obtain the first visual relationship detection model and the initial scene graph generation model includes:
S810, inputting training object detection information into a first visual relation detection model to perform visual relation detection, and obtaining an initial visual relation between every two objects;
s820, inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model to generate a scene graph, so as to obtain an initial scene graph corresponding to the labeling image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects;
S830, determining an initial matrix according to predicates in the initial visual relationship and reference predicates in the reference visual relationship;
S840, obtaining a preset matrix according to the normalized matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix.
As an optional embodiment, training object detection information is input into a first visual relation detection model to perform visual relation detection, initial correlation distribution information can be obtained, initial predicates can be determined according to the initial correlation distribution information, initial visual relation between every two objects can be obtained through the initial predicates, and the initial visual relation characterizes interaction relation between every two objects in the marked image obtained through the first visual relation detection model.
And inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into an initial scene graph generation model to generate a scene graph, so as to obtain an initial scene graph corresponding to the labeling image, wherein the initial scene graph is structural information labeled with the initial visual relationship between every two objects. The predicates in the initial visual relationship and the reference predicates in the reference visual relationship are compared, and the predicates with correct classification and the predicates with incorrect classification can be determined.
As an alternative embodiment, the preset matrix may be expressed as:
C*∈RK×K
Wherein, the matrix C * represents a preset matrix, R represents preset predicates, K represents the number of the preset predicates, and the preset predicates are different kinds of predicates. In the process of acquiring the preset matrix, firstly initializing a confusion matrix for identifying predicates to obtain an initial matrix, wherein the initial matrix can be expressed as:
C∈RK×K
each element in the initial matrix is denoted as C j,k, which is denoted as labeled as a type j predicate, but predicted as the number of type k predicates, where j may be equal to k, and when j is equal to k, it indicates that the labeling result for the predicate and the recognition result for the predicate agree.
The elements in the semantic adjustment matrix of the preset matrix are represented as the predicates marked as the j-th predicate but predicted as the k-th predicate, so that the elements in the semantic adjustment matrix can be determined according to the number of predicates with correct classification and the number of predicates with incorrect classification, for example, 100A predicates are referred to in the predicates, the A predicates correspond to the class number 3, but in the initial visual relationship, only 50A predicates are provided in the predicates corresponding to 100A predicates, and further 30B predicates and 20C predicates are provided, the B predicates correspond to the class number 4 and the C class corresponds to the predicate number 5. The number of correctly classified predicates is 50, the number of incorrectly classified predicates is 30 and 20, and then C 3,3,C3,4 and C 3,5 can be represented in the matrix.
And obtaining a preset matrix according to the normalized matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix. The normalization matrix corresponding to the initial matrix is a semantic adjustment matrix, and the identity matrix corresponding to the initial matrix is a semantic retention matrix. Normalizing the initial matrix to obtain a normalized semantic adjustment matrix C ', wherein the semantic adjustment matrix C' can be obtained by calculating the following formula:
The semantic adjustment matrix C ' represents semantic relativity among predicates to a certain extent, but the diagonal elements of the semantic adjustment matrix C ' have smaller values on predicates with rich semantic information, and the probability of predicates with rich semantic information which are identified can be reduced by directly multiplying initial relativity distribution information by the semantic adjustment matrix C '. Therefore, a semantic keeping matrix is added on the semantic adjustment matrix C', the recognition result of predicates with rich semantic information can be kept based on the semantic keeping matrix, and the preset matrix C * can be obtained. The specific formula is as follows:
C*=(C′+IK)×0.5
Where I K∈RK×K is an identity matrix, the whole formula multiplied by 0.5 ensures that the sum of the elements of each row of the preset matrix is 1.
After the preset matrix is obtained, the preset matrix is added into the initial scene graph generation model, so that the initial scene graph generation model has a semantic adjustment function.
In the training process, a trained preset matrix is used, and parameters in the preset matrix are fixed, so that the condition of semantic drift can be avoided, and the accuracy of semantic adjustment is improved.
As an alternative embodiment, please refer to fig. 9, which is a schematic diagram of a process of inputting a to-be-processed picture and generating a target scene graph. The image to be processed is input into an image detection model for object detection, the labeling frame and the characteristic information of each object are obtained through detection, the positions of four objects of the racket, the hand, the person and the short sleeve can be determined, and the object detection information of the four objects of the racket, the hand, the person and the short sleeve can be obtained. And inputting the object detection information into a visual relation detection model to obtain the visual relation between every two objects. In the visual relationship detection model, predicates corresponding to two objects of "racket" and "hand" are identified, and the target predicate "on top" can be obtained, so that a visual relationship (racket, on top, hand) composed of a subject, predicate and object can be determined. The predicates corresponding to the two objects of the person and the hand are identified, so that the target predicate is available, and the visual relationship (person, available hand) formed by the subject, the predicate and the object can be determined. The predicates corresponding to the short sleeve and the person are identified, so that the target predicate can be obtained to be 'on', and the visual relationship (short sleeve, on and person) formed by the subject, the predicate and the object can be determined. The object detection information of the four objects of the racket, the hand, the person and the short sleeve is input into a scene graph generation model, and the object detection information corresponding to each object is marked with the visual relationship between every two objects to obtain a target scene graph. The target scene graph can be applied to various aspects such as image retrieval, visual question and answer, for example, when the image retrieval is performed, the input retrieval information is the image of the person wearing the short sleeve, and then the target scene graph with the visual relationship (short sleeve, above and on) in the target scene graph generated by the visual relationship detection model and the scene graph generation model can be searched to obtain a retrieval result. Or when the user inputs the problem of 'what is worn on the person', the answer 'short sleeve' is obtained by identifying the visual relation in the target scene graph (short sleeve, above, people) generated by the visual relation detection model and the scene graph generation model, so that the visual question and answer of the time are completed.
The embodiment of the disclosure provides an image processing method, which is characterized in that an image to be processed is input into an image detection model to perform object detection, object detection information is obtained, the object detection information is input into a visual relation detection model to perform visual relation detection, semantic adjustment is performed on predicates corresponding to predicate relations between every two detected objects in the visual relation detection process, so that the object predicates contain abundant semantic information, the accuracy of predicate identification is improved, in a subsequent step, visual relations between every two objects are generated through the object predicates and objects corresponding to the object predicates, the visual relation is input into a scene graph generation model to generate a target scene graph, and the accuracy of the visual relation marked in the target scene graph is improved, thereby improving the effectiveness of the target scene graph.
Fig. 10 is a block diagram of an image processing apparatus according to an exemplary embodiment. Referring to fig. 10, the apparatus includes:
The object detection module 1010 is configured to perform object detection by inputting an image to be processed into the image detection model, so as to obtain object detection information corresponding to at least two objects in the image to be processed respectively;
The visual relation detection module 1020 is configured to perform visual relation detection by inputting object detection information into the visual relation detection model to obtain a visual relation between every two objects, wherein the visual relation represents an interaction relation between every two objects in the image to be processed;
The scene graph generating module 1030 is configured to perform image processing by inputting the visual relationship and the object detection information corresponding to the visual relationship into the scene graph generating model, so as to obtain a target scene graph corresponding to the image to be processed, where the target scene graph is structural information marked with the visual relationship between every two objects.
As an alternative embodiment, the visual relationship detection model includes a predicate identification network and the visual relationship detection module 1020 includes:
The predicate identification unit is configured to perform predicate identification corresponding to the predicate relation between every two objects by inputting object detection information into the predicate identification network, so as to obtain a target predicate, wherein the target predicate represents the predicate after semantic adjustment;
And a visual relationship determination unit configured to perform, as the visual relationship, the target predicate and an object to which the target predicate corresponds.
As an alternative embodiment, the predicate identification network includes an initial correlation calculation layer and a semantic adjustment layer, and the predicate identification unit includes:
The initial correlation calculation unit is configured to input object detection information and preset predicates into the initial correlation calculation layer, and performs correlation calculation on predicates corresponding to the two-by-two object detection information and each preset predicate to obtain initial correlation distribution information, wherein the initial correlation distribution information represents correlation between predicates corresponding to the two-by-two object detection information before semantic adjustment and each preset predicate;
The semantic adjustment unit is configured to input initial correlation distribution information into the semantic adjustment layer, perform predicate semantic adjustment on the initial correlation distribution information based on a preset matrix to obtain target correlation distribution information, wherein the target correlation distribution information characterizes predicates corresponding to the post-semantic-adjustment pairwise object detection information and correlation among each preset predicate;
And a target predicate determination unit configured to perform determination of a target predicate according to the target correlation distribution information.
As an alternative embodiment, the semantic adjustment unit comprises:
an initial predicate determination unit configured to perform determination of an initial predicate according to the initial correlation distribution information;
The first semantic adjustment unit is configured to perform predicate semantic adjustment on the initial correlation distribution information based on a semantic adjustment matrix in a preset matrix under the condition that the initial predicate is a general predicate, and the general predicate characterizes predicates with probability larger than a preset threshold value in the preset predicate;
And a second semantic adjustment unit configured to perform determining the initial correlation distribution information as target correlation distribution information based on a semantic retention matrix in a preset matrix in the case where the initial predicate is a non-universal predicate, the non-universal predicate characterizing a predicate in the preset predicate whose use probability is smaller than a preset threshold.
As an alternative embodiment, the apparatus further comprises:
The first training feature extraction module is configured to execute feature extraction by inputting the marked image into the image detection model to obtain training object detection information corresponding to each object in the marked image;
The first training visual relation detection module is configured to execute the input of training object detection information into a first model to be trained for visual relation detection to obtain a first training visual relation between every two objects, and the first training visual relation characterizes the interaction relation between every two objects in a labeling image obtained through the first model to be trained;
The first training scene graph generation module is configured to execute the steps of inputting a first training visual relationship and training object detection information corresponding to the first training visual relationship into a second model to be trained to generate a scene graph, and obtaining a first training scene graph corresponding to a labeling image, wherein the first training scene graph is structural information labeled with the first training visual relationship between every two objects;
The model training module is configured to train the first model to be trained and the second model to be trained according to the training visual relationship and the reference visual relationship, and a first visual relationship detection model and an initial scene graph generation model are obtained.
As an alternative embodiment, the apparatus further comprises:
the word frequency information detection module is configured to detect word frequency information corresponding to each reference predicate in the reference visual relationship;
The second visual relation detection model acquisition module is configured to perform combination of the first visual relation detection model and a preset matrix to obtain a second visual relation detection model;
the second training visual relationship acquisition module is configured to input training object detection information into the second visual relationship detection model for visual relationship detection to obtain a second training visual relationship between every two objects, and the second training visual relationship characterizes the interaction relationship between every two objects in the labeling image obtained through the second visual relationship detection model;
the second training scene graph acquisition module is configured to execute the steps of inputting a second training visual relationship and training object detection information corresponding to the second training visual relationship into the initial scene graph generation model to generate a scene graph, and obtaining a second training scene graph corresponding to the labeling image, wherein the second training scene graph is structural information labeled with the second training visual relationship between every two objects;
And the model adjustment module is configured to execute adjustment on the second visual relation detection model and the initial scene graph generation model based on the reference predicate type, the second training visual relation and the reference visual relation corresponding to each labeling image to obtain the visual relation detection model and the scene graph generation model.
As an alternative embodiment, the apparatus further comprises:
The initial visual relationship detection module is configured to input object detection information into the first visual relationship detection model for visual relationship detection to obtain an initial visual relationship between every two objects, wherein the initial visual relationship characterizes interaction relationship between every two objects in the annotation image obtained through the first visual relationship detection model;
The scene initial diagram generation module is configured to execute the steps of inputting the initial visual relationship and training object detection information corresponding to the initial visual relationship into the initial scene diagram generation model to generate a scene diagram, so as to obtain an initial scene diagram corresponding to the labeling image, wherein the initial scene diagram is the structural information labeled with the initial visual relationship between every two objects;
an initial matrix determination module configured to perform determining an initial matrix from predicates in the initial visual relationship and reference predicates in the reference visual relationship;
The preset matrix determining module is configured to execute the normalization matrix corresponding to the initial matrix and the identity matrix corresponding to the initial matrix to obtain a preset matrix.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 11 is a block diagram illustrating an electronic device for image processing, which may be a server, according to an exemplary embodiment, and an internal structure diagram thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image processing method.
It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not limiting of the electronic device to which the disclosed aspects apply, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In an exemplary embodiment, a computer-readable storage medium is also provided, such as memory 1104 including instructions executable by processor 1120 of electronic device 1100 to perform the above-described method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an exemplary embodiment, a computer program product is also provided, comprising computer instructions which, when executed by a processor, implement the above-described image processing method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (15)

1.一种图像处理方法,其特征在于,所述方法包括:1. An image processing method, characterized in that the method comprises: 将待处理图像输入到图像检测模型中进行对象检测,得到所述待处理图像中至少两个对象分别对应的对象检测信息;Inputting the image to be processed into the image detection model to perform object detection, and obtaining object detection information corresponding to at least two objects in the image to be processed; 将所述对象检测信息输入到视觉关系检测模型中进行视觉关系检测,得到两两对象间的视觉关系,所述视觉关系表征所述待处理图像中的两两对象间的交互关系;Inputting the object detection information into a visual relationship detection model to perform visual relationship detection to obtain a visual relationship between two objects, wherein the visual relationship represents an interactive relationship between two objects in the image to be processed; 将所述视觉关系和所述视觉关系对应的对象检测信息输入到场景图生成模型中进行场景图生成,得到所述待处理图像对应的目标场景图,所述目标场景图为标注有所述两两对象间的视觉关系的结构信息;Inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model to generate a scene graph, thereby obtaining a target scene graph corresponding to the image to be processed, wherein the target scene graph is structural information annotated with the visual relationship between the two objects; 所述视觉关系检测模型和场景图生成模型的生成方法包括:The method for generating the visual relationship detection model and the scene graph generation model comprises: 将标注图像作为源域,并对通用谓词的标注图像进行下采样得到目标域;所述标注图像中标注有两两对象间的参考视觉关系;所述通用谓词为参考谓词序列中从第一个参考谓词开始的预设数目个参考谓词,所述参考谓词序列是按照语义信息量的大小,从小到大对参考谓词进行排序得到的;The annotated image is used as a source domain, and the annotated image of the general predicate is downsampled to obtain a target domain; the annotated image is annotated with reference visual relations between two objects; the general predicate is a preset number of reference predicates starting from the first reference predicate in a reference predicate sequence, and the reference predicate sequence is obtained by sorting the reference predicates from small to large according to the amount of semantic information; 将在所述源域上训练得到的第一视觉关系检测模型、预设矩阵和初始场景图生成模型迁移到所述目标域上,组合所述第一视觉关系检测模型和所述预设矩阵,得到第二视觉关系检测模型;Migrating the first visual relationship detection model, the preset matrix, and the initial scene graph generation model trained on the source domain to the target domain, and combining the first visual relationship detection model and the preset matrix to obtain a second visual relationship detection model; 对所述第二视觉关系检测模型和所述初始场景图生成模型中按序排列的神经网络层中的最后一个神经网络层进行调整,得到视觉关系检测模型和场景图生成模型;Adjusting the last neural network layer in the neural network layers arranged in sequence in the second visual relationship detection model and the initial scene graph generation model to obtain a visual relationship detection model and a scene graph generation model; 所述预设矩阵的确定方法为:The method for determining the preset matrix is: 将训练对象检测信息输入到所述第一视觉关系检测模型中进行视觉关系检测,得到所述两两对象间的初始视觉关系,所述初始视觉关系表征通过第一视觉关系检测模型得到的所述标注图像中两两对象间的交互关系;Inputting the training object detection information into the first visual relationship detection model to perform visual relationship detection to obtain an initial visual relationship between the two objects, wherein the initial visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the first visual relationship detection model; 将所述初始视觉关系和对应的训练对象检测信息输入到所述初始场景图生成模型中进行场景图生成,得到所述标注图像对应的初始场景图,所述初始场景图为标注有所述两两对象间的初始视觉关系的结构信息;Inputting the initial visual relationship and the corresponding training object detection information into the initial scene graph generation model to generate a scene graph, thereby obtaining an initial scene graph corresponding to the annotated image, wherein the initial scene graph is structural information annotated with the initial visual relationship between the two objects; 根据所述初始视觉关系中的谓词和所述参考视觉关系中的参考谓词,确定初始矩阵;determining an initial matrix according to the predicate in the initial visual relationship and the reference predicate in the reference visual relationship; 根据所述初始矩阵对应的归一化矩阵和所述初始矩阵对应的单位矩阵,得到预设矩阵。A preset matrix is obtained according to a normalized matrix corresponding to the initial matrix and a unit matrix corresponding to the initial matrix. 2.根据权利要求1所述的图像处理方法,其特征在于,所述视觉关系检测模型包括谓词识别网络,所述将所述对象检测信息输入到视觉关系检测模型中进行视觉关系检测,得到两两对象间的视觉关系包括:2. The image processing method according to claim 1, wherein the visual relationship detection model comprises a predicate recognition network, and the step of inputting the object detection information into the visual relationship detection model to perform visual relationship detection to obtain the visual relationship between two objects comprises: 将所述图像检测模型输出的对象检测信息输入到所述谓词识别网络中进行两两对象间的谓语关系对应的谓词识别,得到目标谓词,所述目标谓词表征语义调整后的谓词;Inputting the object detection information output by the image detection model into the predicate recognition network to perform predicate recognition corresponding to the predicate relationship between two objects, and obtaining a target predicate, wherein the target predicate represents the semantically adjusted predicate; 根据所述目标谓词和所述目标谓词对应的对象,得到所述视觉关系。The visual relationship is obtained according to the target predicate and the object corresponding to the target predicate. 3.根据权利要求2所述的图像处理方法,其特征在于,所述谓词识别网络包括初始相关度计算层和语义调整层,所述将所述对象检测信息输入到所述谓词识别网络中进行两两对象间的谓词识别,得到目标谓词包括:3. The image processing method according to claim 2, characterized in that the predicate recognition network comprises an initial relevance calculation layer and a semantic adjustment layer, and the step of inputting the object detection information into the predicate recognition network to perform predicate recognition between two objects to obtain the target predicate comprises: 将所述对象检测信息和预设谓词输入到所述初始相关度计算层中,对两两对象检测信息对应的谓词和每个预设谓词进行相关度计算,得到初始相关度分布信息,所述初始相关度分布信息表征语义调整前所述两两对象检测信息对应的谓词和所述每个预设谓词间的相关度;Inputting the object detection information and the preset predicates into the initial relevance calculation layer, performing relevance calculation on the predicates corresponding to the pairwise object detection information and each preset predicate, and obtaining initial relevance distribution information, wherein the initial relevance distribution information represents the relevance between the predicates corresponding to the pairwise object detection information and each preset predicate before semantic adjustment; 将所述初始相关度分布信息输入到语义调整层中,基于所述预设矩阵对所述初始相关度分布信息进行谓词语义调整,得到目标相关度分布信息,所述目标相关度分布信息表征语义调整后所述两两对象检测信息对应的谓词和所述每个预设谓词间的相关度;Inputting the initial relevance distribution information into a semantic adjustment layer, performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix, and obtaining target relevance distribution information, wherein the target relevance distribution information represents the relevance between the predicate corresponding to the pairwise object detection information and each preset predicate after the semantic adjustment; 根据所述目标相关度分布信息,确定所述目标谓词。The target predicate is determined according to the target relevance distribution information. 4.根据权利要求3所述的图像处理方法,其特征在于,所述将所述初始相关度分布信息输入到语义调整层中,基于所述预设矩阵对所述初始相关度分布信息进行谓词语义调整,得到所述目标相关度分布信息包括:4. The image processing method according to claim 3, characterized in that the step of inputting the initial relevance distribution information into the semantic adjustment layer, and performing predicate semantic adjustment on the initial relevance distribution information based on the preset matrix to obtain the target relevance distribution information comprises: 根据所述初始相关度分布信息,确定初始谓词;Determining an initial predicate according to the initial relevance distribution information; 在所述初始谓词为通用谓词的情况下,基于所述预设矩阵中的语义调整矩阵,对所述初始相关度分布信息进行谓词语义调整,所述通用谓词表征所述预设谓词中使用概率大于预设阈值的谓词;In the case where the initial predicate is a universal predicate, based on the semantic adjustment matrix in the preset matrix, the initial relevance distribution information is subjected to predicate semantic adjustment, wherein the universal predicate represents a predicate in the preset predicate whose use probability is greater than a preset threshold; 在所述初始谓词为非通用谓词的情况下,基于所述预设矩阵中的语义保持矩阵,将所述初始相关度分布信息确定为所述目标相关度分布信息,所述非通用谓词表征所述预设谓词中使用概率小于预设阈值的谓词。In the case where the initial predicate is a non-universal predicate, the initial relevance distribution information is determined as the target relevance distribution information based on a semantic preservation matrix in the preset matrix, and the non-universal predicate represents a predicate in the preset predicate whose usage probability is less than a preset threshold. 5.根据权利要求1所述的图像处理方法,其特征在于,所述方法还包括:5. The image processing method according to claim 1, characterized in that the method further comprises: 将标注图像输入到所述图像检测模型中进行对象检测,得到所述标注图像中每个对象对应的训练对象检测信息,所述标注图像标注有两两对象间的参考视觉关系;Inputting the annotated image into the image detection model to perform object detection, and obtaining training object detection information corresponding to each object in the annotated image, wherein the annotated image is annotated with reference visual relationships between two objects; 将所述训练对象检测信息输入到第一待训练模型中进行视觉关系检测,得到两两对象间的第一训练视觉关系,所述第一训练视觉关系表征通过所述第一待训练模型得到的所述标注图像中两两对象间的交互关系;Inputting the training object detection information into a first model to be trained to perform visual relationship detection, thereby obtaining a first training visual relationship between two objects, wherein the first training visual relationship represents an interactive relationship between two objects in the annotated image obtained by the first model to be trained; 将所述第一训练视觉关系和所述第一训练视觉关系对应的训练对象检测信息输入到第二待训练模型中进行场景图生成,得到所述标注图像对应的第一训练场景图,所述第一训练场景图为标注有所述两两对象间的第一训练视觉关系的结构信息;Inputting the first training visual relationship and the training object detection information corresponding to the first training visual relationship into the second to-be-trained model to generate a scene graph, thereby obtaining a first training scene graph corresponding to the annotated image, wherein the first training scene graph is annotated with structural information of the first training visual relationship between the two objects; 根据所述第一训练视觉关系和所述参考视觉关系,对所述第一待训练模型和所述第二待训练模型进行训练,得到第一视觉关系检测模型和初始场景图生成模型,所述第一视觉关系检测模型为不具有预设矩阵的视觉关系检测模型。According to the first training visual relationship and the reference visual relationship, the first model to be trained and the second model to be trained are trained to obtain a first visual relationship detection model and an initial scene graph generation model, wherein the first visual relationship detection model is a visual relationship detection model without a preset matrix. 6.根据权利要求5所述的图像处理方法,其特征在于,所述根据所述第一训练视觉关系和所述参考视觉关系,对所述第一待训练模型和所述第二待训练模型进行训练,得到所述第一视觉关系检测模型和所述初始场景图生成模型之后,所述方法还包括:6. The image processing method according to claim 5, characterized in that after the first to-be-trained model and the second to-be-trained model are trained according to the first training visual relationship and the reference visual relationship to obtain the first visual relationship detection model and the initial scene graph generation model, the method further comprises: 对所述参考视觉关系中每个参考谓词对应的词频信息进行检测;detecting word frequency information corresponding to each reference predicate in the reference visual relationship; 根据预设的词频分段信息和所述词频信息,对所述参考谓词进行分类,得到每个标注图像对应的参考谓词类型;Classifying the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain a reference predicate type corresponding to each annotated image; 将所述第一视觉关系检测模型和预设矩阵进行组合,得到第二视觉关系检测模型;Combining the first visual relationship detection model and a preset matrix to obtain a second visual relationship detection model; 将所述训练对象检测信息输入到所述第二视觉关系检测模型中进行视觉关系检测,得到所述两两对象间的第二训练视觉关系,所述第二训练视觉关系表征通过所述第二视觉关系检测模型得到的所述标注图像中两两对象间的交互关系;Inputting the training object detection information into the second visual relationship detection model to perform visual relationship detection to obtain a second training visual relationship between the two objects, wherein the second training visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the second visual relationship detection model; 将所述第二训练视觉关系和所述第一训练视觉关系对应的训练对象检测信息输入到所述初始场景图生成模型中进行场景图生成,得到所述标注图像对应的第二训练场景图,所述第二训练场景图为标注有所述两两对象间的第二训练视觉关系的结构信息;Inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, thereby obtaining a second training scene graph corresponding to the annotated image, wherein the second training scene graph is annotated with structural information of the second training visual relationship between the two objects; 基于所述每个标注图像对应的参考谓词类型、所述第二训练视觉关系和所述参考视觉关系,对所述第二视觉关系检测模型和所述初始场景图生成模型进行调整,得到所述视觉关系检测模型和所述场景图生成模型。Based on the reference predicate type corresponding to each annotated image, the second training visual relationship and the reference visual relationship, the second visual relationship detection model and the initial scene graph generation model are adjusted to obtain the visual relationship detection model and the scene graph generation model. 7.一种图像处理装置,其特征在于,所述装置包括:7. An image processing device, characterized in that the device comprises: 对象检测模块,被配置为执行将待处理图像输入到图像检测模型中进行对象检测,得到所述待处理图像中至少两个对象分别对应的对象检测信息;An object detection module is configured to input the image to be processed into an image detection model to perform object detection, and obtain object detection information corresponding to at least two objects in the image to be processed; 视觉关系检测模块,被配置为执行将所述对象检测信息输入到视觉关系检测模型中进行视觉关系检测,得到两两对象间的视觉关系,所述视觉关系表征所述待处理图像中的两两对象间的交互关系;A visual relationship detection module is configured to input the object detection information into a visual relationship detection model to perform visual relationship detection to obtain a visual relationship between two objects, wherein the visual relationship represents an interactive relationship between two objects in the image to be processed; 场景图生成模块,被配置为执行将所述视觉关系和所述视觉关系对应的对象检测信息输入到场景图生成模型中进行场景图生成,得到所述待处理图像对应的目标场景图,所述目标场景图为标注有所述两两对象间的视觉关系的结构信息;A scene graph generation module is configured to execute the scene graph generation by inputting the visual relationship and the object detection information corresponding to the visual relationship into a scene graph generation model to obtain a target scene graph corresponding to the image to be processed, wherein the target scene graph is a structural information annotated with the visual relationship between the two objects; 所述装置还包括模型生成模块,被配置为执行:The device also includes a model generation module configured to execute: 将标注图像作为源域,并对通用谓词的标注图像进行下采样得到目标域;所述标注图像中标注有两两对象间的参考视觉关系;所述通用谓词为参考谓词序列中从第一个参考谓词开始的预设数目个参考谓词,所述参考谓词序列是按照语义信息量的大小,从小到大对参考谓词进行排序得到的;The annotated image is used as a source domain, and the annotated image of the general predicate is downsampled to obtain a target domain; the annotated image is annotated with reference visual relations between two objects; the general predicate is a preset number of reference predicates starting from the first reference predicate in a reference predicate sequence, and the reference predicate sequence is obtained by sorting the reference predicates from small to large according to the amount of semantic information; 将在所述源域上训练得到的第一视觉关系检测模型、预设矩阵和初始场景图生成模型迁移到所述目标域上,组合所述第一视觉关系检测模型和所述预设矩阵,得到第二视觉关系检测模型;Migrating the first visual relationship detection model, the preset matrix and the initial scene graph generation model trained on the source domain to the target domain, combining the first visual relationship detection model and the preset matrix to obtain a second visual relationship detection model; 对所述第二视觉关系检测模型和所述初始场景图生成模型中按序排列的神经网络层中的最后一个神经网络层进行调整,得到视觉关系检测模型和场景图生成模型;Adjusting the last neural network layer in the neural network layers arranged in sequence in the second visual relationship detection model and the initial scene graph generation model to obtain a visual relationship detection model and a scene graph generation model; 所述装置还包括:The device also includes: 初始视觉关系检测模块,被配置为执行将训练对象检测信息输入到所述第一视觉关系检测模型中进行视觉关系检测,得到所述两两对象间的初始视觉关系,所述初始视觉关系表征通过第一视觉关系检测模型得到的所述标注图像中两两对象间的交互关系;an initial visual relationship detection module, configured to input the training object detection information into the first visual relationship detection model to perform visual relationship detection, and obtain the initial visual relationship between the two objects, wherein the initial visual relationship represents the interaction relationship between the two objects in the annotated image obtained by the first visual relationship detection model; 场景初始图生成模块,被配置为执行将所述初始视觉关系和对应的训练对象检测信息输入到所述初始场景图生成模型中进行场景图生成,得到所述标注图像对应的初始场景图,所述初始场景图为标注有所述两两对象间的初始视觉关系的结构信息;A scene initial graph generation module is configured to execute scene graph generation by inputting the initial visual relationship and the corresponding training object detection information into the initial scene graph generation model to obtain an initial scene graph corresponding to the annotated image, wherein the initial scene graph is annotated with structural information of the initial visual relationship between the two objects; 初始矩阵确定模块,被配置为执行根据所述初始视觉关系中的谓词和所述参考视觉关系中的参考谓词,确定初始矩阵;an initial matrix determination module, configured to determine an initial matrix according to the predicate in the initial visual relationship and the reference predicate in the reference visual relationship; 预设矩阵确定模块,被配置为执行根据所述初始矩阵对应的归一化矩阵和所述初始矩阵对应的单位矩阵,得到预设矩阵。The preset matrix determination module is configured to execute a normalized matrix corresponding to the initial matrix and a unit matrix corresponding to the initial matrix to obtain a preset matrix. 8.根据权利要求7所述的图像处理装置,其特征在于,所述视觉关系检测模型包括谓词识别网络,所述视觉关系检测模块包括:8. The image processing device according to claim 7, wherein the visual relationship detection model comprises a predicate recognition network, and the visual relationship detection module comprises: 谓词识别单元,被配置为执行将所述图像检测模型输出的对象检测信息输入到所述谓词识别网络中进行两两对象间的谓语关系对应的谓词识别,得到目标谓词,所述目标谓词表征语义调整后的谓词;A predicate recognition unit is configured to input the object detection information output by the image detection model into the predicate recognition network to perform predicate recognition corresponding to the predicate relationship between two objects, and obtain a target predicate, wherein the target predicate represents the semantically adjusted predicate; 视觉关系确定单元,被配置为执行根据所述目标谓词和所述目标谓词对应的对象,得到所述视觉关系。The visual relationship determining unit is configured to obtain the visual relationship according to the target predicate and the object corresponding to the target predicate. 9.根据权利要求8所述的图像处理装置,其特征在于,所述谓词识别网络包括初始相关度计算层和语义调整层,所述谓词识别单元包括:9. The image processing device according to claim 8, characterized in that the predicate recognition network comprises an initial relevance calculation layer and a semantic adjustment layer, and the predicate recognition unit comprises: 初始相关度计算单元,被配置为执行将所述对象检测信息和预设谓词输入到所述初始相关度计算层中,对两两对象检测信息对应的谓词和每个预设谓词进行相关度计算,得到初始相关度分布信息,所述初始相关度分布信息表征语义调整前所述两两对象检测信息对应的谓词和所述每个预设谓词间的相关度;an initial relevance calculation unit, configured to input the object detection information and the preset predicate into the initial relevance calculation layer, perform relevance calculation on the predicates corresponding to the pairwise object detection information and each preset predicate, and obtain initial relevance distribution information, wherein the initial relevance distribution information represents the relevance between the predicates corresponding to the pairwise object detection information and each preset predicate before semantic adjustment; 语义调整单元,被配置为执行将所述初始相关度分布信息输入到语义调整层中,基于所述预设矩阵对所述初始相关度分布信息进行谓词语义调整,得到目标相关度分布信息,所述目标相关度分布信息表征语义调整后所述两两对象检测信息对应的谓词和所述每个预设谓词间的相关度;A semantic adjustment unit is configured to input the initial relevance distribution information into a semantic adjustment layer, perform predicate semantic adjustment on the initial relevance distribution information based on the preset matrix, and obtain target relevance distribution information, wherein the target relevance distribution information represents the relevance between the predicate corresponding to the pairwise object detection information and each preset predicate after the semantic adjustment; 目标谓词确定单元,被配置为执行根据所述目标相关度分布信息,确定所述目标谓词。The target predicate determination unit is configured to determine the target predicate according to the target relevance distribution information. 10.根据权利要求9所述的图像处理装置,其特征在于,所述语义调整单元包括:10. The image processing device according to claim 9, wherein the semantic adjustment unit comprises: 初始谓词确定单元,被配置为执行根据所述初始相关度分布信息,确定初始谓词;an initial predicate determination unit, configured to determine an initial predicate according to the initial relevance distribution information; 第一语义调整单元,被配置为执行在所述初始谓词为通用谓词的情况下,基于所述预设矩阵中的语义调整矩阵,对所述初始相关度分布信息进行谓词语义调整,所述通用谓词表征所述预设谓词中使用概率大于预设阈值的谓词;A first semantic adjustment unit is configured to perform predicate semantic adjustment on the initial relevance distribution information based on a semantic adjustment matrix in the preset matrix when the initial predicate is a universal predicate, wherein the universal predicate represents a predicate in the preset predicate whose use probability is greater than a preset threshold; 第二语义调整单元,被配置为执行在所述初始谓词为非通用谓词的情况下,基于所述预设矩阵中的语义保持矩阵,将所述初始相关度分布信息确定为所述目标相关度分布信息,所述非通用谓词表征所述预设谓词中使用概率小于预设阈值的谓词。The second semantic adjustment unit is configured to determine the initial relevance distribution information as the target relevance distribution information based on a semantic preservation matrix in the preset matrix when the initial predicate is a non-universal predicate, wherein the non-universal predicate represents a predicate in the preset predicate whose usage probability is less than a preset threshold. 11.根据权利要求10所述的图像处理装置,其特征在于,所述装置还包括:11. The image processing device according to claim 10, characterized in that the device further comprises: 第一训练特征提取模块,被配置为执行将标注图像输入到所述图像检测模型中进行特征提取,得到所述标注图像中每个对象对应的训练对象检测信息,所述标注图像标注有所述两两对象间的参考视觉关系;A first training feature extraction module is configured to perform feature extraction by inputting the annotated image into the image detection model to obtain training object detection information corresponding to each object in the annotated image, wherein the annotated image is annotated with reference visual relationships between the two objects; 第一训练视觉关系检测模块,被配置为执行将所述训练对象检测信息输入到第一待训练模型中进行视觉关系检测,得到所述两两对象间的第一训练视觉关系,所述第一训练视觉关系表征通过所述第一待训练模型得到的所述标注图像中两两对象间的交互关系;A first training visual relationship detection module is configured to input the training object detection information into a first model to be trained to perform visual relationship detection to obtain a first training visual relationship between the two objects, wherein the first training visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the first model to be trained; 第一训练场景图生成模块,被配置为执行将所述第一训练视觉关系和所述第一训练视觉关系对应的训练对象检测信息输入到第二待训练模型中进行场景图生成,得到所述标注图像对应的第一训练场景图,所述第一训练场景图为标注有所述两两对象间第一训练视觉关系的结构信息;A first training scene graph generation module is configured to execute the step of inputting the first training visual relationship and the training object detection information corresponding to the first training visual relationship into a second to-be-trained model to generate a scene graph, thereby obtaining a first training scene graph corresponding to the annotated image, wherein the first training scene graph is structural information annotated with the first training visual relationship between the two objects; 模型训练模块,被配置为执行根据所述第一训练视觉关系和所述参考视觉关系,对所述第一待训练模型和所述第二待训练模型进行训练,得到第一视觉关系检测模型和初始场景图生成模型。The model training module is configured to train the first model to be trained and the second model to be trained according to the first training visual relationship and the reference visual relationship to obtain a first visual relationship detection model and an initial scene graph generation model. 12.根据权利要求11所述的图像处理装置,其特征在于,所述装置还包括:12. The image processing device according to claim 11, characterized in that the device further comprises: 词频信息检测模块,被配置为执行对所述参考视觉关系中每个参考谓词对应的词频信息进行检测;A word frequency information detection module is configured to detect word frequency information corresponding to each reference predicate in the reference visual relationship; 参考谓词分类模块,被配置为执行根据预设的词频分段信息和所述词频信息,对所述参考谓词进行分类,得到每个标注图像对应的参考谓词类型;A reference predicate classification module is configured to classify the reference predicates according to preset word frequency segmentation information and the word frequency information to obtain a reference predicate type corresponding to each annotated image; 第二视觉关系检测模型获取模块,被配置为执行将所述第一视觉关系检测模型和预设矩阵进行组合,得到第二视觉关系检测模型;A second visual relationship detection model acquisition module is configured to combine the first visual relationship detection model with a preset matrix to obtain a second visual relationship detection model; 第二训练视觉关系获取模块,被配置为执行将所述训练对象检测信息输入到所述第二视觉关系检测模型中进行视觉关系检测,得到所述两两对象间的第二训练视觉关系,所述第二训练视觉关系表征通过所述第二视觉关系检测模型得到的所述标注图像中两两对象间的交互关系;A second training visual relationship acquisition module is configured to input the training object detection information into the second visual relationship detection model to perform visual relationship detection to obtain a second training visual relationship between the two objects, wherein the second training visual relationship represents an interactive relationship between the two objects in the annotated image obtained by the second visual relationship detection model; 第二训练场景图获取模块,被配置为执行将所述第二训练视觉关系和所述第一训练视觉关系对应的训练对象检测信息输入到所述初始场景图生成模型中进行场景图生成,得到所述标注图像对应的第二训练场景图,所述第二训练场景图为标注有所述两两对象间的第二训练视觉关系的结构信息;A second training scene graph acquisition module is configured to execute inputting the training object detection information corresponding to the second training visual relationship and the first training visual relationship into the initial scene graph generation model to generate a scene graph, so as to obtain a second training scene graph corresponding to the annotated image, wherein the second training scene graph is structural information annotated with the second training visual relationship between the two objects; 模型调整模块,被配置为执行基于所述每个标注图像对应的参考谓词类型、所述第二训练视觉关系和所述参考视觉关系,对所述第二视觉关系检测模型和所述初始场景图生成模型进行调整,得到所述视觉关系检测模型和所述场景图生成模型。The model adjustment module is configured to adjust the second visual relationship detection model and the initial scene graph generation model based on the reference predicate type corresponding to each annotated image, the second training visual relationship and the reference visual relationship to obtain the visual relationship detection model and the scene graph generation model. 13.一种电子设备,其特征在于,所述电子设备包括:13. An electronic device, characterized in that the electronic device comprises: 处理器;processor; 用于存储所述处理器可执行指令的存储器;a memory for storing instructions executable by the processor; 其中,所述处理器被配置为执行所述指令,以实现如权利要求1至6中任一项所述的图像处理方法。The processor is configured to execute the instructions to implement the image processing method according to any one of claims 1 to 6. 14.一种计算机可读存储介质,其特征在于,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如权利要求1至6中任一项所述的图像处理方法。14. A computer-readable storage medium, characterized in that when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the image processing method according to any one of claims 1 to 6. 15.一种计算机程序产品,包括计算机指令,其特征在于,所述计算机指令被处理器执行时实现权利要求1至6任一项所述的图像处理方法。15. A computer program product, comprising computer instructions, wherein when the computer instructions are executed by a processor, the image processing method according to any one of claims 1 to 6 is implemented.
CN202110693496.5A 2021-06-22 2021-06-22 Image processing method, device, electronic device and storage medium Active CN113869099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110693496.5A CN113869099B (en) 2021-06-22 2021-06-22 Image processing method, device, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110693496.5A CN113869099B (en) 2021-06-22 2021-06-22 Image processing method, device, electronic device and storage medium

Publications (2)

Publication Number Publication Date
CN113869099A CN113869099A (en) 2021-12-31
CN113869099B true CN113869099B (en) 2024-12-24

Family

ID=78989959

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110693496.5A Active CN113869099B (en) 2021-06-22 2021-06-22 Image processing method, device, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN113869099B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511779B (en) * 2022-01-20 2023-07-25 电子科技大学 Training method for scene graph generation model, scene graph generation method and device
CN114821188A (en) * 2022-05-20 2022-07-29 京东科技信息技术有限公司 Image processing method, training method of scene graph generation model and electronic equipment
CN116704233A (en) * 2023-03-27 2023-09-05 西安交通大学 Agricultural scene graph generation method and system based on double-attention perception

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126049A (en) * 2019-12-14 2020-05-08 中国科学院深圳先进技术研究院 Object relation prediction method and device, terminal equipment and readable storage medium

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPR464601A0 (en) * 2001-04-30 2001-05-24 Commonwealth Of Australia, The Shapes vector
US9904579B2 (en) * 2013-03-15 2018-02-27 Advanced Elemental Technologies, Inc. Methods and systems for purposeful computing
US9864932B2 (en) * 2015-04-14 2018-01-09 Conduent Business Services, Llc Vision-based object detector
AU2016225820B2 (en) * 2015-11-11 2021-04-15 Adobe Inc. Structured knowledge modeling, extraction and localization from images
CN107133274B (en) * 2017-04-10 2020-12-15 浙江鸿程计算机系统有限公司 Distributed information retrieval set selection method based on graph knowledge base
US10452923B2 (en) * 2017-11-28 2019-10-22 Visual Semantics, Inc. Method and apparatus for integration of detected object identifiers and semantic scene graph networks for captured visual scene behavior estimation
US20200242146A1 (en) * 2019-01-24 2020-07-30 Andrew R. Kalukin Artificial intelligence system for generating conjectures and comprehending text, audio, and visual data using natural language understanding
US11373390B2 (en) * 2019-06-21 2022-06-28 Adobe Inc. Generating scene graphs from digital images using external knowledge and image reconstruction
US20210110306A1 (en) * 2019-10-14 2021-04-15 Visa International Service Association Meta-transfer learning via contextual invariants for cross-domain recommendation
CN111626291B (en) * 2020-04-07 2023-04-25 上海交通大学 Method, system and terminal for detecting image visual relationship
CN111612070B (en) * 2020-05-13 2024-04-26 清华大学 Image description generation method and device based on scene graph
CN111931928B (en) * 2020-07-16 2022-12-27 成都井之丽科技有限公司 Scene graph generation method, device and equipment
CN112163608B (en) * 2020-09-21 2023-02-03 天津大学 Visual relation detection method based on multi-granularity semantic fusion

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111126049A (en) * 2019-12-14 2020-05-08 中国科学院深圳先进技术研究院 Object relation prediction method and device, terminal equipment and readable storage medium

Also Published As

Publication number Publication date
CN113869099A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
US11348249B2 (en) Training method for image semantic segmentation model and server
US11544588B2 (en) Image tagging based upon cross domain context
US10366313B2 (en) Activation layers for deep learning networks
CN113869099B (en) Image processing method, device, electronic device and storage medium
WO2019119505A1 (en) Face recognition method and device, computer device and storage medium
EP2806374B1 (en) Method and system for automatic selection of one or more image processing algorithm
US20200334867A1 (en) Face synthesis
WO2022105118A1 (en) Image-based health status identification method and apparatus, device and storage medium
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN112541529A (en) Expression and posture fusion bimodal teaching evaluation method, device and storage medium
CN110765882B (en) Video tag determination method, device, server and storage medium
WO2020019591A1 (en) Method and device used for generating information
US11126827B2 (en) Method and system for image identification
CN113762237B (en) Text image processing method, device, equipment and storage medium
CN113569081B (en) Image recognition method, device, equipment and storage medium
CN108399379A (en) The method, apparatus and electronic equipment at facial age for identification
CN111753618B (en) Image recognition method, device, computer equipment and computer readable storage medium
CN114913942A (en) Intelligent matching method and device for patient recruitment projects
CN112084954A (en) Video target detection method and device, electronic equipment and storage medium
CN112132026A (en) Animal identification method and device
CN115273136A (en) A model distillation method, target detection method and related equipment
US9208404B2 (en) Object detection with boosted exemplars
US9081800B2 (en) Object detection via visual search
Qin et al. Finger-vein quality assessment based on deep features from grayscale and binary images
CN114463612B (en) Image recognition method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant