CN115186072A

CN115186072A - A Knowledge Graph Visual Question Answering Method Based on Dual-process Cognitive Theory

Info

Publication number: CN115186072A
Application number: CN202110374169.3A
Authority: CN
Inventors: 何小海; 刘露平; 王美玲; 卿粼波; 陈洪刚; 吴小强; 滕奇志
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2022-10-14
Anticipated expiration: 2041-04-07
Also published as: CN115186072B

Abstract

The invention discloses a knowledge graph visual question-answering method based on a double-process cognitive theory, which comprises the following steps of: (1) Problem-picture joint characterization, namely, for input problems and pictures, respectively extracting the characteristics of the problems and the pictures by using a pre-training model BERT and a Faster-RCNN, and then sending the characteristics into a double-flow Transformer model to learn the problem-picture joint characterization; (2) Constructing a fact graph and a semantic graph, namely constructing the fact graph and the semantic graph respectively aiming at each question-picture pair, retrieving the fact graph from a knowledge base to construct the fact graph based on a semantic matching mode, and constructing the semantic graph in an image description mode; (3) Evidence aggregation, namely selecting evidences from the two graphs respectively by using a graph reasoning network, and then aggregating the evidences from a semantic graph to a fact graph based on a cross-modal graph reasoning network; (4) And (4) carrying out answer reasoning, and carrying out binary classification on nodes in the fact graph to obtain answers. The method has wide application prospect in the fields of education, entertainment and the like.

Description

A Knowledge Graph Visual Question Answering Method Based on Dual-process Cognitive Theory

技术领域technical field

本发明设计了一种基于双过程认知理论的知识图谱视觉问答方法，属于自然语言处理和计算机视觉领域的交叉。The invention designs a knowledge map visual question answering method based on the dual-process cognitive theory, which belongs to the intersection of natural language processing and computer vision fields.

背景技术Background technique

让智能体具备通过分析视觉和语言信息来理解世界的能力是近年来计算机视觉与自然语言处理技术相结合的一个热点研究课题。相关研究推动了许多应用的发展，如视觉问答(VQA)、图像索引及图像描述等。其中，VQA是一项具有挑战性的任务，它要求模型基于给定的图像回答任意问题。为了推动VQA技术的发展，近年来相关学者做了大量的前期研究工作，并取得了较大的进展。然而，现有的VQA方法聚焦于根据图片内容来回答问题，而无法回答一些需要结合常识才能回答的问题。It is a hot research topic of the combination of computer vision and natural language processing technology in recent years to make the agent have the ability to understand the world by analyzing the visual and language information. Related research has promoted the development of many applications, such as visual question answering (VQA), image indexing, and image description. Among them, VQA is a challenging task that requires the model to answer arbitrary questions based on a given image. In order to promote the development of VQA technology, relevant scholars have done a lot of preliminary research work in recent years, and have made great progress. However, existing VQA methods focus on answering questions based on image content, and cannot answer some questions that require a combination of common sense.

为了推动该领域的发展，Wang等人提出了基于事实的视觉问答(Fact-basedVisual Question Answering,FVQA)任务(P.Wang,Q.Wu,C.Shen,A.Dick and A.van denHengel,"FVQA:Fact-Based Visual Question Answering,"in IEEE Transactions onPattern Analysis and Machine Intelligence,vol.40,no.10,pp.2413-2427,1Oct.2018,doi:10.1109/TPAMI.2017.2754246)。与此同时，他们还发布了一个新的数据集，该数据集为每个问题-答案对提供了额外的支持事实并要求模型通过对图像和外部知识的联合分析来回答问题。Wang等人的研究首先对句子进行解析然后映射到知识图谱中，然后利用关键字匹配来找到正确的答案。这种方法存在明显的缺陷，及当问题中没有提到明显的视觉概念或者存在同义词和同形异义词时，这种方法将变得无效。因此后续有学者提出了一种基于语义学习的检索方法，将图像-问题-视觉概念和候选事实投射到一个学习的嵌入空间中，并通过计算相应的距离来找到支撑事实。然而这种方法一次评估一个事实节点，因此效率较低，此外该方法无法利用外部知识库的结构信息。为了解决这个问题，Narasimhan等人提出了一种基于图推断的方法，通过在整个图上进行推理来选择答案(Narasimhan M,Schwing AG.Straight to the facts:Learning knowledge baseretrieval for factual visual question answering[C]//Proceedings of theEuropean conference on computer vision(ECCV).2018:451-468.)。该方法构造一个实体图，图中的每个节点由实体、图像和问题表示的连接表示，然后利用图卷积网络进行消息的聚合后得到相应的节点更新特征，最后基于该更新的节点特征进行答案的预测。由于问题只关注于部分视觉内容，这种方法不可避免地引入了噪声信息。To advance the field, Wang et al. proposed the Fact-based Visual Question Answering (FVQA) task (P.Wang, Q.Wu, C.Shen, A.Dick and A.van denHengel," FVQA: Fact-Based Visual Question Answering,"in IEEE Transactions onPattern Analysis and Machine Intelligence,vol.40,no.10,pp.2413-2427,1Oct.2018,doi:10.1109/TPAMI.2017.2754246). At the same time, they also released a new dataset that provides additional supporting facts for each question-answer pair and asks the model to answer the question through a joint analysis of images and external knowledge. The study by Wang et al. first parses sentences and maps them into a knowledge graph, and then utilizes keyword matching to find the correct answer. This approach has obvious flaws and becomes ineffective when no obvious visual concept is mentioned in the question or when there are synonyms and homographs. Therefore, some scholars subsequently proposed a retrieval method based on semantic learning, projecting image-question-visual concepts and candidate facts into a learned embedding space, and finding supporting facts by calculating the corresponding distance. However, this method evaluates one fact node at a time, so it is inefficient, and furthermore, the method cannot utilize the structural information of external knowledge bases. To address this issue, Narasimhan et al. propose a graph inference-based method to select answers by reasoning over the entire graph (Narasimhan M, Schwing AG. Straight to the facts: Learning knowledge baseretrieval for factual visual question answering [C ] // Proceedings of the European conference on computer vision (ECCV. 2018:451-468.). This method constructs an entity graph, each node in the graph is represented by the connection represented by the entity, image and question, and then uses the graph convolutional network to aggregate the message to obtain the corresponding node update feature, and finally based on the updated node feature. Prediction of the answer. Since the problem only focuses on part of the visual content, this approach inevitably introduces noisy information.

对于人类而言，当给定一张图片和一个问题时，推断答案可以分为两个步骤:(1)大脑通过分析图像和问题，快速获得图像中呈现的内容和问题所关心的视觉信息；(2)根据第一步的输出，并结合大脑中存储的知识进行深度分析，找出正确的答案。这一过程被在认知科学上被称为为双过程理论。双过程认知理论认为，人类大脑首先通过被一个被称为系统1的系统执行一个隐式、无意识过程来感知外部输入信息。然后该信息送入一个称为System2的系统执行显式的、有意识的、可控的推理过程。该推理过程在在工作记忆中进行顺序推断，这个过程较慢，但是是人类作为高级智能体所特有的属性。从这个角度来看，FVQA问题可以利用双过程认知理论来解决—系统1快速从图像和问题中检索信息，系统2结合外部知识进行深度推理，找到正确答案。For humans, when a picture and a question are given, inferring the answer can be divided into two steps: (1) The brain quickly obtains the content presented in the image and the visual information that the question is concerned about by analyzing the image and question; (2) According to the output of the first step, combined with the knowledge stored in the brain, conduct in-depth analysis to find the correct answer. This process is known in cognitive science as two-process theory. The dual-process cognitive theory states that the human brain first perceives external input through an implicit, unconscious process performed by a system called System 1. This information is then fed into a system called System2 to perform an explicit, conscious, controlled reasoning process. The reasoning process is sequential inference in working memory, which is slower but is a property of humans as advanced agents. From this perspective, the FVQA problem can be solved using the dual-process cognitive theory - System 1 quickly retrieves information from images and questions, and System 2 combines external knowledge to perform deep reasoning to find the correct answer.

受双过程认知理论的启发，本发明提出了一种基于双过程认知系统来解决FVQA问题的新框架。具体来说，本发明框架中的系统1是通过一个多模态Transformer网络来实现的，该网络利用交叉注意机制来捕捉问题和图像之间的复杂关系。系统1输出图像和问题的联合表示。对于系统2，本发明使用一个图神经网络(Graph Neural Network,GNN)在两个外部知识图(事实图和语义图)上进行推理，以找到正确的答案。系统2首先进行模态内的证据选择和聚合，分别从两个知识图中聚合与问题相关的证据信息；接着然后进行跨模态选择，将事实图中的证据聚集到语义图中，以辅助对答案更好的推断。在推理过程中，本发明提出了节点级别注意和路径级别注意的双重注意机制，从关键节点和路径中捕获有价值的信息，使推理过程更加合理，且进一步提升基于知识图谱的视觉问答的性能。Inspired by the dual-process cognitive theory, the present invention proposes a new framework for solving the FVQA problem based on a dual-process cognitive system. Specifically, System 1 in the framework of the present invention is implemented by a multimodal Transformer network that utilizes a cross-attention mechanism to capture the complex relationship between questions and images. System 1 outputs a joint representation of the image and the question. For System 2, the present invention uses a Graph Neural Network (GNN) to reason on two external knowledge graphs (fact graph and semantic graph) to find the correct answer. System 2 first performs intra-modal evidence selection and aggregation, and aggregates question-related evidence information from two knowledge graphs, respectively; then performs cross-modal selection, and aggregates evidence from fact graphs into semantic graphs to assist Better inferences about the answer. In the reasoning process, the present invention proposes a dual attention mechanism of node-level attention and path-level attention, which captures valuable information from key nodes and paths, makes the reasoning process more reasonable, and further improves the performance of visual question answering based on knowledge graphs .

发明内容SUMMARY OF THE INVENTION

受启发于双过程认知理论，本发明提出了用双过程认知系统来解决FVQA问题的新框架。具体来讲，本发明框架中的系统1通过一个多模态Transformer网络来实现的，该网络利用交叉注意机制来捕捉问题和图像之间的复杂关系并输出一个问题-图像的联合表征。对于系统2，使用一个GCN网络在两个外部知识图(事实图和语义图)上进行推理，以找到正确的答案。Inspired by the dual-process cognitive theory, the present invention proposes a new framework for solving the FVQA problem with a dual-process cognitive system. Specifically, System 1 in the framework of the present invention is implemented by a multimodal Transformer network, which utilizes a cross-attention mechanism to capture the complex relationship between questions and images and outputs a joint question-image representation. For System 2, a GCN network is used to reason over two external knowledge graphs (fact graph and semantic graph) to find the correct answer.

本发明通过以下技术方案来实现上述目的：The present invention realizes above-mentioned purpose through following technical scheme:

1、发明所述的基于双过程认知理论的知识图谱视觉问答框架如图1所示，其包含协调感知模块(系统1)和显式推理模块(系统2)两部分，该框架具体推断过程包括以下步骤:1. The knowledge graph visual question answering framework based on the dual-process cognitive theory described in the invention is shown in Figure 1, which includes two parts: a coordinated perception module (system 1) and an explicit reasoning module (system 2). The framework has a specific inference process. Include the following steps:

(1)分别使用文本预训练模型BERT和图像预训练模型Faster-RCNN对输入文本和图像进行特征提取，针对每个问题，分别在问题的开始和结束位置添加[CLS]和[SEP]标志位，然后送入BERT模型提取特征；针对每张图片，提取36个目标区域，每个目标区域包含一个目标的视觉特征以及目标的时空位置特征信息。(1) Use the text pre-training model BERT and the image pre-training model Faster-RCNN to extract features from the input text and images, respectively. For each question, add [CLS] and [SEP] flags at the beginning and end of the question, respectively , and then sent to the BERT model to extract features; for each image, 36 target areas are extracted, each target area contains a target's visual features and the target's spatiotemporal location feature information.

(2)将步骤(1)提取好的图像和文本特征送入一个双流的Transformer网络来学习图像和文本的联合表征，其中一个单流的Transformer网络用于学习图像指导的问题表征，而另外一个单流的Transformer用于学习问题指导下的图像表征，最后将两个双流的Transformer网络的输出经过平均池化，进一步进行相乘后得到联合表征。(2) The image and text features extracted in step (1) are fed into a dual-stream Transformer network to learn the joint representation of images and texts, where a single-stream Transformer network is used to learn image-guided problem representations, while the other The single-stream Transformer is used to learn the image representation guided by the problem. Finally, the outputs of the two dual-stream Transformer networks are average pooled and further multiplied to obtain a joint representation.

(3)事实图和语义图的构建，针对每个问题-图像对，分别构建事实图和语义图，其中事实图通过基于句子级别语义匹配的方式从外部知识库中进行选择，而语义图则通过首先对图片进行语义描述，然后生成的句子进行语义解析后得到。(3) Construction of fact graph and semantic graph. For each question-image pair, a fact graph and a semantic graph are constructed respectively, where the fact graph is selected from an external knowledge base by means of sentence-level semantic matching, while the semantic graph is It is obtained by first describing the image semantically, and then performing semantic analysis on the generated sentence.

(4)基于图推理的证据聚合，首先针对事实图和语义图，基于注意力机制分别利用两个图推理网络从知识图和事实图中聚合证据信息，然后利用跨模态推理网络从语义图中聚合与问题相关的证据信息到知识图中。(4) Evidence aggregation based on graph reasoning, first for the fact graph and semantic graph, based on the attention mechanism, two graph reasoning networks are used to aggregate evidence information from the knowledge graph and fact graph respectively, and then the cross-modal reasoning network is used to aggregate evidence information from the semantic graph. Aggregate question-related evidence information into a knowledge graph.

(5)答案预测，将问题-图片的联合表征与事实图中每个节点的特征向量进行点乘计算后得到每个节点与问题的语义相关度得分，最后将该相关度得分送入一个Sigmoid层预测相应的答案。(5) Answer prediction: Do the dot product calculation between the joint representation of the question-picture and the feature vector of each node in the fact graph to obtain the semantic relevance score of each node and the question, and finally send the relevance score to a Sigmoid layer predicts the corresponding answer.

具体地步骤(1)中，首先利用BERT预训练模型对输入问题的单词进行向量初始化，其中BERT使用的是bert-base-uncased版本，经过向量初始化后，得到单词的特征向量C＝[c₀,c₁,...,c_n]，每个单词的特征向量维度768维。然后利用Faster-RCNN预训练模型提取输入的图片36个目标区域，每个区域包含一个目标的外观特征

目标的空间位置特征

和对应的标签。为了同时捕获视觉特征和空间特征，本发明首先将外观特征和空间位置特征投影到同一维度中(本发明为768维)，然后对外观视觉特征和空间位置特征计算平均得到每个目标对象的特征表示。Specifically, in step (1), first use the BERT pre-training model to initialize the word of the input question, where BERT uses the bert-base-uncased version, and after the vector initialization, the feature vector of the word C=[c ₀ ,c ₁ ,...,c _n ], the feature vector dimension of each word is 768 dimensions. Then use the Faster-RCNN pre-training model to extract 36 target regions of the input picture, each region contains the appearance feature of a target

Spatial location features of the target

and corresponding labels. In order to capture visual features and spatial features at the same time, the present invention first projects appearance features and spatial position features into the same dimension (768 dimensions in the present invention), and then calculates and averages the appearance visual features and spatial position features to obtain the features of each target object express.

在步骤(2)中，采用跨模态Transformer对齐这两种模态之间的关系，学习图像参与的问题表示和问题参与的图像表示，如图所示1所示。In step (2), a cross-modal Transformer is used to align the relationship between these two modalities to learn the question representation of image participation and the image representation of question participation, as shown in Figure 1.

通过步骤(1)得到问题特征向量C和图片特征向量V，将问题特征向量C和图片特征向量V送入一个双流的Transformer网络中学习问题和图片的复杂交互，其中一个单流的Transformer网络用于学习图片引导下的问题表示，另外一个单流的Transformer网络用于学习问题引导下的图片表示；在图片引导下的问题表示学习中，将问题特征向量作为query向量，图片特征向量作为key和value向量，图片和问题的之间的依赖关系的计算公式如下：Through step (1), the problem feature vector C and the image feature vector V are obtained, and the problem feature vector C and the image feature vector V are sent into a dual-stream Transformer network to learn the complex interaction between the problem and the image. A single-stream Transformer network uses In order to learn the problem representation under the guidance of pictures, another single-stream Transformer network is used to learn the picture representation under the guidance of the picture; in the learning of the question representation under the guidance of pictures, the question feature vector is used as the query vector, and the picture feature vector is used as the key and The formula for calculating the dependency between the value vector, the picture and the question is as follows:

公式(1)中的W′_*表示模型待学习的参数矩阵；在问题引导下的图片表示学习中，将图片特征向量作为query向量，问题特征向量作为key和value向量，两者之间的依赖关系的计算公式如下：W′ _* in formula (1) represents the parameter matrix to be learned by the model; in the image representation learning guided by the problem, the image feature vector is used as the query vector, and the problem feature vector is used as the key and value vectors. The formula for calculating the relationship is as follows:

公式(1)中的W″_*表示模型待学习的参数矩阵；在经过多层的(在本发明中，具体为9层)图像和文本的联合表征后，将文本序列的[CLS]位的特征作为整个问题-图片的联合表示特征。W″ _* in formula (1) represents the parameter matrix to be learned by the model; after the joint representation of multi-layer (in the present invention, specifically 9 layers) image and text, the [CLS] bit of the text sequence is Features serve as joint representation features for the entire question-image.

步骤(3)中事实图和语义图的构建，第一步，针对每个问题-图片对，将输入问题与图片中检测出的对象的标签依次拼接，得到问题-图片实例集合；将外部知识库中的每条三元组转换(转换方法为将头实体、关系和尾实体依次拼接)为一个自然语言处理句子，得到相应的事实实例集合。第二步，将问题-图片实例集合和与事实实例集合利用预训练的句子编码器Universal-sentence-encoder进行编码后得到相应的实例表示，然后将问题-图片实例集合中的每个实例对象的特征表示依次与事实实例对象的特征表示计算余弦相似度后得到相应的关联得分，最后根据余弦相似度得分对所有的实例对象进行排序后，取得分最高的前10个实例作为备选支撑事实。第三步，根据检索得到的备选支撑事实构造事实图，图中的节点为知识库中的实体，边为两个实体之间的关系，构造好事实图后，用BERT对节点和边进行相应的初始化表示，其中节点和边的初始化表示为节点和边中所有单词的词嵌入的平均。The construction of the fact map and the semantic map in step (3), the first step, for each question-picture pair, the input question and the labels of the detected objects in the picture are sequentially spliced to obtain the question-picture instance set; Each triple in the library is converted (the conversion method is to splicing the head entity, the relation and the tail entity in turn) into a natural language processing sentence, and the corresponding fact instance set is obtained. In the second step, the question-picture instance set and the fact instance set are encoded with the pre-trained sentence encoder Universal-sentence-encoder to obtain the corresponding instance representation, and then the question-picture instance set of each instance object in the set is encoded. The feature representation calculates the cosine similarity with the feature representation of the fact instance object in turn to obtain the corresponding correlation score. Finally, after sorting all the instance objects according to the cosine similarity score, the top 10 instances with the highest scores are obtained as alternative supporting facts. The third step is to construct a fact graph based on the candidate supporting facts retrieved. The nodes in the graph are entities in the knowledge base, and the edges are the relationships between two entities. The corresponding initialization representation, where the initialization representation of nodes and edges is the average of the word embeddings of all words in the nodes and edges.

步骤(4)中基于图推理的证据聚合包括分为模态内的证据聚合和模态间的证据聚合：The evidence aggregation based on graph reasoning in step (4) includes evidence aggregation within a modality and evidence aggregation between modalities:

在模态内证据聚合时，首先利用本发明提出的包含双重级别的注意力机制：节点级别注意力(Node-level)和路径级别的注意力(Path-level)进行特征选择和聚合；在节点级别注意力计算过程中，首先计算图中每个节点与问题-图片联合表征的注意力得分

计算过程如下：During the aggregation of evidence within the modality, the attention mechanism proposed by the present invention containing two levels is firstly used: node-level attention (Node-level) and path-level attention (Path-level) for feature selection and aggregation; In the process of level attention calculation, first calculate the attention score of each node in the graph and the joint question-picture representation

The calculation process is as follows:

然后将该注意力得分乘以图中每个节点的初始特征向量得到基于图片-问题引导后的节点特征向量；在路径节点注意力计算过程中，主要关注哪条路径对推理过程更加重要，其中每条路径定义为与目标节点直接相连的所以节点及边构成的路径，其定义如下：Then the attention score is multiplied by the initial feature vector of each node in the graph to obtain the node feature vector based on the image-question guidance; in the process of path node attention calculation, which path is more important to the inference process, where Each path is defined as a path composed of all nodes and edges directly connected to the target node, which is defined as follows:

φ_ij＝(v_i,r_ij,v_j) (4)φ _ij ₌ (vi , r _ij , v _j ) (4)

其中v_i，r_ij，v_j分表表示事实图中的头节点的表示、关系的表示以及尾节点的特征表示；得到路径表示后，路径级别的注意力计算过程如下：Among them, _vi , ri , and v _j _are divided into tables to represent the representation of the head node, the representation of the relationship and the feature representation of the tail node in the fact graph; after obtaining the path representation, the path-level attention calculation process is as follows:

接着根据消息传播网络从邻居节点聚合特征，邻居节点的特征聚合过程计算公式如下：Then, according to the message propagation network, the features are aggregated from the neighbor nodes, and the calculation formula of the feature aggregation process of the neighbor nodes is as follows:

最后将邻居节点的特征与目标节点的特征进行融合后进一步更新目标节点的特征，为了防止邻居节点的特征对节点初始特征的过度更新，设计了一个门控机制来控制邻居节点特征与目标节点原始特征的占比，目标节点的特征更新过程计算如下：Finally, the feature of the neighbor node is fused with the feature of the target node, and then the feature of the target node is further updated. In order to prevent the feature of the neighbor node from over-updating the initial feature of the node, a gating mechanism is designed to control the feature of the neighbor node and the original feature of the target node. The proportion of features, the feature update process of the target node is calculated as follows:

在语义图中进行证据选择的过程与事实图中的步骤相同，这里不再重复叙述；The process of evidence selection in the semantic graph is the same as the steps in the fact graph, and will not be repeated here;

模态间证据聚合，在进行模态间证据聚合时，首先在问题的引导下，计算事实图中每个节点与语义图中每个节点的注意力权重系数，最后根据该注意力权重系数对语义图中的每个节点进行加权求和后得到语义图中的相关特征，相关的过程计算过程如下：Inter-modal evidence aggregation, when performing inter-modal evidence aggregation, first, under the guidance of the question, calculate the attention weight coefficient of each node in the fact graph and each node in the semantic graph, and finally calculate the attention weight coefficient according to the attention weight coefficient. Each node in the semantic graph is weighted and summed to obtain the relevant features in the semantic graph. The relevant process calculation process is as follows:

最后将语义图中的特征向量与事实图中原始节点的特征向量进行融合后得到跨模态融合后更新的特征，同样为了防止来自语义图中的特征对事实节点特征的过度更新，设计相应的门控机制来控制两种不同模态特征所占的比例，具体过程计算公式如下：Finally, the feature vector in the semantic graph is fused with the feature vector of the original node in the fact graph to obtain the updated feature after cross-modal fusion. The gating mechanism is used to control the proportion of the two different modal features. The specific process calculation formula is as follows:

最后将更新后的特征用于答案的推断。Finally, the updated features are used for answer inference.

附图说明Description of drawings

图1是本发明提出的网络模型的主要框架。Fig. 1 is the main frame of the network model proposed by the present invention.

图2是跨模态Transformer网络的结构。Figure 2 is the structure of the cross-modal Transformer network.

具体实施方式Detailed ways

下面结合附图对本发明作进一步说明：The present invention will be further described below in conjunction with the accompanying drawings:

图1是整个网络的结构，其由两部分组成，分别是协调感知模块(系统1)和显式推理模块(系统2)。框架中的系统1是通过一个多模态Transformer网络来实现的，该网络利用交叉注意机制来捕捉问题和图像之间的复杂关系。系统1输出一个图像和问题的联合表征。对于系统2，本发明使用一个图神经网络(Graph Neural Network,GNN)在两个外部知识图(事实图和语义图)上进行推理，以找到正确的答案。Figure 1 is the structure of the entire network, which consists of two parts, the coordinated perception module (system 1) and the explicit reasoning module (system 2). System 1 in the framework is implemented by a multimodal Transformer network that utilizes a cross-attention mechanism to capture the complex relationship between questions and images. System 1 outputs a joint representation of the image and question. For System 2, the present invention uses a Graph Neural Network (GNN) to reason on two external knowledge graphs (fact graph and semantic graph) to find the correct answer.

图2是跨模态Transformer框架，该框架将提取好的图片和文本特征送入一个双流的Transformer网络来学习图片和文本的联合表示，其中一个单流的Transformer网络用于学习图片引导下的问题表征，而另外一个单流的Transformer模型用于学习问题引导下的图片表征，最后将两个双流的Transformer网络的输出经过平均池化(Average pooling)后进一步进行相乘后得到联合表征。Figure 2 is a cross-modal Transformer framework, which feeds the extracted image and text features into a dual-stream Transformer network to learn the joint representation of images and text, and a single-stream Transformer network is used to learn image-guided problems. representation, and another single-stream Transformer model is used to learn the image representation guided by the problem. Finally, the outputs of the two dual-stream Transformer networks are subjected to average pooling (Average pooling) and further multiplied to obtain a joint representation.

表1和表2是本发明在公开数据集FVQA和OK-VQA上的实验结果，通过实验表明，提出的模型与现有最好的模型相比，其综合评价指标F₁值取得最好的结果。Tables ₁ and 2 are the experimental results of the present invention on the public datasets FVQA and OK-VQA. The experiments show that the proposed model has the best comprehensive evaluation index F1 value compared with the existing best model. result.

表1本发明网络模型在FVQA数据集上和其他现有模型的的实验对比结果Table 1 The experimental comparison results of the network model of the present invention on the FVQA data set and other existing models

表2本发明网络模型在OK-VQA数据集上和其他现有模型的的实验对比结果Table 2 The experimental comparison results of the network model of the present invention on the OK-VQA data set and other existing models

上述实施例只是本发明的较佳实施例，并不是对本发明技术方案的限制，只要是不经过创造性劳动即可在上述实施例的基础上实现的技术方案，均应视为落入本发明专利的权利保护范围内。The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention. As long as the technical solutions that can be realized on the basis of the above-described embodiments without creative work, all should be regarded as falling into the patent of the present invention. within the scope of protection of rights.

Claims

1. A knowledge graph visual question-answer based on a double-process cognitive theory is characterized by comprising the following steps:

(1) Respectively using a text pre-training model BERT and a target detection model Faster-RCNN to extract the characteristics of an input text and an image, respectively adding [ CLS ] and [ SEP ] marks at the beginning and the end of each sentence aiming at the text, and then sending the sentence into the BERT model to extract the characteristics; for each picture, extracting 36 target regions, wherein each target region comprises the appearance visual characteristics of an object and the spatial position characteristic information of the object in the picture;

(2) Sending the extracted picture and text characteristics into a double-flow Transformer network to learn joint representation of the picture and the text, wherein one single-flow Transformer network is used for learning problem representation under picture guidance, the other single-flow Transformer model is used for learning picture representation under problem guidance, and finally multiplying the output of the two double-flow Transformer networks after average pooling to obtain problem-image joint representation;

(3) The method comprises the steps of constructing a fact graph and a semantic graph respectively aiming at each question-picture pair, wherein the fact graph is constructed by retrieving alternative supporting facts from an external knowledge base based on a sentence-level semantic matching mode, the semantic graph is constructed by performing semantic description on pictures firstly and then performing semantic analysis on generated sentences;

(4) The evidence aggregation based on graph reasoning comprises the steps of firstly, aggregating evidence information from a fact graph and a semantic graph by using two graph reasoning networks with attention mechanisms according to the fact graph and the semantic graph, and then aggregating the evidence information related to problems from the semantic graph into a knowledge graph by using a cross-modal reasoning network;

(5) And (3) answer prediction, namely performing point multiplication calculation on the joint representation of the question and the picture and the feature vector of each node in the fact graph to obtain a semantic matching degree score of each node and the question, and finally sending the matching degree score into a Sigmoid layer to predict a corresponding answer.

2. The method according to claim 1, wherein the problem-picture joint characterization learning method in (2) comprises the following steps:

giving a problem feature vector C and a picture feature vector V, and sending the problem feature vector C and the picture feature vector V into a double-flow Transformer network to learn the complex interaction of the problem and the picture, wherein one single-flow Transformer network is used for learning problem representation under picture guidance, and the other single-flow Transformer network is used for learning picture representation under problem guidance; in the problem representation learning under the guidance of pictures, a problem feature vector is used as a query vector, a picture feature vector is used as a key vector and a value vector, and the calculation formula of the dependency relationship between pictures and problems is as follows:

w 'in the formula (1)' _* Representing a parameter matrix to be learned by the model; in the picture representation learning under the guidance of the question, the picture feature vector is used as a query vector, the question feature vector is used as a key vector and a value vector, and the calculation formula of the dependency relationship between the two vectors is as follows:

w in formula (2) _* The parameter matrix to be learned of the model is also represented; after a multi-layered (in the present invention, specifically 9 layers) joint characterization of the image and the text, [ CLS ] of the text sequence]The bit features act as joint representation features for the entire problem-picture.

3. The method according to claim 1, wherein the fact graph constructing method in step (3) comprises the following steps:

(1) For each problem-picture pair, firstly, sequentially splicing input problems and tags of objects detected in pictures to obtain a problem-picture example set, then converting each triple in an external knowledge base into a natural language sentence to obtain a corresponding fact example set, wherein the conversion method is obtained by sequentially splicing a head entity, a relation and a tail entity;

(2) Coding the problem-picture instance set and the fact instance set by using a pre-trained sentence coder Universal-content-encoder to obtain corresponding instance representation;

(3) Finally, calculating cosine similarity between the feature representation of each instance object in the problem-picture instance set and the feature representation of the fact instance object in sequence to obtain corresponding association scores, and finally sequencing all instance objects according to the cosine similarity scores to obtain the top 10 instances with the highest scores as alternative supporting facts;

(4) And constructing a fact graph according to the retrieved alternative support facts, wherein nodes in the graph are entities in a knowledge base, edges are relations between the two entities, and after the fact graph is constructed, performing corresponding initialized representation on the nodes and the edges by using a BERT model, wherein the initialized representation of the nodes and the edges is the average of word embedding of all words in the nodes and the edges.

4. The method according to claim 1, wherein the evidence aggregation step in (4) is divided into intra-modality evidence aggregation and inter-modality evidence aggregation:

(1) In the case of intra-modal evidence of aggregation, the attention proposed by the present invention is first utilized, including a double level: the Node-level attention (Node-level) and Path-level attention (Path-level) networks are used for feature selection and aggregation; in the node level attention calculation process, the attention score of each node and the problem-picture joint representation in the graph is calculated firstly

The calculation process is as follows:

w and b in the above formula represent the parameter matrix to be learned by the model and the offset, and W and b in the following formula represent the same meaning and will not be described repeatedly; obtaining an attention weight coefficient, and multiplying the attention score by the initial feature vector of each node in the graph to obtain a node feature representation guided based on the picture-question; in the path node attention calculation process, which path is more important to the inference process is mainly concerned, wherein each path is defined as a path formed by nodes and edges which are directly connected with a target node, and the path is defined as follows:

φ _ij ＝(v _i ,r _ij ,v _j ) (4)

wherein v is _i ，r _ij ，v _j The sublist represents the feature representation of the head node, the feature representation of the relationship and the feature representation of the tail node in the fact graph, and after the path representation is obtained, the attention calculation process of the path level is as follows:

and then, aggregating the features from the neighbor nodes by using a message propagation mechanism, wherein the feature aggregation process of the neighbor nodes is shown as the following formula:

finally, fusing the characteristics of the neighbor nodes and the characteristics of the target node and then further updating the characteristics of the target node, in order to prevent the characteristics of the neighbor nodes from excessively updating the characteristics of the nodes, designing a gating mechanism to control the ratio of the characteristics of the neighbor nodes to the original characteristics of the target node, wherein the characteristic updating process of the whole target node is shown as the following formula:

the process of evidence aggregation in the semantic graph is the same as the steps in the fact graph, and the description is not repeated here;

(2) And (2) aggregating the inter-modal evidences, wherein when the inter-modal evidences are aggregated, attention weight coefficients of each node in the fact graph and each node in the semantic graph are calculated under the guidance of a problem, and finally, the feature of each node in the semantic graph is weighted and summed according to the attention weight coefficients to obtain the related features in the semantic graph, wherein the related process calculation process comprises the following steps:

finally, fusing the feature vectors aggregated from the semantic graph with the feature vectors of the nodes in the fact graph to obtain new features subjected to cross-mode fusion, and designing a corresponding gating mechanism to control the proportion of two different mode features in order to prevent the features from the semantic graph from excessively updating the features of the nodes in the fact graph, wherein the specific process calculation formula is as follows:

and finally, using the updated features for the inference of the answer.