CN115186072A - A Knowledge Graph Visual Question Answering Method Based on Dual-process Cognitive Theory - Google Patents
A Knowledge Graph Visual Question Answering Method Based on Dual-process Cognitive Theory Download PDFInfo
- Publication number
- CN115186072A CN115186072A CN202110374169.3A CN202110374169A CN115186072A CN 115186072 A CN115186072 A CN 115186072A CN 202110374169 A CN202110374169 A CN 202110374169A CN 115186072 A CN115186072 A CN 115186072A
- Authority
- CN
- China
- Prior art keywords
- graph
- picture
- fact
- representation
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
Description
技术领域technical field
本发明设计了一种基于双过程认知理论的知识图谱视觉问答方法,属于自然语言处理和计算机视觉领域的交叉。The invention designs a knowledge map visual question answering method based on the dual-process cognitive theory, which belongs to the intersection of natural language processing and computer vision fields.
背景技术Background technique
让智能体具备通过分析视觉和语言信息来理解世界的能力是近年来计算机视觉与自然语言处理技术相结合的一个热点研究课题。相关研究推动了许多应用的发展,如视觉问答(VQA)、图像索引及图像描述等。其中,VQA是一项具有挑战性的任务,它要求模型基于给定的图像回答任意问题。为了推动VQA技术的发展,近年来相关学者做了大量的前期研究工作,并取得了较大的进展。然而,现有的VQA方法聚焦于根据图片内容来回答问题,而无法回答一些需要结合常识才能回答的问题。It is a hot research topic of the combination of computer vision and natural language processing technology in recent years to make the agent have the ability to understand the world by analyzing the visual and language information. Related research has promoted the development of many applications, such as visual question answering (VQA), image indexing, and image description. Among them, VQA is a challenging task that requires the model to answer arbitrary questions based on a given image. In order to promote the development of VQA technology, relevant scholars have done a lot of preliminary research work in recent years, and have made great progress. However, existing VQA methods focus on answering questions based on image content, and cannot answer some questions that require a combination of common sense.
为了推动该领域的发展,Wang等人提出了基于事实的视觉问答(Fact-basedVisual Question Answering,FVQA)任务(P.Wang,Q.Wu,C.Shen,A.Dick and A.van denHengel,"FVQA:Fact-Based Visual Question Answering,"in IEEE Transactions onPattern Analysis and Machine Intelligence,vol.40,no.10,pp.2413-2427,1Oct.2018,doi:10.1109/TPAMI.2017.2754246)。与此同时,他们还发布了一个新的数据集,该数据集为每个问题-答案对提供了额外的支持事实并要求模型通过对图像和外部知识的联合分析来回答问题。Wang等人的研究首先对句子进行解析然后映射到知识图谱中,然后利用关键字匹配来找到正确的答案。这种方法存在明显的缺陷,及当问题中没有提到明显的视觉概念或者存在同义词和同形异义词时,这种方法将变得无效。因此后续有学者提出了一种基于语义学习的检索方法,将图像-问题-视觉概念和候选事实投射到一个学习的嵌入空间中,并通过计算相应的距离来找到支撑事实。然而这种方法一次评估一个事实节点,因此效率较低,此外该方法无法利用外部知识库的结构信息。为了解决这个问题,Narasimhan等人提出了一种基于图推断的方法,通过在整个图上进行推理来选择答案(Narasimhan M,Schwing AG.Straight to the facts:Learning knowledge baseretrieval for factual visual question answering[C]//Proceedings of theEuropean conference on computer vision(ECCV).2018:451-468.)。该方法构造一个实体图,图中的每个节点由实体、图像和问题表示的连接表示,然后利用图卷积网络进行消息的聚合后得到相应的节点更新特征,最后基于该更新的节点特征进行答案的预测。由于问题只关注于部分视觉内容,这种方法不可避免地引入了噪声信息。To advance the field, Wang et al. proposed the Fact-based Visual Question Answering (FVQA) task (P.Wang, Q.Wu, C.Shen, A.Dick and A.van denHengel," FVQA: Fact-Based Visual Question Answering,"in IEEE Transactions onPattern Analysis and Machine Intelligence,vol.40,no.10,pp.2413-2427,1Oct.2018,doi:10.1109/TPAMI.2017.2754246). At the same time, they also released a new dataset that provides additional supporting facts for each question-answer pair and asks the model to answer the question through a joint analysis of images and external knowledge. The study by Wang et al. first parses sentences and maps them into a knowledge graph, and then utilizes keyword matching to find the correct answer. This approach has obvious flaws and becomes ineffective when no obvious visual concept is mentioned in the question or when there are synonyms and homographs. Therefore, some scholars subsequently proposed a retrieval method based on semantic learning, projecting image-question-visual concepts and candidate facts into a learned embedding space, and finding supporting facts by calculating the corresponding distance. However, this method evaluates one fact node at a time, so it is inefficient, and furthermore, the method cannot utilize the structural information of external knowledge bases. To address this issue, Narasimhan et al. propose a graph inference-based method to select answers by reasoning over the entire graph (Narasimhan M, Schwing AG. Straight to the facts: Learning knowledge baseretrieval for factual visual question answering [C ] // Proceedings of the European conference on computer vision (ECCV. 2018:451-468.). This method constructs an entity graph, each node in the graph is represented by the connection represented by the entity, image and question, and then uses the graph convolutional network to aggregate the message to obtain the corresponding node update feature, and finally based on the updated node feature. Prediction of the answer. Since the problem only focuses on part of the visual content, this approach inevitably introduces noisy information.
对于人类而言,当给定一张图片和一个问题时,推断答案可以分为两个步骤:(1)大脑通过分析图像和问题,快速获得图像中呈现的内容和问题所关心的视觉信息;(2)根据第一步的输出,并结合大脑中存储的知识进行深度分析,找出正确的答案。这一过程被在认知科学上被称为为双过程理论。双过程认知理论认为,人类大脑首先通过被一个被称为系统1的系统执行一个隐式、无意识过程来感知外部输入信息。然后该信息送入一个称为System2的系统执行显式的、有意识的、可控的推理过程。该推理过程在在工作记忆中进行顺序推断,这个过程较慢,但是是人类作为高级智能体所特有的属性。从这个角度来看,FVQA问题可以利用双过程认知理论来解决—系统1快速从图像和问题中检索信息,系统2结合外部知识进行深度推理,找到正确答案。For humans, when a picture and a question are given, inferring the answer can be divided into two steps: (1) The brain quickly obtains the content presented in the image and the visual information that the question is concerned about by analyzing the image and question; (2) According to the output of the first step, combined with the knowledge stored in the brain, conduct in-depth analysis to find the correct answer. This process is known in cognitive science as two-process theory. The dual-process cognitive theory states that the human brain first perceives external input through an implicit, unconscious process performed by a system called
受双过程认知理论的启发,本发明提出了一种基于双过程认知系统来解决FVQA问题的新框架。具体来说,本发明框架中的系统1是通过一个多模态Transformer网络来实现的,该网络利用交叉注意机制来捕捉问题和图像之间的复杂关系。系统1输出图像和问题的联合表示。对于系统2,本发明使用一个图神经网络(Graph Neural Network,GNN)在两个外部知识图(事实图和语义图)上进行推理,以找到正确的答案。系统2首先进行模态内的证据选择和聚合,分别从两个知识图中聚合与问题相关的证据信息;接着然后进行跨模态选择,将事实图中的证据聚集到语义图中,以辅助对答案更好的推断。在推理过程中,本发明提出了节点级别注意和路径级别注意的双重注意机制,从关键节点和路径中捕获有价值的信息,使推理过程更加合理,且进一步提升基于知识图谱的视觉问答的性能。Inspired by the dual-process cognitive theory, the present invention proposes a new framework for solving the FVQA problem based on a dual-process cognitive system. Specifically,
发明内容SUMMARY OF THE INVENTION
受启发于双过程认知理论,本发明提出了用双过程认知系统来解决FVQA问题的新框架。具体来讲,本发明框架中的系统1通过一个多模态Transformer网络来实现的,该网络利用交叉注意机制来捕捉问题和图像之间的复杂关系并输出一个问题-图像的联合表征。对于系统2,使用一个GCN网络在两个外部知识图(事实图和语义图)上进行推理,以找到正确的答案。Inspired by the dual-process cognitive theory, the present invention proposes a new framework for solving the FVQA problem with a dual-process cognitive system. Specifically,
本发明通过以下技术方案来实现上述目的:The present invention realizes above-mentioned purpose through following technical scheme:
1、发明所述的基于双过程认知理论的知识图谱视觉问答框架如图1所示,其包含协调感知模块(系统1)和显式推理模块(系统2)两部分,该框架具体推断过程包括以下步骤:1. The knowledge graph visual question answering framework based on the dual-process cognitive theory described in the invention is shown in Figure 1, which includes two parts: a coordinated perception module (system 1) and an explicit reasoning module (system 2). The framework has a specific inference process. Include the following steps:
(1)分别使用文本预训练模型BERT和图像预训练模型Faster-RCNN对输入文本和图像进行特征提取,针对每个问题,分别在问题的开始和结束位置添加[CLS]和[SEP]标志位,然后送入BERT模型提取特征;针对每张图片,提取36个目标区域,每个目标区域包含一个目标的视觉特征以及目标的时空位置特征信息。(1) Use the text pre-training model BERT and the image pre-training model Faster-RCNN to extract features from the input text and images, respectively. For each question, add [CLS] and [SEP] flags at the beginning and end of the question, respectively , and then sent to the BERT model to extract features; for each image, 36 target areas are extracted, each target area contains a target's visual features and the target's spatiotemporal location feature information.
(2)将步骤(1)提取好的图像和文本特征送入一个双流的Transformer网络来学习图像和文本的联合表征,其中一个单流的Transformer网络用于学习图像指导的问题表征,而另外一个单流的Transformer用于学习问题指导下的图像表征,最后将两个双流的Transformer网络的输出经过平均池化,进一步进行相乘后得到联合表征。(2) The image and text features extracted in step (1) are fed into a dual-stream Transformer network to learn the joint representation of images and texts, where a single-stream Transformer network is used to learn image-guided problem representations, while the other The single-stream Transformer is used to learn the image representation guided by the problem. Finally, the outputs of the two dual-stream Transformer networks are average pooled and further multiplied to obtain a joint representation.
(3)事实图和语义图的构建,针对每个问题-图像对,分别构建事实图和语义图,其中事实图通过基于句子级别语义匹配的方式从外部知识库中进行选择,而语义图则通过首先对图片进行语义描述,然后生成的句子进行语义解析后得到。(3) Construction of fact graph and semantic graph. For each question-image pair, a fact graph and a semantic graph are constructed respectively, where the fact graph is selected from an external knowledge base by means of sentence-level semantic matching, while the semantic graph is It is obtained by first describing the image semantically, and then performing semantic analysis on the generated sentence.
(4)基于图推理的证据聚合,首先针对事实图和语义图,基于注意力机制分别利用两个图推理网络从知识图和事实图中聚合证据信息,然后利用跨模态推理网络从语义图中聚合与问题相关的证据信息到知识图中。(4) Evidence aggregation based on graph reasoning, first for the fact graph and semantic graph, based on the attention mechanism, two graph reasoning networks are used to aggregate evidence information from the knowledge graph and fact graph respectively, and then the cross-modal reasoning network is used to aggregate evidence information from the semantic graph. Aggregate question-related evidence information into a knowledge graph.
(5)答案预测,将问题-图片的联合表征与事实图中每个节点的特征向量进行点乘计算后得到每个节点与问题的语义相关度得分,最后将该相关度得分送入一个Sigmoid层预测相应的答案。(5) Answer prediction: Do the dot product calculation between the joint representation of the question-picture and the feature vector of each node in the fact graph to obtain the semantic relevance score of each node and the question, and finally send the relevance score to a Sigmoid layer predicts the corresponding answer.
具体地步骤(1)中,首先利用BERT预训练模型对输入问题的单词进行向量初始化,其中BERT使用的是bert-base-uncased版本,经过向量初始化后,得到单词的特征向量C=[c0,c1,...,cn],每个单词的特征向量维度768维。然后利用Faster-RCNN预训练模型提取输入的图片36个目标区域,每个区域包含一个目标的外观特征目标的空间位置特征和对应的标签。为了同时捕获视觉特征和空间特征,本发明首先将外观特征和空间位置特征投影到同一维度中(本发明为768维),然后对外观视觉特征和空间位置特征计算平均得到每个目标对象的特征表示。Specifically, in step (1), first use the BERT pre-training model to initialize the word of the input question, where BERT uses the bert-base-uncased version, and after the vector initialization, the feature vector of the word C=[c 0 ,c 1 ,...,c n ], the feature vector dimension of each word is 768 dimensions. Then use the Faster-RCNN pre-training model to extract 36 target regions of the input picture, each region contains the appearance feature of a target Spatial location features of the target and corresponding labels. In order to capture visual features and spatial features at the same time, the present invention first projects appearance features and spatial position features into the same dimension (768 dimensions in the present invention), and then calculates and averages the appearance visual features and spatial position features to obtain the features of each target object express.
在步骤(2)中,采用跨模态Transformer对齐这两种模态之间的关系,学习图像参与的问题表示和问题参与的图像表示,如图所示1所示。In step (2), a cross-modal Transformer is used to align the relationship between these two modalities to learn the question representation of image participation and the image representation of question participation, as shown in Figure 1.
通过步骤(1)得到问题特征向量C和图片特征向量V,将问题特征向量C和图片特征向量V送入一个双流的Transformer网络中学习问题和图片的复杂交互,其中一个单流的Transformer网络用于学习图片引导下的问题表示,另外一个单流的Transformer网络用于学习问题引导下的图片表示;在图片引导下的问题表示学习中,将问题特征向量作为query向量,图片特征向量作为key和value向量,图片和问题的之间的依赖关系的计算公式如下:Through step (1), the problem feature vector C and the image feature vector V are obtained, and the problem feature vector C and the image feature vector V are sent into a dual-stream Transformer network to learn the complex interaction between the problem and the image. A single-stream Transformer network uses In order to learn the problem representation under the guidance of pictures, another single-stream Transformer network is used to learn the picture representation under the guidance of the picture; in the learning of the question representation under the guidance of pictures, the question feature vector is used as the query vector, and the picture feature vector is used as the key and The formula for calculating the dependency between the value vector, the picture and the question is as follows:
公式(1)中的W′*表示模型待学习的参数矩阵;在问题引导下的图片表示学习中,将图片特征向量作为query向量,问题特征向量作为key和value向量,两者之间的依赖关系的计算公式如下:W′ * in formula (1) represents the parameter matrix to be learned by the model; in the image representation learning guided by the problem, the image feature vector is used as the query vector, and the problem feature vector is used as the key and value vectors. The formula for calculating the relationship is as follows:
公式(1)中的W″*表示模型待学习的参数矩阵;在经过多层的(在本发明中,具体为9层)图像和文本的联合表征后,将文本序列的[CLS]位的特征作为整个问题-图片的联合表示特征。W″ * in formula (1) represents the parameter matrix to be learned by the model; after the joint representation of multi-layer (in the present invention, specifically 9 layers) image and text, the [CLS] bit of the text sequence is Features serve as joint representation features for the entire question-image.
步骤(3)中事实图和语义图的构建,第一步,针对每个问题-图片对,将输入问题与图片中检测出的对象的标签依次拼接,得到问题-图片实例集合;将外部知识库中的每条三元组转换(转换方法为将头实体、关系和尾实体依次拼接)为一个自然语言处理句子,得到相应的事实实例集合。第二步,将问题-图片实例集合和与事实实例集合利用预训练的句子编码器Universal-sentence-encoder进行编码后得到相应的实例表示,然后将问题-图片实例集合中的每个实例对象的特征表示依次与事实实例对象的特征表示计算余弦相似度后得到相应的关联得分,最后根据余弦相似度得分对所有的实例对象进行排序后,取得分最高的前10个实例作为备选支撑事实。第三步,根据检索得到的备选支撑事实构造事实图,图中的节点为知识库中的实体,边为两个实体之间的关系,构造好事实图后,用BERT对节点和边进行相应的初始化表示,其中节点和边的初始化表示为节点和边中所有单词的词嵌入的平均。The construction of the fact map and the semantic map in step (3), the first step, for each question-picture pair, the input question and the labels of the detected objects in the picture are sequentially spliced to obtain the question-picture instance set; Each triple in the library is converted (the conversion method is to splicing the head entity, the relation and the tail entity in turn) into a natural language processing sentence, and the corresponding fact instance set is obtained. In the second step, the question-picture instance set and the fact instance set are encoded with the pre-trained sentence encoder Universal-sentence-encoder to obtain the corresponding instance representation, and then the question-picture instance set of each instance object in the set is encoded. The feature representation calculates the cosine similarity with the feature representation of the fact instance object in turn to obtain the corresponding correlation score. Finally, after sorting all the instance objects according to the cosine similarity score, the top 10 instances with the highest scores are obtained as alternative supporting facts. The third step is to construct a fact graph based on the candidate supporting facts retrieved. The nodes in the graph are entities in the knowledge base, and the edges are the relationships between two entities. The corresponding initialization representation, where the initialization representation of nodes and edges is the average of the word embeddings of all words in the nodes and edges.
步骤(4)中基于图推理的证据聚合包括分为模态内的证据聚合和模态间的证据聚合:The evidence aggregation based on graph reasoning in step (4) includes evidence aggregation within a modality and evidence aggregation between modalities:
在模态内证据聚合时,首先利用本发明提出的包含双重级别的注意力机制:节点级别注意力(Node-level)和路径级别的注意力(Path-level)进行特征选择和聚合;在节点级别注意力计算过程中,首先计算图中每个节点与问题-图片联合表征的注意力得分计算过程如下:During the aggregation of evidence within the modality, the attention mechanism proposed by the present invention containing two levels is firstly used: node-level attention (Node-level) and path-level attention (Path-level) for feature selection and aggregation; In the process of level attention calculation, first calculate the attention score of each node in the graph and the joint question-picture representation The calculation process is as follows:
然后将该注意力得分乘以图中每个节点的初始特征向量得到基于图片-问题引导后的节点特征向量;在路径节点注意力计算过程中,主要关注哪条路径对推理过程更加重要,其中每条路径定义为与目标节点直接相连的所以节点及边构成的路径,其定义如下:Then the attention score is multiplied by the initial feature vector of each node in the graph to obtain the node feature vector based on the image-question guidance; in the process of path node attention calculation, which path is more important to the inference process, where Each path is defined as a path composed of all nodes and edges directly connected to the target node, which is defined as follows:
φij=(vi,rij,vj) (4)φ ij = (vi , r ij , v j ) (4)
其中vi,rij,vj分表表示事实图中的头节点的表示、关系的表示以及尾节点的特征表示;得到路径表示后,路径级别的注意力计算过程如下:Among them, vi , ri , and v j are divided into tables to represent the representation of the head node, the representation of the relationship and the feature representation of the tail node in the fact graph; after obtaining the path representation, the path-level attention calculation process is as follows:
接着根据消息传播网络从邻居节点聚合特征,邻居节点的特征聚合过程计算公式如下:Then, according to the message propagation network, the features are aggregated from the neighbor nodes, and the calculation formula of the feature aggregation process of the neighbor nodes is as follows:
最后将邻居节点的特征与目标节点的特征进行融合后进一步更新目标节点的特征,为了防止邻居节点的特征对节点初始特征的过度更新,设计了一个门控机制来控制邻居节点特征与目标节点原始特征的占比,目标节点的特征更新过程计算如下:Finally, the feature of the neighbor node is fused with the feature of the target node, and then the feature of the target node is further updated. In order to prevent the feature of the neighbor node from over-updating the initial feature of the node, a gating mechanism is designed to control the feature of the neighbor node and the original feature of the target node. The proportion of features, the feature update process of the target node is calculated as follows:
在语义图中进行证据选择的过程与事实图中的步骤相同,这里不再重复叙述;The process of evidence selection in the semantic graph is the same as the steps in the fact graph, and will not be repeated here;
模态间证据聚合,在进行模态间证据聚合时,首先在问题的引导下,计算事实图中每个节点与语义图中每个节点的注意力权重系数,最后根据该注意力权重系数对语义图中的每个节点进行加权求和后得到语义图中的相关特征,相关的过程计算过程如下:Inter-modal evidence aggregation, when performing inter-modal evidence aggregation, first, under the guidance of the question, calculate the attention weight coefficient of each node in the fact graph and each node in the semantic graph, and finally calculate the attention weight coefficient according to the attention weight coefficient. Each node in the semantic graph is weighted and summed to obtain the relevant features in the semantic graph. The relevant process calculation process is as follows:
最后将语义图中的特征向量与事实图中原始节点的特征向量进行融合后得到跨模态融合后更新的特征,同样为了防止来自语义图中的特征对事实节点特征的过度更新,设计相应的门控机制来控制两种不同模态特征所占的比例,具体过程计算公式如下:Finally, the feature vector in the semantic graph is fused with the feature vector of the original node in the fact graph to obtain the updated feature after cross-modal fusion. The gating mechanism is used to control the proportion of the two different modal features. The specific process calculation formula is as follows:
最后将更新后的特征用于答案的推断。Finally, the updated features are used for answer inference.
附图说明Description of drawings
图1是本发明提出的网络模型的主要框架。Fig. 1 is the main frame of the network model proposed by the present invention.
图2是跨模态Transformer网络的结构。Figure 2 is the structure of the cross-modal Transformer network.
具体实施方式Detailed ways
下面结合附图对本发明作进一步说明:The present invention will be further described below in conjunction with the accompanying drawings:
图1是整个网络的结构,其由两部分组成,分别是协调感知模块(系统1)和显式推理模块(系统2)。框架中的系统1是通过一个多模态Transformer网络来实现的,该网络利用交叉注意机制来捕捉问题和图像之间的复杂关系。系统1输出一个图像和问题的联合表征。对于系统2,本发明使用一个图神经网络(Graph Neural Network,GNN)在两个外部知识图(事实图和语义图)上进行推理,以找到正确的答案。Figure 1 is the structure of the entire network, which consists of two parts, the coordinated perception module (system 1) and the explicit reasoning module (system 2).
图2是跨模态Transformer框架,该框架将提取好的图片和文本特征送入一个双流的Transformer网络来学习图片和文本的联合表示,其中一个单流的Transformer网络用于学习图片引导下的问题表征,而另外一个单流的Transformer模型用于学习问题引导下的图片表征,最后将两个双流的Transformer网络的输出经过平均池化(Average pooling)后进一步进行相乘后得到联合表征。Figure 2 is a cross-modal Transformer framework, which feeds the extracted image and text features into a dual-stream Transformer network to learn the joint representation of images and text, and a single-stream Transformer network is used to learn image-guided problems. representation, and another single-stream Transformer model is used to learn the image representation guided by the problem. Finally, the outputs of the two dual-stream Transformer networks are subjected to average pooling (Average pooling) and further multiplied to obtain a joint representation.
表1和表2是本发明在公开数据集FVQA和OK-VQA上的实验结果,通过实验表明,提出的模型与现有最好的模型相比,其综合评价指标F1值取得最好的结果。Tables 1 and 2 are the experimental results of the present invention on the public datasets FVQA and OK-VQA. The experiments show that the proposed model has the best comprehensive evaluation index F1 value compared with the existing best model. result.
表1本发明网络模型在FVQA数据集上和其他现有模型的的实验对比结果Table 1 The experimental comparison results of the network model of the present invention on the FVQA data set and other existing models
表2本发明网络模型在OK-VQA数据集上和其他现有模型的的实验对比结果Table 2 The experimental comparison results of the network model of the present invention on the OK-VQA data set and other existing models
上述实施例只是本发明的较佳实施例,并不是对本发明技术方案的限制,只要是不经过创造性劳动即可在上述实施例的基础上实现的技术方案,均应视为落入本发明专利的权利保护范围内。The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention. As long as the technical solutions that can be realized on the basis of the above-described embodiments without creative work, all should be regarded as falling into the patent of the present invention. within the scope of protection of rights.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110374169.3A CN115186072B (en) | 2021-04-07 | 2021-04-07 | A knowledge graph visual question answering method based on dual-process cognitive theory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110374169.3A CN115186072B (en) | 2021-04-07 | 2021-04-07 | A knowledge graph visual question answering method based on dual-process cognitive theory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115186072A true CN115186072A (en) | 2022-10-14 |
CN115186072B CN115186072B (en) | 2025-05-09 |
Family
ID=83512224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110374169.3A Active CN115186072B (en) | 2021-04-07 | 2021-04-07 | A knowledge graph visual question answering method based on dual-process cognitive theory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115186072B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116150404A (en) * | 2023-03-03 | 2023-05-23 | 成都康赛信息技术有限公司 | A multi-modal knowledge map construction method for educational resources based on federated learning |
CN116401390A (en) * | 2023-05-19 | 2023-07-07 | 中国科学技术大学 | Visual question-answering processing method, system, storage medium and electronic equipment |
CN116976438A (en) * | 2023-06-05 | 2023-10-31 | 山东交通学院 | Visual question-answering double-flow attention multi-hop reasoning method and system based on knowledge base |
CN117892140A (en) * | 2024-03-15 | 2024-04-16 | 浪潮电子信息产业股份有限公司 | Visual question answering and model training method, device, electronic device, and storage medium |
CN119295535A (en) * | 2024-09-13 | 2025-01-10 | 中国电子科技集团公司第五十四研究所 | A semantic point set matching method based on triangle-based companion graph construction |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170132498A1 (en) * | 2015-11-11 | 2017-05-11 | Adobe Systems Incorporated | Structured Knowledge Modeling, Extraction and Localization from Images |
CN110809784A (en) * | 2017-09-27 | 2020-02-18 | 谷歌有限责任公司 | End-to-end network model for high-resolution image segmentation |
-
2021
- 2021-04-07 CN CN202110374169.3A patent/CN115186072B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170132498A1 (en) * | 2015-11-11 | 2017-05-11 | Adobe Systems Incorporated | Structured Knowledge Modeling, Extraction and Localization from Images |
CN110809784A (en) * | 2017-09-27 | 2020-02-18 | 谷歌有限责任公司 | End-to-end network model for high-resolution image segmentation |
Non-Patent Citations (1)
Title |
---|
陈曦;陈华钧;张文;: "规则增强的知识图谱表示学习方法", 情报工程, no. 01, 15 February 2017 (2017-02-15), pages 26 - 34 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116150404A (en) * | 2023-03-03 | 2023-05-23 | 成都康赛信息技术有限公司 | A multi-modal knowledge map construction method for educational resources based on federated learning |
CN116401390A (en) * | 2023-05-19 | 2023-07-07 | 中国科学技术大学 | Visual question-answering processing method, system, storage medium and electronic equipment |
CN116401390B (en) * | 2023-05-19 | 2023-10-20 | 中国科学技术大学 | Visual question-answering processing method, system, storage medium and electronic equipment |
CN116976438A (en) * | 2023-06-05 | 2023-10-31 | 山东交通学院 | Visual question-answering double-flow attention multi-hop reasoning method and system based on knowledge base |
CN117892140A (en) * | 2024-03-15 | 2024-04-16 | 浪潮电子信息产业股份有限公司 | Visual question answering and model training method, device, electronic device, and storage medium |
CN117892140B (en) * | 2024-03-15 | 2024-05-31 | 浪潮电子信息产业股份有限公司 | Visual question and answer and model training method and device thereof, electronic equipment and storage medium |
CN119295535A (en) * | 2024-09-13 | 2025-01-10 | 中国电子科技集团公司第五十四研究所 | A semantic point set matching method based on triangle-based companion graph construction |
Also Published As
Publication number | Publication date |
---|---|
CN115186072B (en) | 2025-05-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509519B (en) | General Knowledge Graph Enhanced Question Answering Interaction System and Method Based on Deep Learning | |
CN112200317B (en) | Multi-mode knowledge graph construction method | |
CN118779438B (en) | Data intelligent question-answering method and system integrating domain knowledge | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
US20190303768A1 (en) | Community Question Answering-Based Article Recommendation Method, System, and User Device | |
CN112287170B (en) | Short video classification method and device based on multi-mode joint learning | |
CN115186072B (en) | A knowledge graph visual question answering method based on dual-process cognitive theory | |
CN111782769B (en) | Intelligent knowledge graph question-answering method based on relation prediction | |
CN117131933B (en) | A method for establishing a multimodal knowledge graph and its application | |
CN115618045A (en) | A visual question answering method, device and storage medium | |
CN111783903B (en) | Text processing method, text model processing method and device and computer equipment | |
CN111488467A (en) | Construction method, device, storage medium and computer equipment of geographic knowledge graph | |
CN112182145B (en) | Text similarity determination method, device, equipment and storage medium | |
CN112115253B (en) | Depth text ordering method based on multi-view attention mechanism | |
CN117591663A (en) | A large model prompt generation method based on knowledge graph | |
CN113807307B (en) | Multi-mode joint learning method for video multi-behavior recognition | |
CN111460121A (en) | Visual-semantic dialogue method and system | |
CN112613451B (en) | Modeling method of cross-modal text picture retrieval model | |
Phan et al. | Building a Vietnamese question answering system based on knowledge graph and distributed CNN | |
CN119005308B (en) | A multimodal exercise representation method based on knowledge graph | |
US20250299072A1 (en) | Data processing method and apparatus, device, and readable storage medium | |
CN114239730A (en) | A Cross-modal Retrieval Method Based on Neighbor Ranking Relation | |
CN112417170A (en) | Relation linking method for incomplete knowledge graph | |
CN113010712B (en) | Visual question answering method based on multi-graph fusion | |
CN114328943A (en) | Question answering method, device, equipment and storage medium based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |