CN116739056A

CN116739056A - Visual dialogue generation method and device based on counterfactual common sense causal reasoning

Info

Publication number: CN116739056A
Application number: CN202310685976.6A
Authority: CN
Inventors: 刘安安; 黄晨曦; 徐宁; 张勇东
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-12

Abstract

The invention discloses a visual dialogue generating method and device based on the causal reasoning of the counterfacts common sense, wherein the method comprises the following steps: introducing common sense into a visual dialogue task, constructing a visual dialogue causal graph based on common sense fusion, and calculating the total effect of input features on answer prediction; constructing a corresponding anti-facts causal graph based on the visual dialogue causal graph; for the anti-facts causal graph, estimating the influence of harmful deviation generated by the introduced common sense according to the natural direct effect on answer prediction, and removing the influence from the total effect; training a visual dialogue model by adopting model integration, and obtaining an answer prediction result by taking cross entropy and KL divergence loss as training targets; integrating the image features, the dialogue history features, the common sense features and the current problem features, sending the integrated image features, the dialogue history features, the common sense features and the current problem features into a decoder, minimizing a loss function, optimizing network parameters, and finally providing real disaster environment information for disaster rescue sites. The device comprises: a processor and a memory.

Description

Visual dialogue generation method and device based on counterfactual common sense causal reasoning

技术领域Technical field

本发明涉及视觉对话生成领域，尤其涉及一种基于反事实常识因果推理的视觉对话生成方法及装置。The present invention relates to the field of visual dialogue generation, and in particular to a visual dialogue generation method and device based on counterfactual common sense causal reasoning.

背景技术Background technique

随着计算机视觉技术和自然语言处理技术的飞速发展，视觉和语言交互的多模态领域受到了广泛关注。从图像描述^[1]、场景图生成^[2]、视觉问答^[3]，再到视觉对话^[4]，研究者们致力于提高计算机与人类进行持续性交互的能力。其中，视觉对话任务一直是多模态领域的研究重点，它需要智能体不断地根据已有的图像信息和历史问答中所蕴含的文本信息来推理出当前问题的答案，人类与智能体之间的对话将持续多轮。因此，视觉对话任务需要智能体有较强的人机交互能力，基于此，视觉对话在帮助视障人群、灾害救援任务等领域有很大的应用价值。With the rapid development of computer vision technology and natural language processing technology, the multimodal field of visual and language interaction has received widespread attention. From image description ^[1] , scene graph generation ^[2] , visual question answering ^[3] , to visual dialogue ^[4] , researchers are committed to improving the ability of computers to continuously interact with humans. Among them, the visual dialogue task has always been the focus of research in the multi-modal field. It requires the agent to continuously infer the answer to the current question based on the existing image information and the text information contained in historical question and answer. The relationship between humans and agents The dialogue will last for multiple rounds. Therefore, visual dialogue tasks require agents to have strong human-computer interaction capabilities. Based on this, visual dialogue has great application value in fields such as helping the visually impaired and disaster rescue missions.

近年来，在视觉对话领域涌现出了许多优秀的工作。例如：基于循环神经网络的方法^[4]使用循环神经网络及其变体来编码视觉-语言的多模态特征进而得到答案，基于注意力机制的方法^[5][6]主要使用注意力机制来更精细化的提取回答当前问题所需要的图像信息、对话历史中的上下文信息等，基于图结构的方法^[7][8]主要使用图结构来编码图像、对话历史或者当前问题，赋予智能体更强的推理能力来生成答案。它们都是基于已有的图像信息和历史问答中蕴含的文本信息来推理出答案。但是在一些更为复杂的对话场景中，仅仅利用这些信息是远远不够的，智能体还需要像人类一样利用外部的常识知识来辅助答案生成，这往往被研究者们所忽略，从而限制了智能体人机交互能力的提高。例如：在火灾救援现场，智能体可以先一步进入火灾地点拍照并实时地回答救援人员的问题。当救援人员问到“火灾现场有煤气罐吗”的问题时，若智能体不具备“煤气罐在厨房里”等相关常识，将不会关注厨房区域，进而不能正确的回答救援人员的问题。In recent years, many excellent works have emerged in the field of visual dialogue. For example: methods based on recurrent neural networks ^[4] use recurrent neural networks and their variants to encode visual-linguistic multi-modal features to obtain answers, and methods based on attention mechanisms ^[5][6] mainly use attention mechanisms. To more precisely extract the image information needed to answer current questions, contextual information in dialogue history, etc., graph structure-based methods ^[7][8] mainly use graph structures to encode images, dialogue history or current questions, giving intelligence stronger reasoning ability to generate answers. They all infer answers based on existing image information and text information contained in historical questions and answers. However, in some more complex dialogue scenarios, it is not enough to just use this information. The agent also needs to use external common sense knowledge like humans to assist in answer generation. This is often ignored by researchers, thus limiting Improvement of intelligent human-computer interaction capabilities. For example: at a fire rescue scene, the agent can enter the fire location to take photos and answer questions from rescuers in real time. When the rescuers ask the question "Is there a gas tank at the fire scene?", if the intelligent agent does not have relevant common sense such as "The gas tank is in the kitchen", it will not pay attention to the kitchen area, and thus cannot answer the rescuers' questions correctly.

目前仅有基于知识的结构化网络方法(SKANet)^[9]以及基于多结构的常识知识推理方法(RMK)^[10]等将外部常识知识引入了视觉对话任务，并取得了一系列进展。它们都是从外部常识库中提取常识知识，再经编码后融入多模态信息进而得到答案。但是上述框架都是基于这样一个潜在的假设：这些常识知识总是会对答案生成产生正面影响。尽管它们通过计算图像描述与常识知识的语义相似度、通过图嵌入算法(TransE算法)构造常识知识图谱进而计算节点之间的余弦距离等，过滤出与当前对话不相关的常识，然而，常识中蕴含的一些“有害偏差”仍然没有被去除，它将会对答案生成产生负面影响。例如：用于检索常识知识的图像标签或者图像描述中的关键词在常识知识中出现频率较高，其可能会干扰智能体生成答案甚至使智能体生成含有这些高频词的错误答案。例如：智能体若具备了“煤气罐”相关的常识，当回答救援人员“火灾现场有煤气罐吗”的问题时，智能体可能会过于关注含有高频词“煤气罐”的错误答案，从而不能给救援人员提供正确的信息，增加了救援难度。At present, only the knowledge-based structured network method (SKANet) ^[9] and the multi-structure-based common sense knowledge reasoning method (RMK) ^[10] have introduced external common sense knowledge into visual dialogue tasks, and have made a series of progress. They all extract common sense knowledge from external common sense databases, and then integrate multi-modal information after encoding to obtain answers. But the above frameworks are all based on the underlying assumption that this common sense knowledge will always have a positive impact on answer generation. Although they filter out common sense that is not relevant to the current conversation by calculating the semantic similarity between image descriptions and common sense knowledge, constructing a common sense knowledge graph through the graph embedding algorithm (TransE algorithm), and then calculating the cosine distance between nodes, etc., however, in common sense Some of the "harmful bias" contained in the implication has still not been removed and will have a negative impact on answer generation. For example, image tags or keywords in image descriptions used to retrieve common sense knowledge appear more frequently in common sense knowledge, which may interfere with the agent's ability to generate answers or even cause the agent to generate incorrect answers containing these high-frequency words. For example, if the agent has common sense related to "gas tank", when answering the rescuer's question "Is there a gas tank at the fire scene?", the agent may pay too much attention to the wrong answer containing the high-frequency word "gas tank", thus causing the problem. Failure to provide correct information to rescuers increases the difficulty of rescue.

基于此研究现状，目前面临的挑战主要有以下三个方面：(1)如何更加有效的选取并利用与图像和当前对话相关的常识知识来辅助答案生成；(2)如何量化常识中蕴含的“有害偏差”对答案生成的负面影响；(3)如何从总体上去除常识对答案生成的负面影响进而只保留常识对答案生成的正面影响，从而提高智能体在灾害救援现场时的答案预测精度，来辅助救援人员更好的了解灾害现场环境、制定救援计划、展开救援工作等。Based on this research status, the challenges currently faced mainly include the following three aspects: (1) How to more effectively select and use common sense knowledge related to images and current conversations to assist in answer generation; (2) How to quantify the "common sense" contained in common sense The negative impact of "harmful bias" on answer generation; (3) How to remove the negative impact of common sense on answer generation as a whole and retain only the positive impact of common sense on answer generation, thereby improving the answer prediction accuracy of the agent at the disaster rescue site, To assist rescue personnel to better understand the disaster scene environment, formulate rescue plans, and carry out rescue work, etc.

发明内容Contents of the invention

本发明提供了一种基于反事实常识因果推理的视觉对话生成方法及装置，本发明构建基于常识融合的视觉对话事实因果图，从基于该因果图演绎的视觉对话生成的答案预测分数中减去常识对答案生成的自然直接效应，此过程保留了常识对答案生成的正面影响的同时，去除了常识中蕴含的“有害偏差”对答案生成的负面影响，从而提高智能体在灾害救援现场的人机交互能力，为救援人员提供更真实详尽的灾害环境信息，详见下文描述：The present invention provides a visual dialogue generation method and device based on counterfactual common sense causal reasoning. The invention constructs a visual dialogue factual causal graph based on common sense fusion, and subtracts it from the answer prediction score generated by the visual dialogue based on the causal graph deduction. Common sense has a natural direct effect on answer generation. This process retains the positive impact of common sense on answer generation while removing the negative impact of the "harmful bias" contained in common sense on answer generation, thereby improving the performance of the agent at the disaster rescue site. Computer interaction capabilities provide rescuers with more realistic and detailed disaster environment information, as described below:

一种基于反事实常识因果推理的视觉对话生成方法，所述方法包括：A visual dialogue generation method based on counterfactual common sense causal reasoning, which method includes:

对提取出的常识三元组，构建常识子图，使用注意图卷积网络对子图进行编码得到常识特征；For the extracted common sense triplet, a common sense subgraph is constructed, and the attention graph convolution network is used to encode the subgraph to obtain common sense features;

将常识引入视觉对话任务，构建基于常识融合的视觉对话因果图，计算输入特征对答案预测的总效应；基于视觉对话因果图，构建其对应的反事实因果图；Introduce common sense into the visual dialogue task, construct a visual dialogue causal graph based on common sense fusion, and calculate the total effect of input features on answer prediction; based on the visual dialogue causal graph, construct its corresponding counterfactual causal graph;

针对反事实因果图，根据自然直接效应估计引入的常识产生的有害偏差对答案预测的影响，并将其从总效应中去除；For counterfactual causal diagrams, the impact of harmful biases introduced by common sense on answer predictions is estimated based on natural direct effects and removed from the total effect;

采用模型集成训练视觉对话模型，以交叉熵和KL散度损失作为训练目标得到答案预测结果；Model ensemble is used to train the visual dialogue model, and cross-entropy and KL divergence loss are used as training targets to obtain answer prediction results;

将图像特征、对话历史特征、常识特征及当前问题特征进行整合，送入解码器中，最小化损失函数，优化网络参数，最后为灾害救援现场提供真实的灾害环境信息。Integrate image features, conversation history features, common sense features and current problem features and send them to the decoder to minimize the loss function, optimize network parameters, and finally provide real disaster environment information for disaster relief sites.

其中，所述常识三元组为：提取数据库的训练集、验证集和测试集样本中与每个视觉对话单元相关的常识；提取出的常识三元组具体操作为：Among them, the common sense triplet is: extract the common sense related to each visual dialogue unit in the training set, verification set and test set samples of the database; the specific operation of the extracted common sense triplet is:

用Faster R-CNN框架检测出物体标签，并根据每个标签对应的置信分数选出分数最高的前若干个物体标签；同理针对图像描述选出分数最高的前若干个关键词；Use the Faster R-CNN framework to detect object labels, and select the top several object labels with the highest scores based on the confidence scores corresponding to each label; similarly, select the top few keywords with the highest scores for image descriptions;

其中，所述基于视觉对话因果图，构建其对应的反事实因果图为：Among them, based on the visual dialogue causal diagram, the corresponding counterfactual causal diagram constructed is:

通过将I、Q、H、C分别赋予空值，即I＝i*，Q＝q*，H＝h*和C＝c*，进而有K＝k*，阻断I、Q、H和K对答案A预测产生的影响；By assigning null values to I, Q, H, and C respectively, that is, I=i*, Q=q*, H=h*, and C=c*, and then K=k*, blocking I, Q, H, and The impact of K on the prediction of answer A;

常识C在反事实的世界中一次可同时被赋予两个值，即C＝c*和C＝c；前者用以得到K＝k*，而后者与答案A直接连接来评估常识C对答案预测A的自然直接效应。Common sense C can be assigned two values at the same time in the counterfactual world, namely C=c* and C=c; the former is used to obtain K=k*, and the latter is directly connected to the answer A to evaluate the answer prediction of common sense C A natural direct effect.

第二方面、一种基于反事实常识因果推理的视觉对话生成装置，所述装置包括：处理器和存储器，所述存储器中存储有程序指令，所述处理器调用存储器中存储的程序指令以使装置执行第一方面中的任一项所述的方法步骤。The second aspect is a visual dialogue generation device based on counterfactual common sense causal reasoning. The device includes: a processor and a memory. Program instructions are stored in the memory. The processor calls the program instructions stored in the memory to cause The apparatus performs the method steps of any one of the first aspects.

第三方面、一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序包括程序指令，所述程序指令被处理器执行时使所述处理器执行第一方面中的任一项所述的方法步骤。In a third aspect, a computer-readable storage medium stores a computer program. The computer program includes program instructions. When executed by a processor, the program instructions cause the processor to execute the first aspect. The method steps described in any one of.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical solution provided by the present invention are:

1、本发明将外部常识知识引入现有的视觉对话任务中，智能体不仅能根据已有的图像信息和对话历史上下文信息进行推理，还能利用来自常识知识中的信息来生成答案；现有的视觉对话方法往往忽略了常识知识在视觉对话生成过程中的重要作用；本发明关注智能体在推理答案时所需的信息来源，常识知识的引入使智能体人机交互能力不断提高，有效提升了答案生成精度；1. The present invention introduces external common sense knowledge into existing visual dialogue tasks. The agent can not only reason based on existing image information and conversation history context information, but also use information from common sense knowledge to generate answers; existing Visual dialogue methods often ignore the important role of common sense knowledge in the process of visual dialogue generation; this invention focuses on the sources of information required by the agent when reasoning about answers. The introduction of common sense knowledge enables the agent's human-computer interaction capabilities to continuously improve and effectively improve Improved answer generation accuracy;

2、本发明在将常识知识引入视觉对话任务的同时，考虑了常识中蕴含的“有害偏差”对答案生成的负面影响，构建了基于常识融合的反事实因果图，将该负面影响量化为常识对答案生成的自然直接效应；而现有的基于常识融合的视觉对话生成方法仅仅将常识引入视觉对话任务，使智能体在生成答案时易受到“有害偏差”的干扰；本发明关注常识知识在视觉对话任务中对答案生成的影响，利用因果理论中的自然直接效应来量化常识中蕴含的“有害偏差”对答案生成的负面影响，从而进行去除，来进一步提升答案生成精度；2. While introducing common sense knowledge into visual dialogue tasks, the present invention considers the negative impact of "harmful bias" contained in common sense on answer generation, constructs a counterfactual causal diagram based on common sense fusion, and quantifies the negative impact into common sense It has a natural direct effect on answer generation; while the existing visual dialogue generation method based on common sense fusion only introduces common sense into the visual dialogue task, making the agent susceptible to the interference of "harmful bias" when generating answers; the present invention focuses on the use of common sense knowledge in Regarding the impact on answer generation in the visual dialogue task, we use the natural direct effect in the causal theory to quantify the negative impact of the "harmful bias" contained in common sense on answer generation, and then remove it to further improve the accuracy of answer generation;

3、本发明构建了基于常识融合的视觉对话事实因果图，现有的基于常识融合的视觉对话框架往往基于该图演绎，本发明基于此图得出常识对答案生成的总效应，从总效应中去除常识对答案生成的自然直接效应，同时本发明还通过最小化KL-散度损失函数使自然直接效应与总效应的答案概率分布相似，从而更好的去除常识中蕴含的“有害偏差”对答案生成的负面影响，保留常识对答案生成的正面影响，现有的方法往往忽略了这一点；本发明关注去除常识中蕴含的“有害偏差”对答案生成的负面影响，提高智能体的人机交互能力；3. The present invention constructs a visual dialogue fact-cause and effect diagram based on common sense fusion. Existing visual dialogue frameworks based on common sense fusion are often deduced based on this diagram. Based on this diagram, the present invention obtains the total effect of common sense on answer generation. From the total effect The natural direct effect of common sense on answer generation is removed. At the same time, the present invention also minimizes the KL-divergence loss function to make the answer probability distribution of the natural direct effect and the total effect similar, thereby better removing the "harmful bias" contained in common sense. Existing methods often ignore the negative impact on answer generation and retain the positive impact of common sense on answer generation; the present invention focuses on removing the negative impact of "harmful bias" contained in common sense on answer generation, and improves the human intelligence of the agent. Computer interaction capabilities;

4、本发明方案借助因果理论中的反事实分析方法去除常识中蕴含的“有害偏差”对答案生成的负面影响，具有一定的泛化性，可以适用于大部分的视觉对话模型；4. The solution of the present invention uses the counterfactual analysis method in the causal theory to remove the negative impact of "harmful bias" contained in common sense on answer generation. It has a certain degree of generalization and can be applied to most visual dialogue models;

5、将本发明的生成结果应用于灾害救援任务中，提高了智能体在实时地回答救援人员问题时的答案预测精度，使救援人员能够充分了解灾害现场的环境信息，从而更好的制定救援计划，降低了救援难度，避免了救援伤亡。5. Apply the generated results of the present invention to disaster rescue tasks, which improves the answer prediction accuracy of the intelligent agent when answering rescuers' questions in real time, allowing rescuers to fully understand the environmental information of the disaster site, thereby better formulating rescue plans. The plan reduces the difficulty of rescue and avoids rescue casualties.

附图说明Description of drawings

图1为一种基于反事实常识因果推理的视觉对话生成方法的流程图；Figure 1 is a flow chart of a visual dialogue generation method based on counterfactual common sense causal reasoning;

图2为因果理论与反事实分析的示意图；Figure 2 is a schematic diagram of causal theory and counterfactual analysis;

图3为基于常识融合的视觉对话事实因果图；Figure 3 is a visual dialogue fact-effect diagram based on common sense fusion;

图4为基于常识融合的视觉对话反事实因果图。Figure 4 shows the counterfactual causal diagram of visual dialogue based on common sense fusion.

图5为基于反事实常识因果推理的视觉对话生成方法模型图。Figure 5 is a model diagram of a visual dialogue generation method based on counterfactual common sense causal reasoning.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the purpose, technical solutions and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below.

实施例1Example 1

为了解决目前的视觉对话在融合常识知识时忽略了由常识带来的有害偏差的问题，从而提高视觉对话答案预测的精度，常用的指标为召回率和归一化折损累计增益，参见图1，本发明实施例提供了一种基于反事实常识因果推理的视觉对话生成方法，该方法包括以下步骤：In order to solve the problem that current visual dialogue ignores the harmful bias caused by common sense when integrating common sense knowledge, thereby improving the accuracy of visual dialogue answer prediction, commonly used indicators are recall rate and normalized loss cumulative gain, see Figure 1 , embodiments of the present invention provide a visual dialogue generation method based on counterfactual common sense causal reasoning. The method includes the following steps:

101：对提取出的常识三元组，构建常识子图，使用注意图卷积网络对子图进行编码得到常识特征；101: Construct a common sense subgraph for the extracted common sense triplet, and use the attention graph convolution network to encode the subgraph to obtain common sense features;

102：将常识引入视觉对话任务，构建基于常识融合的视觉对话因果图，计算输入特征对答案预测的总效应；基于视觉对话因果图，构建其对应的反事实因果图；102: Introduce common sense into the visual dialogue task, construct a visual dialogue causal graph based on common sense fusion, and calculate the total effect of input features on answer prediction; based on the visual dialogue causal graph, construct its corresponding counterfactual causal graph;

103：针对反事实因果图，根据自然直接效应估计引入的常识产生的有害偏差对答案预测的影响，并将其从总效应中去除；103: For counterfactual causal diagrams, estimate the impact of harmful biases introduced by common sense on answer predictions based on natural direct effects and remove them from the total effect;

104：采用模型集成训练视觉对话模型，以交叉熵和KL散度损失作为训练目标得到答案预测结果；104: Use model integration to train the visual dialogue model, and use cross-entropy and KL divergence loss as training targets to obtain answer prediction results;

105：将图像特征、对话历史特征、常识特征及当前问题特征进行整合，送入解码器中，最小化损失函数，优化网络参数，最后为灾害救援现场提供真实的灾害环境信息。105: Integrate image features, conversation history features, common sense features and current problem features, and send them to the decoder to minimize the loss function, optimize network parameters, and finally provide real disaster environment information for disaster relief sites.

综上所述，本发明实施例通过上述步骤101-105实现了去除常识中蕴含的“有害偏差”对答案生成的负面影响，从而提高智能体在灾害救援现场的人机交互能力，为救援人员提供更真实详尽的灾害环境信息。To sum up, the embodiment of the present invention removes the negative impact of "harmful deviations" contained in common sense on answer generation through the above steps 101-105, thereby improving the human-computer interaction ability of the agent at the disaster rescue site and providing rescuers with Provide more realistic and detailed disaster environment information.

实施例2Example 2

下面结合具体的计算公式、实例对实施例1中的方案进行进一步地介绍，详见下文描述：The solution in Embodiment 1 will be further introduced below with specific calculation formulas and examples. For details, see the following description:

201：针对所使用的数据库，提取其训练集、验证集和测试集样本中与每个视觉对话单元相关的常识；201: For the database used, extract the common sense related to each visual dialogue unit in its training set, verification set and test set samples;

对于所使用的数据集，其由一个个视觉对话单元组成。每一个视觉对话单元包括：图像及对应的图片描述、历史问答对和当前需要回答的问题，且每一个问题对应100个候选答案。这里使用的是较为常用的VisDial v0.9和VisDial v1.0数据库，但本发明实施例不仅局限于该数据库，可以是包含所需视觉对话单元的任意数据库。For the dataset used, it consists of visual dialogue units. Each visual dialogue unit includes: image and corresponding picture description, historical question and answer pairs and current questions that need to be answered, and each question corresponds to 100 candidate answers. The more commonly used VisDial v0.9 and VisDial v1.0 databases are used here, but the embodiments of the present invention are not limited to this database, and can be any database containing required visual dialogue units.

针对所需的常识来源，本发明实施例使用的是较为常用的ConceptNet常识知识库。该常识知识库是基于一个人所知道的最基本事物的常识知识库，其由基本的常识三元组组成。三元组的具体形式为<开始节点，关系标签，结束节点>，其中开始节点和结束节点均指的是自然语言的单词或者短语，例如<bus，capable of，transport people>，且每一个三元组会被给定一个可信分数，分数越大表明该三元组的开始节点和结束节点的关系越可靠。For the required common sense sources, the embodiment of the present invention uses the more commonly used ConceptNet common sense knowledge base. The common sense knowledge base is a common sense knowledge base based on the most basic things a person knows, and is composed of basic common sense triples. The specific form of a triple is <start node, relationship label, end node>, where both the start node and the end node refer to natural language words or phrases, such as <bus, capable of, transport people>, and each three-tuple The tuple will be given a credibility score. The larger the score, the more reliable the relationship between the start node and the end node of the triple.

首先，为了从常识知识库中检索出与每一个对话单元相关的常识，需要先生成检索词。在本发明实施例中，检索词来源于图像中物体的标签和图像描述。具体地，针对图像中物体的标签，用Faster R-CNN(快速的区域卷积神经网络)框架检测出物体标签，并根据每个标签对应的置信分数选出分数最高的前5个物体标签；针对图像描述，关键词，同样地，根据每个关键词对应的置信分数，选出分数最高的前5个关键词。由此，提取出的5个物体标签和5个关键词共同组成了用于检索常识的检索词。First, in order to retrieve common sense related to each dialogue unit from the common sense knowledge base, search terms need to be generated first. In this embodiment of the present invention, the search terms are derived from labels and image descriptions of objects in the image. Specifically, for the labels of objects in the image, the Faster R-CNN (fast regional convolutional neural network) framework is used to detect the object labels, and the top five object labels with the highest scores are selected based on the confidence scores corresponding to each label; For image descriptions and keywords, similarly, based on the confidence score corresponding to each keyword, the top 5 keywords with the highest scores are selected. From this, the five extracted object labels and five keywords together constitute the search terms used to retrieve common sense.

其次，使用检索词在ConceptNet常识知识库中检索相关常识。具体地，依次使用检索词匹配常识库中三元组的开始节点，选出检索词与开始节点相同的三元组，根据常识知识库针对每个三元组给定的可信分数，选出分数最高的前20个三元组。Secondly, use the search terms to search for relevant common sense in the ConceptNet common sense knowledge base. Specifically, the search terms are used sequentially to match the starting nodes of the triples in the common sense database, and the triples whose search terms are the same as the starting nodes are selected. Based on the credibility score given by the common sense knowledge base for each triple, the triplet is selected. The top 20 triples with the highest scores.

202：对于步骤201提取出的常识三元组，首先构建常识子图，之后使用aGCN(注意图卷积网络)对该子图进行编码得到常识特征c；202: For the common sense triplet extracted in step 201, first construct a common sense subgraph, and then use aGCN (attention graph convolution network) to encode the subgraph to obtain the common sense feature c;

首先，使用步骤201提取出的与每个对话单元对应的20个常识三元组来构建常识子图G_c。其中G_c所包含的节点特征为所有常识三元组的开始节点和结束节点对应的节点特征。具体地，对每一个常识三元组中的开始节点和结束节点进行Glove词向量嵌入，得到节点特征其中，d指的是嵌入维度，i指的是第i个节点。First, use the 20 common sense triples corresponding to each dialogue unit extracted in step 201 to construct the common sense subgraph G _c . The node features included in G _c are the node features corresponding to the start nodes and end nodes of all common sense triples. Specifically, Glove word vector embedding is performed on the start node and end node in each common sense triplet to obtain the node features. Among them, d refers to the embedding dimension, and i refers to the i-th node.

之后，使用aGCN(注意图卷积网络)进一步编码常识子图G_c来更新当前节点特征z_i。具体地：Afterwards, aGCN (Attention Graph Convolutional Network) is used to further encode the common sense subgraph G _c to update the current node feature z _i . specifically:

其中，W是可学习的线性转换矩阵；N(i)是第i个节点的相邻节点的集合；s是非线性激活函数sigmoid；L是aGCN网络的卷积层数；a_ij是第i个节点特征z_i与第j个节点特征z_j之间的预定义权重；Among them, W is a learnable linear transformation matrix; N(i) is the set of adjacent nodes of the i-th node; s is the nonlinear activation function sigmoid; L is the number of convolutional layers of the aGCN network; a _ij is the i-th node The predefined weight between node feature z _i and j-th node feature z _j ;

具体地：specifically:

a_ij＝softmax(u_ij)a _ij =softmax(u _ij )

其中，W_h和均为可学习参数；[·,·]指将节点特征进行拼接；softmax是非线性激活函数。Among them, W _h and All are learnable parameters; [·,·] refers to splicing node features; softmax is a nonlinear activation function.

最后，常识子图G_c中的每一个节点特征通过aGCN(注意图卷积网络)进行更新，都具有了其相邻节点的特征信息。所有更新后的节点特征进行平均池化后输出常识特征c。Finally, each node feature in the common sense subgraph G _c is updated through aGCN (attention graph convolution network), and has the feature information of its adjacent nodes. All updated node features are average pooled and the common sense feature c is output.

203：因果图与反事实分析的概念；203: Concepts of causal diagrams and counterfactual analysis;

在本发明实施例中，使用一种强有力的图形工具——因果图，来探索变量背后的因果关系。它是一种有向无环图。因果图中的每一个节点代表一个变量，变量间的箭头代表它们之间的因果联系，通常箭头的起点是“原因”，终点是“结果”。如图2(a)所示，X→Y表示X对Y有直接的影响，其中X是“原因”，Y是“结果”。X→M→Y表示X通过中介M间接地影响Y。当变量被赋予某一值时，如X＝x和X＝x^*(在本发明实施例中，用“小写字母”表示变量具体的观测值，用“小写字母^*”表示空值)，则X对Y的总效应(Total Effect,TE)计算方法为：In the embodiment of the present invention, a powerful graphical tool—causality diagram—is used to explore the causal relationship behind variables. It is a directed acyclic graph. Each node in a cause-and-effect diagram represents a variable, and the arrows between variables represent the causal connection between them. Usually the starting point of the arrow is the "cause" and the end point is the "result." As shown in Figure 2(a), X→Y means that X has a direct impact on Y, where X is the "cause" and Y is the "result". X→M→Y means that X indirectly affects Y through the intermediary M. When a variable is assigned a ^certain value, such as X=x ^and The calculation method of the total effect (TE) of X on Y is:

其中，和/>分别表示在X＝x和X＝x^*时，变量X通过X→Y与X→M→Y对变量Y造成的影响。in, and/> Respectively represent the influence of variable X on variable Y through X→Y and X→M→Y when X=x and X=x ^* .

进一步地，总效应TE可以分解为自然直接效应(Natural Direct Effect,NDE)和总间接效应(Total Indirect Effect,TIE)，即：Furthermore, the total effect TE can be decomposed into natural direct effect (NDE) and total indirect effect (TIE), namely:

TE＝NDE+TIETE＝NDE+TIE

具体地，为了计算自然直接效应NDE，本发明实施例构建了一个反事实的世界，如图2(d)所示。在反事实的世界中，变量一次可同时被赋予两个值(如X＝x或X＝x^*)，这在事实的世界中显然不可能发生，在事实的世界(图2(b)、图2(c))中，变量一次只能被赋予一个值(如X＝x或X＝x^*)。当变量X通过X→Y直接影响变量Y时，将变量X赋值为X＝x；当变量X通过X→M→Y间接影响Y时，将变量X赋值为X＝x^*，则变量X对Y的总影响为自然直接效应NDE的计算方法为：Specifically, in order to calculate the natural direct effect NDE, the embodiment of the present invention constructs a counterfactual world, as shown in Figure 2(d). In the counterfactual world, a variable can be assigned two values at the same time (such as X=x or X=x ^* ), which is obviously impossible to happen in the factual world. In Figure 2(c)), a variable can only be assigned one value at a time (such as X=x or X=x ^* ). When variable X directly affects variable ^Y through X→Y, assign variable X to X=x; when variable X indirectly affects Y through X→M→Y, assign variable The total impact of Y is The calculation method of natural direct effect NDE is:

直观上来说，NDE表示排除了X对Y的间接影响(X→M→Y)之后，X对Y的直接影响(X→Y)有多少。Intuitively speaking, NDE indicates how much direct influence X has on Y (X→Y) after excluding the indirect influence of X on Y (X→M→Y).

进一步地，总间接效应TIE的计算方法为：Furthermore, the calculation method of total indirect effect TIE is:

204：将常识引入视觉对话任务，构建基于常识融合的视觉对话因果图；204: Introduce common sense into visual dialogue tasks and construct a visual dialogue causal graph based on common sense fusion;

针对现有的大部分基于常识融合的视觉对话模型，构建基于常识融合的视觉对话因果图如图3所示。For most of the existing visual dialogue models based on common sense fusion, a visual dialogue causal diagram based on common sense fusion is constructed, as shown in Figure 3.

首先，使用视觉编码器(大部分模型使用Faster-RCNN)从图像中提取视觉特征，在因果图(图3)中将该视觉特征记为节点I；当前问题和对话历史通常使用长短期记忆网络(LSTM)进行编码得到对应的问题特征与历史特征，在因果图中将其分别表示为节点Q和节点H。步骤202输出的常识特征在因果图中记为节点C。First, a visual encoder (most models use Faster-RCNN) is used to extract visual features from the image, and the visual feature is recorded as node I in the causal diagram (Figure 3); the current question and conversation history usually use long short-term memory networks (LSTM) is encoded to obtain the corresponding problem features and historical features, which are represented as node Q and node H respectively in the causal graph. The common sense feature output in step 202 is recorded as node C in the causal graph.

之后，子图(I→K,Q→K,H→K,H→Q→K,C→K)表示视觉对话模型中的编码器输入H,Q,I,C，输出多模态特征K，其中，由于当前需要回答的问题起源于之前的对话历史，故有H→Q。After that, the subgraph (I→K, Q→K, H→K, H→Q→K, C→K) represents the encoder input H, Q, I, C in the visual dialogue model and outputs multi-modal features K , where, since the current question that needs to be answered originates from the previous conversation history, there is H→Q.

最后，问题特征Q与多模态特征K输入判别式解码器或者生成式解码器来产生当前问题的答案A，对应的子图为Q→A和K→A。Finally, the question feature Q and the multimodal feature K are input into the discriminative decoder or the generative decoder to generate the answer A to the current question, and the corresponding subgraphs are Q→A and K→A.

另有，如步骤201所述，尽管用于检索常识的检索词来源于图像和对话历史中的图像描述，即I→C,H→C，但子图I→C,H→C不影响基于常识融合的视觉对话因果图(图3)所表示的视觉对话模型输入(H,Q,I,C)与输出(A)之间的因果性，为了简化，故不在图3上予以标注。尽管现有的基于常识融合的视觉对话模型利用常识知识提高答案预测精度的方式可能不同，然而，在不失一般性的情况下，均可以用子图C→K来表示常识特征与多模态特征之间的因果性。同时，本发明实施例切断了H与A的直接连接，即去除了子图H→A来避免视觉对话模型学习到H中蕴含的语言偏差来对答案预测产生负面影响。In addition, as mentioned in step 201, although the search terms used to retrieve common sense are derived from image descriptions in the image and conversation history, that is, I→C, H→C, the subgraph I→C, H→C does not affect the search results based on The causality between the visual dialogue model input (H, Q, I, C) and the output (A) represented by the common sense fusion visual dialogue causality diagram (Figure 3) is not marked on Figure 3 for the sake of simplicity. Although existing visual dialogue models based on common sense fusion may use common sense knowledge to improve answer prediction accuracy in different ways, without loss of generality, subgraph C→K can be used to represent common sense features and multi-modality Causality between features. At the same time, the embodiment of the present invention cuts off the direct connection between H and A, that is, removes the subgraph H→A to prevent the visual dialogue model from learning the language bias contained in H to have a negative impact on answer prediction.

205：根据步骤204的基于常识融合的视觉对话因果图，计算输入特征(H,Q,I,C)对答案(A)预测的总效应TE。205: According to the visual dialogue causal graph based on common sense fusion in step 204, calculate the total effect TE of the input features (H, Q, I, C) on the prediction of the answer (A).

首先，对于步骤204的基于常识融合的视觉对话因果图，根据步骤203的总效应TE的计算方法，对输入变量H,Q,I,C分别赋予具体的观测值和空值，即I＝i,Q＝q,H＝h,C＝c和I＝i^*,Q＝q^*,H＝h^*,C＝c^*；进一步地，针对基于常识融合的视觉对话模型，观测值i指的是某一具体图像经步骤204的使用视觉编码器后输出的图像特征，观测值q,h分别指的是该图像对应的当前问题和对话历史经步骤204的使用长短期记忆网络(LSTM)进行编码得到对应的问题特征与历史特征；观测值c指当前视觉对话单元经步骤201、步骤202提取出的常识特征；空值i^*,q^*,h^*,c^*分别指的是对视觉对话模型不输入图像、问题、对话历史和常识对应的特征。First, for the visual dialogue causal graph based on common sense fusion in step 204, according to the calculation method of the total effect TE in step 203, the input variables H, Q, I, and C are assigned specific observation values and null values respectively, that is, I=i , Q=q, H=h, C=c and I=i ^* , Q=q ^* , H=h ^* , C=c ^* ; further, for the visual dialogue model based on common sense fusion, the observation value i refers to It is the image feature output by a specific image after using the visual encoder in step 204. The observation values q and h respectively refer to the current question and dialogue history corresponding to the image, which are processed using the long short-term memory network (LSTM) in step 204. The corresponding problem features and historical features are obtained by coding; the observation value c refers to the common sense feature extracted by the current visual dialogue unit through steps 201 and 202; the null values i ^* , q ^* , h ^* , c ^* respectively refer to the visual dialogue The model does not input features corresponding to images, questions, conversation history, and common sense.

之后，输入变量H,Q,I,C被赋予具体的观测值时，其对答案(A)预测产生的影响记为A(I＝i,Q＝q,H＝h,C＝c)，输入变量被赋予空值时，其对答案(A)预测产生的影响记为A(I＝i^*,Q＝q^*,H＝h^*,C＝c^*)，根据步骤103的总效应TE的计算方法得：Afterwards, when the input variables H, Q, I, C are assigned specific observed values, their impact on the prediction of answer (A) is recorded as A (I=i, Q=q, H=h, C=c), When the input variable is assigned a null value, its impact on the prediction of answer (A) is recorded as A (I=i ^* , Q=q ^* , H=h ^* , C=c ^* ). According to the total effect TE in step 103 The calculation method is:

TE＝A(I＝i,Q＝q,H＝h,C＝c)-A(I＝i^*,Q＝q^*,H＝h^*,C＝c^*)TE=A(I=i,Q=q,H=h,C=c)-A(I=i ^* ,Q=q ^* ,H=h ^* ,C=c ^* )

为了简化表示，本发明实施例将A(I＝i,Q＝q,H＝h,C＝c)简记为A_i,q,h,c，将A(I＝i^*,Q＝q^*,H＝h^*,C＝c^*)简记为A_i*,q*,h*,c*。In order to simplify the expression, in the embodiment of the present invention, A (I=i, Q=q, H=h, C=c) is abbreviated as A _i,q,h,c , and A (I=i ^* , Q=q ^* , H＝h ^* , C＝c ^* ) are abbreviated as A _{i*, q*, h*, c*} .

如图3所示，对于答案(A)有两条直接的路径与其相连，即K→A和Q→A，进一步地，本发明实施例使用P(·)来表示K，Q对A产生的综合效应，则A_i,q,h,c又可记为P_k,q，其中P_k,q＝P(K＝k,Q＝q)，观测值k指的是步骤204的视觉对话模型中的编码器在输入I＝i,Q＝q,H＝h,C＝c之后输出的多模态特征k；类似地，A_i*,q*,h*,c*可记为P_k*,q*，其中P_k*,q*＝P(K＝k*,Q＝q*)，空值k*指的是步骤204的视觉对话模型中的编码器在输入空值I＝i^*,Q＝q^*,H＝h^*,C＝c^*之后输出的多模态空值k*；则总效应TE表示为：As shown in Figure 3, there are two direct paths connecting the answer (A), namely K→A and Q→A. Furthermore, the embodiment of the present invention uses P(·) to represent the effect of K and Q on A. Comprehensive effect, then A _{i, q, h, c} can be recorded as P _{k, q} , where P _{k, q} = P (K = k, Q = q), and the observation value k refers to the visual dialogue model in step 204 The multi-modal feature k output by the encoder in after inputting I=i, Q=q, H=h, C=c; Similarly, A _{i*, q*, h*, c*} can be recorded as P _{k *,q*} , where P _k*,q* =P(K=k*,Q=q*), the null value k* refers to the encoder in the visual dialogue model in step 204 when inputting the null value I=i ^* ,Q=q ^* ,H=h ^* ,C=c ^* The multi-modal null value k* output after; then the total effect TE is expressed as:

现有的大部分基于常识融合的视觉对话方法通过演绎图3所示的因果图来进行答案预测，忽略了引入的常识产生的有害偏差对答案预测的影响；即总效应TE中不仅包含了引入的常识对答案预测产生的“有益”影响，也包含了有害偏差对答案预测的负面影响。Most of the existing visual dialogue methods based on common sense fusion predict answers by deducing the cause-and-effect diagram shown in Figure 3, ignoring the impact of harmful biases caused by introduced common sense on answer prediction; that is, the total effect TE not only includes the introduced The "beneficial" impact of common sense on answer predictions also includes the negative impact of harmful biases on answer predictions.

206：为了去除引入的常识在视觉对话任务中产生的有害偏差，基于步骤204中所构建的基于常识融合的视觉对话因果图，构建其对应的反事实因果图来估计有害偏差对答案预测的负面影响。206: In order to remove the harmful bias caused by the introduced common sense in the visual dialogue task, based on the visual dialogue causal graph based on common sense fusion constructed in step 204, construct its corresponding counterfactual causal graph to estimate the negative impact of harmful bias on answer prediction. Influence.

针对步骤204中所构建的基于常识融合的视觉对话因果图，常识C通过多模态知识K对答案预测产生间接影响，即子图C→K→A。然而，在实际中模型不仅会学习到有助于提升答案预测精度的“有益”的常识知识，也会学习到降低了答案预测精度的“有害”偏差。例如，对于当前需要回答的问题“What is the bus used for”，经步骤201提取出的常识三元组中不仅包含了“有益”的常识知识<bus,capable of,transport people>，其有助于模型预测出正确答案“Transport people”；同时提取出的常识三元组中也包含了<bus,is a,public transport>、<bus,is a,car>以及<bus,at location,road>等，其形成了降低答案预测精度的“有害”偏差，例如，开始节点“bus”由于在提取出的常识三元组中经常出现，会误导模型青睐于含有“bus”的候选答案，如“City bus”、“On a bus”，甚至使其预测分数高于正确答案“Transport people”的预测分数。Regarding the visual dialogue causal graph based on common sense fusion constructed in step 204, common sense C has an indirect impact on answer prediction through multi-modal knowledge K, that is, subgraph C→K→A. However, in practice, the model not only learns "beneficial" common sense knowledge that helps improve answer prediction accuracy, but also learns "harmful" biases that reduce answer prediction accuracy. For example, for the current question "What is the bus used for" that needs to be answered, the common sense triplet extracted in step 201 not only contains the "useful" common sense knowledge <bus, capable of, transport people>, but also helps The model predicts the correct answer "Transport people"; at the same time, the extracted common sense triplet also contains <bus, is a, public transport>, <bus, is a, car> and <bus, at location, road> etc., which forms a "harmful" bias that reduces the accuracy of answer prediction. For example, since the starting node "bus" often appears in the extracted common sense triplet, it will mislead the model to favor candidate answers containing "bus", such as " City bus”, “On a bus”, and even makes its predicted score higher than the correct answer “Transport people”.

然而，现有的大部分基于常识融合的视觉对话模型忽略了由引入的常识产生的“有害”偏差，在本发明实施例中，为了估计并去除“有害”偏差对答案预测产生的负面影响，根据步骤204中所构建的基于常识融合的视觉对话因果图，构建其对应的反事实因果图。其核心思想是假设“如果对视觉对话模型只输入常识变量C，同时阻断其余变量I、Q、H、K的输入，这对答案预测会产生什么影响？”具体来说，本发明实施例通过阻断I、Q、H和K对答案(A)预测产生的影响来评估常识(C)对答案预测(A)的自然直接效应(NDE)。However, most of the existing visual dialogue models based on common sense fusion ignore the "harmful" bias caused by the introduced common sense. In the embodiment of the present invention, in order to estimate and remove the negative impact of "harmful" bias on answer prediction, According to the visual dialogue causal graph based on common sense fusion constructed in step 204, its corresponding counterfactual causal graph is constructed. The core idea is to assume that "if only the common sense variable C is input to the visual dialogue model, and the input of the other variables I, Q, H, and K is blocked, what impact will this have on answer prediction?" Specifically, the embodiment of the present invention Evaluate the natural direct effect (NDE) of common sense (C) on the prediction of the answer (A) by blocking the effects of I, Q, H, and K on the prediction of the answer (A).

本发明实施例构建的基于常识融合的反事实因果图如图4所示。The counterfactual causal diagram based on common sense fusion constructed by the embodiment of the present invention is shown in Figure 4.

首先，本发明实施例在基于常识融合的因果图(图3)的基础上，通过将I、Q、H、C分别赋予空值，即I＝i*，Q＝q*，H＝h*和C＝c*，进而有K＝k*，阻断I、Q、H和K对答案(A)预测产生的影响。First, based on the causal diagram (Figure 3) based on common sense fusion, the embodiment of the present invention assigns null values to I, Q, H, and C respectively, that is, I=i*, Q=q*, H=h* and C=c*, and then K=k*, blocking the influence of I, Q, H and K on the prediction of answer (A).

之后，如步骤203所述，常识C在反事实的世界中一次可同时被赋予两个值，即C＝c*和C＝c；其中，前者用以得到K＝k*，而后者与答案A直接连接来评估常识(C)对答案预测(A)的自然直接效应(NDE)。由于模型在反事实的世界中只能通过常识(C)来预测答案，因此自然直接效应(NDE)保留了引入的常识产生的有害偏差对答案预测的影响。After that, as described in step 203, common sense C can be assigned two values at the same time in the counterfactual world, namely C=c* and C=c; among them, the former is used to obtain K=k*, and the latter is related to the answer A direct connection evaluates the natural direct effect (NDE) of common sense (C) on answer prediction (A). Since the model can only predict the answer through common sense (C) in the counterfactual world, the natural direct effect (NDE) preserves the impact of the harmful bias introduced by common sense on the answer prediction.

207：针对步骤206所构建的基于常识融合的反事实因果图，根据步骤203中的自然直接效应NDE的计算方法，估计引入的常识产生的有害偏差对答案预测的影响，并将其从步骤205中所述的总效应TE中去除；207: For the counterfactual causal diagram based on common sense fusion constructed in step 206, according to the calculation method of natural direct effect NDE in step 203, estimate the impact of the harmful bias caused by the introduced common sense on the answer prediction, and use it from step 205 removed from the total effect TE described in;

首先，针对步骤206所构建的基于常识融合的反事实因果图，根据步骤203中的自然直接效应NDE的计算方法，得到常识对答案预测的自然直接效应NDE为：First, for the counterfactual causal diagram based on common sense fusion constructed in step 206, according to the calculation method of natural direct effect NDE in step 203, the natural direct effect NDE of common sense on answer prediction is obtained:

其中，指阻断了问题Q和多模态知识K的输入而仅保留常识C之后对答案预测的影响，/>指对视觉对话模型所有输入变量均赋予空值之后对答案预测的影响。in, Refers to the impact on answer prediction after blocking the input of question Q and multi-modal knowledge K and only retaining common sense C,/> Refers to the impact on answer prediction after all input variables of the visual dialogue model are assigned null values.

之后，为了去除引入的常识产生的有害偏差对答案预测的影响，将自然直接效应NDE从总效应TE中去除，得到总间接效应TIE为：After that, in order to remove the harmful bias caused by the introduced common sense on answer prediction, the natural direct effect NDE is removed from the total effect TE, and the total indirect effect TIE is obtained:

最后，“有益”的常识知识的影响(即TIE)被保留，而“有害”常偏差的影响(即NDE)被消除。这里需要明确的是，等价于/>因为它们都指的是模型所有输入都被阻断后对答案预测的影响。此外，本发明实施例提出的反事实因果推理方法具有较强的泛化能力，可适用于各种基于常识融合的视觉对话模型。Finally, the influence of “beneficial” commonsense knowledge (i.e., TIE) is retained, while the influence of “harmful” common sense deviations (i.e., NDE) is eliminated. What needs to be made clear here is that Equivalent to/> Because they all refer to the impact on answer prediction after all inputs to the model are blocked. In addition, the counterfactual causal reasoning method proposed in the embodiment of the present invention has strong generalization ability and can be applied to various visual dialogue models based on common sense fusion.

208：基于步骤207的总间接效应TIE，采用模型集成的方法训练视觉对话模型，同时以交叉熵损失和KL散度损失作为训练目标得到答案预测结果；208: Based on the total indirect effect TIE of step 207, use the model integration method to train the visual dialogue model, and use cross-entropy loss and KL divergence loss as training targets to obtain the answer prediction results;

针对步骤207的总间接效应TIE，在实际应用中，本发明实施例采用模型集成的方法进行训练如图5所示。具体地，P_q,k的对话方法可以分解为P_q和P_k两个分量，分别由神经网络N _Q和N _K计算，即P_q＝N_Q(q)和P_k＝N_K(i,q,h,c)。N_Q(q)表示子图Q→A，其中问题经过LSTM编码后的特征输入softmax解码器得到候选答案的预测分数。N_K(i,q,h,c)覆盖了子图{I→K,Q→K,H→K,H→Q→K,C→K}，其中问题、历史、图像和常识之间的多模态交互使用注意力机制完成。Regarding the total indirect effect TIE in step 207, in practical applications, the embodiment of the present invention uses a model integration method for training, as shown in Figure 5. Specifically, the dialogue method of P _q,k can be decomposed into two components, P _q and P _k , which are calculated by neural networks N _Q and N _K respectively, that is, P _q =N _Q (q) and P _k =N _K (i ,q,h,c). N _Q (q) represents the subgraph Q→A, in which the features of the question encoded by LSTM are input into the softmax decoder to obtain the prediction score of the candidate answer. N _K (i,q,h,c) covers the subgraph {I→K,Q→K,H→K,H→Q→K,C→K}, where the relationship between problems, history, images and common sense Multimodal interaction is accomplished using attention mechanisms.

本发明实施例融合了来自两个网络的输出结果，生成了事实世界中P_q,k的最终答案分数：The embodiment of the present invention combines the output results from the two networks to generate the final answer score of P _q,k in the real world:

其中，s是非线性激活函数sigmoid。Among them, s is the nonlinear activation function sigmoid.

同样，在反事实的世界中，被分解为/>和P_c。正如步骤206所述，为了评估C对A的直接影响，包括Q和K在内的所有信号都被阻塞，即N_Q和N_K的输入是无效的。在反事实场景中遵循一个合理的策略，假设两个网络都将在没有任何特定处理的情况下猜测和挑选出答案。具体地说，如果没有给出i、q、h或c中的一个或多个，则/>和/>被分配给一个具有均匀分布的可学习参数m。Likewise, in the counterfactual world, is decomposed into/> and P _c . As described in step 206, in order to evaluate the direct impact of C on A, all signals including Q and K are blocked, that is, the inputs of N _Q and N _K are invalid. A reasonable strategy to follow in the counterfactual scenario is to assume that both networks will guess and pick out the answer without any specific processing. Specifically, if one or more of i, q, h, or c is not given, then/> and/> is assigned to a learnable parameter m with a uniform distribution.

同时，P_c由神经网络N_C计算，即P_c＝N_C(c)，对应于子图CA。具体来说，常识特征用LSTM编码，然后用2层MLP对候选答案进行排序。最后，将网络的所有输出结果进行融合，得到反事实世界中的/>的答案分数：At the same time, P _c is calculated by the neural network N _C , that is, P _c = N _C (c), corresponding to the subgraph C A. Specifically, common sense features are encoded with LSTM, and then candidate answers are ranked using 2-layer MLP. Finally, all the output results of the network are fused to obtain /> in the counterfactual world. Answer points for:

在推理阶段，使用反事实推理进行无偏常识性学习，以促进VD预测：In the inference stage, counterfactual reasoning is used for unbiased commonsense learning to facilitate VD prediction:

通过最小化以下损失函数来进行网络训练：The network is trained by minimizing the following loss function:

其中，a为地面真实答案；γ为损失权重；CE表示真实答案与其预测答案之间的交叉熵损失。具体来说，使用集成模型来执行反事实计算，其中K→A、Q→A和C→A这三个分支分别在P_k、P_q和P_c之上。同时，为了控制分布的锐度，使用KL散度D_kl来优化可学习参数m：Among them, a is the ground truth answer; γ is the loss weight; CE represents the cross entropy loss between the real answer and its predicted answer. Specifically, an ensemble model is used to perform counterfactual calculations, where the three branches K→A, Q→A, and C→A are above P _k , P _q , and P _c , respectively. At the same time, in order to control The sharpness of the distribution, using KL divergence D _kl to optimize the learnable parameter m:

其中，p(a|q,k)＝softmax(P_q,k)，softmax为非线性激活函数，当计算D_kl时，只有可学习参数m被优化。Among them, p(a|q,k)=softmax(P _q,k ), Softmax is a nonlinear activation function. When calculating D _kl , only the learnable parameter m is optimized.

209：将图像特征、对话历史特征、常识特征及当前问题特征进行整合，送入解码器中，同时最小化损失函数，优化网络参数，最后生成人类针对当前场景所提出问题的正确答案。209: Integrate image features, conversation history features, common sense features and current question features and send them to the decoder. At the same time, the loss function is minimized, the network parameters are optimized, and finally the correct answers to questions raised by humans for the current scene are generated.

具体地，基于常识融合的事实因果图，视觉对话网络输入真实的图像特征、对话历史特征、常识特征及当前问题特征，经融合后送入解码器；基于常识融合的反事实因果图，该视觉对话网络依旧输入真实的常识特征，同时用可学习参数代替真实的图像特征、对话历史特征及当前问题特征的输入，经融合后送入同一解码器；基于步骤207的总间接效应TIE的计算方法以及该解码器在这两种情况下的输出结果，得到候选答案的排名预测结果。Specifically, based on the factual causal graph fused with common sense, the visual dialogue network inputs real image features, conversation history features, common sense features and current problem features, and then fed into the decoder after fusion; Based on the counterfactual causal graph fused with common sense, the visual dialogue network The dialogue network still inputs real common sense features, and at the same time uses learnable parameters to replace the input of real image features, conversation history features and current question features, and then sends them to the same decoder after fusion; Calculation method of total indirect effect TIE based on step 207 As well as the output results of the decoder in these two cases, the ranking prediction results of the candidate answers are obtained.

同时，以最小化步骤208中所述的损失函数为训练目标，不断优化视觉对话网络参数，使答案预测结果达到最优性能，最终得到人类针对当前场景所提出问题的准确答案。本发明方案具有一定的泛化性，可适用于大部分的视觉对话模型，提高了智能体在实时地回答救援人员问题时的答案预测精度，使救援人员能够充分了解灾害现场的环境信息，从而更好的制定救援计划，降低了救援难度，避免了救援伤亡。At the same time, minimizing the loss function described in step 208 is used as the training goal, and the visual dialogue network parameters are continuously optimized to achieve the optimal performance of the answer prediction results, and finally obtain accurate answers to questions raised by humans for the current scene. The solution of the present invention has certain generalizability and can be applied to most visual dialogue models. It improves the answer prediction accuracy of the agent when answering rescuers' questions in real time, so that rescuers can fully understand the environmental information of the disaster site, thereby Better rescue plans can be made to reduce the difficulty of rescue and avoid rescue casualties.

实施例3Example 3

下面结合具体的算例，表1和表2对实施例和2中的方案进行可行性验证，详见下文描述：The feasibility verification of the solutions in Examples and 2 is carried out below with specific calculation examples and Table 1 and Table 2. See the following description for details:

为了验证本发明的效果，分别在Visdial-v1.0和Visdial-v0.9数据集上用本发明的方法进行实验。答案预测的评价指标包括召回率(R@1、R@5、R@10)、归一化折损累计增益(NDCG)、正确答案的平均倒数排名(MRR)和正确答案的平均排名(Mean)。召回率(R@1、R@5、R@10)、归一化折损累计增益(NDCG)和正确的答案的平均倒数排名(MRR)越高，同时正确答案的平均排名(Mean)越低，表示该视觉对话生成方法越好。实验结果分别如表1(使用Visdial-v1.0数据集)和表2(使用Visdial-v0.9数据集)所示。In order to verify the effect of the present invention, experiments were conducted using the method of the present invention on the Visdial-v1.0 and Visdial-v0.9 data sets respectively. Evaluation indicators for answer prediction include recall rate (R@1, R@5, R@10), normalized loss cumulative gain (NDCG), mean reciprocal ranking of correct answers (MRR) and mean ranking of correct answers (Mean ). The higher the recall rate (R@1, R@5, R@10), normalized loss cumulative gain (NDCG) and the average reciprocal ranking (MRR) of the correct answer, the higher the average ranking (Mean) of the correct answer. Low means that the visual dialogue generation method is better. The experimental results are shown in Table 1 (using the Visdial-v1.0 data set) and Table 2 (using the Visdial-v0.9 data set) respectively.

表1本发明方法在Visdial-v1.0测试集上的结果Table 1 Results of the method of the present invention on the Visdial-v1.0 test set

表2本发明方法在Visdial-v0.9验证集上的结果Table 2 Results of the method of the present invention on the Visdial-v0.9 verification set

从表1和表2的结果可以看出，本发明将常识引入视觉对话方法，验证了常识知识能有效提升答案预测精度；同时，在引入常识知识后，本发明使用反事实分析方法去除常识中蕴含的“有害偏差”对答案生成的负面影响，保留常识对答案生成的正面影响，实验结果验证了本发明的有效性。It can be seen from the results in Table 1 and Table 2 that the present invention introduces common sense into the visual dialogue method, verifying that common sense knowledge can effectively improve the accuracy of answer prediction; at the same time, after introducing common sense knowledge, the present invention uses counterfactual analysis method to remove common sense. The contained "harmful bias" has a negative impact on answer generation, while retaining common sense has a positive impact on answer generation. The experimental results verify the effectiveness of the present invention.

本发明方案可以应用在灾害救援任务。当发生灾害时，救援人员可以使智能体先一步到达灾害现场，将灾害现场图像、相关的常识知识及问答历史输入智能体，智能体根据这些信息实时地回答救援人员的问题，以便救援人员充分了解灾害现场环境，合理的制定救援计划，减少救援伤亡。The solution of the invention can be applied in disaster rescue missions. When a disaster occurs, rescuers can make the intelligent agent arrive at the disaster scene first and input the disaster scene images, relevant common sense knowledge and question and answer history into the intelligent agent. The intelligent agent can answer the rescuer's questions in real time based on this information, so that the rescuer can fully Understand the disaster scene environment, formulate a reasonable rescue plan, and reduce rescue casualties.

实施例4Example 4

一种基于反事实常识因果推理的视觉对话生成装置，装置包括：处理器和存储器，存储器中存储有程序指令，处理器调用存储器中存储的程序指令以使装置执行任一项的方法步骤：A visual dialogue generation device based on counterfactual common sense causal reasoning. The device includes: a processor and a memory. Program instructions are stored in the memory. The processor calls the program instructions stored in the memory to cause the device to perform any of the method steps:

其中，常识三元组为：提取数据库的训练集、验证集和测试集样本中与每个视觉对话单元相关的常识；提取出的常识三元组具体操作为：Among them, the common sense triplet is: extract the common sense related to each visual dialogue unit in the training set, verification set and test set samples of the database; the specific operation of the extracted common sense triplet is:

进一步地，使用注意图卷积网络对子图进行编码得到常识特征为：使用aGCN编码常识子图G_c更新当前节点特征z_i；Further, using the attention graph convolutional network to encode the subgraph to obtain common sense features is: using aGCN to encode the common sense subgraph G _c to update the current node feature z _i ;

其中，W是可学习的线性转换矩阵；N(i)是第i个节点的相邻节点的集合；s是非线性激活函数sigmoid；L是aGCN网络的卷积层数；a_ij是第i个节点特征z_i与第j个节点特征z_j之间的预定义权重；所有更新后的节点特征进行平均池化后输出常识特征c。Among them, W is a learnable linear transformation matrix; N(i) is the set of adjacent nodes of the i-th node; s is the nonlinear activation function sigmoid; L is the number of convolutional layers of the aGCN network; a _ij is the i-th node The predefined weight between node feature z _i and j-th node feature z _j ; common sense feature c is output after average pooling of all updated node features.

其中，将常识引入视觉对话任务，构建基于常识融合的视觉对话因果图具体为：Among them, common sense is introduced into the visual dialogue task and a visual dialogue causal graph based on common sense fusion is constructed as follows:

使用视觉编码器从图像中提取视觉特征，在因果图中将视觉特征记为节点I；在因果图中将其分别表示为节点Q和节点H，常识特征在因果图中记为节点C；Use the visual encoder to extract visual features from the image. The visual features are recorded as node I in the causal diagram; they are represented as node Q and node H respectively in the causal diagram, and the common sense features are recorded as node C in the causal diagram;

之后，子图(I→K,Q→K,H→K,H→Q→K,C→K)表示视觉对话模型中的编码器输入H,Q,I,C，输出多模态特征K；After that, the subgraph (I→K, Q→K, H→K, H→Q→K, C→K) represents the encoder input H, Q, I, C in the visual dialogue model and outputs multi-modal features K ;

问题特征Q与多模态特征K输入判别式解码器或生成式解码器产生当前问题的答案A，对应的子图为Q→A和K→A。The question feature Q and the multimodal feature K are input into the discriminative decoder or the generative decoder to generate the answer A to the current question, and the corresponding subgraphs are Q→A and K→A.

进一步地，计算输入特征对答案预测的总效应为：Furthermore, the total effect of input features on answer prediction is calculated as:

首先，对输入变量H,Q,I,C分别赋予具体的观测值和空值，即I＝i,Q＝q,H＝h,C＝c和I＝i^*,Q＝q^*,H＝h^*,C＝c^*；First, the input variables H, Q, I, and C are assigned specific observed values and null values respectively, that is, I=i, Q=q, H=h, C=c and I=i ^* , Q=q ^* , H =h ^* ,C=c ^* ;

输入变量H,Q,I,C被赋予具体的观测值时，对答案A预测产生的影响记为A(I＝i,Q＝q,H＝h,C＝c)，输入变量被赋予空值时，对答案A预测产生的影响记为A(I＝i^*,Q＝q^*,H＝h^*,C＝c^*)，总效应TE的计算方法为：When the input variables H, Q, I, and C are assigned specific observed values, the impact on the prediction of answer A is recorded as A (I=i, Q=q, H=h, C=c), and the input variables are assigned empty values. When the value is , the impact on the prediction of answer A is recorded as A (I=i ^* , Q=q ^* , H=h ^* , C=c ^* ). The calculation method of the total effect TE is:

其中，基于视觉对话因果图，构建其对应的反事实因果图为：Among them, based on the visual dialogue causal diagram, the corresponding counterfactual causal diagram is constructed as:

进一步地，针对反事实因果图，根据自然直接效应估计引入的常识产生的有害偏差对答案预测的影响，并将其从总效应中去除为：Furthermore, for the counterfactual causal diagram, the impact of the harmful bias introduced by common sense on answer prediction is estimated based on the natural direct effect, and is removed from the total effect as:

其中，指阻断了问题Q和多模态知识K的输入而仅保留常识C之后对答案预测的影响，/>指对视觉对话模型所有输入变量均赋予空值之后对答案预测的影响；in, Refers to the impact on answer prediction after blocking the input of question Q and multi-modal knowledge K and only retaining common sense C,/> Refers to the impact on answer prediction after all input variables of the visual dialogue model are assigned null values;

得到总间接效应TIE为：The total indirect effect TIE is obtained as:

其中，采用模型集成训练视觉对话模型，以交叉熵和KL散度损失作为训练目标得到答案预测结果为：Among them, model ensemble is used to train the visual dialogue model, and cross-entropy and KL divergence loss are used as training targets to obtain the answer prediction result:

其中，s是非线性激活函数sigmoid，被分解为/>和P_c，分别由神经网络N_Q、N_K和N_C计算，P_q,k分解为P_q和P_k两个分量，分别由神经网络N_Q和N_K计算。Among them, s is the nonlinear activation function sigmoid, is decomposed into/> and P _c are calculated by neural networks N _Q , N _K and N _C respectively. P _q,k is decomposed into two components P _q and P _k , which are calculated by neural networks N _Q and N _K respectively.

这里需要指出的是，以上实施例中的装置描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention will not be described again here.

上述的处理器和存储器的执行主体可以是计算机、单片机、微控制器等具有计算功能的器件，具体实现时，本发明实施例对执行主体不做限制，根据实际应用中的需要进行选择。The execution subjects of the above-mentioned processors and memories can be computers, microcontrollers, microcontrollers and other devices with computing functions. During specific implementation, the embodiments of the present invention do not limit the execution subjects and can be selected according to the needs of actual applications.

存储器和处理器之间通过总线传输数据信号，本发明实施例对此不做赘述。Data signals are transmitted between the memory and the processor through a bus, which will not be described in detail in the embodiment of the present invention.

基于同一发明构思，本发明实施例还提供了一种计算机可读存储介质，存储介质包括存储的程序，在程序运行时控制存储介质所在的设备执行上述实施例中的方法步骤。Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium. The storage medium includes a stored program. When the program is running, the device where the storage medium is located is controlled to execute the method steps in the above embodiments.

该计算机可读存储介质包括但不限于快闪存储器、硬盘、固态硬盘等。The computer-readable storage medium includes but is not limited to flash memory, hard disk, solid state drive, etc.

这里需要指出的是，以上实施例中的可读存储介质描述是与实施例中的方法描述相对应的，本发明实施例在此不做赘述。It should be pointed out here that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the embodiments of the present invention will not be described again here.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时，全部或部分地产生按照本发明实施例的流程或功能。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, processes or functions according to embodiments of the present invention are generated in whole or in part.

计算机可以是通用计算机、专用计算机、计算机网络、或者其它可编程装置。计算机指令可以存储在计算机可读存储介质中，或者通过计算机可读存储介质进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质或者半导体介质等。The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. Computer instructions may be stored in or transmitted over a computer-readable storage medium. Computer-readable storage media can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or other integrated media that contains one or more available media. Available media may be magnetic media or semiconductor media, etc.

参考文献：references:

[1]X.Yang,H.Zhang,and J.Cai,“Deconfounded image captioning:A causalretrospect,”IEEE Transactions on Pattern Analysis and Machine Intelligence,pp.1–1,2021.[1]X.Yang, H.Zhang, and J.Cai, "Deconfounded image captioning: A causalretrospect," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1–1, 2021.

[2]X.Han,X.Dong,X.Song,T.Gan,Y.Zhan,Y.Yan,and L.Nie,“Divide-and-conquer predictor for unbiased scene graph generation,”IEEE Trans.CircuitsSyst.Video Technol.,vol.32,pp.8611–8622,2022.[2]X.Han, .Video Technol.,vol.32,pp.8611–8622,2022.

[3]Y.Niu,K.Tang,H.Zhang,Z.Lu,X.Hua,and J.Wen,“Counterfactual VQA:Acause-effect look at language bias,”in CVPR,2021,pp.12 700–12 710.[3] Y. Niu, K. Tang, H. Zhang, Z. Lu, X. Hua, and J. Wen, "Counterfactual VQA: A cause-effect look at language bias," in CVPR, 2021, pp. 12 700 –12 710.

[4]A.Das,S.Kottur,K.Gupta,A.Singh,D.Yadav,J.M.F.Moura,D.Parikh,andD.Batra,“Visual dialog,”in CVPR,2017,pp.1080–1089.[4] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra, "Visual dialog," in CVPR, 2017, pp. 1080–1089.

[5]J.Lu,A.Kannan,J.Yang,D.Parikh,and D.Batra,“Best of both worlds:Transferring knowledge from discriminative learning to a generative visualdialog model,”in NIPS,2017,pp.314–324.[5] J.Lu, A.Kannan, J.Yang, D.Parikh, and D.Batra, "Best of both worlds: Transferring knowledge from discriminative learning to a generative visualdialog model," in NIPS, 2017, pp.314 –324.

[6]Y.Niu,H.Zhang,M.Zhang,J.Zhang,Z.Lu,and J.Wen,“Recursive visualattention in visual dialog,”in CVPR,2019,pp.6679–6688.[6] Y.Niu, H.Zhang, M.Zhang, J.Zhang, Z.Lu, and J.Wen, “Recursive visualattention in visual dialog,” in CVPR, 2019, pp.6679–6688.

[7]X.Jiang,S.Du,Z.Qin,Y.Sun,and J.Yu,“KBGN:knowledge-bridge graphnetwork for adaptive vision-text reasoning in visual dialogue,”in ACMMultimedia,2020,pp.1265–1273.[7] –1273.

[8]D.Guo,H.Wang,and M.Wang,“Context-aware graph inference withknowledge distillation for visual dialog,”IEEE Trans.PatternAnal.Mach.Intell.,vol.44,no.10,pp.6056–6073,2022.[9]L.Zhao,H.Zhang,X.Li,S.Yang,and Y.Song,“You should know more:Learning external knowledge forvisual dialog,”Neurocomputing,vol.488,pp.54–65,2022.[8] D.Guo, H.Wang, and M.Wang, "Context-aware graph inference with knowledge distillation for visual dialog," IEEE Trans.PatternAnal.Mach.Intell., vol.44, no.10, pp.6056 –6073, 2022. [9] L. Zhao, H. Zhang, X. Li, S. Yang, and Y. Song, “You should know more: Learning external knowledge for visual dialog,” Neurocomputing, vol.488, pp. 54–65, 2022.

[10]S.Zhang,X.Jiang,Z.Yang,T.Wan,and Z.Qin,“Reasoning with multi-structure commonsense knowledge in visual dialog,”in CVPR Workshops,2022,pp.4599–4608.[10] S. Zhang,

[11]R.Speer,J.Chin,and C.Havasi,“Conceptnet 5.5:An open multilingualgraph of general knowledge,”in AAAI,2017,pp.4444–4451.[11] R.Speer, J.Chin, and C.Havasi, “Conceptnet 5.5: An open multilingualgraph of general knowledge,” in AAAI, 2017, pp.4444–4451.

本发明实施例对各器件的型号除做特殊说明的以外，其他器件的型号不做限制，只要能完成上述功能的器件均可。The embodiments of the present invention do not limit the models of each device unless otherwise specified, as long as the devices can complete the above functions.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the above-mentioned serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. A visual dialogue generation method based on counterfactual common sense causal reasoning, characterized in that the method includes:

For the extracted common sense triplet, a common sense subgraph is constructed, and the attention graph convolution network is used to encode the subgraph to obtain common sense features;

Introduce common sense into the visual dialogue task, construct a visual dialogue causal graph based on common sense fusion, and calculate the total effect of input features on answer prediction; based on the visual dialogue causal graph, construct its corresponding counterfactual causal graph;

For counterfactual causal diagrams, the impact of harmful biases introduced by common sense on answer predictions is estimated based on natural direct effects and removed from the total effect;

Model ensemble is used to train the visual dialogue model, and cross-entropy and KL divergence loss are used as training targets to obtain answer prediction results;

Integrate image features, conversation history features, common sense features and current problem features and send them to the decoder to minimize the loss function, optimize network parameters, and finally provide real disaster environment information for disaster relief sites.

2. A kind of visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that the common sense triplet is: extracting the training set, verification set and test set sample of the database with each Common sense related to each visual dialogue unit; the specific operation of the extracted common sense triplet is:

Use the Faster R-CNN framework to detect object labels, and select the top several object labels with the highest scores based on the confidence scores corresponding to each label; similarly, select the top few keywords with the highest scores for image descriptions.

3. A visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that the common sense feature obtained by encoding the subgraph using an attention graph convolutional network is: using aGCN to encode the common sense subgraph Graph G _c updates the current node feature z _i ;

Among them, W is a learnable linear transformation matrix; N(i) is the set of adjacent nodes of the i-th node; s is the nonlinear activation function sigmoid; L is the number of convolutional layers of the aGCN network; a _ij is the i-th node The predefined weight between node feature z _i and j-th node feature z _j ; common sense feature c is output after average pooling of all updated node features.

4. A visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that the introduction of common sense into the visual dialogue task and the construction of a visual dialogue causal graph based on common sense fusion are specifically:

Use the visual encoder to extract visual features from the image. The visual features are recorded as node I in the causal diagram; they are represented as node Q and node H respectively in the causal diagram, and the common sense features are recorded as node C in the causal diagram;

After that, the subgraph (I→K, Q→K, H→K, H→Q→K, C→K) represents the encoder input H, Q, I, C in the visual dialogue model and outputs multi-modal features K ;

The question feature Q and the multimodal feature K are input into the discriminative decoder or the generative decoder to generate the answer A to the current question, and the corresponding subgraphs are Q→A and K→A.

5. A visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that the total effect of the calculated input features on answer prediction is:

First, the input variables H, Q, I, and C are assigned specific observed values and null values respectively, that is, I=i, Q=q, H=h, C=c and I=i ^* , Q=q ^* , H =h ^* ,C=c ^* ;

When the input variables H, Q, I, and C are assigned specific observed values, the impact on the prediction of answer A is recorded as A (I=i, Q=q, H=h, C=c), and the input variables are assigned empty values. When the value is , the impact on the prediction of answer A is recorded as A (I=i ^* , Q=q ^* , H=h ^* , C=c ^* ). The calculation method of the total effect TE is:

TE=A(I=i,Q=q,H=h,C=c)-A(I=i ^* ,Q=q ^* ,H=h ^* ,C=c ^* )

=P _k,q -P _k*,q* .

6. A visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that, based on the visual dialogue causal diagram, constructing its corresponding counterfactual causal diagram is:

By assigning null values to I, Q, H, and C respectively, that is, I=i*, Q=q*, H=h*, and C=c*, and then K=k*, blocking I, Q, H, and The impact of K on the prediction of answer A;

Common sense C can be assigned two values at the same time in the counterfactual world, namely C=c* and C=c; the former is used to obtain K=k*, and the latter is directly connected to the answer A to evaluate the answer prediction of common sense C A natural direct effect.

7. A visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that, for the counterfactual causal diagram, the answer prediction is based on harmful biases caused by common sense introduced by natural direct effect estimation. and removing it from the total effect is:

in, Refers to the impact on answer prediction after blocking the input of question Q and multi-modal knowledge K and only retaining common sense C,/> Refers to the impact on answer prediction after all input variables of the visual dialogue model are assigned null values;

The total indirect effect TIE is obtained as:

8. A visual dialogue generation method based on counterfactual common sense causal reasoning according to claim 1, characterized in that the visual dialogue model is trained using model integration, and the answer is obtained with cross entropy and KL divergence loss as training targets. The predicted result is:

Among them, s is the nonlinear activation function sigmoid, is decomposed into/> and P _c are calculated by neural networks N _Q , N _K and N _C respectively. P _q,k is decomposed into two components P _q and P _k , which are calculated by neural networks N _Q and N _K respectively.

9. A visual dialogue generation device based on counterfactual common sense causal reasoning, characterized in that the device includes: a processor and a memory, program instructions are stored in the memory, and the processor calls the program instructions stored in the memory To cause the device to perform the method steps described in any one of claims 1-8.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions cause the processor to execute the right The method steps described in any one of claims 1-8.