CN111988673A

CN111988673A - Video description statement generation method and related equipment

Info

Publication number: CN111988673A
Application number: CN202010764613.8A
Authority: CN
Inventors: 袁艺天; 马林; 朱文武
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-24
Anticipated expiration: 2040-07-31
Also published as: CN111988673B

Abstract

An embodiment of the present application provides a method and related equipment for generating a video description sentence, the method includes: obtaining a syntactic feature vector of a target paradigm example sentence; determining the syntax of the video description sentence to be generated according to the syntactic feature vector, and obtaining syntax information determining the semantics of the to-be-generated video description sentence corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtaining semantic information; generating the video description sentence of the target video according to the semantic information. Therefore, it is possible to generate video description sentences with different syntactic structures by selecting different target example sentences, and solve the problem of single syntax of video description sentences.

Description

Video description sentence generation method and related equipment

技术领域technical field

本申请涉及人工智能技术领域，具体而言，涉及一种视频描述语句的生成方法及相关设备。The present application relates to the technical field of artificial intelligence, and in particular, to a method and related equipment for generating video description sentences.

背景技术Background technique

视频描述(Video Captioning)是指为给定视频生成可用于描述该视频中内容的语句，所生成的语句被称为视频描述语句。通过为视频生成的视频描述语句，便于用户不用观看视频而仅仅通过视频描述语句即可快速获知视频的内容。相关技术中，所生成的视频描述语句存在句法单一的问题。Video Captioning refers to generating sentences for a given video that can be used to describe the content in the video, and the generated sentences are called video description sentences. By generating the video description sentence for the video, it is convenient for the user to quickly know the content of the video only through the video description sentence without watching the video. In the related art, the generated video description sentences have the problem of single syntax.

发明内容SUMMARY OF THE INVENTION

本申请的实施例提供了一种视频描述语句的生成方法及相关设备，进而至少在一定程度上解决视频描述语句所存在句法单一的问题。The embodiments of the present application provide a method and related device for generating a video description sentence, thereby solving the problem of single syntax of the video description sentence at least to a certain extent.

本申请的其他特性和优点将通过下面的详细描述变得显然，或部分地通过本申请的实践而习得。Other features and advantages of the present application will become apparent from the following detailed description, or be learned in part by practice of the present application.

根据本申请实施例的一个方面，提供了一种视频描述语句的生成方法，所述方法包括：获取目标范例句的句法特征向量；根据所述句法特征向量确定所要生成视频描述语句的句法，得到句法信息；根据所述句法信息和目标视频的视频语义特征向量确定所述所要生成视频描述语句对应于所述句法的语义，得到语义信息；根据所述语义信息生成所述目标视频的视频描述语句。According to an aspect of the embodiments of the present application, a method for generating a video description sentence is provided, the method includes: obtaining a syntactic feature vector of a target paradigm example sentence; determining the syntax of the video description sentence to be generated according to the syntactic feature vector, and obtaining Syntax information; determine the semantics of the to-be-generated video description sentence corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtain semantic information; generate the video description sentence of the target video according to the semantic information .

根据本申请实施例的一个方面，提供了一种视频描述语句的生成装置，所述装置包括：获取模块，用于获取目标范例句的句法特征向量；句法确定模块，用于根据所述句法特征向量确定所要生成视频描述语句的句法，得到句法信息；语义确定模块，用于根据所述句法信息和目标视频的视频语义特征向量确定所述所要生成视频描述语句对应于所述句法的语义，得到语义信息；视频描述语句确定模块，用于根据所述语义信息生成所述目标视频的视频描述语句。According to an aspect of the embodiments of the present application, there is provided an apparatus for generating a video description sentence, the apparatus comprising: an obtaining module for obtaining a syntactic feature vector of a target paradigm example sentence; a syntax determining module for obtaining the syntactic feature according to the syntactic feature The vector determines the syntax of the video description sentence to be generated, and obtains syntax information; the semantic determination module is used for determining the semantics of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtains Semantic information; a video description sentence determination module, configured to generate a video description sentence of the target video according to the semantic information.

在本申请的一些实施例中，句法确定模块被配置为：由描述生成模型所包含的第一神经网络根据所述句法特征向量生成第一隐向量，所述第一隐向量用于指示所述句法信息，所述描述生成模型还包括与所述第一神经网络级联的第二神经网络，所述第一神经网络和第二神经网络是基于门控的循环神经网络。In some embodiments of the present application, the syntax determination module is configured to: generate a first latent vector according to the syntax feature vector by the first neural network included in the description generation model, where the first latent vector is used to indicate the Syntax information, the description generation model further includes a second neural network cascaded with the first neural network, the first neural network and the second neural network are gated based recurrent neural networks.

在本实施例中，语义确定模块被配置为：由所述第二神经网络根据所述第一隐向量和所述视频语义特征向量生成第二隐向量，所述第二隐向量用于指示所述语义信息。In this embodiment, the semantic determination module is configured to: generate a second latent vector by the second neural network according to the first latent vector and the video semantic feature vector, where the second latent vector is used to indicate the describe semantic information.

在本申请的一些实施例中，视频描述语句确定模块被配置为：根据所述第二神经网络在t时刻生成的第二隐向量确定t时刻的词向量；根据各时刻所输出的词向量生成所述视频描述语句。In some embodiments of the present application, the video description sentence determination module is configured to: determine a word vector at time t according to a second latent vector generated by the second neural network at time t; the video description sentence.

在本实施例例中，句法确定模块包括第一隐向量生成单元，其用于由所述第一神经网络根据所述句法特征向量、t-1时刻的词向量和所述第一神经网络所生成t-1时刻的第一隐向量，输出t时刻的第一隐向量。In this embodiment, the syntax determination module includes a first latent vector generation unit, which is configured to be generated by the first neural network according to the syntax feature vector, the word vector at time t-1, and the result obtained by the first neural network. Generate the first hidden vector at time t-1, and output the first hidden vector at time t.

在本实施例例中，语义确定模块包括第二隐向量生成单元，其用于由所述第二神经网络根据所述视频语义特征向量、所述t时刻的第一隐向量和所述第二神经网络所生成t-1时刻的第二隐向量，输出t时刻的第二隐向量。In this embodiment, the semantic determination module includes a second latent vector generating unit, which is configured to generate the second latent vector by the second neural network according to the video semantic feature vector, the first latent vector at time t, and the second latent vector. The second latent vector at time t-1 generated by the neural network outputs the second latent vector at time t.

在本申请的一些实施例中，第一隐向量生成单元包括：第一软注意力加权单元，用于根据所述t-1时刻的第一隐向量对所述句法特征向量进行软注意力加权，得到对应于t时刻的目标句法特征向量。第一拼接单元，用于将所述对应于t时刻的目标句法特征向量与所述t-1时刻的词向量进行拼接，得到对应于t时刻的第一拼接向量。第一输出单元，用于由所述第一神经网络以所述对应于t时刻的第一拼接向量作为输入，对应输出t时刻的第一隐向量。In some embodiments of the present application, the first latent vector generating unit includes: a first soft attention weighting unit, configured to perform soft attention weighting on the syntactic feature vector according to the first latent vector at the time t-1 , the target syntax feature vector corresponding to time t is obtained. The first splicing unit is used for splicing the target syntax feature vector corresponding to time t and the word vector at time t-1 to obtain a first splicing vector corresponding to time t. The first output unit is configured to use the first splicing vector corresponding to time t as an input by the first neural network, and output the first latent vector corresponding to time t.

在本申请的一些实施例中，所述第一神经网络包括第一输入门、第一遗忘门和第一输出门，第一输出单元包括：第一遗忘门向量计算单元，用于由所述第一遗忘门根据所述对应于t时刻的第一拼接向量计算得到t时刻的第一遗忘门向量。以及第一输入门向量计算单元，用于由所述第一输入门根据所述对应于t时刻的第一拼接向量计算得到t时刻的第一输入门向量。第一细胞单元向量计算单元，用于根据所述t时刻的第一遗忘门向量、所述t时刻的第一输入门向量、t时刻的第一单元向量和所述第一神经网络所对应t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量，所述t时刻的第一单元向量是根据所述对应于t时刻的第一拼接向量进行双曲正切计算得到的。第一隐向量计算单元，用于根据所述t时刻的第一细胞单元向量和t时刻的第一输出门向量计算得到t时刻的第一隐向量，所述t时刻的第一输出门向量是由所述第一输出门根据所述对应于t时刻的第一拼接向量计算得到的。In some embodiments of the present application, the first neural network includes a first input gate, a first forgetting gate, and a first output gate, and the first output unit includes: a first forgetting gate vector calculation unit for The first forgetting gate calculates the first forgetting gate vector at time t according to the first splicing vector corresponding to time t. and a first input gate vector calculation unit, configured to calculate the first input gate vector at time t by the first input gate according to the first splicing vector corresponding to time t. The first cell unit vector calculation unit is used to calculate the first forget gate vector at the time t, the first input gate vector at the time t, the first unit vector at the time t and the corresponding t of the first neural network. The first cell unit vector at time -1 is calculated to obtain the first cell unit vector at time t, and the first unit vector at time t is obtained by performing hyperbolic tangent calculation according to the first splicing vector corresponding to time t. The first hidden vector calculation unit is used to calculate the first hidden vector at time t according to the first cell unit vector at time t and the first output gate vector at time t. The first output gate vector at time t is Calculated by the first output gate according to the first splicing vector corresponding to time t.

在本申请的一些实施例中，所述句法确定模块还包括：第一归一化单元，用于对所述第一神经网络中的第一输入门向量、第一遗忘门向量、第一输出门向量和第一单元向量分别进行归一化；第一变换单元，根据第一偏移向量和第一缩放向量分别对归一化后的第一输入门向量、第一遗忘门向量、第一输出门向量和所述第一单元向量进行变换，得到目标第一输入门向量、目标第一遗忘门向量、目标第一输出门向量和目标第一单元向量，所述第一偏移向量是第一多层感知机根据所述对应于t时刻的目标句法特征向量输出的，所述第一缩放向量是第二多层感知机根据所述对应于t时刻的目标句法特征向量输出的，所述第一多层感知机与所述第二多层感知机相独立。In some embodiments of the present application, the syntax determination module further includes: a first normalization unit, configured to compare the first input gate vector, the first forget gate vector, and the first output in the first neural network The gate vector and the first unit vector are respectively normalized; the first transformation unit, according to the first offset vector and the first scaling vector, respectively normalizes the normalized first input gate vector, the first forget gate vector, the first The output gate vector and the first unit vector are transformed to obtain the target first input gate vector, the target first forget gate vector, the target first output gate vector and the target first unit vector, and the first offset vector is the A multi-layer perceptron is output according to the target syntax feature vector corresponding to time t, the first scaling vector is output by the second multi-layer perceptron according to the target syntax feature vector corresponding to time t, the The first multilayer perceptron is independent of the second multilayer perceptron.

在本实施例中，第一细胞单元向量计算单元进一步被配置为：根据所述目标第一遗忘门向量、目标第一输入门向量、目标第一单元向量和所述t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量。In this embodiment, the first cell unit vector calculation unit is further configured to: according to the target first forget gate vector, the target first input gate vector, the target first unit vector and the first time t-1 The cell unit vector is calculated to obtain the first cell unit vector at time t.

在本实施例中，第一隐向量计算单元进一步被配置为：根据所述t时刻的第一细胞单元向量和所述目标输出门向量计算得到t时刻的第一隐向量。In this embodiment, the first hidden vector calculation unit is further configured to: calculate and obtain the first hidden vector at time t according to the first cell unit vector at time t and the target output gate vector.

在本申请的一些实施例中，第二隐向量生成单元，包括：第二软注意力加权单元，用于根据所述t-1时刻的第二隐向量对所述视频语义特征向量进行软注意力加权，得到对应于t时刻的目标视频语义向量。第二拼接单元，用于将所述对应于t时刻的目标视频语义向量与所述t时刻的第一隐向量进行拼接，得到对应于t时刻的第二拼接向量。第二输出单元，用于由所述第二神经网络以所述对应于t时刻的第二拼接向量作为输入，对应输出t时刻的第二隐向量。In some embodiments of the present application, the second latent vector generating unit includes: a second soft attention weighting unit, configured to perform soft attention on the video semantic feature vector according to the second latent vector at the time t-1 Force weighting to obtain the target video semantic vector corresponding to time t. The second splicing unit is used for splicing the target video semantic vector corresponding to time t with the first latent vector at time t to obtain a second splicing vector corresponding to time t. The second output unit is configured to use the second splicing vector corresponding to time t as input by the second neural network, and output the second latent vector corresponding to time t.

在本申请的一些实施例中，所述第二神经网络包括第二输入门、第二遗忘门和第二输出门，第二输出单元包括：第二遗忘门向量计算单元，用于由所述第二遗忘门根据所述对应于t时刻的第二拼接向量计算得到t时刻的第二遗忘门向量；以及第二输入门向量计算单元，用于由所述第二输入门根据所述对应于t时刻的第二拼接向量计算得到t时刻的第二输入门向量。第二细胞单元向量计算单元，用于根据所述t时刻的第二遗忘门向量、所述t时刻的第二输入门向量、t时刻的第二单元向量和所述第二神经网络所对应t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量，所述t时刻的第二单元向量是根据所述对应于t时刻的第二拼接向量进行双曲正切计算得到的；第二隐向量计算单元，用于根据所述t时刻的第二细胞单元向量和t时刻的第二输出门向量计算得到t时刻的第二隐向量，所述t时刻的第二输出门向量是由所述第二输出门根据所述对应于t时刻的第二拼接向量计算得到的。In some embodiments of the present application, the second neural network includes a second input gate, a second forgetting gate, and a second output gate, and the second output unit includes: a second forgetting gate vector calculation unit for The second forgetting gate calculates the second forgetting gate vector at time t according to the second splicing vector corresponding to time t; and the second input gate vector calculation unit is used for the second input gate according to the The second splicing vector at time t is calculated to obtain the second input gate vector at time t. The second cell unit vector calculation unit is configured to calculate the second forget gate vector at the time t, the second input gate vector at the time t, the second unit vector at the time t, and the t corresponding to the second neural network. The second cell unit vector at time -1 is calculated to obtain the second cell unit vector at time t, and the second unit vector at time t is obtained by performing hyperbolic tangent calculation according to the second splicing vector corresponding to time t; The second hidden vector calculation unit is configured to calculate the second latent vector at time t according to the second cell unit vector at time t and the second output gate vector at time t, where the second output gate vector at time t is Calculated by the second output gate according to the second splicing vector corresponding to time t.

在本申请的一些实施例中，语义确定模块还包括：第二归一化单元，用于对所述第二神经网络中的第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量分别进行归一化。第二变换单元，用于根据第二偏移向量和第二缩放向量分别对归一化后的第二输入门向量、第二遗忘门向量、第二输出门向量和所述第二单元向量进行变换，得到目标第二输入门向量、目标第二遗忘门向量、目标第二输出门向量和目标第二单元向量，所述第二偏移向量是第三多层感知机根据所述对应于t时刻的目标视频语义向量输出的，所述第二缩放向量是第四多层感知机根据所述对应于t时刻的目标视频语义向量输出的，所述第三多层感知机与所述第四多层感知机相独立。In some embodiments of the present application, the semantic determination module further includes: a second normalization unit, configured to compare the second input gate vector, the second forget gate vector, and the second output gate vector in the second neural network and the second unit vector are normalized separately. The second transformation unit is configured to perform the normalized second input gate vector, the second forget gate vector, the second output gate vector and the second unit vector respectively according to the second offset vector and the second scaling vector. Transform to obtain the target second input gate vector, the target second forget gate vector, the target second output gate vector and the target second unit vector, and the second offset vector is the third multi-layer perceptron according to the corresponding t The second scaling vector is output by the fourth multi-layer perceptron according to the target video semantic vector corresponding to time t, and the third multi-layer perceptron is the same as the fourth multi-layer perceptron. Multilayer perceptrons are independent.

在本实施例中，第二细胞单元向量计算单元进一步被配置为：根据所述目标第二遗忘门向量、目标第二输入门向量、目标第二单元向量和所述t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量；In this embodiment, the second cell unit vector calculation unit is further configured to: according to the target second forget gate vector, the target second input gate vector, the target second unit vector and the second time t-1 Calculate the cell unit vector to obtain the second cell unit vector at time t;

在本实施例中，第二隐向量计算单元进一步被配置为：根据所述t时刻的第二细胞单元向量和所述目标第二输出门向量计算得到t时刻的第二隐向量。In this embodiment, the second latent vector calculation unit is further configured to: calculate and obtain the second latent vector at time t according to the second cell unit vector at time t and the target second output gate vector.

在本申请的一些实施例中，视频描述语句的生成装置还包括：训练数据获取模块，用于获取训练数据，所述训练数据包括若干样本视频和所述样本视频对应的样本视频描述语句。语义特征提取模块，用于对所述样本视频进行语义特征提取，得到所述样本视频的样本视频语义特征向量；以及句法特征提取模块，用于对所述样本视频所对应样本视频描述语句进行句法特征提取，得到所述样本视频描述语句的样本句法特征向量。第一句法损失确定模块，用于由所述第一神经网络根据所述样本句法特征向量输出第一隐向量序列，通过所述第一隐向量序列计算第一句法损失；第一语义损失确定模块，用于由所述第二神经网络根据所述第一隐向量序列和所述样本视频的样本视频语义特征向量输出第二隐向量序列，通过所述第二隐向量序列计算第一语义损失。第一目标损失计算模块，用于根据所述第一句法损失和所述第一语义损失计算得到第一目标损失；第一调整模块，用于基于所述第一目标损失调整所述描述生成模型的参数。In some embodiments of the present application, the apparatus for generating video description sentences further includes: a training data acquisition module, configured to acquire training data, where the training data includes several sample videos and sample video description sentences corresponding to the sample videos. a semantic feature extraction module, used for performing semantic feature extraction on the sample video, to obtain a sample video semantic feature vector of the sample video; and a syntactic feature extraction module, used for syntactically extracting the sample video description sentences corresponding to the sample video Feature extraction, to obtain a sample syntax feature vector of the sample video description sentence. The first syntax loss determination module is configured to output the first latent vector sequence by the first neural network according to the sample syntax feature vector, and calculate the first syntax loss through the first latent vector sequence; the first semantic loss A determination module, used for outputting a second latent vector sequence by the second neural network according to the first latent vector sequence and the sample video semantic feature vector of the sample video, and calculating the first semantics through the second latent vector sequence loss. a first target loss calculation module, configured to calculate and obtain a first target loss according to the first syntactic loss and the first semantic loss; a first adjustment module, configured to adjust the description generation based on the first target loss parameters of the model.

在本申请的一些实施例中，第一句法损失确定模块包括：语法树预测单元，用于通过第六神经网络根据所述第一隐向量序列为所述样本描述语句预测得到语法树，所述第六神经网络是基于门控的循环神经网络；第一句法损失计算单元，用于根据所预测得到的语法树和所述样本描述语句的实际语法树计算得到所述第一句法损失。In some embodiments of the present application, the first syntax loss determination module includes: a syntax tree prediction unit, configured to predict and obtain a syntax tree for the sample description sentence according to the first latent vector sequence through a sixth neural network, where The sixth neural network is a gated-based recurrent neural network; the first syntactic loss calculation unit is used to calculate the first syntactic loss according to the predicted syntax tree and the actual syntax tree of the sample description sentence .

在本申请的一些实施例中，第一语义损失确定模块包括：第一描述语句输出单元，用于通过第五多层感知机根据所述第二隐向量序列为所述样本视频输出第一描述语句。第一语义损失计算单元，用于根据所述第一描述语句和所述样本视频所对应样本视频描述语句计算得到所述第一语义损失。In some embodiments of the present application, the first semantic loss determination module includes: a first description sentence output unit, configured to output a first description for the sample video according to the second latent vector sequence through a fifth multilayer perceptron statement. A first semantic loss calculation unit, configured to calculate and obtain the first semantic loss according to the first description sentence and the sample video description sentence corresponding to the sample video.

在本申请的一些实施例中，视频描述语句的生成装置还包括：第一样本句法特征向量获取模块，用于获取样本语句的样本句法特征向量，所述样本语句包括样本范例句和样本视频对应的样本视频描述语句；第二句法损失计算模块，用于由所述第一神经网络根据所述样本语句的样本句法特征向量输出第一隐向量序列，通过所述样本语句对应的第一隐向量序列计算第二句法损失；第二语义损失计算模块，用于由所述第二神经网络根据所述样本语句的样本语义特征向量和所述样本语句对应的第一隐向量序列输出第二隐向量序列，所述样本语义特征向量是对所述样本语句进行语义特征提取所得到的，通过所述样本语句对应的第二隐向量序列计算第二语义损失；第二目标损失计算模块，用于根据所述第二句法损失和所述第二语义损失计算得到第二目标损失；第二调整模块，用于基于所述第二目标损失调整所述描述生成模型的参数。In some embodiments of the present application, the apparatus for generating a video description sentence further includes: a first sample syntax feature vector acquisition module, configured to obtain a sample syntax feature vector of a sample sentence, where the sample sentence includes a sample example sentence and a sample video The corresponding sample video description sentence; the second syntax loss calculation module is used for outputting a first latent vector sequence by the first neural network according to the sample syntax feature vector of the sample sentence, and the first hidden vector sequence corresponding to the sample sentence is passed. The vector sequence calculates the second syntactic loss; the second semantic loss calculation module is configured to output the second hidden vector by the second neural network according to the sample semantic feature vector of the sample sentence and the first latent vector sequence corresponding to the sample sentence. vector sequence, the sample semantic feature vector is obtained by performing semantic feature extraction on the sample sentence, and the second semantic loss is calculated through the second latent vector sequence corresponding to the sample sentence; the second target loss calculation module is used for A second target loss is calculated according to the second syntactic loss and the second semantic loss; and a second adjustment module is configured to adjust the parameters of the description generation model based on the second target loss.

在本申请的一些实施例中，所述训练数据还包括若干样本范例句，视频描述语句的生成装置还还包括：第二样本句法特征向量获取模块，用于获取所述样本范例句的样本句法特征向量；第一隐向量序列输出模块，用于由所述第一神经网络根据所述样本范例句的样本句法特征向量输出第一隐向量序列；第二隐向量序列输出模块，用于由所述第二神经网络根据对应于所述样本范例句的第一隐向量序列和样本视频的样本视频语义特征向量输出第二隐向量序列；第二描述语句确定模块，用于根据对应于所述样本视频的第二隐向量序列确定第二描述语句；第三目标损失计算模块，用于根据所述样本范例句对应的语法树和所述第二描述语句对应的语法树计算得到第三目标损失；第三调整模块，用于基于所述第三目标损失调整所述描述生成模型的参数。In some embodiments of the present application, the training data further includes several sample exemplary sentences, and the apparatus for generating a video description sentence further includes: a second sample syntax feature vector acquisition module, configured to acquire the sample syntax of the sample exemplary sentences feature vector; the first hidden vector sequence output module is used for outputting the first latent vector sequence by the first neural network according to the sample syntax feature vector of the sample example sentence; the second latent vector sequence output module is used for the output by the The second neural network outputs a second latent vector sequence according to the first latent vector sequence corresponding to the sample model example sentence and the sample video semantic feature vector of the sample video; the second description sentence determination module is used for according to the sample corresponding to the sample. The second latent vector sequence of the video determines the second description sentence; the third target loss calculation module is used to calculate the third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence; A third adjustment module, configured to adjust the parameters of the description generation model based on the third target loss.

在本申请的一些实施例中，获取模块包括：字符特征向量获取单元，用于获取所述目标范例句中各词所包括字符的字符特征向量，所述字符特征向量是对字符进行编码得到的；第三隐向量输出单元，用于由所述第三神经网络根据各字符的字符特征向量输出各字符对应的第三隐向量；平均计算单元，用于针对所述目标范例句中的每一词，根据该词中各字符对应的第三隐向量进行平均计算，得到该词的特征向量；第四隐向量输出单元，用于由所述第四神经网络根据所述目标范例句中各词的特征向量输出第四隐向量，所述第四隐向量作为所述句法特征向量，所述第三神经网络和所述第四神经网络是基于门控的循环神经网络。In some embodiments of the present application, the obtaining module includes: a character feature vector obtaining unit, configured to obtain character feature vectors of characters included in each word in the target example sentence, where the character feature vectors are obtained by encoding characters The 3rd latent vector output unit is used for outputting the 3rd latent vector corresponding to each character according to the character feature vector of each character by the described 3rd neural network; Average calculation unit is used for each in the described target model example sentence word, according to the average calculation of the third hidden vector corresponding to each character in the word, to obtain the feature vector of the word; the fourth hidden vector output unit is used by the fourth neural network according to each word in the target example sentence The feature vector of , outputs a fourth latent vector, the fourth latent vector is used as the syntactic feature vector, and the third neural network and the fourth neural network are gated-based recurrent neural networks.

在本申请的一些实施例中，视频描述语句的生成装置还包括：视频帧序列获取模块，用于获取对所述目标视频进行分帧所得到的视频帧序列；语义提取模块，用于通过所述卷积神经网络对所述视频帧序列中的各视频帧进行语义提取，得到各视频帧的语义向量；第五隐向量输出模块，用于通过所述第五神经网络根据所述视频帧序列中各视频帧的语义向量输出第五隐向量，所述第五隐向量作为所述视频语义特征向量，所述第五神经网络是基于门控的循环神经网络。In some embodiments of the present application, the device for generating a video description sentence further includes: a video frame sequence acquisition module for acquiring a video frame sequence obtained by dividing the target video into frames; a semantic extraction module for The convolutional neural network performs semantic extraction on each video frame in the video frame sequence to obtain the semantic vector of each video frame; the fifth hidden vector output module is used for according to the video frame sequence through the fifth neural network. The semantic vector of each video frame outputs a fifth latent vector, and the fifth latent vector is used as the video semantic feature vector, and the fifth neural network is a gated-based recurrent neural network.

根据本申请实施例的一个方面，提供了一种电子设备，包括：处理器；存储器，所述存储器上存储有计算机可读指令，所述计算机可读指令被所述处理器执行时，实现上述的方法。According to an aspect of the embodiments of the present application, an electronic device is provided, including: a processor; and a memory, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the above-mentioned Methods.

根据本申请实施例的一个方面，提供了一种计算机可读存储介质，其上存储有计算机可读指令，当所述计算机可读指令被处理器执行时，实现如上述的方法。According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by a processor, the above method is implemented.

在本申请的一些实施例所提供的技术方案中，先基于目标范例句的句法特征向量得到用于指导所要生成视频描述语句的句法结构的句法信息，然后根据句法信息和目标视频的视频语义特征向量确定所要生成视频描述语句对应于句法特征向量所指示句法结构的语义信息，最后根据语义信息为目标视频生成视频描述语句，不仅保证了所生成的视频描述语句与目标范例句的句法结构类似，而且保证了所生成的视频描述语句与目标视频的语义相关，即该视频描述语句可以准确描述目标视频中的内容。In the technical solutions provided by some embodiments of the present application, first, based on the syntactic feature vector of the target paradigm example sentence, syntactic information for guiding the syntactic structure of the video description sentence to be generated is obtained, and then based on the syntactic information and the video semantic feature of the target video The vector determines that the video description sentence to be generated corresponds to the semantic information of the syntactic structure indicated by the syntactic feature vector, and finally generates a video description sentence for the target video according to the semantic information, which not only ensures that the generated video description sentence is similar to the syntactic structure of the target example sentence, Furthermore, it is ensured that the generated video description sentence is semantically related to the target video, that is, the video description sentence can accurately describe the content in the target video.

由于所生成视频描述语句的句法受控于目标范例句的句法特征向量，那么，针对同一目标视频，若选用不同的句法结构的目标范例句来约束视频描述语句的句法结构，则可以生成不同句法结构的视频描述语句，从而，可以通过改变目标范例句来为同一目标视频生成不同句法结构的视频描述语句，以此实现为目标视频生成多样化的视频描述语句，有效解决了现有技术中视频描述语句句法单一的问题。Since the syntax of the generated video description sentence is controlled by the syntactic feature vector of the target sample sentence, then, for the same target video, if the target sample sentence with different syntactic structure is selected to constrain the syntactic structure of the video description sentence, different syntaxes can be generated. Structured video description sentences, thus, video description sentences with different syntactic structures can be generated for the same target video by changing the target example sentence, so as to realize the generation of diverse video description sentences for the target video, effectively solving the problem of video description in the prior art. Describe the problem of single syntax of the statement.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not limiting of the present application.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application. Obviously, the drawings in the following description are only some embodiments of the present application, and for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort. In the attached image:

图1示出了可以应用本申请实施例的技术方案的示例性系统架构的示意图；FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application can be applied;

图2是根据本申请的一个实施例的视频描述语句的生成方法的流程图；2 is a flowchart of a method for generating a video description sentence according to an embodiment of the present application;

图3示出了长短时记忆神经网络的示意图；Figure 3 shows a schematic diagram of a long short-term memory neural network;

图4是根据一具体实施例示出的针对同一目标视频在不同目标范例句下生成的视频描述语句的示意图；4 is a schematic diagram of a video description sentence generated under different target paradigm sentences for the same target video according to a specific embodiment;

图5是根据一实施例示出的输出第一隐向量的流程图；5 is a flow chart of outputting a first latent vector according to an embodiment;

图6是根据一实施例示出的输出第二隐向量的流程图；6 is a flow chart of outputting a second latent vector according to an embodiment;

图7是根据一实施例示出对描述生成模型进行训练的流程图；7 is a flowchart illustrating training a description generation model according to an embodiment;

图8是根据另一实施例示出对描述生成模型进行训练的流程图；8 is a flowchart illustrating training a description generation model according to another embodiment;

图9是根据另一实施例示出对描述生成模型进行训练的流程图；9 is a flowchart illustrating training a description generation model according to another embodiment;

图10是根据一实施例示出的生成视频描述语句的示意图；10 is a schematic diagram of generating a video description sentence according to an embodiment;

图11是根据一实施例示出的视频描述语句的生成装置的框图；11 is a block diagram of an apparatus for generating video description sentences according to an embodiment;

图12示出了适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。FIG. 12 shows a schematic structural diagram of a computer system suitable for implementing the electronic device according to the embodiment of the present application.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本申请将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments, however, can be embodied in various forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本申请的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本申请的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of the embodiments of the present application. However, those skilled in the art will appreciate that the technical solutions of the present application may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be employed. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present application.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are merely functional entities and do not necessarily necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are only exemplary illustrations and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partially combined, so the actual execution order may be changed according to the actual situation.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理(Nature Language processing，NLP)技术以及机器学习/深度学习等几大方向。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing (NLP) technology, and machine learning/deep learning.

图1示出了可以应用本申请实施例的技术方案的示例性系统架构的示意图。FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application can be applied.

如图1所示，系统架构可以包括终端设备(如图1中所示智能手机101、平板电脑102和便携式计算机103中的一种或多种，当然也可以是台式计算机等等)、网络104和服务器105。网络104用以在终端设备和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型，例如有线通信链路、无线通信链路等等。As shown in FIG. 1 , the system architecture may include terminal devices (one or more of a smart phone 101 , a tablet computer 102 and a portable computer 103 as shown in FIG. 1 , of course, a desktop computer, etc.), a network 104 and server 105. The network 104 is the medium used to provide the communication link between the terminal device and the server 105 . The network 104 may include various connection types, such as wired communication links, wireless communication links, and the like.

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。比如服务器105可以是多个服务器组成的服务器集群等。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs. For example, the server 105 may be a server cluster composed of multiple servers, or the like.

在本申请的一个实施例中，服务器可以获取终端设备上所上传的目标范例句和目标视频，然后对目标范例句进行句法特征提取，得到目标范例句的句法特征向量，以及对目标视频进行视频语义提取，得到目标视频的视频语义特征向量。In an embodiment of the present application, the server may acquire the target example sentence and the target video uploaded on the terminal device, and then perform syntactic feature extraction on the target example sentence to obtain the syntactic feature vector of the target example sentence, and perform a video on the target video. Semantic extraction, to obtain the video semantic feature vector of the target video.

在本申请的一个实施例中，服务器中还可以存储范例句集合，该服务器可以接收终端设备所发送的范例句选取指令，并根据该范例句选取指令确定目标范例句，进而对目标范例句进行句法特征提取。In an embodiment of the present application, the server can also store a set of example sentences, and the server can receive the example sentence selection instruction sent by the terminal device, and determine the target example sentence according to the example example sentence selection instruction, and then carry out the target example sentence. Syntax feature extraction.

当然，在其他实施例中，服务器中也可以存储若干视频供用户进行选择，即服务器可以接收终端设备发送的视频选取指令，将该视频选取指令所指示选择的视频作为待生成视频描述语句的目标视频。Of course, in other embodiments, the server may also store several videos for the user to select, that is, the server may receive the video selection instruction sent by the terminal device, and use the video selected by the video selection instruction as the target of the video description sentence to be generated. video.

在本申请的一个实施例中，服务器在获得目标范例句的句法特征向量和目标视频的视频语义特征向量后，基于句法特征向量和视频语义特征向量为目标视频生成视频描述语句，使得所生成的视频描述语句一方面与目标范例句的句法相同或相似，而且，保证该视频描述语句的语义与目标视频中的视频内容相关，即与目标视频的语义相关。In one embodiment of the present application, after obtaining the syntactic feature vector of the target paradigm example sentence and the video semantic feature vector of the target video, the server generates a video description sentence for the target video based on the syntactic feature vector and the video semantic feature vector, so that the generated On the one hand, the video description sentence is the same or similar to the syntax of the target example sentence, and it is guaranteed that the semantics of the video description sentence is related to the video content in the target video, that is, related to the semantics of the target video.

在本申请的一个实施例中，服务器在为目标视频生成视频描述语句后，将所生成的视频描述语句反馈至终端设备，以便终端设备将该视频描述语句呈现给用户。In an embodiment of the present application, after generating the video description sentence for the target video, the server feeds back the generated video description sentence to the terminal device, so that the terminal device presents the video description sentence to the user.

需要说明的是，本申请实施例所提供的视频描述语句的生成方法一般由服务器105执行，相应地，视频描述语句的生成装置一般设置于服务器105中。但是，在本申请的其它实施例中，终端设备也可以与服务器具有相似的功能，从而执行本申请实施例所提供视频描述语句的生成方法。It should be noted that, the method for generating a video description sentence provided by the embodiment of the present application is generally executed by the server 105 , and accordingly, a device for generating a video description sentence is generally set in the server 105 . However, in other embodiments of the present application, the terminal device may also have similar functions to the server, so as to execute the method for generating video description sentences provided by the embodiments of the present application.

以下对本申请实施例的技术方案的实现细节进行详细阐述：The implementation details of the technical solutions of the embodiments of the present application are described in detail below:

图2示出了根据本申请的一个实施例的视频描述语句的生成方法的流程图，该视频描述语句的生成方法可以由具有计算处理功能的设备来执行，比如可以由图1中所示的服务器105来执行。参照图2所示，该视频描述语句的生成方法至少包括步骤210至步骤240，详细介绍如下：FIG. 2 shows a flowchart of a method for generating a video description sentence according to an embodiment of the present application. The method for generating a video description sentence can be executed by a device with a computing processing function, such as the method shown in FIG. 1 . server 105 to execute. Referring to FIG. 2, the method for generating a video description sentence includes at least steps 210 to 240, and the details are as follows:

步骤210，获取目标范例句的句法特征向量。Step 210: Obtain the syntactic feature vector of the target paradigm example sentence.

目标范例句是指用于约束所要生成视频描述语句的句法结构的范例句。The target paradigm sentence refers to the paradigm sentence used to constrain the syntactic structure of the video description sentence to be generated.

在本申请的一些实施例中，可以预先构建范例句集合，从而，由用户从范例句集合中选取一范例句作为目标范例句，以此对所要生成视频描述语句的句法结构进行约束。当然，在其他实施例中，该目标范例句还可以是由用户通过终端设备上传的范例句。In some embodiments of the present application, a sample sentence set may be pre-built, so that the user selects a sample sentence from the sample sentence set as the target sample sentence, so as to constrain the syntactic structure of the video description sentence to be generated. Certainly, in other embodiments, the target example sentence may also be a sample sentence uploaded by the user through the terminal device.

目标范例句的句法特征向量用于描述目标范例句的句法结构，其反映了目标范例句中词与词之间的依存关系和句法结构信息(例如主谓宾、定状补等)。The syntactic feature vector of the target paradigm example sentence is used to describe the syntactic structure of the target paradigm sentence sentence, which reflects the dependencies between words and syntactic structure information (such as subject-predicate-object, definite form complement, etc.) in the target paradigm sentence sentence.

目标范例句的句法特征向量可以通过对目标范例句进行句法分析获得。其中，句法分析可以采用句法结构分析(又称短语结构分析、成分句法分析)、依存关系分析(又称依存句法分析，依存分析)、深层文法句法分析。The syntactic feature vector of the target paradigm example sentence can be obtained by syntactic analysis of the target paradigm sentence sentence. Among them, syntactic analysis can use syntactic structure analysis (also known as phrase structure analysis, component syntax analysis), dependency relationship analysis (also known as dependency syntax analysis, dependency analysis), and deep grammar and syntax analysis.

在本申请的一些实施例中，可以借助于句法分析工具来对目标范例句进行句法分析，进而根据句法分析结果生成句法特征向量。句法分析工具例如StanfordCoreNLP、HanLP、SpaCy、FudanNLP。其中，句法分析结果可以是为目标范例句所生成的成分句法树，通过对成分句法树进行序列化，得到目标范例句的句法特征向量。In some embodiments of the present application, a syntactic analysis tool may be used to perform a syntactic analysis on the target paradigm example sentence, and then a syntactic feature vector may be generated according to the syntactic analysis result. Syntax analysis tools such as StanfordCoreNLP, HanLP, SpaCy, FudanNLP. The syntactic analysis result may be a component syntax tree generated for the target paradigm example sentence, and the syntactic feature vector of the target paradigm sentence sentence is obtained by serializing the component syntax tree.

在本申请的一些实施例中，可以通过两层级联的基于门控的循环神经网络来进行句法特征提取，得到目标范例句的句法特征向量。In some embodiments of the present application, syntactic feature extraction may be performed through a two-level cascaded gating-based recurrent neural network to obtain a syntactic feature vector of the target paradigm sentence.

具体的，先获取目标范例句中各词所包括字符的字符特征向量，字符特征向量是对字符进行编码得到的；然后，由第三神经网络根据各字符的字符特征向量输出各字符对应的第三隐向量；再针对目标范例句中的每一词，根据该词中各字符对应的第三隐向量进行平均计算，得到该词的特征向量；最后由第四神经网络根据目标范例句中各词的特征向量获得第四隐向量序列，第四隐向量序列作为句法特征向量，第三神经网络和第四神经网络是基于门控的循环神经网络。Specifically, the character feature vector of the characters included in each word in the target example sentence is obtained first, and the character feature vector is obtained by encoding the character; then, the third neural network outputs the first corresponding to each character according to the character feature vector of each character. Three hidden vectors; then for each word in the target example sentence, the average calculation is carried out according to the third hidden vector corresponding to each character in the word to obtain the feature vector of the word; The feature vector of the word obtains the fourth latent vector sequence, the fourth latent vector sequence is used as the syntactic feature vector, and the third neural network and the fourth neural network are gated-based recurrent neural networks.

其中，基于门控的循环神经网络可以是长短时记忆网络(Long Short TermMemory Network，LSTM)，也可以是门控循环网络(Gated Recurrent Unit，GRU)。The gated-based recurrent neural network may be a Long Short Term Memory Network (LSTM) or a Gated Recurrent Unit (GRU).

GRU是LSTM的一种变体，其中，GRU相对于LSTM的改进包括：将遗忘门和输入门合并为一个门，即更新门，而另外一个门叫重置门；其没有LSTM中作为内部状态的细胞单元向量和作为外部状态的隐向量的划分，而是直接通过当前网络的状态(h_t)和上一时刻网络的状态(h_t-1)之间添加一个线性的依赖关系。GRU is a variant of LSTM, in which the improvement of GRU over LSTM includes: merging the forget gate and the input gate into one gate, that is, the update gate, and the other gate is called the reset gate; it is not used as an internal state in LSTM The division of the cell unit vector and the latent vector as the external state directly adds a linear dependency between the current state of the network (h _t ) and the state of the network at the previous moment (h _t-1 ).

下面，以第三神经网络和第四神经网络均为长短时记忆网络为例对上述为目标范例句生成句法特征向量的过程进行说明。在进行具体说明之前，有必要对长短时记忆神经网络的结构以及其中涉及的处理过程进行说明。In the following, the above-mentioned process of generating a syntactic feature vector for the target example sentence sentence will be described by taking that the third neural network and the fourth neural network are both long and short-term memory networks as an example. Before the specific description, it is necessary to explain the structure of the long-short-term memory neural network and the processing procedures involved in it.

图3示出了长短时记忆神经网络的示意图，如图3所示，在任一t时刻(假设为t时刻)，长短时记忆网络的输入有三个：t时刻的输入向量x_t、上一时刻输出的隐向量h_t-1、以及上一时刻的细胞单元向量c_t-1，其中，细胞单元向量用于反映对应时刻细胞单元的状态，隐向量作为对应时刻LSTM的输出。Figure 3 shows a schematic diagram of a long-short-term memory neural network. As shown in Figure 3, at any time t (assuming it is time t), there are three inputs to the long-short-term memory network: the input vector x _t at time t, the previous time The output latent vector h _t-1 and the cell unit vector c _t-1 at the previous moment, wherein the cell unit vector is used to reflect the state of the cell unit at the corresponding moment, and the hidden vector is used as the output of the LSTM at the corresponding moment.

如图3所示，LSTM包括遗忘门、输入门和输出门，其中，遗忘门决定了上一时刻的细胞单元向量c_t-1有多少保留到当前时刻的细胞单元向量c_t；输入门决定了当前时刻的输入向量x_t有多少保存到当前时刻的细胞单元向量c_t；输出门用来控制细胞单元向量c_t有多少输出到LSTM的当前输出h_t。As shown in Figure 3, LSTM includes a forget gate, an input gate and an output gate. The forget gate determines how much of the cell unit vector c _t-1 at the previous moment is retained to the current cell unit vector c _t ; the input gate determines how much How much of the input vector x _t at the current moment is saved to the cell unit vector _ct at the current moment; the output gate is used to control how much of the cell unit vector ct _{is output to the current output h t} _of the LSTM.

在LSTM中，基于遗忘门、输入门和输出门的计算来确定对应时刻的隐向量和细胞单元向量，为便于描述，将由LSTM中的输入门进行计算所直接得到的向量称为输入门向量，将由输出门进行计算所得到的向量称为输出门向量，将由遗忘门进行计算所得到的向量称为遗忘门向量。In LSTM, the hidden vector and cell unit vector at the corresponding moment are determined based on the calculation of the forget gate, input gate and output gate. For the convenience of description, the vector directly obtained by the calculation of the input gate in LSTM is called the input gate vector. The vector calculated by the output gate is called the output gate vector, and the vector calculated by the forget gate is called the forget gate vector.

对于任一时刻t，t时刻对应的遗忘门向量f_t、输入门向量i_t、单元向量g_t、细胞单元向量c_t、输出门向量o_t、隐向量h_t的计算过程如下所示。For any time _t , the calculation process of forget gate vector ft , input gate vector it , unit vector _gt , cell unit vector _ct , output gate vector _ot , and hidden vector h _t corresponding to time _t is as follows.

其中，遗忘门向量f_t为：Among them, the forget gate vector f _t is:

f_t＝σ(W_f·[h_t-1，x_t]+b_f)， (1)f _t =σ(W _f ·[h _t-1 , x _t ]+b _f ), (1)

其中，σ是sigmoid函数，其值域是(0，1)；W_f是遗忘门的权重矩阵，[h_t-1，x_t]表示把两个向量进行拼接，b_f是遗忘门的偏置项，W_f和b_f可以通过训练确定。Among them, σ is the sigmoid function, and its value range is (0, 1); W _f is the weight matrix of the forget gate, [h _t-1 , x _t ] represents the splicing of two vectors, and b _f is the bias of the forget gate Set terms, W _f and b _f can be determined by training.

输入门向量i_t为：The input gate vector i _t is:

i_t＝σ(W_i·[h_t-1，x_t]+b_i)， (2)i _t =σ(W _i ·[h _t-1 , x _t ]+ _bi ), (2)

W_i是输入门的权重矩阵，b_i是输入门的偏置项，W_i和b_i通过训练确定。 _Wi is the weight matrix of the input gate, _bi is the bias term of the input gate, _Wi and _bi are determined by training.

在LSTM中还涉及到单元向量的计算，单元向量用于描述当前的输入，单元向量g_t为：In LSTM, the calculation of unit vector is also involved. The unit vector is used to describe the current input. The unit vector g _t is:

g_t＝tanh(W_c·[h_t-1，x_t]+b_c)， (3)g _t =tanh(W _c ·[h _t-1 , x _t ]+b _c ), (3)

其中，tanh表示双曲正切函数，W_c为权重矩阵，b_c为偏置项，W_c和b_c可通过训练确定。Among them, tanh represents the hyperbolic tangent function, W _c is the weight matrix, b _c is the bias term, and W _c and b _c can be determined through training.

该单元向量g_t用于计算细胞单元向量c_t，细胞单元向量c_t为：The cell vector g _t is used to calculate the cell cell vector c _t , and the cell cell vector c _t is:

其中，符号

表示按元素乘。Among them, the symbol

Represents element-wise multiplication.

输出门向量o_t为：The output gate vector o _t is:

o_t＝σ(W_o·[h_t-1，x_t]+b_o)， (5)o _t =σ(W _o ·[h _t-1 , x _t ]+b _o ), (5)

隐向量ht为：The hidden vector ht is:

继续回到为目标范例句生成句法特征向量的过程。例如，针对“apple”这个词，其字符序列为：“a-p-p-l-e”，先将该字符序列中每一字符的字符特征向量顺序输入到第三神经网络LSTMc中，得到每一字符对应的第三隐向量。具体的，假设目标范例句中第n个词中第l个字符的字符特征向量为

其输入到第三神经网络LSTMc中，其在第三神经网络LSTMc网络中的计算过程可以描述为：Continue back to the process of generating syntactic feature vectors for the target paradigm example sentences. For example, for the word "apple", its character sequence is: "apple", first input the character feature vector of each character in the character sequence into the third neural network LSTMc, and obtain the third hidden value corresponding to each character. vector. Specifically, it is assumed that the character feature vector of the l-th character in the n-th word in the target paradigm example sentence is

It is input into the third neural network LSTMc, and its calculation process in the third neural network LSTMc network can be described as:

其中，

为第三神经网络针对第n个词中第l-1个字符所输出的隐向量，

为第三神经网络针对第n个词中第l-1个字符所得到的细胞单元向量，

为第三神经网络针对第n个词中第1个字符所得到的细胞单元向量.

为第三神经网络第n个词中第1个字符所输出的隐向量，为便于区分，将第三神经网络所输出的隐向量称为第三隐向量。in,

is the latent vector output by the third neural network for the l-1th character in the nth word,

is the cell unit vector obtained by the third neural network for the l-1th character in the nth word,

is the cell unit vector obtained by the third neural network for the first character in the nth word.

is the latent vector output by the first character in the nth word of the third neural network. For the convenience of distinction, the latent vector output by the third neural network is called the third latent vector.

得到各个字符所对应的第三隐向量后，根据该第n个词中各个字符所对应的第三隐向量进行平均计算，将所得到的平均向量作为该第n个词的特征向量w_n：After the third latent vector corresponding to each character is obtained, average calculation is performed according to the third latent vector corresponding to each character in the nth word, and the obtained average vector is used as the feature vector w _n of the nth word:

最后，将目标范例句中各个词对应的特征向量顺序输入到第四神经网络LSTM^w中，其中，第n个词的特征向量w_n在第四神经网络中的处理过程可以描述为：Finally, the feature vectors corresponding to each word in the target example sentence are sequentially input into the fourth neural network LSTM ^w , where the processing process of the feature vector w _n of the nth word in the fourth neural network can be described as:

其中，

为第四神经网络针对第n个词的特征向量所输出的隐向量；

为第四神经网络针对第n个词的特征向量所输出的细胞单元向量；

为第四神经网络针对第n-1个词的特征向量所输出的隐向量；

为第四神经网络针对第n-1个词的特征向量所输出的细胞单元向量；为便于区分，将第四神经网络所输出的隐向量称为第四隐向量。in,

is the latent vector output by the fourth neural network for the feature vector of the nth word;

is the cell unit vector output by the fourth neural network for the feature vector of the nth word;

is the latent vector output by the fourth neural network for the feature vector of the n-1th word;

is the cell unit vector output by the fourth neural network for the feature vector of the n-1th word; for the convenience of distinction, the latent vector output by the fourth neural network is called the fourth latent vector.

由此，由各个时刻所输出的第四隐向量进行组合得到第四隐向量序列

该第四隐向量序列H^s作为目标范例句的句法特征向量，用于控制所要生成的视频描述语句的句法。Thus, the fourth hidden vector sequence is obtained by combining the fourth hidden vectors output at each moment

The fourth latent vector sequence H ^s is used as the syntactic feature vector of the target paradigm example sentence to control the syntax of the video description sentence to be generated.

通过如上的过程，实现了针对目标范例句中每一词以字符级别编码为基础得到词所对应的特征向量，通过从字符级别开始编码，可以保证所得到词对应的特征向量可以充分体现词的特征。Through the above process, the feature vector corresponding to the word is obtained based on the character-level coding for each word in the target sample sentence. feature.

请继续参阅图2，步骤220，根据句法特征向量确定所要生成视频描述语句的句法，得到句法信息。Please continue to refer to FIG. 2, step 220, determine the syntax of the video description sentence to be generated according to the syntax feature vector, and obtain the syntax information.

对于视频描述语句的生成而言，其是按词进行预测并输出的，而各个时刻所输出的词在视频描述句子中的句法成分是不同的。举例来说，若目标范例句是主谓宾结构，则按照主语、谓语、宾语的顺序依次输出对应的词，从而组成视频描述语句。For the generation of video description sentences, it is predicted and outputted by words, and the syntactic components of the words outputted at each moment in the video description sentences are different. For example, if the target example sentence is a subject-predicate-object structure, the corresponding words are output in the order of subject, predicate, and object, so as to form a video description sentence.

由于目标范例句的句法特征向量描述了目标范例句中词与词之间的依存关系和句法结构信息，从而，根据目标范例句的句法特征向量可以对应确定目标范例句的句法结构，以目标范例句的句法结构作为所要生成视频描述语句的句法结构，以此为基础，通过句法信息来指导各个时刻所要输出的句子成分。Since the syntactic feature vector of the target example sentence describes the dependencies and syntactic structure information between words in the target example sentence, therefore, according to the syntactic feature vector of the target example sentence, the syntactic structure of the target example sentence can be correspondingly determined. The syntactic structure of the sentence is used as the syntactic structure of the video description sentence to be generated. Based on this, the sentence components to be output at each moment are guided by the syntactic information.

也就是说，句法信息用于指示各个时刻所要输出视频描述语句的句子成分，例如主语、谓语、宾语、定语、状语等。由于所要生成的视频描述语句的句法结构与目标范例句的句法结构相同，由此，按照句法特征向量所指示的句法结构，来对应确定各个时刻所要输出的句子成分，以此来保证所输出视频描述语句的句法结构与目标范例句的句法结构一致。That is to say, the syntactic information is used to indicate the sentence components of the video description sentence to be output at each moment, such as subject, predicate, object, attribute, adverbial and so on. Since the syntactic structure of the video description sentence to be generated is the same as the syntactic structure of the target example sentence, the sentence components to be output at each moment are correspondingly determined according to the syntactic structure indicated by the syntactic feature vector, so as to ensure the output video The syntactic structure of the descriptive sentence is consistent with the syntactic structure of the target paradigm.

步骤230，根据句法信息和目标视频的视频语义特征向量确定所要生成视频描述语句对应于句法的语义，得到语义信息。Step 230: Determine the semantics of the to-be-generated video description sentence corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtain semantic information.

目标视频并不特指某一视频，而是泛指需要生成视频描述语句的视频。The target video does not specifically refer to a certain video, but generally refers to the video that needs to generate a video description sentence.

目标视频的视频语义特征向量用于描述视频的内容，视频的内容又可以理解为视频的语义。视频中的内容可以包括视频中的对象(例如人物、动物、物品、景物等)、对象的行为等。The video semantic feature vector of the target video is used to describe the content of the video, and the content of the video can be understood as the semantics of the video. The content in the video may include objects in the video (eg, characters, animals, objects, scenes, etc.), behaviors of the objects, and the like.

对于视频中的对象，可以基于视频中的视频帧进行对象识别确定；对于视频中对象的行为，可以基于若干连续视频帧进行对象的动作识别，以确定对象的行为。For an object in a video, object recognition and determination can be performed based on video frames in the video; for behavior of an object in a video, action recognition of the object can be performed based on several consecutive video frames to determine the behavior of the object.

在本申请的一些实施例中，为了得到目标视频的视频语义特征向量，可以先对目标视频进行分帧，基于各个视频帧中的图像特征来进行对象识别、动作识别等，以此来生成目标视频的视频语义特征向量。In some embodiments of the present application, in order to obtain the video semantic feature vector of the target video, the target video can be divided into frames first, and object recognition, action recognition, etc. can be performed based on the image features in each video frame, so as to generate the target video. Video semantic feature vector for the video.

在本申请的一些实施例中，可以通过如下的过程来提取得到目标视频的视频语义特征向量。具体的，先获取对目标视频进行分帧所得到的视频帧序列；然后，通过卷积神经网络对视频帧序列中的各视频帧进行语义提取，得到各视频帧的语义向量；再通过第五神经网络根据视频帧序列中各视频帧的语义向量输出第五隐向量序列，第五隐向量序列作为视频语义特征向量，第五神经网络是基于门控的循环神经网络。In some embodiments of the present application, the video semantic feature vector of the target video can be extracted through the following process. Specifically, the video frame sequence obtained by dividing the target video into frames is obtained first; then, the semantic extraction of each video frame in the video frame sequence is performed through the convolutional neural network to obtain the semantic vector of each video frame; The neural network outputs the fifth latent vector sequence according to the semantic vector of each video frame in the video frame sequence, and the fifth latent vector sequence is used as the video semantic feature vector. The fifth neural network is a gated-based recurrent neural network.

其中，将第五神经网络所输出的隐向量称为第五隐向量，该第五隐向量序列是将第五神经网络针对各个视频帧所对应输出的第五隐向量进行组合得到的。Wherein, the latent vector output by the fifth neural network is called the fifth latent vector, and the fifth latent vector sequence is obtained by combining the fifth latent vectors output by the fifth neural network corresponding to each video frame.

假设各个视频帧的语义向量组合成视频语义序列V＝[v₁，...，v_m，...，v_M]，其中v_m为第m帧视频帧的语义向量。It is assumed that the semantic vectors of each video frame are combined into a video semantic sequence V=[v ₁ , . . . , vm , _{. . . , v M} _] , where vm is the semantic vector of the _mth video frame.

然后，将该视频语义序列输入到第五神经网络(例如为长短时记忆网络LSTMv)中进行编码，获得包含视频上下文的特征序列

其中，第m帧视频帧的语义向量v_m在第五神经网络LSTM^v中的处理过程可描述为：Then, the video semantic sequence is input into the fifth neural network (for example, the long short-term memory network LSTMv) for encoding, and the feature sequence containing the video context is obtained

Among them, the processing process of the semantic vector v _m of the mth video frame in the fifth neural network LSTM ^v can be described as:

其中，

为第五神经网络LSTM^v针对第m-1帧视频帧的语义向量所输出的隐向量；

为第五神经网络LSTM^v针对m-1帧视频帧的语义向量所输出的细胞单元向量；

为第五神经网络LSTM^v针对第m帧视频帧的语义向量所输出的隐向量；

为第五神经网络LSTM^v针对m帧视频帧的语义向量所输出的细胞单元向量；为便于区分，将第五神经网络LSTM^v所输出的隐向量称为第五隐向量。in,

is the latent vector output by the fifth neural network LSTM ^v for the semantic vector of the m-1th video frame;

is the cell unit vector output by the fifth neural network LSTM ^v for the semantic vector of the m-1 video frame;

is the latent vector output by the fifth neural network LSTM ^v for the semantic vector of the mth video frame;

is the cell unit vector output by the fifth neural network LSTM ^v for the semantic vector of m video frames; for the convenience of distinction, the latent vector output by the fifth neural network LSTM ^v is called the fifth latent vector.

由第五神经网络LSTM^v针对目标视频中各视频帧所输出的第五隐向量按照视频帧的时序顺序进行组合，即得到第五隐向量序列。The fifth latent vectors output by the fifth neural network LSTM ^v for each video frame in the target video are combined according to the sequence order of the video frames, that is, the fifth latent vector sequence is obtained.

如上所描述，句法信息用于指示各个时刻所要输出的句子成分。因此，为了保证所输出的视频描述语句准确表达视频中的内容，结合目标视频的视频语义特征向量和句法信息来为各句子成分赋予语义，得到各句子成分对应的语义信息。As described above, the syntax information is used to indicate the sentence components to be output at each moment. Therefore, in order to ensure that the output video description sentence accurately expresses the content in the video, the semantic information of each sentence component is obtained by combining the video semantic feature vector and syntactic information of the target video to assign semantics to each sentence component.

可以理解的是，若目标范例句的句法结构变化，则各个时刻所需要顺序输出的句子成分也对应发生变化，因此，为目标视频所输出的语义信息是受控于目标范例句所限定的句法结构的。It can be understood that if the syntactic structure of the target example sentence changes, the sentence components that need to be output in sequence at each moment also change correspondingly. Therefore, the semantic information output for the target video is controlled by the syntax limited by the target example sentence. structured.

步骤240，根据语义信息生成目标视频的视频描述语句。Step 240: Generate a video description sentence of the target video according to the semantic information.

在本申请的一些实施例中，在得到各个句法成分对应的语义信息后，按照语义信息进行词的预测，从而将各个时刻所预测到的词进行顺序组合得到目标视频的视频描述语句。In some embodiments of the present application, after the semantic information corresponding to each syntactic component is obtained, words are predicted according to the semantic information, so that the words predicted at each moment are sequentially combined to obtain a video description sentence of the target video.

在本申请的一些实施例中，为了进行词预测，预先部署了词表。基于所得到的语义信息，预测词表中各词对应于该语义信息的概率，然后根据所预测得到的概率，确定该语义信息对应的词，例如将概率最高的词作为该语义信息对应的词。In some embodiments of the present application, a vocabulary is pre-deployed for word prediction. Based on the obtained semantic information, predict the probability that each word in the vocabulary corresponds to the semantic information, and then determine the word corresponding to the semantic information according to the predicted probability, for example, use the word with the highest probability as the word corresponding to the semantic information .

通过本申请的方案，先基于目标范例句的句法特征向量得到用于指导所要生成视频描述语句的句法结构的句法信息，然后根据句法信息和目标视频的视频语义特征向量确定所要生成视频描述语句对应于句法特征向量所指示句法结构的语义信息，最后根据语义信息为目标视频生成视频描述语句，不仅保证了所生成的视频描述语句与目标范例句的句法结构相同，而且保证了所生成的视频描述语句与目标视频的语义相关，即该视频描述语句可以准确描述目标视频中的内容。Through the solution of the present application, first, based on the syntactic feature vector of the target paradigm example sentence, the syntactic information used to guide the syntactic structure of the video description sentence to be generated is obtained, and then according to the syntactic information and the video semantic feature vector of the target video, the corresponding video description sentence to be generated is determined. Based on the semantic information of the syntactic structure indicated by the syntactic feature vector, finally generate a video description sentence for the target video according to the semantic information, which not only ensures that the generated video description sentence has the same syntactic structure as the target example sentence, but also ensures that the generated video description sentence The sentence is related to the semantics of the target video, that is, the video description sentence can accurately describe the content in the target video.

可以理解的是，针对同一目标视频，若选用不同的句法结构的目标范例句来约束视频描述语句的句法结构，则可以生成不同句法结构的视频描述语句，从而，可以通过改变目标范例句来为同一目标视频生成不同句法结构的视频描述语句，以此实现为目标视频生成多样化的视频描述语句。It can be understood that, for the same target video, if target example sentences with different syntactic structures are selected to constrain the syntactic structure of the video description sentences, then video description sentences with different syntactic structures can be generated. The same target video generates video description sentences with different syntactic structures, so as to generate diverse video description sentences for the target video.

请参阅图4所示，针对同一目标视频，若目标范例句为“Aerial view of a groupof sheeps on a grass field.”，则基于该目标范例句按照本申请的方法为目标视频所生成的视频描述语句为“Cooking video of a recipe with ingredients in a glassbowl.”；而若目标范例句为“Female patient watching TV and remoting control inhand in hospital bed”，则基于该目标范例句按照本申请的方法为目标视频所生成的视频描述语句为“Woman cook slicing egg and mixing salad in bow in kitchentable”；若目标范例句为“Water splashes when a coin dropped in a glass”，则基于该目标范例句按照本申请的方法为目标视频所生成的视频描述语句为“Egg scatters whena knife cut at board”。Please refer to FIG. 4 , for the same target video, if the target example sentence is "Aerial view of a group of sheeps on a grass field.", then based on the target example sentence, the method of the present application generates a video description for the target video The sentence is "Cooking video of a recipe with ingredients in a glassbowl."; and if the target sample sentence is "Female patient watching TV and remoting control inhand in hospital bed", then the target video is the target video according to the method of this application. The generated video description sentence is "Woman cook slicing egg and mixing salad in bow in kitchentable"; if the target sample sentence is "Water splashes when a coin dropped in a glass", then based on the target sample sentence according to the method of the present application The video description sentence generated by the target video is "Egg scatters whena knife cut at board".

在本申请的一些实施例中，通过基于门控的循环神经网络来分别实现步骤220和步骤230的过程。In some embodiments of the present application, the processes of step 220 and step 230 are respectively implemented by a gated-based recurrent neural network.

在本实施例中，步骤220包括：由描述生成模型所包含的第一神经网络根据句法特征向量生成第一隐向量，第一隐向量用于指示句法信息，描述生成模型还包括第二神经网络，第一神经网络和第二神经网络是基于门控的循环神经网络。In this embodiment, step 220 includes: generating a first latent vector according to the syntactic feature vector by the first neural network included in the description generation model, where the first latent vector is used to indicate syntactic information, and the description generation model further includes a second neural network , the first neural network and the second neural network are gated based recurrent neural networks.

在本申请的一些实施例中，由第一神经网络根据句法特征向量、t-1时刻的词向量和第一神经网络所生成t-1时刻的第一隐向量，输出t时刻的第一隐向量。In some embodiments of the present application, the first hidden vector at time t is output by the first neural network according to the syntactic feature vector, the word vector at time t-1, and the first hidden vector at time t-1 generated by the first neural network. vector.

具体的，第一隐向量可以通过如图5所示的步骤510-530过程来生成。具体说明如下：Specifically, the first latent vector may be generated through the process of steps 510-530 as shown in FIG. 5 . The specific instructions are as follows:

步骤510，根据t-1时刻的第一隐向量对句法特征向量进行软注意力加权，得到对应于t时刻的目标句法特征向量。Step 510: Perform soft attention weighting on the syntactic feature vector according to the first latent vector at time t-1 to obtain a target syntactic feature vector corresponding to time t.

步骤520，将对应于t时刻的目标句法特征向量与t-1时刻的词向量进行拼接，得到对应于t时刻的第一拼接向量。Step 520, splicing the target syntax feature vector corresponding to time t and the word vector at time t-1 to obtain a first splicing vector corresponding to time t.

步骤530，由第一神经网络以对应于t时刻的第一拼接向量作为输入，对应输出t时刻的第一隐向量。Step 530, the first neural network uses the first splicing vector corresponding to time t as an input, and outputs the first latent vector corresponding to time t.

软注意力加权又称为软注意力机制，其通过选择性地忽略部分信息来对其余信息进行重加权聚合计算。Soft attention weighting, also known as soft attention mechanism, performs re-weighted aggregation calculation on the remaining information by selectively ignoring part of the information.

继续上述的举例，以第五神经网络为长短时记忆神经网络对上述步骤510-530进行说明。Continuing with the above example, the above steps 510-530 are described by taking the fifth neural network as a long-short-term memory neural network.

假设t-1时刻第一神经网络所输出的第一隐向量为

则通过该t-1时刻的第一隐向量

对目标范例句的句法特征向量为

进行软注意力加权可以描述为：Assume that the first hidden vector output by the first neural network at time t-1 is

Then through the first hidden vector at the time t-1

The syntactic feature vector of the target paradigm example sentence is

Performing soft attention weighting can be described as:

其中，

为对应于t时刻的目标句法特征向量。in,

is the target syntax feature vector corresponding to time t.

将t时刻的目标句法特征向量

与t-1时刻的词向量e_t-1进行拼接，所得到对应于t时刻的第一拼接向量为

The target syntactic feature vector at time t

Splicing with the word vector e _t- 1 at time t-1, the obtained first splicing vector corresponding to time t is

然后，将对应于t时刻的第一拼接向量

作为对应于t时刻第一神经网络LSTM^syn的输入向量，由该第一神经网络LSTM^syn输出对应于t时刻的第一隐向量

该过程可以描述为：Then, the first splice vector corresponding to time t is

As the input vector corresponding to the first neural network LSTM ^syn at time t, the first hidden vector corresponding to time t is output by the first neural network LSTM ^syn

The process can be described as:

其中，

为第一神经网络LSTM^syn中t时刻的细胞单元向量；

为第一神经网络LSTM^syn中t时刻的第一隐向量；

为第一神经网络LSTM^syn中t-1时刻的细胞单元向量。in,

is the cell unit vector at time t in the first neural network LSTM ^syn ;

is the first hidden vector at time t in the first neural network LSTM ^syn ;

is the cell unit vector at time t-1 in the first neural network LSTM ^syn .

在第一神经网络为长短时记忆神经网络LSTM^syn时，其具体结构可以参阅图3所示。该第一神经网络包括第一输入门、第一遗忘门和第一输出门，上述步骤530包括：由第一遗忘门根据对应于t时刻的第一拼接向量计算得到t时刻的第一遗忘门向量；以及由第一输入门根据对应于t时刻的第一拼接向量计算得到t时刻的第一输入门向量；然后根据t时刻的第一遗忘门向量、t时刻的第一输入门向量、t时刻的第一单元向量和第一神经网络所对应t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量，t时刻的第一单元向量是根据对应于t时刻的第一拼接向量进行双曲正切计算得到的；最后，根据t时刻的第一细胞单元向量和t时刻的第一输出门向量计算得到t时刻的第一隐向量，t时刻的第一输出门向量是由第一输出门根据对应于t时刻的第一拼接向量计算得到的。When the first neural network is a long-short-term memory neural network LSTM ^syn , its specific structure can be referred to as shown in FIG. 3 . The first neural network includes a first input gate, a first forgetting gate and a first output gate, and the above step 530 includes: calculating the first forgetting gate at time t by the first forgetting gate according to the first splicing vector corresponding to time t vector; and by the first input gate according to the first splicing vector corresponding to the time t to obtain the first input gate vector at time t; then according to the first forget gate vector at time t, the first input gate vector at time t, t The first unit vector at time and the first cell unit vector at time t-1 corresponding to the first neural network are calculated to obtain the first cell unit vector at time t, and the first unit vector at time t is based on the first unit vector corresponding to time t. The splicing vector is calculated by hyperbolic tangent; finally, the first hidden vector at time t is calculated according to the first cell unit vector at time t and the first output gate vector at time t, and the first output gate vector at time t is given by The first output gate is calculated according to the first splice vector corresponding to time t.

在本实施例中，第一遗忘门向量是指第一神经网络中的遗忘门向量，同理，第一输入门向量、第一输出门向量、第一细胞单元向量分别是指第一神经网络中的输入门向量、输出门向量、细胞单元向量。In this embodiment, the first forget gate vector refers to the forget gate vector in the first neural network. Similarly, the first input gate vector, the first output gate vector, and the first cell unit vector refer to the first neural network respectively. The input gate vector, output gate vector, and cell unit vector in .

其中，t时刻的第一输入门向量、第一输出门向量、第一细胞单元向量、第一单元向量和第一隐向量的计算参见上文的公式(1)-(6)在此不再赘述。Wherein, for the calculation of the first input gate vector, the first output gate vector, the first cell unit vector, the first cell vector and the first hidden vector at time t, please refer to the above formulas (1)-(6) and no longer here. Repeat.

在本申请的一些实施例中，基于门控的循环神经网络的第二神经网络实现步骤230的过程如下：由第二神经网络根据第一隐向量和视频语义特征向量生成第二隐向量，第二隐向量用于指示语义信息。In some embodiments of the present application, the second neural network based on the gated recurrent neural network implements the process of step 230 as follows: the second neural network generates a second latent vector according to the first latent vector and the video semantic feature vector, and the second latent vector is generated by the second neural network. Binary latent vectors are used to indicate semantic information.

在本申请的一些实施例中，由第二神经网络根据视频语义特征向量、t时刻的第一隐向量和第二神经网络所生成t-1时刻的第二隐向量，输出t时刻的第二隐向量。In some embodiments of the present application, the second neural network outputs the second latent vector at time t according to the video semantic feature vector, the first latent vector at time t, and the second latent vector at time t-1 generated by the second neural network. hidden vector.

在第二神经网络为长短时记忆神经网络的情况下，t时刻的第二隐向量的生成过程可以包括如下如图6所示的步骤610-630。具体说明如下：In the case where the second neural network is a long-short-term memory neural network, the process of generating the second latent vector at time t may include steps 610-630 as shown in FIG. 6 below. The specific instructions are as follows:

步骤610，根据t-1时刻的第二隐向量对视频语义特征向量进行软注意力加权，得到对应于t时刻的目标视频语义向量。Step 610: Perform soft attention weighting on the video semantic feature vector according to the second latent vector at time t-1 to obtain a target video semantic vector corresponding to time t.

步骤620，将对应于t时刻的目标视频语义向量与t时刻的第一隐向量进行拼接，得到对应于t时刻的第二拼接向量。Step 620, splicing the target video semantic vector corresponding to time t with the first latent vector at time t to obtain a second splicing vector corresponding to time t.

步骤630，由第二神经网络以对应于t时刻的第二拼接向量作为输入，对应输出t时刻的第二隐向量。Step 630, the second neural network uses the second splicing vector corresponding to time t as an input, and outputs the second latent vector corresponding to time t.

继续上述的举例，目标视频的视频语义向量为

通过t-1时刻第二神经网络的第二隐向量

对视频语义向量

进行软注意加权可以描述为：Continuing the above example, the video semantic vector of the target video is

Passing through the second latent vector of the second neural network at time t-1

video semantic vector

Performing soft attention weighting can be described as:

其中，

为对应于t时刻的目标视频语义向量。in,

is the target video semantic vector corresponding to time t.

然后，将t时刻的目标视频语义向量

与t-1时刻第一神经网络所输出的第一隐向量

进行拼接，得到对应于t时刻的第二拼接向量

Then, the target video semantic vector at time t is

and the first hidden vector output by the first neural network at time t-1

Perform splicing to obtain the second splicing vector corresponding to time t

将该对应于t时刻的第二拼接向量

作为t时刻第二神经网络LSTM^sem的输入向量，由第二神经网络LSTM^sem对应输出t时刻的第二隐向量，该过程可描述为：The second splice vector corresponding to time t

As the input vector of the second neural network LSTM ^sem at time t, the second latent vector at time t is correspondingly output by the second neural network LSTM ^sem . The process can be described as:

其中，

为第二神经网络LSTM^sem中t时刻的细胞单元向量；

为第二神经网络LSTM^sem中t时刻的第一隐向量；

为第二神经网络LSTM^sem中t-1时刻的细胞单元向量。in,

is the cell unit vector at time t in the second neural network LSTM ^sem ;

is the first hidden vector at time t in the second neural network LSTM ^sem ;

is the cell unit vector at time t-1 in the second neural network LSTM ^sem .

在第二神经网络为长短时记忆神经网络的情况下，该第二神经网络包括第二输入门、第二遗忘门和第二输出门，在该实施例中，步骤630，包括：由第二遗忘门根据对应于t时刻的第二拼接向量计算得到t时刻的第二遗忘门向量；以及由第二输入门根据对应于t时刻的第二拼接向量计算得到t时刻的第二输入门向量；根据t时刻的第二遗忘门向量、t时刻的第二输入门向量、t时刻的第二单元向量和第二神经网络所对应t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量，t时刻的第二单元向量是根据对应于t时刻的第二拼接向量进行双曲正切计算得到的；根据t时刻的第二细胞单元向量和t时刻的第二输出门向量计算得到t时刻的第二隐向量，t时刻的第二输出门向量是由第二输出门根据对应于t时刻的第二拼接向量计算得到的。In the case where the second neural network is a long-short-term memory neural network, the second neural network includes a second input gate, a second forgetting gate, and a second output gate. In this embodiment, step 630 includes: The forgetting gate calculates the second forgetting gate vector at time t according to the second splicing vector corresponding to time t; and calculates the second input gate vector at time t by the second input gate according to the second splicing vector corresponding to time t; According to the second forgetting gate vector at time t, the second input gate vector at time t, the second unit vector at time t and the second cell unit vector at time t-1 corresponding to the second neural network, the second time at time t is calculated. Cell unit vector, the second unit vector at time t is calculated from the hyperbolic tangent of the second splicing vector corresponding to time t; calculated from the second cell unit vector at time t and the second output gate vector at time t The second latent vector at time t and the second output gate vector at time t are calculated by the second output gate according to the second splicing vector corresponding to time t.

在本实施例中，第二遗忘门向量是指第二神经网络中的遗忘门向量，同理，第二输入门向量、第二输出门向量、第二细胞单元向量分别是指第二神经网络中的输入门向量、输出门向量、细胞单元向量。In this embodiment, the second forget gate vector refers to the forget gate vector in the second neural network. Similarly, the second input gate vector, the second output gate vector, and the second cell unit vector refer to the second neural network respectively. The input gate vector, output gate vector, and cell unit vector in .

其中，t时刻的第二输入门向量、第二输出门向量、第二细胞单元向量、第二单元向量和第二隐向量的计算参见上文的公式(1)-(6)，在此不再赘述。Among them, the calculation of the second input gate vector, the second output gate vector, the second cell unit vector, the second unit vector and the second hidden vector at time t refer to the above formulas (1)-(6), and here Repeat.

在得到t时刻第二神经网络所输出的第二隐向量后，根据第二神经网络在t时刻生成的第二隐向量确定t时刻的词向量；根据各时刻所输出的词向量生成视频描述语句。After obtaining the second latent vector output by the second neural network at time t, determine the word vector at time t according to the second latent vector generated by the second neural network at time t; generate video description sentences according to the word vector output at each time .

其中，词向量作为所要输出词的向量编码，根据第二神经网络在t时刻生成的第二隐向量进行t时刻的词向量的预测，从而，将各个时刻所预测到词向量所对应词进行组合，即得到目标视频的视频描述语句。Among them, the word vector is used as the vector code of the word to be output, and the word vector at time t is predicted according to the second latent vector generated by the second neural network at time t, so that the words corresponding to the word vector predicted at each time are combined. , that is, the video description sentence of the target video is obtained.

在具有多层神经网络的模型中，上一层的输出为下一层的输入，由于在神经网络中需要经过线性变换、激活函数等计算过程，下一层的输入在取值范围可能存在较大差异。而如果某一层神经网络的输入的分布发生了变化，那么其参数需要重新学习，该种现象叫做内部协变量偏移。In a model with a multi-layer neural network, the output of the previous layer is the input of the next layer. Since the neural network needs to go through calculation processes such as linear transformation and activation function, the input of the next layer may have a higher value range. big difference. If the distribution of the input of a certain layer of neural network changes, its parameters need to be re-learned, a phenomenon called internal covariate shift.

因此，为了避免在描述生成模型中发生内部协变量偏移的现象，对第一神经网络中的第一输出门向量、第一遗忘门向量、第一输出门向量和第一单元向量、以及第二神经网络中的第二输出门向量、第二遗忘门向量、第二输出门向量和第二单元向量分别进行条件层归一化(Conditional Layer Normalization，CLN)操作。Therefore, in order to avoid the phenomenon of internal covariate shift in the description generative model, the first output gate vector, the first forget gate vector, the first output gate vector and the first unit vector in the first neural network, and the first output gate vector in the first neural network. The second output gate vector, the second forget gate vector, the second output gate vector and the second unit vector in the two neural networks are respectively subjected to a conditional layer normalization (CLN) operation.

其中，条件层归一化操作定义为：Among them, the conditional layer normalization operation is defined as:

其中，x为待进行条件层归一化处理的变量，μ(x)为变量x的均值，σ(x)为变量x的标准差；f_γ(y)为由一个多层感知机以条件向量y为输入而输出的向量(假设为第一向量)，f_β(y)为另一多层感知机以条件向量y为输入而输出的向量(假设为第二向量)。输出第一向量的多层感知机与输出第二向量的多层感知机相独立。Among them, x is the variable to be normalized by the conditional layer, μ(x) is the mean value of the variable x, σ(x) is the standard deviation of the variable x; f _γ (y) is the conditional The vector y is the input and output vector (assumed to be the first vector), and f _β (y) is the output vector (assumed to be the second vector) of another multilayer perceptron with the condition vector y as the input. The multilayer perceptron that outputs the first vector is independent of the multilayer perceptron that outputs the second vector.

由上可知，为实现条件层归一化操作，需要先对变量进行归一化操作，然后，以第一向量作为缩放向量，用于对归一化后的变量进行缩放变换；第二向量作为偏移向量，用于对归一化后的变量进行偏移变换。It can be seen from the above that in order to realize the normalization operation of the conditional layer, the variables need to be normalized first, and then the first vector is used as the scaling vector to scale and transform the normalized variables; the second vector is used as the scaling vector. Offset vector, used to offset the normalized variable.

针对第一神经网络中的第一输出门向量、第一遗忘门向量、第一输出门向量和第一单元向量进行条件层归一化操作，先对第一神经网络中的第一输入门向量、第一遗忘门向量、第一输出门向量和第一单元向量分别进行归一化；然后根据第一偏移向量和第一缩放向量分别对归一化后的第一输入门向量、第一遗忘门向量、第一输出门向量和第一单元向量进行变换，得到目标第一输入门向量、目标第一遗忘门向量、目标第一输出门向量和目标第一单元向量，第一偏移向量是第一多层感知机根据对应于t时刻的目标句法特征向量输出的，第一缩放向量是第二多层感知机根据对应于t时刻的目标句法特征向量输出的，第一多层感知机与第二多层感知机相独立。The conditional layer normalization operation is performed on the first output gate vector, the first forget gate vector, the first output gate vector and the first unit vector in the first neural network. , the first forgetting gate vector, the first output gate vector and the first unit vector are normalized respectively; then the normalized first input gate vector, the first The forget gate vector, the first output gate vector and the first unit vector are transformed to obtain the target first input gate vector, the target first forget gate vector, the target first output gate vector and the target first unit vector, and the first offset vector is output by the first multi-layer perceptron according to the target syntax feature vector corresponding to time t, and the first scaling vector is output by the second multi-layer perceptron according to the target syntax feature vector corresponding to time t. Independent of the second multilayer perceptron.

在本实施例中，根据t时刻的第一遗忘门向量、t时刻的第一输入门向量、t时刻的第一单元向量和第一神经网络所对应t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量，包括：根据目标第一遗忘门向量、目标第一输入门向量、目标第一单元向量和t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量。In this embodiment, the calculation is performed according to the first forget gate vector at time t, the first input gate vector at time t, the first unit vector at time t, and the first cell unit vector at time t-1 corresponding to the first neural network. Obtaining the first cell unit vector at time t includes: calculating the first cell unit vector at time t according to the target first forget gate vector, the target first input gate vector, the target first unit vector and the first cell unit vector at time t-1. Cell unit vector.

在本实施例中，根据t时刻的第一细胞单元向量和t时刻的第一输出门向量计算得到t时刻的第一隐向量，包括：根据t时刻的第一细胞单元向量和目标输出门向量计算得到t时刻的第一隐向量。In this embodiment, calculating the first latent vector at time t according to the first cell unit vector at time t and the first output gate vector at time t includes: according to the first cell unit vector at time t and the target output gate vector Calculate the first hidden vector at time t.

针对第一神经网络中第一输出门向量、第一遗忘门向量、第一输出门向量和第一单元向量的条件层归一化操作，上述条件向量y为上述所计算得到对应于t时刻的目标句法特征向量。For the conditional layer normalization operation of the first output gate vector, the first forget gate vector, the first output gate vector and the first unit vector in the first neural network, the above condition vector y is the calculated value corresponding to time t. The target syntax feature vector.

上述针对第一神经网络中的第一输出门向量、第一遗忘门向量、第一输出门向量和第一单元向量进行条件层归一化操作可表示为：The above conditional layer normalization operation for the first output gate vector, the first forget gate vector, the first output gate vector and the first unit vector in the first neural network can be expressed as:

其中，

和W_i ^syn均为权重矩阵，b^syn为偏置项，该

W_i ^syn、b^syn通过训练确定。上式(16)等号左边的f_t ^syn，

分别为进行条件层归一化操作后所得到的目标第一遗忘门向量、目标第一输入门向量、目标第一输出门向量和目标第一单元向量。in,

and Wi _syn are weight matrices, b ^syn ^is the bias term, the

_Wisyn and ^bsyn are determined by training ^. f _t ^syn on the left side of the equal sign of the above formula (16),

are the target first forget gate vector, the target first input gate vector, the target first output gate vector and the target first unit vector obtained after the normalization operation of the conditional layer is performed, respectively.

在得到对应于t时刻的目标第一遗忘门向量、目标第一输入门向量、目标第一输出门向量和目标第一单元向量，将该目标第一遗忘门向量、目标第一输入门向量、目标第一输出门向量和目标第一单元向量参与到长短时记忆神经网络中第一隐向量和第一细胞单元向量的计算。具体的，分别按照下述公式(17)进行t时刻的第一细胞单元向量

的计算，以及分别按照下述公式(18)进行t时刻的第一隐向量

的计算。After obtaining the target first forget gate vector, target first input gate vector, target first output gate vector and target first unit vector corresponding to time t, the target first forget gate vector, target first input gate vector, The target first output gate vector and the target first unit vector participate in the calculation of the first latent vector and the first cell unit vector in the long short-term memory neural network. Specifically, according to the following formula (17), the first cell unit vector at time t is calculated respectively.

The calculation of , and the first hidden vector at time t is performed according to the following formula (18) respectively

calculation.

针对第二神经网络中第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量的条件层归一化操作的过程，包括：先对第二神经网络中的第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量分别进行归一化；然后，根据第二偏移向量和第二缩放向量分别对归一化后的第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量进行变换，得到目标第二输入门向量、目标第二遗忘门向量、目标第二输出门向量和目标第二单元向量，第二偏移向量是第三多层感知机根据对应于t时刻的目标视频语义向量输出的，第二缩放向量是第四多层感知机根据对应于t时刻的目标视频语义向量输出的，第三多层感知机与第四多层感知机相独立。The process of normalizing the conditional layer for the second input gate vector, the second forgetting gate vector, the second output gate vector and the second unit vector in the second neural network includes: The input gate vector, the second forget gate vector, the second output gate vector and the second unit vector are respectively normalized; then, the normalized second input gate is respectively normalized according to the second offset vector and the second scaling vector The vector, the second forget gate vector, the second output gate vector and the second unit vector are transformed to obtain the target second input gate vector, the target second forget gate vector, the target second output gate vector and the target second unit vector. The second offset vector is output by the third multi-layer perceptron according to the target video semantic vector corresponding to time t, the second scaling vector is output by the fourth multi-layer perceptron according to the target video semantic vector corresponding to time t, and the third The multilayer perceptron is independent of the fourth multilayer perceptron.

在本实施例中，根据t时刻的第二遗忘门向量、t时刻的第二输入门向量、t时刻的第二单元向量和第二神经网络所对应t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量，包括：根据目标第二遗忘门向量、目标第二输入门向量、目标第二单元向量和t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量。In this embodiment, the calculation is performed according to the second forget gate vector at time t, the second input gate vector at time t, the second unit vector at time t, and the second cell unit vector at time t-1 corresponding to the second neural network. Obtaining the second cell unit vector at time t includes: calculating the second cell unit vector at time t according to the target second forget gate vector, the target second input gate vector, the target second unit vector and the second cell unit vector at time t-1. Cell unit vector.

在本实施例中，根据t时刻的第二细胞单元向量和t时刻的第二输出门向量计算得到t时刻的第二隐向量，包括：根据t时刻的第二细胞单元向量和目标第二输出门向量计算得到t时刻的第二隐向量。In this embodiment, calculating the second latent vector at time t according to the second cell unit vector at time t and the second output gate vector at time t includes: according to the second cell unit vector at time t and the target second output The gate vector is calculated to obtain the second hidden vector at time t.

对于第二神经网络中第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量的条件层归一化操作涉及的计算过程与第一神经网络中的计算过程类似，具体可参照上述针对第一神经网络中的条件层归一化操作的计算过程，在此不再赘述。The calculation process involved in the conditional layer normalization operation of the second input gate vector, the second forget gate vector, the second output gate vector and the second unit vector in the second neural network is similar to the calculation process in the first neural network, For details, reference may be made to the above-mentioned calculation process for the normalization operation of the conditional layer in the first neural network, which will not be repeated here.

基于门控的循环神经网络是在简单循环神经网络的基础上对网络的结构做了调整，加入了门控机制，用于控制神经网路中信息的传递。门控机制可以用来控制记忆单元中的信息有多少需要保留，有多少需要丢弃，新的状态信息有多少需要保存到记忆单元中等，这使得基于门控循环神经网络可以学习跨度相对较长的依赖关系，而不会出现梯度消失和梯度爆炸的关系。The gated-based recurrent neural network adjusts the structure of the network on the basis of the simple recurrent neural network, and adds a gate control mechanism to control the transmission of information in the neural network. The gating mechanism can be used to control how much information in the memory unit needs to be retained, how much needs to be discarded, how much new state information needs to be saved to the memory unit, etc. dependencies without the vanishing and exploding gradients.

而且，基于门控的循环神经网络保留了简单循环神经网络的特点，即对于具有依赖关系的数据流的处理能力，具有依赖关系的数据流例如目标范例句、目标视频、目标范例句的句法特征向量、目标视频的视频语义特征向量等，而且，由于加入了门控机制，可以学习跨度相对较长的依赖关系，而不会出现梯度消失和梯度爆炸的关系。Moreover, the gating-based RNN retains the characteristics of the simple RNN, that is, the processing capability of the data stream with dependencies, such as the syntactic features of the data streams with dependencies, such as target paradigm sentences, target videos, and target paradigm sentences vector, the video semantic feature vector of the target video, etc., and, due to the addition of a gating mechanism, a relatively long-span dependency can be learned without gradient disappearance and gradient explosion.

值得一提的是，虽然上述以第一神经网络、第二神经网络为长短时记忆神经网络对本申请的方案进行了距离说明，但是，并不限于利用长短时记忆神经网络实现本申请的方案，利用门控循环网络也可以实现本申请的方案，具体的过程参照上述利用长短时记忆神经网络实现本方案的过程。It is worth mentioning that although the first neural network and the second neural network are used as long-short-term memory neural networks to describe the solution of the present application, it is not limited to using long-short-term memory neural networks to realize the solution of the present application. The scheme of the present application can also be implemented by using a gated recurrent network. For the specific process, refer to the above-mentioned process of implementing the scheme by using a long-short-term memory neural network.

在本申请的一些实施例中，为了保证描述生成模型所预测输出的视频描述语句的准确性，还需要对该描述生成模型进行训练，具体训练的过程可以包括如图7所示的步骤710-760的过程。具体描述如下：In some embodiments of the present application, in order to ensure the accuracy of the video description sentences predicted and output by the description generation model, the description generation model also needs to be trained. The specific training process may include steps 710- as shown in FIG. 7 . 760 process. The specific description is as follows:

步骤710，获取训练数据，训练数据包括若干样本视频和样本视频对应的样本视频描述语句。Step 710: Acquire training data, where the training data includes several sample videos and sample video description sentences corresponding to the sample videos.

其中，训练数据可以采用已有的视频描述生成数据集MSRVTT和ActivityNet。当然，在其他实施例中，还可以根据需要进行训练数据的构建。Among them, the training data can use the existing video description to generate the data sets MSRVTT and ActivityNet. Of course, in other embodiments, training data can also be constructed as required.

步骤720，对样本视频进行语义特征提取，得到样本视频的样本视频语义特征向量；以及对样本视频所对应样本视频描述语句进行句法特征提取，得到样本视频描述语句的样本句法特征向量。Step 720: Extract semantic features of the sample video to obtain a sample video semantic feature vector of the sample video; and perform syntactic feature extraction on the sample video description sentence corresponding to the sample video to obtain a sample syntactic feature vector of the sample video description sentence.

其中，针对样本视频进行语义特征提取得到样本视频语义特征向量的过程，可以利用上文中的卷积神经网络和第五神经网络来实现，具体过程参见上文描述，在此不再赘述。Among them, the process of extracting the semantic features of the sample video to obtain the semantic feature vector of the sample video can be realized by using the convolutional neural network and the fifth neural network in the above.

针对样本描述语句所进行句法特征提取得到样本句法特征向量，可以利用上文中的通过第三神经网络、第四神经网络来实现，具体过程参见上文描述，在此不再赘述。The syntactic feature extraction performed on the sample description sentence to obtain the sample syntactic feature vector can be realized by using the third neural network and the fourth neural network in the above.

步骤730，由第一神经网络根据样本句法特征向量输出第一隐向量序列，通过第一隐向量序列计算第一句法损失。Step 730, the first neural network outputs the first latent vector sequence according to the sample syntax feature vector, and calculates the first syntax loss by using the first latent vector sequence.

在本申请的一些实施例中，为了通过第一隐向量序列计算第一句法损失，先通过第六神经网络根据第一隐向量序列为样本描述语句预测得到语法树，第六神经网络是基于门控的循环神经网络；然后根据所预测得到的语法树和样本描述语句的实际语法树计算得到第一句法损失。In some embodiments of the present application, in order to calculate the first syntactic loss by using the first sequence of latent vectors, the sixth neural network is first used to predict the syntax tree for the sample description sentence according to the sequence of the first latent vectors, and the sixth neural network is based on A gated recurrent neural network; then the first syntactic loss is calculated based on the predicted syntax tree and the actual syntax tree of the sample description sentence.

在本申请的一些实施例中，可以通过负对数似然损失函数对描述生成模型进行句法方面的监督训练。定义第一句法损失函数

为：In some embodiments of the present application, the description generation model may be trained syntactically with a negative log-likelihood loss function. Define the first syntactic loss function

for:

其中，P(C^syn|H^syn；V，C)表示为根据样本视频V和样本视频V所对应的样本视频描述语句C所得到第一隐向量序列预测得到的语法树H^syn与样本描述语句C的实际语法树C^syn的相似度满足第一预设条件的概率。Wherein, P(C ^syn |H ^syn ; V, C) is expressed as the syntax tree H ^syn and the sample description sentence obtained according to the first hidden vector sequence prediction obtained from the sample video description sentence C corresponding to the sample video V and the sample video V The probability that the similarity of the actual syntax tree C ^syn of C satisfies the first preset condition.

其中，该第一预设条件可以是根据第一语法树相似度阈值来设定，例如若所预测到的语法树与样本描述语句的实际语法树的相似度大于等于该第一语法树相似度阈值，则视为满足第一预设条件。The first preset condition may be set according to the first syntax tree similarity threshold, for example, if the similarity between the predicted syntax tree and the actual syntax tree of the sample description sentence is greater than or equal to the first syntax tree similarity threshold, it is considered that the first preset condition is met.

由此，按照上述的第一句法损失函数来分别计算针对样本视频和样本视频所对应样本视频描述语句的第一句法损失。Thus, the first syntactic loss for the sample video and the sample video description sentence corresponding to the sample video is calculated according to the above-mentioned first syntactic loss function.

步骤740，由第二神经网络根据第一隐向量序列和样本视频的样本视频语义特征向量输出第二隐向量序列，通过第二隐向量序列计算第一语义损失。Step 740, the second neural network outputs the second latent vector sequence according to the first latent vector sequence and the sample video semantic feature vector of the sample video, and calculates the first semantic loss by using the second latent vector sequence.

在本申请的一些实施例中，为通过第二隐向量序列计算第一语义损失，先通过第五多层感知机根据第二隐向量序列为样本视频输出第一描述语句；然后根据第一描述语句和样本视频所对应样本视频描述语句计算得到第一语义损失。In some embodiments of the present application, in order to calculate the first semantic loss through the second latent vector sequence, the fifth multilayer perceptron first outputs the first description sentence for the sample video according to the second latent vector sequence; then according to the first description The sentence and the sample video description sentence corresponding to the sample video are calculated to obtain the first semantic loss.

在本申请的一些实施例中，通过计算负对数似然损失函数对描述生成模型进行语义方面的监督训练，其中，定义第一语义损失函数

为：In some embodiments of the present application, semantically supervised training is performed on the description generation model by calculating a negative log-likelihood loss function, wherein a first semantic loss function is defined

for:

其中，P(C|H^sem；V，C)为基于样本视频V和该样本视频V所对应视频描述语句C所预测得到第一描述语句与样本描述语句的语义相似度满足第二预设条件的概率。Wherein, P(C|H ^sem ; V, C) is based on the sample video V and the video description sentence C corresponding to the sample video V. The semantic similarity between the first description sentence and the sample description sentence satisfies the second preset condition. The probability.

其中，该第二预设条件可以是根据第一语义相似度阈值来设定，例如若所预测到的第一描述语句与样本描述语句的语义相似度大于等于该第一语义相似度阈值，则视为满足第二预设条件。The second preset condition may be set according to a first semantic similarity threshold. For example, if the predicted semantic similarity between the first description sentence and the sample description sentence is greater than or equal to the first semantic similarity threshold, then deemed to satisfy the second preset condition.

可以理解的，为了计算第一描述语句与样本描述语句的语义相似度，需要分别进行第一描述语句的语义向量和样本描述语句的语义向量的构建，由此，根据第一描述语句的语义向量和样本描述语句的语义向量进行相似度计算，对应得到语义相似度。It can be understood that in order to calculate the semantic similarity between the first description sentence and the sample description sentence, it is necessary to construct the semantic vector of the first description sentence and the semantic vector of the sample description sentence respectively. Therefore, according to the semantic vector of the first description sentence The similarity calculation is performed with the semantic vector of the sample description sentence, and the semantic similarity is correspondingly obtained.

由此，按照上述的第一语义损失函数来分别计算针对样本视频和样本视频所对应样本视频描述语句的第一语义损失。Thus, the first semantic loss for the sample video and the sample video description sentence corresponding to the sample video is calculated according to the above-mentioned first semantic loss function.

步骤750，根据第一句法损失和第一语义损失计算得到第一目标损失。Step 750: Calculate the first target loss according to the first syntactic loss and the first semantic loss.

在本申请的一些实施例中，可以将第一句法损失函数和第一语义损失进行加权，将加权结果作为第一目标损失函数。In some embodiments of the present application, the first syntactic loss function and the first semantic loss may be weighted, and the weighted result may be used as the first target loss function.

在一具体实施例中，将第一句法损失函数和第一语义损失函数相加，并将相加的和作为第一目标损失函数，即第一目标损失函数L_v，c为：In a specific embodiment, the first syntactic loss function and the first semantic loss function are added, and the added sum is used as the first target loss function, that is, the first target loss function L _v,c is:

通过带入上述步骤740所计算得到的第一句法损失和步骤750所计算得到的第一语义损失可以对应得到第一目标损失。The first target loss can be correspondingly obtained by bringing in the first syntactic loss calculated in the above step 740 and the first semantic loss calculated in the step 750 .

步骤760，基于第一目标损失调整描述生成模型的参数。Step 760: Adjust the parameters describing the generative model based on the first target loss.

由此，根据所计算得到的第一目标损失来调整描述生成模型的参数，直至第一目标损失函数收敛。Thus, the parameters describing the generative model are adjusted according to the calculated first target loss until the first target loss function converges.

在训练过程中，由于训练数据有限，可能会导致描述生成模型的性能受到限制，因此，为了避免出现该种情况，在图7所对应的训练方式的基础上，引入下述图8和图9中至少一项所对应的训练方式，对描述生成模型进行进一步训练，以对辅助图7所示的训练方式。In the training process, due to the limited training data, the performance of the description generation model may be limited. Therefore, in order to avoid this situation, the following Figures 8 and 9 are introduced on the basis of the training method corresponding to Figure 7. The description generation model is further trained to assist the training method shown in FIG. 7 .

在本申请的一些实施例中，如图8所示，方法还包括：In some embodiments of the present application, as shown in FIG. 8 , the method further includes:

步骤810，获取样本语句的样本句法特征向量，样本语句包括样本范例句和样本视频对应的样本视频描述语句。Step 810: Obtain a sample syntax feature vector of a sample sentence, where the sample sentence includes a sample example sentence and a sample video description sentence corresponding to the sample video.

其中，样本语句的句法特征向量可以利用上文中的通过第三神经网络、第四神经网络来实现，具体过程参见上文描述，在此不再赘述。Wherein, the syntactic feature vector of the sample sentence can be realized by using the third neural network and the fourth neural network above, and the specific process can be referred to the above description, which will not be repeated here.

步骤820，由第一神经网络根据样本语句的样本句法特征向量输出第一隐向量序列，通过样本语句对应的第一隐向量序列计算第二句法损失。Step 820: The first neural network outputs a first latent vector sequence according to the sample syntactic feature vector of the sample sentence, and calculates the second syntactic loss by using the first latent vector sequence corresponding to the sample sentence.

在本申请的一些实施例中，为了通过样本语句对应的第一隐向量序列计算第二句法损失，先通过第六神经网络根据样本语句对应的第一隐向量序列为样本语句预测得到语法树，第六神经网络是基于门控的循环神经网络；然后根据所预测得到的语法树和样本语句的实际语法树计算得到第二句法损失。In some embodiments of the present application, in order to calculate the second syntactic loss by using the first latent vector sequence corresponding to the sample sentence, the sixth neural network is first used to predict the syntax tree for the sample sentence according to the first latent vector sequence corresponding to the sample sentence, The sixth neural network is a gated-based recurrent neural network; the second syntactic loss is then calculated according to the predicted syntax tree and the actual syntax tree of the sample sentence.

在本申请的一些实施例中，通过负对数似然损失函数对描述生成模型进行句法方面的监督训练。定义第二句法损失函数

为：In some embodiments of the present application, the description generation model is syntactically supervised through a negative log-likelihood loss function. Define the second syntactic loss function

for:

其中，P(S^syn|H^syn；S，S)表示根据样本语句S所得到第一隐向量序列预测的语法树H^syn与样本语句S的实际语法树S^syn的相似度满足第三预设条件的概率。Wherein, P(S ^syn |H ^syn ; S, S) indicates that the similarity between the syntax tree H ^syn predicted by the first latent vector sequence obtained from the sample sentence S and the actual syntax tree S ^syn of the sample sentence S satisfies the third preset conditional probability.

其中，该第三预设条件可以是根据第二语法树相似度阈值来设定，例如若为样本语句所预测到的语法树与样本语句的实际语法树的相似度大于等于该第二语法树相似度阈值，则视为满足第三预设条件。The third preset condition may be set according to the second syntax tree similarity threshold, for example, if the similarity between the syntax tree predicted for the sample sentence and the actual syntax tree of the sample sentence is greater than or equal to the second syntax tree The similarity threshold is deemed to satisfy the third preset condition.

由此，按照上述的第二句法损失函数来分别计算针对样本语句的第二句法损失。Thus, the second syntactic losses for the sample sentences are calculated respectively according to the above-mentioned second syntactic loss function.

步骤830，由第二神经网络根据样本语句的样本语义特征向量和样本语句对应的第一隐向量序列输出第二隐向量序列，样本语义特征向量是对样本语句进行语义特征提取所得到的，通过样本语句对应的第二隐向量序列计算第二语义损失。Step 830, the second neural network outputs the second latent vector sequence according to the sample semantic feature vector of the sample sentence and the first latent vector sequence corresponding to the sample sentence. The sample semantic feature vector is obtained by performing semantic feature extraction on the sample sentence, and is obtained by The second semantic loss is calculated for the second latent vector sequence corresponding to the sample sentence.

在本申请的一些实施例中，可以预先构造一个句子语义编码模块来对样本语句进行语义特征提取。具体地，针对一个样本语句，首先用glove词向量对样本语句中的每一个词进行编码，然后将编码后的词向量序列输入到一个长短时记忆网络中，该长短时记忆网络的输出为样本语句的样本语义特征向量。In some embodiments of the present application, a sentence semantic encoding module may be constructed in advance to perform semantic feature extraction on sample sentences. Specifically, for a sample sentence, first use the glove word vector to encode each word in the sample sentence, and then input the encoded word vector sequence into a long and short-term memory network, and the output of the long and short-term memory network is the sample. The sample semantic feature vector of the sentence.

在本申请的一些实施例中，为通过样本语句对应的第二隐向量序列计算第二语义损失，先通过一多层感知机根据样本语句对应的第二隐向量序列为样本语句输出第三描述语句；然后根据第三描述语句和样本语句计算得到第二语义损失。In some embodiments of the present application, in order to calculate the second semantic loss based on the second latent vector sequence corresponding to the sample sentence, a multilayer perceptron is first used to output the third description for the sample sentence according to the second latent vector sequence corresponding to the sample sentence sentence; and then calculate the second semantic loss according to the third description sentence and the sample sentence.

在本申请的一些实施例中，通过计算负对数似然损失函数对描述生成模型进行语义方面的监督训练，其中，定义第二语义损失函数

为：In some embodiments of the present application, semantically supervised training is performed on the description generation model by calculating a negative log-likelihood loss function, wherein a second semantic loss function is defined

for:

其中，P(S|H^sem；S，S)为基于样本语句S所预测得到第三描述语句H^sem与样本语句S的语义相似度满足第四预设条件的概率。Wherein, P(S|H ^sem ; S, S) is the probability that the semantic similarity between the third description sentence H ^sem and the sample sentence S predicted based on the sample sentence S satisfies the fourth preset condition.

其中，该第四预设条件可以是根据第二语义相似度阈值来设定，例如若所预测到的第三描述语句与样本语句的语义相似度大于等于该第二语义相似度阈值，则视为满足第四预设条件。The fourth preset condition may be set according to the second semantic similarity threshold. For example, if the predicted semantic similarity between the third description sentence and the sample sentence is greater than or equal to the second semantic similarity threshold, then the to satisfy the fourth preset condition.

同理，根据第三描述语句的语义向量和样本语句的语义向量进行语义相似度计算。Similarly, the semantic similarity calculation is performed according to the semantic vector of the third description sentence and the semantic vector of the sample sentence.

从而，按照上述的第二语义损失函数

可以为样本语句对应计算得到第二语义损失。Thus, according to the above-mentioned second semantic loss function

The second semantic loss can be calculated for the sample sentence correspondence.

步骤840，根据第二句法损失和第二语义损失计算得到第二目标损失。Step 840: Calculate and obtain a second target loss according to the second syntactic loss and the second semantic loss.

在本申请的一些实施例中，将第二句法损失函数和第二语义损失函数相加，并将相加的结果作为第二目标损失函数，即第二目标损失函数L_s，s为：In some embodiments of the present application, the second syntactic loss function and the second semantic loss function are added, and the addition result is used as the second objective loss function, that is, the second objective loss function L _{s, s} is:

从而，在计算得到第二句法损失和第二语法损失的基础上，按照上式(24)计算得到针对样本语句的第二目标损失。Therefore, on the basis of calculating the second syntactic loss and the second syntactic loss, the second target loss for the sample sentence is calculated according to the above formula (24).

步骤850，基于第二目标损失调整描述生成模型的参数。Step 850: Adjust the parameters describing the generative model based on the second target loss.

在本申请的一些实施例中，训练数据还包括若干样本范例句，如图9所示，该还包括：In some embodiments of the present application, the training data also includes several sample example sentences, as shown in FIG. 9 , which also includes:

步骤910，获取样本范例句的样本句法特征向量。Step 910: Obtain a sample syntactic feature vector of the sample model sentence sentence.

步骤920，由第一神经网络根据样本范例句的样本句法特征向量输出第一隐向量序列。Step 920, the first neural network outputs the first latent vector sequence according to the sample syntax feature vector of the sample example sentence.

步骤930，由第二神经网络根据对应于样本范例句的第一隐向量序列和样本视频的样本视频语义特征向量输出第二隐向量序列。Step 930, outputting a second latent vector sequence by the second neural network according to the first latent vector sequence corresponding to the sample paradigm sentence sentence and the sample video semantic feature vector of the sample video.

步骤940，根据对应于样本视频的第二隐向量序列确定第二描述语句。Step 940: Determine the second description sentence according to the second latent vector sequence corresponding to the sample video.

在本申请的一些实施例中，可以通过一多层感知机根据样本视频的第二隐向量序列为样本视频输出第二描述语句。In some embodiments of the present application, a multilayer perceptron may be used to output a second description sentence for the sample video according to the second latent vector sequence of the sample video.

步骤950，根据样本范例句对应的语法树和第二描述语句对应的语法树计算得到第三目标损失。Step 950: Calculate and obtain a third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence.

在本申请的一些实施例中，样本范例句的语法树和第二描述语句的语法树可以通过句法分析工具获得，例如上文中所列举的工具；也可以按照上述的方法，通过基于门控的循环神经网络来确定样本范例句和第二描述语句的语法树。In some embodiments of the present application, the syntax tree of the sample example sentence and the syntax tree of the second description sentence can be obtained by a syntax analysis tool, such as the tools listed above; A recurrent neural network is used to determine the syntactic tree of the sample exemplar sentences and the second descriptive sentence.

在本申请的一些实施例中，定义第三目标损失函数

为：In some embodiments of the present application, a third objective loss function is defined

for:

其中，P(E^syn|H^syn；V，E)表示基于样本视频V和样本范例句E所得到第二描述语句的语法树H^syn与样本范例句E的语法树E^syn之间的相似度满足第物预设条件的概率。Wherein, P(E ^syn | H ^syn ; V, E) represents the similarity between the syntax tree H ^syn of the second description sentence obtained based on the sample video V and the sample paradigm sentence E and the syntax tree E ^syn of the sample paradigm sentence E The probability of meeting the pre-conditions of the item.

其中，该第五预设条件可以是根据第三语法树相似度阈值来设定，例如若第二描述语句的语法树与样本范例句的语法树的相似度大于等于该第三语法树相似度阈值，则视为满足第五预设条件。Wherein, the fifth preset condition may be set according to the third syntax tree similarity threshold, for example, if the similarity between the syntax tree of the second description sentence and the syntax tree of the sample example sentence is greater than or equal to the similarity of the third syntax tree threshold, it is considered that the fifth preset condition is met.

由此，按照上述的第三目标损失函数来对应计算针对样本视频和样本范例句的第三目标损失。Therefore, the third target loss for the sample video and the sample example sentence is correspondingly calculated according to the above-mentioned third target loss function.

步骤960，基于第三目标损失调整描述生成模型的参数。Step 960: Adjust the parameters describing the generative model based on the third target loss.

在本申请的一些实施例中，可以结合图7-图9的三种训练方式对描述生成模型进行训练，在此种情况下，描述生成模型的总损失函数L可以定义为第一目标损失函数、第二目标损失函数和第三目标损失函数的和，即：In some embodiments of the present application, the description generation model can be trained in combination with the three training methods shown in FIGS. 7-9 . In this case, the total loss function L of the description generation model can be defined as the first objective loss function , the sum of the second objective loss function and the third objective loss function, namely:

当然，在其他实施例中，结合实际需要，还可以仅结合图7与图8的训练方式，或者仅结合图7和图9的训练方式对描述生成模型的进行训练。Of course, in other embodiments, based on actual needs, the description generation model may be trained only in combination with the training methods in FIGS. 7 and 8 , or only in combination with the training methods in FIGS. 7 and 9 .

图10是根据一实施例示出的生成视频描述语句的示意图，如图10所示，先分别通过视频语义编码模块提取输入视频的视频语义特征向量、以及通过层级结构的句子句法编码模块提取得到范例句的句法特征向量，然后由描述生成模型根据输入视频的视频语义特征向量和范例句的句法特征向量为输入视频输出视频描述语句。FIG. 10 is a schematic diagram of generating a video description sentence according to an embodiment. As shown in FIG. 10 , first, the video semantic feature vector of the input video is extracted by the video semantic encoding module, and the example is obtained by extracting the sentence syntax encoding module of the hierarchical structure. Then, the description generation model outputs the video description sentence for the input video according to the video semantic feature vector of the input video and the syntactic feature vector of the example sentence.

具体的，如图10所示，视频语义编码模块包括卷积神经网络(ConvolutionalNeural Networks，CNN)和长短时记忆神经网络(LSTM)在输入视频后，先通过卷积神经网络分别对各视频帧进行特征提取，得到各视频帧的语义向量，然后，将各视频帧的语义向量按照视频帧的时序输入到长短时记忆神经网络中，由长短时记忆神经网络针对各视频输出的隐向量，进而，由各视频帧对应的隐向量组合得到输入视频的视频语义特征向量。Specifically, as shown in Figure 10, the video semantic coding module includes Convolutional Neural Networks (CNN) and Long Short-Term Memory Neural Network (LSTM). Feature extraction to obtain the semantic vector of each video frame, and then input the semantic vector of each video frame into the long-short-term memory neural network according to the time sequence of the video frame, and the long-short-term memory neural network outputs the hidden vector for each video, and then, The video semantic feature vector of the input video is obtained by combining the latent vectors corresponding to each video frame.

层级结构的句子句法编码模块包括两层长短时记忆神经网络，分别为LSTM^c和LSTM^w，其中，长短时记忆神经网络LSTM^c用于对范例句中各词中每一字符的字符特征向量进行编码，由长短时记忆神经网络LSTM^c针对字符输出的隐向量；然后，针对范例句中的每一词，根据该词中各字符的隐向量进行平均计算，得到该词的特征向量；然后，由长短时记忆神经网络LSTM^w根据范例句中各个词的特征向量输出隐向量序列，所输出的隐向量序列作为范例句的句法特征向量。The hierarchical structure of sentence syntax coding module includes two layers of long-term and short-term memory neural networks, namely LSTM ^c and LSTM ^w , where the long-term and short-term memory neural network LSTM ^c is used to perform the character feature vector of each character in each word in the example sentence. encoding, the hidden vector output by the long short-term memory neural network LSTM ^c for the character; then, for each word in the example sentence, average calculation is performed according to the hidden vector of each character in the word, and the feature vector of the word is obtained; then, The long-short-term memory neural network LSTM ^w outputs the latent vector sequence according to the feature vector of each word in the exemplary sentence, and the output latent vector sequence is used as the syntactic feature vector of the exemplary sentence.

描述生成模型包括级联的两层长短时记忆神经网络，分别为LSTM^syn(第一神经网络)和LSTM^sem(第二神经网络)，其中，长短时记忆神经网络LSTM^syn用于控制所要生成视频描述语句的句法，长短时记忆神经网络LSTM^sem用于给视频描述语句赋予语义含义。具体的，将范例句的句法特征向量输入至LSTM^syn中，将该LSTM^syn输出的隐向量作为LSTM^sem的输入，LSTM^syn输出的隐向量用于对所要生成的视频描述语句进行句法指导；然后，LSTM^sem根据LSTM^syn输出的隐向量和输入视频的视频语义特征向量，输出隐向量；最后，根据LSTM^sem所输出的隐向量确定词向量，并进而根据所确定的词向量对应的词生成视频描述语句，以此保证，所生成的视频描述语句既能描述视频的内容，在句法上又和范例句类似。The description generation model includes a cascaded two-layer long and short-term memory neural network, namely LSTM ^syn (the first neural network) and LSTM ^sem (the second neural network), where the long and short-term memory neural network LSTM ^syn is used to control the video to be generated. The syntax of the description sentence, the long short-term memory neural network LSTM ^sem is used to give semantic meaning to the video description sentence. Specifically, the syntactic feature vector of the example sentence is input into the LSTM ^syn , the hidden vector output by the LSTM ^syn is used as the input of the LSTM ^sem , and the hidden vector output by the LSTM ^syn is used to syntactically guide the video description sentence to be generated; then; , LSTM ^sem outputs the hidden vector according to the hidden vector output by LSTM ^syn and the video semantic feature vector of the input video; finally, the word vector is determined according to the hidden vector output by LSTM ^sem , and then the video is generated according to the word corresponding to the determined word vector. Descriptive sentences are used to ensure that the generated video description sentences can describe the content of the video and are syntactically similar to the example sentences.

如图10所示，范例句为“Aerial view ofa group ofsheeps on a grassfield.”，基于该范例句为输入视频所生成的视频描述语句为“Cooking video of arecipe with ingredients in a glass bowl”。As shown in Figure 10, the exemplary sentence is "Aerial view of a group of sheeps on a grassfield.", and the video description sentence generated for the input video based on the exemplary sentence is "Cooking video of arecipe with ingredients in a glass bowl".

在本申请的一些实施例中，在生成视频描述语句后，还可以通过语音技术输出。In some embodiments of the present application, after the video description sentence is generated, it can also be output through a voice technology.

以下介绍本申请的装置实施例，可以用于执行本申请上述实施例中的方法。对于本申请装置实施例中未披露的细节，请参照本申请上述的方法实施例。The apparatus embodiments of the present application are introduced below, which can be used to execute the methods in the foregoing embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the above-mentioned method embodiments of the present application.

本申请提供了一种视频描述语句的生成装置1100，该视频描述语句的生成装置1100可以配置于图1所示的服务器，如图11所示，该视频描述语句的生成装置1100包括：The present application provides an apparatus 1100 for generating video description sentences. The apparatus 1100 for generating video description sentences can be configured on the server shown in FIG. 1 . As shown in FIG. 11 , the apparatus 1100 for generating video description sentences includes:

获取模块1110，用于获取目标范例句的句法特征向量。The obtaining module 1110 is configured to obtain the syntactic feature vector of the target paradigm example sentence.

句法确定模块1120，用于根据句法特征向量确定所要生成视频描述语句的句法，得到句法信息。The syntax determination module 1120 is configured to determine the syntax of the video description sentence to be generated according to the syntax feature vector to obtain syntax information.

语义确定模块1130，用于根据句法信息和目标视频的视频语义特征向量确定所要生成视频描述语句对应于句法的语义，得到语义信息。The semantic determination module 1130 is configured to determine the semantic corresponding to the syntax of the video description sentence to be generated according to the syntax information and the video semantic feature vector of the target video, and obtain the semantic information.

视频描述语句确定模块1140，用于根据语义信息生成目标视频的视频描述语句。The video description sentence determining module 1140 is configured to generate a video description sentence of the target video according to the semantic information.

在本申请的一些实施例中，句法确定模块被配置为：由描述生成模型所包含的第一神经网络根据句法特征向量生成第一隐向量，第一隐向量用于指示句法信息，描述生成模型还包括与第一神经网络级联的第二神经网络，第一神经网络和第二神经网络是基于门控的循环神经网络。In some embodiments of the present application, the syntax determination module is configured to: generate a first latent vector according to the syntactic feature vector by the first neural network included in the description generation model, where the first latent vector is used to indicate syntax information, and the description generation model Also included is a second neural network cascaded with the first neural network, the first neural network and the second neural network being gated-based recurrent neural networks.

在本实施例中，语义确定模块被配置为：由第二神经网络根据第一隐向量和视频语义特征向量生成第二隐向量，第二隐向量用于指示语义信息。In this embodiment, the semantic determination module is configured to: generate a second latent vector by the second neural network according to the first latent vector and the video semantic feature vector, and the second latent vector is used to indicate semantic information.

在本申请的一些实施例中，视频描述语句确定模块被配置为：根据第二神经网络在t时刻生成的第二隐向量确定t时刻的词向量；根据各时刻所输出的词向量生成视频描述语句。In some embodiments of the present application, the video description sentence determination module is configured to: determine a word vector at time t according to a second latent vector generated by the second neural network at time t; generate a video description according to the word vector output at each time statement.

在本实施例例中，句法确定模块包括第一隐向量生成单元，其用于由第一神经网络根据句法特征向量、t-1时刻的词向量和第一神经网络所生成t-1时刻的第一隐向量，输出t时刻的第一隐向量。In this embodiment, the syntax determination module includes a first latent vector generation unit, which is configured to be generated by the first neural network according to the syntax feature vector, the word vector at time t-1, and the time t-1 generated by the first neural network. The first hidden vector, which outputs the first hidden vector at time t.

在本实施例例中，语义确定模块包括第二隐向量生成单元，其用于由第二神经网络根据视频语义特征向量、t时刻的第一隐向量和第二神经网络所生成t-1时刻的第二隐向量，输出t时刻的第二隐向量。In this embodiment, the semantic determination module includes a second latent vector generating unit, which is configured to generate time t-1 by the second neural network according to the video semantic feature vector, the first latent vector at time t, and the second neural network The second latent vector of , outputs the second latent vector at time t.

在本申请的一些实施例中，第一隐向量生成单元包括：第一软注意力加权单元，用于根据t-1时刻的第一隐向量对句法特征向量进行软注意力加权，得到对应于t时刻的目标句法特征向量。第一拼接单元，用于将对应于t时刻的目标句法特征向量与t-1时刻的词向量进行拼接，得到对应于t时刻的第一拼接向量。第一输出单元，用于由第一神经网络以对应于t时刻的第一拼接向量作为输入，对应输出t时刻的第一隐向量。In some embodiments of the present application, the first latent vector generating unit includes: a first soft attention weighting unit, configured to perform soft attention weighting on the syntactic feature vector according to the first latent vector at time t-1, to obtain a value corresponding to The target syntactic feature vector at time t. The first splicing unit is used for splicing the target syntax feature vector corresponding to time t and the word vector at time t-1 to obtain a first splicing vector corresponding to time t. The first output unit is configured to use the first splicing vector corresponding to time t as input by the first neural network, and output the first latent vector corresponding to time t.

在本申请的一些实施例中，第一神经网络包括第一输入门、第一遗忘门和第一输出门，第一输出单元包括：第一遗忘门向量计算单元，用于由第一遗忘门根据对应于t时刻的第一拼接向量计算得到t时刻的第一遗忘门向量。以及第一输入门向量计算单元，用于由第一输入门根据对应于t时刻的第一拼接向量计算得到t时刻的第一输入门向量。第一细胞单元向量计算单元，用于根据t时刻的第一遗忘门向量、t时刻的第一输入门向量、t时刻的第一单元向量和第一神经网络所对应t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量，t时刻的第一单元向量是根据对应于t时刻的第一拼接向量进行双曲正切计算得到的。第一隐向量计算单元，用于根据t时刻的第一细胞单元向量和t时刻的第一输出门向量计算得到t时刻的第一隐向量，t时刻的第一输出门向量是由第一输出门根据对应于t时刻的第一拼接向量计算得到的。In some embodiments of the present application, the first neural network includes a first input gate, a first forgetting gate, and a first output gate, and the first output unit includes: a first forgetting gate vector calculation unit, which is used for calculating the first forgetting gate by the first forgetting gate. The first forget gate vector at time t is calculated according to the first splicing vector corresponding to time t. and a first input gate vector calculation unit, configured to calculate the first input gate vector at time t by the first input gate according to the first splicing vector corresponding to time t. The first cell unit vector calculation unit is used to calculate the first forget gate vector at time t, the first input gate vector at time t, the first unit vector at time t, and the first neural network corresponding to time t-1. The cell unit vector is calculated to obtain the first cell unit vector at time t, and the first unit vector at time t is obtained by performing hyperbolic tangent calculation according to the first splicing vector corresponding to time t. The first hidden vector calculation unit is used to calculate the first hidden vector at time t according to the first cell unit vector at time t and the first output gate vector at time t, and the first output gate vector at time t is calculated by the first output gate vector. The gate is computed from the first splice vector corresponding to time t.

在本申请的一些实施例中，句法确定模块还包括：第一归一化单元，用于对第一神经网络中的第一输入门向量、第一遗忘门向量、第一输出门向量和第一单元向量分别进行归一化；第一变换单元，根据第一偏移向量和第一缩放向量分别对归一化后的第一输入门向量、第一遗忘门向量、第一输出门向量和第一单元向量进行变换，得到目标第一输入门向量、目标第一遗忘门向量、目标第一输出门向量和目标第一单元向量，第一偏移向量是第一多层感知机根据对应于t时刻的目标句法特征向量输出的，第一缩放向量是第二多层感知机根据对应于t时刻的目标句法特征向量输出的，第一多层感知机与第二多层感知机相独立。In some embodiments of the present application, the syntax determination module further includes: a first normalization unit, configured to compare the first input gate vector, the first forget gate vector, the first output gate vector, and the first input gate vector in the first neural network. The one-unit vectors are respectively normalized; the first transformation unit, according to the first offset vector and the first scaling vector, respectively normalizes the normalized first input gate vector, the first forget gate vector, the first output gate vector and the The first unit vector is transformed to obtain the target first input gate vector, the target first forget gate vector, the target first output gate vector and the target first unit vector, and the first offset vector is the first multi-layer perceptron according to the corresponding The first scaling vector is output by the target syntax feature vector at time t, and the first scaling vector is output by the second multi-layer perceptron according to the target syntax feature vector corresponding to time t. The first multi-layer perceptron is independent from the second multi-layer perceptron.

在本实施例中，第一细胞单元向量计算单元进一步被配置为：根据目标第一遗忘门向量、目标第一输入门向量、目标第一单元向量和t-1时刻的第一细胞单元向量计算得到t时刻的第一细胞单元向量；In this embodiment, the first cell unit vector calculation unit is further configured to: calculate according to the target first forget gate vector, the target first input gate vector, the target first unit vector and the first cell unit vector at time t-1 Obtain the first cell unit vector at time t;

在本实施例中，第一隐向量计算单元进一步被配置为：根据t时刻的第一细胞单元向量和目标输出门向量计算得到t时刻的第一隐向量。In this embodiment, the first hidden vector calculation unit is further configured to: calculate and obtain the first hidden vector at time t according to the first cell unit vector at time t and the target output gate vector.

在本申请的一些实施例中，第二隐向量生成单元，包括：第二软注意力加权单元，用于根据t-1时刻的第二隐向量对视频语义特征向量进行软注意力加权，得到对应于t时刻的目标视频语义向量。第二拼接单元，用于将对应于t时刻的目标视频语义向量与t时刻的第一隐向量进行拼接，得到对应于t时刻的第二拼接向量。第二输出单元，用于由第二神经网络以对应于t时刻的第二拼接向量作为输入，对应输出t时刻的第二隐向量。In some embodiments of the present application, the second latent vector generating unit includes: a second soft attention weighting unit, configured to perform soft attention weighting on the video semantic feature vector according to the second latent vector at time t-1, to obtain The target video semantic vector corresponding to time t. The second splicing unit is used for splicing the target video semantic vector corresponding to time t with the first latent vector at time t to obtain a second splicing vector corresponding to time t. The second output unit is used for the second neural network to take the second splicing vector corresponding to time t as an input, and to output the second latent vector corresponding to time t.

在本申请的一些实施例中，第二神经网络包括第二输入门、第二遗忘门和第二输出门，第二输出单元包括：第二遗忘门向量计算单元，用于由第二遗忘门根据对应于t时刻的第二拼接向量计算得到t时刻的第二遗忘门向量。以及第二输入门向量计算单元，用于由第二输入门根据对应于t时刻的第二拼接向量计算得到t时刻的第二输入门向量。第二细胞单元向量计算单元，用于根据t时刻的第二遗忘门向量、t时刻的第二输入门向量、t时刻的第二单元向量和第二神经网络所对应t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量，t时刻的第二单元向量是根据对应于t时刻的第二拼接向量进行双曲正切计算得到的。第二隐向量计算单元，用于根据t时刻的第二细胞单元向量和t时刻的第二输出门向量计算得到t时刻的第二隐向量，t时刻的第二输出门向量是由第二输出门根据对应于t时刻的第二拼接向量计算得到的。In some embodiments of the present application, the second neural network includes a second input gate, a second forgetting gate, and a second output gate, and the second output unit includes: a second forgetting gate vector computing unit, configured to generate a second forgetting gate by the second forgetting gate The second forget gate vector at time t is obtained by calculating according to the second splicing vector corresponding to time t. and a second input gate vector calculation unit, configured to calculate the second input gate vector at time t by the second input gate according to the second splicing vector corresponding to time t. The second cell unit vector calculation unit is used to calculate the second forget gate vector at time t, the second input gate vector at time t, the second unit vector at time t, and the second neural network corresponding to time t-1. The cell unit vector is calculated to obtain the second cell unit vector at time t, and the second unit vector at time t is obtained by performing hyperbolic tangent calculation according to the second splicing vector corresponding to time t. The second hidden vector calculation unit is used to calculate the second hidden vector at time t according to the second cell unit vector at time t and the second output gate vector at time t, and the second output gate vector at time t is obtained by the second output gate vector The gate is computed from the second splice vector corresponding to time t.

在本申请的一些实施例中，语义确定模块还包括：第二归一化单元，用于对第二神经网络中的第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量分别进行归一化。第二变换单元，用于根据第二偏移向量和第二缩放向量分别对归一化后的第二输入门向量、第二遗忘门向量、第二输出门向量和第二单元向量进行变换，得到目标第二输入门向量、目标第二遗忘门向量、目标第二输出门向量和目标第二单元向量，第二偏移向量是第三多层感知机根据对应于t时刻的目标视频语义向量输出的，第二缩放向量是第四多层感知机根据对应于t时刻的目标视频语义向量输出的，第三多层感知机与第四多层感知机相独立。In some embodiments of the present application, the semantic determination module further includes: a second normalization unit, configured to compare the second input gate vector, the second forget gate vector, the second output gate vector and the first gate vector in the second neural network. Two-element vectors are normalized separately. The second transformation unit is configured to transform the normalized second input gate vector, the second forget gate vector, the second output gate vector and the second unit vector respectively according to the second offset vector and the second scaling vector, Obtain the target second input gate vector, the target second forget gate vector, the target second output gate vector and the target second unit vector, and the second offset vector is the third multilayer perceptron according to the target video semantic vector corresponding to time t. In the output, the second scaling vector is output by the fourth multi-layer perceptron according to the target video semantic vector corresponding to time t, and the third multi-layer perceptron is independent from the fourth multi-layer perceptron.

在本实施例中，第二细胞单元向量计算单元进一步被配置为：根据目标第二遗忘门向量、目标第二输入门向量、目标第二单元向量和t-1时刻的第二细胞单元向量计算得到t时刻的第二细胞单元向量；In this embodiment, the second cell unit vector calculation unit is further configured to: calculate according to the target second forget gate vector, the target second input gate vector, the target second cell vector and the second cell unit vector at time t-1 Obtain the second cell unit vector at time t;

在本实施例中，第二隐向量计算单元进一步被配置为：根据t时刻的第二细胞单元向量和目标第二输出门向量计算得到t时刻的第二隐向量。In this embodiment, the second hidden vector calculation unit is further configured to: calculate and obtain the second hidden vector at time t according to the second cell unit vector at time t and the target second output gate vector.

在本申请的一些实施例中，视频描述语句的生成装置还包括：训练数据获取模块，用于获取训练数据，训练数据包括若干样本视频和样本视频对应的样本视频描述语句。语义特征提取模块，用于对样本视频进行语义特征提取，得到样本视频的样本视频语义特征向量；以及句法特征提取模块，用于对样本视频所对应样本视频描述语句进行句法特征提取，得到样本视频描述语句的样本句法特征向量。第一句法损失确定模块，用于由第一神经网络根据样本句法特征向量输出第一隐向量序列，通过第一隐向量序列计算第一句法损失。第一语义损失确定模块，用于由第二神经网络根据第一隐向量序列和样本视频的样本视频语义特征向量输出第二隐向量序列，通过第二隐向量序列计算第一语义损失。第一目标损失计算模块，用于根据第一句法损失和第一语义损失计算得到第一目标损失。第一调整模块，用于基于第一目标损失调整描述生成模型的参数。In some embodiments of the present application, the apparatus for generating video description sentences further includes: a training data acquisition module, configured to acquire training data, where the training data includes several sample videos and sample video description sentences corresponding to the sample videos. The semantic feature extraction module is used to extract the semantic features of the sample video to obtain the sample video semantic feature vector of the sample video; and the syntactic feature extraction module is used to extract the syntactic features of the sample video description sentences corresponding to the sample video to obtain the sample video. A sample syntactic feature vector describing the sentence. The first syntax loss determining module is used for outputting the first latent vector sequence by the first neural network according to the sample syntax feature vector, and calculating the first syntax loss by using the first latent vector sequence. The first semantic loss determination module is configured to output a second latent vector sequence by the second neural network according to the first latent vector sequence and the sample video semantic feature vector of the sample video, and calculate the first semantic loss by using the second latent vector sequence. The first target loss calculation module is configured to calculate and obtain the first target loss according to the first syntactic loss and the first semantic loss. The first adjustment module is configured to adjust the parameters describing the generation model based on the first target loss.

在本申请的一些实施例中，第一句法损失确定模块包括：语法树预测单元，用于通过第六神经网络根据第一隐向量序列为样本描述语句预测得到语法树，第六神经网络是基于门控的循环神经网络。第一句法损失计算单元，用于根据所预测得到的语法树和样本描述语句的实际语法树计算得到第一句法损失。In some embodiments of the present application, the first syntax loss determination module includes: a syntax tree prediction unit, configured to predict and obtain a syntax tree for the sample description sentence according to the first latent vector sequence through the sixth neural network, and the sixth neural network is Gated based recurrent neural network. The first syntax loss calculation unit is configured to calculate and obtain the first syntax loss according to the predicted syntax tree and the actual syntax tree of the sample description sentence.

在本申请的一些实施例中，第一语义损失确定模块包括：第一描述语句输出单元，用于通过第五多层感知机根据第二隐向量序列为样本视频输出第一描述语句。第一语义损失计算单元，用于根据第一描述语句和样本视频所对应样本视频描述语句计算得到第一语义损失。In some embodiments of the present application, the first semantic loss determination module includes: a first description sentence output unit, configured to output the first description sentence for the sample video according to the second latent vector sequence through the fifth multilayer perceptron. The first semantic loss calculation unit is configured to calculate and obtain the first semantic loss according to the first description sentence and the sample video description sentence corresponding to the sample video.

在本申请的一些实施例中，视频描述语句的生成装置还包括：第一样本句法特征向量获取模块，用于获取样本语句的样本句法特征向量，样本语句包括样本范例句和样本视频对应的样本视频描述语句。第二句法损失计算模块，用于由第一神经网络根据样本语句的样本句法特征向量输出第一隐向量序列，通过样本语句对应的第一隐向量序列计算第二句法损失。第二语义损失计算模块，用于由第二神经网络根据样本语句的样本语义特征向量和样本语句对应的第一隐向量序列输出第二隐向量序列，样本语义特征向量是对样本语句进行语义特征提取所得到的，通过样本语句对应的第二隐向量序列计算第二语义损失。第二目标损失计算模块，用于根据第二句法损失和第二语义损失计算得到第二目标损失。第二调整模块，用于基于第二目标损失调整描述生成模型的参数。In some embodiments of the present application, the apparatus for generating a video description sentence further includes: a first sample syntax feature vector acquisition module, configured to obtain a sample syntax feature vector of the sample sentence, where the sample sentence includes a sample sentence corresponding to a sample sentence and a sample video Sample video description sentences. The second syntactic loss calculation module is used for outputting the first latent vector sequence by the first neural network according to the sample syntactic feature vector of the sample sentence, and calculating the second syntactic loss through the first latent vector sequence corresponding to the sample sentence. The second semantic loss calculation module is used for the second neural network to output the second latent vector sequence according to the sample semantic feature vector of the sample sentence and the first latent vector sequence corresponding to the sample sentence. The sample semantic feature vector is the semantic feature of the sample sentence. The obtained result is extracted, and the second semantic loss is calculated through the second latent vector sequence corresponding to the sample sentence. The second target loss calculation module is configured to calculate and obtain the second target loss according to the second syntactic loss and the second semantic loss. The second adjustment module is configured to adjust the parameters describing the generation model based on the second target loss.

在本申请的一些实施例中，训练数据还包括若干样本范例句，视频描述语句的生成装置还还包括：第二样本句法特征向量获取模块，用于获取样本范例句的样本句法特征向量。第一隐向量序列输出模块，用于由第一神经网络根据样本范例句的样本句法特征向量输出第一隐向量序列。第二隐向量序列输出模块，用于由第二神经网络根据对应于样本范例句的第一隐向量序列和样本视频的样本视频语义特征向量输出第二隐向量序列。第二描述语句确定模块，用于根据对应于样本视频的第二隐向量序列确定第二描述语句。第三目标损失计算模块，用于根据样本范例句对应的语法树和第二描述语句对应的语法树计算得到第三目标损失。第三调整模块，用于基于第三目标损失调整描述生成模型的参数。In some embodiments of the present application, the training data further includes several sample exemplary sentences, and the apparatus for generating a video description sentence further includes: a second sample syntactic feature vector acquisition module, configured to acquire the sample syntactic feature vectors of the sample exemplary sentences. The first latent vector sequence output module is used for outputting the first latent vector sequence by the first neural network according to the sample syntactic feature vector of the sample example sentence. The second latent vector sequence output module is used for outputting the second latent vector sequence by the second neural network according to the first latent vector sequence corresponding to the sample example sentence and the sample video semantic feature vector of the sample video. The second description sentence determining module is configured to determine the second description sentence according to the second latent vector sequence corresponding to the sample video. The third target loss calculation module is configured to calculate and obtain the third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence. The third adjustment module is configured to adjust the parameters describing the generation model based on the third target loss.

在本申请的一些实施例中，通过句法模型获得目标范例句对应的句法特征向量，句法模型包括级联的第三神经网络和第四神经网络，第三神经网络和第四神经网络是基于门控的循环神经网络。在本实施例中，获取模块包括：字符特征向量获取单元，用于获取目标范例句中各词所包括字符的字符特征向量，字符特征向量是对字符进行编码得到的。第三隐向量输出单元，用于由第三神经网络根据各字符的字符特征向量输出各字符对应的第三隐向量。平均计算单元，用于针对目标范例句中的每一词，根据该词中各字符对应的第三隐向量进行平均计算，得到该词的特征向量。第四隐向量输出单元，用于由第四神经网络根据目标范例句中各词的特征向量输出第四隐向量，第四隐向量作为句法特征向量。In some embodiments of the present application, the syntactic feature vector corresponding to the target paradigm example sentence is obtained through a syntactic model, the syntactic model includes a cascaded third neural network and a fourth neural network, and the third neural network and the fourth neural network are gate-based controlled recurrent neural network. In this embodiment, the obtaining module includes: a character feature vector obtaining unit, configured to obtain character feature vectors of characters included in each word in the target paradigm example sentence, where the character feature vectors are obtained by encoding the characters. The third latent vector output unit is used for outputting the third latent vector corresponding to each character by the third neural network according to the character feature vector of each character. The average calculation unit is used to perform average calculation for each word in the target example sentence according to the third latent vector corresponding to each character in the word to obtain the feature vector of the word. The fourth latent vector output unit is used for outputting the fourth latent vector by the fourth neural network according to the feature vector of each word in the target example sentence sentence, and the fourth latent vector is used as the syntactic feature vector.

在本申请的一些实施例中，通过视频语义模型获得目标视频对应的视频语义特征向量，视频语义模型包括级联的卷积神经网络和第五神经网络，第五神经网络是基于门控的循环神经网络，视频描述语句的生成装置还包括：视频帧序列获取模块，用于获取对目标视频进行分帧所得到的视频帧序列。语义提取模块，用于通过卷积神经网络对视频帧序列中的各视频帧进行语义提取，得到各视频帧的语义向量。第五隐向量输出模块，用于通过第五神经网络根据视频帧序列中各视频帧的语义向量输出第五隐向量，第五隐向量作为视频语义特征向量。In some embodiments of the present application, a video semantic feature vector corresponding to the target video is obtained through a video semantic model, the video semantic model includes a cascaded convolutional neural network and a fifth neural network, and the fifth neural network is a gated-based loop The neural network, and the apparatus for generating video description sentences further include: a video frame sequence acquisition module, configured to acquire a video frame sequence obtained by dividing the target video into frames. The semantic extraction module is used to perform semantic extraction on each video frame in the video frame sequence through the convolutional neural network, and obtain the semantic vector of each video frame. The fifth latent vector output module is used for outputting the fifth latent vector according to the semantic vector of each video frame in the video frame sequence through the fifth neural network, and the fifth latent vector is used as the video semantic feature vector.

需要说明的是，图12示出的电子设备的计算机系统1200仅是一个示例，不应对本申请实施例的功能和使用范围带来任何限制。It should be noted that the computer system 1200 of the electronic device shown in FIG. 12 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.

如图12所示，计算机系统1200包括中央处理单元(Central Processing Unit，CPU)1201，其可以根据存储在只读存储器(Read-Only Memory，ROM)1202中的程序或者从存储部分1208加载到随机访问存储器(Random Access Memory，RAM)1203中的程序而执行各种适当的动作和处理，例如执行上述实施例中的方法。在RAM 1203中，还存储有系统操作所需的各种程序和数据。CPU 1201、ROM 1202以及RAM 1203通过总线1204彼此相连。输入/输出(Input/Output，I/O)接口1205也连接至总线1204。As shown in FIG. 12, the computer system 1200 includes a central processing unit (Central Processing Unit, CPU) 1201, which can be loaded into a random device according to a program stored in a read-only memory (Read-Only Memory, ROM) 1202 or from a storage part 1208 A program in a memory (Random Access Memory, RAM) 1203 is accessed to perform various appropriate actions and processes, for example, the methods in the above embodiments are performed. In the RAM 1203, various programs and data necessary for system operation are also stored. The CPU 1201 , the ROM 1202 , and the RAM 1203 are connected to each other through a bus 1204 . An Input/Output (I/O) interface 1205 is also connected to the bus 1204 .

以下部件连接至I/O接口1205：包括键盘、鼠标等的输入部分1206；包括诸如阴极射线管(Cathode Ray Tube，CRT)、液晶显示器(Liquid Crystal Display，LCD)等以及扬声器等的输出部分1207；包括硬盘等的存储部分1208；以及包括诸如LAN(Local AreaNetwork，局域网)卡、调制解调器等的网络接口卡的通信部分1209。通信部分1209经由诸如因特网的网络执行通信处理。驱动器1210也根据需要连接至I/O接口1205。可拆卸介质1211，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器1210上，以便于从其上读出的计算机程序根据需要被安装入存储部分1208。The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, etc.; an output section 1207 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc. ; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN (Local Area Network) card, a modem, and the like. The communication section 1209 performs communication processing via a network such as the Internet. Drivers 1210 are also connected to I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1210 as needed so that a computer program read therefrom is installed into the storage section 1208 as needed.

特别地，根据本申请的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本申请的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信部分1209从网络上被下载和安装，和/或从可拆卸介质1211被安装。在该计算机程序被中央处理单元(CPU)1201执行时，执行本申请的系统中限定的各种功能。In particular, according to embodiments of the present application, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication portion 1209, and/or installed from the removable medium 1211. When the computer program is executed by the central processing unit (CPU) 1201, various functions defined in the system of the present application are executed.

需要说明的是，本申请实施例所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory，EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory，CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中，计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：无线、有线等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the embodiments of the present application may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Erasable Programmable Read Only Memory (EPROM), flash memory, optical fiber, portable Compact Disc Read-Only Memory (CD-ROM), optical storage device, magnetic storage device, or any suitable of the above The combination. In this application, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination of the foregoing.

附图中的流程图和框图，图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。其中，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Wherein, each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the above-mentioned module, program segment, or part of code contains one or more executables for realizing the specified logical function instruction. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented.

描述于本申请实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现，所描述的单元也可以设置在处理器中。其中，这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments of the present application may be implemented in software or hardware, and the described units may also be provided in a processor. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.

在本申请实施例的一个方面，提供了一种电子设备，该电子设备包括：处理器；存储器，存储器上存储有计算机可读指令，计算机可读指令被处理器执行时，实现上述任一实施例中视频描述语句的生成方法。In one aspect of the embodiments of the present application, an electronic device is provided, the electronic device includes: a processor; and a memory, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, any one of the foregoing implementations is implemented The generation method of the video description sentence in the example.

作为另一方面，本申请还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。该计算机可读存储介质上存储有计算机可读指令，当所述计算机可读指令被处理器执行时，实现如上任一实施例中的方法。As another aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium may be included in the electronic device described in the above-mentioned embodiments; in electronic equipment. Computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, implement the method in any of the above embodiments.

应当注意，尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元，但是这种划分并非强制性的。实际上，根据本申请的实施方式，上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之，上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the apparatus for action performance are mentioned in the above detailed description, this division is not mandatory. Indeed, according to embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of one module or unit described above may be further divided into multiple modules or units to be embodied.

通过以上的实施方式的描述，本领域的技术人员易于理解，这里描述的示例实施方式可以通过软件实现，也可以通过软件结合必要的硬件的方式来实现。因此，根据本申请实施方式的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中或网络上，包括若干指令以使得一台计算设备(可以是个人计算机、服务器、触控终端、或者网络设备等)执行根据本申请实施方式的方法。Those skilled in the art can easily understand from the description of the above embodiments that the exemplary embodiments described herein may be implemented by software, or by a combination of software and necessary hardware. Therefore, the technical solutions according to the embodiments of the present application may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.) or on the network , which includes several instructions to cause a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiment of the present application.

本领域技术人员在考虑说明书及实践这里公开的实施方式后，将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the present application will readily occur to those skilled in the art upon consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses or adaptations of this application that follow the general principles of this application and include common knowledge or conventional techniques in the technical field not disclosed in this application .

应当理解的是，本申请并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It is to be understood that the present application is not limited to the precise structures described above and illustrated in the accompanying drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for generating a video description sentence, the method comprising:

obtaining a syntactic characteristic vector of a target example sentence;

determining the syntax of a video description sentence to be generated according to the syntax feature vector to obtain syntax information;

determining the semantics of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information;

and generating a video description statement of the target video according to the semantic information.

2. The method of claim 1, wherein determining the syntax of the video description sentence to be generated according to the syntax feature vector, and obtaining syntax information comprises:

generating a first hidden vector according to the syntactic characteristic vector by a first neural network contained in a description generation model, wherein the first hidden vector is used for indicating the syntactic information, the description generation model further comprises a second neural network, and the first neural network and the second neural network are gate-based cyclic neural networks;

determining the semantics of the to-be-generated video description sentence corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information, wherein the semantic information comprises:

generating, by the second neural network, a second hidden vector from the first hidden vector and the video semantic feature vector, the second hidden vector being used to indicate the semantic information.

3. The method of claim 2, wherein the generating a video description sentence of the target video according to the semantic information comprises:

determining a word vector at the t moment according to a second hidden vector generated by the second neural network at the t moment;

generating the video description sentence according to the word vector output at each moment;

generating a first hidden vector by a first neural network contained in a description generation model according to the syntactic feature vector, wherein the first hidden vector comprises:

outputting, by the first neural network, a first hidden vector at a time t according to the syntactic feature vector, the word vector at the time t-1 and the first hidden vector at the time t-1 generated by the first neural network;

generating, by the second neural network, a second hidden vector from the first hidden vector and the video semantic feature vector, comprising:

and outputting the second hidden vector at the t moment by the second neural network according to the video semantic feature vector, the first hidden vector at the t moment and the second hidden vector at the t-1 moment generated by the second neural network.

4. The method of claim 3, wherein outputting, by the first neural network, the first hidden vector at time t from the syntactic feature vector, the word vector at time t-1, and the first hidden vector at time t-1 generated by the first neural network comprises:

carrying out soft attention weighting on the syntactic characteristic vector according to the first implicit vector at the time t-1 to obtain a target syntactic characteristic vector corresponding to the time t;

splicing the target syntactic characteristic vector corresponding to the time t with the word vector at the time t-1 to obtain a first spliced vector corresponding to the time t;

and correspondingly outputting a first implicit vector at the time t by using the first splicing vector corresponding to the time t as an input of the first neural network.

5. The method of claim 4, wherein the first neural network comprises a first input gate, a first forgetting gate, and a first output gate, wherein the outputting, by the first neural network, the first stitched vector corresponding to time t as an input and the first hidden vector corresponding to time t as an output comprises:

calculating by the first forgetting gate according to the first splicing vector corresponding to the time t to obtain a first forgetting gate vector at the time t; calculating to obtain a first input gate vector at the t moment by the first input gate according to the first splicing vector corresponding to the t moment;

calculating to obtain a first cell unit vector at the time t according to the first forgetting gate vector at the time t, the first input gate vector at the time t, the first cell unit vector at the time t and the first cell unit vector at the time t-1 corresponding to the first neural network, wherein the first cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the first splicing vector corresponding to the time t;

and calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and a first output gate vector at the time t, wherein the first output gate vector at the time t is calculated by the first output gate according to the first splicing vector corresponding to the time t.

6. The method of claim 5, wherein before the calculating the first cell unit vector at time t according to the first forgetting gate vector at time t, the first input gate vector at time t, the first unit vector at time t, and the first cell unit vector at time t-1 corresponding to the first neural network, the method further comprises:

respectively normalizing a first input gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in the first neural network;

respectively transforming the normalized first input gate vector, the normalized first forgetting gate vector, the normalized first output gate vector and the normalized first unit vector according to a first offset vector and a first scaling vector to obtain a target first input gate vector, a target first forgetting gate vector, a target first output gate vector and a target first unit vector, wherein the first offset vector is output by a first multilayer perceptron according to the target syntactic feature vector corresponding to the moment t, the first scaling vector is output by a second multilayer perceptron according to the target syntactic feature vector corresponding to the moment t, and the first multilayer perceptron and the second multilayer perceptron are independent;

the calculating according to the first forgetting gate vector at the time t, the first input gate vector at the time t, the first unit vector at the time t and the first cell unit vector at the time t-1 corresponding to the first neural network to obtain the first cell unit vector at the time t includes:

calculating to obtain a first cell unit vector at the t moment according to the target first forgetting gate vector, the target first input gate vector, the target first cell unit vector and the first cell unit vector at the t-1 moment;

the calculating to obtain the first implicit vector at the time t according to the first cell unit vector at the time t and the first output gate vector at the time t includes:

and calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and the target output gate vector.

7. The method of claim 3, wherein outputting, by the second neural network, the second hidden vector at time t from the video semantic feature vector, the first hidden vector at time t, and the second hidden vector at time t-1 generated by the second neural network comprises:

carrying out soft attention weighting on the video semantic feature vector according to the second hidden vector at the time t-1 to obtain a target video semantic vector corresponding to the time t;

splicing the target video semantic vector corresponding to the time t with the first hidden vector corresponding to the time t to obtain a second spliced vector corresponding to the time t;

and correspondingly outputting a second implicit vector at the t moment by the second neural network by taking the second splicing vector corresponding to the t moment as an input.

8. The method of claim 7, wherein the second neural network comprises a second input gate, a second forgetting gate, and a second output gate, wherein the outputting, by the second neural network, the second stitched vector corresponding to time t as an input and the second implicit vector corresponding to time t as an output comprises:

calculating by the second forgetting gate according to the second splicing vector corresponding to the time t to obtain a second forgetting gate vector at the time t; and calculating a second input gate vector at the time t by the second input gate according to the second splicing vector corresponding to the time t;

calculating a second cell unit vector at the time t according to the second forgetting gate vector at the time t, the second input gate vector at the time t, the second cell unit vector at the time t and the second cell unit vector at the time t-1 corresponding to the second neural network, wherein the second cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the second splicing vector corresponding to the time t;

and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t, wherein the second output gate vector at the time t is calculated by the second output gate according to the second splicing vector corresponding to the time t.

9. The method of claim 8, wherein before the calculating the second cell unit vector at time t according to the second forgetting gate vector at time t, the second input gate vector at time t, the second cell unit vector at time t, and the second cell unit vector at time t-1 corresponding to the second neural network, the method further comprises:

respectively normalizing a second input gate vector, a second forgetting gate vector, a second output gate vector and a second unit vector in the second neural network;

respectively transforming the normalized second input gate vector, the normalized second forgetting gate vector, the normalized second output gate vector and the normalized second unit vector according to a second offset vector and a second scaling vector to obtain a target second input gate vector, a target second forgetting gate vector, a target second output gate vector and a target second unit vector, wherein the second offset vector is output by a third multilayer perceptron according to the target video semantic vector corresponding to the moment t, the second scaling vector is output by a fourth multilayer perceptron according to the target video semantic vector corresponding to the moment t, and the third multilayer perceptron and the fourth multilayer perceptron are independent;

the calculating according to the second forgetting gate vector at the time t, the second input gate vector at the time t, the second unit vector at the time t and the second unit vector at the time t-1 corresponding to the second neural network to obtain the second cell unit vector at the time t includes:

calculating to obtain a second cell unit vector at the t moment according to the target second forgetting gate vector, the target second input gate vector, the target second cell unit vector and the second cell unit vector at the t-1 moment;

the calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t includes:

and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and the target second output gate vector.

10. The method of claim 2, further comprising:

acquiring training data, wherein the training data comprises a plurality of sample videos and sample video description sentences corresponding to the sample videos;

semantic feature extraction is carried out on the sample video to obtain a sample video semantic feature vector of the sample video; carrying out syntactic feature extraction on a sample video description statement corresponding to the sample video to obtain a sample syntactic feature vector of the sample video description statement;

outputting, by the first neural network, a first sequence of hidden vectors from the sample syntactic feature vectors, a first syntactic loss being calculated from the first sequence of hidden vectors;

outputting a second hidden vector sequence by the second neural network according to the first hidden vector sequence and the sample video semantic feature vector of the sample video, and calculating a first semantic loss through the second hidden vector sequence;

calculating to obtain a first target loss according to the first syntax loss and the first semantic loss;

adjusting parameters of the description generative model based on the first target loss.

11. The method of claim 10, wherein said computing a first syntactic loss by said first sequence of hidden vectors comprises:

predicting a syntax tree for the sample description statement according to the first hidden vector sequence through a sixth neural network, wherein the sixth neural network is a gate-controlled cyclic neural network;

and calculating the first syntax loss according to the predicted syntax tree and the actual syntax tree of the sample description statement.

12. The method of claim 10, wherein said calculating a first semantic loss through said second sequence of hidden vectors comprises:

outputting a first description statement for the sample video according to the second hidden vector sequence by a fifth multilayer perceptron;

and calculating to obtain the first semantic loss according to the first description statement and the sample video description statement corresponding to the sample video.

13. The method of claim 10, further comprising:

obtaining a sample syntax feature vector of a sample statement, wherein the sample statement comprises a sample example sentence and a sample video description statement corresponding to a sample video;

outputting a first hidden vector sequence by the first neural network according to the sample syntax feature vector of the sample statement, and calculating a second syntax loss through the first hidden vector sequence corresponding to the sample statement;

outputting a second hidden vector sequence by the second neural network according to a sample semantic feature vector of the sample statement and the first hidden vector sequence corresponding to the sample statement, wherein the sample semantic feature vector is obtained by performing semantic feature extraction on the sample statement, and calculating a second semantic loss through the second hidden vector sequence corresponding to the sample statement;

calculating to obtain a second target loss according to the second syntax loss and the second semantic loss;

adjusting parameters of the description generative model based on the second target loss.

14. The method of claim 10 or 13, wherein the training data further comprises a number of sample example sentences, the method further comprising:

obtaining a sample syntax feature vector of the sample example sentence;

outputting, by the first neural network, a first hidden vector sequence according to the sample syntactic feature vector of the sample example sentence;

outputting, by the second neural network, a second hidden vector sequence according to the first hidden vector sequence corresponding to the sample example sentence and a sample video semantic feature vector of the sample video;

determining a second description statement from a second sequence of hidden vectors corresponding to the sample video;

calculating to obtain a third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence;

adjusting parameters of the description generative model based on the third target loss.

15. The method of claim 1, wherein the obtaining of the syntactic feature vector of the target example sentence comprises:

acquiring character feature vectors of characters included by each word in the target example sentence, wherein the character feature vectors are obtained by coding characters;

outputting a third implicit vector corresponding to each character by a third neural network according to the character feature vector of each character;

aiming at each word in the target example sentence, carrying out average calculation according to a third hidden vector corresponding to each character in the word to obtain a feature vector of the word;

outputting a fourth hidden vector sequence by a fourth neural network according to the feature vector of each word in the target example sentence, wherein the fourth hidden vector sequence is used as the syntactic feature vector, and the third neural network and the fourth neural network are gate-controlled-based cyclic neural networks.

16. The method according to claim 1, wherein before determining the semantic meaning of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtaining the semantic information, the method further comprises:

acquiring a video frame sequence obtained by framing the target video;

semantic extraction is carried out on each video frame in the video frame sequence through a convolutional neural network to obtain a semantic vector of each video frame;

outputting a fifth hidden vector sequence according to the semantic vector of each video frame in the video frame sequence through a fifth neural network, wherein the fifth hidden vector sequence is used as the video semantic feature vector, and the fifth neural network is a gate-controlled cyclic neural network.

17. An apparatus for generating a video description sentence, the apparatus comprising:

the obtaining module is used for obtaining the syntactic characteristic vector of the target example sentence;

the syntax determining module is used for determining the syntax of the video description sentence to be generated according to the syntax feature vector to obtain syntax information;

the semantic determining module is used for determining the semantic of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information;

and the video description statement determining module is used for generating the video description statement of the target video according to the semantic information.