CN111402130A

CN111402130A - Data processing method and data processing device

Info

Publication number: CN111402130A
Application number: CN202010110945.4A
Authority: CN
Inventors: 李松江; 磯部骏; 贾旭; 袁善欣; 格雷戈里·斯拉堡; 许春景; 田奇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-07-10
Anticipated expiration: 2040-02-21
Also published as: CN111402130B

Abstract

The embodiment of the application discloses a data processing method, which is applied to the field of artificial intelligence, in particular to an image processing technology, and comprises the following steps: acquiring a sequence of frames, frames in the sequence of frames having a first resolution; determining at least two frame groups from the frame sequence, wherein the frame groups comprise a first target frame and at least two adjacent frames of the first target frame, the first target frame is any one frame in the frame sequence, and the adjacent frames are frames except the first target frame in the frame sequence; determining the characteristics of each frame group in at least two frame groups through a three-dimensional convolution neural network, wherein the size of a convolution kernel in the three-dimensional convolution neural network in a time dimension is positively correlated with the number of frames in the frame groups; fusing the characteristics of each frame group of the at least two frame groups to determine the detail characteristics of the first target frame; and acquiring a first target frame with a second resolution according to the detail features and the first target frame, wherein the second resolution is greater than the first resolution.

Description

Data processing method and data processing device

技术领域technical field

本申请涉及图像处理技术领域，尤其涉及一种数据处理方法和数据处理装置。The present application relates to the technical field of image processing, and in particular, to a data processing method and a data processing device.

背景技术Background technique

超分辨率(super resolution，SR)，简称超分，是指根据低分辨率图像重建出相应的高分辨率图像的技术。通过将低分辨率的图像上采样放大，借助图像先验知识等手段填充细节，生成对应的高分辨率图像。超分辨率技术在高清电视、监控设备、卫星图像和医学影像等领域有重要的应用价值。Super-resolution (SR), referred to as super-resolution, refers to the technology of reconstructing corresponding high-resolution images from low-resolution images. By upsampling and enlarging the low-resolution image, and filling in the details by means of image prior knowledge, the corresponding high-resolution image is generated. Super-resolution technology has important application value in the fields of high-definition television, surveillance equipment, satellite imagery and medical imaging.

视频超分辨率(video super resolution，VSR)，简称视频超分，是将低分辨率的视频生成对应的高分辨率视频，视频超分区别于图像超分的一个核心操作是运动补偿，即对目标帧附近的多帧图像进行信息提取和融合，利用目标帧前后的邻近帧的相似性获取细节信息，进而生成高分辨率视频。具体地，现有技术中，将包含目标帧的至少7帧低分辨率帧序列输入三维(3D)卷积神经网络，通过3*3*3尺寸的卷积核隐式地进行细节信息的提取和融合，由于3*3*3尺寸的卷积核在时间维度的尺寸为3，即一次可以同时处理3帧，然后通过卷积核依次滑动1帧进行信息提取，获取细节信息，根据该细节信息并通过上采样放大低分辨率的目标帧，最终获取高分辨的目标帧。Video super resolution (VSR), referred to as video super-resolution, is to generate a corresponding high-resolution video from a low-resolution video. A core operation that distinguishes video super-resolution from image super-resolution is motion compensation, that is, to The multi-frame images near the target frame are subjected to information extraction and fusion, and the similarity of adjacent frames before and after the target frame is used to obtain detailed information, thereby generating high-resolution video. Specifically, in the prior art, at least 7 low-resolution frame sequences including target frames are input into a three-dimensional (3D) convolutional neural network, and detailed information is implicitly extracted through a 3*3*3 convolution kernel. And fusion, because the size of the 3*3*3 convolution kernel in the time dimension is 3, that is, 3 frames can be processed at the same time, and then the convolution kernel slides 1 frame in turn for information extraction to obtain detailed information, according to the details. information and enlarge the low-resolution target frame by upsampling, and finally obtain the high-resolution target frame.

现有技术利用3D卷积神经网络进行运动补偿时，受限于卷积核的尺寸限制，每次滑动处理3帧，以输入7帧的帧序列，且第4帧为目标帧为例，卷积核滑动至第1、2、3帧时，由于缺少目标帧的指导信息，将导致特征提取较为盲目，特征提取效率低。When the prior art uses a 3D convolutional neural network for motion compensation, it is limited by the size of the convolution kernel, and 3 frames are processed each time the sliding process is performed. When the accumulated kernel slides to the 1st, 2nd, and 3rd frames, due to the lack of guidance information of the target frame, the feature extraction will be relatively blind and the feature extraction efficiency will be low.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种数据处理方法，用于帧序列超分辨率，可以提高特征提取效率，减少计算量。The embodiment of the present application provides a data processing method for frame sequence super-resolution, which can improve the efficiency of feature extraction and reduce the amount of calculation.

本申请实施例第一方面提供了一种数据处理方法，包括：数据处理装置获取帧序列，所述帧序列中的帧具有第一分辨率；所述数据处理装置从所述帧序列中确定至少两个帧组，所述帧组包括第一目标帧和所述第一目标帧的至少两个邻近帧，所述第一目标帧为所述帧序列中的任意一帧，所述邻近帧为所述帧序列中除所述第一目标帧以外的帧；所述数据处理装置通过三维卷积神经网络确定所述至少两个帧组中每个帧组的特征，所述每个帧组的特征指示基于所述第一目标帧，从所述每个帧组内的邻近帧中获取的细节信息，所述三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量正相关；所述数据处理装置融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征，所述细节特征指示基于所述第一目标帧，从所述至少两个帧组内的邻近帧中获取的细节信息；所述数据处理装置根据所述细节特征和所述第一目标帧，获取具有第二分辨率的第一目标帧，所述第二分辨率大于所述第一分辨率。可选地，所述三维卷积神经网络中卷积核在时间维度的尺寸越大，帧组中帧的数量也越大，可选地，所述三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量相等。A first aspect of an embodiment of the present application provides a data processing method, including: a data processing device acquires a frame sequence, where the frames in the frame sequence have a first resolution; the data processing device determines from the frame sequence at least Two frame groups, the frame group includes a first target frame and at least two adjacent frames of the first target frame, the first target frame is any frame in the frame sequence, and the adjacent frame is Frames other than the first target frame in the frame sequence; the data processing device determines the characteristics of each frame group in the at least two frame groups through a three-dimensional convolutional neural network, and the The feature indication is based on the detailed information obtained from the adjacent frames in each frame group based on the first target frame, and the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is the same as the frame in the frame group. are positively correlated; the data processing device fuses the features of each frame group in the at least two frame groups to determine the detail feature of the first target frame, and the detail feature indication is based on the first target frame , the detail information obtained from the adjacent frames in the at least two frame groups; the data processing device obtains the first target frame with the second resolution according to the detail feature and the first target frame, so The second resolution is greater than the first resolution. Optionally, the larger the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension, the larger the number of frames in the frame group. The dimensions of the dimension are equal to the number of frames in the frame group.

本申请实施例提供的数据处理方法，帧序列中第一目标帧超分过程中，先从帧序列中确定至少两个帧组，帧组中包括该第一目标帧，分别将帧组输入三维卷积神经网络提取帧组的组特征，然后将组特征融合，确定第一目标帧的细节特征，根据该细节特征将第一分辨率的第一目标帧转换为第二分辨率的第一目标帧。该数据处理方法，由于从帧序列中确定的帧组中包含第一目标帧，且三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量正相关，根据三维卷积神经网络中卷积核在时间维度的尺寸，设置帧组中帧的室内，由此，通过三维卷积神经网络提取帧组的特征时，不仅可以减少卷积核滑动，降低计算量，还可以得到目标帧的指导，细节特征提取效率高。In the data processing method provided by the embodiment of the present application, in the process of super-dividing the first target frame in the frame sequence, at least two frame groups are first determined from the frame sequence, the frame groups include the first target frame, and the frame groups are respectively input into the three-dimensional frame. The convolutional neural network extracts the group features of the frame group, and then fuses the group features to determine the detail features of the first target frame, and converts the first target frame of the first resolution to the first target of the second resolution according to the detail features. frame. In this data processing method, since the frame group determined from the frame sequence includes the first target frame, and the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is positively related to the number of frames in the frame group, according to the three-dimensional The size of the convolution kernel in the convolutional neural network in the time dimension is set in the room of the frame in the frame group. Therefore, when extracting the features of the frame group through the three-dimensional convolutional neural network, it can not only reduce the sliding of the convolution kernel, but also reduce the amount of calculation. The guidance of the target frame can also be obtained, and the detailed feature extraction efficiency is high.

在第一方面的一种可能的实现方式中，所述帧组包括所述第一目标帧和两个所述邻近帧。In a possible implementation manner of the first aspect, the frame group includes the first target frame and two adjacent frames.

本申请实施例提供的数据处理方法，确定的帧组数量为3帧，即所述卷积核在时间维度的尺寸为3，可以通过卷积核尺寸为3*3*3的3D卷积神经网络进行特征提取，计算量较小。In the data processing method provided in this embodiment of the present application, the determined number of frame groups is 3 frames, that is, the size of the convolution kernel in the time dimension is 3, and a 3D convolutional neural network with a convolution kernel size of 3*3*3 can be used. The network performs feature extraction, and the amount of computation is small.

在第一方面的一种可能的实现方式中，所述两个所述邻近帧包括第一邻近帧和第二邻近帧，所述第一邻近帧在所述帧序列中与所述第一目标帧之间的间隔，与所述第二邻近帧在所述帧序列中与所述第一目标帧之间的间隔相等。In a possible implementation manner of the first aspect, the two adjacent frames include a first adjacent frame and a second adjacent frame, and the first adjacent frame is associated with the first target in the frame sequence The interval between frames is equal to the interval between the second adjacent frame and the first target frame in the frame sequence.

本申请实施例提供的数据处理方法，帧组中第一邻近帧和第二邻近帧在帧序列中关于目标帧对称，考虑到运动的连续性，对于通过相同时间间隔连续获取的帧序列，两个邻近帧在时间维度关于目标帧对称，可更有效提取特征。In the data processing method provided by the embodiment of the present application, the first adjacent frame and the second adjacent frame in the frame group are symmetrical with respect to the target frame in the frame sequence. Considering the continuity of motion, for the frame sequence obtained continuously through the same time interval, the two The adjacent frames are symmetrical about the target frame in the time dimension, which can extract features more efficiently.

在第一方面的一种可能的实现方式中，所述至少两个帧组包括三个帧组。In a possible implementation manner of the first aspect, the at least two frame groups include three frame groups.

由于帧组数量越多，特征提取的计算量越大，帧组数量越少，可以获取的细节信息越少。本申请实施例提供的数据处理方法，通过融合三个帧组的特征进行细节特征提取，可以在提供足够的信息量与减少计算量中取得较好的平衡。Since the number of frame groups is larger, the computational complexity of feature extraction is larger, and the smaller the number of frame groups, the less detailed information can be obtained. The data processing method provided by the embodiment of the present application can achieve a better balance between providing sufficient information and reducing the amount of calculation by fusing the features of three frame groups to perform detailed feature extraction.

在第一方面的一种可能的实现方式中，所述方法还包括：所述数据处理装置将所述至少两个帧组中每个帧组内的帧对齐，确定对齐的至少两个帧组；所述数据处理装置通过三维卷积神经网络确定所述至少两个帧组中每个帧组的特征包括：所述数据处理装置通过三维卷积神经网络确定所述对齐的至少两个帧组中每个帧组的特征。In a possible implementation manner of the first aspect, the method further includes: the data processing apparatus aligns the frames in each of the at least two frame groups, and determines the aligned at least two frame groups Determining the characteristics of each frame group in the at least two frame groups by the data processing device through a three-dimensional convolutional neural network includes: the data processing device determining the aligned at least two frame groups through a three-dimensional convolutional neural network Features of each frame group in .

本申请实施例提供的数据处理方法，在通过三维卷积神经网络提取帧组的特征之前，可以先对第一帧组进行帧对齐处理，由此，可以更有效地提取帧组的特征。In the data processing method provided by the embodiment of the present application, before extracting the features of the frame group through the three-dimensional convolutional neural network, frame alignment processing can be performed on the first frame group, thereby, the features of the frame group can be extracted more effectively.

在第一方面的一种可能的实现方式中，所述数据处理装置将所述至少两个帧组中每个帧组内的帧对齐，确定对齐的至少两个帧组包括：所述数据处理装置确定由所述至少两个帧组中的帧组成的队列内所有连续两帧之间的单应矩阵；所述数据处理装置根据所述单应矩阵确定所述对齐的至少两个帧组。In a possible implementation manner of the first aspect, the data processing apparatus aligns the frames in each of the at least two frame groups, and determining the aligned at least two frame groups includes: the data processing The apparatus determines a homography matrix between all two consecutive frames in a queue composed of frames in the at least two frame groups; the data processing apparatus determines the aligned at least two frame groups according to the homography matrices.

本申请实施例提供的数据处理方法，第一目标帧的帧组通过单应矩阵的方法进行帧对齐，可以减少计算量。In the data processing method provided by the embodiment of the present application, the frame group of the first target frame is aligned by the method of the homography matrix, which can reduce the amount of calculation.

在第一方面的一种可能的实现方式中，所述方法还包括：所述数据处理装置确定所述至少两个帧组中每个帧组的特征的权值；所述数据处理装置融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征包括：所述数据处理装置根据所述权值融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征。In a possible implementation manner of the first aspect, the method further includes: the data processing apparatus determines a weight of the feature of each frame group in the at least two frame groups; the data processing apparatus fuses the The characteristics of each frame group in the at least two frame groups to determine the detailed characteristics of the first target frame include: the data processing device fuses each frame group in the at least two frame groups according to the weights features to determine the detailed features of the first target frame.

本申请实施例提供的数据处理方法，在融合多个特征时，可以通过深度学习网络注意力机制计算注意力掩膜，确定各个帧组的特征的权值，据此融合各特征，最终确定目标帧的细节特征。In the data processing method provided by the embodiments of the present application, when multiple features are fused, an attention mask can be calculated through a deep learning network attention mechanism, the weights of the features of each frame group can be determined, and the features can be fused accordingly to finally determine the target. Details of the frame.

本申请实施例第二方面提供了一种数据处理装置，所述装置包括：获取单元，用于获取帧序列，所述帧序列中的帧具有第一分辨率；确定单元，用于从所述帧序列中确定至少两个帧组，所述帧组包括第一目标帧和所述第一目标帧的至少两个邻近帧，所述第一目标帧为所述帧序列中的任意一帧，所述邻近帧为所述帧序列中除所述第一目标帧以外的帧；所述确定单元，还用于通过三维卷积神经网络确定所述至少两个帧组中每个帧组的特征，所述每个帧组的特征指示基于所述第一目标帧，从所述每个帧组内的邻近帧中获取的细节信息，所述三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量正相关；处理单元，用于融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征，所述细节特征指示基于所述第一目标帧，从所述至少两个帧组内的邻近帧中获取的细节信息；所述获取单元，还用于根据所述细节特征和所述第一目标帧，获取具有第二分辨率的第一目标帧，所述第二分辨率大于所述第一分辨率。A second aspect of an embodiment of the present application provides a data processing apparatus, the apparatus includes: an acquisition unit, configured to acquire a frame sequence, where the frames in the frame sequence have a first resolution; a determination unit, configured to obtain a frame sequence from the At least two frame groups are determined in the frame sequence, the frame group includes a first target frame and at least two adjacent frames of the first target frame, and the first target frame is any frame in the frame sequence, The adjacent frames are frames other than the first target frame in the frame sequence; the determining unit is further configured to determine the feature of each frame group in the at least two frame groups through a three-dimensional convolutional neural network , the feature indication of each frame group is based on the first target frame, the detailed information obtained from the adjacent frames in each frame group, and the convolution kernel in the three-dimensional convolutional neural network is in the time dimension. The size is positively related to the number of frames in the frame group; the processing unit is used for fusing the features of each frame group in the at least two frame groups to determine the detail features of the first target frame, the detail features Indicates the detail information obtained from the adjacent frames in the at least two frame groups based on the first target frame; the obtaining unit is further configured to obtain, according to the detail feature and the first target frame A first target frame of a second resolution, the second resolution being greater than the first resolution.

在第二方面的一种可能的实现方式中，所述帧组包括所述第一目标帧和两个所述邻近帧。In a possible implementation manner of the second aspect, the frame group includes the first target frame and two adjacent frames.

在第二方面的一种可能的实现方式中，所述两个所述邻近帧包括第一邻近帧和第二邻近帧，所述第一邻近帧在所述帧序列中与所述第一目标帧之间的间隔，与所述第二邻近帧在所述帧序列中与所述第一目标帧之间的间隔相等。In a possible implementation manner of the second aspect, the two adjacent frames include a first adjacent frame and a second adjacent frame, and the first adjacent frame is associated with the first target in the frame sequence The interval between frames is equal to the interval between the second adjacent frame and the first target frame in the frame sequence.

在第二方面的一种可能的实现方式中，所述至少两个帧组包括三个帧组。In a possible implementation manner of the second aspect, the at least two frame groups include three frame groups.

在第二方面的一种可能的实现方式中，所述确定单元，还用于将所述至少两个帧组中每个帧组内的帧对齐，确定对齐的至少两个帧组；所述确定单元具体用于：通过三维卷积神经网络确定所述对齐的至少两个帧组中每个帧组的特征。In a possible implementation manner of the second aspect, the determining unit is further configured to align the frames in each of the at least two frame groups, and determine the aligned at least two frame groups; the The determining unit is specifically configured to: determine the feature of each frame group in the aligned at least two frame groups through a three-dimensional convolutional neural network.

在第二方面的一种可能的实现方式中，所述确定单元具体用于：确定由所述至少两个帧组中的帧组成的队列内所有连续两帧之间的单应矩阵；根据所述单应矩阵确定所述对齐的至少两个帧组。In a possible implementation manner of the second aspect, the determining unit is specifically configured to: determine a homography matrix between all consecutive two frames in a queue composed of frames in the at least two frame groups; The homography matrix determines the aligned at least two frame groups.

在第二方面的一种可能的实现方式中，所述确定单元还用于：所述确定单元还用于：通过深度神经网络确定所述至少两个帧组中每个帧组的特征的权值；所述处理单元具体用于：根据所述权值融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征。In a possible implementation manner of the second aspect, the determining unit is further configured to: the determining unit is further configured to: determine, through a deep neural network, a weight of the feature of each frame group in the at least two frame groups value; the processing unit is specifically configured to: fuse the features of each frame group in the at least two frame groups according to the weight value to determine the detailed feature of the first target frame.

在第二方面的一种可能的实现方式中，所述三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量相等。In a possible implementation manner of the second aspect, the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is equal to the number of frames in the frame group.

本申请实施例第三方面提供了一种包含指令的计算机程序产品，其特征在于，当其在计算机上运行时，使得所述计算机执行如上述第一方面或第一方面任意一种可能实现方式的方法。A third aspect of the embodiments of the present application provides a computer program product containing instructions, characterized in that, when it runs on a computer, the computer is caused to execute the first aspect or any one of the possible implementations of the first aspect. Methods.

本申请实施例第四方面提供了一种计算机可读存储介质，包括指令，其特征在于，当所述指令在计算机上运行时，使得计算机执行如上述第一方面或第一方面任意一种可能实现方式的方法。A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, characterized in that, when the instructions are executed on a computer, the computer is made to execute the first aspect or any one of the first aspects. method of implementation.

本申请实施例第五方面提供了一种芯片系统，该芯片系统包括处理器，处理器用于读取并执行存储器中存储的计算机程序，以执行上述任一方面任意可能的实现方式所涉及的功能。在一种可能的设计中，芯片系统还包括存储器，该存储器与该处理器电连接。进一步可选地，该芯片还包括通信接口，处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息，处理器从该通信接口获取该数据和/或信息，并对该数据和/或信息进行处理，并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。该芯片系统，可以由芯片构成，也可以包含芯片和其他分立器件。A fifth aspect of the embodiments of the present application provides a chip system, where the chip system includes a processor, and the processor is configured to read and execute a computer program stored in a memory, so as to execute the functions involved in any possible implementation manner of any of the foregoing aspects . In one possible design, the system-on-a-chip further includes a memory that is electrically connected to the processor. Further optionally, the chip further includes a communication interface, and the processor is connected to the communication interface. The communication interface is used for receiving data and/or information to be processed, the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface. The communication interface may be an input-output interface. The chip system may be composed of chips, or may include chips and other discrete devices.

其中，第二方面、第三方面、第四方面以及第五方面中任一种实现方式所带来的技术效果可参见第一方面中相应实现方式所带来的技术效果，此处不再赘述。Wherein, for the technical effect brought by any one of the second aspect, the third aspect, the fourth aspect and the fifth aspect, please refer to the technical effect brought by the corresponding implementation in the first aspect, which will not be repeated here. .

本申请实施例提供的数据处理方法，其优点在于：The data processing method provided by the embodiment of the present application has the advantages of:

本申请实施例提供的数据处理方法，帧序列中第一目标帧超分过程中，先从帧序列中确定至少两个帧组，帧组中包括该第一目标帧，分别将帧组输入三维卷积神经网络提取帧组的组特征，然后将组特征融合，确定第一目标帧的细节特征，根据该细节特征将第一分辨率的第一目标帧转换为第二分辨率的第一目标帧。该数据处理方法，由于从帧序列中确定的帧组中包含第一目标帧，且三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量匹配，因此，通过三维卷积神经网络提取帧组的组特征时，不仅可以减少卷积核滑动，降低计算量，还可以得到目标帧的指导，细节特征提取效率高。In the data processing method provided by the embodiment of the present application, in the process of super-dividing the first target frame in the frame sequence, at least two frame groups are first determined from the frame sequence, the frame groups include the first target frame, and the frame groups are respectively input into the three-dimensional frame. The convolutional neural network extracts the group features of the frame group, and then fuses the group features to determine the detail features of the first target frame, and converts the first target frame of the first resolution to the first target of the second resolution according to the detail features. frame. In this data processing method, since the frame group determined from the frame sequence includes the first target frame, and the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension matches the number of frames in the frame group, therefore, by When the 3D convolutional neural network extracts the group features of the frame group, it can not only reduce the sliding of the convolution kernel and reduce the amount of calculation, but also obtain the guidance of the target frame, and the extraction efficiency of details is high.

附图说明Description of drawings

图1为本申请实施例提供的一种人工智能主体框架示意图；1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of the present application;

图2为本申请实施例提供的一种应用环境示意图；2 is a schematic diagram of an application environment provided by an embodiment of the present application;

图3为本申请实施例提供的一种卷积神经网络结构示意图；3 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application;

图4为本申请实施例提供的另一种卷积神经网络结构示意图；4 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present application;

图5为本申请实施例中数据处理方法的一个应用场景示意图；5 is a schematic diagram of an application scenario of the data processing method in the embodiment of the present application;

图6为本申请实施例中数据处理方法的另一个应用场景示意图；6 is a schematic diagram of another application scenario of the data processing method in the embodiment of the present application;

图7为本申请实施例提供的数据处理方法的一个实施例示意图；FIG. 7 is a schematic diagram of an embodiment of a data processing method provided by an embodiment of the present application;

图8为本申请实施例中帧对齐方法的一个实施例示意图；FIG. 8 is a schematic diagram of an embodiment of a frame alignment method in an embodiment of the present application;

图9为本申请实施例中单应矩阵计算的一个实施例示意图；FIG. 9 is a schematic diagram of an embodiment of homography matrix calculation in the embodiment of the present application;

图10为本申请实施例中时序分组的一个实施例示意图；FIG. 10 is a schematic diagram of an embodiment of timing grouping in an embodiment of the present application;

图11为本申请实施例中组间特征融合的一个实施例示意图；11 is a schematic diagram of an embodiment of feature fusion between groups in an embodiment of the present application;

图12为本申请实施例中进行组间特征融合的另一个实施例示意图；12 is a schematic diagram of another embodiment of performing feature fusion between groups in an embodiment of the present application;

图13为本申请实施例中采样放大的一个实施例示意图；13 is a schematic diagram of an embodiment of sampling amplification in the embodiment of the present application;

图14为本申请实施例提供的数据处理方法的另一个实施例示意图；FIG. 14 is a schematic diagram of another embodiment of the data processing method provided by the embodiment of the present application;

图15为本申请实施例提供的数据处理装置的一个实施例示意图；FIG. 15 is a schematic diagram of an embodiment of a data processing apparatus provided by an embodiment of the present application;

图16为本申请实施例提供的一种芯片硬件结构图。FIG. 16 is a structural diagram of a chip hardware provided by an embodiment of the present application.

具体实施方式Detailed ways

本申请实施例提供了一种数据处理方法，用于连续拍摄获取的帧序列的超分辨率，可以减少计算量，提高细节特征提取效果。The embodiment of the present application provides a data processing method, which is used for super-resolution of a frame sequence obtained by continuous shooting, which can reduce the amount of calculation and improve the effect of detail feature extraction.

下面对本申请实施例涉及的术语进行简要介绍。The terms involved in the embodiments of the present application are briefly introduced below.

视频：视频由一系列静态图像构成，其中图像通常称为帧(frame)，每秒传输或显示的帧数称为帧率(frames per second，FPS)，帧率越大，画面越流畅；帧率越小，画面越有跳动感。当视频帧率不低于24fps时，由于“视觉暂留”现象，人眼无法辨别单幅的静态画面，看上去是平滑连续的视觉效果，这样连续的帧序列即为视频。本申请实施例中的视频是指经连续拍摄得到的帧序列。Video: Video consists of a series of static images, in which the image is usually called a frame, and the number of frames transmitted or displayed per second is called the frame rate (frames per second, FPS). The higher the frame rate, the smoother the picture; the frame The smaller the rate, the more jumpy the picture is. When the video frame rate is not lower than 24fps, due to the phenomenon of "persistence of vision", the human eye cannot distinguish a single static picture, and it looks like a smooth and continuous visual effect, such a continuous frame sequence is a video. The video in the embodiment of the present application refers to a frame sequence obtained by continuous shooting.

图像分辨率(resolution)：分辨率指图像中存储的信息量，是每英寸图像内有多少个像素点，分辨率的单位为PPI(pixels per inch)，通常叫做像素每英寸。表达方式为：水平像素数×垂直像素数，常见分辨率有1280﹡720PPI；1920﹡1080PPI等规格。分辨率越高，图像越大；反之，图像越小。Image resolution (resolution): The resolution refers to the amount of information stored in the image, which is how many pixels per inch of the image. The unit of resolution is PPI (pixels per inch), usually called pixels per inch. The expression is: the number of horizontal pixels × the number of vertical pixels, and the common resolutions are 1280*720PPI; 1920*1080PPI and other specifications. The higher the resolution, the larger the image; conversely, the smaller the image.

视频超分辨率：简称“视频超分”，由低分辨率视频重建出相应的高分辨率视频。通过将低分辨率的图像上采样放大，借助图像先验知识、图像自相似性和多帧图像互补信息等手段填充细节，生成对应的高分辨率图像。Video super-resolution: referred to as "video super-resolution", a corresponding high-resolution video is reconstructed from a low-resolution video. By upsampling and enlarging the low-resolution image, the corresponding high-resolution image is generated by filling in the details by means of image prior knowledge, image self-similarity, and multi-frame image complementary information.

连续拍摄获取的帧序列中，相邻的帧通常很相似。对于帧序列的超分，例如视频流超分、视频监控超分、旧电影高清化等任务中，简单应用图像超分辨率方法对视频逐帧处理并不能达到理想的效果。一方面，由于图像超分并没有考虑前后帧的信息，缺乏时间连续性，生成的高分辨率视频会出现闪烁、抖动等伪影，极大地影响观看和后续的应用；另一方面，图像超分辨率方法缺乏前后帧的互补信息，较低的信息利用率限制了其超分的性能。因此，在图像超分辨率的基础上，能够有效利用前后帧信息的视频超分辨率技术(VSR)得到了更多的关注。In a sequence of frames obtained by continuous shooting, adjacent frames are usually very similar. For the super-resolution of frame sequences, such as video streaming super-resolution, video surveillance super-resolution, and old movie high-definition tasks, simply applying image super-resolution methods to video frame-by-frame processing cannot achieve ideal results. On the one hand, since the image super-resolution does not consider the information of the previous and previous frames and lacks temporal continuity, the generated high-resolution video will have artifacts such as flicker and jitter, which greatly affects viewing and subsequent applications; on the other hand, the image super-resolution The resolution method lacks the complementary information of the front and rear frames, and the low information utilization limits its super-resolution performance. Therefore, on the basis of image super-resolution, video super-resolution (VSR), which can effectively utilize the information of previous and previous frames, has received more attention.

运动补偿：是一种描述相邻帧差别的方法。连续拍摄获取的帧序列中，相邻的帧通常很相似，也就是说包含了很多冗余，简单的运动补偿是从当前帧中减去参考帧，从而得到帧间的区别，即超分技术中需要获取的细节信息。Motion compensation: is a method of describing the difference between adjacent frames. In the frame sequence obtained by continuous shooting, adjacent frames are usually very similar, that is to say, they contain a lot of redundancy. Simple motion compensation is to subtract the reference frame from the current frame to obtain the difference between frames, that is, super-division technology. Details to be obtained in .

视频超分辨率的一个核心操作是对多帧时空信息的提取和融合，因而需要运动补偿来处理视频帧之间的运动。根据运动补偿的方式，现有的视频超分辨率方法可以分为两大类：显式运动补偿和隐式运动补偿。A core operation of video super-resolution is the extraction and fusion of multi-frame spatiotemporal information, so motion compensation is needed to deal with the motion between video frames. According to the way of motion compensation, existing video super-resolution methods can be divided into two categories: explicit motion compensation and implicit motion compensation.

显式运动补偿方法在预处理阶段使用光流等手段直接将图像扭曲并对齐，这类方法计算量大，且存在明显的伪影。The explicit motion compensation method uses optical flow and other means to directly distort and align the image in the preprocessing stage, which is computationally intensive and has obvious artifacts.

而隐式运动补偿则借助3D卷积或者可变形卷积等操作在神经网络中隐式地进行运动补偿，这类方法受限于3D卷积和可变形卷积的结构，需要巨大的计算量，导致其运行速度很慢。因此，在有限计算量的情况下更有效地融合视频序列中多帧时序信息的视频超分方法成为了当前工业界和学术界研究的热点。Implicit motion compensation uses operations such as 3D convolution or deformable convolution to implicitly perform motion compensation in neural networks. Such methods are limited by the structure of 3D convolution and deformable convolution, and require a huge amount of computation. , causing it to run very slowly. Therefore, a video super-segmentation method that more effectively fuses the timing information of multiple frames in a video sequence under the condition of limited computation has become a hot research topic in the current industry and academia.

三维卷积，简称3D卷积，是一种应用于三维数据上的卷积，其卷积核包含三个维度，比常用的2D卷积核在长和宽的维度之外多一个深度维度，深度维度可以是视频的多个帧、也可以是立体图像的不同切片。3D卷积可以有效提取视频序列或三维图像中的时空信息，常应用于运动识别、医疗图像处理和视频处理等任务中。本申请实施例三维卷积核中的深度维度是指时间维度，指视频中在不同时间点获取的多个帧。Three-dimensional convolution, referred to as 3D convolution, is a convolution applied to three-dimensional data. Its convolution kernel contains three dimensions, which is one more depth dimension than the commonly used 2D convolution kernel in addition to the length and width dimensions. The depth dimension can be multiple frames of a video or different slices of a stereoscopic image. 3D convolution can effectively extract spatiotemporal information in video sequences or 3D images, and is often used in tasks such as motion recognition, medical image processing, and video processing. The depth dimension in the three-dimensional convolution kernel in the embodiment of the present application refers to the time dimension, which refers to multiple frames obtained at different time points in the video.

帧序列：指具有顺序的多帧图像。Frame Sequence: Refers to a sequence of multiple frames of images.

帧组：指本申请实施例中同时输入三维卷积神经网络进行信息提取的多个帧，包括目标帧和目标帧的邻近帧。Frame group: refers to multiple frames that are simultaneously input to the three-dimensional convolutional neural network for information extraction in the embodiment of the present application, including the target frame and the adjacent frames of the target frame.

下面结合附图，对本申请的实施例进行描述，显然，所描述的实施例仅仅是本申请一部分的实施例，而不是全部的实施例。本领域普通技术人员可知，随着技术的发展和新场景的出现，本申请实施例提供的技术方案对于类似的技术问题，同样适用。The embodiments of the present application will be described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Those of ordinary skill in the art know that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

本申请中出现的术语“和/或”，可以是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B的情况，其中A，B可以是单数或者复数。另外，本申请中字符“/”，一般表示前后关联对象是一种“或”的关系。本申请中，“至少一个”是指一个或多个，“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达，是指的这些项中的任意组合，包括单项(个)或复数项(个)的任意组合。例如，a，b，或c中的至少一项(个)，可以表示：a，b，c，a-b，a-c，b-c，或a-b-c，其中a，b，c可以是单个，也可以是多个。The term "and/or" that appears in this application can be an association relationship to describe associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, and A and B exist at the same time , the case of B alone, where A and B can be singular or plural. In addition, the character "/" in this application generally indicates that the related objects are an "or" relationship. In this application, "at least one" means one or more, and "plurality" means two or more. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is only a distinguishing manner adopted when describing objects with the same attributes in the embodiments of the present application. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, product or device comprising a series of elements is not necessarily limited to those elements, but may include no explicit or other units inherent to these processes, methods, products, or devices.

图1示出一种人工智能主体框架示意图，该主体框架描述了人工智能系统总体工作流程，适用于通用的人工智能领域需求。Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.

下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。The above artificial intelligence theme framework will be explained from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).

“智能信息链”反映从数据的获取到处理的一列过程。举例来说，可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中，数据经历了“数据—信息—知识—智慧”的凝练过程。The "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".

“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程，反映人工智能为信息技术产业带来的价值。The "IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.

(1)基础设施：(1) Infrastructure:

基础设施为人工智能系统提供计算能力支持，实现与外部世界的沟通，并通过基础平台实现支撑。通过传感器与外部沟通；计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供；基础平台包括分布式计算框架及网络等相关的平台保障和支持，可以包括云存储和计算、互联互通网络等。举例来说，传感器和外部沟通获取数据，这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。The infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.

(2)数据(2) Data

基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本，还涉及到传统设备的物联网数据，包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3)数据处理(3) Data processing

数据处理通常包括数据训练，机器学习，深度学习，搜索，推理，决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.

其中，机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.

推理是指在计算机或智能系统中，模拟人类的智能推理方式，依据推理控制策略，利用形式化的信息进行机器思维和求解问题的过程，典型的功能是搜索与匹配。Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.

决策是指智能信息经过推理后进行决策的过程，通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.

(4)通用能力(4) General ability

对数据经过上面提到的数据处理后，进一步基于数据处理的结果可以形成一些通用的能力，比如可以是算法或者一个通用系统，例如，翻译，文本的分析，计算机视觉的处理，语音识别，图像的识别等等。After the above-mentioned data processing, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.

(5)智能产品及行业应用(5) Smart products and industry applications

智能产品及行业应用指人工智能系统在各领域的产品和应用，是对人工智能整体解决方案的封装，将智能信息决策产品化、实现落地应用，其应用领域主要包括：智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶，平安城市，智能终端等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, safe city, smart terminals, etc.

参见附图2，本申请实施例提供了一种系统架构200。数据采集设备260用于采集连续拍摄的帧序列并存入数据库230，训练设备220基于数据库230中维护的帧序列数据生成目标模型/规则201。下面将更详细地描述训练设备220如何基于帧序列数据得到目标模型/规则201，目标模型/规则201能够用于视频超分、图像序列超分等应用场景。Referring to FIG. 2 , an embodiment of the present application provides a system architecture 200 . The data collection device 260 is used to collect continuously shot frame sequences and store them in the database 230 , and the training device 220 generates the target model/rule 201 based on the frame sequence data maintained in the database 230 . The following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the frame sequence data. The target model/rule 201 can be used in application scenarios such as video super-score and image sequence super-score.

该目标模型/规则201可以是基于深度神经网络得到的，下面对深度神经网络进行介绍。The target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.

深度神经网络中的每一层的工作可以用数学表达式

来描述：从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作，完成输入空间到输出空间的变换(即矩阵的行空间到列空间)，这五种操作包括：1、升维/降维；2、放大/缩小；3、旋转；4、平移；5、“弯曲”。其中1、2、3的操作由

完成，4的操作由+b完成，5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物，而是一类事物，空间是指这类事物所有个体的集合。其中，W是权重向量，该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换，即每一层的权重W控制着如何变换空间。训练深度神经网络的目的，也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此，神经网络的训练过程本质上就是学习控制空间变换的方式，更具体的就是学习权重矩阵。The work of each layer in a deep neural network can be expressed mathematically

To describe: from the physical level, the work of each layer in the deep neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column through five operations on the input space (set of input vectors). Space), these five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5. "Bending". Among them, the operations of 1, 2, and 3 are determined by

Complete, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a type of thing, and space refers to the collection of all individuals of this type of thing. Among them, W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. This vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.

因为希望深度神经网络的输出尽可能的接近真正想要预测的值，所以可以通过比较当前网络的预测值和真正想要的目标值，再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然，在第一次更新之前通常会有初始化的过程，即为深度神经网络中的各层预先配置参数)，比如，如果网络的预测值高了，就调整权重向量让它预测低一些，不断的调整，直到神经网络能够预测出真正想要的目标值。因此，就需要预先定义“如何比较预测值和目标值之间的差异”，这便是损失函数(loss function)或目标函数(objectivefunction)，它们是用于衡量预测值和目标值的差异的重要方程。其中，以损失函数举例，损失函数的输出值(loss)越高表示差异越大，那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。Because it is hoped that the output of the deep neural network is as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then update each layer of neural network according to the difference between the two. The weight vector of the network (of course, there is usually an initialization process before the first update, that is, the parameters are pre-configured for each layer in the deep neural network), for example, if the predicted value of the network is high, adjust the weight vector to make it Predict lower and keep adjusting until the neural network can predict the actual desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are important for measuring the difference between the predicted value and the target value equation. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the deep neural network becomes the process of reducing the loss as much as possible.

训练设备220得到的目标模型/规则可以应用不同的系统或设备中。在附图2中，执行设备210配置有I/O接口212，与外部设备进行数据交互，“用户”可以通过客户设备240向I/O接口212输入数据。The target models/rules obtained by training the device 220 can be applied in different systems or devices. In FIG. 2 , the execution device 210 is configured with an I/O interface 212 for data interaction with external devices, and a “user” can input data to the I/O interface 212 through the client device 240 .

执行设备210可以调用数据存储系统250中的数据、代码等，也可以将数据、指令等存入数据存储系统250中。The execution device 210 can call data, codes, etc. in the data storage system 250 , and can also store data, instructions, etc. in the data storage system 250 .

计算模块211使用目标模型/规则201对输入的数据进行处理，以图像超分为例，计算模块211可以对输入的图像或图像序列进行解析，得到图像特征。The calculation module 211 uses the target model/rule 201 to process the input data. Taking image super-score as an example, the calculation module 211 can analyze the input image or image sequence to obtain image features.

关联功能模块213可以对计算模块211中的图像数据进行预处理，例如进行帧对齐或图像分组等。The association function module 213 may preprocess the image data in the calculation module 211, such as frame alignment or image grouping.

关联功能模块214可以对计算模块211中的图像数据进行预处理，例如进行帧对齐或图像分组等。The association function module 214 may preprocess the image data in the calculation module 211, such as frame alignment or image grouping.

最后，I/O接口212将处理结果返回给客户设备240，提供给用户。Finally, the I/O interface 212 returns the processing result to the client device 240, which is provided to the user.

更深层地，训练设备220可以针对不同的目标，基于不同的数据生成相应的目标模型/规则201，以给用户提供更佳的结果。More deeply, the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.

在附图2中所示情况下，用户可以手动指定输入执行设备210中的数据，例如，在I/O接口212提供的界面中操作。另一种情况下，客户设备240可以自动地向I/O接口212输入数据并获得结果，如果客户设备240自动输入数据需要获得用户的授权，用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果，具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到训练数据存入数据库230。In the case shown in FIG. 2 , the user can manually specify data in the input execution device 210 , eg, operate in the interface provided by the I/O interface 212 . In another case, the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs to obtain the user's authorization, the user can set the corresponding permission in the client device 240 . The user can view the result output by the execution device 210 on the client device 240, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 240 may also serve as a data collection terminal to store the collected training data in the database 230 .

值得注意的，附图2仅是本申请实施例提供的一种系统架构的示意图，图中所示设备、器件、模块等之间的位置关系不构成任何限制，例如，在附图2中，数据存储系统250相对执行设备210是外部存储器，在其它情况下，也可以将数据存储系统250置于执行设备210中。It is worth noting that FIG. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 2 , The data storage system 250 is an external memory relative to the execution device 210 , and in other cases, the data storage system 250 may also be placed in the execution device 210 .

卷积神经网络(convolutional neural network，CNN)是一种带有卷积结构的深度神经网络，是一种深度学习(deep learning)架构，深度学习架构是指通过机器学习的算法，在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构，CNN是一种前馈(feed-forward)人工神经网络，以图像处理为例，该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。A convolutional neural network (CNN) is a deep neural network with a convolutional structure and a deep learning architecture. Learning at multiple levels. As a deep learning architecture, CNN is a feed-forward artificial neural network, taking image processing as an example, in which each neuron responds to overlapping regions in the image input into it .

如图3所示，卷积神经网络(CNN)100可以包括输入层110，卷积层/池化层120，其中池化层为可选的，以及神经网络层130。As shown in FIG. 3 , a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .

卷积层/池化层120：Convolutional layer/pooling layer 120:

卷积层：Convolutional layer:

如图3所示卷积层/池化层120可以包括如示例121-126层，在一种实现中，121层为卷积层，122层为池化层，123层为卷积层，124层为池化层，125为卷积层，126为池化层；在另一种实现方式中，121、122为卷积层，123为池化层，124、125为卷积层，126为池化层。即卷积层的输出可以作为随后的池化层的输入，也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in FIG. 3, the convolutional/pooling layer 120 may include layers 121-126 as examples. In one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, layer 123 is a convolutional layer, and layer 124 is a convolutional layer. Layers are pooling layers, 125 are convolutional layers, and 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.

以卷积层121为例，卷积层121可以包括很多个卷积算子，卷积算子也称为核，其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器，卷积算子本质上可以是一个权重矩阵，这个权重矩阵通常被预先定义，在对图像进行卷积操作的过程中，权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理，从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关，需要注意的是，权重矩阵的纵深维度(depthdimension)和输入图像的纵深维度是相同的，在进行卷积运算的过程中，权重矩阵会延伸到输入图像的整个深度。因此，和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出，但是大多数情况下不使用单一权重矩阵，而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征，例如一个权重矩阵用来提取图像边缘信息，另一个权重矩阵用来提取图像的特定颜色，又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同，经过该多个维度相同的权重矩阵提取后的特征图维度也相同，再将提取到的多个维度相同的特征图合并形成卷积运算的输出。Taking the convolution layer 121 as an example, the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels...depending on the value of stride), which completes the work of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will extend to the input the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification... The dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .

根据所需处理数据的维度不同，卷积核也有多种格式。常用的卷积核包括二维卷积核和三维卷积核。二维卷积核主要应用于处理二维的图像数据，而三维卷积核则由于增加了深度/时间方向的维度，可应用于视频处理、立体图像处理等。相比于二维卷积核，增加了一个维度的三维卷积核所需的参数量和计算量均大幅增加。在实际应用中，需要对三维卷积的网络进行精心设计：卷积核较大时滑动次数较少，但计算量大幅增加；卷积核较小时滑动次数较多，对深度/时间维度的特征提取较为盲目且低效，限制了网络性能的发挥。Depending on the dimension of the data to be processed, convolution kernels also come in a variety of formats. Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. The two-dimensional convolution kernel is mainly used to process two-dimensional image data, while the three-dimensional convolution kernel can be applied to video processing, stereo image processing, etc. due to the increased depth/time dimension. Compared with the two-dimensional convolution kernel, the amount of parameters and computation required for the three-dimensional convolution kernel with one additional dimension increases significantly. In practical applications, it is necessary to carefully design the three-dimensional convolution network: when the convolution kernel is large, the number of slips is small, but the amount of calculation increases greatly; when the convolution kernel is small, the number of slips is large, and the features of the depth/time dimension Extraction is blind and inefficient, which limits the performance of the network.

这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到，通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息，从而帮助卷积神经网络100进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.

当卷积神经网络100有多个卷积层的时候，初始的卷积层(例如121)往往提取较多的一般特征，该一般特征也可以称之为低级别的特征；随着卷积神经网络100深度的加深，越往后的卷积层(例如126)提取到的特征越来越复杂，比如高级别的语义之类的特征，语义越高的特征越适用于待解决的问题。为方便描述网络结构，可以将多个卷积层称为一个块(block)。When the convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (for example, 121) often extracts more general features, which can also be called low-level features; with the convolutional neural network As the depth of the network 100 deepens, the features extracted by the later convolutional layers (eg 126) become more and more complex, such as features such as high-level semantics. Features with higher semantics are more suitable for the problem to be solved. For the convenience of describing the network structure, multiple convolutional layers can be referred to as a block.

池化层：Pooling layer:

由于常常需要减少训练参数的数量，因此卷积层之后常常需要周期性的引入池化层，即如图3中120所示例的121-126各层，可以是一层卷积层后面跟一层池化层，也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中，池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子，以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外，就像卷积层中用权重矩阵的大小应该与图像大小相关一样，池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸，池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, that is, each layer 121-126 exemplified by 120 in Figure 3, which can be a convolutional layer followed by a layer The pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers. During image processing, the only purpose of pooling layers is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the average value of the pixel values in the image within a certain range. The max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image. The size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

神经网络层130：Neural network layer 130:

在经过卷积层/池化层120的处理后，卷积神经网络100还不足以输出所需要的输出信息。因为如前所述，卷积层/池化层120只会提取特征，并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息)，卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此，在神经网络层130中可以包括多层隐含层(如图3所示的131、132至13n)以及输出层140，该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到，例如该任务类型可以包括图像识别，图像分类，图像超分辨率重建等等。After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 3) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.

在神经网络层130中的多层隐含层之后，也就是整个卷积神经网络100的最后层为输出层140，该输出层140具有类似分类交叉熵的损失函数，具体用于计算预测误差，一旦整个卷积神经网络100的前向传播(如图3由110至140的传播为前向传播)完成，反向传播(如图3由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差，以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error, Once the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 3 from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 3 from 140 to 110 as back propagation) will start to update The weight values and biases of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

需要说明的是，如图3所示的卷积神经网络100仅作为一种卷积神经网络的示例，在具体的应用中，卷积神经网络还可以以其他网络模型的形式存在，例如，如图4所示的多个卷积层/池化层并行，将分别提取的特征均输入给全神经网络层130进行处理。It should be noted that the convolutional neural network 100 shown in FIG. 3 is only used as an example of a convolutional neural network. In a specific application, the convolutional neural network can also exist in the form of other network models, for example, such as The multiple convolutional layers/pooling layers shown in FIG. 4 are in parallel, and the extracted features are input to the full neural network layer 130 for processing.

视频超分辨率方法的应用场景广阔，下面结合图5和图6举例进行介绍。The application scenarios of the video super-resolution method are broad, which will be introduced below with reference to Figures 5 and 6 as examples.

应用场景1：高清流视频系统Application Scenario 1: HD Streaming Video System

请参阅图5，为本申请实施例中数据处理方法的一个应用场景示意图；Please refer to FIG. 5, which is a schematic diagram of an application scenario of the data processing method in the embodiment of the present application;

随着智能手机、平板电脑的普及，流视频已经逐渐成为当前主流的视频娱乐方式之一。流视频平台提供的视频资源分辨率逐渐提高，这对网络带宽和网络的稳定性提出了更高的要求。基于视频超分辨率技术，可以直接向用户传输较低分辨率的视频画面，客户端可借助自身的计算能力对低质量视频进行视频超分辨率，最后向用户呈现高质量的画面。这种方式可以在不明显降低高清画面质量的情况下，大幅度减少视频流的带宽需求。With the popularity of smartphones and tablet computers, streaming video has gradually become one of the current mainstream video entertainment methods. The resolution of video resources provided by streaming video platforms is gradually increasing, which puts forward higher requirements for network bandwidth and network stability. Based on video super-resolution technology, lower-resolution video images can be directly transmitted to users, and the client can perform video super-resolution on low-quality videos with its own computing power, and finally present high-quality images to users. In this way, the bandwidth requirements of the video stream can be greatly reduced without significantly reducing the quality of the high-definition picture.

应用场景2：高清监控系统Application Scenario 2: HD Surveillance System

请参阅图6，为本申请实施例中数据处理方法的另一个应用场景示意图；Please refer to FIG. 6 , which is a schematic diagram of another application scenario of the data processing method in the embodiment of the present application;

视频监控是平安城市系统的重要组成部分，越来越多的视频监控被布置在城市的各个角落，守护城市的安全。受限于摄像头素质、安装位置、有限的存储空间等不利条件，部分视频监控的画面质量较差，限制了其后续的应用。如图2所示，本申请实施例可以将低分辨率的视频监控画面转化为高分辨率的高清画面。通过前后帧信息和图像先验知识，实现对监控画面中大量细节的有效恢复，为后续视频分析提供更有效、丰富的信息，提升平安城市系统的鲁棒性。Video surveillance is an important part of the safe city system. More and more video surveillance is deployed in every corner of the city to protect the safety of the city. Limited by the camera quality, installation location, limited storage space and other unfavorable conditions, the picture quality of some video surveillance is poor, which limits its subsequent application. As shown in FIG. 2 , in this embodiment of the present application, a low-resolution video surveillance picture can be converted into a high-resolution high-definition picture. Through the information of the frames before and after and the prior knowledge of the image, the effective recovery of a large number of details in the monitoring picture is realized, which provides more effective and rich information for the subsequent video analysis, and improves the robustness of the safe city system.

现有的视频超分方法主要有基于光流的视频超分方法和基于隐式运动补偿的视频超分方法，下面简要进行介绍。Existing video super-resolution methods mainly include optical flow-based video super-resolution methods and implicit motion compensation-based video super-resolution methods, which are briefly introduced below.

基于光流的视频超分方法，即显式的运动补偿方法，对输入的多帧图像逐帧估计邻近帧到中间帧的稠密光流，基于光流估计结果，将邻近帧扭曲对齐到中间帧，形成对齐的图像序列，对对齐的图像序列进行特征提取和融合，输出对中间帧的超分辨率结果。The optical flow-based video super-segmentation method, that is, an explicit motion compensation method, estimates the dense optical flow from adjacent frames to the intermediate frame for the input multi-frame image frame by frame, and aligns the adjacent frame distortion to the intermediate frame based on the optical flow estimation results. , form an aligned image sequence, perform feature extraction and fusion on the aligned image sequence, and output the super-resolution result of the intermediate frame.

由于显式的运动补偿方法在预处理阶段使用光流等手段直接将图像扭曲并对齐，受限于光流计算的精确度，往往存在较大的伪影。Because the explicit motion compensation method uses optical flow and other means to directly distort and align the image in the preprocessing stage, limited by the accuracy of the optical flow calculation, there are often large artifacts.

基于隐式运动补偿的视频超分方法，为了避免稠密光流巨大的计算量以及其伪影问题，通过具有运动补偿能力的神经网络模块，例如3D卷积神经网络，在图像信息提取和融合的过程中，隐式地进行运动补偿。在结构上，3D卷积核比2D卷积增加了一个维度，增大卷积核会导致参数量和计算量的急剧增加，因而很难加深3D卷积核在时间维度的深度。在实际应用中，往往采用3x3x3尺寸的3D卷积核，来取得性能和计算量的均衡，3x3x3尺寸的3D卷积核一次提取3帧的信息。在视频超分辨率任务中，直接输入7帧图像至三维卷积神经网络时，每次处理3帧，并通过卷积核依次滑动1帧进行处理，共计计算5次，以第4帧为目标帧为例，最远的邻近帧与目标帧的距离为3帧，第一次计算时为第1至第3帧，由于不包含目标帧，因此提取特征时得不到目标帧的指导，特征融合过程间接且盲目，特征提取效率低。In the video super-segmentation method based on implicit motion compensation, in order to avoid the huge computational load of dense optical flow and its artifacts, a neural network module with motion compensation capability, such as 3D convolutional neural network, is used in image information extraction and fusion. During the process, motion compensation is performed implicitly. Structurally, the 3D convolution kernel adds one dimension to the 2D convolution. Increasing the convolution kernel will lead to a sharp increase in the amount of parameters and computation, so it is difficult to deepen the depth of the 3D convolution kernel in the time dimension. In practical applications, a 3D convolution kernel with a size of 3x3x3 is often used to achieve a balance between performance and computation, and a 3D convolution kernel with a size of 3x3x3 extracts 3 frames of information at a time. In the video super-resolution task, when 7 frames of images are directly input to the 3D convolutional neural network, 3 frames are processed each time, and the convolution kernel slides 1 frame for processing, a total of 5 times, with the fourth frame as the target Take the frame as an example, the distance between the farthest adjacent frame and the target frame is 3 frames, and the first to third frames are in the first calculation. Since the target frame is not included, the guidance of the target frame cannot be obtained when extracting features. The fusion process is indirect and blind, and the feature extraction efficiency is low.

本申请实施例提供的数据处理方法可以应用于视频，视频帧率可以是24、30、60、120或300等，对于视频的具体帧率此处不做限定。此外，该数据处理方法还可以应用于连续拍摄的图像序列，例如用户以不同时间间隔连续拍摄的图像序列。本申请实施例中数据处理对象统一称为帧序列。The data processing method provided in this embodiment of the present application may be applied to video, and the video frame rate may be 24, 30, 60, 120, or 300, etc. The specific frame rate of the video is not limited here. In addition, the data processing method can also be applied to a sequence of images captured continuously, for example, a sequence of images captured continuously by a user at different time intervals. In the embodiments of the present application, the data processing objects are collectively referred to as frame sequences.

本申请实施例提出的数据处理方法，用于帧序列的超分辨率中高效融合多帧信息，可以减少计算量，提高处理速度。请参阅图7，为本申请实施例提供的数据处理方法的一个实施例示意图。The data processing method proposed in the embodiment of the present application is used to efficiently fuse multi-frame information in the super-resolution of the frame sequence, which can reduce the amount of calculation and improve the processing speed. Please refer to FIG. 7 , which is a schematic diagram of an embodiment of the data processing method provided by the embodiment of the present application.

701、数据处理装置对帧序列进行帧对齐处理；701. The data processing apparatus performs frame alignment processing on the frame sequence;

数据处理装置对输入的帧序列进行帧对齐处理，将不同帧中相同的画面内容对齐，得到对齐的视频帧，可以减少帧间画面差异，降低后续信息提取和融合的难度。The data processing device performs frame alignment processing on the input frame sequence, and aligns the same picture content in different frames to obtain aligned video frames, which can reduce picture differences between frames and reduce the difficulty of subsequent information extraction and fusion.

可选地，将用于信息提取的邻近帧与目标帧对齐，例如，对于第一目标帧，将该第一目标帧提取细节信息的多个邻近帧与该第一目标帧对齐，例如第一目标帧的6个邻近帧与第一目标帧对齐。类似地，对于每次进行超分的目标帧进行一次帧对齐处理。Optionally, the adjacent frames used for information extraction are aligned with the target frame, for example, for the first target frame, multiple adjacent frames for extracting detailed information from the first target frame are aligned with the first target frame, such as the first target frame. The 6 adjacent frames of the target frame are aligned with the first target frame. Similarly, a frame alignment process is performed for each target frame that is over-segmented.

可选地，将同时输入卷积神经网络进行信息提取的多个帧对齐，例如包含目标帧的3帧在输入三维卷积神经网络之前，先将这3帧对齐。Optionally, align multiple frames that are simultaneously input to the convolutional neural network for information extraction, for example, 3 frames containing the target frame are aligned before being input to the 3D convolutional neural network.

可选地，本申请实施例提供的数据处理方法通过单应矩阵的方法实现快速帧对齐。为例便于理解，下面首先对快速帧对齐的原理、单应矩阵的特点进行简要介绍。Optionally, the data processing method provided by this embodiment of the present application implements fast frame alignment by using a homography matrix method. As an example to facilitate understanding, the following briefly introduces the principle of fast frame alignment and the characteristics of the homography matrix.

对于通过连续拍摄得到的视频帧，帧间的运动即前后两帧之间画面内容的变化，可以由相机运动和物体运动两部分组成，而相机运动可以通过单应矩阵进行粗略描述，因此本方法使用基于单应矩阵的帧对齐实现粗略的运动补偿。For the video frames obtained by continuous shooting, the motion between frames, that is, the change of the picture content between the two frames before and after, can be composed of two parts: camera motion and object motion, and the camera motion can be roughly described by the homography matrix, so this method Coarse motion compensation is achieved using homography-based frame alignment.

一个平面经过透视变换得到另一个平面，单应矩阵可以刻画平面的透视变换。单应矩阵具有如下性质：One plane is transformed into another plane by perspective transformation, and the homography matrix can describe the perspective transformation of the plane. A homography matrix has the following properties:

1)、A到C的单应矩阵，可以由A到B的单应矩阵、B到C的单应矩阵计算：1) The homography matrix of A to C can be calculated by the homography matrix of A to B and the homography matrix of B to C:

H_A→C＝H_A→B·H_B→C H _A→C = H _A→B ·H _B→C

2)、B到A的单应矩阵为A到B的单应矩阵的逆矩阵2), the homography matrix from B to A is the inverse matrix of the homography matrix from A to B

请参阅图8，为本申请实施例中帧对齐方法的一个实施例示意图。Please refer to FIG. 8 , which is a schematic diagram of an embodiment of a frame alignment method in an embodiment of the present application.

数据处理装置计算连续两帧之间的单应矩阵，可选地，数据处理装置对单应矩阵进行校验，受限于运动的复杂性，部分样本并不适合进行对齐，例如固定的监控相机不存在相机运行，或者帧间运动中物体运动占比过大，这类样本进行帧间对齐会导致对齐错误。为了避免错误的对齐影响后续的特征提取和融合，可选地，在帧对齐之前进行适用性检测，数据处理装置通过单应矩阵校验判断帧序列是否适合对齐，若适合，则进行对齐，若不适合，则直接输出原始输入的帧序列。The data processing device calculates the homography matrix between two consecutive frames. Optionally, the data processing device checks the homography matrix. Due to the complexity of motion, some samples are not suitable for alignment, such as fixed surveillance cameras. There is no camera operation, or the motion of objects in the inter-frame motion is too large, and inter-frame alignment of such samples will lead to alignment errors. In order to avoid erroneous alignment from affecting subsequent feature extraction and fusion, optionally, an applicability test is performed before frame alignment, and the data processing device judges whether the frame sequence is suitable for alignment through the homography matrix check. If it is not suitable, the frame sequence of the original input is directly output.

本申请实施例中通过单应矩阵的方法实现帧对齐，可以减少计算量，加快帧对齐速度，下面与现有的光流方法对比进行介绍。In the embodiment of the present application, the method of homography matrix is used to realize frame alignment, which can reduce the amount of calculation and speed up the frame alignment. The following is an introduction in comparison with the existing optical flow method.

请参阅图9，本申请实施例中单应矩阵计算的一个实施例示意图。数据处理装置对于每一帧，只计算其与上一帧的基本单应矩阵，其他帧之间的单应矩阵根据单应矩阵的性质，由基本单应矩阵计算，考虑对一段M帧的视频进行处理，每次取2N+1帧作为一组网络输入，即一个目标帧和2N个邻近帧，基于光流的方法需要对每个邻近帧与目标帧之间进行计算，共计2NM次，而基于单应矩阵的对齐方法中，对于每个2N+1帧的输入，只需要计算一次单应矩阵，因此总计算次数为M次，大幅度减少了计算量，实现了明显的速度提升。同时，本方法减少了光流计算中的像素层面的形变，对后续的超分辨率处理提供了良好的对齐。Please refer to FIG. 9 , which is a schematic diagram of an embodiment of the calculation of the homography matrix in the embodiment of the present application. For each frame, the data processing device only calculates the basic homography matrix with the previous frame, and the homography matrix between other frames is calculated by the basic homography matrix according to the properties of the homography matrix. Considering a video of M frames For processing, each time 2N+1 frames are taken as a set of network inputs, that is, a target frame and 2N adjacent frames, the optical flow-based method needs to calculate between each adjacent frame and the target frame, a total of 2NM times, and In the homography matrix-based alignment method, for each input of 2N+1 frames, the homography matrix only needs to be calculated once, so the total number of calculations is M times, which greatly reduces the amount of calculation and achieves a significant speed improvement. At the same time, this method reduces the deformation of the pixel level in the optical flow calculation, and provides a good alignment for the subsequent super-resolution processing.

需要说明的是，步骤701为可选步骤，可以执行，也可以不执行，具体此处不做限定。It should be noted that step 701 is an optional step, which may or may not be performed, which is not specifically limited here.

可选地，第一目标帧的帧组可以分别进行帧对齐，也可以第一目标帧的所有帧组共同进行帧对齐，具体此处不做限定。Optionally, frame alignment may be performed on the frame groups of the first target frame separately, or frame alignment may be performed on all frame groups of the first target frame jointly, which is not specifically limited here.

702、数据处理装置确定第一目标帧的至少两个帧组；702. The data processing apparatus determines at least two frame groups of the first target frame;

为帧序列中的目标帧进行细节信息提取时，可以考虑从目标帧的邻近帧中获取细节信息，目标帧的邻近帧，后简称邻近帧，是帧序列中除目标帧以外的任意帧。可选地，是在图像拍摄时与目标帧拍摄时间比较接近，且与目标帧包含部分重叠图像信息的帧，通常可提供目标帧不具有的细节信息。When extracting the detail information for the target frame in the frame sequence, it can be considered to obtain the detail information from the adjacent frames of the target frame. The adjacent frame of the target frame, hereinafter referred to as the adjacent frame, is any frame in the frame sequence except the target frame. Optionally, it is a frame that is relatively close to the target frame when the image is captured, and contains partially overlapping image information with the target frame, usually providing detailed information that the target frame does not have.

数据处理装置对帧序列进行分组，为该帧序列中每一帧确定进行信息提取的至少两个帧组。该帧序列可以是步骤701中对帧序列进行对齐后输出的对齐的帧序列，或者是未经帧对齐的帧序列，具体此处不做限定。The data processing device groups the frame sequence, and determines at least two frame groups for information extraction for each frame in the frame sequence. The frame sequence may be an aligned frame sequence output after aligning the frame sequence in step 701, or a frame sequence without frame alignment, which is not specifically limited here.

以第一目标帧为例，数据处理装置确定包含至少两个包含第一目标帧和该第一目标帧的邻近帧的帧组。帧组的具体数量此处不做限定，例如可以是3个、4个或7个等。可以理解的是，每个帧组包含的帧数量一定的情况下，帧组越多，可以获取的细节信息越多，计算量越大，反之，帧组越少，可获取的细节信息越少，计算量越小。实际应用中，可以根据超分需求确定帧组的数量。Taking the first target frame as an example, the data processing apparatus determines a frame group including at least two frames including the first target frame and adjacent frames of the first target frame. The specific number of frame groups is not limited here, for example, it may be 3, 4, or 7, and so on. It can be understood that under the condition that each frame group contains a certain number of frames, the more frame groups, the more detailed information that can be obtained, and the greater the amount of calculation. On the contrary, the fewer the frame groups, the less detailed information can be obtained. , the smaller the amount of computation. In practical applications, the number of frame groups can be determined according to the over-division requirement.

本申请实施例中用于同时输入3D卷积神经网络的帧组中包含目标帧，由此，目标帧可以在卷积过程中为特征提取提供指导信息，提高特征提取的有效性，帧组中的帧数量与卷积核尺寸中深度维度的尺寸匹配，帧组的帧数量为奇数。可选地，选用3*3*3尺寸的卷积核，则帧组中的帧数量为3帧，即一个目标帧和两个邻近帧，可选地，若使用5*5*5尺寸的卷积核，则帧组的帧数量5个，如使用7*7*7尺寸的卷积核，则帧组的帧数量为7个，具体尺寸不做限定。由于卷积核尺寸增加时，计算量将快速增大，通常算力有限的终端数据处理设备都使用3*3*3尺寸的卷积核进行特征提取。In the embodiment of the present application, the frame group used to simultaneously input the 3D convolutional neural network includes the target frame. Therefore, the target frame can provide guidance information for feature extraction during the convolution process, and improve the effectiveness of the feature extraction. The number of frames matches the size of the depth dimension in the kernel size, and the number of frames in the frame group is odd. Optionally, if a convolution kernel of size 3*3*3 is used, the number of frames in the frame group is 3 frames, that is, one target frame and two adjacent frames. If the convolution kernel is used, the number of frames in the frame group is 5. If a convolution kernel with a size of 7*7*7 is used, the number of frames in the frame group is 7, and the specific size is not limited. As the size of the convolution kernel increases, the amount of computation will increase rapidly. Usually, terminal data processing equipment with limited computing power uses a 3*3*3 size convolution kernel for feature extraction.

帧组中邻近帧选取方法有多种，可选地，为第一目标帧确定的多个帧组中的所有邻近帧，与第一目标帧构成的帧集合，为一组连续帧序列，且该多个帧组中的所有邻近帧中不存在重复的邻近帧。There are various methods for selecting adjacent frames in the frame group. Optionally, all adjacent frames in the plurality of frame groups determined for the first target frame and the frame set formed by the first target frame are a group of consecutive frame sequences, and There are no duplicate adjacent frames among all adjacent frames in the plurality of frame groups.

可选地，按照与目标帧的时间间隔确定帧组，N个帧组记为{G_1，G_2，…，G_n}，n∈[1:N]。按照对应邻近帧到中间帧之间不同的时间距离，每一组3帧G_n＝{I_(t-n)，I_t，I_(t+n)}，其中，I_t为目标帧，I_(t-n)为前邻近帧，具体是目标帧之前n帧，I_(t+n)为后邻近帧，具体在目标帧之后n帧，n∈[1:N]。对于按照相同时间间隔获取的帧序列，该方法确定的帧组中，前邻近帧和后邻近帧与目标帧之间的时间间隔相同，即在时间维度关于目标帧对称，考虑到运动的连续性，关于目标帧对称的两个临近帧可更有效提取特征。Optionally, the frame group is determined according to the time interval with the target frame, and the N frame groups are denoted as {G_1, G_2, . . . , G_n}, n∈[1:N]. According to the different time distances between the corresponding adjacent frames and the intermediate frames, each group of 3 frames G_n={I _(tn) , I _t , I _(t+n) }, where I _t is the target frame, I _(tn) is the front adjacent frame, specifically n frames before the target frame, I _(t+n) is the rear adjacent frame, specifically n frames after the target frame, n∈[1:N]. For the frame sequence obtained at the same time interval, in the frame group determined by this method, the time interval between the front adjacent frame and the back adjacent frame and the target frame is the same, that is, the time dimension is symmetrical with respect to the target frame, considering the continuity of motion , two adjacent frames symmetrical about the target frame can extract features more effectively.

以7帧输入为例，请参阅图10，本申请实施例中时序分组的一个实施例示意图。Taking the input of 7 frames as an example, please refer to FIG. 10 , which is a schematic diagram of an embodiment of timing grouping in the embodiment of the present application.

本模块按照邻近帧到目标帧的时间距离，将其分为3组。第1帧、第4帧和第7帧为第一帧组，第2帧、第4帧和第6帧为第二帧组，第3帧、第4帧和第5帧为第三帧组。This module divides adjacent frames into 3 groups according to the time distance from the adjacent frame to the target frame. Frames 1, 4, and 7 are the first frame group, frames 2, 4, and 6 are the second frame group, and frames 3, 4, and 5 are the third frame group .

由于每一帧组中均包含了目标帧作为帧组内信息融合的指导，可有效提升信息提取效率。Since the target frame is included in each frame group as a guide for information fusion within the frame group, the information extraction efficiency can be effectively improved.

此外，现有技术中通过卷积核滑动的方式每次处理3帧，共需要计算6组，本申请实施例提供的数据处理方法仅需进行3组计算，可以显著减少计算量。In addition, in the prior art, 3 frames are processed each time by means of sliding the convolution kernel, and a total of 6 groups need to be calculated. The data processing method provided in the embodiment of the present application only needs to perform 3 groups of calculations, which can significantly reduce the amount of calculation.

703、数据处理装置获取帧组的组特征；703. The data processing device acquires the group feature of the frame group;

根据步骤702中确定的帧组，数据处理装置通过三维卷积神经网络分别对帧组进行特征的提取和融合，为方便描述，下面将帧组的特征特称为组特征。可选地，还可以结合二维卷积神经网络和三维卷积神经网络进行特征提取和融合。数据处理装置获取N个帧组中每个帧组的组特征

According to the frame group determined in step 702, the data processing device extracts and fuses the features of the frame group respectively through the three-dimensional convolutional neural network. For the convenience of description, the features of the frame group are hereinafter referred to as group features. Optionally, feature extraction and fusion can also be performed in combination with a two-dimensional convolutional neural network and a three-dimensional convolutional neural network. The data processing device obtains the group feature of each frame group in the N frame groups

可选地，按照邻近帧到目标帧的时间距离确定帧组时，数据处理装置可以对该目标帧的不同的帧组使用权重共享的方式提取和融合特征。权重共享是指用同一个网络结构实体处理不同批次的数据。Optionally, when the frame group is determined according to the temporal distance from the adjacent frame to the target frame, the data processing apparatus may extract and fuse features by using a weight sharing method for different frame groups of the target frame. Weight sharing refers to using the same network structure entity to process different batches of data.

以7帧输入为例，其中第4帧为目标帧，由于处于帧序列的中间位置，下面也称作中间帧。第1帧、第4帧和第7帧为第一帧组，第2帧、第4帧和第6帧为第二帧组，第3帧、第4帧和第5帧为第三帧组。考虑到3个帧组对应的帧间距离分别为1、2和3，提取每个帧组的特征时，使用同一个3D网络，网络中卷积核的扩张率(dilation rate)分别设置为该帧组的帧间距离，即第一帧组、第二帧组和第三帧组分别为1、2和3。Dilation rate对应卷积核的感受野，即空间覆盖范围，对运动较大的组使用较大的dilation rate可以更好提取空间运动信息，实现更高效的组内融合。Taking the input of 7 frames as an example, the 4th frame is the target frame. Since it is in the middle of the frame sequence, it is also called the middle frame below. Frames 1, 4, and 7 are the first frame group, frames 2, 4, and 6 are the second frame group, and frames 3, 4, and 5 are the third frame group . Considering that the inter-frame distances corresponding to the three frame groups are 1, 2, and 3, respectively, when extracting the features of each frame group, the same 3D network is used, and the dilation rate of the convolution kernel in the network is set to this value. The inter-frame distances of the frame groups, that is, the first frame group, the second frame group and the third frame group are 1, 2 and 3, respectively. The dilation rate corresponds to the receptive field of the convolution kernel, that is, the spatial coverage. Using a larger dilation rate for groups with large motion can better extract spatial motion information and achieve more efficient intra-group fusion.

类似地，数据处理装置为每个目标帧的多个帧组分别提取组特征，具体不再赘述。Similarly, the data processing apparatus separately extracts group features for multiple frame groups of each target frame, and details are not repeated here.

704、数据处理装置融合第一目标帧的组特征，获取第一目标帧的组间特征；704. The data processing device fuses the group feature of the first target frame to obtain the inter-group feature of the first target frame;

数据处理装置将同一目标帧的多个帧组的组特征

融合到一起，得到目标帧的组间特征F_A。该组间特征即为从邻近帧中提取的细节特征，可用于下一步骤中进行目标帧的超分。The data processing device converts the group features of a plurality of frame groups of the same target frame

fused together to obtain the inter _- group feature FA of the target frame. The inter-group feature is the detailed feature extracted from the adjacent frames, which can be used for the super-score of the target frame in the next step.

将融合多个组特征，确定目标帧的特征F_A的方式有多种，具体此处不做限定。可选的，请参阅图11，本申请实施例中组间特征融合的一个实施例示意图。There are many ways to fuse multiple group features to determine the feature _FA of the target frame, which is not specifically limited here. Optionally, please refer to FIG. 11 , which is a schematic diagram of an embodiment of feature fusion between groups in an embodiment of the present application.

对第一目标帧的所有帧组的组特征

首先使用2D卷积对其进行特征提取得到1维的特征

之后基于

计算注意力掩膜M_n。注意力掩膜M_n可以理解为特征

的权重，计算公式如下：Group features for all frame groups of the first target frame

First use 2D convolution to perform feature extraction on it to obtain 1-dimensional features

based on

Compute the attention mask M _n . The attention mask _Mn can be understood as a feature

The weight of , the calculation formula is as follows:

其中，M_n(x，y)_i代表注意力掩膜，

代表第i组的组特征图；(x，y)_j代表特征图中位置j的像素坐标，i代表第i个帧组，N代表帧组总数。where M _n (x, y) _i represents the attention mask,

Represents the group feature map of the ith group; (x, y) _j represents the pixel coordinates of position j in the feature map, i represents the ith frame group, and N represents the total number of frame groups.

根据计算得到的注意力掩膜对组特征进行加权，根据下面的公式获取经过加权的目标帧的特征

The group features are weighted according to the calculated attention mask, and the features of the weighted target frame are obtained according to the following formula

其中，⊙为按照元素执行的点乘(阿达玛乘积)。where ⊙ is the element-wise dot product (Hadamard product).

可选地，再使用包含3D卷积的三维卷积块(3D Block)和包含2D卷积的二维卷积块(2D Block)对加权后的组特征

进行进一步融合，生成融合之后的特征F_A。请参阅图12，为本申请实施例中进行组间特征融合的另一个实施例示意图。Optionally, use a three-dimensional convolution block (3D Block) containing 3D convolution and a two-dimensional convolution block (2D Block) containing 2D convolution to weight the group features

Further fusion is performed to generate the fused feature F _A . Please refer to FIG. 12 , which is a schematic diagram of another embodiment of feature fusion between groups in an embodiment of the present application.

类似地，数据处理装置对帧序列中每个目标帧的帧组特征进行组间融合，分别获取每个目标帧的特征，具体此处不再一一赘述。Similarly, the data processing apparatus performs inter-group fusion on the frame group features of each target frame in the frame sequence, and obtains the features of each target frame respectively, and details are not repeated here.

705、数据处理装置根据组间特征获取高分辨率的目标帧。705. The data processing apparatus acquires a high-resolution target frame according to the inter-group feature.

目标帧的特征F_A可以理解为多通道的图像，也称特征图，数据处理模块将该特征图放大到目标分辨率，输出残差图即细节特征。可选地，数据处理装置使用级联的2D卷积和像素重组算法(PixelShuffle)实现特征图的放大，可选地，每次放大2倍，直到放大倍率达到目标分辨率。 _The feature FA of the target frame can be understood as a multi-channel image, also called a feature map. The data processing module enlarges the feature map to the target resolution, and outputs a residual map, which is a detailed feature. Optionally, the data processing apparatus uses cascaded 2D convolution and pixel reshuffle algorithm (PixelShuffle) to realize the enlargement of the feature map, optionally, each time enlarges by 2 times until the magnification reaches the target resolution.

数据处理装置通过上采样将低分辨率的目标帧放大，得到放大的模糊图像，一般可以认为：清晰图像＝模糊图像+残差图。模糊图像可以由输入的低分辨率目标帧直接通过插值得到，再结合残差图即可得到清晰的高分辨率的目标帧。The data processing device amplifies the low-resolution target frame through up-sampling to obtain an enlarged blurred image, which can generally be considered as: clear image=blurred image+residual image. The blurred image can be directly obtained by interpolation from the input low-resolution target frame, and then combined with the residual map, a clear high-resolution target frame can be obtained.

示例性的，请参阅图13，为本申请实施例中目标帧采样放大的一个示意图。Exemplarily, please refer to FIG. 13 , which is a schematic diagram of sample amplification of a target frame in an embodiment of the present application.

数据处理装置将融合的特征F_A，通过级联的2D卷积和像素重组算法(PixelShuffle)放大，每次放大2倍，直至放大到目标分辨率之后输出残差图。数据处理装置将原始的目标帧经过双三次插值(Bicubic)方法上采样放大之后得到的模糊的放大图，数据处理装置根据该放大图与该残差图相加，得到目标帧即中间帧I4对应的高分辨率的目标帧。The data processing device amplifies the fused feature FA through the cascaded 2D convolution and pixel reshuffle algorithm ( _PixelShuffle ), and each time amplifies it by 2 times, until it is enlarged to the target resolution and then outputs a residual image. The data processing device samples the original target frame through a bicubic interpolation (Bicubic) method to obtain a blurred enlarged image obtained after upsampling and amplification, and the data processing device adds the residual image according to the enlarged image to obtain the target frame, that is, the corresponding intermediate frame I4. high-resolution target frame.

类似地，数据处理装置根据帧序列中每个目标帧的特征进行上采样放大，分别获取每个目标帧的高分辨率帧，具体此处不再一一赘述。Similarly, the data processing device performs up-sampling and amplification according to the characteristics of each target frame in the frame sequence, and obtains high-resolution frames of each target frame respectively, and details are not repeated here.

请参阅图14，为本申请实施例提供的数据处理方法的另一个实施例示意图；Please refer to FIG. 14 , which is a schematic diagram of another embodiment of the data processing method provided by the embodiment of the present application;

本申请实施例提供的数据处理方法中，数据处理装置获取输入的帧序列后，对帧序列进行快速帧对齐处理，然后根据邻近帧与目标帧的时间间隔进行时序分组，将分组1至分组N分别根据3D卷积神经网络进行特征组内融合，再根据注意力机制获取组内融合特征的权重，根据该权重和N个组内融合特征进行组间融合，获取目标帧的细节特征，最后，通过上采样输出高分辨率帧。In the data processing method provided by the embodiment of the present application, after the data processing device acquires the input frame sequence, the frame sequence is subjected to fast frame alignment processing, and then time sequence grouping is performed according to the time interval between the adjacent frame and the target frame, and grouping 1 to group N are grouped. According to the 3D convolutional neural network, the intra-group fusion of features is performed, and then the weight of the intra-group fusion features is obtained according to the attention mechanism, and the inter-group fusion is performed according to the weight and the N intra-group fusion features to obtain the detailed features of the target frame. Finally, Output high-resolution frames by upsampling.

下面对实现数据处理的数据处理装置进行介绍，请参阅图15，为本申请实施例提供的数据处理装置的一个实施例示意图；The following describes a data processing apparatus for implementing data processing. Please refer to FIG. 15 , which is a schematic diagram of an embodiment of the data processing apparatus provided by the embodiment of the present application;

本申请实施例提供的数据处理装置，包括：The data processing device provided by the embodiment of the present application includes:

获取单元1501，用于用于获取帧序列，所述帧序列中的帧具有第一分辨率；an acquiring unit 1501, configured to acquire a frame sequence, where the frames in the frame sequence have a first resolution;

确定单元1502，用于从所述帧序列中确定至少两个帧组，所述帧组包括第一目标帧和所述第一目标帧的至少两个邻近帧，所述第一目标帧为所述帧序列中的任意一帧，所述邻近帧为所述帧序列中除所述第一目标帧以外的帧；The determining unit 1502 is configured to determine at least two frame groups from the frame sequence, where the frame group includes a first target frame and at least two adjacent frames of the first target frame, and the first target frame is the any frame in the frame sequence, the adjacent frame is a frame other than the first target frame in the frame sequence;

所述确定单元1502，还用于通过三维卷积神经网络确定所述至少两个帧组中每个帧组的特征，所述每个帧组的特征指示基于所述第一目标帧，从所述每个帧组内的邻近帧中获取的细节信息，所述三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量正相关；The determining unit 1502 is further configured to determine, by using a three-dimensional convolutional neural network, the characteristics of each frame group in the at least two frame groups, where the characteristics of each frame group indicate that the first target frame is obtained from all the frame groups. The detailed information obtained in the adjacent frames in each frame group, the size of the convolution kernel in the time dimension in the three-dimensional convolutional neural network is positively correlated with the number of frames in the frame group;

处理单元1503，用于融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征，所述细节特征指示基于所述第一目标帧，从所述至少两个帧组内的邻近帧中获取的细节信息；The processing unit 1503 is configured to fuse the features of each frame group in the at least two frame groups to determine the detail feature of the first target frame, where the detail feature indicates that based on the first target frame, from the Detailed information obtained from adjacent frames in at least two frame groups;

所述获取单元1501，还用于根据所述细节特征和所述第一目标帧，获取具有第二分辨率的第一目标帧，所述第二分辨率大于所述第一分辨率。The obtaining unit 1501 is further configured to obtain, according to the detailed feature and the first target frame, a first target frame with a second resolution, where the second resolution is greater than the first resolution.

可选的，所述帧组包括所述第一目标帧和两个所述邻近帧。Optionally, the frame group includes the first target frame and the two adjacent frames.

可选的，所述两个所述邻近帧包括第一邻近帧和第二邻近帧，所述第一邻近帧在所述帧序列中与所述第一目标帧之间的间隔，与所述第二邻近帧在所述帧序列中与所述第一目标帧之间的间隔相等。Optionally, the two adjacent frames include a first adjacent frame and a second adjacent frame, and the interval between the first adjacent frame and the first target frame in the frame sequence is the same as the interval between the first adjacent frame and the first target frame. A second adjacent frame is equally spaced from the first target frame in the sequence of frames.

可选的，所述至少两个帧组包括三个帧组。Optionally, the at least two frame groups include three frame groups.

可选的，所述确定单元1502，还用于将所述至少两个帧组中每个帧组内的帧对齐，确定对齐的至少两个帧组；所述确定单元具体用于：通过三维卷积神经网络确定所述对齐的至少两个帧组中每个帧组的特征。Optionally, the determining unit 1502 is further configured to align the frames in each frame group of the at least two frame groups, and determine the aligned at least two frame groups; the determining unit is specifically configured to: through the three-dimensional A convolutional neural network determines features for each of the aligned at least two frame groups.

可选的，所述确定单元1502具体用于：确定由所述至少两个帧组中的帧组成的队列内所有连续两帧之间的单应矩阵；根据所述单应矩阵确定所述对齐的至少两个帧组。Optionally, the determining unit 1502 is specifically configured to: determine a homography matrix between all consecutive two frames in a queue composed of frames in the at least two frame groups; determine the alignment according to the homography matrix. of at least two frame groups.

可选的，所述确定单元1502还用于：所述确定单元还用于：通过深度神经网络确定所述至少两个帧组中每个帧组的特征的权值；所述处理单元具体用于：根据所述权值融合所述至少两个帧组中每个帧组的特征，以确定所述第一目标帧的细节特征。Optionally, the determining unit 1502 is further configured to: the determining unit is further configured to: determine the weight of the feature of each frame group in the at least two frame groups through a deep neural network; the processing unit specifically uses In: fusing the features of each of the at least two frame groups according to the weight to determine the detailed features of the first target frame.

可选的，所述三维卷积神经网络中卷积核在时间维度的尺寸与所述帧组中帧的数量相等。Optionally, the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is equal to the number of frames in the frame group.

图3和图4所示的基于卷积神经网络的算法可以在图16所示的NPU芯片中实现。The algorithms based on the convolutional neural network shown in Figures 3 and 4 can be implemented in the NPU chip shown in Figure 16.

神经网络处理器NPU 50作为协处理器挂载到主CPU(Host CPU)上，由Host CPU分配任务。NPU的核心部分为运算电路503，通过控制器504控制运算电路503提取存储器中的矩阵数据并进行乘法运算。The neural network processor NPU 50 is mounted on the main CPU (Host CPU) as a co-processor, and the Host CPU assigns tasks. The core part of the NPU is the arithmetic circuit 503, which is controlled by the controller 504 to extract the matrix data in the memory and perform multiplication operations.

在一些实现中，运算电路503内部包括多个处理单元(process engine，PE)。在一些实现中，运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，运算电路503是通用的矩阵处理器。In some implementations, the arithmetic circuit 503 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.

举例来说，假设有输入矩阵A，权重矩阵B，输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据，并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算，得到的矩阵的部分结果或最终结果，保存在累加器508accumulator中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers it on each PE in the operation circuit. The arithmetic circuit fetches the matrix A data from the input memory 501 and performs matrix operations on the matrix B, and the obtained partial result or final result of the matrix is stored in the accumulator 508 accumulator.

统一存储器506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器505(direct memory access controller，DMAC)被搬运到权重存储器502中。输入数据也通过DMAC被搬运到统一存储器506中。Unified memory 506 is used to store input data and output data. The weight data is directly transferred to the weight memory 502 through a storage unit access controller 505 (direct memory access controller, DMAC). Input data is also moved to unified memory 506 via the DMAC.

BIU为Bus Interface Unit即，总线接口单元510，用于AXI总线与DMAC和取指存储器509Instruction Fetch Buffer的交互。The BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch memory 509 Instruction Fetch Buffer.

总线接口单元510(bus interface unit，简称BIU)，用于取指存储器509从外部存储器获取指令，还用于存储单元访问控制器505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 510 (BIU for short) is used for the instruction fetch memory 509 to obtain instructions from the external memory, and also for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器506或将权重数据搬运到权重存储器502中或将输入数据数据搬运到输入存储器501中。The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 , the weight data to the weight memory 502 , or the input data to the input memory 501 .

向量计算单元507可以包括多个运算处理单元，在需要的情况下，对运算电路的输出做进一步处理，如向量乘，向量加，指数运算，对数运算，大小比较等等。主要用于神经网络中非卷积/FC层网络计算，如Pooling(池化)，Batch Normalization(批归一化)，LocalResponse Normalization(局部响应归一化)等。The vector calculation unit 507 may include a plurality of operation processing units, if necessary, further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. Mainly used for non-convolutional/FC layer network calculation in neural network, such as Pooling (pooling), Batch Normalization (batch normalization), LocalResponse Normalization (local response normalization), etc.

在一些实现中，向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如，向量计算单元507可以将非线性函数应用到运算电路503的输出，例如累加值的向量，用以生成激活值。在一些实现中，向量计算单元507生成归一化的值、合并值，或二者均有。在一些实现中，处理过的输出的向量能够用作到运算电路503的激活输入，例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit can 507 store the processed output vectors to the unified buffer 506 . For example, the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 507 generates normalized values, merged values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 503, eg, for use in subsequent layers in a neural network.

控制器504连接的取指存储器(instruction fetch buffer)509，用于存储控制器504使用的指令；an instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

统一存储器506，输入存储器501，权重存储器502以及取指存储器509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 506, the input memory 501, the weight memory 502 and the instruction fetch memory 509 are all On-Chip memories. External memory is private to the NPU hardware architecture.

其中，图3和图4所示的卷积神经网络中各层的运算可以由矩阵计算单元212或向量计算单元507执行。The operations of each layer in the convolutional neural network shown in FIG. 3 and FIG. 4 may be performed by the matrix computing unit 212 or the vector computing unit 507 .

本申请上述方法实施例可以应用于处理器中，或者由处理器实现上述方法实施例的步骤。处理器可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是中央处理器(central processing unit，CPU)，网络处理器(networkprocessor，NP)或者CPU和NP的组合、数字信号处理器(digital signal processor，DSP)、专用集成电路(application specific integrated circuit，ASIC)、现成可编程门阵列(field programmable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法的步骤。虽然图中仅仅示出了一个处理器，该装置可以包括多个处理器或者处理器包括多个处理单元。具体的，处理器可以是一个单核(single-CPU)处理器，也可以是一个多核(multi-CPU)处理器。The foregoing method embodiments of the present application may be applied to a processor, or the processor may implement the steps of the foregoing method embodiments. A processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method embodiments may be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software. The above processor may be a central processing unit (CPU), a network processor (NP) or a combination of CPU and NP, a digital signal processor (DSP), an application specific integrated circuit (application specific integrated circuit) integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Various methods, steps and logic block diagrams disclosed in this application can be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps in combination with the method disclosed in this application can be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware. Although only one processor is shown, the apparatus may include multiple processors or processors include multiple processing units. Specifically, the processor may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.

存储器用于存储处理器执行的计算机指令。存储器可以是存储电路也可以是存储器。存储器可以是易失性存储器或非易失性存储器，或可包括易失性和非易失性存储器两者。其中，非易失性存储器可以是只读存储器(read-only memory，ROM)、可编程只读存储器(programmable ROM，PROM)、可擦除可编程只读存储器(erasable PROM，EPROM)、电可擦除可编程只读存储器(electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory，RAM)，其用作外部高速缓存。存储器可以独立于处理器，也可以是处理器中的存储单元，在此不做限定。虽然图中仅仅示出了一个存储器，该装置也可以包括多个存储器或者存储器包括多个存储单元。Memory is used to store computer instructions for execution by the processor. The memory may be a storage circuit or a memory. The memory can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. The memory may be independent of the processor, or may be a storage unit in the processor, which is not limited herein. Although only one memory is shown in the figures, the apparatus may include multiple memories or the memories include multiple storage units.

收发器用于实现处理器与其他单元或者网元的内容交互。具体的，收发器可以是该装置的通信接口，也可以是收发电路或者通信单元，还可以是收发信机。收发器还可以是处理器的通信接口或者收发电路。一种可能的实现方式，收发器可以是一个收发芯片。该收发器还可以包括发送单元和/或接收单元。在一种可能的实现方式中，该收发器可以包括至少一个通信接口。在另一种可能的实现方式中，该收发器也可以是以软件形式实现的单元。在本申请的各实施例中，处理器可以通过收发器与其他单元或者网元进行交互。例如：处理器通过该收发器获取或者接收来自其他网元的内容。若处理器与收发器是物理上分离的两个部件，处理器可以不经过收发器与该装置的其他单元进行内容交互。The transceiver is used to realize the content interaction between the processor and other units or network elements. Specifically, the transceiver may be a communication interface of the device, a transceiver circuit or a communication unit, or a transceiver. The transceiver may also be a communication interface or a transceiver circuit of the processor. In a possible implementation manner, the transceiver may be a transceiver chip. The transceiver may also include a transmitting unit and/or a receiving unit. In one possible implementation, the transceiver may include at least one communication interface. In another possible implementation, the transceiver may also be a unit implemented in the form of software. In various embodiments of the present application, the processor may interact with other units or network elements through the transceiver. For example, the processor obtains or receives content from other network elements through the transceiver. If the processor and the transceiver are two physically separate components, the processor can interact with other units of the device without going through the transceiver.

一种可能的实现方式中，处理器、存储器以及收发器可以通过总线相互连接。总线可以是外设部件互连标准(peripheral component interconnect，PCI)总线或扩展工业标准结构(extended industry standard architecture，EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。In one possible implementation, the processor, the memory, and the transceiver may be connected to each other through a bus. The bus may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus or the like. The bus can be divided into an address bus, a data bus, a control bus, and the like.

本申请实施例中，“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言，使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiment or design described in the embodiments of the present application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.

在本申请的各实施例中，为了方面理解，进行了多种举例说明。然而，这些例子仅仅是一些举例，并不意味着是实现本申请的最佳实现方式。In the various embodiments of the present application, various illustrations are provided for the sake of understanding. However, these examples are merely examples and are not meant to be the best way to implement the present application.

上述实施例，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现，当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。The above embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof, and when implemented in software, may be implemented in whole or in part in the form of computer program products.

所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机执行指令时，全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。The computer program product includes one or more computer instructions. When the computer-executed instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

以上对本申请所提供的技术方案进行了详细介绍，本申请中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The technical solutions provided by the present application have been introduced in detail above, and the principles and implementations of the present application have been described with specific examples in the present application. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application. At the same time, for those skilled in the art, according to the idea of the application, there will be changes in the specific implementation and application scope. To sum up, the content of this specification should not be construed as a limitation to the application.

Claims

1. a data processing method, is characterized in that, described method comprises:

The data processing device acquires a sequence of frames, the frames in the sequence of frames have a first resolution;

The data processing device determines at least two frame groups from the frame sequence, the frame groups including a first target frame and at least two adjacent frames of the first target frame, the first target frame being the any frame in the frame sequence, the adjacent frame is a frame other than the first target frame in the frame sequence;

The data processing device determines the characteristics of each frame group in the at least two frame groups through a three-dimensional convolutional neural network, and the characteristic indication of each frame group is based on the first target frame, from the each frame. Detailed information obtained from adjacent frames in the group, the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is positively correlated with the number of frames in the frame group;

The data processing device fuses the features of each of the at least two frame groups to determine a detail feature of the first target frame, the detail feature indicating that based on the first target frame, from the at least Detail information obtained from adjacent frames within two frame groups;

The data processing apparatus acquires, according to the detailed feature and the first target frame, a first target frame with a second resolution, where the second resolution is greater than the first resolution.

2. The method of claim 1, wherein the frame group includes the first target frame and two of the adjacent frames.

3. The method of claim 2, wherein the two adjacent frames include a first adjacent frame and a second adjacent frame, the first adjacent frame being the same as the first adjacent frame in the frame sequence. The interval between a target frame is equal to the interval between the second adjacent frame and the first target frame in the frame sequence.

4. The method of any one of claims 1 to 3, wherein the at least two frame groups comprise three frame groups.

5. The method according to claim 1, wherein the method further comprises:

The data processing device aligns the frames in each of the at least two frame groups, and determines the aligned at least two frame groups;

The characteristics of each frame group in the at least two frame groups determined by the data processing device through a three-dimensional convolutional neural network include:

The data processing device determines the feature of each frame group in the aligned at least two frame groups through a three-dimensional convolutional neural network.

6. The method according to claim 5, wherein the data processing device aligns the frames in each of the at least two frame groups, and determining the aligned at least two frame groups comprises:

The data processing device determines the homography matrix between all two consecutive frames in the queue formed by the frames in the at least two frame groups;

The data processing apparatus determines the aligned at least two frame groups according to the homography matrix.

7. The method according to any one of claims 1 to 6, wherein the method further comprises:

The data processing device determines the weight of the feature of each frame group in the at least two frame groups through a deep neural network;

The data processing device fuses the features of each frame group in the at least two frame groups to determine the detailed features of the first target frame including:

The data processing device fuses the features of each frame group in the at least two frame groups according to the weight value to determine the detailed features of the first target frame.

8. The method according to any one of claims 1 to 7, wherein the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is equal to the number of frames in the frame group.

9. A data processing device, comprising:

an acquisition unit, configured to acquire a frame sequence, the frames in the frame sequence have a first resolution;

a determining unit, configured to determine at least two frame groups from the frame sequence, the frame group including a first target frame and at least two adjacent frames of the first target frame, and the first target frame is the any frame in the frame sequence, the adjacent frame is a frame other than the first target frame in the frame sequence;

The determining unit is further configured to determine, by using a three-dimensional convolutional neural network, a feature of each frame group in the at least two frame groups, where the feature indication of each frame group is based on the first target frame and obtained from the Detailed information obtained from adjacent frames in each frame group, the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is positively correlated with the number of frames in the frame group;

A processing unit, configured to fuse the features of each frame group in the at least two frame groups to determine the detail feature of the first target frame, the detail feature indicating that based on the first target frame, from the at least Detail information obtained from adjacent frames within two frame groups;

The obtaining unit is further configured to obtain, according to the detailed feature and the first target frame, a first target frame with a second resolution, where the second resolution is greater than the first resolution.

10. The apparatus of claim 9, wherein the frame group includes the first target frame and two of the adjacent frames.

11. The apparatus of claim 10, wherein the two adjacent frames comprise a first adjacent frame and a second adjacent frame, the first adjacent frame being the same as the first adjacent frame in the frame sequence. The interval between a target frame is equal to the interval between the second adjacent frame and the first target frame in the frame sequence.

12. The device according to any one of claims 9 to 11, wherein

The at least two frame groups include three frame groups.

13. The apparatus according to claim 9, wherein the determining unit,

is also used to align the frames in each of the at least two frame groups, and determine the aligned at least two frame groups;

The determining unit is specifically used for:

A feature of each of the aligned at least two frame groups is determined by a three-dimensional convolutional neural network.

14. The apparatus according to claim 13, wherein the determining unit is specifically configured to:

determining a homography matrix between all two consecutive frames in the queue consisting of frames in the at least two frame groups;

The aligned at least two frame groups are determined from the homography matrix.

15. The device according to any one of claims 9 to 14, wherein

The determining unit is also used for:

Determine the weight of the feature of each frame group in the at least two frame groups by a deep neural network;

The processing unit is specifically used for:

The features of each of the at least two frame groups are fused according to the weights to determine the detailed features of the first target frame.

The apparatus according to any one of claims 9 to 15, wherein the size of the convolution kernel in the three-dimensional convolutional neural network in the time dimension is equal to the number of frames in the frame group.

17. A data processing device, comprising a processor and a memory, wherein the processor and the memory are connected to each other, wherein the memory is used to store a computer program, the computer program includes program instructions, the The processor is configured to invoke the program instructions to execute the method as claimed in any one of claims 1 to 8.

18. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 8.

19. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 8.