HK40080593B

HK40080593B - Video decoding method, video decoding apparatus and storage medium

Info

Publication number: HK40080593B
Application number: HK62023069217.5A
Authority: HK
Inventors: 丁鼎; 蒋薇; 王炜; 刘杉
Original assignee: 腾讯美国有限责任公司
Priority date: 2021-04-30
Filing date: 2022-04-29
Publication date: 2025-11-21

Description

Video decoding methods, video decoding equipment and storage media

援引并入Incorporation

本申请要求2022年4月26日提交的题为“BLOCK-WISE CONTENT-ADAPTIVE ONLINETRAINING IN NEURAL IMAGE”的美国专利申请号17/729,978的优先权，该申请要求2021年4月30日提交的题为“Block-wise Content-Adaptive Online Training in Neural ImageCompression”的美国临时申请号63/182,366的优先权。在先申请的公开内容全部通过引用结合于此。This application claims priority to U.S. Patent Application No. 17/729,978, filed April 26, 2022, entitled “BLOCK-WISE CONTENT-ADAPTIVE ONLINETRAINING IN NEURAL IMAGE,” which in turn claims priority to U.S. Provisional Application No. 63/182,366, filed April 30, 2021, entitled “Block-wise Content-Adaptive Online Training in Neural Image Compression.” The entire disclosure of the earlier applications is incorporated herein by reference.

技术领域Technical Field

本公开描述了总体上涉及视频编码的实施例。This disclosure describes embodiments that generally involve video coding.

背景技术Background Technology

本文提供的背景描述是为了总体呈现本公开的上下文。在该背景技术部分中描述的程度上，目前署名的发明人的工作以及该描述的在提交时可能不符合现有技术的方面既不能明确地也不能隐含地被认为是本公开的现有技术。The background description provided herein is intended to provide a general overview of the context of this disclosure. To the extent described in this background section, the work of the currently attributed inventors and aspects of the description that may not conform to the prior art at the time of filing should not be explicitly or implicitly considered as prior art to this disclosure.

可以使用具有运动补偿的帧间图片预测来执行视频编码和解码。未压缩的数字图像和/或视频可以包括一系列图片，每个图片具有例如1920×1080亮度样本和相关色度样本的空间维度。该系列图片可以具有固定或可变的图像速率(非正式地也称为帧率)，例如，每秒60幅图片或60Hz。未压缩的图像和/或视频有特定的比特率要求。例如，每样本8比特的1080p60 4:2:0视频(60Hz帧率的1920×1080亮度样本分辨率)需要接近1.5Gbit/s的带宽。一小时这样的视频需要超过600G字节(GBytes)的存储空间。Video encoding and decoding can be performed using inter-frame picture prediction with motion compensation. Uncompressed digital images and/or video can comprise a series of pictures, each with spatial dimensions, for example, 1920×1080 luma samples and correlated chroma samples. This series of pictures can have a fixed or variable image rate (informally also called frame rate), for example, 60 pictures per second or 60Hz. Uncompressed images and/or video have specific bit rate requirements. For example, 1080p60 4:2:0 video with 8 bits per sample (1920×1080 luma sample resolution at 60Hz frame rate) requires close to 1.5 Gbit/s of bandwidth. One hour of such video would require more than 600 GBytes of storage space.

视频编码和解码的一个目的是通过压缩减少输入图像和/或视频信号中的冗余。压缩有助于降低上述带宽和/或存储空间需求，在某些情况下可降低两个数量级或更多。尽管本文的描述使用视频编码/解码，作为说明性示例，但是在不脱离本公开的精神的情况下，相同的技术可以以类似的方式应用于图像编码/解码。可以采用无损压缩和有损压缩及其组合。无损压缩是指可以从压缩的原始信号中重构原始信号的精确副本的技术。当使用有损压缩时，重构信号可能与原始信号不相同，但是原始信号和重构信号之间的失真足够小，使得重构信号对预期应用有用。在视频的情况下，广泛采用有损压缩。容许的失真量取决于应用；例如，某些消费者流应用的用户可能比电视分发应用的用户容忍更高的失真。可实现的压缩比可以反映出：更高的容许/可容忍失真可以产生更高的压缩比。One objective of video encoding and decoding is to reduce redundancy in the input image and/or video signal through compression. Compression helps reduce the aforementioned bandwidth and/or storage space requirements, in some cases by two orders of magnitude or more. Although the description herein uses video encoding/decoding as an illustrative example, the same techniques can be applied in a similar manner to image encoding/decoding without departing from the spirit of this disclosure. Lossless compression and lossy compression, as well as combinations thereof, can be employed. Lossless compression refers to a technique that can reconstruct an exact copy of the original signal from the compressed original signal. When lossy compression is used, the reconstructed signal may not be identical to the original signal, but the distortion between the original and reconstructed signals is small enough that the reconstructed signal is useful for the intended application. In the case of video, lossy compression is widely used. The amount of distortion tolerated depends on the application; for example, users of some consumer streaming applications may tolerate higher distortion than users of television distribution applications. The achievable compression ratio can be reflected in the fact that higher tolerance/tolerable distortion can result in a higher compression ratio.

视频编码器和解码器可以利用几大类技术，包括例如运动补偿、变换、量化和熵编码。Video encoders and decoders can utilize several major categories of techniques, including motion compensation, transform, quantization, and entropy coding.

视频编解码器技术可以包括称为帧内编码的技术。在帧内编码中，样本值是在不参考来自先前重构的参考图片的样本或其他数据的情况下表示的。在某些视频编解码器中，图片在空间上被细分为样本块。当所有样本块都以帧内模式编码时，该图片可以是帧内图片。帧内图片及其派生图片(例如，独立的解码器刷新图片)可以用于重置解码器状态，因此可以用作编码视频比特流和视频会话中的第一个图片，或者用作静止图像。可以对帧内块的样本进行变换，并且可以在熵编码之前对变换系数进行量化。帧内预测可以是一种在预变换域中最小化样本值的技术。在一些情况下，变换后的DC值越小，AC系数越小，在给定量化步长下表示熵编码后的块所需的比特就越少。Video codec techniques can include techniques called intra-frame coding. In intra-frame coding, sample values are represented without reference to samples or other data from previously reconstructed reference pictures. In some video codecs, a picture is spatially subdivided into sample blocks. When all sample blocks are encoded in intra-frame mode, the picture can be an intra-frame picture. Intra-frame pictures and their derived pictures (e.g., standalone decoder refresh pictures) can be used to reset the decoder state and thus can be used as the first picture in the encoded video bitstream and video session, or as a still image. Samples of intra-frame blocks can be transformed, and the transform coefficients can be quantized before entropy coding. Intra-frame prediction can be a technique that minimizes sample values in the pre-transform domain. In some cases, the smaller the transformed DC value, the smaller the AC coefficients, and the fewer bits are needed to represent the entropy-coded block at a given quantization step size.

例如，从MPEG-2代编码技术中已知的传统帧内编码不使用帧内预测。然而，一些较新的视频压缩技术包括从例如在空间上相邻的并且在解码顺序上在前的数据块的编码和/或解码期间获得的周围样本数据和/或元数据中尝试的技术。这种技术此后被称为“帧内预测”技术。注意，至少在一些情况下，帧内预测仅使用来自重构中的当前图片的参考数据，而不使用来自参考图片的参考数据。For example, traditional intra-frame coding known from MPEG-2 generation coding techniques does not use intra-frame prediction. However, some newer video compression techniques include attempts to use surrounding sample data and/or metadata obtained during the encoding and/or decoding of, for example, spatially adjacent data blocks that are earlier in the decoding order. This technique is hereby referred to as "intra-frame prediction." Note that, at least in some cases, intra-frame prediction uses only reference data from the current picture in the reconstruction, and not reference data from a reference picture.

可以有许多不同形式的帧内预测。当在给定的视频编码技术中可以使用多于一种这样的技术时，所使用的技术可以在帧内预测模式中被编码。在某些情况下，模式可以有子模式和/或参数，这些子模式和/或参数可以单独编码或包含在模式码字中。对于给定的模式、子模式和/或参数组合，使用哪个码字会对通过帧内预测对编码效率增益产生影响，因此，用于将码字转换成比特流的熵编码技术也会产生影响。There can be many different forms of intra-prediction. When more than one such technique can be used in a given video coding technique, the technique used can be encoded in the intra-prediction mode. In some cases, the mode can have sub-modes and/or parameters, which can be encoded individually or included in the mode codeword. For a given combination of modes, sub-modes, and/or parameters, which codeword is used will affect the coding efficiency gain through intra-prediction, and therefore, the entropy coding technique used to convert the codeword into a bitstream will also have an impact.

H.264引入了特定的帧内预测模式，在H.265中进行了改进，并在更新的编码技术中进一步改进，例如，共同探索模型(JEM)、通用视频编码(VVC)和基准集(BMS)。可以使用属于已经可用样本的相邻样本值来形成预测块。根据方向将相邻样本的样本值复制到预测块中。对使用中的方向的参考可以在比特流中编码，或者本身可以预测。H.264 introduced specific intra-frame prediction modes, which were improved in H.265 and further refined in newer coding techniques such as the Joint Exploration Model (JEM), Universal Video Coding (VVC), and Baseline Set (BMS). Prediction blocks can be formed using neighboring sample values belonging to already available samples. The sample values of neighboring samples are copied into the prediction block according to the orientation. The reference to the orientation being used can be encoded in the bitstream or can be predicted itself.

参考图1A，右下方描绘了从H.265的33个可能的预测方向(对应于35个帧内模式的33个角度模式)中已知的9个预测方向的子集。箭头会聚的点(101)表示被预测的样本。箭头表示预测样本的方向。例如，箭头(102)指示样本(101)是基于右上方、与水平线成45°角的一个或多个样本预测的。类似地，箭头(103)指示样本(101)是基于样本(101)的左下方、与水平线成22.5°角的一个或多个样本预测的。Referring to Figure 1A, the lower right depicts a subset of nine known prediction directions from the 33 possible prediction directions of H.265 (corresponding to 33 angular modes of 35 intra-frame modes). The point (101) where the arrows converge represents the sample being predicted. The arrows indicate the direction of the predicted sample. For example, arrow (102) indicates that sample (101) is predicted based on one or more samples to the upper right at a 45° angle to the horizontal line. Similarly, arrow (103) indicates that sample (101) is predicted based on one or more samples to the lower left of sample (101) at a 22.5° angle to the horizontal line.

仍然参考图1A，在左上方描绘了4×4样本的正方形块(104)(由粗虚线表示)。正方形块(104)包括16个样本，每个样本标有“S”，其在Y维度上的位置(例如，行索引)和其在X维度上的位置(例如，列索引)。例如，样本S21是Y维度中的第二个样本(从顶部算起)和X维度中的第一个样本(从左侧算起)。类似地，样本S44在Y和X维度上都是块(104)中的第四个样本。由于该块的大小为4×4个样本，所以S44位于右下角。还显示了遵循类似编号方案的参考样本。参考样本用R、其相对于块(104)的Y位置(例如，行索引)和X位置(列索引)来标记(104)。在H.264和H.265中，预测样本与重构中的块相邻；因此，不需要使用负值。Referring again to Figure 1A, a square block (104) of 4×4 samples is depicted in the upper left (represented by a thick dashed line). The square block (104) comprises 16 samples, each labeled "S" with its position in the Y dimension (e.g., row index) and its position in the X dimension (e.g., column index). For example, sample S21 is the second sample in the Y dimension (counting from the top) and the first sample in the X dimension (counting from the left). Similarly, sample S44 is the fourth sample in block (104) in both the Y and X dimensions. Since the block size is 4×4 samples, S44 is located in the lower right corner. Reference samples following a similar numbering scheme are also shown. Reference samples are labeled (104) with R, their Y position (e.g., row index) relative to block (104), and their X position (column index). In H.264 and H.265, predicted samples are adjacent to blocks in the reconstruction; therefore, negative values are not required.

帧内图片预测可以通过从相邻样本中复制参考样本值来工作，通过信号的预测方向(signaled prediction direction)是合适的。例如，假设经编码的视频比特流包含信号，其针对此块指示与箭头(102)一致的预测方向——即，基于右上方、与水平成45°角的一个或多个预测样本来预测多个样本。在这种情况下，基于相同的参考样本R05预测样本S41、S32、S23和S14。然后基于参考样本R08预测样本S44。Intra-frame image prediction works by copying reference sample values from neighboring samples, which is appropriate using the signaled prediction direction. For example, suppose the encoded video bitstream contains a signal that indicates a prediction direction consistent with arrow (102) for this block—that is, predicting multiple samples based on one or more prediction samples at the upper right, at a 45° angle to the horizontal. In this case, samples S41, S32, S23, and S14 are predicted based on the same reference sample R05. Then sample S44 is predicted based on reference sample R08.

在某些情况下，多个参考样本的值可以组合，例如，通过插值，以便计算参考样本；尤其是当方向不能被45°整除时。In some cases, the values of multiple reference samples can be combined, for example, by interpolation, to calculate the reference sample; especially when the direction is not divisible by 45°.

随着视频编码技术的发展，可能的方向的数量已经增加。在H.264(2003年)中，可以表示九个不同方向。这在H.265(2013年)中增加到33个，而JEM/VVC/BMS在披露时可以支持多达65个方向。已经进行了实验来识别最可能的方向，并且熵编码中的某些技术被用于以少量比特来表示那些可能的方向，对于不太可能的方向接受一定的惩罚。此外，方向本身有时可以从相邻的已经解码的块中使用的相邻方向来预测。With the development of video coding technology, the number of possible directions has increased. In H.264 (2003), nine different directions could be represented. This increased to 33 in H.265 (2013), while JEM/VVC/BMS, at the time of its disclosure, could support up to 65 directions. Experiments have been conducted to identify the most likely directions, and certain techniques in entropy coding have been used to represent those possible directions with a small number of bits, penalizing less likely directions. Furthermore, the direction itself can sometimes be predicted from adjacent directions used in adjacent decoded blocks.

图1B示出了示意图(110)，描绘了根据JEM的65个帧内预测方向，以说明预测方向的数量随着时间的推移而增加。Figure 1B shows a schematic diagram (110) depicting 65 intra-frame prediction directions according to JEM to illustrate how the number of prediction directions increases over time.

编码视频比特流中表示方向的帧内预测方向比特的映射可以根据视频编码技术的不同而不同；并且上述映射可以包括例如一些简单直接的映射，比如从预测方向到帧内预测模式的映射，从预测方向到码字的映射，上述映射还可以包括涉及最可能模式的复杂自适应方案，以及类似的技术。然而，在所有情况下，与某些其他方向相比，某些方向在统计上不太可能出现在视频内容中。由于视频压缩的目标是减少冗余，在工作良好的视频编码技术中，那些不太可能的方向将由比更可能的方向更多的比特来表示。The mapping of intra-prediction direction bits representing direction in the encoded video bitstream can vary depending on the video coding technique. This mapping can include, for example, simple and straightforward mappings such as mapping from prediction direction to intra-prediction mode, or from prediction direction to codeword; it can also include complex adaptive schemes involving the most probable mode, and similar techniques. However, in all cases, some directions are statistically less likely to appear in the video content compared to some other directions. Since the goal of video compression is to reduce redundancy, in well-functioning video coding techniques, those less likely directions will be represented by more bits than the more likely directions.

运动补偿可以是有损压缩技术，并且可以涉及以下技术，在该技术中，来自先前重构的图片或其一部分(参考图片)的样本数据块在由运动矢量(此后称为MV)指示的方向上进行空间移位之后，用于预测新重构的图片或图片部分。在某些情况下，参考图片可以与当前正在重构的图片相同。MV可以具有两个维度X和Y，或者三个维度，第三个维度是使用中的参考图片的指示(后者间接可以是时间维度)。Motion compensation can be a lossy compression technique and can involve techniques in which sample data blocks from a previously reconstructed image or a portion thereof (the reference image) are spatially shifted in a direction indicated by a motion vector (hereinafter referred to as MV) to predict a newly reconstructed image or image portion. In some cases, the reference image can be the same as the image currently being reconstructed. The MV can have two dimensions, X and Y, or three dimensions, with the third dimension being an indication of the reference image in use (the latter can indirectly be a temporal dimension).

在一些视频压缩技术中，可应用于样本数据的某个区域的MV可以基于其他MV预测，例如，基于与空间上邻近正在重构的区域的并且在解码顺序上在该MV之前的样本数据的另一个区域相关的MV预测。这样做可以大大减少编码MV所需的数据量，从而消除冗余并提高压缩率。MV预测可以有效地工作，例如，是因为当对从相机导出的输入视频信号(称为自然视频(natural video))进行编码时，统计上存在以下可能性，即比单个MV可应用的区域大的多个区域在相似的方向上移动，因此，在某些情况下，可以使用基于相邻区域的多个MV得到的相似MV(motion vector，运动矢量)来预测。这导致对于给定区域发现的MV与基于周围MV预测的MV相似或相同，并且在熵编码之后，这又可以用比直接编码MV时所使用的更少的比特数来表示。在某些情况下，MV预测可以是基于原始信号(即：样本流)得到的信号(即：MV)的无损压缩的示例。在其他情况下，例如，当从几个周围的MV计算预测值时，由于舍入误差，MV预测本身可能是有损耗的。In some video compression techniques, a motion vector (MV) applicable to a region of sample data can be predicted based on other MVs, such as a MV prediction based on another region of sample data that is spatially adjacent to the region being reconstructed and precedes that MV in the decoding order. Doing so can significantly reduce the amount of data required to encode the MV, thereby eliminating redundancy and improving compression ratio. MV prediction works effectively, for example, because when encoding an input video signal derived from a camera (called natural video), there is a statistically likely probability that multiple regions larger than the region to which a single MV can be applied move in similar directions. Therefore, in some cases, similar MVs (motion vectors) derived from multiple MVs of neighboring regions can be used for prediction. This results in the MV found for a given region being similar to or identical to the MV predicted based on surrounding MVs, and after entropy encoding, this can be represented with fewer bits than when directly encoding the MV. In some cases, MV prediction can be an example of lossless compression of the signal (i.e., the MV) derived from the original signal (i.e., the sample stream). In other cases, such as when calculating predictions from several surrounding MVs, the MV prediction itself can be lossy due to rounding errors.

在H.265/HEVC(ITU-TRec.H.265，“High Efficiency Video Coding”，2016年12月)中描述各种MV预测机制。在H.265提供的许多MV预测机制之外，此处描述了一种此后被称为“空间合并”的技术。Various MV prediction mechanisms are described in H.265/HEVC (ITU-TRec.H.265, “High Efficiency Video Coding”, December 2016). In addition to the many MV prediction mechanisms provided by H.265, a technique hereinafter referred to as “spatial merging” is described here.

参考图2，当前块(201)包括编码器在运动搜索过程中发现的样本，以便可基于已经空间移位的相同大小的先前块预测。并非直接编码该MV，可以使用与五个周围样本(表示为A0、A1和B0、B1、B2(分别为202至206))中的任一个相关联的MV，基于与一个或多个参考图片相关联的元数据得到该MV，例如，基于最近的(按照解码顺序)参考图片得到。在H.265中，MV预测可以使用多个预测器，这些预测器基于相邻块正在使用的相同参考图片进行预测。Referring to Figure 2, the current block (201) includes samples discovered by the encoder during motion search so that predictions can be made based on previous blocks of the same size that have been spatially shifted. Instead of directly encoding the MV, the MV can be obtained using the MV associated with any of the five surrounding samples (denoted as A0, A1 and B0, B1, B2 (202 to 206 respectively)) based on metadata associated with one or more reference images, for example, based on the most recent (in decoding order) reference image. In H.265, MV prediction can use multiple predictors that make predictions based on the same reference images being used by adjacent blocks.

发明内容Summary of the Invention

本公开的各方面提供了视频编码和解码方法和设备。在一些示例中，一种视频解码设备包括处理电路。该处理电路被配置为对编码比特流中的第一神经网络更新信息进行解码，上述第一神经网络更新信息用于第一神经网络。第一神经网络配置有第一组预训练参数。第一神经网络更新信息对应于要重构的图像中的第一块，并且指示与第一组预训练参数中的第一预训练参数对应的第一替换参数。处理电路可以基于第一替换参数更新第一神经网络，并且基于更新后的第一神经网络解码第一块，上述更新后的第一神经网络用于第一块。This disclosure provides video encoding and decoding methods and apparatuses. In some examples, a video decoding apparatus includes processing circuitry. This processing circuitry is configured to decode update information of a first neural network in an encoded bitstream, the update information being used by the first neural network. The first neural network is configured with a first set of pre-trained parameters. The update information of the first neural network corresponds to a first block in an image to be reconstructed and indicates a first replacement parameter corresponding to the first pre-trained parameter in the first set of pre-trained parameters. The processing circuitry can update the first neural network based on the first replacement parameter and decode the first block based on the updated first neural network, the updated first neural network being used by the first block.

在一个实施例中，第一神经网络更新信息还指示一个或多个替换参数，上述一个或多个替换参数用于多个神经网络中除所述第一神经网络之外的一个或多个剩余神经网络(remaining neural network)。处理电路可以基于一个或多个替换参数更新一个或多个剩余神经网络。In one embodiment, the first neural network update information further indicates one or more replacement parameters, which are used for one or more remaining neural networks in addition to the first neural network. The processing circuitry can update the one or more remaining neural networks based on the one or more replacement parameters.

在一个实施例中，处理电路对编码比特流中的第二神经网络更新信息进行解码，上述第二神经网络更新信息用于第二神经网络。第二神经网络配置有第二组预训练参数。第二神经网络更新信息对应于要重构的图像中的第二块，并且指示与第二组预训练参数中的第二预训练参数对应的第二替换参数。在一个示例中，第二神经网络不同于第一神经网络。处理电路可以基于第二替换参数来更新第二神经网络，并且基于更新后的第二神经网络解码第二块，上述更新后的第二神经网络用于第二块。In one embodiment, the processing circuit decodes second neural network update information in the encoded bitstream, the second neural network update information being used by the second neural network. The second neural network is configured with a second set of pre-trained parameters. The second neural network update information corresponds to a second block in the image to be reconstructed and indicates a second replacement parameter corresponding to the second pre-trained parameter in the second set of pre-trained parameters. In one example, the second neural network differs from the first neural network. The processing circuit can update the second neural network based on the second replacement parameters and decode the second block based on the updated second neural network, the updated second neural network being used by the second block.

在一个实施例中，第一预训练参数是预训练权重系数和预训练偏置项中的一个。In one embodiment, the first pre-training parameter is one of the pre-training weight coefficients and the pre-training bias term.

在一个实施例中，第二预训练参数是预训练权重系数和预训练偏置项中的另一个。In one embodiment, the second pre-training parameter is another of the pre-training weight coefficients and the pre-training bias term.

在一个实施例中，处理电路基于更新后的第一神经网络对编码比特流中的第二块进行解码，上述更新后的第一神经网络用于第一块。In one embodiment, the processing circuit decodes a second block of the encoded bitstream based on an updated first neural network, which was used for the first block.

在一个实施例中，第一神经网络更新信息指示第一替换参数和第一预训练参数之间的差异。处理电路根据差异和第一预训练参数的和来确定第一替换参数。In one embodiment, the first neural network update information indicates the difference between the first replacement parameter and the first pre-trained parameter. The processing circuitry determines the first replacement parameter based on the difference and the sum of the first pre-trained parameters.

在一个实施例中，处理电路基于Lempel-Ziv-Markov链算法的变体(variation)(LZMA2)和bzip2算法中的一个解码第一神经网络更新信息。In one embodiment, the processing circuitry decodes the first neural network update information based on a variation of the Lempel-Ziv-Markov chain algorithm (LZMA2) and the bzip2 algorithm.

在一个示例中，处理电路基于LZMA2和bzip2算法中的另一个解码第二神经网络更新信息。In one example, the processing circuitry updates information based on another decoding second neural network in the LZMA2 and bzip2 algorithms.

本公开的各方面还提供了一种非暂时性计算机可读存储介质，所述介质存储可由至少一个处理器执行的程序，以执行视频解码方法。This disclosure also provides a non-transitory computer-readable storage medium storing a program executable by at least one processor to perform a video decoding method.

附图说明Attached Figure Description

从以下详细描述和附图中，所公开主题的进一步特征、性质和各种优点将变得更加明显，其中：Further features, properties, and various advantages of the disclosed subject matter will become more apparent from the following detailed description and accompanying drawings, wherein:

图1A是帧内预测模式的示例性子集的示意图；Figure 1A is a schematic diagram of an exemplary subset of intra-frame prediction modes;

图1B是示例性帧内预测方向的图示；Figure 1B is an illustration of an exemplary intra-frame prediction direction;

图2示出了根据一个实施例的当前块和周围样本；Figure 2 shows the current block and surrounding samples according to one embodiment;

图3是根据一个实施例的通信系统的简化框图的示意图；Figure 3 is a simplified block diagram of a communication system according to one embodiment;

图4是根据一个实施例的通信系统的简化框图的示意图；Figure 4 is a simplified block diagram of a communication system according to one embodiment;

图5是根据一个实施例的解码器的简化框图的示意图；Figure 5 is a simplified block diagram of a decoder according to one embodiment;

图6是根据一个实施例的编码器的简化框图的示意图；Figure 6 is a simplified block diagram of an encoder according to one embodiment;

图7示出了根据另一实施例的编码器的框图；Figure 7 shows a block diagram of an encoder according to another embodiment;

图8示出了根据另一实施例的解码器的框图；Figure 8 shows a block diagram of a decoder according to another embodiment;

图9A示出了根据本公开实施例的分块图像编码的示例；Figure 9A illustrates an example of block image encoding according to an embodiment of the present disclosure;

图9B示出了根据本公开实施例的示例性NIC框架；Figure 9B illustrates an exemplary NIC framework according to an embodiment of this disclosure;

图10示出了根据本公开实施例的主编码器网络的示例性卷积神经网络(CNN)；Figure 10 illustrates an exemplary convolutional neural network (CNN) of a master encoder network according to an embodiment of the present disclosure;

图11示出了根据本公开实施例的主解码器网络的示例性CNN；Figure 11 illustrates an exemplary CNN of a master decoder network according to an embodiment of the present disclosure;

图12示出了根据本公开实施例的超级编码器(hyper encoder)的示例性CNN；Figure 12 illustrates an exemplary CNN of a hyper encoder according to an embodiment of this disclosure;

图13示出了根据本公开实施例的超级解码器(hyper decoder)的示例性CNN；Figure 13 illustrates an exemplary CNN of a hyper decoder according to an embodiment of the present disclosure;

图14示出了根据本公开实施例的上下文模型网络的示例性CNN；Figure 14 illustrates an exemplary CNN of a context model network according to an embodiment of the present disclosure;

图15示出了根据本公开实施例的熵参数网络的示例性CNN；Figure 15 illustrates an exemplary CNN of an entropy parameter network according to an embodiment of the present disclosure;

图16A示出了根据本公开实施例的示例性视频编码器；Figure 16A illustrates an exemplary video encoder according to an embodiment of the present disclosure;

图16B示出了根据本公开实施例的示例性视频解码器；Figure 16B illustrates an exemplary video decoder according to an embodiment of the present disclosure;

图17示出了根据本公开实施例的示例性视频编码器；Figure 17 illustrates an exemplary video encoder according to an embodiment of the present disclosure;

图18示出了根据本公开实施例的示例性视频解码器；Figure 18 illustrates an exemplary video decoder according to an embodiment of the present disclosure;

图19示出了概述根据本公开实施例的过程的流程图；Figure 19 shows a flowchart outlining a process according to an embodiment of the present disclosure;

图20示出了概述根据本公开实施例的过程的流程图；Figure 20 shows a flowchart outlining a process according to an embodiment of the present disclosure;

图21是根据一个实施例的计算机系统的示意图。Figure 21 is a schematic diagram of a computer system according to one embodiment.

具体实施方式Detailed Implementation

图3示出了根据本公开实施例的通信系统(300)的简化框图。通信系统(300)包括能够经由例如网络(350)相互通信的多个终端装置。例如，通信系统(300)包括经由网络(350)互连的第一对终端装置(310)和(320)。在图3的示例中，第一对终端装置(310)和(320)执行数据的单向传输。例如，终端装置(310)可以对视频数据(例如，由终端装置(310)捕捉的视频图片流)进行编码，以便经由网络(350)传输到另一个终端装置(320)。编码视频数据可以以一个或多个编码视频比特流的形式传输。终端装置(320)可以从网络(350)接收编码的视频数据，解码经编码的视频数据，以恢复视频图片，并根据恢复的视频数据显示视频图片。单向数据传输在媒体服务应用等中是常见的。Figure 3 shows a simplified block diagram of a communication system (300) according to an embodiment of the present disclosure. The communication system (300) includes a plurality of terminal devices capable of communicating with each other via, for example, a network (350). For example, the communication system (300) includes a first pair of terminal devices (310) and (320) interconnected via the network (350). In the example of Figure 3, the first pair of terminal devices (310) and (320) perform unidirectional data transmission. For example, terminal device (310) may encode video data (e.g., a video image stream captured by terminal device (310)) for transmission to another terminal device (320) via the network (350). The encoded video data may be transmitted in the form of one or more encoded video bitstreams. Terminal device (320) may receive the encoded video data from the network (350), decode the encoded video data to recover the video images, and display the video images based on the recovered video data. Unidirectional data transmission is common in media service applications, etc.

在另一个示例中，通信系统(300)包括第二对终端装置(330)和(340)，其执行例如在视频会议期间可能发生的编码视频数据的双向传输。对于数据的双向传输，在一个示例中，终端装置(330)和(340)的每个终端装置可以对视频数据(例如，由终端装置捕捉的视频图片流)进行编码，以便经由网络(350)传输到终端装置(330)和(340)的另一个终端装置。终端装置(330)和(340)的每个终端装置还可以接收由终端装置(330)和(340)的另一个终端装置传输的编码视频数据，并且可以解码经编码的视频数据，以恢复视频图片，并且可以根据恢复的视频数据在可访问的显示装置上显示视频图片。In another example, the communication system (300) includes a second pair of terminal devices (330) and (340) that perform bidirectional transmission of encoded video data, such as that that may occur during a video conference. For bidirectional data transmission, in one example, each of the terminal devices (330) and (340) can encode video data (e.g., a stream of video images captured by the terminal device) for transmission via a network (350) to the other terminal device (330) and (340). Each of the terminal devices (330) and (340) can also receive encoded video data transmitted by the other terminal device (330) and (340), and can decode the encoded video data to recover the video images, and can display the video images on an accessible display device based on the recovered video data.

在图3的示例中，终端装置(310)、(320)、(330)和(340)可以被示为服务器、个人计算机和智能电话，但是本公开的原理可以不限于此。本公开的实施例适用于膝上型计算机、平板计算机、媒体播放器和/或专用视频会议设备。网络(350)表示在终端装置(310)、(320)、(330)和(340)之间传送编码视频数据的任意数量的网络，包括例如线路(有线)和/或无线通信网络。通信网络(350)可以在电路交换和/或分组交换信道中交换数据。代表性的网络包括电信网络、局域网、广域网和/或互联网。出于当前讨论的目的，网络(350)的架构和拓扑对于本公开的操作可能是不重要的，除非在下文中解释。In the example of Figure 3, the terminal devices (310), (320), (330), and (340) can be shown as servers, personal computers, and smartphones, but the principles of this disclosure are not limited thereto. Embodiments of this disclosure are applicable to laptop computers, tablet computers, media players, and/or dedicated video conferencing equipment. Network (350) refers to any number of networks that transmit encoded video data between the terminal devices (310), (320), (330), and (340), including, for example, wired (wired) and/or wireless communication networks. The communication network (350) can exchange data in circuit-switched and/or packet-switched channels. Representative networks include telecommunications networks, local area networks (LANs), wide area networks (WANs), and/or the Internet. For the purposes of this discussion, the architecture and topology of the network (350) may be of little importance to the operation of this disclosure, unless explained below.

作为所公开的主题的应用的示例，图4示出了视频编码器和视频解码器在流环境中的放置。所公开的主题同样可应用于其他支持视频的应用，包括例如视频会议、数字电视、在包括CD、DVD、记忆棒等的数字媒体上存储压缩视频等。As an example of the application of the disclosed subject matter, Figure 4 illustrates the placement of a video encoder and a video decoder in a streaming environment. The disclosed subject matter can also be applied to other video-enabled applications, including, for example, video conferencing, digital television, and storing compressed video on digital media including CDs, DVDs, Memory Sticks, etc.

流式传输系统可以包括：捕捉子系统(413)，其可以包括视频源(401)，例如，数码相机；创建例如未压缩的视频图片流(402)。在一个示例中，视频图片流(402)包括由数码相机拍摄的样本。当与编码视频数据(404)(或编码视频比特流)相比时，视频图片流(402)被描绘为强调高数据量的粗线，可以由包括耦合到视频源(401)的视频编码器(403)的电子装置(420)来处理。视频编码器(403)可以包括硬件、软件或其组合，以实现或实施如下面更详细描述的所公开的主题的各方面。编码的视频数据(404)(或编码的视频比特流(404))被描绘为细线，以强调当与视频图片流(402)相比时较低的数据量，可以存储在流服务器(405)上，以供将来使用。一个或多个流客户端子系统(例如，图4中的客户端子系统(406)和(408))可以访问流服务器(405)，以检索编码视频数据(404)的副本(407)和(409)。客户端子系统(406)可以包括例如电子装置(430)中的视频解码器(410)。视频解码器(410)对编码视频数据的输入副本(407)进行解码，并创建可以在显示器(412)(例如，显示屏)或其他呈现装置(未示出)上呈现的输出视频图片流(411)。在一些流系统中，编码的视频数据(404)、(407)和(409)(例如，视频比特流)可以根据某些视频编码/压缩标准来编码。这些标准的示例包括ITU-T建议H.265。在一个示例中，正在开发的视频编码标准被非正式地称为通用视频编码(VVC)。所公开的主题可以在VVC的环境中使用。The streaming system may include: a capture subsystem (413) that may include a video source (401), such as a digital camera; and a system for creating, for example, an uncompressed video picture stream (402). In one example, the video picture stream (402) includes samples captured by a digital camera. The video picture stream (402) is depicted as a thick line emphasizing the high data volume when compared to encoded video data (404) (or encoded video bitstream), and may be processed by an electronic device (420) including a video encoder (403) coupled to the video source (401). The video encoder (403) may include hardware, software, or a combination thereof to implement or enforce aspects of the disclosed subject matter as described in more detail below. The encoded video data (404) (or encoded video bitstream (404)) is depicted as a thin line to emphasize the lower data volume when compared to the video picture stream (402), and may be stored on a streaming server (405) for future use. One or more streaming client subsystems (e.g., client subsystems (406) and (408) in Figure 4) can access a streaming server (405) to retrieve copies (407) and (409) of encoded video data (404). Client subsystem (406) may include, for example, a video decoder (410) in an electronic device (430). The video decoder (410) decodes the input copy (407) of the encoded video data and creates an output video picture stream (411) that can be presented on a display (412) (e.g., a screen) or other presentation device (not shown). In some streaming systems, the encoded video data (404), (407), and (409) (e.g., a video bitstream) may be encoded according to certain video coding/compression standards. Examples of such standards include ITU-T Recommendation H.265. In one example, the video coding standard under development is informally referred to as Universal Video Coding (VVC). The disclosed topics can be used in a VVC environment.

注意，电子装置(420)和(430)可以包括其他组件(未示出)。例如，电子装置(420)可以包括视频解码器(未示出)，并且电子装置(430)也可以包括视频编码器(未示出)。Note that electronic devices (420) and (430) may include other components (not shown). For example, electronic device (420) may include a video decoder (not shown), and electronic device (430) may also include a video encoder (not shown).

图5示出了根据本公开实施例的视频解码器(510)的框图。视频解码器(510)可以包括在电子装置(530)中。电子装置(530)可以包括接收机(531)(例如，接收电路)。视频解码器(510)可以用来代替图4示例中的视频解码器(410)。Figure 5 shows a block diagram of a video decoder (510) according to an embodiment of the present disclosure. The video decoder (510) may be included in an electronic device (530). The electronic device (530) may include a receiver (531) (e.g., receiving circuitry). The video decoder (510) may be used in place of the video decoder (410) in the example of Figure 4.

接收机(531)可以接收将由视频解码器(510)解码的一个或多个编码视频序列；在同一个或另一个实施例中，一次一个编码视频序列，其中，每个编码视频序列的解码独立于其他编码视频序列。可以从信道(501)接收编码视频序列，该信道可以是到存储编码视频数据的存储装置的硬件/软件链接。接收机(531)可以接收编码的视频数据和其他数据，例如，编码的音频数据和/或辅助数据流，这些数据可以被转发到其相应的使用实体(未示出)。接收机(531)可以将编码的视频序列与其他数据分离。为了对抗网络抖动，缓冲存储器(515)可以耦合在接收机(531)和熵解码器/解析器(520)(以下称为“解析器(520)”)之间。在某些应用中，缓冲存储器(515)是视频解码器(510)的一部分。在其他情况下，可以在视频解码器(510)之外(未示出)。在其他情况下，在视频解码器(510)之外可以有缓冲存储器(未示出)，例如，用于对抗网络抖动，此外，在视频解码器(510)内部可以有另一个缓冲存储器(515)，例如，用于处理播放定时。当接收机(531)从具有足够带宽和可控性的存储/转发装置或者从等同步网络接收数据时，缓冲存储器(515)可以是不需要的，或者可以是小的。为了在诸如因特网之类的尽力而为的分组网络上使用，可能需要缓冲存储器(515)，该缓冲存储器可以相对较大，并且可以有利地具有自适应大小，并且可以至少部分地在视频解码器(510)外部的操作系统或类似元件(未示出)中实现。The receiver (531) can receive one or more encoded video sequences to be decoded by the video decoder (510); in the same or another embodiment, one encoded video sequence at a time, wherein the decoding of each encoded video sequence is independent of the other encoded video sequences. The encoded video sequences can be received from a channel (501), which can be a hardware/software link to a storage device storing the encoded video data. The receiver (531) can receive encoded video data and other data, such as encoded audio data and/or auxiliary data streams, which can be forwarded to their respective user entities (not shown). The receiver (531) can separate the encoded video sequences from other data. To combat network jitter, a buffer memory (515) can be coupled between the receiver (531) and the entropy decoder/parser (520) (hereinafter referred to as "parser (520)"). In some applications, the buffer memory (515) is part of the video decoder (510). In other cases, it can be outside the video decoder (510) (not shown). In other cases, a buffer memory (not shown) may be present outside the video decoder (510), for example, to combat network jitter. Additionally, another buffer memory (515) may be present inside the video decoder (510), for example, to handle playback timing. The buffer memory (515) may be unnecessary or small when the receiver (531) receives data from a store/forward device with sufficient bandwidth and controllability or from an isochronous network. For use on best-effort packet networks such as the Internet, a buffer memory (515) may be required. This buffer memory may be relatively large and advantageously have an adaptive size, and may be implemented at least partially outside the video decoder (510) in an operating system or similar component (not shown).

视频解码器(510)可以包括解析器(520)，以从编码的视频序列中重构符号(521)。这些符号的类别包括用于管理视频解码器(510)的操作的信息以及潜在地控制诸如呈现装置(512)(例如，显示屏)等呈现装置的信息，该呈现装置不是电子装置(530)的组成部分，但是可以耦合到电子装置(530)，如图5所示。用于呈现装置的控制信息可以是补充增强信息(SEI消息)或视频可用性信息(VUI)参数集片段(未示出)的形式。解析器(520)可以对接收到的编码视频序列进行解析/熵解码。编码视频序列的编码可以根据视频编码技术或标准，并且可以遵循各种原理，包括可变长度编码、Huffman编码、具有或不具有上下文敏感性的算术编码等。解析器(520)可以基于对应于该组的至少一个参数，从编码视频序列中提取视频解码器中的至少一个像素子组的一组子组参数。子组可以包括图片组(GOP)、图片、瓦片、切片、宏块、编码单元(Cu)、块、变换单元(TU)、预测单元(PU)等。解析器(520)还可以从编码的视频序列中提取信息，例如，变换系数、量化器参数值、运动矢量等。The video decoder (510) may include a parser (520) to reconstruct symbols (521) from the encoded video sequence. These symbols may include information for managing the operation of the video decoder (510) and information that potentially controls a presentation device (512) (e.g., a display screen), which is not part of the electronic device (530) but may be coupled to it, as shown in FIG. 5. Control information for the presentation device may be in the form of supplementary enhancement information (SEI messages) or fragments of video availability information (VUI) parameter sets (not shown). The parser (520) may perform parsing/entropy decoding on the received encoded video sequence. The encoding of the encoded video sequence may be based on video coding techniques or standards and may follow various principles, including variable-length coding, Huffman coding, arithmetic coding with or without context sensitivity, etc. The parser (520) may extract a set of subgroup parameters of at least one pixel subgroup from the encoded video sequence based on at least one parameter corresponding to the set. Subgroups may include group of pictures (GOP), pictures, tiles, slices, macroblocks, coding units (Cu), blocks, transform units (TU), prediction units (PU), etc. The parser (520) can also extract information from the encoded video sequence, such as transform coefficients, quantizer parameter values, motion vectors, etc.

解析器(520)可以对从缓冲存储器(515)接收的视频序列执行熵解码/解析操作，以便创建符号(521)。The parser (520) can perform entropy decoding/parsing operations on the video sequence received from the buffer memory (515) in order to create symbols (521).

根据编码视频图片或其部分的类型(例如：帧间和帧内图片、帧间和帧内块)以及其他因素，符号(521)的重构可以涉及多个不同的单元。可以通过由解析器(520)从编码视频序列中解析的子组控制信息来控制涉及哪些单元以及如何涉及。为了清楚起见，没有描述解析器(520)和下面的多个单元之间的这种子组控制信息流。Depending on the type of the encoded video picture or its portions (e.g., inter-frame and intra-frame pictures, inter-frame and intra-frame blocks) and other factors, the reconstruction of the symbol (521) can involve multiple different units. Which units are involved and how they are involved can be controlled by subgroup control information parsed from the encoded video sequence by the parser (520). For clarity, the flow of such subgroup control information between the parser (520) and the multiple units below is not described.

除了已经提到的功能块之外，视频解码器(510)可以在概念上细分成如下所述的多个功能单元。在商业限制下操作的实际实现中，许多这些单元彼此紧密交互，并且可以至少部分地彼此集成。然而，为了描述所公开的主题，在概念上细分成以下功能单元是合适的。In addition to the functional blocks already mentioned, the video decoder (510) can be conceptually subdivided into several functional units as described below. In practical implementations operating under commercial constraints, many of these units interact closely with each other and can be integrated at least partially with each other. However, for the purpose of describing the disclosed subject matter, it is appropriate to conceptually subdivide it into the following functional units.

第一单元是定标器/逆变换单元(551)。定标器/逆变换单元(551)接收量化的变换系数以及控制信息，包括使用哪个变换、块大小、量化因子、量化缩放矩阵等，作为来自解析器(520)的符号(521)。定标器/逆变换单元(551)可以输出包括样本值的块，这些块可以被输入到聚集器(555)中。The first unit is the scaler/inverse transform unit (551). The scaler/inverse transform unit (551) receives the quantized transform coefficients and control information, including which transform to use, block size, quantization factor, quantization scaling matrix, etc., as symbols (521) from the parser (520). The scaler/inverse transform unit (551) can output blocks containing sample values, which can be input into the aggregator (555).

在一些情况下，定标器/逆变换(551)的输出样本可以属于帧内编码块；即：没有使用来自先前重构图像的预测信息，但是可以使用来自当前图片的先前重构部分的预测信息的块。这种预测信息可以由帧内图片预测单元(552)提供。在一些情况下，帧内图片预测单元(552)使用从当前图片缓冲器(558)获取的周围已经重构的信息，生成与重构中的块具有相同大小和形状的块。当前图片缓冲器(558)缓冲例如部分重构的当前图片和/或完全重构的当前图片。在一些情况下，聚集器(555)基于每个样本将帧内预测单元(552)已经生成的预测信息添加到由定标器/逆变换单元(551)提供的输出样本信息。In some cases, the output samples of the scaler/inverse transform (551) may belong to intra-coded blocks; that is, blocks that do not use prediction information from previously reconstructed images but can use prediction information from previously reconstructed portions of the current image. This prediction information can be provided by the intra-picture prediction unit (552). In some cases, the intra-picture prediction unit (552) uses surrounding reconstructed information obtained from the current picture buffer (558) to generate blocks of the same size and shape as the blocks in the reconstruction. The current picture buffer (558) buffers, for example, partially reconstructed current images and/or fully reconstructed current images. In some cases, the aggregator (555) adds the prediction information already generated by the intra-picture prediction unit (552) to the output sample information provided by the scaler/inverse transform unit (551) based on each sample.

在其他情况下，定标器/逆变换单元(551)的输出样本可以属于帧间编码的并且可能是运动补偿的块。在这种情况下，运动补偿预测单元(553)可以访问参考图片存储器(557)，以获取用于预测的样本。在根据与该块有关的符号(521)对提取的样本进行运动补偿之后，这些样本可以由聚集器(555)添加到定标器/逆变换单元(551)的输出(在这种情况下称为残差样本或残差信号)，以便生成输出样本信息。运动补偿预测单元(553)从中获取预测样本的参考图片存储器(557)内的地址可以由运动向量来控制，运动补偿预测单元(553)可以符号(521)的形式获得这些地址，这些符号可以具有例如X、Y和参考图片组件。当使用子采样精确运动矢量时，运动补偿还可以包括从参考图片存储器(557)获取的采样值的插值、运动矢量预测机制等。In other cases, the output samples of the scaler/inverse transform unit (551) may belong to inter-frame coded blocks and may be motion-compensated. In this case, the motion compensation prediction unit (553) can access the reference image memory (557) to obtain samples for prediction. After motion compensation of the extracted samples according to the symbols (521) associated with the block, these samples can be added by the aggregator (555) to the output of the scaler/inverse transform unit (551) (referred to as residual samples or residual signals in this case) to generate output sample information. The addresses in the reference image memory (557) from which the motion compensation prediction unit (553) obtains the predicted samples can be controlled by motion vectors, and the motion compensation prediction unit (553) can obtain these addresses in the form of symbols (521), which may have, for example, X, Y, and reference image components. When using subsampled exact motion vectors, motion compensation may also include interpolation of sampled values obtained from the reference image memory (557), motion vector prediction mechanisms, etc.

聚集器(555)的输出样本可以在环路滤波器单元(556)中经受各种环路滤波技术。视频压缩技术可以包括环路滤波技术，这些技术由编码视频序列(也称为编码视频比特流)中包含的参数控制，并且作为来自解析器(520)的符号(521)可用于环路滤波单元(556)，但是也可以响应于在编码图像或编码视频序列的先前(按照解码顺序)部分的解码期间获得的元信息以及响应于先前重构的和环路滤波的样本值。The output samples of the aggregator (555) can undergo various loop filtering techniques in the loop filter unit (556). The video compression techniques may include loop filtering techniques controlled by parameters contained in the encoded video sequence (also known as the encoded video bitstream) and available to the loop filter unit (556) as symbols (521) from the parser (520), but may also be in response to metadata obtained during the decoding of a previous (in the order of decoding) portion of the encoded image or encoded video sequence and to previously reconstructed and loop-filtered sample values.

环路滤波器单元(556)的输出可以是样本流，该样本流可以输出到呈现装置(512)以及存储在参考图片存储器(557)中，以用于将来的帧间图片预测。The output of the loop filter unit (556) can be a sample stream, which can be output to the presentation device (512) and stored in the reference image memory (557) for future inter-frame image prediction.

一旦完全重构，某些编码图像可以用作未来预测的参考图片。例如，一旦完全重构对应于当前图片的编码图像，并且该编码的图片已经被识别为参考图片(例如，通过解析器(520))，则当前图片缓冲器(558)可以成为参考图片存储器(557)的一部分，并且在开始下一个编码图像的重构之前，可以重新分配新的当前图片缓冲器。Once fully reconstructed, certain encoded images can be used as reference images for future predictions. For example, once the encoded image corresponding to the current image has been fully reconstructed and that encoded image has been identified as a reference image (e.g., by the parser (520)), the current image buffer (558) can become part of the reference image memory (557), and a new current image buffer can be reallocated before the reconstruction of the next encoded image begins.

视频解码器(510)可以根据诸如ITU-T Rec.H.265等标准中的预定视频压缩技术来执行解码操作。在编码视频序列符合视频压缩技术或标准的语法以及视频压缩技术或标准中记载的简档的意义上，编码视频序列可以符合由所使用的视频压缩技术或标准规定的语法。具体地，简档可以从视频压缩技术或标准中可用的所有工具中选择某些工具，作为在该简档下可用的唯一工具。符合标准还需要编码视频序列的复杂度在视频压缩技术或标准的水平所定义的范围内。在某些情况下，级别限制了最大图像尺寸、最大帧速率、最大重构采样率(例如，以每秒兆样本为单位测量)、最大参考图片尺寸等。在某些情况下，由级别设置的限制可以通过假设参考解码器(HRD)规范和编码视频序列中信令的HRD缓冲器管理的元数据来进一步限制。The video decoder (510) can perform decoding operations according to a predetermined video compression technique in a standard such as ITU-T Rec.H.265. The encoded video sequence can conform to the syntax specified by the video compression technique or standard used, in the sense that the encoded video sequence conforms to the syntax of the video compression technique or standard and the brief documented in the video compression technique or standard. Specifically, the brief may select certain tools from all available tools in the video compression technique or standard as the only tools available under that brief. Standard conformance also requires the complexity of the encoded video sequence to be within the range defined by the level of the video compression technique or standard. In some cases, the level limits the maximum image size, maximum frame rate, maximum reconstruction sampling rate (e.g., measured in megasamples per second), maximum reference picture size, etc. In some cases, the limitations set by the level can be further restricted by the assumed reference decoder (HRD) specification and the metadata managed by the HRD buffer of the signaling in the encoded video sequence.

在一个实施例中，接收机(531)可以接收具有编码视频的额外(冗余)数据。可以包括额外数据，作为编码视频序列的一部分。视频解码器(510)可以使用额外数据来正确解码数据和/或更准确地重构原始视频数据。额外数据可以是例如时间、空间或信噪比(SNR)增强层、冗余切片、冗余图片、前向纠错码等形式。In one embodiment, the receiver (531) may receive additional (redundant) data with encoded video. This additional data may be included as part of the encoded video sequence. The video decoder (510) may use the additional data to correctly decode the data and/or more accurately reconstruct the original video data. The additional data may be, for example, in the form of temporal, spatial, or signal-to-noise ratio (SNR) enhancement layers, redundant slices, redundant images, forward error correction codes, etc.

图6示出了根据本公开实施例的视频编码器(603)的框图。视频编码器(603)包括在电子装置(620)中。电子装置(620)包括发射机(640)(例如，发射电路)。视频编码器(603)可以用来代替图4示例中的视频编码器(403)。Figure 6 shows a block diagram of a video encoder (603) according to an embodiment of the present disclosure. The video encoder (603) is included in an electronic device (620). The electronic device (620) includes a transmitter (640) (e.g., a transmitting circuit). The video encoder (603) can be used in place of the video encoder (403) in the example of Figure 4.

视频编码器(603)可以从视频源(601)(其不是图6示例中的电子装置(620)的一部分)接收视频样本，该视频源可以捕捉要由视频编码器(603)编码的视频图片。在另一个示例中，视频源(601)是电子装置(620)的一部分。The video encoder (603) can receive video samples from a video source (601) (which is not part of the electronic device (620) in the example of Figure 6), which can capture video images to be encoded by the video encoder (603). In another example, the video source (601) is part of the electronic device (620).

视频源(601)可以以数字视频样本流的形式提供要由视频编码器(603)编码的源视频序列，该数字视频样本流可以具有任何合适的比特深度(例如：8比特、10比特、12比特、…)、任何颜色空间(例如，BT.601Y CrCB、RGB、…)和任何合适的采样结构(例如，Y CrCb4:2:0、Y CrCb 4:4:4)。在媒体服务系统中，视频源(601)可以是存储先前准备的视频的存储装置。在视频会议系统中，视频源(601)可以是捕捉本地图像信息作为视频序列的相机。可以提供视频数据，作为多个单独的图片，当按顺序观看时，这些图片赋予运动。图片本身可以被组织为像素的空间阵列，其中，每个像素可以包括一个或多个样本，这取决于使用中的采样结构、颜色空间等。本领域技术人员可以容易地理解像素和样本之间的关系。下面的描述集中在样本上。The video source (601) can be provided as a digital video sample stream of a sequence of source videos to be encoded by a video encoder (603). This digital video sample stream can have any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit, ...), any color space (e.g., BT.601YCrCb, RGB, ...), and any suitable sampling structure (e.g., YCrCb4:2:0, YCrCb4:4:4). In a media service system, the video source (601) can be a storage device storing previously prepared video. In a video conferencing system, the video source (601) can be a camera capturing local image information as a video sequence. Video data can be provided as multiple individual pictures, which, when viewed sequentially, are given motion. The pictures themselves can be organized as a spatial array of pixels, where each pixel can include one or more samples, depending on the sampling structure, color space, etc., used. Those skilled in the art will readily understand the relationship between pixels and samples. The following description focuses on samples.

根据一个实施例，视频编码器(603)可以实时地或者在应用所需的任何其他时间约束下，将源视频序列的图片编码和压缩成编码的视频序列(643)。实施适当的编码速度是控制器(650)的一个功能。在一些实施例中，控制器(650)控制如下所述的其他功能单元，并且在功能上耦合到其他功能单元。为了清楚起见，没有描述耦合。控制器(650)设置的参数可以包括速率控制相关参数(图片跳过、量化器、率失真优化技术的λ值、…)、图片大小、图片组(GOP)布局、最大运动矢量搜索范围等。控制器(650)可以被配置为具有针对特定系统设计而优化的与视频编码器(603)相关的其他合适的功能。According to one embodiment, the video encoder (603) can encode and compress images of a source video sequence into an encoded video sequence (643) in real time or under any other time constraints required by the application. Implementing an appropriate encoding rate is a function of the controller (650). In some embodiments, the controller (650) controls and is functionally coupled to other functional units as described below. For clarity, coupling is not described. Parameters set by the controller (650) may include rate control-related parameters (image skipping, quantizer, λ value of rate-distortion optimization techniques, etc.), image size, group of pictures (GOP) layout, maximum motion vector search range, etc. The controller (650) can be configured to have other suitable functions related to the video encoder (603) optimized for a particular system design.

在一些实施例中，视频编码器(603)被配置为在编码循环中操作。作为过于简化的描述，在一个示例中，编码环路可以包括源编码器(630)(例如，负责基于要编码的输入图片和参考图片来创建符号，例如，符号流)以及嵌入在视频编码器(603)中的(本地)解码器(633)。解码器(633)以类似于(远程)解码器也将创建的方式重构符号，以创建样本数据(因为在所公开的主题中考虑的视频压缩技术中，符号和编码视频比特流之间的任何压缩都是无损的)。重构的样本流(样本数据)被输入到参考图片存储器(634)。由于符号流的解码导致独立于解码器位置(本地或远程)的位精确结果，所以参考图片存储器(634)中的内容在本地编码器和远程编码器之间也是位精确的。换言之，当在解码期间使用预测时，编码器的预测部分作为参考图片样本“看到”与解码器“看到”的样本值完全相同的样本值。也在一些相关技术中使用参考图片同步性(以及由此产生的漂移，如果不能保持同步性，例如，由于信道误差)的该基本原理。In some embodiments, the video encoder (603) is configured to operate within an encoding loop. As an oversimplification, in one example, the encoding loop may include a source encoder (630) (e.g., responsible for creating symbols, such as a symbol stream, based on the input picture to be encoded and a reference picture) and a (local) decoder (633) embedded within the video encoder (603). The decoder (633) reconstructs the symbols in a manner similar to that created by the (remote) decoder to create sample data (since any compression between the symbols and the encoded video bitstream is lossless in the video compression techniques considered in the disclosed subject matter). The reconstructed sample stream (sample data) is input to a reference picture memory (634). Since decoding of the symbol stream results in bit-precise results independent of the decoder location (local or remote), the contents of the reference picture memory (634) are also bit-precise between the local and remote encoders. In other words, when prediction is used during decoding, the encoder's prediction portion, as a reference picture sample, "sees" the exact same sample values as the decoder "sees." This basic principle of reference image synchronization (and the resulting drift if synchronization cannot be maintained, for example, due to channel errors) is also used in some related technologies.

“本地”解码器(633)的操作可以与诸如视频解码器(510)之类的“远程”解码器的操作相同，上面已经结合图5对其进行了详细描述。然而，简要地参考图5，由于符号是可用的，并且熵编码器(645)和解析器(520)对编码视频序列的符号编码/解码可以是无损的，所以可以不完全在本地解码器(633)中实现视频解码器(510)的熵解码部分，包括缓冲存储器(515)和解析器(520)。The operation of the “local” decoder (633) can be the same as that of a “remote” decoder such as a video decoder (510), which has been described in detail above in conjunction with Figure 5. However, briefly referring to Figure 5, since symbols are available and the encoding/decoding of symbols for the encoded video sequence by the entropy encoder (645) and the parser (520) can be lossless, the entropy decoding portion of the video decoder (510), including the buffer memory (515) and the parser (520), may not be fully implemented in the local decoder (633).

在一个实施例中，除了解码器中存在的解析/熵解码之外，解码器技术以相同或基本相同的功能形式存在于相应的编码器中。因此，所公开的主题集中于解码器操作。编码器技术的描述可以简化，因为这些技术是全面描述的解码器技术的逆。在某些领域，下面提供了更详细的描述。In one embodiment, the decoder techniques exist in the corresponding encoder in the same or substantially the same functional form, except for the parsing/entropy decoding present in the decoder. Therefore, the disclosed subject focuses on decoder operation. The description of the encoder techniques can be simplified because these techniques are the inverse of the fully described decoder techniques. In some areas, a more detailed description is provided below.

在操作期间，在一些示例中，源编码器(630)可以执行运动补偿预测编码，其参考来自视频序列的被指定为“参考图片”的一个或多个先前编码的图片来预测性地编码输入图片。以这种方式，编码引擎(632)对输入图片的像素块和可被选为输入图片的预测参考的参考图片的像素块之间的差异进行编码。During operation, in some examples, the source encoder (630) may perform motion-compensated predictive coding, which predictively encodes the input image by referencing one or more previously encoded images from the video sequence designated as "reference images". In this way, the encoding engine (632) encodes the differences between pixel blocks of the input image and pixel blocks of the reference image, which may be selected as a predictive reference for the input image.

本地视频解码器(633)可以基于由源编码器(630)创建的符号，对可以被指定为参考图片的图片的编码视频数据进行解码。编码引擎(632)的操作可以有利地是有损过程。当编码的视频数据可以在视频解码器(图6中未示出)处被解码时，重构的视频序列通常可以是具有一些误差的源视频序列的副本。本地视频解码器(633)复制可以由视频解码器对参考图片执行的解码过程，并且可以使得重构的参考图片存储在参考图片缓存(634)中。以这种方式，视频编码器(603)可以本地存储重构的参考图片的副本，这些副本具有与将由远端视频解码器获得的重构的参考图片相同的内容(不存在传输误差)。The local video decoder (633) can decode encoded video data of a picture that can be designated as a reference picture based on symbols created by the source encoder (630). The operation of the encoding engine (632) can advantageously be a lossy process. When the encoded video data can be decoded at the video decoder (not shown in FIG. 6), the reconstructed video sequence can typically be a copy of the source video sequence with some errors. The local video decoder (633) replicates the decoding process that can be performed on the reference picture by the video decoder, and can make the reconstructed reference picture stored in a reference picture buffer (634). In this way, the video encoder (603) can locally store copies of the reconstructed reference pictures that have the same content as the reconstructed reference pictures that will be obtained by the remote video decoder (without transmission errors).

预测器(635)可以对编码引擎(632)执行预测搜索。也就是说，对于要编码的新图片，预测器(635)可以在参考图片存储器(634)中搜索样本数据(作为候选参考像素块)或某些元数据，例如，参考图片运动矢量、块形状等，其可以用作新图片的适当预测参考。预测器(635)可以在逐个样本块-像素块的基础上操作，以找到合适的预测参考。在一些情况下，如由预测器(635)获得的搜索结果所确定的，输入图片可以具有从存储在参考图片存储器(634)中的多个参考图片中提取的预测参考。The predictor (635) can perform a prediction search on the encoding engine (632). That is, for a new image to be encoded, the predictor (635) can search the reference image memory (634) for sample data (as candidate reference pixel blocks) or certain metadata, such as reference image motion vectors, block shapes, etc., which can be used as appropriate prediction references for the new image. The predictor (635) can operate on a sample block-by-pixel basis to find suitable prediction references. In some cases, as determined by the search results obtained by the predictor (635), the input image may have prediction references extracted from multiple reference images stored in the reference image memory (634).

控制器(650)可以管理源编码器(630)的编码操作，包括例如用于编码视频数据的参数和子组参数的设置。The controller (650) can manage the encoding operations of the source encoder (630), including, for example, the settings of parameters and subgroup parameters for encoding video data.

所有前述功能单元的输出可以在熵编码器中经历熵编码(645)。熵编码器(645)通过根据诸如Huffman编码、可变长度编码、算术编码等技术无损压缩符号，将由各种功能单元生成的符号转换成编码的视频序列。The outputs of all the aforementioned functional units can undergo entropy encoding (645) in the entropy encoder. The entropy encoder (645) converts the symbols generated by the various functional units into an encoded video sequence by losslessly compressing the symbols according to techniques such as Huffman coding, variable length coding, and arithmetic coding.

发射机(640)可以缓冲由熵编码器(645)创建的编码视频序列，以准备经由通信信道(660)传输，通信信道可以是到将存储编码视频数据的存储装置的硬件/软件链接。发射机(640)可以将来自视频编码器(603)的编码视频数据与要传输的其他数据合并，例如，编码音频数据和/或辅助数据流(源未示出)。The transmitter (640) can buffer the encoded video sequence created by the entropy encoder (645) in preparation for transmission via a communication channel (660), which may be a hardware/software link to a storage device storing the encoded video data. The transmitter (640) can combine the encoded video data from the video encoder (603) with other data to be transmitted, such as encoded audio data and/or auxiliary data streams (source not shown).

控制器(650)可以管理视频编码器(603)的操作。在编码期间，控制器(650)可以向每个编码图片分配特定的编码图片类型，这可以影响可以应用于相应图片的编码技术。例如，图片通常可以被指定为以下图片类型之一：The controller (650) can manage the operation of the video encoder (603). During encoding, the controller (650) can assign a specific encoded picture type to each encoded picture, which can affect the encoding techniques that can be applied to the corresponding picture. For example, a picture can typically be designated as one of the following picture types:

帧内图片(I图片)可以是不使用序列中的任何其他图片作为预测源而被编码和解码的图片。一些视频编解码器允许不同类型的帧内图片，包括例如独立解码器刷新(“IDR”)图片。本领域技术人员知道I图片的那些变体以及其相应的应用和特征。An intra-frame picture (I-picture) can be a picture that is encoded and decoded without using any other pictures in the sequence as a prediction source. Some video codecs allow different types of intra-frame pictures, including, for example, Independent Decoder Refresh (“IDR”) pictures. Those skilled in the art will recognize those variations of I-pictures and their corresponding applications and characteristics.

预测图片(P图片)可以是使用最多一个运动矢量和参考索引来预测每个块的样本值，使用帧内预测或帧间预测来编码和解码的图片。A predicted image (P-image) can be an image that uses at most one motion vector and a reference index to predict the sample value of each block, and is encoded and decoded using intra-frame prediction or inter-frame prediction.

双向预测图片(B图片)可以是使用最多两个运动矢量和参考索引来预测每个块的样本值，使用帧内预测或帧间预测来编码和解码的图片。类似地，多预测图片可以使用两个以上的参考图片和相关元数据来重构单个块。A bidirectional prediction image (B-image) can be an image that uses up to two motion vectors and a reference index to predict sample values for each block, and is encoded and decoded using intra-frame prediction or inter-frame prediction. Similarly, a multi-prediction image can reconstruct a single block using two or more reference images and associated metadata.

源图片通常可以在空间上被细分成多个样本块(例如，每个样本块为4×4、8×8、4×8或16×16个样本块)，并且在分块的基础上编码。可以参考由应用于块的相应图片的编码分配所确定的其他(已经编码的)块来预测性地编码块。例如，I图片的块可以被非预测性地编码，或者可以参考同一图片的已经编码的块被预测性地编码(空间预测或帧内预测)。参考一个先前编码的参考图片，经由空间预测或经由时间预测，P图片的像素块可以预测性地编码。参考一个或两个先前编码的参考图片，经由空间预测或经由时间预测，可以预测性地编码B图片的块。The source image can typically be spatially subdivided into multiple sample blocks (e.g., each sample block is 4×4, 8×8, 4×8, or 16×16 sample blocks) and encoded on a block-by-block basis. Blocks can be predictedly encoded by referencing other (already encoded) blocks determined by the encoding allocation applied to the corresponding image of the block. For example, blocks of image I can be unpredictably encoded, or they can be predictedably encoded (spatial prediction or intra-frame prediction) by referencing already encoded blocks of the same image. Pixel blocks of image P can be predictedably encoded by referencing a previously encoded reference image, either spatially or temporally. Blocks of image B can be predictedably encoded by referencing one or two previously encoded reference images, either spatially or temporally.

视频编码器(603)可以根据预定的视频编码技术或标准(例如，ITU-T Rec.H.265)来执行编码操作。在其操作中，视频编码器(603)可以执行各种压缩操作，包括利用输入视频序列中的时间和空间冗余的预测编码操作。因此，编码的视频数据可以符合由正在使用的视频编码技术或标准指定的语法。The video encoder (603) can perform encoding operations according to a predetermined video coding technique or standard (e.g., ITU-T Rec.H.265). In its operation, the video encoder (603) can perform various compression operations, including predictive coding operations that utilize temporal and spatial redundancy in the input video sequence. Therefore, the encoded video data can conform to the syntax specified by the video coding technique or standard being used.

在一个实施例中，发射机(640)可以与编码视频一起传输额外数据。源编码器(630)可以包括这样的数据，作为编码视频序列的一部分。额外数据可以包括时间/空间/SNR增强层、其他形式的冗余数据(例如，冗余图片和切片)、SEI消息、VUI参数集片段等。In one embodiment, the transmitter (640) may transmit additional data along with the encoded video. The source encoder (630) may include such data as part of the encoded video sequence. The additional data may include temporal/spatial/SNR enhancement layers, other forms of redundant data (e.g., redundant images and slices), SEI messages, VUI parameter set fragments, etc.

可以捕捉视频，作为时间序列中的多个源图片(视频图片)。帧内图片预测(通常缩写为帧内预测)利用给定图片中的空间相关性，而帧间图片预测利用图片之间的(时间或其他)相关性。在一个示例中，被称为当前图片的编码/解码中的特定图片被分割成块。在当前图片中的块类似于视频中先前编码且仍被缓冲的参考图片中的参考块时，当前图片中的块可以通过称为运动矢量的矢量来编码。在使用多个参考图片的情况下，运动矢量指向参考图片中的参考块，并且可以具有识别参考图片的第三维。Video can be captured as multiple source images (video pictures) in a time series. Intra-frame picture prediction (often abbreviated as intra-prediction) utilizes spatial correlations within a given picture, while inter-frame picture prediction utilizes (temporal or other) correlations between pictures. In one example, a specific picture in the encoding/decoding process, referred to as the current picture, is segmented into blocks. When a block in the current picture resembles a reference block in a previously encoded and still buffered reference picture in the video, the block in the current picture can be encoded using a vector called a motion vector. In the case of using multiple reference pictures, the motion vector points to the reference block in the reference picture and can have a third dimension that identifies the reference picture.

在一些实施例中，可以在帧间图片预测中使用双向预测技术。根据双向预测技术，使用两个参考图片，例如，第一参考图片和第二参考图片，这两个参考图片在解码顺序上都在视频中的当前图片之前(但是在显示顺序上可以分别在过去和未来)。当前图片中的块可由指向第一参考图片中的第一参考块的第一运动向量和指向第二参考图片中的第二参考块的第二运动向量来编码。可以通过第一参考块和第二参考块的组合来预测该块。In some embodiments, bidirectional prediction techniques can be used in inter-frame image prediction. According to bidirectional prediction, two reference images are used, for example, a first reference image and a second reference image, both of which precede the current image in the video in the decoding order (but can be displayed in the past and future, respectively). A block in the current image can be encoded by a first motion vector pointing to a first reference block in the first reference image and a second motion vector pointing to a second reference block in the second reference image. The block can be predicted using a combination of the first and second reference blocks.

此外，可以在帧间图片预测中使用合并模式技术来提高编码效率。In addition, merging mode techniques can be used in inter-frame image prediction to improve coding efficiency.

根据本公开的一些实施例，以块为单位执行预测，例如，帧间图片预测和帧内图片预测。例如，根据HEVC标准，视频图片序列中的图片被分割成编码树单元(CTU)，用于压缩，图片中的CTU具有相同的大小，例如，64×64像素、32×32像素或16×16像素。通常，CTU包括三个编码树块(CTB)，即一个亮度CTB和两个色度CTB。每个CTU可以被递归地四叉树分割成一个或多个编码单元(CU)。例如，64×64像素的CTU可以被分割成一个64×64像素的CU，或者4个32×32像素的CU，或者16个16×16像素的CU。在一示例中，分析每个CU，以确定CU的预测类型，例如，帧间预测类型或帧内预测类型。根据时间和/或空间可预测性，CU被分成一个或多个预测单元(PU)。通常，每个PU包括一个亮度预测块(PB)和两个色度PB。在实施例中，以预测块为单位执行编码(编码/解码)中的预测操作。使用亮度预测块作为预测块的示例，预测块包含像素的值(例如，亮度值)的矩阵，例如，8×8像素、16×16像素、8×16像素、16×8像素等。According to some embodiments of this disclosure, prediction is performed on a block-by-block basis, such as inter-frame picture prediction and intra-frame picture prediction. For example, according to the HEVC standard, pictures in a video picture sequence are segmented into Coding Tree Units (CTUs) for compression. The CTUs in the pictures have the same size, for example, 64×64 pixels, 32×32 pixels, or 16×16 pixels. Typically, a CTU comprises three Coding Tree Blocks (CTBs): one luma CTB and two chroma CTBs. Each CTU can be recursively quadtree-segmented into one or more Coding Units (CUs). For example, a 64×64 pixel CTU can be segmented into one 64×64 pixel CU, or four 32×32 pixel CUs, or sixteen 16×16 pixel CUs. In one example, each CU is analyzed to determine its prediction type, such as inter-frame prediction or intra-frame prediction. Based on temporal and/or spatial predictability, CUs are divided into one or more Prediction Units (PUs). Typically, each PU comprises one luma prediction block (PB) and two chroma PBs. In this embodiment, the prediction operation in the encoding (encoding/decoding) is performed on a per-prediction-block basis. A luminance prediction block is used as an example; a prediction block contains a matrix of pixel values (e.g., luminance values), such as 8×8 pixels, 16×16 pixels, 8×16 pixels, 16×8 pixels, etc.

图7示出了根据本公开的另一实施例的视频编码器(703)的示图。视频编码器(703)被配置为接收视频图片序列中的当前视频图片内的样本值的处理块(例如，预测块)，并将该处理块编码成作为编码视频序列的一部分的编码图像。在一个示例中，视频编码器(703)用于代替图4示例中的视频编码器(403)。Figure 7 illustrates a diagram of a video encoder (703) according to another embodiment of the present disclosure. The video encoder (703) is configured to receive a processing block (e.g., a prediction block) of sample values within a current video image in a video image sequence and to encode the processing block into an encoded image as part of an encoded video sequence. In one example, the video encoder (703) is used instead of the video encoder (403) in the example of Figure 4.

在HEVC示例中，视频编码器(703)接收处理块的样本值矩阵，例如，8×8样本的预测块等。视频编码器(703)确定使用帧内模式、帧间模式还是使用例如率失真优化的双向预测模式对处理块进行最佳编码。当要以帧内模式对处理块进行编码时，视频编码器(703)可以使用帧内预测技术将处理块编码成编码图片；并且当要以帧间模式或双向预测模式对处理块进行编码时，视频编码器(703)可以分别使用帧间预测或双向预测技术来将处理块编码成编码图片。在某些视频编码技术中，合并模式可以是帧间图片预测子模式，其中，从一个或多个运动向量预测器中导出运动向量，而不受益于预测器之外的编码运动向量组件。在某些其他视频编码技术中，可能存在适用于对象块的运动矢量组件。在一个示例中，视频编码器(703)包括其他组件，例如，用于确定处理块的模式的模式判定模块(未示出)。In the HEVC example, the video encoder (703) receives a sample value matrix of the processed block, such as an 8×8 sample prediction block. The video encoder (703) determines the optimal encoding method for the processed block: intra-frame mode, inter-frame mode, or bidirectional prediction mode, such as rate-distortion optimized mode. When encoding the processed block in intra-frame mode, the video encoder (703) can use intra-frame prediction techniques to encode the processed block into a coded picture; and when encoding the processed block in inter-frame mode or bidirectional prediction mode, the video encoder (703) can use inter-frame prediction or bidirectional prediction techniques to encode the processed block into a coded picture, respectively. In some video coding techniques, the merging mode may be an inter-frame picture prediction sub-mode, in which motion vectors are derived from one or more motion vector predictors without benefiting from coded motion vector components outside the predictors. In some other video coding techniques, there may be motion vector components applicable to the object block. In one example, the video encoder (703) includes other components, such as a mode determination module (not shown) for determining the mode of the processed block.

在图7的示例中，视频编码器(703)包括如图7所示耦合在一起的帧间编码器(730)、帧内编码器(722)、残差计算器(723)、开关(726)、残差编码器(724)、通用控制器(721)和熵编码器(725)。In the example of Figure 7, the video encoder (703) includes an inter-frame encoder (730), an intra-frame encoder (722), a residual encoder (723), a switch (726), a residual encoder (724), a general controller (721), and an entropy encoder (725) coupled together as shown in Figure 7.

帧间编码器(730)被配置为接收当前块(例如，处理块)的样本，将该块与参考图片中的一个或多个参考块(例如，先前图片和后续图片中的块)进行比较，生成帧间预测信息(例如，根据帧间编码技术的冗余信息的描述、运动向量、合并模式信息)，并且使用任何合适的技术基于帧间预测信息来计算帧间预测结果(例如，预测块)。在一些示例中，参考图片是基于编码视频信息解码的解码参考图片。An inter-frame encoder (730) is configured to receive samples of the current block (e.g., the processing block), compare that block with one or more reference blocks in a reference image (e.g., blocks in previous and subsequent images), generate inter-frame prediction information (e.g., descriptions of redundancy information based on inter-frame coding techniques, motion vectors, merging mode information), and compute inter-frame prediction results (e.g., predicted blocks) based on the inter-frame prediction information using any suitable technique. In some examples, the reference image is a decoded reference image based on the decoded encoded video information.

帧内编码器(722)被配置为接收当前块(例如，处理块)的样本，在一些情况下，将该块与已经在同一图片中编码的块进行比较，在变换后生成量化系数，并且在一些情况下，还生成帧内预测信息(例如，根据一种或多种帧内编码技术的帧内预测方向信息)。在一个示例中，帧内编码器(722)还基于同一图片中的帧内预测信息和参考块来计算帧内预测结果(例如，预测块)。The intra encoder (722) is configured to receive samples of the current block (e.g., the processing block), in some cases compare the block with blocks already encoded in the same image, generate quantization coefficients after transformation, and in some cases also generate intra prediction information (e.g., intra prediction direction information based on one or more intra coding techniques). In one example, the intra encoder (722) also computes an intra prediction result (e.g., a prediction block) based on the intra prediction information in the same image and a reference block.

通用控制器(721)被配置为确定通用控制数据并基于通用控制数据控制视频编码器(703)的其他组件。在一个示例中，通用控制器(721)确定块的模式，并基于该模式向开关(726)提供控制信号。例如，当模式是帧内模式时，通用控制器(721)控制开关(726)选择帧内模式结果供残差计算器(723)使用，并控制熵编码器(725)选择帧内预测信息并将帧内预测信息包括在比特流中；当模式是帧间模式时，通用控制器(721)控制开关(726)选择帧间预测结果供残差计算器(723)使用，并控制熵编码器(725)选择帧间预测信息并将帧间预测信息包括在比特流中。A general controller (721) is configured to determine general control data and control other components of the video encoder (703) based on the general control data. In one example, the general controller (721) determines the mode of a block and provides control signals to a switch (726) based on that mode. For example, when the mode is intra-frame mode, the general controller (721) controls the switch (726) to select intra-frame mode results for use by the residual calculator (723) and controls the entropy encoder (725) to select intra-frame prediction information and include the intra-frame prediction information in the bitstream; when the mode is inter-frame mode, the general controller (721) controls the switch (726) to select inter-frame prediction results for use by the residual calculator (723) and controls the entropy encoder (725) to select inter-frame prediction information and include the inter-frame prediction information in the bitstream.

残差计算器(723)被配置为计算接收的块和从帧内编码器(722)或帧间编码器(730)选择的预测结果之间的差(残差数据)。残差编码器(724)被配置为基于残差数据进行操作，以编码残差数据，来生成变换系数。在一个示例中，残差编码器(724)被配置为将残差数据从空间域转换到频域，并生成变换系数。然后对变换系数进行量化处理，以获得量化的变换系数。在各种实施例中，视频编码器(703)还包括残差解码器(728)。残差解码器(728)被配置为执行逆变换，并生成解码后的残差数据。解码的残差数据可以被帧内编码器(722)和帧间编码器(730)适当地使用。例如，帧间编码器(730)可基于解码的残差数据和帧间预测信息生成解码的块，帧内编码器(722)可基于解码的残差数据和帧内预测信息生成解码的块。适当地处理经解码的块，以生成经解码的图片，并且经解码的图片可以缓存在存储器电路(未示出)中，并且在一些示例中用作参考图片。A residual calculator (723) is configured to calculate the difference (residual data) between the received block and the prediction result selected from the intra encoder (722) or inter encoder (730). A residual encoder (724) is configured to operate based on the residual data to encode the residual data to generate transform coefficients. In one example, the residual encoder (724) is configured to transform the residual data from the spatial domain to the frequency domain and generate transform coefficients. The transform coefficients are then quantized to obtain quantized transform coefficients. In various embodiments, the video encoder (703) also includes a residual decoder (728). The residual decoder (728) is configured to perform an inverse transform and generate decoded residual data. The decoded residual data can be appropriately used by the intra encoder (722) and the inter encoder (730). For example, the inter encoder (730) can generate a decoded block based on the decoded residual data and inter-frame prediction information, and the intra encoder (722) can generate a decoded block based on the decoded residual data and intra-frame prediction information. The decoded blocks are processed appropriately to generate a decoded image, and the decoded image can be cached in a memory circuit (not shown) and used as a reference image in some examples.

熵编码器(725)被配置为格式化比特流，以包括编码块。熵编码器(725)被配置为根据合适的标准，例如，HEVC标准，包括各种信息。在一个示例中，熵编码器(725)被配置为包括通用控制数据、所选预测信息(例如，帧内预测信息或帧间预测信息)、残差信息和比特流中的其他合适的信息。注意，根据所公开的主题，当在帧间模式或双向预测模式的合并子模式中对块进行编码时，没有残差信息。The entropy encoder (725) is configured to format the bitstream to include coded blocks. The entropy encoder (725) is configured to include various information according to a suitable standard, such as the HEVC standard. In one example, the entropy encoder (725) is configured to include general control data, selected prediction information (e.g., intra-frame prediction information or inter-frame prediction information), residual information, and other suitable information in the bitstream. Note that, according to the disclosed subject matter, no residual information is present when blocks are encoded in a merged sub-mode of inter-frame mode or bidirectional prediction mode.

图8示出了根据本公开的另一实施例的视频解码器(810)的示图。视频解码器(810)被配置为接收作为编码视频序列的一部分的编码图片，并对编码图片进行解码，以生成重构图片。在一个示例中，视频解码器(810)用于代替图4示例中的视频解码器(410)。Figure 8 illustrates a video decoder (810) according to another embodiment of the present disclosure. The video decoder (810) is configured to receive an encoded image as part of an encoded video sequence and decode the encoded image to generate a reconstructed image. In one example, the video decoder (810) is used instead of the video decoder (410) in the example of Figure 4.

在图8的示例中，视频解码器(810)包括如图8所示耦合在一起的熵解码器(871)、帧间解码器(880)、残差解码器(873)、重构模块(874)和帧内解码器(872)。In the example of Figure 8, the video decoder (810) includes an entropy decoder (871), an inter-frame decoder (880), a residual decoder (873), a reconstruction module (874), and an intra-frame decoder (872) coupled together as shown in Figure 8.

熵解码器(871)可以被配置为从编码图像中重构表示组成编码图像的语法元素的某些符号。这种符号可以包括例如对块进行编码的模式(例如，帧内模式、帧间模式、双向预测模式、合并子模式或另一子模式中的后两种模式)、可以识别分别由帧内解码器(872)或帧间解码器(880)用于预测的特定样本或元数据的预测信息(例如，帧内预测信息或帧间预测信息)、例如量化变换系数形式的残差信息等。在一个示例中，当预测模式是帧间或双向预测模式时，帧间预测信息被提供给帧间解码器(880)；并且当预测类型是帧内预测类型时，帧内预测信息被提供给帧内解码器(872)。残差信息可以经历逆量化，并被提供给残差解码器(873)。The entropy decoder (871) can be configured to reconstruct certain symbols representing the syntax elements constituting the encoded image from the encoded image. Such symbols may include, for example, the mode in which the block is encoded (e.g., intra-mode, inter-mode, bidirectional prediction mode, merged sub-mode, or the latter two of another sub-mode), prediction information (e.g., intra-prediction information or inter-prediction information) that can identify specific samples or metadata used for prediction by the intra-decoder (872) or inter-decoder (880), such as residual information in the form of quantized transform coefficients, etc. In one example, when the prediction mode is inter-mode or bidirectional prediction mode, inter-prediction information is provided to the inter-decoder (880); and when the prediction type is intra-prediction type, intra-prediction information is provided to the intra-decoder (872). The residual information may undergo inverse quantization and be provided to the residual decoder (873).

帧间解码器(880)被配置为接收帧间预测信息，并基于帧间预测信息生成帧间预测结果。The inter-frame decoder (880) is configured to receive inter-frame prediction information and generate inter-frame prediction results based on the inter-frame prediction information.

帧内解码器(872)被配置为接收帧内预测信息，并基于帧内预测信息生成预测结果。The intra-frame decoder (872) is configured to receive intra-frame prediction information and generate prediction results based on the intra-frame prediction information.

残差解码器(873)被配置为执行逆量化，以提取去量化的变换系数，并处理去量化的变换系数，以将残差从频域转换到空间域。残差解码器(873)还可能需要某些控制信息(以包括量化器参数(QP))，并且该信息可以由熵解码器(871)提供(数据路径未示出，因为这可能只是少量控制信息)。The residual decoder (873) is configured to perform inverse quantization to extract the dequantized transform coefficients and process the dequantized transform coefficients to transform the residual from the frequency domain to the spatial domain. The residual decoder (873) may also require some control information (to include quantizer parameters (QP)), and this information can be provided by the entropy decoder (871) (the data path is not shown because this may only be a small amount of control information).

重构模块(874)被配置为在空间域中组合由残差解码器(873)输出的残差和预测结果(由帧间或帧内预测模块输出，视情况而定)，以形成重构块，该重构块可以是重构图片的一部分，该重构图片又可以是重构视频的一部分。注意，可以执行诸如去块操作等其他合适的操作来提高视觉质量。The reconstruction module (874) is configured to combine the residual output by the residual decoder (873) and the prediction results (output by the inter-frame or intra-frame prediction module, as appropriate) in the spatial domain to form a reconstruction block, which can be part of a reconstructed image, which in turn can be part of a reconstructed video. Note that other suitable operations, such as deblocking, can be performed to improve visual quality.

注意，视频编码器(403)、(603)和(703)以及视频解码器(410)、(510)和(810)可以使用任何合适的技术来实现。在一个实施例中，视频编码器(403)、(603)和(703)以及视频解码器(410)、(510)和(810)可以使用一个或多个集成电路来实现。在另一个实施例中，视频编码器(403)、(603)和(603)以及视频解码器(410)、(510)和(810)可以使用执行软件指令的一个或多个处理器来实现。Note that the video encoders (403), (603), and (703) and the video decoders (410), (510), and (810) can be implemented using any suitable technology. In one embodiment, the video encoders (403), (603), and (703) and the video decoders (410), (510), and (810) can be implemented using one or more integrated circuits. In another embodiment, the video encoders (403), (603), and (603) and the video decoders (410), (510), and (810) can be implemented using one or more processors that execute software instructions.

本公开描述了与神经图像压缩技术和/或神经视频压缩技术相关的视频编码技术，例如，基于人工智能(AI)的神经图像压缩(NIC)。本公开的方面包括NIC中的内容自适应在线训练，例如，用于基于神经网络的端到端(E2E)优化图像编码框架的分块内容自适应在线训练NIC方法。神经网络(NN)可以包括人工神经网络(ANN)，例如，深度神经网络(DNN)、卷积神经网络(CNN)等。This disclosure describes video coding techniques related to neural image compression and/or neural video compression, such as artificial intelligence (AI)-based neural image compression (NIC). Aspects of this disclosure include content-adaptive online training in NICs, for example, a chunked content-adaptive online training method for optimizing an image coding framework based on neural networks for end-to-end (E2E) optimization. Neural networks (NNs) can include artificial neural networks (ANNs), such as deep neural networks (DNNs), convolutional neural networks (CNNs), etc.

在一个实施例中，相关的混合视频编解码器很难被整体优化。例如，混合视频编解码器中单个模块(例如，编码器)的改进可能不会导致整体性能的编码增益。在基于NN的视频编码框架中，可以通过执行学习过程或训练过程(例如，机器学习过程)来从输入到输出共同优化不同的模块，以改善最终目标(例如，率失真性能，例如，本公开中描述的率失真损耗L)，并因此产生E2E优化的NIC。In one embodiment, it is difficult to optimize a hybrid video codec as a whole. For example, improvements to a single module (e.g., the encoder) in a hybrid video codec may not result in a coding gain for the overall performance. In a neural network-based video coding framework, different modules can be jointly optimized from input to output by performing a learning or training process (e.g., a machine learning process) to improve the final objective (e.g., rate-distortion performance, such as the rate-distortion loss L described in this disclosure) and thus produce an E2E optimized NIC.

可以如下描述示例性NIC框架或系统。NIC框架可以使用输入块x作为神经网络编码器(例如，基于神经网络(例如，DNN)的编码器)的输入，以计算压缩表示(例如，紧凑表示)该压缩表示可以是紧凑的，例如，用于存储和传输目的。神经网络解码器(例如，基于神经网络(例如，DNN)的解码器)可以使用压缩表示作为输入来重构输出块(也称为重构块)在各种实施例中，输入块x和重构块在空间域中，压缩表示在不同于空间域的域中。在一些示例中，压缩表示被量化和熵编码。An exemplary NIC framework or system can be described as follows. The NIC framework can use an input block x as input to a neural network encoder (e.g., a neural network-based encoder, such as a DNN) to compute a compressed representation (e.g., a compact representation). This compressed representation can be compact, for example, for storage and transmission purposes. A neural network decoder (e.g., a neural network-based decoder, such as a DNN) can use the compressed representation as input to reconstruct an output block (also called a reconstructed block). In various embodiments, the input block x and the reconstructed block are in a spatial domain, while the compressed representation is in a domain different from the spatial domain. In some examples, the compressed representation is quantized and entropy-encoded.

在一些示例中，NIC框架可以使用可变自动编码器(VAE)结构。在VAE结构中，神经网络编码器可以直接使用整个输入块x作为神经网络编码器的输入。整个输入块x可以穿过一组作为黑盒工作的神经网络层，来计算压缩表示压缩表示是神经网络编码器的输出。神经网络解码器可以将整个压缩表示作为输入。压缩表示可以穿过另一组作为另一个黑盒工作的神经网络层，来计算重构块率失真(R-D)损耗可以被优化，以利用折衷超参数λ实现重构块的失真损耗和紧凑表示的比特消耗R之间的折衷。In some examples, the NIC framework can use a variable autoencoder (VAE) architecture. In a VAE architecture, the neural network encoder can directly use the entire input block x as input. The entire input block x can be passed through a set of neural network layers operating as black boxes to compute a compressed representation, which is the output of the neural network encoder. The neural network decoder can then use the entire compressed representation as input. The compressed representation can be passed through another set of neural network layers operating as another black box to compute the reconstructed block rate-distortion (R-D) loss. This loss can be optimized to achieve a trade-off between the distortion loss of the reconstructed block and the bit consumption R of the compact representation using a trade-off hyperparameter λ.

神经网络(例如，ANN)可以从示例中学习执行任务，而无需特定于任务的编程。ANN可以配置有连接的节点或人工神经元。节点之间的连接可以将信号从第一节点传输到第二节点(例如，接收节点)，并且该信号可以通过权重来修改，该权重可以由该连接的权重系数来指示。接收节点可以处理来自向接收节点发送信号的节点的信号(即，接收节点的输入信号)，然后通过对输入信号应用函数来生成输出信号。该函数可以是线性函数。在一个示例中，输出信号是输入信号的加权和。在一个示例中，输出信号被可以由偏置项指示的偏置进一步修改，因此输出信号是偏置和输入信号的加权和的和。该函数可以包括例如对输入信号的加权和或偏置和加权和的和的非线性运算。输出信号可以被发送到连接到接收节点的节点(下游节点)。ANN可以由参数(例如，连接和/或偏置的权重)来表示或配置。可以通过用可以迭代调整权重和/或偏置的示例训练ANN来获得权重和/或偏置。配置有所确定的权重和/或所确定的偏置的经训练的ANN可用于执行任务。Neural networks (e.g., ANNs) can learn to perform tasks from examples without task-specific programming. An ANN can be configured with connected nodes or artificial neurons. Connections between nodes can transmit signals from a first node to a second node (e.g., a receiving node), and these signals can be modified by weights, which can be indicated by the weight coefficients of the connections. A receiving node can process signals from nodes that send signals to it (i.e., the input signals to the receiving node) and then generate an output signal by applying a function to the input signals. This function can be a linear function. In one example, the output signal is a weighted sum of the input signals. In another example, the output signal is further modified by a bias that can be indicated by a bias term, so the output signal is the sum of the bias and the weighted sum of the input signals. The function can include, for example, a nonlinear operation on the weighted sum of the input signals or the weighted sum of the bias. The output signal can be sent to nodes connected to the receiving node (downstream nodes). An ANN can be represented or configured by parameters (e.g., the weights of the connections and/or biases). The weights and/or biases can be obtained by training the ANN with examples where the weights and/or biases can be iteratively adjusted. A trained ANN with defined weights and/or defined biases can be used to perform tasks.

ANN中的节点可以用任何合适的架构来组织。在各种实施例中，ANN中的节点被组织成层，包括接收到ANN的输入信号的输入层和从ANN输出输出信号的输出层。在一个实施例中，ANN还包括输入层和输出层之间的层，例如，隐藏层。不同层可以对不同层的相应输入执行不同种类的变换。信号可以从输入层传输到输出层。Nodes in an ANN can be organized using any suitable architecture. In various embodiments, nodes in an ANN are organized into layers, including an input layer that receives the input signals to the ANN and an output layer that outputs signals from the ANN. In one embodiment, the ANN also includes layers between the input and output layers, such as hidden layers. Different layers can perform different kinds of transformations on their respective inputs. Signals can be transmitted from the input layer to the output layer.

在输入层和输出层之间具有多层的ANN可以被称为DNN。在一个实施例中，DNN是前馈网络，其中，数据从输入层流向输出层，而没有环回。在一个示例中，DNN是全连接网络，其中，一层中的每个节点都连接到下一层中的所有节点。在一个实施例中，DNN是递归神经网络(RNN)，其中，数据可以向任何方向流动。在一个实施例中，DNN是CNN。An ANN with multiple layers between the input and output layers can be called a DNN. In one embodiment, a DNN is a feedforward network, where data flows from the input layer to the output layer without loopback. In one example, a DNN is a fully connected network, where each node in one layer is connected to all nodes in the next layer. In one embodiment, a DNN is a recurrent neural network (RNN), where data can flow in any direction. In one embodiment, a DNN is a CNN.

CNN可以包括输入层、输出层以及输入层和输出层之间的隐藏层。隐藏层可以包括执行卷积(例如，二维(2D)卷积)的卷积层(例如，用在编码器中)。在一个实施例中，在卷积层中执行的2D卷积是在卷积核(也称为滤波器或信道，例如，5×5矩阵)和卷积层的输入信号(例如，2D矩阵，例如，2D块，256×256矩阵)之间。在各种示例中，卷积核的维度(例如，5×5)小于输入信号的维度(例如，256×256)。因此，输入信号(例如，256×256矩阵)中被卷积核覆盖的部分(例如，5×5区域)小于输入信号的区域(例如，256×256区域)，因此可以被称为下一层中相应节点中的感受野。A CNN may include an input layer, an output layer, and hidden layers between the input and output layers. Hidden layers may include convolutional layers (e.g., used in an encoder) that perform convolutions (e.g., two-dimensional (2D) convolutions). In one embodiment, the 2D convolution performed in the convolutional layer is between the convolutional kernel (also called a filter or channel, e.g., a 5×5 matrix) and the input signal of the convolutional layer (e.g., a 2D matrix, such as a 2D block, a 256×256 matrix). In various examples, the dimension of the convolutional kernel (e.g., 5×5) is smaller than the dimension of the input signal (e.g., 256×256). Therefore, the portion of the input signal (e.g., a 256×256 matrix) covered by the convolutional kernel (e.g., a 5×5 region) is smaller than the region of the input signal (e.g., a 256×256 region), and can therefore be referred to as the receptive field in the corresponding node of the next layer.

在卷积期间，计算卷积核和输入信号中相应感受野的点积。因此，卷积核的每个元素是应用于感受野中相应样本的权重，因此卷积核包括权重。例如，由5×5矩阵表示的卷积核具有25个权重。在一些示例中，向卷积层的输出信号施加偏置，并且输出信号基于点积和偏置的和。During convolution, the dot product of the convolution kernel and the corresponding receptive field in the input signal is calculated. Therefore, each element of the convolution kernel is a weight applied to the corresponding sample in the receptive field, thus the convolution kernel comprises weights. For example, a convolution kernel represented by a 5×5 matrix has 25 weights. In some examples, a bias is applied to the output signal of the convolutional layer, and the output signal is based on the sum of the dot product and the bias.

卷积核可以沿着输入信号(例如，2D矩阵)移动被称为步幅的大小，因此卷积运算生成特征图或激活图(例如，另一个2D矩阵)，这又有助于CNN中下一层的输入。例如，输入信号是具有256×256个样本的2D块，步幅是2个样本(例如，步幅为2)。对于步幅2，卷积核沿着X方向(例如，水平方向)和/或Y方向(例如，垂直方向)移动2个样本。The convolutional kernel can move along the input signal (e.g., a 2D matrix) by a size called stride, so the convolution operation generates feature maps or activation maps (e.g., another 2D matrix), which in turn contribute to the input of the next layer in the CNN. For example, the input signal is a 2D block with 256×256 samples, and the stride is 2 samples (e.g., stride 2). For a stride of 2, the convolutional kernel moves 2 samples along the X direction (e.g., horizontal) and/or the Y direction (e.g., vertical).

多个卷积核可以在相同的卷积层中应用于输入信号，以分别生成多个特征图，其中，每个特征图可以表示输入信号的特定特征。一般来说，具有N个信道(即，N个卷积核)的卷积层(每个卷积核具有M×M个样本和步幅S)可以被指定为Conv:MxM cN sS。例如，具有192个信道的卷积层(每个卷积核具有5×5个样本，并且步幅为2)被指定为Conv:5x5 c192s2。隐藏层可以包括执行去卷积(例如，2D去卷积)的去卷积层(例如，用在解码器中)。去卷积是卷积的逆运算。具有192个信道的去卷积层(每个去卷积核具有5×5个样本，并且步幅为2)被指定为DeConv:5x5 c192 s2。Multiple convolutional kernels can be applied to the input signal within the same convolutional layer to generate multiple feature maps, each representing a specific feature of the input signal. Generally, a convolutional layer with N channels (i.e., N convolutional kernels) (each kernel having M×M samples and a stride S) can be specified as Conv:MxM cN sS. For example, a convolutional layer with 192 channels (each kernel having 5×5 samples and a stride of 2) is specified as Conv:5x5 c192s2. Hidden layers can include deconvolutional layers that perform deconvolution (e.g., 2D deconvolution) (e.g., used in decoders). Deconvolution is the inverse operation of convolution. A deconvolutional layer with 192 channels (each deconvolutional kernel having 5×5 samples and a stride of 2) is specified as DeConv:5x5 c192 s2.

在各种实施例中，CNN具有以下好处。CNN中的多个可学习参数(即，待训练的参数)可以显著小于DNN(例如，前馈DNN)中的多个可学习参数。在CNN中，相对大量的节点可以共享相同的滤波器(例如，相同的权重)和相同的偏置(如果使用偏置的话)，因此可以减少存储器占用，因为可以在共享相同滤波器的所有感受野上使用单个偏置和单个权重向量。例如，对于具有100×100个样本的输入信号，具有5×5个样本的卷积核的卷积层具有25个可学习的参数(例如，权重)。如果使用偏置，则一个信道使用26个可学习参数(例如，25个权重和一个偏置)。如果卷积层具有N个信道，则总的可学习参数为26xN。另一方面，对于DNN中全连接层，100x100(即10000)个权重用于下一层中的每个节点。如果下一层有L个节点，则总的可学习参数是10000xL。In various embodiments, CNNs offer the following advantages. The number of learnable parameters (i.e., parameters to be trained) in a CNN can be significantly smaller than the number of learnable parameters in a DNN (e.g., a feedforward DNN). In a CNN, a relatively large number of nodes can share the same filters (e.g., the same weights) and the same biases (if biases are used), thus reducing memory footprint because a single bias and a single weight vector can be used across all receptive fields sharing the same filters. For example, for an input signal with 100×100 samples, a convolutional layer with a kernel of 5×5 samples has 25 learnable parameters (e.g., weights). If biases are used, one channel uses 26 learnable parameters (e.g., 25 weights and one bias). If the convolutional layer has N channels, the total learnable parameters are 26xN. On the other hand, for a fully connected layer in a DNN, 100x100 (i.e., 10000) weights are used for each node in the next layer. If the next layer has L nodes, the total learnable parameters are 10000xL.

CNN还可以包括一个或多个其他层，例如，池化层、可以将一层中的每个节点连接到另一层中的每个节点的全连接层、标准化层等。CNN中的层可以以任何合适的顺序和任何合适的架构(例如，前馈架构、递归架构)排列。在一个示例中，卷积层之后是其他层，例如，池化层、全连接层、标准化层等。CNNs can also include one or more other layers, such as pooling layers, fully connected layers that connect each node in one layer to each node in another layer, normalization layers, etc. The layers in a CNN can be arranged in any suitable order and with any suitable architecture (e.g., feedforward architecture, recursive architecture). In one example, convolutional layers are followed by other layers, such as pooling layers, fully connected layers, normalization layers, etc.

通过将来自一层的多个节点的输出组合到下一层的单个节点中，可以使用池化层来减少数据的维度。下面描述了将特征图作为输入的池化层的池化操作。该描述可以适当地适用于其他输入信号。特征图可以被分成子区域(例如，矩形子区域)，并且相应子区域中的特征可以被独立地下采样(或池化)为单个值，例如，通过在平均池化中取平均值或在最大池化中取最大值。Pooling layers can be used to reduce the dimensionality of data by combining the outputs of multiple nodes in one layer into a single node in the next layer. The pooling operation of a pooling layer with a feature map as input is described below. This description can be appropriately applied to other input signals. The feature map can be divided into sub-regions (e.g., rectangular sub-regions), and features within the respective sub-regions can be independently downsampled (or pooled) into single values, for example, by averaging in average pooling or maximizing in max pooling.

池化层可以执行池化，例如，本地池化、全局池化、最大池化、平均池化等。池化是非线性下采样的一种形式。本地池化组合特征图中的少量节点(例如，本地节点集群，例如，2×2节点)。例如，全局池化可以组合特征图的所有节点。Pooling layers can perform pooling operations, such as local pooling, global pooling, max pooling, and average pooling. Pooling is a form of non-linear downsampling. Local pooling combines a small number of nodes in a feature map (e.g., a local cluster of nodes, such as a 2×2 node cluster). Global pooling, for example, can combine all nodes in a feature map.

池化层可以减少表示的大小，从而减少CNN中的参数数量、内存占用和计算量。在一个示例中，在CNN中的连续卷积层之间插入一个池化层。在一个示例中，池化层之后是激活函数，例如，校正线性单元(ReLU)层。在一个示例中，在CNN中的连续卷积层之间省略了池化层。Pooling layers can reduce the size of the representation, thereby reducing the number of parameters, memory footprint, and computational cost in a CNN. In one example, a pooling layer is inserted between consecutive convolutional layers in a CNN. In another example, the pooling layer is followed by an activation function, such as a Corrected Linear Unit (ReLU) layer. In yet another example, the pooling layer is omitted between consecutive convolutional layers in a CNN.

归一化层可以是ReLU、泄漏ReLU、广义除法归一化(GDN)、逆GDN(IGDN)等。ReLU可以应用非饱和激活函数，通过将负值设置为零来从输入信号(例如，特征图)中去除负值。对于负值，泄漏ReLU可以具有小斜率(例如，0.01)，而不是平坦的斜率(例如，0)。因此，如果值x大于0，则来自泄漏ReLU的输出是x。否则，来自泄漏ReLU的输出是值x乘以小斜率(例如，0.01)。在一个示例中，斜率是在训练之前确定的，因此在训练期间不学习。Normalization layers can be ReLU, Leaked ReLU, Generalized Division Normalization (GDN), Inverse GDN (IGDN), etc. ReLU can apply a non-saturating activation function to remove negative values from the input signal (e.g., feature map) by setting negative values to zero. For negative values, Leaked ReLU can have a small slope (e.g., 0.01) instead of a flat slope (e.g., 0). Therefore, if the value x is greater than 0, the output from Leaked ReLU is x. Otherwise, the output from Leaked ReLU is the value x multiplied by a small slope (e.g., 0.01). In one example, the slope is determined before training and therefore not learned during training.

在基于NN的图像压缩方法中，例如，基于DNN或基于CNN的图像压缩方法，不是直接编码整个图像，而是基于块或分块编码机制可以有效地压缩基于DNN的视频编码标准(例如，FVC)中的图像。整个图像可以被分割成大小相同(或不同)的块，并且可以单独压缩这些块。在一个实施例中，图像可以被分成大小相等或不相等的块。可以压缩分割的块而不是图像。图9A示出了根据本公开实施例的分块图像编码的示例。图像(980)可以被分割成块，例如，块(981)-(996)。例如，可以根据扫描顺序来压缩块(981)-(996)。在图9A所示的示例中，已经压缩块(981)-(989)，并且将压缩块(990)-(996)。In NN-based image compression methods, such as DNN-based or CNN-based image compression methods, instead of directly encoding the entire image, block or chunked encoding mechanisms can effectively compress images in DNN-based video coding standards (e.g., FVC). The entire image can be divided into blocks of the same (or different) size, and these blocks can be compressed individually. In one embodiment, the image can be divided into blocks of equal or unequal size. The segmented blocks, rather than the image itself, can be compressed. Figure 9A illustrates an example of chunked image encoding according to an embodiment of this disclosure. The image (980) can be divided into blocks, for example, blocks (981)-(996). For example, blocks (981)-(996) can be compressed according to the scan order. In the example shown in Figure 9A, blocks (981)-(989) have already been compressed, and blocks (990)-(996) will be compressed.

图像可以被视为一个块。在一个实施例中，压缩图像，而没有被分成块。整个图像可以是E2E NIC框架的输入。An image can be viewed as a single block. In one embodiment, the image is compressed without being divided into blocks. The entire image can be the input to an E2E NIC framework.

图9B示出了根据本公开实施例的示例性NIC框架(900)(例如，NIC系统)。NIC框架(900)可以基于神经网络，例如，DNN和/或CNN。NIC框架(900)可用于压缩(例如，编码)块和解压缩(例如，解码或重构)压缩块(例如，编码块)。NIC框架(900)可以包括使用神经网络来实现的两个子神经网络，即第一子神经网络(951)和第二子NN(952)。Figure 9B illustrates an exemplary NIC framework (900) (e.g., a NIC system) according to an embodiment of this disclosure. The NIC framework (900) may be based on a neural network, such as a DNN and/or a CNN. The NIC framework (900) may be used to compress (e.g., encode) blocks and decompress (e.g., decode or reconstruct) compressed blocks (e.g., encoded blocks). The NIC framework (900) may include two sub-neural networks implemented using a neural network, namely a first sub-neural network (951) and a second sub-NN (952).

第一子NN(951)可以类似于自动编码器，并且可以被训练，以生成输入块x的压缩块并且解压缩压缩块以获得重构块第一子NN(951)可以包括多个组件(或模块)，例如，主编码器神经网络(或主编码器网络)(911)、量化器(912)、熵编码器(913)、熵解码器(914)和主解码器神经网络(或主编码器网络)(915)。参考图9B，主编码器网络(911)可以从输入块x(例如，要压缩或编码的块)生成潜在或潜在表示y。在一个示例中，主编码器网络(911)使用CNN来实现。潜潜在表示y和输入块x之间的关系可以使用等式2来描述。The first sub-NN (951) can be analogous to an autoencoder and can be trained to generate compressed blocks of input block x and decompress the compressed blocks to obtain reconstructed blocks. The first sub-NN (951) can include multiple components (or modules), such as a master encoder neural network (or master encoder network) (911), a quantizer (912), an entropy encoder (913), an entropy decoder (914), and a master decoder neural network (or master encoder network) (915). Referring to Figure 9B, the master encoder network (911) can generate a latent or latent representation y from the input block x (e.g., the block to be compressed or encoded). In one example, the master encoder network (911) is implemented using a CNN. The relationship between the latent representation y and the input block x can be described using Equation 2.

y＝f₁(x；θ₁)等式2Equation 2: y = _f1 (x; _θ1 )

其中，参数θ₁表示参数，例如，主编码器网络(911)中卷积核中使用的权重和偏置(如果在主编码器网络(911)中使用偏置)。Here, parameter _θ1 represents parameters, such as the weights and biases used in the convolution kernels of the master encoder network (911) (if biases are used in the master encoder network (911)).

可以使用量化器(912)量化潜在表示y，以生成量化的潜在可以压缩量化的潜在例如，熵编码器(913)使用无损压缩，来生成压缩块(例如，编码块)(931)，该图像是输入块x的压缩表示熵编码器(913)可以使用熵编码技术，例如，Huffman编码、算术编码等。在一个示例中，熵编码器(913)使用算术编码，并且是算术编码器。在一个示例中，在编码比特流中传输编码块(931)。A latent representation y can be quantized using a quantizer (912) to generate a quantized latent that can be compressed. For example, an entropy encoder (913) uses lossless compression to generate a compressed block (e.g., a coded block) (931), which is a compressed representation of the input block x. The entropy encoder (913) can use entropy coding techniques, such as Huffman coding, arithmetic coding, etc. In one example, the entropy encoder (913) uses arithmetic coding and is an arithmetic encoder. In one example, the coded block (931) is transmitted in the coded bitstream.

编码块(931)可以由熵解码器(914)解压缩(例如，熵解码)，以生成输出。熵解码器(914)可以使用与熵编码器(913)中使用的熵编码技术对应的熵编码技术，例如，Huffman编码、算术编码等。在一个示例中，熵解码器(914)使用算术解码，并且是算术解码器。在一个示例中，在熵编码器(913)中使用无损压缩，在熵解码器(914)中使用无损解压缩，并且可以忽略诸如由于编码块(931)的传输而产生的噪声，来自熵解码器(914)的输出是量化的潜在The encoded block (931) can be decompressed (e.g., entropy decoding) by the entropy decoder (914) to generate the output. The entropy decoder (914) can use an entropy coding technique corresponding to the entropy coding technique used in the entropy encoder (913), such as Huffman coding, arithmetic coding, etc. In one example, the entropy decoder (914) uses arithmetic decoding and is an arithmetic decoder. In one example, lossless compression is used in the entropy encoder (913), lossless decompression is used in the entropy decoder (914), and noise such as that generated due to the transmission of the encoded block (931) can be ignored. The output from the entropy decoder (914) is a quantized potential.

主解码器网络(915)可以解码量化的潜在以生成重构块在一个示例中，主解码器网络(915)使用CNN来实现。重构块(即，主解码器网络(915)的输出)和量化的潜在(即，主解码器网络(915)的输入)之间的关系可以使用等式3来描述。The master decoder network (915) can decode the quantized potential to generate reconstructed blocks. In one example, the master decoder network (915) is implemented using a CNN. The relationship between the reconstructed blocks (i.e., the output of the master decoder network (915)) and the quantized potential (i.e., the input of the master decoder network (915)) can be described using Equation 3.

其中，参数θ₂表示参数，例如，在主解码器网络(915)中的卷积核中使用的权重和偏置(如果在主解码器网络(915)中使用偏置)。因此，第一子NN(951)可以压缩(例如，编码)输入块x，以获得编码块(931)并且解压缩(例如，解码)编码块(931)，以获得重构块由于量化器(912)引入的量化损耗，重构块x能不同于输入块x。Here, parameter _θ2 represents parameters, such as the weights and biases used in the convolutional kernels in the main decoder network (915) (if biases are used in the main decoder network (915)). Thus, the first sub-NN (951) can compress (e.g., encode) the input block x to obtain the encoded block (931) and decompress (e.g., decode) the encoded block (931) to obtain the reconstructed block. Due to the quantization loss introduced by the quantizer (912), the reconstructed block x can be different from the input block x.

第二子NN(952)可以在用于熵编码的量化潜在上学习熵模型(例如，先验概率模型)。因此，熵模型可以是条件熵模型，例如，高斯混合模型(GMM)、取决于输入块x的高斯尺度模型(GSM)。第二子NN(952)可以包括上下文模型NN(916)、熵参数NN(917)、超级编码器(921)、量化器(922)、熵编码器(923)、熵解码器(924)和超级解码器(925)。在上下文模型NN(916)中使用的熵模型可以是潜像的(例如，量化的潜在)自回归模型。在一个示例中，超级编码器(921)、量化器(922)、熵编码器(923)、熵解码器(924)和超级解码器(925)形成超级神经网络(例如，超级NN)。超级神经网络可以表示对校正基于上下文的预测有用的信息。来自上下文模型NN(916)和超级神经网络的数据可以通过熵参数NN(917)来组合。熵参数NN(917)可以生成参数，例如，用于诸如条件高斯熵模型(例如，GMM)等熵模型的均值和尺度参数。The second sub-NN (952) can learn an entropy model (e.g., a prior probability model) on the quantized latent image used for entropy encoding. Thus, the entropy model can be a conditional entropy model, such as a Gaussian mixture model (GMM) or a Gaussian scaling model (GSM) dependent on the input block x. The second sub-NN (952) can include a context model NN (916), an entropy parameter NN (917), a super encoder (921), a quantizer (922), an entropy encoder (923), an entropy decoder (924), and a super decoder (925). The entropy model used in the context model NN (916) can be an autoregressive model of the latent image (e.g., a quantized latent image). In one example, the super encoder (921), quantizer (922), entropy encoder (923), entropy decoder (924), and super decoder (925) form a super neural network (e.g., a super NN). The super neural network can represent information useful for correcting context-based predictions. Data from the context model NN (916) and the super neural network can be combined using the entropy parameter NN (917). The entropy parameter NN(917) can generate parameters, such as the mean and scale parameters for entropy models like the conditional Gaussian entropy model (e.g., GMM).

参考图9B，在编码器侧，来自量化器(912)的量化的潜在被馈入上下文模型NN(916)。在解码器侧，来自熵解码器(914)的量化的潜在被馈入上下文模型NN(916)。上下文模型NN(916)可以使用诸如CNN之类的神经网络来实现。上下文模型NN(916)可以基于上下文生成输出o_cm,i，该输出是对上下文模型NN(916)可用的量化的潜在上下文可以包括编码器侧的先前量化潜像或解码器侧的先前熵解码的量化潜像。上下文模型NN(916)的输出o_cm,i和输入(例如，)之间的关系可以使用等式4来描述。Referring to Figure 9B, on the encoder side, the quantized latent image from the quantizer (912) is fed into the context model NN (916). On the decoder side, the quantized latent image from the entropy decoder (914) is fed into the context model NN (916). The context model NN (916) can be implemented using a neural network such as a CNN. The context model NN (916) can generate an output o _cm,i based on the context, which can include the quantized latent context available to the context model NN (916) and can include the previously quantized latent image on the encoder side or the previously entropy-decoded quantized latent image on the decoder side. The relationship between the output o _cm,i of the context model NN (916) and the input (e.g., ) can be described using Equation 4.

其中，参数θ₃表示参数，例如，在上下文模型NN(916)中的卷积核中使用的权重和偏置(如果在上下文模型NN(916)中使用偏置)。Here, parameter _θ3 represents parameters, such as the weights and biases used in the convolution kernel in the context model NN(916) (if biases are used in the context model NN(916)).

来自上下文模型NN(916)的输出o_cm,i和来自超级解码器(925)的输出o_hc被馈入熵参数NN(917)，以生成输出o_ep。熵参数NN(917)可以使用诸如CNN之类的神经网络来实现。熵参数NN(917)的输出o_ep和输入(例如，o_cm,i和o_hc)之间的关系可以使用等式5来描述。The outputs o _cm,i from the context model NN (916) and o _hc from the super decoder (925) are fed into the entropy parameter NN (917) to generate the output o _ep . The entropy parameter NN (917) can be implemented using a neural network such as a CNN. The relationship between the output o _ep of the entropy parameter NN (917) and the inputs (e.g., o _cm,i and o _hc ) can be described using Equation 5.

o_ep＝f₄(o_cm,i,o_hc；θ₄)等式5o _ep = f ₄ (o _cm,i ,o _hc ；θ ₄ ) Equation 5

其中，参数θ₄表示参数，例如，熵参数NN(917)中卷积核中使用的权重和偏置(如果在熵参数NN(917)中使用偏置)。熵参数NN(917)的输出o_ep可以用于确定(例如，调节)熵模型，并且因此经调节的熵模型可以取决于输入块x，例如，经由来自超级解码器(925)的输出o_hc。在一个示例中，输出o_ep包括用于调节熵模型(例如，GMM)的参数，例如，均值和尺度参数。参考图9B，熵编码器(913)和熵解码器(914)可以分别在熵编码和熵解码中使用熵模型(例如，条件熵模型)。Here, parameter _θ4 represents parameters, such as the weights and biases used in the convolutional kernel of the entropy parameter NN (917) (if biases are used in the entropy parameter NN (917)). The output _oep of the entropy parameter NN (917) can be used to determine (e.g., adjust) the entropy model, and thus the adjusted entropy model can depend on the input block x, for example, via the output _ohc from the super decoder (925). In one example, the output _oep includes parameters for adjusting the entropy model (e.g., GMM), such as the mean and scale parameters. Referring to Figure 9B, the entropy encoder (913) and entropy decoder (914) can use the entropy model (e.g., the conditional entropy model) in entropy encoding and entropy decoding, respectively.

可以如下描述第二子NN(952)。潜在y可以被馈送到超级编码器(921)中，以生成超级潜在z。在一个示例中，超级编码器(921)是使用诸如CNN之类的神经网络来实现的。超级潜在z和潜在y之间的关系可以使用等式6来描述。The second sub-NN (952) can be described as follows. The latent y can be fed into the super encoder (921) to generate the super latent z. In one example, the super encoder (921) is implemented using a neural network such as a CNN. The relationship between the super latent z and the latent y can be described using Equation 6.

z＝f₅(y；θ₅)等式6z = _f5 (y; _θ5 ) Equation 6

其中，参数θ₅表示参数，例如，超级编码器(921)中卷积核中使用的权重和偏置(如果在超级编码器(921)中使用了偏置)。Here, parameter _θ5 represents parameters, such as the weights and biases used in the convolution kernel of the super encoder (921) (if biases are used in the super encoder (921)).

量化器(922)对超级潜在z进行量化，以生成量化的潜在可以压缩量化的潜在例如，通过熵编码器(923)使用无损压缩，来生成边信息，例如，来自超级神经网络的编码比特(932)。熵编码器(923)可以使用熵编码技术，例如，Huffman编码、算术编码等。在一个示例中，熵编码器(923)使用算术编码，并且是算术编码器。在一个示例中，诸如编码比特(932)之类的边信息可以在编码比特流中例如与编码块(931)一起传输。The quantizer (922) quantizes the super-latency z to generate a quantized latent that can be compressed, for example, by using lossless compression via an entropy encoder (923), to generate side information, such as coded bits (932) from the super-neural network. The entropy encoder (923) can use entropy coding techniques, such as Huffman coding, arithmetic coding, etc. In one example, the entropy encoder (923) uses arithmetic coding and is an arithmetic encoder. In one example, side information such as coded bits (932) can be transmitted in the coded bitstream, for example, along with coded blocks (931).

诸如编码比特(932)的边信息可以由熵解码器(924)解压缩(例如，熵解码)，以生成输出。熵解码器(924)可以使用熵编码技术，例如，Huffman编码、算术编码等。在一个示例中，熵解码器(924)使用算术解码，并且是算术解码器。在一个示例中，在熵编码器(923)中使用无损压缩，在熵解码器(924)中使用无损解压缩，并且可以忽略诸如由于边信息的传输而导致的噪声，来自熵解码器(924)的输出可以是量化的潜在超级解码器(925)可以解码量化的潜在以生成输出to_hc。输出to_hc和量化的潜在之间的关系可以使用等式7来描述。Side information, such as encoded bits (932), can be decompressed (e.g., entropy decoding) by an entropy decoder (924) to generate an output. The entropy decoder (924) can use entropy coding techniques, such as Huffman coding, arithmetic coding, etc. In one example, the entropy decoder (924) uses arithmetic decoding and is an arithmetic decoder. In one example, lossless compression is used in the entropy encoder (923), lossless decompression is used in the entropy decoder (924), and noise such as that caused by the transmission of side information can be ignored. The output from the entropy decoder (924) can be a quantized latent. The super decoder (925) can decode the quantized latent to generate an output to _hc . The relationship between the output to _hc and the quantized latent can be described by Equation 7.

其中，参数θ₆表示参数，例如，超级解码器(925)中卷积核中使用的权重和偏置(如果在超级解码器(925)中使用偏置)。Here, parameter _θ6 represents parameters, such as the weights and biases used in the convolution kernel in the superdecoder (925) (if biases are used in the superdecoder (925)).

如上所述，压缩或编码的比特(932)可以作为边信息被添加到编码的比特流，这使得熵解码器(914)能够使用条件熵模型。因此，熵模型可以是块相关的和空间自适应的，因此可以比固定熵模型更精确。As described above, compressed or encoded bits (932) can be added as side information to the encoded bitstream, which enables the entropy decoder (914) to use a conditional entropy model. Therefore, the entropy model can be block-dependent and spatially adaptive, and thus more accurate than a fixed entropy model.

可以适当地修改NIC框架(900)，例如，省略图9B中所示的一个或多个组件，修改图9B中所示的一个或多个组件，和/或包括图9B中未示出的一个或多个组件。在一个示例中，使用固定熵模型的NIC框架包括第一子NN(951)，并且不包括第二子NN(952)。在一个示例中，NIC框架包括NIC框架(900)中除熵编码器(923)和熵解码器(924)之外的组件。The NIC framework (900) can be modified appropriately, for example, by omitting one or more components shown in FIG. 9B, modifying one or more components shown in FIG. 9B, and/or including one or more components not shown in FIG. 9B. In one example, the NIC framework using a fixed entropy model includes a first sub-NN (951) and does not include a second sub-NN (952). In one example, the NIC framework includes components in the NIC framework (900) other than the entropy encoder (923) and the entropy decoder (924).

在一个实施例中，图9B所示的NIC框架(900)中的一个或多个组件使用神经网络(例如，CNN)来实现。NIC框架(例如，NIC框架(900))中的每个基于NN的组件(例如，主编码器网络(911)、主解码器网络(915)、上下文模型NN(916)、熵参数NN(917)、超级编码器(921)或超级解码器(925))可以包括任何合适的架构(例如，具有任何合适的层组合)、包括任何合适类型的参数(例如，权重、偏置、权重和偏置的组合和/或诸如此类)，并且包括任何合适数量的参数。In one embodiment, one or more components in the NIC framework (900) shown in Figure 9B are implemented using a neural network (e.g., a CNN). Each NN-based component in the NIC framework (e.g., the master encoder network (911), the master decoder network (915), the context model NN (916), the entropy parameter NN (917), the super encoder (921), or the super decoder (925)) may include any suitable architecture (e.g., with any suitable combination of layers), any suitable type of parameters (e.g., weights, biases, combinations of weights and biases, and/or the like), and any suitable number of parameters.

在一个实施例中，主编码器网络(911)、主解码器网络(915)、上下文模型NN(916)、熵参数NN(917)、超级编码器(921)和超级解码器(925)使用相应的CNN来实现。In one embodiment, the master encoder network (911), master decoder network (915), context model NN (916), entropy parameter NN (917), super encoder (921), and super decoder (925) are implemented using corresponding CNNs.

图10示出了根据本公开实施例的主编码器网络(911)的示例性CNN。例如，主编码器网络(911)包括四组层，其中，每组层包括一个卷积层5x5 c192 s2，其后是一个GDN层。可以修改和/或省略图10中所示的一层或多层。可以将额外层添加到主编码器网络(911)。Figure 10 illustrates an exemplary CNN of a master encoder network (911) according to an embodiment of the present disclosure. For example, the master encoder network (911) includes four sets of layers, wherein each set of layers includes a 5x5 c192 s2 convolutional layer followed by a GDN layer. One or more layers shown in Figure 10 may be modified and/or omitted. Additional layers may be added to the master encoder network (911).

图11示出了根据本公开实施例的主解码器网络(915)的示例性CNN。例如，主解码器网络(915)包括三组层，其中，每组层包括去卷积层5x5 c192 s2，随后是IGDN层。此外，三组层之后是去卷积层5x5 c3 s2，之后是IGDN层。可以修改和/或省略图11中所示的一层或多层。可以将额外层添加到主解码器网络(915)。Figure 11 illustrates an exemplary CNN of a master decoder network (915) according to an embodiment of the present disclosure. For example, the master decoder network (915) includes three sets of layers, each set comprising a 5x5 c192 s2 deconvolutional layer followed by an IGDN layer. Furthermore, after the three sets of layers is a 5x5 c3 s2 deconvolutional layer, followed by an IGDN layer. One or more layers shown in Figure 11 may be modified and/or omitted. Additional layers may be added to the master decoder network (915).

图12示出了根据本公开实施例的超级编码器(921)的示例性CNN。例如，超级编码器(921)包括卷积层3x3 c192 s1，随后是泄漏ReLU，卷积层5x5 c192s2，随后是泄漏ReLU，以及卷积层5x5 c192 s2。可以修改和/或省略图12中所示的一层或多层。可以将额外层添加到超级编码器(921)。Figure 12 illustrates an exemplary CNN of a super encoder (921) according to an embodiment of the present disclosure. For example, the super encoder (921) includes a 3x3 c192 s1 convolutional layer followed by a leaky ReLU, a 5x5 c192 s2 convolutional layer followed by a leaky ReLU, and another 5x5 c192 s2 convolutional layer. One or more layers shown in Figure 12 may be modified and/or omitted. Additional layers may be added to the super encoder (921).

图13示出了根据本公开实施例的超级解码器(925)的示例性CNN。例如，超级解码器(925)包括去卷积层5x5 c192 s2，随后是泄漏ReLU，卷积层5x5 c288s2，随后是泄漏ReLU，以及卷积层3x3 c384 s1。可以修改和/或省略图13中所示的一层或多层。可以将额外层添加到超级编码器(925)。Figure 13 illustrates an exemplary CNN of a super decoder (925) according to an embodiment of the present disclosure. For example, the super decoder (925) includes a 5x5 c192 s2 deconvolutional layer followed by a leaky ReLU, a 5x5 c288 s2 convolutional layer followed by a leaky ReLU, and a 3x3 c384 s1 convolutional layer. One or more layers shown in Figure 13 may be modified and/or omitted. Additional layers may be added to the super encoder (925).

图14示出了根据本公开实施例的上下文模型NN(916)的示例性CNN。例如，上下文模型NN(916)包括用于上下文预测的掩蔽卷积5x5 c384 s1，因此等式4中的上下文包括有限的上下文(例如，5×5卷积核)。可以修改图14中的卷积层。可以将额外层添加到上下文模型NN(916)。Figure 14 illustrates an exemplary CNN of a context model NN (916) according to an embodiment of the present disclosure. For example, the context model NN (916) includes a masked convolution 5x5 c384 s1 for context prediction, so the context in Equation 4 includes a finite context (e.g., a 5×5 convolutional kernel). The convolutional layers in Figure 14 can be modified. Additional layers can be added to the context model NN (916).

图15示出了根据本公开实施例的熵参数NN(917)的示例性CNN。例如，熵参数NN(917)包括卷积层1x1 c640 s1，随后是泄漏ReLU，卷积层1x1 c512s1，随后是泄漏ReLU，以及卷积层1x1 c384 s1。可以修改和/或省略图15中所示的一层或多层。可以将额外层添加到熵参数NN(917)。Figure 15 illustrates an exemplary CNN with entropy parameter NN (917) according to an embodiment of this disclosure. For example, the entropy parameter NN (917) includes a 1x1 convolutional layer c640 s1 followed by a leaky ReLU, a 1x1 convolutional layer c512 s1 followed by a leaky ReLU, and a 1x1 convolutional layer c384 s1. One or more layers shown in Figure 15 may be modified and/or omitted. Additional layers may be added to the entropy parameter NN (917).

NIC框架(900)可以使用CNN来实现，如参考图10-15所描述的。NIC框架(900)可以被适当地适配，使得NIC框架(900)中的一个或多个组件(例如，(911)、(915)、(916)、(917)、(921)和/或(925))使用任何适当类型的神经网络(例如，基于CNN或非CNN的神经网络)来实现。NIC框架(900)的一个或多个其他组件可以使用神经网络来实现。The NIC framework (900) can be implemented using a CNN, as described with reference to Figures 10-15. The NIC framework (900) can be appropriately adapted so that one or more components of the NIC framework (900) (e.g., (911), (915), (916), (917), (921) and/or (925)) can be implemented using any suitable type of neural network (e.g., a CNN-based or non-CNN-based neural network). One or more other components of the NIC framework (900) can be implemented using neural networks.

可以训练包括神经网络(例如，CNN)的NIC框架(900)来学习神经网络中使用的参数。例如，当使用CNN时，可以分别在训练过程中学习由θ₁-θ₆表示的参数，例如，在主编码器网络(911)中的卷积核中使用的权重和偏置(如果在主编码器网络(911)中使用偏置)、在主解码器网络(915)中的卷积核中使用的权重和偏置(如果在主解码器网络(915)中使用偏置)、超级编码器(921)中的卷积核中使用的权重和偏置(如果在超级编码器(921)中使用偏置)、超级解码器(925)中的卷积核中使用的权重和偏置(如果在超级解码器(925)中使用偏置)、上下文模型NN(916)中的卷积核中使用的权重和偏置(如果在上下文模型NN(916)中使用偏置)以及在熵参数NN(917)中的卷积核中使用的权重和偏置(如果在熵参数NN(917)中使用了偏置)。The NIC framework (900) including neural networks (e.g., CNNs) can be trained to learn the parameters used in the neural networks. For example, when using a CNN, the parameters represented by _θ1 - _θ6 can be learned during training, such as the weights and biases used in the convolutional kernels in the main encoder network (911) (if biases are used in the main encoder network (911)), the weights and biases used in the convolutional kernels in the main decoder network (915) (if biases are used in the main decoder network (915)), the weights and biases used in the convolutional kernels in the super encoder (921) (if biases are used in the super encoder (921)), the weights and biases used in the convolutional kernels in the super decoder (925) (if biases are used in the super decoder (925)), the weights and biases used in the convolutional kernels in the context model NN (916) (if biases are used in the context model NN (916)), and the weights and biases used in the convolutional kernels in the entropy parameter NN (917) (if biases are used in the entropy parameter NN (917)).

在一个示例中，参考图10，主编码器网络(911)包括四个卷积层，其中，每个卷积层具有5×5的卷积核和192个信道。因此，在主编码器网络(911)中的卷积核中使用的权重的数量是19200(即，4x5x5x192)。主编码器网络(911)中使用的参数包括19200权重和可选偏置。当在主编码器网络(911)中使用偏置和/或额外NN时，可以包括额外参数。In one example, referring to Figure 10, the master encoder network (911) comprises four convolutional layers, each with a 5×5 convolutional kernel and 192 channels. Therefore, the number of weights used in the convolutional kernels of the master encoder network (911) is 19200 (i.e., 4x5x5x192). The parameters used in the master encoder network (911) include 19200 weights and optional biases. Additional parameters may be included when biases and/or additional neural networks are used in the master encoder network (911).

参考图9B，NIC框架(900)包括至少一个建立在神经网络上的组件或模块。该至少一个组件可以包括主编码器网络(911)、主解码器网络(915)、超级编码器(921)、超级解码器(925)、上下文模型NN(916)和熵参数NN(917)中的一个或多个。可以单独训练该至少一个组件。在一个示例中，训练过程用于分别学习每个组件的参数。该至少一个组件可以作为一个组共同训练。在一个示例中，训练过程用于共同学习至少一个组件的子集的参数。在一个示例中，训练过程用于学习所有至少一个组件的参数，因此被称为E2E优化。Referring to Figure 9B, the NIC framework (900) includes at least one component or module built on a neural network. This at least one component may include one or more of a master encoder network (911), a master decoder network (915), a super encoder (921), a super decoder (925), a context model NN (916), and an entropy parameter NN (917). This at least one component can be trained individually. In one example, the training process is used to learn the parameters of each component separately. The at least one component can also be trained as a group. In one example, the training process is used to collectively learn the parameters of a subset of at least one component. In one example, the training process is used to learn the parameters of all at least one component, hence the term E2E optimization.

在NIC框架(900)中的一个或多个组件的训练过程中，可以初始化一个或多个组件的权重(或权重系数)。在一个示例中，基于预训练相应神经网络模型(例如，DNN模型、CNN模型)来初始化权重。在一个示例中，通过将权重设置为随机数来初始化权重。During the training of one or more components in the NIC framework (900), the weights (or weight coefficients) of one or more components can be initialized. In one example, the weights are initialized based on a pre-trained corresponding neural network model (e.g., a DNN model, a CNN model). In another example, the weights are initialized by setting them to random numbers.

例如，在初始化权重之后，可以采用一组训练块来训练一个或多个组件。该组训练块可以包括具有任何合适尺寸的任何合适的块。在一些示例中，该组训练块包括在空间域中的原始图像、自然图像、计算机生成的图像等的块。在一些示例中，该组训练块包括在空间域中具有残差数据的残差块或残差图像的块。残差数据可以由残差计算器(例如，残差计算器(723))来计算。在一些示例中，原始图像和/或包括残差数据的残差图像可以直接用于在NIC框架中训练神经网络。因此，原始图像、残差图像、来自原始图像的块和/或来自残差图像的块可以用于在NIC框架中训练神经网络。For example, after initializing the weights, a set of training blocks can be used to train one or more components. This set of training blocks can include any suitable block with any appropriate size. In some examples, the set of training blocks includes blocks of the original image, natural image, computer-generated image, etc., in the spatial domain. In some examples, the set of training blocks includes blocks of residual blocks or residual images with residual data in the spatial domain. The residual data can be computed by a residual calculator (e.g., a residual calculator (723)). In some examples, the original image and/or the residual image including the residual data can be directly used to train the neural network in the NIC framework. Therefore, the original image, the residual image, blocks from the original image, and/or blocks from the residual image can be used to train the neural network in the NIC framework.

为了简洁起见，下面使用训练块作为示例来描述训练过程。该描述可以适当地适用于训练块。该组训练块中的训练块t可以通过图9B中的编码过程，以生成压缩表示(例如，编码信息，例如，比特流)。编码信息可以通过图9B中描述的解码过程来计算和重构该重构块For the sake of brevity, the training block is used as an example to describe the training process below. This description can be appropriately applied to the training block. The training block t in this set of training blocks can be encoded using the encoding process in Figure 9B to generate a compressed representation (e.g., encoded information, such as a bitstream). The encoded information can be computed and reconstructed using the decoding process described in Figure 9B.

对于NIC框架(900)，平衡了两个竞争目标，例如，重构质量和比特消耗。质量损耗函数(例如，失真或失真损耗)可以用于指示重构质量，例如，重构(例如，重构块)和原始块(例如，训练块t)之间的差异。速率(或速率损耗)R可以用于指示压缩表示的比特消耗。在一个示例中，速率损耗R还包括例如在确定上下文模型时使用的边信息。For the NIC framework (900), two competing objectives are balanced, such as reconstruction quality and bit consumption. A quality loss function (e.g., distortion or distortion loss) can be used to indicate reconstruction quality, such as the difference between the reconstructed block (e.g., the reconstructed block) and the original block (e.g., the training block t). A rate (or rate loss) R can be used to indicate the bit consumption of the compressed representation. In one example, the rate loss R also includes, for example, side information used when determining the context model.

对于神经图像压缩，可以在E2E优化中使用量化的可微分近似。在各种示例中，在基于神经网络的图像压缩的训练过程中，使用噪声注入来模拟量化，因此量化是通过噪声注入来模拟的，而不是由量化器(例如，量化器(912))来执行。因此，利用噪声注入的训练可以可变地逼近量化误差。每像素比特(BPP)估计器可用于模拟熵编码器，因此熵编码由BPP估计器模拟，而不是由熵编码器(例如，(913))和熵解码器(例如，(914))执行。因此，例如，可以基于噪声注入和BPP估计器来估计训练过程中等式1所示的损耗函数L中的速率损耗R。一般而言，较高的速率R可以实现较低的失真D，而较低的速率R会导致较高的失真D。等式1中的权衡超参数λ可用于优化共同R-D损耗L，其中，L作为λD和R的总和可被优化。训练过程可用于调整NIC框架(900)中的一个或多个组件(例如(911)、(915))的参数，使得共同R-D损耗L被最小化或优化。在一个示例中，可以使用折衷超参数λ来优化联合率失真(R-D)损耗，如下所示：For neural image compression, a differentiable approximation of quantization can be used in E2E optimization. In various examples, noise injection is used to simulate quantization during the training process of neural network-based image compression, so quantization is simulated by noise injection rather than performed by a quantizer (e.g., quantizer (912)). Therefore, training with noise injection can variably approximate the quantization error. A bit-per-pixel (BPP) estimator can be used to simulate an entropy encoder, so entropy encoding is simulated by the BPP estimator rather than performed by an entropy encoder (e.g., (913)) and an entropy decoder (e.g., (914)). Thus, for example, the rate loss R in the loss function L shown in Equation 1 during training can be estimated based on noise injection and the BPP estimator. In general, a higher rate R can achieve a lower distortion D, while a lower rate R results in a higher distortion D. The tradeoff hyperparameter λ in Equation 1 can be used to optimize the common R-D loss L, where L can be optimized as a sum of λD and R. The training process can be used to tune the parameters of one or more components (e.g., (911), (915)) in the NIC framework (900) such that the common R-D loss L is minimized or optimized. In one example, a trade-off hyperparameter λ can be used to optimize the joint rate-distortion (R-D) loss, as follows:

其中，E测量与编码前的原始块残差相比的解码块残差的失真，其充当残差编码/解码DNN和编码/解码DNN的正则化损耗。β是一个超参数，用于平衡正则化损耗的重要性。Here, E measures the distortion of the decoded block residual compared to the original block residual before encoding, and it acts as the regularization loss for both the residual encoding/decoding DNN and the encoding/decoding DNN. β is a hyperparameter used to balance the importance of the regularization loss.

可以使用各种模型来确定失真损耗D和速率损耗R，从而确定等式1中的共同R-D损耗L。在一个示例中，失真损耗被表示为峰值信噪比(PSNR)，其是基于均方误差、多尺度结构相似性(MS-SSIM)质量指数、PSNR和MS-SSIM的加权组合等的度量。Various models can be used to determine the distortion loss D and the rate loss R, thereby determining the common R-D loss L in Equation 1. In one example, the distortion loss is expressed as the peak signal-to-noise ratio (PSNR), which is a measure based on mean squared error, multi-scale structural similarity (MS-SSIM) quality index, a weighted combination of PSNR and MS-SSIM, etc.

在一个示例中，训练过程的目标是训练编码神经网络(例如，编码DNN)，例如，要在编码器侧使用的视频编码器，以及训练解码神经网络(例如，解码DNN)，例如，要在解码器侧使用的视频解码器。在一个示例中，参考图9B，编码神经网络可以包括主编码器网络(911)、超级编码器(921)、超级解码器(925)、上下文模型NN(916)和熵参数NN(917)。解码神经网络可以包括主解码器网络(915)、超级解码器(925)、上下文模型NN(916)和熵参数NN(917)。视频编码器和/或视频解码器可以包括基于NN和/或不基于NN的其他组件。In one example, the goal of the training process is to train an encoding neural network (e.g., an encoding DNN), such as a video encoder to be used on the encoder side, and to train a decoding neural network (e.g., a decoding DNN), such as a video decoder to be used on the decoder side. In one example, referring to Figure 9B, the encoding neural network may include a main encoder network (911), a super encoder (921), a super decoder (925), a context model NN (916), and an entropy parameter NN (917). The decoding neural network may include a main decoder network (915), a super decoder (925), a context model NN (916), and an entropy parameter NN (917). The video encoder and/or video decoder may include other components based on and/or not based on NNs.

可以以E2E方式训练NIC框架(例如，NIC框架(900))。在一个示例中，编码神经网络和解码神经网络在训练过程中基于反向传播梯度以E2E方式共同更新。The NIC framework can be trained in an E2E manner (e.g., the NIC framework (900)). In one example, the encoding neural network and the decoding neural network are jointly updated in an E2E manner based on backpropagation gradients during training.

在训练NIC框架(900)中的神经网络的参数之后，NIC框架(900)中的一个或多个组件可以用于编码和/或解码块。在一个实施例中，在编码器侧，视频编码器被配置为将输入块x编码成要在比特流中传输的编码块(931)。视频编码器可以包括NIC框架(900)中的多个组件。在一个实施例中，在解码器侧，对应的视频解码器被配置为将比特流中的编码块(931)解码成重构块x。视频解码器可以包括NIC框架(900)中的多个组件。After training the parameters of the neural network in the NIC framework (900), one or more components in the NIC framework (900) can be used to encode and/or decode blocks. In one embodiment, on the encoder side, a video encoder is configured to encode an input block x into a coded block (931) to be transmitted in a bitstream. The video encoder may include multiple components in the NIC framework (900). In one embodiment, on the decoder side, a corresponding video decoder is configured to decode the coded block (931) in the bitstream into a reconstructed block x. The video decoder may include multiple components in the NIC framework (900).

在一个示例中，例如，当采用内容自适应在线训练时，视频编码器包括NIC框架(900)中的所有组件。In one example, for instance, when using content-adaptive online training, the video encoder includes all components of the NIC framework (900).

图16A示出了根据本公开实施例的示例性视频编码器(1600A)。视频编码器(1600A)包括参考图9B描述的主编码器网络(911)、量化器(912)、熵编码器(913)和第二子NN(952)，为了简洁起见，省略了详细描述。图16B示出了根据本公开实施例的示例性视频解码器(1600B)。视频解码器(1600B)可以对应于视频编码器(1600A)。视频解码器(1600B)可以包括主解码器网络(915)、熵解码器(914)、上下文模型NN(916)、熵参数NN(917)、熵解码器(924)和超级解码器(925)。参考图16A-16B，在编码器侧，视频编码器(1600A)可以生成要在比特流中传输的编码块(931)和编码比特(932)。在解码器侧，视频解码器(1600B)可以接收和解码编码块(931)和编码比特(932)。Figure 16A illustrates an exemplary video encoder (1600A) according to an embodiment of the present disclosure. The video encoder (1600A) includes a master encoder network (911), a quantizer (912), an entropy encoder (913), and a second sub-NN (952) described with reference to Figure 9B, the details of which are omitted for brevity. Figure 16B illustrates an exemplary video decoder (1600B) according to an embodiment of the present disclosure. The video decoder (1600B) may correspond to the video encoder (1600A). The video decoder (1600B) may include a master decoder network (915), an entropy decoder (914), a context model NN (916), an entropy parameter NN (917), an entropy decoder (924), and a super decoder (925). Referring to Figures 16A-16B, on the encoder side, the video encoder (1600A) may generate coded blocks (931) and coded bits (932) to be transmitted in the bitstream. On the decoder side, the video decoder (1600B) can receive and decode coded blocks (931) and coded bits (932).

图17-18分别示出了根据本公开的实施例的示例性视频编码器(1700)和相应的视频解码器(1800)。参考图17，编码器(1700)包括主编码器网络(911)、量化器(912)和熵编码器(913)。参考图9B描述主编码器网络(911)、量化器(912)和熵编码器(913)的示例。参考图18，视频解码器(1800)包括主解码器网络(915)和熵解码器(914)。参考图9B描述主解码器网络(915)和熵解码器(914)的示例。参考图17和18，视频编码器(1700)可以生成将在比特流中传输的编码块(931)。视频解码器(1800)可以接收并解码编码块(931)。Figures 17-18 illustrate exemplary video encoders (1700) and corresponding video decoders (1800) according to embodiments of the present disclosure. Referring to Figure 17, the encoder (1700) includes a master encoder network (911), a quantizer (912), and an entropy encoder (913). An example of the master encoder network (911), quantizer (912), and entropy encoder (913) is described with reference to Figure 9B. Referring to Figure 18, the video decoder (1800) includes a master decoder network (915) and an entropy decoder (914). An example of the master decoder network (915) and entropy decoder (914) is described with reference to Figure 9B. Referring to Figures 17 and 18, the video encoder (1700) can generate encoded blocks (931) that will be transmitted in a bitstream. The video decoder (1800) can receive and decode the encoded blocks (931).

如上所述，包括视频编码器和视频解码器的NIC框架(900)可以基于该组训练图像中的图像和/或块来训练。在一些示例中，要压缩(例如，编码)和/或传输的一个或多个块具有与该组训练块显著不同的属性。因此，分别使用基于该组训练块训练的视频编码器和视频解码器直接对一个或多个块进行编码和解码，会导致相对较差的R-D损耗L(例如，相对较大的失真和/或相对较大的比特率)。因此，本公开的方面描述NIC的内容自适应在线训练方法，例如，NIC的分块内容自适应在线训练方法。As described above, the NIC framework (900), including a video encoder and a video decoder, can be trained based on images and/or blocks in the set of training images. In some examples, one or more blocks to be compressed (e.g., encoded) and/or transmitted have properties significantly different from those of the set of training blocks. Therefore, directly encoding and decoding one or more blocks using the video encoder and video decoder trained based on the set of training blocks results in relatively poor R-D loss L (e.g., relatively large distortion and/or relatively large bit rate). Therefore, aspects of this disclosure describe content-adaptive online training methods for NICs, such as a chunked content-adaptive online training method for NICs.

在分块内容自适应在线训练方法中，输入图像可以被分成块，并且一个或多个块可以用于通过优化率失真性能来将预训练的NIC框架中的一个或多个参数更新为一个或多个替换参数。指示一个或多个替换参数或一个或多个替换参数的子集的神经网络更新信息可以与编码的一个或多个块一起被编码到比特流中。在解码器侧，视频解码器可以对编码的一个或多个块进行解码，并且可以通过使用一个或多个替换参数或一个或多个替换参数的子集来实现更好的压缩性能。分块内容自适应在线训练方法可以用作预处理步骤(例如，预编码步骤)，用于提高预训练的E2ENIC压缩方法的压缩性能。In the chunked content adaptive online training method, the input image can be divided into chunks, and one or more chunks can be used to update one or more parameters in a pre-trained NIC framework with one or more replacement parameters by optimizing rate-distortion performance. Neural network update information indicating one or more replacement parameters, or a subset of one or more replacement parameters, can be encoded into a bitstream along with the encoded one or more chunks. On the decoder side, the video decoder can decode the encoded one or more chunks and can achieve better compression performance by using one or more replacement parameters, or a subset of one or more replacement parameters. The chunked content adaptive online training method can be used as a preprocessing step (e.g., a pre-encoding step) to improve the compression performance of the pre-trained E2ENIC compression method.

为了区分基于该组训练块的训练过程和基于要压缩(例如，编码)和/或传输的一个或多个块的内容自适应在线训练过程，由该组训练块训练的NIC框架(900)、视频编码器和视频解码器分别被称为预训练NIC框架(900)、预训练视频编码器和预训练视频解码器。预训练NIC框架(900)、预训练视频编码器或预训练视频解码器中的参数分别被称为NIC预训练参数、编码器预训练参数和解码器预训练参数。在一个示例中，NIC预训练参数包括编码器预训练参数和解码器预训练参数。在一个示例中，编码器预训练参数和解码器预训练参数不重叠，其中，编码器预训练参数都不包括在解码器预训练参数中。例如，(1700)中的编码器预训练参数(例如，主编码器网络(911)中的预训练参数)和(1800)中的解码器预训练参数(例如，主解码器网络(915)中的预训练参数)不重叠。在一个示例中，编码器预训练参数和解码器预训练参数重叠，其中，编码器预训练参数中的至少一个包括在解码器预训练参数中。例如，(1600A)中的编码器预训练参数(例如，上下文模型NN(916)中的预训练参数)和(1600B)中的解码器预训练参数(例如，上下文模型NN(916)中的预训练参数)重叠。可以基于该组训练块中的块和/或图像来获得NIC预训练参数。To distinguish between the training process based on this set of training blocks and the content-adaptive online training process based on one or more blocks to be compressed (e.g., encoded) and/or transmitted, the NIC framework (900), video encoder, and video decoder trained by this set of training blocks are referred to as the pre-trained NIC framework (900), pre-trained video encoder, and pre-trained video decoder, respectively. Parameters in the pre-trained NIC framework (900), pre-trained video encoder, or pre-trained video decoder are referred to as NIC pre-training parameters, encoder pre-training parameters, and decoder pre-training parameters, respectively. In one example, the NIC pre-training parameters include both encoder pre-training parameters and decoder pre-training parameters. In one example, the encoder pre-training parameters and decoder pre-training parameters do not overlap, wherein none of the encoder pre-training parameters are included in the decoder pre-training parameters. For example, the encoder pre-training parameters in (1700) (e.g., pre-training parameters in the main encoder network (911)) and the decoder pre-training parameters in (1800) (e.g., pre-training parameters in the main decoder network (915)) do not overlap. In one example, the encoder pre-training parameters and decoder pre-training parameters overlap, wherein at least one of the encoder pre-training parameters is included in the decoder pre-training parameters. For example, the encoder pre-training parameters (e.g., pre-training parameters in the context model NN(916)) in (1600A) and the decoder pre-training parameters (e.g., pre-training parameters in the context model NN(916)) in (1600B) overlap. NIC pre-training parameters can be obtained based on blocks and/or images in this set of training blocks.

内容自适应在线培训过程可被称为微调过程，如下所述。可以基于要编码和/或传输的一个或多个块进一步训练(例如，微调)预训练NIC框架(900)中的NIC预训练参数中的一个或多个预训练参数，其中，一个或多个块可以不同于该组训练块。NIC预训练参数中使用的一个或多个预训练参数可以通过基于一个或多个块优化共同R-D损耗L来微调。已经由一个或多个块微调的一个或多个预训练参数被称为一个或多个替换参数或一个或多个微调参数。在一个实施例中，在NIC预训练参数中的一个或多个预训练参数已经被一个或多个替换参数微调(例如，替换)之后，神经网络更新信息被编码到比特流中，以指示一个或多个替换参数或一个或多个替换参数的子集。在一个示例中，更新(或微调)NIC框架(900)，其中，一个或多个预训练参数分别被一个或多个替换参数替换。The content-adaptive online training process can be referred to as a fine-tuning process, as described below. One or more pre-trained parameters in the pre-trained NIC framework (900) can be further trained (e.g., fine-tuned) based on one or more blocks to be encoded and/or transmitted, wherein one or more blocks may differ from this set of training blocks. One or more pre-trained parameters used in the NIC pre-trained parameters can be fine-tuned by optimizing the common R-D loss L based on one or more blocks. One or more pre-trained parameters that have been fine-tuned by one or more blocks are referred to as one or more replacement parameters or one or more fine-tuning parameters. In one embodiment, after one or more pre-trained parameters in the NIC pre-trained parameters have been fine-tuned (e.g., replaced) by one or more replacement parameters, neural network update information is encoded into a bitstream to indicate one or more replacement parameters or a subset of one or more replacement parameters. In one example, the NIC framework (900) is updated (or fine-tuned) where one or more pre-trained parameters are replaced by one or more replacement parameters, respectively.

在第一种情况下，一个或多个预训练参数包括一个或多个预训练参数的第一子集和一个或多个预训练参数的第二子集。一个或多个替换参数包括一个或多个替换参数的第一子集和一个或多个替换参数的第二子集。In the first case, the one or more pre-trained parameters comprise a first subset of one or more pre-trained parameters and a second subset of one or more pre-trained parameters. The one or more replacement parameters comprise a first subset of one or more replacement parameters and a second subset of one or more replacement parameters.

一个或多个预训练参数的第一子集在预训练视频编码器中使用，并且例如在训练过程中被一个或多个替换参数的第一子集替换。因此，通过训练过程，预训练视频编码器被更新为更新的视频编码器。神经网络更新信息可以指示一个或多个替换参数的第二子集，其将替换一个或多个替换参数的第二子集。可以使用更新的视频编码器对一个或多个块进行编码，并在具有神经网络更新信息的比特流中传输。A first subset of one or more pre-trained parameters is used in a pre-trained video encoder and, for example, is replaced by a first subset of one or more replacement parameters during training. Therefore, through the training process, the pre-trained video encoder is updated to a newer video encoder. Neural network update information can indicate a second subset of one or more replacement parameters, which will replace the second subset of one or more replacement parameters. One or more blocks can be encoded using the updated video encoder and transmitted in a bitstream with the neural network update information.

在解码器侧，一个或多个预训练参数的第二子集用于预训练视频解码器中。在一个实施例中，预训练视频解码器接收并解码神经网络更新信息，以确定一个或多个替换参数的第二子集。当预训练视频解码器中的一个或多个预训练参数的第二子集被一个或多个替换参数的第二子集替换时，预训练视频解码器被更新为更新的视频解码器。可以使用更新的视频解码器来解码一个或多个编码块。On the decoder side, a second subset of one or more pre-trained parameters is used in the pre-trained video decoder. In one embodiment, the pre-trained video decoder receives and decodes neural network update information to determine a second subset of one or more replacement parameters. When the second subset of one or more pre-trained parameters in the pre-trained video decoder is replaced by a second subset of one or more replacement parameters, the pre-trained video decoder is updated to an updated video decoder. The updated video decoder can then be used to decode one or more coded blocks.

图16A-16B示出了第一种情况的示例。例如，一个或多个预训练参数包括预训练上下文模型NN(916)中的N1预训练参数和预训练主解码器网络(915)中的N2预训练参数。因此，一个或多个预训练参数的第一子集包括N1预训练参数，并且一个或多个预训练参数的第二子集与一个或多个预训练参数相同。因此，预训练上下文模型NN(916)中的N1预训练参数可以被N1对应的替换参数替换，使得预训练视频编码器(1600A)可以被更新为更新的视频编码器(1600A)。预训练上下文模型NN(916)也被更新为更新的上下文模型NN(916)。在解码器侧，N1预训练参数可以由N1对应替换参数替换，N2预训练参数可以由N2对应替换参数替换，将预训练上下文模型NN(916)更新为更新上下文模型NN(916)，并将预训练主解码器网络(915)更新为更新主解码器网络(915)。因此，预训练视频解码器(1600B)可以被更新为更新的视频解码器(1600B)。Figures 16A-16B illustrate examples of the first case. For example, one or more pre-trained parameters include N1 pre-trained parameters in the pre-trained context model NN (916) and N2 pre-trained parameters in the pre-trained master decoder network (915). Therefore, a first subset of the one or more pre-trained parameters includes the N1 pre-trained parameters, and a second subset of the one or more pre-trained parameters is the same as the one or more pre-trained parameters. Therefore, the N1 pre-trained parameters in the pre-trained context model NN (916) can be replaced by the replacement parameters corresponding to N1, so that the pre-trained video encoder (1600A) can be updated to the updated video encoder (1600A). The pre-trained context model NN (916) is also updated to the updated context model NN (916). On the decoder side, the N1 pre-trained parameters can be replaced by the replacement parameters corresponding to N1, and the N2 pre-trained parameters can be replaced by the replacement parameters corresponding to N2, updating the pre-trained context model NN (916) to the updated context model NN (916) and updating the pre-trained master decoder network (915) to the updated master decoder network (915). Therefore, the pre-trained video decoder (1600B) can be updated to the updated video decoder (1600B).

在第二种情况下，在编码器侧的预训练视频编码器中不使用一个或多个预训练参数。相反，在解码器侧的预训练视频解码器中使用一个或多个预训练参数。因此，不更新预训练视频编码器，并且在训练过程之后继续是预训练视频编码器。在一个实施例中，神经网络更新信息指示一个或多个替换参数。可以使用预训练视频编码器对一个或多个块进行编码，并在具有神经网络更新信息的比特流中传输。In the second case, one or more pre-trained parameters are not used in the pre-trained video encoder on the encoder side. Instead, one or more pre-trained parameters are used in the pre-trained video decoder on the decoder side. Therefore, the pre-trained video encoder is not updated and continues to be used after the training process. In one embodiment, neural network update information indicates one or more replacement parameters. One or more blocks can be encoded using the pre-trained video encoder and transmitted in a bitstream with neural network update information.

在解码器侧，预训练视频解码器可以接收和解码神经网络更新信息，以确定一个或多个替换参数。当预训练视频解码器中的一个或多个预训练参数被一个或多个替换参数替换时，预训练视频解码器被更新为更新的视频解码器。可以使用更新的视频解码器来解码一个或多个编码块。On the decoder side, the pre-trained video decoder can receive and decode neural network update information to determine one or more replacement parameters. When one or more pre-trained parameters in the pre-trained video decoder are replaced by one or more replacement parameters, the pre-trained video decoder is updated to the updated video decoder. The updated video decoder can then be used to decode one or more coded blocks.

图16A-16B示出了第二种情况的示例。例如，一个或多个预训练参数包括预训练主解码器网络中的N2预训练参数(915)。因此，在编码器侧的预训练视频编码器(例如，预训练视频编码器(1600A))中没有使用一个或多个预训练参数。因此，在训练过程之后，预训练视频编码器(1600A)继续是预训练视频编码器。在解码器侧，N2预训练参数可以被N2对应的替换参数替换，这将预训练主解码器网络(915)更新为更新的主解码器网络(915)。因此，预训练视频解码器(1600B)可以被更新为更新的视频解码器(1600B)。Figures 16A-16B illustrate examples of the second scenario. For instance, one or more pre-trained parameters include N2 pre-trained parameters (915) in the pre-trained master decoder network. Therefore, no one or more pre-trained parameters are used in the pre-trained video encoder on the encoder side (e.g., the pre-trained video encoder (1600A)). Thus, after the training process, the pre-trained video encoder (1600A) remains a pre-trained video encoder. On the decoder side, the N2 pre-trained parameters can be replaced by the corresponding replacement parameters, which updates the pre-trained master decoder network (915) to the updated master decoder network (915). Therefore, the pre-trained video decoder (1600B) can be updated to the updated video decoder (1600B).

在第三种情况下，一个或多个预训练参数在预训练视频编码器中使用，并且例如在训练过程中被一个或多个替换参数替换。因此，通过训练过程，预训练视频编码器被更新为更新的视频编码器。可以使用更新的视频编码器对一个或多个块进行编码，并在比特流中传输。比特流中没有编码神经网络更新信息。在解码器端，预训练视频解码器没有更新，仍然是预训练视频解码器。可以使用预训练视频解码器来解码一个或多个编码块。In the third case, one or more pre-trained parameters are used in the pre-trained video encoder and are replaced, for example, by one or more replacement parameters during training. Therefore, through the training process, the pre-trained video encoder is updated to the updated video encoder. One or more blocks can be encoded using the updated video encoder and transmitted in a bitstream. No neural network update information is encoded in the bitstream. At the decoder end, the pre-trained video decoder is not updated and remains the pre-trained video decoder. The pre-trained video decoder can be used to decode one or more coded blocks.

图16A-16B示出了第三种情况的示例。例如，一个或多个预训练参数在预训练主编码器网络中(911)。因此，预训练主编码器网络(911)中的一个或多个预训练参数可以被一个或多个替换参数替换，使得预训练视频编码器(1600A)可以被更新为更新的视频编码器(1600A)。预训练主编码器网络(911)也被更新为更新的主编码器网络(911)。在解码器侧，不更新预训练视频解码器(1600B)。Figures 16A-16B illustrate an example of the third case. For example, one or more pre-trained parameters are in the pre-trained master encoder network (911). Therefore, one or more pre-trained parameters in the pre-trained master encoder network (911) can be replaced by one or more replacement parameters, such that the pre-trained video encoder (1600A) can be updated to the updated video encoder (1600A). The pre-trained master encoder network (911) is also updated to the updated master encoder network (911). On the decoder side, the pre-trained video decoder (1600B) is not updated.

在例如在第一、第二和第三场景中描述的各种示例中，视频解码可以由具有不同能力的预训练解码器来执行，包括具有和不具有更新预训练参数的能力的解码器。In various examples, such as those described in the first, second, and third scenarios, video decoding can be performed by pre-trained decoders with different capabilities, including decoders with and without the ability to update pre-trained parameters.

在一个示例中，与用预训练视频编码器和预训练视频解码器对一个或多个块进行编码相比，通过用更新的视频编码器和/或更新的视频解码器对一个或多个块进行编码，可以提高压缩性能。因此，内容自适应的在线训练方法可以用于使预训练NIC框架(例如，预训练NIC框架(900))适应目标块内容(例如，要传输的一个或多个块)，从而微调预训练NIC框架。因此，可以更新编码器侧的视频编码器和/或解码器侧的视频解码器。In one example, compression performance can be improved by encoding one or more blocks with an updated video encoder and/or an updated video decoder compared to encoding one or more blocks with a pre-trained video encoder and a pre-trained video decoder. Therefore, content-adaptive online training methods can be used to adapt a pre-trained NIC framework (e.g., a pre-trained NIC framework (900)) to the target block content (e.g., one or more blocks to be transmitted), thereby fine-tuning the pre-trained NIC framework. Thus, the video encoder on the encoder side and/or the video decoder on the decoder side can be updated.

内容自适应在线训练方法可以用作预处理步骤(例如，预编码步骤)，用于提高预训练E2E NIC压缩方法的压缩性能。Content-adaptive online training methods can be used as a preprocessing step (e.g., a precoding step) to improve the compression performance of pre-trained E2E NIC compression methods.

在一个实施例中，一个或多个块包括单个输入块，并且利用该单个输入块执行微调过程。基于单个输入块训练和更新(例如，微调)NIC框架(900)。编码器侧的更新的视频编码器和/或解码器侧的更新的视频解码器可以用于编码单个输入块和可选的其他输入块。神经网络更新信息可以与编码的单个输入块一起被编码到比特流中。In one embodiment, one or more blocks comprise a single input block, and a fine-tuning process is performed using this single input block. The NIC framework (900) is trained and updated (e.g., fine-tuned) based on the single input block. Updates to the video encoder on the encoder side and/or the video decoder on the decoder side can be used to encode the single input block and optional other input blocks. Neural network update information can be encoded into the bitstream along with the encoded single input block.

在一个实施例中，一个或多个块包括多个输入块，并且利用多个输入块执行微调过程。基于多个输入块被训练和更新(例如，微调)NIC框架(900)。编码器侧的更新的视频编码器和/或解码器侧的更新的解码器可以用于编码多个输入块和可选的其他输入块。神经网络更新信息可以与编码的多个输入块一起被编码到比特流中。In one embodiment, one or more blocks include multiple input blocks, and a fine-tuning process is performed using the multiple input blocks. The NIC framework (900) is trained and updated (e.g., fine-tuned) based on the multiple input blocks. Updates on the encoder side and/or the decoder side can be used to encode the multiple input blocks and optional additional input blocks. Neural network update information can be encoded into the bitstream along with the encoded multiple input blocks.

速率损耗R可以随着比特流中神经网络更新信息的信令而增加。当一个或多个块包括单个输入块时，针对每个编码块信令神经网络更新信息，并且速率损耗R的第一次增加用于指示由于每个块信令神经网络更新信息而导致的速率损耗R的增加。当一个或多个块包括多个输入块时，神经网络更新信息信令给多个输入图像并由多个输入块共享，并且速率损耗R的第二次增加用于指示由于每个块信令神经网络更新信息而导致的速率损耗R的增加。因为神经网络更新信息由多个输入块共享，所以速率损耗R的第二次增加可以小于速率损耗R的第一次增加。因此，在一些示例中，使用多个输入块来微调NIC框架可能是有利的。The rate loss R can increase with signaling of neural network update information in the bitstream. When one or more blocks comprise a single input block, neural network update information is signaled for each coded block, and the first increase in rate loss R indicates the increase in rate loss R due to signaling neural network update information for each block. When one or more blocks comprise multiple input blocks, neural network update information is signaled to multiple input images and shared by multiple input blocks, and the second increase in rate loss R indicates the increase in rate loss R due to signaling neural network update information for each block. Because the neural network update information is shared by multiple input blocks, the second increase in rate loss R can be less than the first increase in rate loss R. Therefore, in some examples, using multiple input blocks to fine-tune the NIC framework may be advantageous.

在一个实施例中，要更新的一个或多个预训练参数在预训练NIC框架(900)的一个组件中。因此，基于一个或多个替换参数来更新预训练NIC框架(900)的一个组件，并且不更新预训练NIC框架(900)的其他组件。In one embodiment, one or more pre-trained parameters to be updated are in a component of the pre-trained NIC framework (900). Therefore, one component of the pre-trained NIC framework (900) is updated based on one or more replacement parameters, and the other components of the pre-trained NIC framework (900) are not updated.

一个组件可以是预训练上下文模型NN(916)、预训练熵参数NN(917)、预训练主编码器网络(911)、预训练主解码器网络(915)、预训练超级编码器(921)或预训练超级解码器(925)。根据更新预训练NIC框架(900)中的哪个组件，可以更新预训练视频编码器和/或预训练视频解码器。A component can be a pre-trained context model NN (916), a pre-trained entropy parameter NN (917), a pre-trained master encoder network (911), a pre-trained master decoder network (915), a pre-trained super encoder (921), or a pre-trained super decoder (925). The pre-trained video encoder and/or pre-trained video decoder can be updated depending on which component in the pre-trained NIC framework (900) is updated.

在一个示例中，要更新的一个或多个预训练参数在预训练上下文模型NN(916)中，因此更新预训练上下文模型NN(916)，而不更新剩余的组件(911)、(915)、(921)、(917)和(925)。在一个示例中，编码器侧的预训练视频编码器和解码器侧的预训练视频解码器包括预训练上下文模型NN(916)，因此更新预训练视频编码器和预训练视频解码器。In one example, one or more pre-trained parameters to be updated are in the pre-trained context model NN (916), so the pre-trained context model NN (916) is updated, but the remaining components (911), (915), (921), (917), and (925) are not updated. In another example, the pre-trained video encoder on the encoder side and the pre-trained video decoder on the decoder side include the pre-trained context model NN (916), so the pre-trained video encoder and the pre-trained video decoder are updated.

在一个示例中，要更新的一个或多个预训练参数在预训练超级解码器(925)中，因此更新预训练超级解码器(925)，而不更新其余组件(911)、(915)、(916)、(917)和(921)。因此，不更新预训练视频编码器，而更新预训练视频解码器。In one example, one or more pre-trained parameters to be updated are in the pre-trained superdecoder (925), so the pre-trained superdecoder (925) is updated, but the remaining components (911), (915), (916), (917), and (921) are not updated. Therefore, the pre-trained video encoder is not updated, but the pre-trained video decoder is updated.

在一个实施例中，要更新的一个或多个预训练参数在预训练NIC框架(900)的多个组件中。因此，基于一个或多个替换参数来更新预训练NIC框架(900)的多个组件。在一个示例中，预训练NIC框架(900)的多个组件包括配置有神经网络(例如，DNN、CNN)的所有组件。在一个示例中，预训练NIC框架(900)的多个组件包括基于CNN的组件：预训练主编码器网络(911)、预训练主解码器网络(915)、预训练上下文模型NN(916)、预训练熵参数NN(917)、预训练超级编码器(921)和预训练超级解码器(925)。In one embodiment, one or more pre-trained parameters to be updated are in multiple components of a pre-trained NIC framework (900). Therefore, multiple components of the pre-trained NIC framework (900) are updated based on one or more replacement parameters. In one example, the multiple components of the pre-trained NIC framework (900) include all components configured with neural networks (e.g., DNN, CNN). In one example, the multiple components of the pre-trained NIC framework (900) include CNN-based components: a pre-trained master encoder network (911), a pre-trained master decoder network (915), a pre-trained context model NN (916), a pre-trained entropy parameter NN (917), a pre-trained super encoder (921), and a pre-trained super decoder (925).

如上所述，在一个示例中，要更新的一个或多个预训练参数在预训练NIC框架的预训练视频编码器中(900)。在一个示例中，要更新的一个或多个预训练参数在NIC框架(900)的预训练视频解码器中。在一个示例中，要更新的一个或多个预训练参数在预训练NIC框架(900)的预训练视频编码器和预训练视频解码器中。As described above, in one example, one or more pre-trained parameters to be updated are in the pre-trained video encoder (900) of the pre-trained NIC framework. In one example, one or more pre-trained parameters to be updated are in the pre-trained video decoder of the NIC framework (900). In one example, one or more pre-trained parameters to be updated are in both the pre-trained video encoder and the pre-trained video decoder of the pre-trained NIC framework (900).

NIC框架(900)可以基于神经网络，例如，NIC框架(900)中的一个或多个组件可以包括神经网络，例如，CNN、DNN等。如上所述，神经网络可以由不同类型的参数指定，例如，权重、偏置等。NIC框架(900)中的每个基于神经网络的组件(例如，上下文模型NN(916)、熵参数NN(917)、主编码器网络(911)、主解码器网络(915)、超级编码器(921)或超级解码器(925))可以配置有合适的参数，例如，相应的权重、偏置或权重和偏置的组合。当使用CNN时，权重可以包括卷积核中的元素。一种或多种类型的参数可用于指定神经网络。在一个实施例中，要更新的一个或多个预训练参数是偏置项，并且只有偏置项被一个或多个替换参数替换。在一个实施例中，要更新的一个或多个预训练参数是权重，并且只有权重被一个或多个替换参数替换。在一个实施例中，要更新的一个或多个预训练参数包括权重和偏置项，并且包括权重和偏置项的所有预训练参数被一个或多个替换参数替换。在一个实施例中，可以使用其他参数来指定神经网络，并且可以微调其他参数。The NIC framework (900) can be based on a neural network; for example, one or more components in the NIC framework (900) may include a neural network, such as a CNN, DNN, etc. As mentioned above, the neural network can be specified by different types of parameters, such as weights, biases, etc. Each neural network-based component in the NIC framework (900) (e.g., the context model NN (916), the entropy parameter NN (917), the master encoder network (911), the master decoder network (915), the super encoder (921), or the super decoder (925)) can be configured with appropriate parameters, such as corresponding weights, biases, or combinations of weights and biases. When using a CNN, weights may include elements in the convolutional kernel. One or more types of parameters can be used to specify the neural network. In one embodiment, one or more pre-trained parameters to be updated are bias terms, and only bias terms are replaced by one or more replacement parameters. In one embodiment, one or more pre-trained parameters to be updated are weights, and only weights are replaced by one or more replacement parameters. In one embodiment, one or more pre-trained parameters to be updated include weights and bias terms, and all pre-trained parameters including weights and bias terms are replaced by one or more replacement parameters. In one embodiment, other parameters can be used to specify the neural network, and these other parameters can be fine-tuned.

微调过程可以包括多个阶段(例如，迭代)，其中，在迭代微调过程中更新一个或多个预训练参数。当训练损耗变平或即将变平时，微调过程可以停止。在一个示例中，当训练损耗(例如，R-D损耗L)低于第一阈值时，微调过程停止。在一个示例中，当两个连续训练损耗之间的差低于第二阈值时，微调过程停止。The fine-tuning process can include multiple stages (e.g., iterations), during which one or more pre-trained parameters are updated. The fine-tuning process can stop when the training loss flattens out or is about to flatten out. In one example, the fine-tuning process stops when the training loss (e.g., R-D loss L) falls below a first threshold. In another example, the fine-tuning process stops when the difference between two consecutive training losses falls below a second threshold.

两个超参数(例如，步长和最大步数)可以与损耗函数(例如，R-D损耗L)一起用于微调过程。最大迭代次数可以用作终止微调过程的最大迭代次数的阈值。在一个示例中，当迭代次数达到最大迭代次数时，微调过程停止。Two hyperparameters (e.g., step size and maximum number of steps) can be used together with a loss function (e.g., R-D loss L) in the fine-tuning process. The maximum number of iterations can be used as a threshold for the maximum number of iterations to terminate the fine-tuning process. In one example, the fine-tuning process stops when the maximum number of iterations is reached.

步长可以指示在线训练过程(例如，在线微调过程)的学习速率。步长可用于在微调过程中执行的梯度下降算法或反向传播计算。可以使用任何合适的方法来确定步长。The step size can indicate the learning rate during an online training process (e.g., an online fine-tuning process). The step size can be used in gradient descent algorithms or backpropagation calculations performed during fine-tuning. Any suitable method can be used to determine the step size.

图像中每个块的步长可以不同。在一个实施例中，可以为图像分配不同的步长，以便实现更好的压缩结果(例如，更好的R-D损耗L)。The stride size of each block in the image can be different. In one embodiment, different strides can be assigned to the image to achieve better compression results (e.g., better R-D loss L).

在一些示例中，基于NIC框架(例如，NIC框架(900))的视频编码器和视频解码器可以直接编码和解码图像。因此，分块内容自适应在线训练方法可以适于通过直接使用一个或多个图像来更新NIC框架中的某些参数，从而更新视频编码器和/或视频解码器中的某些参数。不同的图像可以具有不同的步长，以实现优化的压缩结果。In some examples, video encoders and decoders based on NIC frameworks (e.g., NIC framework (900)) can directly encode and decode images. Therefore, the chunked content adaptive online training method can be adapted to update certain parameters in the video encoder and/or video decoder by directly using one or more images to update certain parameters in the NIC framework. Different images can have different strides to achieve optimized compression results.

在一个实施例中，不同的步长用于具有不同类型内容的块，以获得最佳结果。不同的类型可以指不同的差异。在一个示例中，基于用于更新NIC框架的块的差异来确定步长。例如，具有高差异的块的步长大于具有低差异的块的步长，其中，高差异大于低差异。In one embodiment, different step sizes are used for blocks with different types of content to achieve optimal results. Different types can refer to different differences. In one example, the step size is determined based on the differences of the blocks used to update the NIC framework. For example, the step size for blocks with high differences is greater than the step size for blocks with low differences, where high differences are greater than low differences.

在一个实施例中，基于块或图像的特征，例如,块的RGB差异，来选择步长。在一个实施例中，基于块的RD性能(例如，R-D损耗L)来选择步长。可以基于不同的步长生成多组替换参数，并且可以选择具有更好压缩性能(例如，更小的R-D损耗)的组。In one embodiment, the step size is selected based on features of the block or image, such as the RGB difference of the block. In another embodiment, the step size is selected based on the RD performance of the block (e.g., R-D loss L). Multiple sets of replacement parameters can be generated based on different step sizes, and the set with better compression performance (e.g., lower R-D loss) can be selected.

在一个实施例中，第一步长可以用于运行一定数量(例如，100)的迭代。然后，第二步长(例如，第一步长加上或减去大小增量)可以用于运行一定数量的迭代。可以比较第一步长和第二步长的结果，以确定要使用的步长。可以测试两个以上的步长，来确定最佳步长。In one embodiment, the first step size can be used to run a certain number of iterations (e.g., 100). Then, a second step size (e.g., the first step size plus or minus a size increment) can be used to run a certain number of iterations. The results of the first and second step sizes can be compared to determine the step size to use. More than two step sizes can be tested to determine the optimal step size.

在微调过程中，步长可以变化。步长可在微调过程开始时具有初始值，并且可在微调过程的稍后阶段(例如，在一定次数的迭代之后)减小初始值(例如，减半)，以实现更精细的调谐。在迭代在线训练期间，步长或学习速率可以由调度器来改变。调度器可以包括用于调整步长的参数调整方法。调度器可以确定步长的值，使得步长可以在多个间隔中增加、减小或保持恒定。在一个示例中，学习速率在每个步骤中由调度器改变。单个调度器或多个不同的调度器可以用于不同的块。因此，可以基于多个调度器生成多组替换参数，并且可以选择多组替换参数中具有更好压缩性能(例如，更小R-D损耗)的一组。During fine-tuning, the step size can vary. The step size may have an initial value at the start of the fine-tuning process and may be reduced (e.g., halved) at a later stage (e.g., after a certain number of iterations) to achieve finer tuning. During iterative online training, the step size or learning rate can be changed by the scheduler. The scheduler may include parameter tuning methods for adjusting the step size. The scheduler can determine the value of the step size such that it can be increased, decreased, or kept constant over multiple intervals. In one example, the learning rate is changed by the scheduler at each step. A single scheduler or multiple different schedulers can be used for different blocks. Therefore, multiple sets of replacement parameters can be generated based on multiple schedulers, and the set with better compression performance (e.g., lower R-D loss) can be selected from among the multiple sets of replacement parameters.

在一个实施例中，为不同的块分配多个学习速率调度，以便实现更好的压缩结果。在一个实施例中，图像中的所有块共享相同的学习速率表。在一个实施例中，学习速率表的选择基于块的特性，例如，块的RGB差异。在一个实施例中，学习速率表的选择基于块的RD性能。In one embodiment, multiple learning rate schedules are assigned to different blocks to achieve better compression results. In another embodiment, all blocks in the image share the same learning rate table. In yet another embodiment, the learning rate table is selected based on block characteristics, such as the RGB differences between blocks. In yet another embodiment, the learning rate table is selected based on the block's RD performance.

在一个实施例中，不同的块可以用于更新NIC框架中的不同组件(例如，上下文模型NN(916)或超解码器(925))中的不同参数。例如，第一块用于更新上下文模型NN中的参数(916)，第二块用于更新超解码器中的参数(925)。In one embodiment, different blocks can be used to update different parameters in different components of the NIC framework (e.g., the context model NN (916) or the super decoder (925)). For example, the first block is used to update the parameters (916) in the context model NN, and the second block is used to update the parameters (925) in the super decoder.

在一个实施例中，不同的块可以用于更新NIC框架中不同类型的参数(例如，偏置或权重)。例如，第一块用于更新NIC框架中的一个或多个神经网络中的至少一个偏置，第二块用于更新NIC框架中的一个或多个神经网络中的至少一个权重。In one embodiment, different blocks can be used to update different types of parameters (e.g., biases or weights) in the NIC framework. For example, a first block is used to update at least one bias in one or more neural networks in the NIC framework, and a second block is used to update at least one weight in one or more neural networks in the NIC framework.

在一个实施例中，图像中的多个块(例如，所有块)更新相同的一个或多个参数。In one embodiment, multiple blocks in an image (e.g., all blocks) update the same one or more parameters.

在一个实施例中，基于块的特性，例如，块的RGB差异，选择要更新的一个或多个参数。在一个实施例中，基于块的RD性能来选择要更新的一个或多个参数。In one embodiment, one or more parameters to be updated are selected based on block characteristics, such as the RGB differences of the block. In another embodiment, one or more parameters to be updated are selected based on the RD performance of the block.

在微调过程结束时，可以为相应的一个或多个替换参数计算一个或多个更新的参数。在一个实施例中，计算一个或多个更新的参数，作为一个或多个替换参数和相应的一个或多个预训练参数之间的差。在一个实施例中，一个或多个更新的参数分别是一个或多个替换参数。At the end of the fine-tuning process, one or more updated parameters can be calculated for the corresponding one or more replacement parameters. In one embodiment, one or more updated parameters are calculated as the difference between one or more replacement parameters and the corresponding one or more pre-trained parameters. In another embodiment, the one or more updated parameters are one or more replacement parameters.

在一个实施例中，可以从一个或多个替换参数中生成一个或多个更新的参数，例如，使用特定的线性或非线性变换，并且一个或多个更新的参数是基于一个或多个替换参数生成的代表性参数。一个或多个替换参数被转换成一个或多个更新的参数，以进行更好的压缩。In one embodiment, one or more updated parameters can be generated from one or more replacement parameters, for example, using a specific linear or nonlinear transformation, and the one or more updated parameters are representative parameters generated based on the one or more replacement parameters. The one or more replacement parameters are transformed into one or more updated parameters for better compression.

一个或多个更新参数的第一子集对应于一个或多个替换参数的第一子集，一个或多个更新参数的第二子集对应于一个或多个替换参数的第二子集。One or more update parameters, a first subset, correspond to one or more replacement parameters, and one or more update parameters, a second subset, correspond to one or more replacement parameters, a second subset.

在一个实施例中，不同的块在一个或多个更新的参数和一个或多个替换参数之间具有不同的关系。例如，对于第一块，计算一个或多个更新的参数，作为一个或多个替换参数和相应的一个或多个预训练参数之间的差。对于第二块，一个或多个更新的参数分别是一个或多个替换参数。In one embodiment, different blocks have different relationships between one or more updated parameters and one or more replacement parameters. For example, for a first block, one or more updated parameters are computed as the difference between one or more replacement parameters and the corresponding one or more pre-trained parameters. For a second block, the one or more updated parameters are, respectively, one or more replacement parameters.

在一个实施例中，图像中的多个块(例如，所有块)在一个或多个更新参数和一个或多个替换参数之间具有相同的关系。In one embodiment, multiple blocks in an image (e.g., all blocks) have the same relationship between one or more update parameters and one or more replacement parameters.

在一个实施例中，基于块的特性，例如，块的RGB差异，选择一个或多个更新参数和一个或多个替换参数之间的关系。在一个实施例中，基于块的RD性能来选择一个或多个更新参数和一个或多个替换参数之间的关系。In one embodiment, the relationship between one or more update parameters and one or more replacement parameters is selected based on block characteristics, such as the RGB differences of the block. In another embodiment, the relationship between one or more update parameters and one or more replacement parameters is selected based on the block's RD performance.

在一个示例中，可以压缩一个或多个更新的参数，例如，使用LZMA2、bzip2算法等，该LZMA2是Lempel–Ziv–Markov链算法(LZMA)的变体。在一个示例中，对于一个或多个更新的参数，省略压缩。在一些实施例中，一个或多个更新参数或一个或多个更新参数的第二子集可以作为神经网络更新信息被编码到比特流中，其中，神经网络更新信息指示一个或多个替换参数或一个或多个替换参数的第二子集。In one example, one or more updated parameters may be compressed, for example, using algorithms such as LZMA2 or bzip2, where LZMA2 is a variant of the Lempel–Ziv–Markov chain algorithm (LZMA). In another example, compression is omitted for one or more updated parameters. In some embodiments, one or more updated parameters, or a second subset of one or more updated parameters, may be encoded into a bitstream as neural network update information, wherein the neural network update information indicates one or more replacement parameters, or a second subset of one or more replacement parameters.

在一个实施例中，一个或多个更新参数的压缩方法对于不同的块是不同的。例如，对于第一块，LZMA2用于压缩一个或多个更新的参数，对于第二块，bzip2用于压缩一个或多个更新的参数。在一个实施例中，使用相同的压缩方法来压缩图像中多个块(例如，所有块)的一个或多个更新参数。在一个实施例中，基于块的特性，例如，块的RGB差异，选择压缩方法。在一个实施例中，基于块的RD性能来选择压缩方法。In one embodiment, the compression method for one or more updated parameters is different for different blocks. For example, LZMA2 is used to compress one or more updated parameters for the first block, and bzip2 is used to compress one or more updated parameters for the second block. In one embodiment, the same compression method is used to compress one or more updated parameters for multiple blocks (e.g., all blocks) in the image. In one embodiment, the compression method is selected based on block characteristics, such as the RGB differences of the block. In one embodiment, the compression method is selected based on the RD performance of the block.

在微调过程之后，在一些示例中，可以基于(i)一个或多个替换参数的第一子集或(ii)一个或多个替换参数来更新或微调编码器侧的预训练视频编码器。可以使用更新的视频编码器将输入图像(例如，用于微调过程的一个或多个块之一)编码到比特流中。因此，比特流包括编码块和神经网络更新信息。Following the fine-tuning process, in some examples, the pre-trained video encoder on the encoder side can be updated or fine-tuned based on (i) a first subset of one or more replacement parameters or (ii) one or more replacement parameters. The updated video encoder can be used to encode the input image (e.g., one or more blocks used in the fine-tuning process) into a bitstream. Therefore, the bitstream includes encoded blocks and neural network update information.

如果适用，在一个示例中，神经网络更新信息由预训练视频解码器解码(例如，解压缩)，以获得一个或多个更新的参数或一个或多个更新的参数的第二子集。在一个示例中，可以基于一个或多个更新的参数和上述一个或多个替换参数之间的关系来获得一个或多个替换参数或一个或多个替换参数的第二子集。如上所述，可以微调预训练视频解码器，并且可以使用解码的更新视频来解码该编码块。If applicable, in one example, the neural network update information is decoded (e.g., decompressed) by a pre-trained video decoder to obtain one or more updated parameters or a second subset of one or more updated parameters. In one example, one or more replacement parameters or a second subset of one or more replacement parameters can be obtained based on the relationship between one or more updated parameters and the aforementioned one or more replacement parameters. As described above, the pre-trained video decoder can be fine-tuned, and the decoded updated video can be used to decode the coded block.

NIC框架可以包括任何类型的神经网络，并且使用任何基于神经网络的图像压缩方法，例如，上下文超优编码器-解码器框架(例如，图(9B)中所示的NIC框架)、尺度超优编码器-解码器框架、高斯混合似然框架和高斯混合似然框架的变体、基于RNN的递归压缩方法和基于RNN的递归压缩方法的变体等。The NIC framework can include any type of neural network and use any neural network-based image compression method, such as context super-optimal encoder-decoder framework (e.g., the NIC framework shown in Figure (9B)), scale super-optimal encoder-decoder framework, Gaussian mixture likelihood framework and its variants, RNN-based recursive compression method and its variants, etc.

与相关的E2E图像压缩方法相比，本公开的内容自适应在线训练方法和设备具有以下优点。利用自适应在线训练机制来提高NIC编码效率。使用灵活通用的框架可以适应各种类型的预训练框架和质量度量。例如，各种类型的预训练框架中的某些预训练参数可以通过使用对要编码和传输的块的在线训练来替换。Compared to related E2E image compression methods, the adaptive online training method and apparatus disclosed herein have the following advantages: They utilize an adaptive online training mechanism to improve NIC coding efficiency. A flexible and general framework can be used to adapt to various types of pre-trained frameworks and quality metrics. For example, certain pre-training parameters in various types of pre-trained frameworks can be replaced by online training on the blocks to be encoded and transmitted.

图19示出了概述根据本公开实施例的过程(1900)的流程图。过程(1900)可用于编码块，例如，原始图像中的块或残差图像中的块。在各种实施例中，过程(1900)由处理电路执行，例如，终端装置(310)、(320)、(330)和(340)中的处理电路、执行视频编码器(1600A)的功能的处理电路、执行视频编码器(1700)的功能的处理电路。在一个示例中，处理电路执行(i)视频编码器(403)、(603)和(703)之一以及(ii)视频编码器(1600A)和视频编码器(1700)之一的功能的组合。在一些实施例中，过程(1900)在软件指令中实现，因此当处理电路执行软件指令时，处理电路执行过程(1900)。该过程开始于(S1901)。在一个示例中，NIC框架基于神经网络。在一个示例中，NIC框架是参考图9B描述的NIC框架(900)。NIC框架可以基于CNN，例如，参考图10-15所描述的。如上所述，视频编码器(例如，(1600A)或(1700))和相应的视频解码器(例如，(1600B)或(1800))可以在NIC框架中包括多个组件。预训练基于神经网络的NIC框架，从而预训练视频编码器和视频解码器。过程(1900)进行到(S1910)。Figure 19 shows a flowchart outlining a process (1900) according to an embodiment of the present disclosure. The process (1900) can be used to encode blocks, such as blocks in a raw image or blocks in a residual image. In various embodiments, the process (1900) is executed by processing circuitry, such as processing circuitry in terminal devices (310), (320), (330), and (340), processing circuitry performing the functions of a video encoder (1600A), or processing circuitry performing the functions of a video encoder (1700). In one example, the processing circuitry performs a combination of (i) one of the video encoders (403), (603), and (703) and (ii) one of the functions of the video encoder (1600A) and the video encoder (1700). In some embodiments, the process (1900) is implemented in software instructions, so that the processing circuitry executes the process (1900) when the software instructions are executed. The process begins at (S1901). In one example, the NIC framework is based on a neural network. In one example, the NIC framework is the NIC framework (900) described with reference to Figure 9B. The NIC framework can be based on a CNN, for example, as described in Figures 10-15. As mentioned above, the video encoder (e.g., (1600A) or (1700)) and the corresponding video decoder (e.g., (1600B) or (1800)) can include multiple components in the NIC framework. The neural network-based NIC framework is pre-trained, thereby pre-training the video encoder and video decoder. Process (1900) proceeds to (S1910).

在(S1910)，基于一个或多个块(或输入块)在NIC框架上执行微调过程。输入块可以是具有任何合适大小的任何合适的块。在一些示例中，输入块包括空间域中的原始图像、自然图像、计算机生成的图像等中的块。In (S1910), a fine-tuning process is performed on the NIC framework based on one or more blocks (or input blocks). The input block can be any suitable block of any appropriate size. In some examples, the input block includes blocks from the original image, natural image, computer-generated image, etc., in the spatial domain.

在一些示例中，输入块包括例如由残差计算器(例如，残差计算器(723))计算的空间域中的残差数据。各种设备中的组件可以被适当地组合，以例如参考图7和图9B，实现(S1910)来自残差计算器的残差块可以被馈送到NIC框架中的主编码器网络(911)中。In some examples, the input block includes residual data in the spatial domain, calculated, for example, by a residual calculator (e.g., residual calculator (723)). Components in various devices can be appropriately combined to implement (S1910) that the residual block from the residual calculator can be fed into the main encoder network (911) in the NIC framework, for example, referring to Figures 7 and 9B.

如上所述，NIC框架(例如，预训练NIC框架)中的一个或多个神经网络(例如，一个或多个预训练神经网络)中的一个或多个参数(例如，一个或多个预训练参数)可以分别被更新为一个或多个替换参数。在一个实施例中，在(S1910)中描述的训练过程期间，例如，在每个步骤中，更新一个或多个神经网络中的一个或多个参数。As described above, one or more parameters (e.g., one or more pre-trained parameters) in one or more neural networks (e.g., one or more pre-trained neural networks) within a NIC framework (e.g., a pre-trained NIC framework) can be updated to one or more replacement parameters, respectively. In one embodiment, during the training process described in (S1910), for example, at each step, one or more parameters in one or more neural networks are updated.

在一个实施例中，视频编码器(例如，预训练视频编码器)中的至少一个神经网络配置有一个或多个预训练参数的第一子集，因此视频编码器中的至少一个神经网络可以基于一个或多个替换参数的相应第一子集来更新。在一个示例中，一个或多个替换参数的第一子集包括全部一个或多个替换参数。在一个示例中，当一个或多个预训练参数的第一子集分别被一个或多个替换参数的第一子集替换时，更新视频编码器中的至少一个神经网络。在一个示例中，视频编码器中的至少一个神经网络在微调过程中迭代更新。在一个示例中，一个或多个预训练参数都不包括在视频编码器中，因此视频编码器不更新并且保持预训练视频编码器。In one embodiment, at least one neural network in the video encoder (e.g., a pre-trained video encoder) is configured with a first subset of one or more pre-trained parameters, so that the at least one neural network in the video encoder can be updated based on a corresponding first subset of one or more replacement parameters. In one example, the first subset of one or more replacement parameters includes all one or more replacement parameters. In one example, at least one neural network in the video encoder is updated as the first subset of one or more pre-trained parameters is replaced by the first subset of one or more replacement parameters, respectively. In one example, at least one neural network in the video encoder is iteratively updated during fine-tuning. In one example, one or more pre-trained parameters are not included in the video encoder, so the video encoder is not updated and remains a pre-trained video encoder.

在(S1920)，可以使用具有至少一个更新的神经网络的视频编码器来编码一个或多个块中的一个，其中，视频编码器配置有一个或多个替换参数的第一子集。在一个示例中，在更新视频编码器中的至少一个神经网络之后，编码一个或多个块中的一个。At (S1920), a video encoder having at least one updated neural network can be used to encode one of one or more blocks, wherein the video encoder is configured with a first subset of one or more replacement parameters. In one example, one of one or more blocks is encoded after updating at least one neural network in the video encoder.

可以适当地修改步骤(S1920)。例如，当一个或多个替换参数都不包括在视频编码器中的至少一个神经网络中时，不更新视频编码器，因此可以使用预训练视频编码器(例如，包括至少一个预训练神经网络的视频编码器)来编码一个或多个块中的一个。Step (S1920) can be modified appropriately. For example, when one or more replacement parameters are not included in at least one neural network in the video encoder, the video encoder is not updated, so a pre-trained video encoder (e.g., a video encoder including at least one pre-trained neural network) can be used to encode one of the one or more blocks.

在(S1930)，指示一个或多个替换参数的第二子集的神经网络更新信息可以被编码到比特流中。在一个示例中，一个或多个替换参数的第二子集将用于更新解码器侧的视频解码器中的至少一个神经网络。可以省略步骤(S1930)，并且例如，如果一个或多个替换参数的第二子集不包括参数并且在比特流中没有信令神经网络更新信息，则没有更新视频解码器中的神经网络。In step (S1930), neural network update information indicating a second subset of one or more replacement parameters can be encoded into the bitstream. In one example, the second subset of one or more replacement parameters will be used to update at least one neural network in the video decoder on the decoder side. Step (S1930) can be omitted, and for example, if the second subset of one or more replacement parameters does not include parameters and there is no signaling neural network update information in the bitstream, then the neural network in the video decoder is not updated.

在(S1940)，可以传输包括一个或多个块中的编码块和神经网络更新信息的比特流。可以适当地修改步骤(S1940)。例如，如果省略步骤(S1930)，则比特流不包括神经网络更新信息。过程(1900)进行到(S1999)，并且终止。In (S1940), a bitstream including coded blocks from one or more blocks and neural network update information can be transmitted. Step (S1940) can be modified appropriately. For example, if step (S1930) is omitted, the bitstream will not include neural network update information. The process (1900) proceeds to (S1999) and terminates.

过程(1900)可以适合于各种情况，并且可以相应地调整过程(1900)中的步骤。可以修改、省略、重复和/或组合过程(1900)中的一个或多个步骤。可以使用任何合适的顺序来实现该过程(1900)。可以添加额外的步骤。例如，除了对一个或多个块中的一个进行编码之外，还在(S1920)中对一个或多个块进行编码，并在(S1940)中传输。The process (1900) can be adapted to various situations, and the steps in the process (1900) can be adjusted accordingly. One or more steps in the process (1900) can be modified, omitted, repeated, and/or combined. The process (1900) can be implemented using any suitable order. Additional steps can be added. For example, in addition to encoding one of the one or more blocks, one or more blocks can be encoded in (S1920) and transmitted in (S1940).

在过程(1900)的一些示例中，一个或多个块中的一个被更新的视频编码器编码并在比特流中传输。因为微调过程基于一个或多个块，所以微调过程基于要编码的上下文，因此是基于上下文的。In some examples of process (1900), one of one or more blocks is encoded by an updated video encoder and transmitted in a bitstream. Because the fine-tuning process is based on one or more blocks, it is context-based, as it is based on the context to be encoded.

在一些示例中，神经网络更新信息还指示一个或多个预训练参数的第二子集(或一个或多个替换参数的对应第二子集)是什么参数，使得可以更新视频解码器中的对应预训练参数。神经网络更新信息可以指示一个或多个预训练参数的第二子集的组件信息(例如，(915))、层信息(例如，第四层DeConv：5x5 c3 s2)、信道信息(例如，第二信道)等。因此，参考图11，一个或多个替换参数的第二子集包括主解码器网络中的第二信道DeConv:5x5c3 s2的卷积核(915)。因此，更新预训练主解码器网络(915)中的第二信道DeConv:5x5 c3s2的卷积核。在一些示例中，一个或多个预训练参数的第二子集的分量信息(例如(915))、层信息(例如，第四层DeConv:5x5 c3 s2)、信道信息(例如，第二信道)等预先确定并存储在预训练视频解码器中，因此不信令。In some examples, the neural network update information also indicates what a second subset of one or more pre-trained parameters (or a corresponding second subset of one or more replacement parameters) is, such that the corresponding pre-trained parameters in the video decoder can be updated. The neural network update information may indicate component information (e.g., (915)), layer information (e.g., fourth layer DeConv: 5x5 c3 s2), channel information (e.g., second channel), etc., of the second subset of one or more pre-trained parameters. Thus, referring to Figure 11, the second subset of one or more replacement parameters includes the convolutional kernel (915) of the second channel DeConv: 5x5 c3 s2 in the main decoder network. Therefore, the convolutional kernel of the second channel DeConv: 5x5 c3 s2 in the pre-trained main decoder network (915) is updated. In some examples, the component information (e.g., (915)), layer information (e.g., fourth layer DeConv: 5x5 c3 s2), channel information (e.g., second channel), etc., of the second subset of one or more pre-trained parameters are predetermined and stored in the pre-trained video decoder, and therefore not signaled.

图20示出了概述根据本公开实施例的过程(2000)的流程图。过程(2000)可以用于编码块的重构。在各种实施例中，过程(2000)由处理电路执行，例如，终端装置(310)、(320)、(330)和(340)中的处理电路、执行视频解码器(1600B)的功能的处理电路、执行视频解码器(1800)的功能的处理电路。在一个示例中，处理电路执行(i)视频解码器(410)、视频解码器(510)和视频解码器(810)中的一个以及(ii)视频解码器(1600B)或视频解码器(1800)中的一个的功能组合。在一些实施例中，过程(2000)在软件指令中实现，因此当处理电路执行软件指令时，处理电路执行过程(2000)。该过程开始于(S2001)。在一个示例中，NIC框架基于神经网络。在一个示例中，NIC框架是参考图9B描述的NIC框架(900)。NIC框架可以基于CNN，例如参考图10-15所描述的。如上所述，视频解码器(例如，(1600B)或(1800))可以在NIC框架中包括多个组件。可以对基于神经网络的NIC框架进行预训练。可以用预训练参数对视频解码器进行预训练。过程(2000)进行到(S2010)。Figure 20 shows a flowchart outlining a process (2000) according to an embodiment of the present disclosure. The process (2000) can be used for the reconstruction of coded blocks. In various embodiments, the process (2000) is executed by processing circuitry, such as processing circuitry in terminal devices (310), (320), (330), and (340), processing circuitry performing the functions of a video decoder (1600B), and processing circuitry performing the functions of a video decoder (1800). In one example, the processing circuitry performs (i) one of video decoders (410), (510), and (810) and (ii) a combination of the functions of either video decoder (1600B) or video decoder (1800). In some embodiments, the process (2000) is implemented in software instructions, so that the processing circuitry executes the process (2000) when the software instructions are executed. The process begins at (S2001). In one example, the NIC framework is based on a neural network. In one example, the NIC framework is the NIC framework (900) described with reference to Figure 9B. The NIC framework can be based on a CNN, as described with reference to Figures 10-15. As mentioned above, the video decoder (e.g., (1600B) or (1800)) can include multiple components within the NIC framework. The neural network-based NIC framework can be pre-trained. The video decoder can be pre-trained using pre-trained parameters. The process (2000) proceeds to (S2010).

在(S2010)，可以解码编码比特流中的第一神经网络更新信息。第一神经网络更新信息可以用于视频解码器中的第一神经网络。第一神经网络可以配置有第一组预训练参数。第一神经网络更新信息可以对应于要重构的图像中的第一块，并且指示与第一组预训练参数中的第一预训练参数对应的第一替换参数。In (S2010), the update information of the first neural network in the encoded bitstream can be decoded. The update information of the first neural network can be used in the first neural network in the video decoder. The first neural network can be configured with a first set of pre-trained parameters. The update information of the first neural network can correspond to a first block in the image to be reconstructed and indicate a first replacement parameter corresponding to the first pre-trained parameter in the first set of pre-trained parameters.

在一个示例中，第一预训练参数是预训练偏置项。In one example, the first pre-trained parameter is the pre-trained bias term.

在一个示例中，第一预训练参数是预训练权重系数。In one example, the first pre-trained parameter is the pre-trained weight coefficient.

在一个实施例中，视频解码器包括多个神经网络。第一神经网络更新信息可以指示多个神经网络中的一个或多个剩余神经网络的更新信息。例如，第一神经网络更新信息还指示多个神经网络中的一个或多个剩余神经网络的一个或多个替换参数。一个或多个替换参数对应于一个或多个剩余神经网络的一个或多个相应的预训练参数。在一个示例中，第一预训练参数和一个或多个预训练参数中的每一个都是相应的预训练偏置项。在一个示例中，第一预训练参数和一个或多个预训练参数中的每一个都是相应的预训练权重系数。在一个示例中，第一预训练参数和一个或多个预训练参数包括多个神经网络中的一个或多个预训练偏置项和一个或多个预训练权重系数。In one embodiment, the video decoder includes multiple neural networks. First neural network update information may indicate update information for one or more remaining neural networks among the multiple neural networks. For example, the first neural network update information may also indicate one or more replacement parameters for one or more remaining neural networks among the multiple neural networks. The one or more replacement parameters correspond to one or more corresponding pre-trained parameters of the one or more remaining neural networks. In one example, each of the first pre-trained parameter and each of the one or more pre-trained parameters is a corresponding pre-trained bias term. In one example, each of the first pre-trained parameter and each of the one or more pre-trained parameters is a corresponding pre-trained weight coefficient. In one example, the first pre-trained parameter and the one or more pre-trained parameters include one or more pre-trained bias terms and one or more pre-trained weight coefficients from the multiple neural networks.

在一个示例中，第一神经网络更新信息指示多个神经网络的子集的更新信息，并且不更新多个神经网络的剩余子集。In one example, the update information of the first neural network indicates the update information of a subset of multiple neural networks, and does not update the remaining subset of multiple neural networks.

在一个示例中，视频解码器是图18中所示的视频解码器(1800)。第一神经网络是主解码器网络(915)。In one example, the video decoder is the video decoder (1800) shown in Figure 18. The first neural network is the master decoder network (915).

在一个示例中，视频解码器是图16B中所示的视频解码器(1600B)。视频解码器中的多个神经网络包括主解码器网络(915)、上下文模型NN(916)、熵参数NN(917)和超解码器(925)。第一神经网络是主解码器网络(915)、上下文模型NN(916)、熵参数NN(917)和超解码器(925)中的一个，例如，上下文模型NN(916)。在一个示例中，第一神经网络更新信息还包括视频解码器中的一个或多个剩余神经网络(例如，主解码器网络(915)、熵参数NN(917)和/或超解码器(925))的一个或多个替换参数。In one example, the video decoder is the video decoder (1600B) shown in Figure 16B. Multiple neural networks in the video decoder include a master decoder network (915), a context model NN (916), an entropy parameter NN (917), and a super decoder (925). A first neural network is one of the master decoder network (915), the context model NN (916), the entropy parameter NN (917), and the super decoder (925), for example, the context model NN (916). In one example, the first neural network update information also includes one or more replacement parameters of one or more remaining neural networks in the video decoder (e.g., the master decoder network (915), the entropy parameter NN (917), and/or the super decoder (925)).

在(S2020)，可以基于第一神经网络更新信息来确定第一替换参数。在一个实施例中，从第一神经网络更新信息中获得更新的参数。在一个示例中，可以通过解压缩(例如，LZMA2或bzip2算法)从第一神经网络更新信息中获得更新的参数。At (S2020), the first replacement parameter can be determined based on the first neural network update information. In one embodiment, the updated parameter is obtained from the first neural network update information. In one example, the updated parameter can be obtained from the first neural network update information by decompression (e.g., LZMA2 or bzip2 algorithm).

在一个示例中，第一神经网络更新信息指示更新的参数是第一替换参数和第一预训练参数之间的差。可以根据更新参数和第一预训练参数的总和来计算第一替换参数。In one example, the first neural network update information indicates that the updated parameter is the difference between the first replacement parameter and the first pre-trained parameter. The first replacement parameter can be calculated based on the sum of the updated parameter and the first pre-trained parameter.

在一个实施例中，第一替换参数被确定为更新的参数。In one embodiment, the first replacement parameter is determined to be the updated parameter.

在一个实施例中，更新的参数是基于编码器侧的第一替换参数生成的代表性参数(例如，使用线性或非线性变换)，并且基于代表性参数获得第一替换参数。In one embodiment, the updated parameters are representative parameters generated based on the first replacement parameters on the encoder side (e.g., using a linear or nonlinear transformation), and the first replacement parameters are obtained based on the representative parameters.

在(S2030)，可以基于第一替换参数更新(或微调)视频解码器中的第一神经网络，例如，通过用第一神经网络中的第一替换参数替换第一预训练参数。如果视频解码器包含多个神经网络，并且第一神经网络更新信息指示多个神经网络的更新信息(例如，额外替换参数)，则可更新多个神经网络。例如，第一神经网络更新信息还包括用于视频解码器中的一个或多个剩余神经网络的一个或多个替换参数，并且可以基于一个或多个替换参数来更新一个或多个剩余神经网络。In (S2030), the first neural network in the video decoder can be updated (or fine-tuned) based on the first substitution parameters, for example, by replacing the first pre-trained parameters with the first substitution parameters in the first neural network. If the video decoder contains multiple neural networks, and the first neural network update information indicates update information (e.g., additional substitution parameters) for the multiple neural networks, then the multiple neural networks can be updated. For example, the first neural network update information also includes one or more substitution parameters for one or more remaining neural networks in the video decoder, and the one or more remaining neural networks can be updated based on the one or more substitution parameters.

在(S2040)，比特流中的编码的第一块可以由例如基于更新的第一神经网络更新的视频解码器解码。在(S2040)生成的输出块可以是具有任何合适大小的任何合适的块。在一些示例中，输出块是空间域中的重构图像中的重构块。In (S2040), the first block encoded in the bitstream can be decoded, for example, by a video decoder updated based on the updated first neural network. The output block generated in (S2040) can be any suitable block with any appropriate size. In some examples, the output block is a reconstructed block in the reconstructed image in the spatial domain.

在一些示例中，视频解码器的输出块包括空间域中的残差数据，并且因此可使用进一步处理来基于输出块生成重构块。例如，重构模块(874)被配置为在空间域中组合残差数据和预测结果(由帧间或帧内预测模块输出)，以形成可以是重构图像的一部分的重构块。可以执行额外的适当操作，例如，去块操作等，以提高视觉质量。各种设备中的组件可以适当地组合，以例如参考图8和9，实现(S2040)来自视频解码器中的主解码器网络(915)的残差数据和相应的预测结果被馈送到重构模块(874)中，以生成重构图像。In some examples, the output block of the video decoder includes residual data in the spatial domain, and therefore can be used for further processing to generate a reconstructed block based on the output block. For example, the reconstruction module (874) is configured to combine the residual data and prediction results (output by the inter-frame or intra-frame prediction module) in the spatial domain to form a reconstructed block that can be part of the reconstructed image. Additional appropriate operations, such as deblocking, can be performed to improve visual quality. Components in various devices can be appropriately combined to, for example, refer to Figures 8 and 9, to implement (S2040) the residual data and corresponding prediction results from the main decoder network (915) in the video decoder being fed into the reconstruction module (874) to generate the reconstructed image.

在一个示例中，比特流还包括用于确定解码编码块的上下文模型的一个或多个编码比特。视频解码器可以包括主解码器网络(例如，(911))、上下文模型网络(例如，(916))、熵参数网络(例如，(917))和超解码器网络(例如，(925))。神经网络是主解码器网络、上下文模型网络、熵参数NN和超解码器网络中的一个。可以使用超解码器网络来解码一个或多个编码比特。可以使用上下文模型网络和熵参数网络，基于上下文模型网络可用的编码块的解码比特和量化潜在比特，来确定熵模型(例如，上下文模型)。可以使用主解码器网络和熵模型对编码块进行解码。In one example, the bitstream also includes one or more coded bits for determining the context model of the decoded coded block. The video decoder may include a master decoder network (e.g., (911)), a context model network (e.g., (916)), an entropy parameter network (e.g., (917)), and a super decoder network (e.g., (925)). A neural network is one of the master decoder network, the context model network, the entropy parameter NN, and the super decoder network. The super decoder network can be used to decode one or more coded bits. The context model network and the entropy parameter network can be used to determine the entropy model (e.g., the context model) based on the decoded bits and quantized potential bits of the coded block available from the context model network. The master decoder network and the entropy model can be used to decode the coded block.

过程(2000)进行到(S2099)，并终止。The process (2000) proceeds to (S2099) and terminates.

过程(2000)可以适合于各种情况，并且可以相应地调整过程(2000)中的步骤。可以修改、省略、重复和/或组合过程(2000)中的一个或多个步骤。可以使用任何合适的顺序来实现该过程(2000)。可以添加额外的步骤。The process (2000) can be adapted to various situations, and the steps in the process (2000) can be adjusted accordingly. One or more steps in the process (2000) can be modified, omitted, repeated, and/or combined. The process (2000) can be implemented using any suitable order. Additional steps can be added.

在一个示例中，在(S2040)，基于第一块的更新的第一神经网络来解码编码比特流中的另一块。In one example, at (S2040), the first neural network, updated based on the first block, decodes another block in the encoded bitstream.

在一个示例中，在(S2010)，解码视频解码器中的第二神经网络的编码比特流中的第二神经网络更新信息。第二神经网络配置有第二组预训练参数。第二神经网络更新信息对应于要重构的图像中的第二块，并且指示对应于第二组预训练参数中的第二预训练参数的第二替换参数。第二神经网络(例如，上下文模型NN(916))可以不同于第一神经网络(例如，主解码器网络(915))。在(S2030)，可以基于第二替换参数更新视频解码器中的第二神经网络。在(S2040)，可以基于第二块的更新的第二神经网络来解码第二块。在一个示例中，第一预训练参数是预训练权重系数和预训练偏置项中的一个。在一个示例中，第二预训练参数是预训练权重系数和预训练偏置项中的另一个。In one example, at (S2010), the second neural network update information in the encoded bitstream of the second neural network in the video decoder is obtained. The second neural network is configured with a second set of pre-trained parameters. The second neural network update information corresponds to a second block in the image to be reconstructed and indicates a second replacement parameter corresponding to the second pre-trained parameter in the second set of pre-trained parameters. The second neural network (e.g., a contextual model NN (916)) may be different from the first neural network (e.g., the main decoder network (915)). At (S2030), the second neural network in the video decoder may be updated based on the second replacement parameter. At (S2040), the second block may be decoded based on the updated second neural network of the second block. In one example, the first pre-trained parameter is one of the pre-trained weight coefficients and the pre-trained bias term. In one example, the second pre-trained parameter is the other of the pre-trained weight coefficients and the pre-trained bias term.

本公开中的实施例可以单独使用或者以任何顺序组合使用。此外，方法(或实施例)、编码器和解码器中的每一个可以由处理电路(例如，一个或多个处理器或一个或多个集成电路)来实现。在一个示例中，一个或多个处理器执行存储在非暂时性计算机可读介质中的程序。The embodiments in this disclosure can be used individually or in any combination in any order. Furthermore, each of the methods (or embodiments), encoders, and decoders can be implemented by processing circuitry (e.g., one or more processors or one or more integrated circuits). In one example, one or more processors execute a program stored on a non-transitory computer-readable medium.

本公开没有对用于编码器(例如，基于神经网络的编码器)、解码器(例如，基于神经网络的解码器)的方法施加任何限制。编码器、解码器等中使用的神经网络可以是任何合适类型的神经网络，例如，DNN、CNN等。This disclosure does not impose any limitations on the methods used for encoders (e.g., neural network-based encoders), decoders (e.g., neural network-based decoders). The neural network used in the encoder, decoder, etc., can be any suitable type of neural network, such as DNN, CNN, etc.

因此，本公开的内容自适应在线训练方法可以适应不同类型的NIC框架，例如，不同类型的编码DNN、解码DNN、编码CNN、解码CNN等。Therefore, the adaptive online training method disclosed herein can be adapted to different types of NIC frameworks, such as different types of encoding DNN, decoding DNN, encoding CNN, decoding CNN, etc.

上述技术可以被实现为使用计算机可读指令的计算机软件，并且物理地存储在一个或多个计算机可读介质中。例如，图21示出了适于实现所公开主题的某些实施例的计算机系统(2100)。The above-described techniques can be implemented as computer software using computer-readable instructions and physically stored in one or more computer-readable media. For example, Figure 21 illustrates a computer system (2100) suitable for implementing certain embodiments of the disclosed subject matter.

计算机软件可以使用任何合适的机器代码或计算机语言来编码，其可以经受汇编、编译、链接或类似机制来创建包括指令的代码，这些指令可以由一个或多个计算机中央处理单元(CPU)、图形处理单元(GPU)等直接执行，或者通过解释、微代码执行等来执行。Computer software can be encoded using any suitable machine code or computer language, and can be assembled, compiled, linked or similarly to create code containing instructions that can be executed directly by one or more computer central processing units (CPUs), graphics processing units (GPUs), etc., or executed through interpretation, microcode execution, etc.

指令可以在各种类型的计算机或其组件上执行，包括例如个人计算机、平板计算机、服务器、智能手机、游戏装置、物联网装置等。The instructions can be executed on various types of computers or their components, including, for example, personal computers, tablet computers, servers, smartphones, gaming devices, Internet of Things devices, etc.

图21中所示的计算机系统(2100)的组件本质上是示例性的，并且不旨在对实现本公开的实施例的计算机软件的使用范围或功能提出任何限制。组件的配置也不应被解释为对计算机系统(2100)的示例性实施例中所示的任何一个组件或组件组合有任何依赖性或要求。The components of the computer system (2100) shown in Figure 21 are exemplary in nature and are not intended to impose any limitation on the scope or functionality of computer software used to implement embodiments of this disclosure. The configuration of the components should also not be construed as having any dependency or requirement on any component or combination of components shown in the exemplary embodiments of the computer system (2100).

计算机系统(2100)可以包括某些人机接口输入装置。这种人机接口输入装置可以响应一个或多个人类用户通过例如触觉输入(例如：击键、滑动、数据手套移动)、音频输入(例如：语音、鼓掌)、视觉输入(例如：手势)、嗅觉输入(未示出)进行的输入。人机接口装置还可以用于捕捉不一定与人的有意识输入直接相关的某些媒体，例如，音频(例如：语音、音乐、环境声音)、图像(例如：扫描图像、从静止图像相机获得的照片图像)、视频(例如，二维视频、包括立体视频的三维视频)。The computer system (2100) may include certain human-machine interface input devices. Such human-machine interface input devices may respond to input from one or more human users via, for example, tactile input (e.g., keystrokes, swipes, data glove movements), audio input (e.g., voice, clapping), visual input (e.g., gestures), and olfactory input (not shown). The human-machine interface devices may also be used to capture certain media that are not necessarily directly related to conscious human input, such as audio (e.g., speech, music, ambient sounds), images (e.g., scanned images, photographic images obtained from still image cameras), and video (e.g., two-dimensional video, three-dimensional video including stereoscopic video).

输入人机接口装置可以包括以下一个或多个(每个仅描绘了一个)：键盘(2101)、鼠标(2102)、轨迹板(2103)、触摸屏(2110)、数据手套(未示出)、操纵杆(2105)、麦克风(2106)、扫描仪(2107)、相机(2108)。The input human-machine interface device may include one or more of the following (only one of each is depicted): keyboard (2101), mouse (2102), trackpad (2103), touch screen (2110), data glove (not shown), joystick (2105), microphone (2106), scanner (2107), and camera (2108).

计算机系统(2100)还可以包括某些人机接口输出装置。这种人机接口输出装置可以通过例如触觉输出、声音、光和气味/味道来刺激一个或多个人类用户的感觉。这种人机接口输出装置可以包括触觉输出装置(例如，通过触摸屏(2110)、数据手套(未示出)或操纵杆(2105)的触觉反馈，但是也可以有不用作输入装置的触觉反馈装置)、音频输出装置(例如：扬声器(2109)、耳机(未示出))、视觉输出装置(例如，屏幕(2110)，包括CRT屏幕、LCD屏幕、等离子屏幕、OLED屏幕，每个都具有或不具有触摸屏输入能力，每个都具有或不具有触觉反馈能力——其中一些能够通过诸如立体输出之类的方式输出二维视觉输出或多于三维的输出；虚拟现实眼镜(未示出)、全息显示器和烟雾箱(未示出))以及打印机(未示出)。The computer system (2100) may also include certain human-machine interface output devices. Such human-machine interface output devices may stimulate the senses of one or more human users through, for example, tactile output, sound, light, and smell/taste. Such human-machine interface output devices may include tactile output devices (e.g., tactile feedback via a touchscreen (2110), data gloves (not shown), or joystick (2105), but may also include tactile feedback devices that are not used as input devices), audio output devices (e.g., speakers (2109), headphones (not shown)), visual output devices (e.g., screens (2110), including CRT screens, LCD screens, plasma screens, OLED screens, each with or without touchscreen input capability, each with or without tactile feedback capability—some of which are capable of outputting two-dimensional or more than three-dimensional visual output in a manner such as stereoscopic output; virtual reality glasses (not shown), holographic displays, and smoke boxes (not shown)), and printers (not shown).

计算机系统(2100)还可以包括人类可访问的存储装置及其相关联的介质，例如，包括具有CD/DVD或类似介质(2121)的CD/DVD ROM/RW(2120)的光学介质、拇指驱动器(2122)、可移动硬盘驱动器或固态驱动器(2123)、诸如磁带和软盘(未示出)之类的传统磁介质、诸如安全加密狗(未示出)之类的专用ROM/ASIC/PLD装置等。The computer system (2100) may also include human-accessible storage devices and their associated media, such as optical media including CD/DVD ROM/RW (2120) having CD/DVD or similar media (2121), thumb drives (2122), removable hard disk drives or solid-state drives (2123), conventional magnetic media such as magnetic tapes and floppy disks (not shown), dedicated ROM/ASIC/PLD devices such as security dongles (not shown), etc.

本领域技术人员还应该理解，结合当前公开的主题使用的术语“计算机可读介质”不包括传输介质、载波或其他瞬时信号。Those skilled in the art should also understand that the term "computer-readable medium" as used in connection with the presently disclosed subject matter does not include transmission media, carrier waves, or other transient signals.

计算机系统(2100)还可以包括到一个或多个通信网络(2155)的接口(2154)。网络例如可以是无线的、有线的、光学的。网络还可以是局域的、广域的、大都市的、车辆的和工业的、实时的、延迟容忍的等。网络的示例包括诸如以太网、无线LAN之类的局域网、包括GSM、3G、4G、5G、LTE等的蜂窝网络、包括有线电视、卫星电视和地面广播电视的电视有线或无线广域数字网络、包括CANBus的车辆和工业网络等。某些网络通常需要连接到某些通用数据端口或外围总线(2149)(例如，计算机系统(2100)的USB端口)的外部网络接口适配器；其他的通常通过连接到如下所述的系统总线而集成到计算机系统(2100)的核心中(例如，PC计算机系统中的以太网接口或智能电话计算机系统中的蜂窝网络接口)。使用这些网络中的任何一个，计算机系统(2100)可以与其他实体通信。这种通信可以是单向的、只接收的(例如，广播电视)、单向的、只发送的(例如，到某些CANbus装置的CANbus)，或者是双向的，例如，到使用局域或广域数字网络的其他计算机系统。如上所述，某些协议和协议栈可以用在这些网络和网络接口的每一个上。The computer system (2100) may also include an interface (2154) to one or more communication networks (2155). The network may be, for example, wireless, wired, or optical. The network may also be local area, wide area, metropolitan, vehicular, and industrial, real-time, latency-tolerant, etc. Examples of networks include local area networks such as Ethernet and wireless LANs; cellular networks including GSM, 3G, 4G, 5G, LTE, etc.; cable or wireless wide area digital television networks including cable television, satellite television, and terrestrial broadcast television; and vehicular and industrial networks including CANBus, etc. Some networks typically require external network interface adapters to connect to certain general-purpose data ports or peripheral buses (2149) (e.g., a USB port of the computer system (2100); others are typically integrated into the core of the computer system (2100) via connection to system buses as described below (e.g., an Ethernet interface in a PC computer system or a cellular network interface in a smartphone computer system). Using any of these networks, the computer system (2100) can communicate with other entities. This communication can be unidirectional and receive-only (e.g., broadcast television), unidirectional and transmit-only (e.g., to a CANbus device), or bidirectional, such as to other computer systems using local or wide area digital networks. As mentioned above, certain protocols and protocol stacks can be used on each of these networks and network interfaces.

前述人机接口装置、人类可访问的存储装置和网络接口可以附接到计算机系统(2100)的核心(2140)。The aforementioned human-machine interface device, human-accessible storage device, and network interface can be attached to the core (2140) of the computer system (2100).

核心(2140)可以包括一个或多个中央处理单元(CPU)(2141)、图形处理单元(GPU)(2142)、现场可编程门区域(FPGA)(2143)形式的专用可编程处理单元、用于特定任务的硬件加速器(2144)、图形适配器(2150)等。这些装置连同只读存储器(ROM)(2145)、随机存取存储器(2146)、诸如内部非用户可访问硬盘驱动器、SSD之类的内部大容量存储器(2147)可以通过系统总线(2148)连接。在一些计算机系统中，系统总线(2148)可以以一个或多个物理插头的形式访问，以允许额外CPU、GPU等的扩展。外围装置可以直接或者通过外围总线(2149)连接到核心的系统总线(2148)。在一个示例中，屏幕(2110)可以连接到图形适配器(2150)。外围总线的架构包括PCI、USB等。The core (2140) may include one or more central processing units (CPU) (2141), graphics processing units (GPUs) (2142), dedicated programmable processing units in the form of field-programmable gate areas (FPGAs) (2143), task-specific hardware accelerators (2144), graphics adapters (2150), etc. These devices, along with read-only memory (ROM) (2145), random access memory (2146), and internal mass storage such as internal non-user-accessible hard disk drives (SDs) (2147), may be connected via a system bus (2148). In some computer systems, the system bus (2148) may be accessed as one or more physical connectors to allow for the expansion of additional CPUs, GPUs, etc. Peripheral devices may be connected directly to the core's system bus (2148) or via a peripheral bus (2149). In one example, a screen (2110) may be connected to a graphics adapter (2150). Peripheral bus architectures include PCI, USB, etc.

CPU(2141)、GPU(2142)、FPGA(2143)和加速器(2144)可以执行某些指令，这些指令组合起来可以构成上述计算机代码。该计算机代码可以存储在ROM(2145)或RAM(2146)中。过渡数据也可以存储在RAM(2146)中，而永久数据可以存储在例如内部大容量存储器(2147)中。可以通过使用高速缓冲存储器来实现对任何存储装置的快速存储和检索，高速缓冲存储器可以与一个或多个CPU(2141)、GPU(2142)、大容量存储器(2147)、ROM(2145)、RAM(2146)等紧密关联。The CPU (2141), GPU (2142), FPGA (2143), and accelerator (2144) can execute certain instructions, which, when combined, constitute the aforementioned computer code. This computer code can be stored in ROM (2145) or RAM (2146). Transient data can also be stored in RAM (2146), while permanent data can be stored, for example, in internal mass storage (2147). Fast storage and retrieval of any storage device can be achieved by using a cache memory, which can be closely associated with one or more CPUs (2141), GPUs (2142), mass storage (2147), ROM (2145), RAM (2146), etc.

计算机可读介质上可以具有用于执行各种计算机实现的操作的计算机代码。介质和计算机代码可以是为了本公开的目的而专门设计和构造的，或者可以是计算机软件领域的技术人员公知和可获得的类型。Computer-readable media may contain computer code for performing operations of various computer implementations. The media and computer code may be specifically designed and constructed for the purposes of this disclosure, or may be of a type known and available to those skilled in the art of computer software.

作为示例而非限制，具有架构(2100)的计算机系统，特别是核心(2140)可以提供作为处理器(包括CPU、GPU、FPGA、加速器等)执行包含在一个或多个有形的计算机可读介质中的软件的结果的功能。这种计算机可读介质可以是与如上所述的用户可访问的大容量存储器相关联的介质以及具有非暂时性的核心(2140)的某些存储器，例如，核心内部大容量存储器(2147)或ROM(2145)。实现本公开的各种实施例的软件可以存储在这样的装置中并由核心执行(2140)。根据特定需要，计算机可读介质可以包括一个或多个存储装置或芯片。该软件可以使核心(2140)并且特别是其中的处理器(包括CPU、GPU、FPGA等)执行本文描述的特定过程或特定过程的特定部分，包括定义存储在RAM(2146)中的数据结构并且根据软件定义的过程修改这样的数据结构。此外或作为替代，计算机系统可以作为硬连线或以其他方式包含在电路中的逻辑(例如：加速器(2144))的结果来提供功能，其可以代替软件或与软件一起操作来执行本文描述的特定过程或特定过程的特定部分。在适当的情况下，对软件的引用可以包括逻辑，反之亦然。在适当的情况下，对计算机可读介质的引用可以包括存储用于执行的软件的电路(例如，集成电路(IC))、包含用于执行的逻辑的电路或者这两者。本公开包含硬件和软件的任何合适的组合。By way of example and not limitation, a computer system having an architecture (2100), particularly a core (2140), can provide functionality as a processor (including a CPU, GPU, FPGA, accelerator, etc.) to execute software contained in one or more tangible computer-readable media. Such computer-readable media can be media associated with user-accessible mass storage as described above, as well as some memory of the non-transitory core (2140), such as internal mass storage (2147) or ROM (2145). Software implementing various embodiments of this disclosure can be stored in such a device and executed by the core (2140). Depending on specific needs, the computer-readable medium may include one or more storage devices or chips. The software can cause the core (2140), and particularly the processors therein (including CPUs, GPUs, FPGAs, etc.), to execute specific processes or specific portions of specific processes described herein, including defining data structures stored in RAM (2146) and modifying such data structures according to software-defined processes. In addition, or alternatively, the computer system may provide functionality as a result of hard-wired or otherwise incorporated logic (e.g., accelerator (2144)) that may replace or operate with software to perform the specific process or a specific portion of the specific process described herein. References to software may include logic, and vice versa, where appropriate. References to computer-readable media may include circuitry (e.g., integrated circuits (ICs)) storing software for execution, circuitry containing logic for execution, or both, where appropriate. This disclosure includes any suitable combination of hardware and software.

附录A：缩略语Appendix A: Abbreviations

JEM：joint exploration model，共同探索模式JEM: Joint Exploration Model

VVC：versatile video coding，多功能视频编码VVC: Versatile Video Coding

BMS：benchmark set，基准集BMS: benchmark set

MV：Motion Vector，运动矢量MV: Motion Vector

HEVC：High Efficiency Video Coding，高效视频编码HEVC: High Efficiency Video Coding

SEI：Supplementary Enhancement Information，补充增强信息SEI: Supplementary Enhancement Information

VUI：Video Usability Information，视频可用性信息VUI: Video Usability Information

GOP：Groups of Pictures，图片组GOP: Groups of Pictures

TU：Transform Units，变换单元TU: Transform Units

PU：Prediction Units，预测单元PU: Prediction Units

CTU：Coding Tree Units，编码树单元CTU: Coding Tree Units

CTB：Coding Tree Blocks，编码树块CTB: Coding Tree Blocks

PB：Prediction Blocks，预测块PB: Prediction Blocks

HRD：Hypothetical Reference Decoder，假设参考解码器HRD: Hypothetical Reference Decoder

SNR：Signal Noise Ratio，信噪比SNR: Signal Noise Ratio

CPU：Central Processing Units，中央处理器CPU: Central Processing Unit

GPU：Graphics Processing Units，图形处理单元GPU: Graphics Processing Units

CRT：Cathode Ray Tube，阴极射线管CRT: Cathode Ray Tube

LCD：Liquid-Crystal Display，液晶显示器LCD: Liquid-Crystal Display

OLED：Organic Light-Emitting Diode，有机发光二极管OLED: Organic Light-Emitting Diode

CD：Compact Disc，光盘CD: Compact Disc

DVD：Digital Video Disc，数字视频光盘DVD: Digital Video Disc

ROM：Read-Only Memory，只读存储器ROM: Read-Only Memory

RAM：Random Access Memory，随机存取存储器RAM: Random Access Memory

ASIC：Application-Specific Integrated Circuit，专用集成电路ASIC: Application-Specific Integrated Circuit

PLD：Programmable Logic Device，可编程逻辑器件PLD: Programmable Logic Device

LAN：Local Area Network，局域网LAN: Local Area Network

GSM：Global System for Mobile communications，全球移动通信系统GSM: Global System for Mobile Communications

LTE：Long-Term Evolution，长期演进LTE: Long-Term Evolution

CANBus：Controller Area Network Bus，控制器局域网总线CANBus: Controller Area Network Bus

USB：Universal Serial Bus，通用串行总线USB: Universal Serial Bus

PCI：Peripheral Component Interconnect，外围组件互连PCI: Peripheral Component Interconnect

FPGA：Field Programmable Gate Areas，现场可编程门区域FPGA: Field Programmable Gate Areas

SSD：solid-state drive，固态硬盘SSD: Solid-state drive

IC：Integrated Circuit，集成电路IC: Integrated Circuit

CU：Coding Unit，编码单元CU: Coding Unit

NIC：Neural Image Compression，神经图像压缩NIC: Neural Image Compression

R-D：Rate-Distortion，率失真R-D: Rate-Distortion

E2E：End to End，端到端E2E: End to End

ANN：Artificial Neural Network，人工神经网络ANN: Artificial Neural Network

DNN：Deep Neural Network，深度神经网络DNN: Deep Neural Network

CNN：Convolution Neural Network，卷积神经网络CNN: Convolutional Neural Network

虽然本公开已经描述了几个示例性实施例，但是存在落入本公开范围内的变更、置换和各种替代等同物。因此，应当理解，本领域技术人员将能够设计出许多系统和方法，尽管在此没有明确示出或描述，但是这些系统和方法体现了本公开的原理，并且因此在本公开的精神和范围内。While several exemplary embodiments have been described in this disclosure, variations, substitutions, and various alternative equivalents fall within the scope of this disclosure. Therefore, it should be understood that those skilled in the art will be able to design numerous systems and methods that, although not expressly shown or described herein, embody the principles of this disclosure and are therefore within its spirit and scope.

Claims

1. A video decoding method, characterized in that it includes:

Decode the update information of the first neural network in the encoded bitstream. The update information of the first neural network is used for the first neural network. The first neural network is configured with a first set of pre-trained parameters. The update information of the first neural network corresponds to the first block in the image to be reconstructed and indicates a first replacement parameter corresponding to the first pre-trained parameter in the first set of pre-trained parameters. The update information of the first neural network also indicates one or more replacement parameters. The one or more replacement parameters are used for one or more remaining neural networks in a plurality of neural networks other than the first neural network.

The update further includes updating the one or more remaining neural networks based on the first replacement parameters, wherein the update is based on the first replacement parameters; and

The first block is decoded based on the updated first neural network, which is used for the first block.

2. The method according to claim 1, characterized in that it further comprises:

The second neural network update information in the encoded bitstream is decoded. The second neural network update information is used in the second neural network, which is configured with a second set of pre-trained parameters. The second neural network update information corresponds to a second block in the image to be reconstructed and indicates a second replacement parameter corresponding to the second pre-trained parameter in the second set of pre-trained parameters. The second neural network is different from the first neural network.

Update the second neural network based on the second replacement parameters; and

The second block is decoded based on the updated second neural network, which is used for the second block.

3. The method according to claim 2, wherein,

The first pre-training parameter is one of the pre-training weight coefficients and the pre-training bias term.

4. The method according to claim 3, characterized in that,

The second pre-training parameter is another of the pre-training weight coefficients and the pre-training bias term.

5. The method according to claim 1, characterized in that it further comprises:

Based on the updated first neural network, the second block of the encoded bitstream is decoded.

6. The method according to claim 1, characterized in that,

The first neural network update information indicates the difference between the first replacement parameter and the first pre-trained parameter, and

The method further includes: determining the first replacement parameter based on the difference and the sum of the first pre-trained parameters.

7. The method according to claim 2, characterized in that,

Decoding the update information of the first neural network includes decoding the update information of the first neural network using one of the variants of the Lempel-Ziv-Markov chain algorithm, LZMA2 and bzip2.

8. The method according to claim 7, characterized in that,

Decoding the update information of the second neural network includes: decoding the update information of the second neural network based on another of the LZMA2 and bzip2 algorithms.

9. A video decoding device, characterized in that it includes:

Processing circuit, the processing circuit being configured to:

10. The device according to claim 9, wherein the processing circuit is configured to:

11. The device according to claim 10, characterized in that,

12. The device according to claim 11, characterized in that,

13. The device according to claim 9, wherein the processing circuit is configured to: decode a second block of the encoded bitstream based on the updated first neural network.

14. The device according to claim 9, characterized in that,

The processing circuit is configured to determine the first replacement parameter based on the difference and the sum of the first pre-trained parameters.

15. The device according to claim 10, wherein the processing circuit is configured to:

Decoding the update information of the first neural network includes decoding the update information of the first neural network using one of the variants of the Lempel–Ziv–Markov chain algorithm, LZMA2 and bzip2.

16. The device according to claim 15, wherein the processing circuit is configured to:

The update information of the second neural network is decoded based on another of the LZMA2 and bzip2 algorithms.

17. A non-transitory computer-readable storage medium, characterized in that the medium stores a program executable by at least one processor to perform the method as described in any one of claims 1-8.