WO2025153016A1 - Procédé, appareil, et support de traitement de données visuelles - Google Patents
Procédé, appareil, et support de traitement de données visuellesInfo
- Publication number
- WO2025153016A1 WO2025153016A1 PCT/CN2025/072729 CN2025072729W WO2025153016A1 WO 2025153016 A1 WO2025153016 A1 WO 2025153016A1 CN 2025072729 W CN2025072729 W CN 2025072729W WO 2025153016 A1 WO2025153016 A1 WO 2025153016A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- visual data
- codestream
- indication
- level
- tool
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/70—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards
Definitions
- a method for visual data processing comprises: performing a conversion between visual data and a codestream of the visual data with a neural network (NN) -based model, wherein the codestream comprises a first indication indicating a stream profile to which the codestream conforms, and at least one second indication indicating at least one decoder profile to which the codestream conforms.
- NN neural network
- the codestream comprises a first indication indicating a stream profile to which the codestream conforms, and at least one second indication indicating at least one decoder profile to which the codestream conforms.
- the proposed method can provide a mechanism for signaling of information regarding stream profiles and decoder profiles, and thus can better support the application of different stream profiles and different decoder profiles. Thereby, the coding flexibility can be improved.
- a non-transitory computer-readable storage medium stores instructions that cause a processor to perform a method in accordance with the first aspect of the present disclosure.
- Fig. 1A illustrates a block diagram that illustrates an example visual data coding system, in accordance with some embodiments of the present disclosure
- Fig. 1B is a schematic diagram illustrating an example transform coding scheme
- Fig. 5 illustrates an example encoding process
- Fig. 8 illustrates an example interaction of the stream profile and the decoder profile
- references in the present disclosure to “one embodiment, ” “an embodiment, ” “an example embodiment, ” and the like indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment includes the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an example embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
- Fig. 1A is a block diagram that illustrates an example visual data coding system 100 that may utilize the techniques of this disclosure.
- the visual data coding system 100 may include a source device 110 and a destination device 120.
- the source device 110 can be also referred to as a visual data encoding device, and the destination device 120 can be also referred to as a visual data decoding device.
- the source device 110 can be configured to generate encoded visual data and the destination device 120 can be configured to decode the encoded visual data generated by the source device 110.
- the source device 110 may include a visual data source 112, a visual data encoder 114, and an input/output (I/O) interface 116.
- I/O input/output
- the visual data source 112 may include a source such as a visual data capture device.
- Examples of the visual data capture device include, but are not limited to, an interface to receive visual data from a visual data provider, a computer graphics system for generating visual data, and/or a combination thereof.
- the visual data may comprise one or more pictures of a video or one or more images.
- the visual data encoder 114 encodes the visual data from the visual data source 112 to generate a bitstream.
- the bitstream may include a sequence of bits that form a coded representation of the visual data.
- the bitstream may include coded pictures and associated visual data.
- the coded picture is a coded representation of a picture.
- the associated visual data may include sequence parameter sets, picture parameter sets, and other syntax structures.
- the I/O interface 116 may include a modulator/demodulator and/or a transmitter.
- the encoded visual data may be transmitted directly to destination device 120 via the I/O interface 116 through the network 130A.
- the encoded visual data may also be stored onto a storage medium/server 130B for access by destination device 120.
- the destination device 120 may include an I/O interface 126, a visual data decoder 124, and a display device 122.
- the I/O interface 126 may include a receiver and/or a modem.
- the I/O interface 126 may acquire encoded visual data from the source device 110 or the storage medium/server 130B.
- the visual data decoder 124 may decode the encoded visual data.
- the display device 122 may display the decoded visual data to a user.
- the display device 122 may be integrated with the destination device 120, or may be external to the destination device 120 which is configured to interface with an external display device.
- the visual data encoder 114 and the visual data decoder 124 may operate according to a visual data coding standard, such as video coding standard or still picture coding standard and other current and/or further standards.
- a visual data coding standard such as video coding standard or still picture coding standard and other current and/or further standards.
- This disclosure is related to neural network (NN) -based image and video coding. Specifically, it is related to definitions of profiles, levels and versions of a neural network based image or video codec.
- the ideas may be applied individually or in various combinations, for image and/or video coding methods and specifications.
- Image/video compression (also referred to as image/video coding) usually refers to the computing technology that compresses image/video into binary code to facilitate storage and transmission.
- the binary codes may or may not support losslessly reconstructing the original image/video, termed lossless compression and lossy compression.
- Neural network-based video compression is in two flavors, neural network-based coding tools and end-to-end neural network-based video compression.
- the former is embedded into existing classical video codecs as coding tools and only serves as part of the framework, while the latter is a separate framework developed based on neural networks without depending on classical video codecs.
- a series of classical video coding standards have been developed to accommodate the increasing visual content.
- the international standardization organizations ISO/IEC has two expert groups namely Joint Photographic Experts Group (JPEG) and Moving Picture Experts Group (MPEG) , and ITU-T also has its own Video Coding Experts Group (VCEG) which is for standardization of image/video coding technology.
- JPEG Joint Photographic Experts Group
- MPEG Moving Picture Experts Group
- VCEG Video Coding Experts Group
- Deep learning eliminates the necessity of handcrafted representations, and thus is regarded useful especially for processing natively unstructured data, such as acoustic and visual signal, whilst processing such data has been a longstanding difficulty in the artificial intelligence field.
- Neural networks for image and video compression Existing neural networks for image compression methods can be classified in two categories, i.e., pixel probability modeling and auto-encoder. The former one belongs to the predictive coding strategy, while the latter one is the transform-based solution. Sometimes, these two methods are combined together in literature.
- the decoding In random access case, it requires the decoding can be started from any point of the sequence, typically divides the entire sequence into multiple individual segments and each segment can be decoded independently. In low-latency case, it aims at reducing decoding time thereby usually merely temporally previous frames can be used as reference frames to decode subsequent frames. Pixel probability modeling According to Shannon’s information theory, the optimal method for lossless coding can reach the minimal coding rate -log 2 p (x) where p (x) is the probability of symbol x. A number of lossless coding methods were developed in literature and among them arithmetic coding is believed to be among the optimal ones.
- arithmetic coding ensures that the coding rate to be as close as possible to its theoretical limit -log 2 p (x) without considering the rounding error. Therefore, the remaining problem is to how to determine the probability, which is however very challenging for natural image/video due to the curse of dimensionality.
- one way to model p (x) is to predict pixel probabilities one by one in a raster scan order based on previous observations, where x is an image.
- the prototype auto-encoder for image compression is in Fig. 1B, which can be regarded as a transform coding strategy.
- the synthesis network will inversely transform the quantized latent representation back to obtain the reconstructed image
- the framework is trained with the rate-distortion loss function, i.e., where D is the distortion between x and R is the rate calculated or estimated from the quantized representation and ⁇ is the Lagrange multiplier. It should be noted that D can be calculated in either pixel domain or perceptual domain.
- the encoder subnetwork transforms the image vector x using a parametric analysis transform into a latent representation y, which is then quantized to form Because is discrete-valued, it can be losslessly compressed using entropy coding techniques such as arithmetic coding and transmitted as a sequence of bits.
- entropy coding techniques such as arithmetic coding and transmitted as a sequence of bits.
- the image compression network is depicted in Fig. 3.
- the left hand of the models is the encoder g a and decoder g s .
- the right-hand side is the additional hyper encoder h a and hyper decoder h S networks that are used to obtain
- the encoder subjects the input image x to g a , yielding the responses y with spatially varying standard deviations.
- the responses y are fed into h a , summarizing the distribution of standard deviations in z. z is then quantized compressed, and transmitted as side information.
- Fig. 4 is a schematic diagram illustrating an example combined model configured to jointly optimize a context model along with a hyperprior and the autoencoder.
- the following table illustrates meaning of different symbols. Table –Illustration of symbols
- a joint architecture can be utilized where both hyper prior model subnetwork (hyper encoder and hyper decoder) and a context model subnetwork are utilized.
- the hyper prior and the context model are combined to learn a probabilistic model over quantized latents which is then used for entropy coding. As depicted in Fig.
- the interpolated gain vector can be obtained via the following equations.
- m v [ (m r ) l ⁇ (m t ) 1-l ]
- m′ v [ (m′ r ) l ⁇ (m′ t ) 1-l ]
- l ⁇ R an interpolation coefficient, which controls the corresponding bit rate of the generated gain vector pair. Since l is a real number, an arbitrary bit rate between the given two gain vector pairs can be achieved.
- the encoding process using joint auto-regressive hyper prior model The fig 4. corresponds to the state of the art compression method.
- the quantized hyper latent includes information about the probability distribution of the quantized latent
- the Entropy Parameters subnetwork generates the probability distribution estimations, that are used to encode the quantized latent
- the information that is generated by the Entropy Parameters typically include a mean ⁇ and scale (or variance) ⁇ parameters, that are together used to obtain a gaussian probability distribution.
- a gaussian distribution of a random variable x is defined as wherein the parameter ⁇ is the mean or expectation of the distribution (and also its median and mode) , while the parameter ⁇ is its standard deviation (or variance, or scale) . In order to define a gaussian distribution, the mean and the variance need to be determined.
- the context module In a raster scan order the rows of a matrix are processed from top to bottom, wherein the samples in a row are processed from left to right.
- the context module In such a scenario (wherein the raster scan order is used by the AE to encode the samples into bitstream) , the context module generates the information pertaining to a sample using the samples encoded before, in raster scan order.
- the information generated by the context module and the hyper decoder are combined by the entropy parameters module to generate the probability distributions that are used to encode the quantized latent into bitstream (bits1) .
- bits1 bitstream
- the first and the second bitstream are transmitted to the decoder as result of the encoding process.
- the factorized entropy module typically generates the probability distributions using a predetermined template, for example using predetermined mean and variance values in the case of gaussian distribution.
- the output of the arithmetic decoding process of the bits2 is which is the quantized hyper latent.
- the AD process reverts to AE process that was applied in the encoder.
- the processes of AE and AD are lossless, meaning that the quantized hyper latent that was generated by the encoder can be reconstructed at the decoder without any change.
- After obtaining of it is processed by the hyper decoder, whose output is fed to entropy parameters module.
- the three subnetworks, context, hyper decoder and entropy parameters that are employed in the decoder are identical to the ones in the encoder.
- the decoder (as in encoder) , which is essential for reconstructing the quantized latent without any loss.
- the identical version of the quantized latent that was obtained in the encoder can be obtained in the decoder.
- the arithmetic decoding module decodes the samples of the quantized latent one by one from the bitstream bits1. From a practical standpoint, autoregressive model (the context model) is inherently serial, and therefore cannot be sped up using techniques such as parallelization. Finally the fully reconstructed quantized latent is input to the synthesis transform (denoted as decoder in Fig.
- Fig. 6 module to obtain the reconstructed image.
- decoder The synthesis transform that converts the quantized latent into reconstructed image.
- decoder or auto-decoder
- Neural networks for video compression Similar to conventional video coding technologies, neural image compression serves as the foundation of intra compression in neural network-based video compression, thus development of neural network-based video compression technology comes later than neural network-based image compression but needs far more efforts to solve the challenges due to its complexity. Starting from 2017, a few researchers have been working on neural network-based video compression schemes. Compared with image compression, video compression needs efficient methods to remove inter-picture redundancy. Inter-picture prediction is then a crucial step in these works.
- a grayscale digital image can be represented by where is the set of values of a pixel, m is the image height and n is the image width.
- YUV color space an image is decomposed into three channels, namely Y, Cb and Cr, where Y is the luminance component and Cb/Cr are the chroma components.
- Cb and Cr are typically down sampled to achieve pre-compression since human vision system is less sensitive to chroma components.
- a color video sequence is composed of multiple color images, called frames, to record scenes at different timestamps.
- the distortion can be measured by calculating the average squared difference between the original image and the reconstructed image, i.e., mean-squared-error (MSE) .
- MSE mean-squared-error
- the quality of the reconstructed image compared with the original image can be measured by peak signal-to-noise ratio (PSNR) : where is the maximal value in e.g., 255 for 8-bit grayscale images.
- PSNR peak signal-to-noise ratio
- PSNR peak signal-to-noise ratio
- PSNR peak signal-to-noise ratio
- There are other quality evaluation metrics such as structural similarity (SSIM) and multi-scale SSIM (MS-SSIM) .
- the luma and chroma components of an image can be decoded using separate subnetworks.
- the luma component of the image is processed by the subnetwoks “Synthesis” , “Prediction fusion” , “Mask Conv” , “Hyper Decoder” , “Hyper scale decoder” etc.
- the chroma components are processed by the subnetworks: “Synthesis UV” , “Prediction fusion UV” , “Mask Conv UV” , “Hyper Decoder UV” , “Hyper scale decoder UV” etc.
- a benefit of the above separate processing is that the computational complexity of the processing of an image is reduced by application of separate processing.
- the computational complexity is proportional to the square of the number of feature maps. If the number of total feature maps is equal to 192 for example, computational complexity will be proportional to 192x192. On the other hand if the feature maps are divided into 128 for luma and 64 for chroma (in the case of separate processing) , the computational complexity is proportional to 128x128 + 64x64, which corresponds to a reduction in complexity by 45%. Typically the separate processing of luma and chroma components of an image does not result in a prohibitive reduction in performance, as the correlation between the luma and chroma components are typically very small.
- the processing (Decoding process) in Fig. 7 can be explained below: 1.
- the factorized entropy model is used to decode the quantized latents for luma and chroma, i.e., and in Fig. 7. 2.
- the probability parameters (e.g. variance) generated by the second network are used to generate a quantized residual latent by performing the arithmetic decoding process.
- the quantized residual latent is inversely gained with the inverse gain unit (iGain) as shown in orange color in Fig. 7.
- the outputs of the inverse gain units are denoted as and for luma and chroma components, respectively. 4.
- the following steps are performed in a loop until all elements of are obtained: a.
- a first subnetwork is used to estimate a mean value parameter of a quantized latent using the already obtained samples of b.
- the quantized residual latent and the mean value are used to obtain the next element of 5.
- a synthesis transform can be applied to obtain the reconstructed image.
- step 4 and 5 are the same but with a separate set of networks.
- the decoded luma component is used as additional information to obtain the chroma component.
- the Inter Channel Correlation Information filter sub-network ICCI
- the luma is fed into the ICCI sub-network as additional information to assist the chroma component decoding. 8.
- Adaptive color transform is performed after the luma and chroma components are reconstructed.
- the module named ICCI is a neural-network based postprocessing module.
- the example embodiments of the present disclosure are not limited to the UCCI subnetwork, any other neural network based postprocessing module might also be used.
- An exemplary implementation of some example embodiments of the present disclosure is depicted in Fig. 7 (the decoding process) .
- the framework comprises two branches for luma and chroma components respectively. In each of the branch, the first subnetwork comprises the context, prediction and optionally the hyper decoder modules.
- the second network comprises the hyper scale decoder module.
- the quantized hyper latent are and
- the arithmetic decoding process generates the quantized residual latents, which are further fed into the iGain units to obtain the gained quantized residual latents and
- a recursive prediction operation is performed to obtain the latent and
- the following steps describe how to obtain the samples of latent and the chroma component is processed in the same way but with different networks.
- An autoregressive context module is used to generate first input of a prediction module using the samples where the (m, n) pair are the indices of the samples of the latent that are already obtained. 2.
- the second input of the prediction module is obtained by using a hyper decoder and a quantized hyper latent 3.
- the prediction module uses the first input and the second input, the prediction module generates the mean value mean [: , i, j] . 4.
- the mean value mean [: , i, j] and the quantized residual latent are added together to obtain the latent 5.
- the steps 1-4 are repeated for the next sample.
- Whether to and/or how to apply at least one method disclosed in the document may be signaled from the encoder to the decoder, e.g. in the bitstream. Alternatively, whether to and/or how to apply at least one method disclosed in the document may be determined by the decoder based on coding information, such as dimensions, color format, etc.
- modules might be missing and some of the modules might be displaced in processing order. Also additional modules might be included.
- the ICCI module might be removed. In that case the output of the Synthesis module and the Synthesis UV module might be combined by means of another module, that might be based on neural networks. 2.
- One or more of the modules named MS1, MS2 or MS3+O might be removed. The core of the proposed solution is not affected by the removing of one or more of the said scaling and adding modules.
- Fig. 7 other operations that are performed during the processing of the luma and chroma components are also indicated using the star symbol. These processes are denoted as MS1, MS2, MS3+O.
- processing might be, but not limited to, adaptive quantization, latent sample scaling, and latent sample offsetting operations.
- an adaptive quantization process might correspond to scaling of a sample with multiplier before the prediction process, wherein the multiplier is predefined or whose value is indicated in the bitstream.
- the latent scaling process might correspond to the process where a sample is scaled with a multiplier after the prediction process, wherein the value of the multiplier is either predefined or indicated in the bitstream.
- the offsetting operation might correspond to adding an additive element to the sample, again wherein the value of the additive element might be indicated in the bitstream or inferred or predetermined.
- This sub-stream contains information about image height H, width W, latent space tiles location and sizes, control flags for each tool, scaling factors for primary and secondary component, modelIdx –learnable model index and displacement for rate control parameters ( ⁇ Y for primary and ⁇ UV for secondary component) .
- picture_header_size is the number of bytes in the picture header excluding the first two-byte marker
- img_width plus 64 specifies width of an input picture (from 64 to 65600)
- img_height plus 64 specifies height of the input picture (from 64 to 65600)
- bit_depth is a bit-depth the output picture ( “0” corresponds to 8 and “1” corresponds to 10)
- color_transform_offset [i] is an offset for color transformation If not present (color_transform_enable is false) then default ITU-R BT.
- Tools header This optional sub-stream contains information about tools.
- the tools_header () marker segment is not present in the bitstream, the following tool enabling flags are set to be 0: rvs_enable_flag, lsbs_enable_flag, grfs_enable_flag, gain_3D_enable_flag, icci_enable_flag, LEF_enabled_flag, EFE_upsampler_enabled_flag, and EFE_nonlinear_filter_enabled_flag.
- tools_header_size is the number of bytes in the tools header excluding the first two-byte marker; rvs_enable_flag –is a flag used in the RVS for each color component. 0 indicates RVS disabled, 1 indicates RVS turn on.
- Profile A profile for a codec (image or video codec) is a set of features of that codec identified to meet a certain set of specifications of intended applications. This means that many of the features listed are not supported in some profiles.
- Level As the term is used in the standard, a "level" is a specified set of constraints that indicate a degree of required decoder performance for a profile. A decoder that conforms to a given level must be able to decode all bitstreams encoded for that level and all lower levels.
- JPEG AI profiling Framework In an existing design, a framework for the JPEG AI profiling is provided. According to the design, a codec can conform to a stream profile and at the same time one or more decoder profiles. Fig. 8 illustrates an example interaction of the stream profile and the decoder profile. On the other hand, the details of how the profiles are specified and how they are signalled are not specified. 3. Problems An existing design has the following problems: 1) The draft specification does not consider how the profiles and levels of a JPEG AI standard are specified. In other words, the details of the profile and level descriptions are missing.
- a first profile and/or a second profile is specified.
- the name of the first profile might be: 1.
- ii. Name of the second profile might be: 1.
- An indication might be included in the bitstream to indicate which first profile does the bitstream or decoding process conforms to. ii.
- An indication might be included in the bitstream to indicate which second profile does the bitstream or decoding process conforms to. 1.
- the indication might indicate a set of second profiles that a bitstream conforms to. a.
- the indication might indicate “up to which second profile” does the bitstream conforms.
- the indication might indicate up to “second profile N” , and all second profiles that are “second profile 1” , “second profile 2” up to “second profile N” are supported.
- indication might indicate the id of the second profile that the bitstream or the decoding process conforms to. 2.
- At least one (or more) indication might be included in the bitstream to indicate which second profile does the bitstream or decoding process conforms to.
- the indication indicating which first or second profile do the bitstream or decoding process conforms to might be included in the bitstream using: 1. A fixed length coding (each indication coded with fixed number of bits) ; 2. A variable length coding.
- the first/second profile might indicate a restriction or a capability based on any of the following: i. A restriction or a capability based on a picture size (ahigher limit, or a lower limit) . ii. A restriction or a capability based on a number of threads used in entropy decoding process. 1. the number of threads might be a distinct number.
- the number of threads (e.g. parallel processing units, or number of sub-bitstreams) might be 1 (any other value other than 1 is disallowed) .
- the number of threads might be 4 (any other value other than 4 is disallowed) .
- the number of threads might be 8 (any other value other than 8 is disallowed) .
- the number of threads might be N, M, K... (any other value other than M, N, K...are disallowed) .
- ⁇ M, N, K... ⁇ might be ⁇ 1, 4 and 8 ⁇ .
- b. ⁇ M, N, K... ⁇ might be ⁇ 4 and 8 ⁇ . c. etc. 3.
- a dependently or independently decodable region 3.
- a tile (independently or dependently decodable) 4.
- the size might be restricted to be a certain amount, a set of allowed values (e.g. ⁇ 0, 16, 32, 48, 64... ⁇ ) , a certain upper limit, or a certain lower limit.
- the overlap amount might be restricted to be a certain amount, a set of allowed values (e.g. ⁇ 0, 16, 32, 48, 64... ⁇ ) , a certain upper limit, or a certain lower limit. 2.
- a first level and/or a second level is specified. a.
- the restriction on the bitstream size might be an upper limit.
- the bitstream size might be measured in terms a compression ratio.
- the bitstream size might be measured between two markers (e.g. JPEG markers) .
- the restriction or a capability based on the value of a syntax element 1.
- the syntax element might be an enabled flag, indicating if a tool is enabled or disabled (used or not used) .
- the restriction/capabilities might disable/enable application of a neural network layer or a tool or a module.
- Restriction or capability might indicate a scaling ratio between a luma component and a chroma component.
- the metric might be a conformance metric, or a similarity metric, or a deviation metric etc. 2.
- the metric might measure a deviation from a reference reconstruction.
- the reconstruction might be a reconstructed image, or a reconstructed sample or a reconstructed latent tensor, or a reconstructed residual tensor.
- the reference reconstruction might be a reconstruction obtained by a reference implementation of a codec.
- the metric might be PSNR (peak signal to noise ratio) .
- MSE mean squared error
- e Average or sum of number samples that are different than the corresponding sample at the reference reconstruction.
- the metric might be calculated for a NxN region of the reconstruction.
- A.3.3.3.4 Model Name M3 For conciseness, the parameter (s) for this model is not further described in this application.
- A.3.3.3.5 Model Name M4 For conciseness, the parameter (s) for this model is not further described in this application.
- A.3.3.3.6 Model Name M5 For conciseness, the parameter (s) for this model is not further described in this application.
- the input of the multistage context module (a.k.a., a multistage context modelling process) may comprise reconstructed residual tensor, which is the output of inverse residual scale process, and re-shuffled explicit prediction tensor, which is the output of the hyper decoder.
- the output of the multistage context module may comprise reconstructed latent tensor.
- Clause 5 The method of clause 4, wherein the syntax element indicates whether a tool of the NN-based model is enabled.
- Clause 7 The method of any of clauses 4-6, wherein in accordance with a determination that an enable flag of a first tool indicates that the first tool is enabled and the combination of the stream profile and the decoder profile indicates a restriction that the first tool is disabled, the first tool is disabled.
- Clause 9 The method of any of clauses 1-3, wherein samples associated with the visual data are partitioned into a plurality of partitions, and a combination of the stream profile and one of the at least one decoder profile indicates a restriction or a capability based on at least one of the following: a size of one of the plurality of partitions, a size of a substream of the codestream that corresponds to one of the plurality of partitions, or an among of overlap between two of the plurality of partitions.
- Clause 10 The method of any of clauses 1-9, wherein the codestream conforms to at least one of the following: at least one first level, or at least one second level.
- Clause 12 The method of any of clauses 10-11, wherein the codestream further comprises at least one of the following: at least one third indication indicating the at least one first level, or at least one fourth indication indicating the at least one second level.
- Clause 13 The method of clause 12, wherein the at least one third indication is coded with a fixed number of bits, or the at least one third indication is coded with a variable number of bits, or the at least one fourth indication is coded with a fixed number of bits, or the at least one fourth indication is coded with a variable number of bits.
- Clause 14 The method of any of clauses 10-13, wherein one of the at least one first level, one of the at least one second level, or a combination of the first and second levels indicates a restriction or a capability based on at least one of the following: a size of the visual data, the number of threads used in an entropy coding process during the conversion, a size of the codestream, a compression ratio of the codestream, a value of a syntax element in the codestream, whether to enable an NN layer of the NN-based model, whether to enable a tool of the NN-based model, whether to enable a module of the NN-based model, a ratio between a height of a primary component and a height of a secondary component of the visual data, or a ratio between a width of the primary component and a width of the secondary component.
- Clause 15 The method of clause 14, wherein the size of the codestream is measured in terms of the number of bits, or the number of bits per sample, or the size of the codestream is measured between two markers, or the syntax element indicates whether a tool of the NN-based model is enabled.
- Clause 17 The method of clause 9 or 16, wherein one of the plurality of partitions comprise a region or a tile.
- Clause 19 The method of clause 18, wherein the reconstruction quality or the deviation measures a quality of a reconstruction associated with the visual data.
- Clause 22 The method of clause 21, wherein the metric comprises at least one of a conformance metric, a similarity metric, or a deviation metric.
- Clause 23 The method of clause 22, wherein the deviation metric measures a deviation from a reference reconstruction associated with the visual data.
- Clause 25 The method of any of clauses 23-24, wherein the deviation metric is one of the following: peak signal to noise ratio (PSNR) , mean squared error (MSE) , an average of the number of samples that are different from corresponding samples of the reference reconstruction, or a sum of the number of samples that are different from corresponding samples of the reference reconstruction.
- PSNR peak signal to noise ratio
- MSE mean squared error
- Clause 26 The method of any of clauses 23-25, wherein the deviation metric is determined for a region within the reference reconstruction, or the deviation metric is determined for each region within the reference reconstruction.
- Clause 27 The method of clause 26, wherein a size of the region is N ⁇ M, and each of N and M is a positive number.
- Clause 30 The method of clause 29, wherein the at least one constant parameter comprises at least one of the following: a model weight, a multiplicative constant, an additive constant, a threshold of an NN layer, or a descaling parameter applied to a module of the NN-based model.
- one of the at least one constant parameter is a parameter of a tool used for the conversion, and the tool comprises at least one of the following: a quantization tool, a scaling tool, an RVS tool, or an LSBS tool.
- Clause 32 The method of any of clauses 29-31, wherein a name of the at least one fifth indication is version or decoder version.
- Clause 33 The method of any of clauses 29-32, wherein the value of the at least one constant parameter is determined based on a combination of a value of the at least one fifth indication and a value of a sixth indication.
- Clause 34 The method of clause 33, wherein a mapping relationship between candidate values of the sixth indication and candidate values of the at least one constant parameter is determined based on the value of the at least one fifth indication.
- Clause 35 The method of any of clauses 29-34, wherein the at least one fifth indication is a version number, a weights version number, or a constants version number.
- Clause 36 The method of any of clauses 29-35, wherein the at least one fifth indication is associated with a profile to which the codestream conforms.
- Clause 37 The method of any of clauses 29-36, wherein the at least one fifth indication is coded with a fixed number of bits, or the at least one fifth indication is coded with a variable number of bits.
- Clause 38 The method of any of clauses 29-37, wherein the number of the fifth indication is equal to the number of the stream profile or the number of the at least one decoder profile.
- a non-transitory computer-readable recording medium storing a codestream of visual data which is generated by a method performed by an apparatus for visual data processing, wherein the method comprises: performing a conversion between the visual data and the codestream with a neural network (NN) -based model, wherein the codestream comprises a first indication indicating a stream profile to which the codestream conforms, and at least one second indication indicating at least one decoder profile to which the codestream conforms.
- NN neural network
- the storage unit 1030 may be any detachable or non-detachable medium and may include a machine-readable medium such as a memory, flash memory drive, magnetic disk or another other media, which can be used for storing information and/or visual data and can be accessed in the computing device 1000.
- a machine-readable medium such as a memory, flash memory drive, magnetic disk or another other media, which can be used for storing information and/or visual data and can be accessed in the computing device 1000.
- the computing device 1000 may further include additional detachable/non-detachable, volatile/non-volatile memory medium.
- additional detachable/non-detachable, volatile/non-volatile memory medium may be provided.
- a magnetic disk drive for reading from and/or writing into a detachable and non-volatile magnetic disk
- an optical disk drive for reading from and/or writing into a detachable non-volatile optical disk.
- each drive may be connected to a bus (not shown) via one or more visual data medium interfaces.
- the communication unit 1040 communicates with a further computing device via the communication medium.
- the functions of the components in the computing device 1000 can be implemented by a single computing cluster or multiple computing machines that can communicate via communication connections. Therefore, the computing device 1000 can operate in a networked environment using a logical connection with one or more other servers, networked personal computers (PCs) or further general network nodes.
- PCs personal computers
- the input device 1050 may be one or more of a variety of input devices, such as a mouse, keyboard, tracking ball, voice-input device, and the like.
- the output device 1060 may be one or more of a variety of output devices, such as a display, loudspeaker, printer, and the like.
- the computing device 1000 can further communicate with one or more external devices (not shown) such as the storage devices and display device, with one or more devices enabling the user to interact with the computing device 1000, or any devices (such as a network card, a modem and the like) enabling the computing device 1000 to communicate with one or more other computing devices, if required. Such communication can be performed via input/output (I/O) interfaces (not shown) .
- I/O input/output
- some or all components of the computing device 1000 may also be arranged in cloud computing architecture.
- the components may be provided remotely and work together to implement the functionalities described in the present disclosure.
- cloud computing provides computing, software, visual data access and storage service, which will not require end users to be aware of the physical locations or configurations of the systems or hardware providing these services.
- the cloud computing provides the services via a wide area network (such as Internet) using suitable protocols.
- a cloud computing provider provides applications over the wide area network, which can be accessed through a web browser or any other computing components.
- the software or components of the cloud computing architecture and corresponding visual data may be stored on a server at a remote position.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
Des modes de réalisation de la présente divulgation concernent une solution de traitement de données visuelles. La présente divulgation concerne un procédé de traitement de données visuelles. Le procédé consiste à : effectuer une conversion entre des données visuelles et un flux de code des données visuelles avec un modèle basé sur un réseau neuronal (NN), le flux de code comprenant une première indication indiquant un profil de flux auquel le flux de code est conforme, et au moins une deuxième indication indiquant au moins un profil de décodeur auquel le flux de code est conforme.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CNPCT/CN2024/072854 | 2024-01-17 | ||
| CN2024072854 | 2024-01-17 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2025153016A1 true WO2025153016A1 (fr) | 2025-07-24 |
| WO2025153016A8 WO2025153016A8 (fr) | 2025-11-06 |
Family
ID=96470796
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/072729 Pending WO2025153016A1 (fr) | 2024-01-17 | 2025-01-16 | Procédé, appareil, et support de traitement de données visuelles |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025153016A1 (fr) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110129202A1 (en) * | 2009-12-01 | 2011-06-02 | Divx, Llc | System and method for determining bit stream compatibility |
| US20170111661A1 (en) * | 2015-10-20 | 2017-04-20 | Intelcorporation | Method and system of video coding with post-processing indication |
| US20220201321A1 (en) * | 2020-12-22 | 2022-06-23 | Tencent America LLC | Method and apparatus for video coding for machine |
| US20230122449A1 (en) * | 2021-10-18 | 2023-04-20 | Tencent America LLC | Substitutional quality factor learning in the latent space for neural image compression |
| US20230262243A1 (en) * | 2020-10-20 | 2023-08-17 | Huawei Technologies Co., Ltd. | Signaling of feature map data |
-
2025
- 2025-01-16 WO PCT/CN2025/072729 patent/WO2025153016A1/fr active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110129202A1 (en) * | 2009-12-01 | 2011-06-02 | Divx, Llc | System and method for determining bit stream compatibility |
| US20170111661A1 (en) * | 2015-10-20 | 2017-04-20 | Intelcorporation | Method and system of video coding with post-processing indication |
| US20230262243A1 (en) * | 2020-10-20 | 2023-08-17 | Huawei Technologies Co., Ltd. | Signaling of feature map data |
| US20220201321A1 (en) * | 2020-12-22 | 2022-06-23 | Tencent America LLC | Method and apparatus for video coding for machine |
| US20230122449A1 (en) * | 2021-10-18 | 2023-04-20 | Tencent America LLC | Substitutional quality factor learning in the latent space for neural image compression |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025153016A8 (fr) | 2025-11-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN118872263A (zh) | 用于视觉数据处理的方法、装置和介质 | |
| CN119156819A (zh) | 用于视觉数据处理的方法、设备和介质 | |
| WO2023138686A1 (fr) | Procédé, appareil et support de traitement de données | |
| JP2025506992A (ja) | 視覚データ処理のための方法、装置及び媒体 | |
| WO2025072500A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| CN120419185A (zh) | 用于视觉数据处理的方法、装置和介质 | |
| CN119586129A (zh) | 用于视觉数据处理的方法、装置和介质 | |
| WO2025153016A1 (fr) | Procédé, appareil, et support de traitement de données visuelles | |
| WO2024149394A1 (fr) | Procédé, appareil, et support de traitement de données visuelles | |
| WO2024083202A1 (fr) | Procédé, appareil, et support de traitement de données visuelles | |
| WO2024149392A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2025077746A1 (fr) | Procédé, appareil et support pour le traitement de données visuelles | |
| WO2025149063A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2024169958A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2025077742A1 (fr) | Procédé, appareil, et support de traitement de données visuelles | |
| WO2025082522A1 (fr) | Procédé, appareil et support pour le traitement de données visuelles | |
| WO2024169959A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| US20250379990A1 (en) | Method, apparatus, and medium for visual data processing | |
| WO2024149395A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2025157163A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2025146073A1 (fr) | Procédé, appareil, et support de traitement de données visuelles | |
| WO2024083247A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2025131046A1 (fr) | Procédé, appareil, et support de traitement de données visuelles | |
| WO2025151780A1 (fr) | Procédé, appareil et support de traitement de données visuelles | |
| WO2025044947A1 (fr) | Procédé, appareil et support de traitement de données visuelles |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25741577 Country of ref document: EP Kind code of ref document: A1 |