CN119152101B

CN119152101B - Zero sample 3D generation method for enhancing viewing angle consistency

Info

Publication number: CN119152101B
Application number: CN202411625074.4A
Authority: CN
Inventors: 周媛; 金世龙
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2024-11-14
Filing date: 2024-11-14
Publication date: 2025-05-27
Anticipated expiration: 2044-11-14
Also published as: CN119152101A

Abstract

The invention provides a zero sample 3D generation method for enhancing viewing angle consistency, and aims to solve the problem of inconsistency of the same object under different viewing angles, namely a multi-face problem, in the current 3D generation technology. This problem stems from the typical viewing angle preference of pre-trained generative models. Therefore, the visual angle decoupling method VDM is adopted to extract visual angle characteristics to eliminate visual angle prior preference and enhance visual angle control, and meanwhile, the PSL is introduced to introduce similarity partial order loss to optimize similarity distribution of images among visual angles, so that the generated 3D images are ensured to keep high consistency under different visual angles. In addition, the technology is combined with a 3D Gaussian splatter technology, so that the rendering effect and detail performance of the model are further enhanced. The scheme remarkably improves the authenticity and consistency of the 3D content, is more suitable for the fields of virtual reality, game design, industrial design and the like, and greatly promotes the practicability and development of the zero-sample 3D generation technology.

Description

Zero sample 3D generation method for enhancing viewing angle consistency

Technical Field

The invention belongs to the field of 3D generation, and particularly relates to a zero sample 3D generation method for enhancing viewing angle consistency.

Background

The 3D generation technology plays a vital role in a plurality of fields of innovative industrial design, game design, virtual reality and the like, and particularly, the progress of the zero-sample text-to-3D generation technology provides more possibility for the innovative design from nothing to nothing, and also provides powerful support for user interactive experience and simulation, so that the transition from concept to reality is more efficient and visual. However, unlike the small span upscales of text-to-image generation tasks, the inherent complexity of real scenes and the scarcity of 3D data make generating high quality 3D content from text a great challenge. Recently, the fantasy fusion technique DreamFusion leverages the mature text-to-image generation model, stabilizes the prior knowledge of the diffusion model SD, and upgrades the 2D results from the perspective dimension to the 3D world through advanced fractional distillation sampling technique SDs without relying on the 3D dataset, enabling unsupervised 3D content generation. In subsequent studies, 3D gaussian splatter techniques were introduced, replacing the traditional implicit 3D representation method, and this paradigm was optimized in terms of generation quality, generation speed, and geometry.

However, existing text-to-3D generation methods are still largely limited by geometric collapse and viewing angle inconsistency, one of the significant problems is the "multi-face" problem (also known as the "multi-face" problem), which is usually represented by the fact that the same 3D object presents different or even contradictory faces at different viewing angles, severely affecting the realism of the 3D generation result. This phenomenon occurs because one key factor is ignored, and in order to improve training efficiency, the existing text-generated image models mostly use normalized perspective images (typical perspective images, images in which objects or scenes are presented at a common, standard angle) as training data. Such data will typically select common perspectives that maximize the representation of the object's features, such as front view and oblique front view. Thus, without given a particular view angle condition, a pre-trained text-generated image model, such as a stable diffusion model SD, would generate an image at a canonical view angle based on view angle prior knowledge. Recent text-to-3D generation methods optimize parameterized 3D models by matching rendered images of various perspectives of a target 3D object with stable diffusion model SD generation results. However, when an attempt is made to introduce a specific view description into an original hint word to guide the model to generate an image corresponding to a view, the view prior knowledge inherent to the stable diffusion model SD and the newly added view control information may collide, resulting in blurring of view semantics, thereby causing a multi-view consistency problem.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a zero sample 3D generation method for enhancing the consistency of the visual angle, which comprises the following steps:

Step 1, carrying out mathematical layer analysis on a multi-surface problem in the process of generating a three-dimensional object and providing an optimization target;

Step 2, acquiring text prompt words input by a user, selecting the keywords in the prompt words, connecting the keywords with three different visual angle descriptors (the tail of the keywords are respectively connected with a front visual angle, a side visual angle and a back visual angle), constructing three phrases containing the keywords with visual angle descriptions, respectively inputting the three phrases containing the keywords into a pre-trained text encoder CLIP to obtain three encoding vectors, respectively extracting encoding parts corresponding to the keywords from the three encoding vectors, and then respectively extracting the features of the front visual angle, the side visual angle and the back visual angle of the keywords from the three keyword encodings by adopting a method of extracting vertical components;

step 3, inputting a pre-trained text into a three-dimensional diffusion model PointE by using a user prompt word to generate rough three-dimensional point cloud data, simultaneously fitting the rough three-dimensional point cloud by using a three-dimensional Gaussian splatter technology to generate a three-dimensional Gaussian model with explicit representation, and then rendering the three-dimensional Gaussian model by adopting camera parameters generated randomly to obtain rendering graphs with more than two visual angles;

Step 4, inputting a user prompting word into a pre-trained text encoder CLIP to obtain a prompting word coding vector, positioning the position of a subject word in the coding vector according to the subject word, respectively injecting the characteristics of the front view angle, the side view angle and the back view angle extracted in the step2 into a subject word coding part in the user prompting word coding according to the camera parameters randomly generated in the step 3 to obtain prompting word codes corresponding to all camera parameters after view angle control, and inputting a new prompting word coding vector and the rendering graphs of more than two view angles obtained in the step 3 into a pre-trained stable diffusion model SD;

And 5, generating a corresponding two-dimensional image by the stable diffusion model SD according to the user prompt word and the rendered image, extracting unconditional prediction noise generated in the generation process, obtaining an unconditional denoising result, exploring the similarity distribution of the image, and restricting the unconditional denoising result by using the similarity partial order loss, so as to enhance the visual angle consistency of the three-dimensional Gaussian model.

In step 1, the "multi-faceted" problem includes:

The existing text-to-3D generation method is still limited to a great extent by the problems of geometric collapse and viewing angle inconsistency, wherein one significant problem is a multi-face problem (also called a multi-face problem), which is generally represented by the fact that the same 3D object presents different or even contradictory faces under different viewing angles, and the sense of realism of a 3D generation result is seriously affected. This phenomenon occurs because one key factor is ignored, and in order to improve training efficiency, the existing text-generated image models mostly use normalized perspective images (typical perspective images, images in which objects or scenes are presented at a common, standard angle) as training data. Such data will typically select common perspectives that maximize the representation of the object's features, such as the straight ahead perspective and the oblique front perspective. Thus, without given a particular view angle condition, a pre-trained text-generated image model, such as a stable diffusion model SD, would generate an image at a canonical view angle based on view angle prior knowledge. Recent text-to-3D generation methods optimize parameterized 3D models by matching rendered images of various perspectives of a target 3D object with stable diffusion model SD generation results. However, when an attempt is made to introduce a specific view description into an original hint word to guide the model to generate an image corresponding to a view, the view prior knowledge inherent to the stable diffusion model SD and the newly added view control information may collide, resulting in blurring of view semantics, thereby causing a multi-view consistency problem.

In step 1, the mathematical level analysis includes:

The diffusion model is a kind of generation model used for learning and sampling from complex distribution, and is derived from random process in statistical substance, and its core idea is to gradually 'diffuse' data from its original form to random noise, and then reversely restore the noise into data through an inverse process, and the inverse process generally carries out noise reduction on noise samples through trained neural network. Thereafter, the concept of a denoising diffusion probability model DDPM was proposed, using the denoising diffusion probability model DDPM to optimize the following simplification objectives:

Wherein, The loss function representing the denoising diffusion probability model DDPM is used to measure the difference between the model-generated sample and the actual sample. By minimizing the loss functionThe denoising diffusion probability model DDPM may gradually learn how to denoise and generate high quality samples. Expected valueRepresenting the original sample dataNoise andAnd time stepPerforming expected value calculation, wherein the original sample dataAs input data to the denoising diffusion probability model DDPM, noiseRepresenting random noise subject to a standard normal distribution with a mean of 0 and a variance of 1, which noise is added to the samples during training, time stepsRepresenting a diffusion time step for controlling the degree of noise addition, the time step in the diffusion model being from 1 toGradually increasing.Model parameters representing denoising diffusion probability model DDPMAccording to the noise disturbance time stepPost-sampleAnd time stepPredicted noise value, added noiseModel predictive noiseSquare of two norms betweenThe difference between the predicted noise and the actual added noise is measured. In the reasoning process, the denoising diffusion probability model DDPM is derived fromStarting from a probability density ofSample before sampling in the normal distribution of (a). The denoising diffusion probability model DDPM emphasizes the diffusion model's excellent ability to capture and simulate real world image data, which advantage motivates a series of innovations and improvements including stabilizing the diffusion model SD. The stable diffusion model SD is based on the potential diffusion model LDM, two key advantages are highlighted in design and realization, firstly, the efficiency and speed of image generation are remarkably improved by performing diffusion process in a lower-dimensional potential space, and the calculation burden is greatly reduced. Secondly, by introducing a condition mechanism, the latent diffusion model LDM encodes text condition information into a latent space by using an advanced text encoder CLIP, so that the model can efficiently generate highly relevant and realistic images according to specific condition information, and the latent diffusion model LDM optimizes the following simplification targets:

Wherein, Is the loss function of the potential diffusion model LDM, the expected valueRepresenting the original sample dataPotential encoding results of (a)User promptRandom noise following standard normal distributionAnd time stepPerforming desired calculation, condition information encodingPresentation of user cues through text encoder CLIPAs a result of the encoding performed, the code,Representing the two norms of the difference between the added noise and the predicted noise. The method is based on a stable diffusion model SD and a potential diffusion model LDM.

Under the denoising diffusion probability model DDPM and latent diffusion model LDM framework, the generation of images can be considered from the final noise stateTo a noiseless initial clear stateIs used for the reverse recovery process of (a), this process is controlled by user promptsIs controlled accurately. In particular, the reverse recovery process involves a gradual recovery from a high noise state to a noise-free original image state, in which the final noise-free initial sharpness state is generated by the following equationProbability density of (2):

Wherein, Representation ofThe maximum time step that can be taken is,The probability distribution representing the final state of the diffusion process is a standard normal distribution.Presented in a given user promptAnd the current potential representationUnder the condition, generating the potential representation of the previous stepConditional probability of (a) symbolIs a differential element in the integration, representing for each possible during the integration processThe states undergo minor changes and the joint probability distribution from the initial state to the final state is calculated by these differential elements through all possible intermediate states. In the task of generating 3D from the text of zero samples, the model will integrate the denoising diffusion probability model DDPM through multiple iterative steps from various perspectivesThe generated two-dimensional image representation can construct three-dimensional parametersNon-normalized probability density function of (2)This function is obtained by:

Wherein, Representing all view sets, the expression aggregating the view sets fromEach view angle selected in (1)The result of generating the image through the denoising diffusion probability model DDPM process reveals the contribution of the generated image to the overall three-dimensional model parameter estimation at different view angles. Although the denoising diffusion probability model DDPM method is excellent in generating a high-quality two-dimensional image, randomness in the reverse process and sensitivity to input fluctuations may cause inconsistency of image features generated from different perspectives, thereby affecting the quality of the overall three-dimensional model. In particular, during the training or distillation phase of the model, the pre-trained denoising diffusion probability model DDPM may tend to produce false true values of low quality and inconsistent features, which is particularly pronounced in multi-view fusion processes, resulting in three-dimensional reconstruction results that are overly smooth and lack of detail. To address this challenge, many of the efforts in the prior art have introduced a denoising diffusion implicit model DDIM method that significantly reduces random variation in each iteration step by a more deterministic recursive process. Although the denoising diffusion implicit model DDIM effectively improves the quality and consistency of the pseudo-true values, the "multi-face" problem is still significant, so that the focus is on the multi-view fitting process outside the diffusion process. The method of the present invention redefines three-dimensional parametersDensity function of (2)For a given series of optimization stepsViewing angle control of (2)User promptAnd 3D gaussian model projection at corresponding viewing anglesUnder the condition of each iteration stepThe product of the conditional likelihoods of (2) is expressed as:

Wherein, Representation ofThe maximum optimization step that can be achieved,Representative of the iterative procedureIn (3) by three-dimensional parameters3D model state of control. User promptIt should contain only guide information related to the generated content and not directly related to viewing angle information. However, in the study it was found that the pre-trained diffusion model was parsing the user cues due to exposure to training data from a normalized perspectiveAutomatic introduction of viewing angle a priori knowledge, evenAnd does not explicitly contain view information. In order to accurately control and understand the phenomenon theoretically, from the perspective of understanding the user prompt by the stable diffusion model, the user is promptedDividing into content partsAnd view angle prior partContent partAnd view angle prior partAre orthogonal.Directly reflecting the content part entered by the userThe model-based training experience is a typical perspective preference that the model introduces based on pre-training data when parsing user input. Three-dimensional parametersDensity function of (2)The new expression is:

Wherein the method comprises the steps of The representation is constrained to be such that,This condition indicates that the content portionAnd view angle prior partNot sharing components in any direction in vector space ensures that the generation of content is completely independent of the effect of any particular view. Further, taking the logarithm of the two sides of the equation to obtain:

then, using the chain rule, For a pair ofGradient of (2)Expressed as:

Wherein, Representing a 3D model stateWith respect to three-dimensional parametersIs used for the partial derivative of (a),Representing logarithmic probability densityWith respect to 3D model statesIs a partial derivative of (2);

For a pair of Further expanding by using the Bayesian theorem:

Wherein, Item state relative to 3D modelThis term can be deleted, since it is regarded as a constant, and the result after the bias is 0. In the expressionItems intuitively represent the instinctive perception of the model on the 3D object in the absence of external perspective information guidance, however, this perception is particularly susceptible to view priori knowledge, resulting in the model relying on the most common or significant perspective features in past experience when not constrained by external perspective. This dependence can lead to a deviation of the model when dealing with new perspectives, i.e. when facing a situation that is significantly different from the perspective of the training data, the resulting deviation from a true 3D representation. In addition, the second term in the expressionFurther expansion can be made using point condition mutual information PCMI, point condition mutual information PCMI provides a quantitative approach that shows how the information interactions between different variables under specific conditions go beyond a simple combination of their individual behaviors,Further decomposed into:

wherein the viewing angle is controlled And view angle prior partPoint condition mutual information of (2),AndCan be regarded as constant in the current optimization step, and can be further simplifiedThe term is expanded by definition of conditional probability:

this measure is known at In the case of (a) the number of the cells,AndWith respect to the additional information they exist independently. If viewing angle controlView prior part in pre-training modelGenerating conflicts, i.e.<<Then for eachIn other words, the point condition mutual information PCMI items are all close to 0, thenItems and itemsThe term will simultaneously adversely affect the 3D generation. Thus, build up ofAndVisual semantic consistency between is particularly critical to alleviating the "multi-faceted" problem.

The step 2 comprises the following steps:

acquiring text prompt words input by a user, selecting the subject words in the prompt words, for example, in the prompt words of 'a photo of an automobile', wherein the subject words are 'automobile', and directly inputting the subject words into a pre-trained CLIP encoder to obtain the coding vector of the subject words with no view angle description 。

The subject word is connected with the description words of three different visual angles, namely, the tail end of the subject word is respectively connected with the front visual angle, the side visual angle and the back visual angle, so that three phrases containing the subject word with visual angle description are constructed, for example, the subject word 'automobile' is changed into 'automobile, front visual angle', 'automobile, side visual angle' and 'automobile, back visual angle' after being connected with the visual angle description phrases.

Inputting three phrases containing subject words with view descriptions into a CLIP encoder respectively and positioning the positions of the subject word codes to obtain the subject word codes containing the corresponding view descriptionsAnd (3) withDifferently, subject word encoding with corresponding view descriptionsContaining additional context information provides a way to extractContent independent view feature encoding in a video cameraThis process employs a method of extracting the vertical component, formulated as:

Wherein, ,Representing subject matter word encoding including corresponding view descriptionsCoding vector describing subject words in no view angleProjection section on, view characteristic code independent of content obtained by subtracting this projection componentThe extracted pure visual angle difference is represented, and the difference can be applied to the encoding process of other prompt words, so that the visual angle control is realized. When generating a 2D image, if no view-related description exists in the prompt words, the generation result is often a normalized view result, and in order to enable the generated view to better meet the diversified requirements of users, the prior view characteristics of the subject word codes in the user prompts are decoupled, typical view preference is eliminated, and view control information is injected on the basis. Taking "back view" as an example, the "back view" feature is implanted after decoupling the "front view" and "side view" features, formulated as:

Wherein, Indicating a view that requires elimination of a priori knowledge, including both the front and side cases,Coding vector representing a non-visual angle description subject wordCoding of view characteristics independent of contentThe projection of the direction is performed,Represents a content independent backside view feature encoding,Is a scale factor representing the implant intensity for the new viewing angle.

The step 3 comprises the following steps:

Inputting a user prompt word into the pre-trained text to the three-dimensional diffusion model PointE to generate coarse three-dimensional point cloud data, wherein the coarse three-dimensional point cloud data comprises basic geometric information of the surface of an object;

An anisotropic gaussian is a three-dimensional gaussian distribution that allows for different variances in different directions, i.e. the shape of the distribution can be stretched or scaled along each direction. This feature enables it to be more flexibly adapted to the local structure of the point cloud data. By fitting an anisotropic gaussian, a gaussian matching the local distribution of the point cloud can be generated for each local area in the point cloud, forming a continuous, smooth representation. This fitting process takes into account the density distribution and local geometry of the point cloud to more accurately capture the shape details of the object. The geometric features of the rough three-dimensional point cloud data generated by the three-dimensional diffusion model PointE are captured by utilizing anisotropic gaussian bodies, and the geometric structures of the rough three-dimensional point cloud data generated by the three-dimensional diffusion model PointE are directly defined by the gaussian bodies through information such as positions, directions, sizes and the like, so that the method is an explicit expression. Fitting a three-dimensional Gaussian model with explicit representation in the process;

The tile rasterization technique optimized by the graphics processing unit GPU is a technique that efficiently renders a 3D scene into a 2D image by dividing the image into small blocks and taking advantage of the parallel processing capabilities of the graphics processing unit GPU. And rendering the three-dimensional Gaussian model by using a tile type rasterization technology optimized by a graphic processing unit GPU through the randomly generated camera parameters to obtain rendering graphs of a plurality of view angles.

Step 4 comprises:

and inputting the prompt words of the user into a pre-trained text encoder CLIP to obtain the prompt word coding vectors, and positioning the positions of the subject words in the coding vectors according to the subject words.

The method for obtaining the plurality of camera parameters randomly generated in the step 3, wherein the visual angle control process can be adaptively realized through the camera parameters of each optimization step, and specifically comprises the steps that when the azimuth angle in the camera parameters is positioned in the angle of the azimuth) When (1):

Wherein, Coding vector representing a non-visual angle description subject wordEncoding of features at a content independent front viewProjection in a direction;

When the azimuth angle is within% )When (1):

Wherein the weight is adjusted according to azimuth angle in camera parameters Embodying the followingThe intensity of the injected, the weight when the azimuth angle in the camera parameters is closer to 180 ° or-180 °, i.e. the current viewing angle is closer to the front and backThe larger. With such view control at the encoding level, the stable diffusion model SD can more accurately understand the view control specified by the rendering cameraWithout being disturbed by the viewing angle preferences common in the stable diffusion model SD, such that the view a priori partViewing angle control with camera parameter controlMore consistent, this step will result in new cue word encoding after view control.

Then, the new hint word encoding vector is input into the pre-trained stable diffusion model SD together with the multiple view rendering graphs obtained in step 3.

The step 5 comprises the following steps:

During the training of the stable diffusion model SD, the data used is a text-to-image pairing. However, the description of viewing angles in text is generally rough and generalized, lacking the necessary numeralization and precision. This blurred view information limits the ability of the stable diffusion model SD to build robust view perception in learning, which in turn makes it significantly difficult to convert from 2D images to 3D content. In order to alleviate the problem, in step 4, through improving the visual angle semantic definition of the user prompt in the image generation process, the alignment of the user prompt and the generated object in visual angle semantics is ensured, so that the dependence of 3D generated content on priori visual angle knowledge is effectively relieved. However, the mathematical analysis of the "multi-faceted" problem in step 1 shows another gradient term As an unconditional term, without interaction with the user cues, the view-angle effect on the generated content relies only on view-a priori knowledge of the stable diffusion model SD, so this term also potentially injects a priori view-angle preferences into the generated content. To eliminate this preference, the goal is to find a way to link the unconditional guidance item to the view control, or to let the unconditional item be controlled by camera parameters, to inject view sensing capabilities for the model.

The contrast language image pre-training model is a machine learning model and is specially used for measuring the similarity between an image and a text and obtaining a cosine similarity score between the image and the text. The link between view control and rendered images is explored from the cosine similarity score distribution between the view description text (e.g., "back", "side", etc.) and the rendered images. The two sets of results with and without the "multi-faceted" problem were sampled uniformly, respectively, by sampling 500 uniformly spaced camera parameters from a fixed radius upper hemisphere, all pointing to the sphere center at the same height, and rendering 500 images from a 3D scene. And calculating cosine similarity of each image and the visual angle description text through a pre-trained contrast language image pre-training model, wherein the cosine similarity score distribution result of the sample without the 'multi-face' problem is found to be approximately periodically changed along with the change of the angle, and the change is approximately continuous.

Further, according to the characteristic exploration of unconditional items, a cosine similarity distribution rule is obtained by searching cosine similarity relation between images, one piece of image is randomly selected from all rendering images to serve as a reference image, cosine similarity between other rendering images and the reference image is calculated, and considering random fluctuation of pitch angle and view size in the optimization process and image overturning operation, a more global feature is required to be obtained, and a first layer vector in a pre-trained encoder ViT is an ideal choice because the first layer vector in the encoder ViT not only keeps detail information of each small block of the image, but also effectively fuses overall layout and relation of the whole image, so that feature representation has a comprehensive global view angle. Experiments show that cosine similarity distribution obtained by calculation of all rendering graphs and a reference image is really related to azimuth angles of camera parameters, cosine similarity scores of the rendering graphs corresponding to azimuth angles which are closer to the azimuth angle of the reference image are higher, and scores gradually decrease from the azimuth angle of the reference image to two sides, however, when a multi-face problem occurs, the cosine similarity distribution is disturbed, and a way is provided for explicitly restraining an unconditional denoising result from being far away from the multi-face problem. Specifically, in each round of optimization steps, random generation is performedIndividual viewing angle controlAnd is controlled according to the viewing angleThe azimuth angles in (a) determine an order for the view control, which follows the cosine similarity distribution law. Establishing a rectangular coordinate system, constructing a unit circle by taking an origin as a circle center, gradually increasing the angle corresponding to the point on the unit circle from 0 DEG in the negative direction of the vertical axis to 180 DEG in the positive direction of the horizontal axis and decreasing the angle to-180 DEG in the negative direction of the horizontal axis, and controlling the randomly generated visual anglePlacing the camera on a unit circle according to the azimuth angle, randomly selecting one of the view angle controls as a reference camera, and constructing a cutting line parallel to the transverse axis on the unit circle by taking the view angle control as a starting pointCalculating point distance secant corresponding to other visual angle controlDistance of (2)According toIs of the size of (a)The individual camera controls are ordered.

Then, control by these viewing anglesObtaining a rendered imageRendering an imagePotential representation of (a)And rendering an imageIs added to the noisy potential representation of (a) and a rendered image is obtainedPotential representation of (a)Corresponding unconditional noise prediction result after stable diffusion model SD,SD estimation representing stable diffusion modelNoise components after the noise adding process.

Then, willRemoval of unconditional denoising results from denoised potential representationsAnd calculates unconditional denoising resultsPartial sequence loss between them, loss function is:

Wherein, Representing perspective control in each iterationIs used in the number of (a) and (b),Representation ofThe index position in the individual view angle control,Represents the firstZhang Xuanran imagesCorresponding unconditional denoising resultAt ViT the first layer encoding result of the encoder,Representation ofAndCosine similarity between them.The term penalizes unconditional denoising results which do not meet the cosine similarity rule distribution, and brings the unconditional denoising results close to the cosine similarity distribution without the 'multi-face' problem. The partial order loss can repair the multi-surface object in the initial stage of iterative optimization, so that the problem of multi-surface is effectively relieved.

The invention also provides a zero sample 3D generating device for enhancing the viewing angle consistency based on the method, which comprises the following steps:

the text input module is used for acquiring text prompt words input by a user;

the point cloud generation module is used for generating rough 3D point cloud according to the text prompt words;

the 3D fitting module is used for fitting the rough point cloud by utilizing a 3D Gaussian splatter technology to generate an explicit 3D Gaussian model;

The multi-view rendering module is used for rendering the 3D Gaussian model from a plurality of view angles to obtain a multi-view rendering image;

The visual angle coding control module is used for eliminating typical visual angle preference and injecting visual angle control, so that visual angle semantic consistency from text to 2D to 3D is improved;

The text generation image module is used for adding noise to the multi-view rendering image, predicting noise according to the user prompt and the multi-view rendering image, and updating 3D Gaussian model parameters according to the prediction result;

and the similarity partial order loss constraint module is used for establishing the viewing angle consistency between the multi-viewing angle images and improving the viewing angle perception capability of the 3D model.

The progress tracking module monitors the training progress and records and displays various indexes in the training process;

the model check point management module is responsible for storing and loading the model state so as to facilitate the sustainability and recovery of model training;

The network interface module is used for processing network communication, receiving external commands and sending rendered images;

and the video generation module is used for generating a video file according to the rendered image.

The coding control module includes:

the visual angle characteristic decoupling unit is used for extracting visual angle characteristic difference values from the text prompts;

The visual angle control unit is used for eliminating typical visual angle preference by utilizing the visual angle characteristic difference value after characteristic decoupling and injecting new visual angle control to enhance visual angle semantic definition;

The text generation image module includes:

an image encoding unit for encoding the multi-view image to obtain a potential representation;

an image denoising unit for denoising the potential representation of the multi-view image;

The image denoising unit is used for performing conditional denoising operation on the denoised image according to the user prompt and performing unconditional denoising operation on the denoised image;

the noise comparison unit is used for comparing the predicted noise with the noise adding part and calculating the gradient of the noise tensor;

the parameter updating unit is used for updating the parameters of the 3D Gaussian model through a back propagation algorithm according to the noise tensor gradient;

The similarity partial order loss constraint module comprises:

the camera parameter ordering unit is used for ordering the randomly generated camera parameters according to an ideal cosine similarity distribution rule;

the unconditional denoising result acquisition unit is used for selecting unconditional prediction noise in the image denoising unit according to the ordered camera parameters and acquiring unconditional denoising results according to the unconditional prediction noise;

and the loss calculation unit is used for calculating cosine similarity between unconditional denoising results, calculating partial sequence loss according to an ideal cosine similarity distribution rule, and updating the 3D Gaussian model parameters during counter propagation together with the parameter updating unit.

The invention also provides an electronic device, which comprises a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to perform the steps of the method according to the instructions.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor realizes the steps of the method.

The invention obviously improves the 3D consistency of the generating target from two layers. On one hand, aiming at conflict brought by view priori, a view decoupling method VDM is provided, the view decoupling method VDM is a view feature decoupling method, view semantic blurring caused by typical view preference is effectively eliminated, view control is enhanced when 3D content is generated, consistency of view semantic in the process of text to image and 3D is remarkably enhanced, on the other hand, aiming at the defect of explicit 3D consistency constraint, the invention explores distribution rules of image similarity and provides similarity out-of-order loss PSL, and view perception capability in the process of 2D to 3D is enhanced. The fact proves that the method constructs strong association of visual angle semantics in data of different modes, and effectively relieves the problem of multiple faces.

The zero-sample 3D generation method for enhancing the viewing angle consistency has the beneficial effects that systematic optimization measures are provided aiming at the common multi-viewing angle consistency problem in the traditional method, namely the multi-face problem, so that the geometric and semantic consistency of the 3D model under different viewing angles is remarkably improved. By introducing the visual angle decoupling method VDM and the similarity partial order loss PSL, the visual angle sensing capability of the model is enhanced, and the problems of generation conflict and visual angle blurring caused by visual angle priori are effectively solved through fine-granularity visual angle control. Furthermore, the present invention improves rendering efficiency and model expressive power using 3D gaussian splatter techniques, allowing for efficient processing of large-scale and dynamic scenes in real-time, which is particularly important for real-time applications such as virtual reality and game design. The explicit 3D representation is combined with high-quality viewing angle consistency, so that consistency and authenticity in viewing from different viewing angles are guaranteed, and user experience and application value of a model are greatly improved. Through the comprehensive application of the technologies, the invention obviously reduces the occurrence frequency of the multi-face problem and improves the generation speed and quality of the 3D content. The method is not only suitable for a single scene, but also can adapt to changeable environments and complex scene requirements, and provides wider practicability and flexibility for 3D content creation. Therefore, the method has wide application prospect in the fields of industrial design, virtual reality, game design and the like, is expected to promote the development of 3D generation technology, and brings innovation and value to related industries.

Drawings

Fig. 1 is a training flowchart of a zero sample 3D generation method for enhancing viewing angle consistency according to an embodiment of the present invention.

Fig. 2 is a model frame diagram of a zero-sample 3D generation method for enhancing viewing angle consistency according to an embodiment of the present invention.

Detailed Description

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.

As shown in FIG. 1, the embodiment of the invention provides a zero-sample 3D generation method for enhancing viewing angle consistency, which comprises the steps of firstly executing a training process, initializing an environment and a 3D Gaussian model at the beginning of training, and using the 3D Gaussian model to represent a 3D object. This step is mainly to prepare the environment and model structure for subsequent training. At rendering, objects in the 3D gaussian model are projected onto the imaging plane of the camera and the projected two-dimensional gaussian distribution is distributed over the individual tiles. The checkpoints are then used to recover the previously trained state. If a checkpoint file is provided, the program will load the previous model parameters and state to continue the training of the previous interrupt. Otherwise, training will be started from scratch. Then, a pre-trained diffusion model (stable diffusion model SD) and text coding are set for training of the guideline model. Text encoding is a vector representation extracted with the received user prompt word selected subject words using a pre-trained text encoder CLIP, which will be used in subsequent visual feature decoupling processes. And then, inputting the user prompt into the pre-trained three-dimensional diffusion model to generate a rough 3D point cloud. It is then checked whether all specified training iterations have been completed. If the iteration run reaches a preset maximum, the training will stop. Otherwise, the next iteration is entered. And next, dynamically adjusting the learning rate, and gradually reducing the learning rate in the training process to stabilize convergence. Then, camera parameters are configured, and parameters such as position, viewing angle, focal length and the like are set. These parameters are used to select the current optimized view. The scene is then rendered with the currently set camera parameters, generating a set of images that will be used to train the 3D gaussian model. The cue word text is then encoded using a pre-trained text encoder CLIP and view feature decoupling is performed by the prepared subject word encoding, typical view preferences are eliminated and view controls are injected, rendering the image encoded by a pre-trained image encoder VAE. And then inputting the text codes and the rendered image codes into a stable diffusion model SD for noise prediction, carrying out weighted aggregation on the obtained unconditional noise prediction and the conditional noise prediction guided by the text, and calculating a similarity partial order loss PSL, a prediction noise loss, a scale loss and a total coding difference loss by utilizing the unconditional noise. Next, the 3D gaussian model parameters are updated by back propagation while processor events are recorded to monitor training time and resource consumption, and to check whether the current iteration satisfies the conditions for preserving the model (e.g., specified preservation interval), and if so, the state and parameters of the current model are preserved. And finally, saving the current training state, model parameters and logs to a disk so as to facilitate subsequent continuous training or evaluation, and ending the whole flow after all training steps are completed.

As shown in FIG. 2, the network framework of this embodiment includes firstly, obtaining text prompt words input by a user, and selecting the subject words in the prompt words, for example, the prompt words "one squirrel is eating hamburger", the subject words are "squirrel", and the subject words are directly input into a pre-trained CLIP encoder to obtain coding vectors for describing the subject words without view angles. The subject word is connected with three descriptors with different visual angles, namely, the tail end of the subject word is respectively connected with a front visual angle, a side visual angle and a back visual angle, so that three phrases containing the subject word with visual angle description, such as a subject word of ' squirrel ' is changed into a description phrase of ' squirrel, a front visual angle ', ' squirrel, a side visual angle ', a squirrel and a back visual angle ', after the description phrase of the visual angle is connected.

Inputting three phrases containing subject words with view descriptions into a CLIP encoder respectively and positioning the positions of the subject word codes to obtain the subject word codes containing the corresponding view descriptionsAnd (3) withDifferently, subject word encoding with corresponding view descriptionsContaining additional context information provides a way to extractContent independent view feature encoding in a video cameraThis process, which employs a method of extracting the vertical component, can be formulated as:

Wherein, Representing subject word encoding including corresponding view descriptionsCoding vector describing subject words in no view angleProjection section on, view characteristic code independent of content obtained by subtracting this projection componentThe extracted pure visual angle difference is represented, and the difference can be applied to the encoding process of other prompt words, so that the visual angle control is realized. And inputting the user prompt words into the pre-trained text to the three-dimensional diffusion model PointE to generate rough three-dimensional point cloud data, wherein the point cloud data comprises basic geometric information of the surface of the object. An anisotropic gaussian is a three-dimensional gaussian distribution that allows for different variances in different directions, i.e. the shape of the distribution can be stretched or scaled in each direction. This feature enables it to be more flexibly adapted to the local structure of the point cloud data. By fitting an anisotropic gaussian, a gaussian matching the local distribution of the point cloud can be generated for each local area in the point cloud, forming a continuous, smooth representation. This fitting process takes into account the density distribution and local geometry of the point cloud to more accurately capture the shape details of the object. The geometric features of the rough three-dimensional point cloud data generated by the three-dimensional diffusion model PointE are captured by utilizing anisotropic gaussian bodies, and the geometric structures of the rough three-dimensional point cloud data generated by the three-dimensional diffusion model PointE are directly defined by the gaussian bodies through information such as positions, directions, sizes and the like, so that the method is an explicit expression. This process fits a three-dimensional gaussian model of the explicit representation. The tile rasterization technique optimized by the graphics processing unit GPU is a technique that efficiently renders a 3D scene into a 2D image by dividing the image into small blocks and taking advantage of the parallel processing capabilities of the graphics processing unit GPU. And rendering the three-dimensional Gaussian model by using a tile type rasterization technology optimized by a graphic processing unit GPU through the randomly generated camera parameters to obtain rendering graphs of a plurality of view angles. And inputting the prompt words of the user into a pre-trained text encoder CLIP to obtain the prompt word coding vectors, and positioning the positions of the subject words in the coding vectors according to the subject words.

The viewing angle control process can be adaptively implemented by the camera parameters of each optimization step, in particular, when the azimuth angle in the camera parameters is located at @) When (1):

Wherein, Coding vector representing a non-visual angle description subject wordEncoding of features at a content independent front viewProjection in the direction.

When the azimuth angle is within%)When (1):

Wherein the weight is adjusted according to azimuth angle in camera parameters Embodying the followingThe intensity of the injected, the weight when the azimuth angle in the camera parameters is closer to 180 ° or-180 °, i.e. the current viewing angle is closer to the front and backThe larger. With such view control at the encoding level, the stable diffusion model SD can more accurately understand the view control specified by the rendering cameraWithout being disturbed by the viewing angle preferences common in the stable diffusion model SD, such that the view a priori partViewing angle control with camera parameter controlMore consistent, this step will result in new cue word encoding after view control. Then, the new hint word encoding vector is input to the pre-trained stable diffusion model SD together with the rendering map.

Further, according to the characteristic exploration image of unconditional item noise and the cosine similarity relation between images, one piece is selected randomly from all rendering images to serve as a reference image, cosine similarity between other rendering images and the reference image is calculated, the random fluctuation of pitch angle and view size in the optimization process and the overturning operation of the images are considered, the feature of the image is required to be obtained, the first layer vector in the pre-trained encoder ViT is an ideal choice, and the first layer vector in the encoder ViT not only keeps the detailed information of each small block of the image, but also effectively fuses the overall layout and relation of the whole image, so that the feature representation has a comprehensive global view angle. Experiments show that cosine similarity distribution obtained by calculation of all rendering graphs and a reference image is really related to azimuth angles of camera parameters, cosine similarity scores of the rendering graphs corresponding to azimuth angles which are closer to the azimuth angle of the reference image are higher, and scores gradually decrease from the azimuth angle of the reference image to two sides, however, when a multi-face problem occurs, the cosine similarity distribution is disturbed, and a way is provided for explicitly restraining an unconditional denoising result from being far away from the multi-face problem. Specifically, in each round of optimization steps, random generation is performedIndividual viewing angle controlAnd is controlled according to the viewing angleThe azimuth angles in (a) determine an order for the view control, which follows the cosine similarity distribution law. Establishing a rectangular coordinate system, constructing a unit circle by taking an origin as a circle center, gradually increasing the angle corresponding to the point on the unit circle from 0 DEG in the negative direction of the vertical axis to 180 DEG in the positive direction of the horizontal axis and decreasing the angle to-180 DEG in the negative direction of the horizontal axis, and controlling the randomly generated visual anglePlacing the camera on a unit circle according to the azimuth angle, randomly selecting one of the view angle controls as a reference camera, and constructing a cutting line parallel to the transverse axis by taking the point as a starting pointCalculating point distance secant corresponding to other camerasDistance of (2)According toIs of the size of (a)The individual camera parameters are ordered from small to large. Then, control by these viewing anglesObtaining a rendered imageRendering an imagePotential representation of (a)Rendering an imageIs added to the noisy potential representation of (a) and a rendered image is obtainedPotential representation of (a)Corresponding unconditional noise prediction result after stable diffusion model SD,Representation model estimationNoise components after the noise adding process. Then, willRemoval of unconditional denoising results from denoised potential representationsAnd calculates unconditional denoising resultsPartial sequence loss in between. The loss function is as follows:

Wherein, Representing perspective control in each iterationIs used in the number of (a) and (b),Representation ofThe first of the individual view angle controlsThe number of index positions is set,Represents the firstZhang Xuanran imagesCorresponding unconditional denoising resultAt ViT the first layer encoding result of the encoder,RepresentingAndCosine similarity between them.The term penalizes unconditional denoising results which do not meet the cosine similarity rule distribution, and brings the unconditional denoising results close to the cosine similarity distribution without the 'multi-face' problem. The partial order loss can repair the multi-surface object in the initial stage of iterative optimization, so that the problem of multi-surface is effectively relieved. And finally, updating model parameters of the 3D Gaussian model through a back propagation algorithm, and improving the quality of the 3D generated object.

Another embodiment of the present invention provides a zero-sample 3D generating apparatus for enhancing viewing angle consistency, including:

the text input module is used for acquiring text prompt words input by a user;

The coding control module includes:

The text generation image module includes:

The similarity partial order loss constraint module comprises:

Another embodiment of the present invention provides an electronic device, including a processor and a storage medium;

the processor is operative to perform steps of the method in accordance with the instructions.

Another embodiment of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method.

The method of the embodiment of the invention mainly solves the problem of multi-view consistency commonly existing in the prior art, namely the problem of multi-face. The problem often appears as inconsistent images generated by the same 3D object at different viewing angles, severely affecting the authenticity and visual effect of the generated results. In order to solve the problem, the embodiment of the invention introduces a 3D Gaussian splatter technology, a visual angle decoupling method VDM and a similarity partial sequence loss PSL, and forms an optimization flow of a system. Firstly, inputting a text prompt word describing a 3D object to be generated, and generating a rough 3D point cloud according to the prompt word by a model. This primarily generated point cloud, while capturing the basic shape of the object, has shortcomings in detail and consistency. To improve the accuracy and consistency of the generation, the invention fits the coarse point cloud using 3D gaussian splatter techniques, converting it into an explicit 3D representation consisting of anisotropic gaussian. Then, an iterative optimization strategy of 5000 steps is adopted, and each iterative process comprises the following key steps of firstly selecting a certain view angle of a current 3D model, rendering a projection image under the view angle, inputting text prompt words, rendering images and camera parameters into a pre-trained diffusion model, and updating the explicit parameters of the 3D model through the generated 2D image. Through repeated iterative optimization, consistency of the 3D model under different view angles is continuously enhanced. In order to further alleviate the multi-aspect problem, the invention introduces a visual angle decoupling method VDM and a similarity partial order loss PSL. The visual angle decoupling method VDM is a visual angle characteristic decoupling method, effectively eliminates visual angle semantic blurring caused by typical visual angle preference, enhances visual angle control when generating 3D content, and remarkably enhances consistency of visual angle semantics in the process of text-to-image-to-3D. On the other hand, aiming at the lack of explicit 3D consistency constraint, the invention explores the distribution rule of image similarity and proposes the similarity partial sequence loss PSL, thereby enhancing the visual angle perception capability in the 2D-to-3D lifting process. Experimental results show that the occurrence frequency of the multi-face problem is remarkably reduced in the generation process, and the generated 3D model has higher authenticity and consistency under a plurality of visual angles. Compared with the traditional method, the method not only improves the generation speed and quality, but also shows remarkable advantages in solving the problem of multi-view consistency, and can better meet the requirements of fields such as industrial design, virtual reality, game design and the like on high-quality 3D content.

The invention provides a zero sample 3D generating method for enhancing viewing angle consistency, and the method and the way for realizing the technical scheme are numerous, the above description is only a preferred embodiment of the invention, and it should be noted that, for those skilled in the art, several improvements and modifications can be made without departing from the principle of the invention, and the improvements and modifications should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A zero sample 3D generation method for enhancing viewing angle consistency, comprising the steps of:

Step 2, acquiring text prompt words input by a user, selecting the subject words in the prompt words, connecting the subject words with three different view angle descriptive words, namely, connecting the tail of the subject words with a front view angle, a side view angle and a back view angle respectively, constructing three phrases containing the subject words with view angle descriptions, respectively inputting the three phrases containing the subject words into a pre-trained text encoder CLIP to obtain three coding vectors, respectively extracting coding parts corresponding to the subject words from the three coding vectors, and then respectively extracting the features of the front view angle, the side view angle and the back view angle of the subject word object from the three subject word codes by adopting a method of extracting vertical components;

The step 2 comprises the following steps:

Acquiring text prompt words input by a user, selecting the subject words in the prompt words, and directly inputting the subject words into a pre-trained CLIP encoder to obtain coding vectors for describing the subject words without view angles

Connecting the subject word with three descriptors with different visual angles, namely connecting the tail of the subject word with a front visual angle, a side visual angle and a back visual angle respectively, and constructing three phrases containing the subject word with visual angle descriptions;

Inputting three phrases containing subject words with view descriptions into a CLIP encoder respectively and positioning the positions of the subject word codes to obtain the subject word codes containing the corresponding view descriptions Extraction ofContent independent view feature encoding in a video cameraThe method for extracting the vertical component is adopted, and is formulated as follows:

Wherein, Representing subject matter word encoding including corresponding view descriptionsCoding vector describing subject words in no view angleA projection portion thereon;

the method comprises the steps of decoupling prior view characteristics of subject word codes in user prompts, eliminating typical view preference, injecting view control information on the basis of the decoupling prior view characteristics and side view characteristics, and injecting back view characteristics after decoupling the prior view characteristics and the side view characteristics for the back view, wherein the formula is as follows:

where view = { front, side } represents the view that requires elimination of a priori knowledge, including both the front and side cases, Coding vector representing a non-visual angle description subject wordCoding of view characteristics independent of contentThe projection of the direction is performed,Representing content independent back view feature coding, η being a scale factor;

2. The method of claim 1, wherein in step 1, the mathematical level analysis comprises:

the following simplified objectives are optimized using the denoising diffusion probability model DDPM:

Wherein L _DDPM represents the loss function of the denoising diffusion probability model DDPM, the expected value Representing the expected value calculation of the original sample data x, noise E and time step t, wherein the original sample data x is used as the input data of the denoising diffusion probability model DDPM, noiseThe time step t represents a diffusion time step for controlling the adding degree of noise, and the time step in the diffusion model gradually increases from 1 to t; model parameters representing denoising diffusion probability model DDPM Under the condition of noise disturbance, according to sample x _t after time step t and noise value predicted by time step t, the added noise E and model prediction noiseSquare of two norms betweenThe difference between the predicted noise and the actual added noise is measured, and in the reasoning process, the denoising diffusion probability model DDPM starts from x _t and the probability density is as followsSampling a previous sample x _t-1 in the normal distribution of (a);

the following simplification objectives are optimized by the latent diffusion model LDM:

wherein L _LDM is the loss function of the potential diffusion model LDM, the expected value Representing the potential coding result z, user prompt c, random noise e following standard normal distribution of the original sample data xAnd time step t for carrying out expected calculation, condition information codingRepresenting the result of encoding the user prompt c by the text encoder CLIP,A second norm representing the difference between the added noise and the predicted noise;

Under the de-noising diffusion probability model DDPM and latent diffusion model LDM framework, the generation of the image is considered as a reverse recovery process from the final noise state z _T to the noiseless initial clear state z ₀, in which the probability density p _2D(z₀ |c of generating the final noiseless initial clear state z ₀ is expressed by the following formula:

Where T represents the maximum time step that T can take, p (z _T) represents the probability distribution of the final state of the diffusion process, p (z _t-1∣z_t, c) represents the conditional probability of generating the previous step potential representation z _t-1 given the user prompt c and the current potential representation z _t, the sign Is a differential element in the integration;

construction of non-normalized probability Density function of three-dimensional parameter θ

Wherein Λ represents all view angle sets;

redefining density function of three-dimensional parameter θ View control lambda _τ for a given series of optimization steps tau, user prompt c and projection of a 3D gaussian model at the corresponding viewUnder the condition, the product of the conditional likelihoods of each iteration step τ is expressed as:

Wherein N represents the maximum optimization step that τ can take, and Z _θ,τ represents the 3D model state controlled by the three-dimensional parameter θ in the iteration step τ;

From the perspective of understanding the user cues by the stable diffusion model, the user cues c are divided into a content part c _c and a visual angle priori part c _v, the content part c _c and the visual angle priori part c _v are orthogonal, and the density function of the three-dimensional parameter theta The new expression is:

Wherein s.t. is constrained, c _c⊥c_v, taking the logarithm of the opposite edges of the equation to obtain:

then, using the chain rule, Gradient to θExpressed as:

Wherein, Represents the partial derivative of the 3D model state Z _θ,τ with respect to the three-dimensional parameter theta,Representing logarithmic probability densityPartial derivatives with respect to 3D model state Z _θ,τ;

For a pair of Further expanding by using the Bayesian theorem:

further decomposed into:

Wherein the point condition mutual information of the view angle control lambda _τ and the view angle a priori part c _v C _c The term PCMI (λ _τ,c_v∣Z_θ,τ) is further simplified and expanded with the definition of conditional probabilities, as constant in the current optimization step:

3. The method of claim 2, wherein in step 3, the coarse three-dimensional point cloud data comprises basic geometric information of the object surface.

4. A method according to claim 3, wherein step 4 comprises:

Inputting the user prompting words into a pre-trained text encoder CLIP to obtain prompting word coding vectors, and positioning the positions of the main theme words in the coding vectors according to the main theme words;

The camera parameters randomly generated in the step 3 are acquired, and the visual angle control process can be adaptively realized through the camera parameters of each optimization step, and specifically comprises the following steps of when the azimuth angle in the camera parameters is located at (-90 degrees, 90 degrees):

when the azimuth angle is located at (-90 °, -180 °). U.s.u.90 °,180 °:

Wherein the weight w adjusted according to azimuth in the camera parameters is embodied The intensity injected, the greater the weight w when the azimuth angle in the camera parameters is closer to 180 ° or-180 °, i.e., the closer to the front and back the current view angle;

then, the new hint word encoding vector is input into the pre-trained stable diffusion model SD together with the view rendering map obtained in step 3.

5. The method of claim 4, wherein step 5 comprises:

Uniformly sampling two groups of results with and without the multi-surface problem, wherein the sampling method is to sample uniformly-spaced camera parameters from an upper hemisphere with a fixed radius, all the camera parameters point to the center of a sphere with the same height, render images from a 3D scene, explore cosine similarity relations among the images, and obtain a cosine similarity distribution rule;

Randomly selecting one rendering image from all rendering images as a reference image, calculating cosine similarity between other rendering images and the reference image, randomly generating c_ batchsize view angle controls lambda in each round of optimization step, and determining a sequence for the view angle controls according to azimuth angles in the view angle controls lambda, wherein the sequence obeys a cosine similarity distribution rule; establishing a rectangular coordinate system, constructing a unit circle by taking an origin as a circle center, gradually increasing the angle corresponding to a point on the unit circle from 0 DEG in the negative direction of a vertical axis to 180 DEG in the positive direction of a horizontal axis and decreasing the angle to-180 DEG in the negative direction of the horizontal axis, placing randomly generated view angle control lambda on the unit circle according to the azimuth angle, randomly selecting one view angle control as a reference camera, constructing a secant s parallel to the horizontal axis on the unit circle by taking the view angle control as a starting point, calculating the distance d between the point corresponding to other view angle controls and the secant s, and sequencing c_ batchsize view angle controls according to the size of d;

Obtaining a rendered image R, a potential representation psi of the rendered image R and a potential representation of the rendered image R after noise addition, and obtaining an unconditional noise prediction result epsilon ^φ,∈^φ corresponding to the potential representation psi of the rendered image R after the stable diffusion model SD is processed, wherein the unconditional noise prediction result epsilon ^φ,∈^φ represents the stable diffusion model SD to estimate the noise component of the psi after the noise addition;

removing epsilon ^φ from the denoised potential representation to obtain unconditional denoising results And calculates unconditional denoising resultsPartial sequence Loss between them, loss function is Loss _partial:

where c batchsize denotes the number of view controls lambda in each iteration, i denotes the index position in c batchsize view controls, Represents unconditional denoising results corresponding to the ith rendered image R _i At ViT the first layer encoding result of the encoder,Representation ofAndCosine similarity between them.

6. An electronic device, comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to execute the steps of the method according to any one of claims 1 to 5 according to the instruction.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1-5.