Object identification method and device based on three-dimensional reconstruction
Technical Field
The present invention relates to the field of computer image processing, and in particular, to an object recognition method and apparatus based on three-dimensional reconstruction.
Background
In the field of three-dimensional reconstruction, the current mainstream schemes include multi-view geometry based methods and NeRF.
The method based on multi-view geometry is a traditional method before the popularity of the neural network, and is typified by Colmap software. Colmap extract geometric information from the two-dimensional images at multiple viewing angles, and derive a three-dimensional structure. The method mainly comprises the steps of feature extraction and matching, sfM-based sparse reconstruction, MVS-based dense reconstruction, and BA-based camera calibration and optimization. After the process, a three-dimensional model is finally obtained.
NeRF (Neural RADIANCE FIELDS) is a neural network-based three-dimensional reconstruction and rendering technique capable of generating a high quality three-dimensional scene representation from a set of two-dimensional images. The core idea is to learn the color and density distribution of each spatial position and view angle in the scene through a multi-layer perceptron (MLP) network, thereby generating a volume rendering output. But multi-view based schemes cannot handle areas without texture.
NeRF training requires input of image and camera parameters, the algorithm encodes each three-dimensional coordinate at high frequency, inputs it into the neural network, and the network outputs the color (RGB) and density (transparency of volume rendering) of the point. The generated image and the input real image are compared, and network parameters are adjusted so that the generated image is more matched with the real image, but the scheme based on NeRF has the problem of low operation efficiency.
In the field of three-dimensional scene recognition based on point cloud, pointNet uses a shared multi-layer perceptron to learn the characteristics of each point, so that the unordered point cloud can be directly processed, a learner subsequently proposes PointNet ++ [19] network on the basis of the characteristics, the capability of multi-layer neighborhood characteristic extraction and aggregation is increased, and the local and global structures of the point cloud data can be better captured. DGCNN constructing the adjacency relation of the point cloud data through the dynamic graph, so that the topological structure between the points can be captured better. The reconstructed model has irregular topology, and is difficult to perform secondary rendering.
Disclosure of Invention
The invention aims to provide an object identification method and device based on three-dimensional reconstruction, which solve the defects in the prior art, and the technical problem to be solved by the invention is realized by the following technical scheme.
The first aspect of the invention provides an object identification method based on three-dimensional reconstruction, which comprises the following steps:
A three-dimensional reconstruction step based on multi-view geometry, wherein a three-dimensional reconstruction algorithm based on multi-view geometry is used for inputting a plurality of scene images and outputting a point cloud of a scene;
A non-texture region detection step, namely setting a threshold value according to depth information obtained in a multi-view geometric three-dimensional reconstruction process, and judging that the non-texture region exists if the depth information is smaller than the threshold value and the region does not have point clouds according to the depth information of each photo;
Based on NeRF, processing the input image based on NeRF to generate a corresponding point cloud, specifically realized by the following substeps:
the image input substep NeRF relies on multi-view image input, the images from multiple camera angles covering different view angles of the target scene to ensure the integrity and accuracy of the reconstruction;
NeRF using an MLP multi-layer perceptron neural network to fit the radiation field of the scene, the neural network receiving as input the pose and ray direction of each camera and outputting color and density values for each 3D spatial location, allowing NeRF to generate a continuous three-dimensional representation, capable of reconstructing depth and geometry information even in non-textured areas;
NeRF, rendering the density and color values of the three-dimensional space back to a two-dimensional image through a volume rendering technology, gradually optimizing a loss function of the two-dimensional image, and generating a high-quality rendering image matched with an input photo;
after the reconstruction of the three-dimensional scene is completed, extracting points with higher density from the NeRF model as point cloud data, wherein the points represent object boundaries or other important geometric features in the scene;
The step NeRF of converting the point cloud comprises the following substeps of converting the scene output by NeRF into a three-dimensional point cloud, wherein each point in the point cloud is represented by the position and color value of the point cloud, and fitting the spatial information based on the input picture to the geometric characteristics of the non-texture area NeRF so as to generate a corresponding point cloud;
Aligning, correcting and integrating point clouds obtained from NeRF and multi-view geometry to ensure that each point cloud is in the same coordinate system;
and a point cloud classification step of identifying objects from the point clouds based on a deep-learning classification model, wherein each classified point cloud corresponds to a specific object category in the asset library.
With reference to the first aspect, the above-mentioned point cloud fusion step further includes:
The method comprises the substeps of obtaining the pose of a camera, wherein in the multi-view geometric reconstruction, each photo corresponds to the pose of one camera and comprises rotation and translation information of the camera;
NeRF, generating point cloud substeps, namely, extracting point cloud from NeRF to represent denser and optimized spatial point distribution in a scene, wherein the spatial point distribution comprises color and depth information;
A point cloud sub-step of multi-view geometry, namely, a group of preliminary point clouds based on different angles is generated by using SFM to recover a structure from motion or MVS multi-view stereo matching algorithm;
And the fusion substep is optimized by iterating the closest point ICP algorithm, so that the two groups of point clouds are accurately matched in a three-dimensional space.
With reference to the first aspect, the above-mentioned point cloud classifying step further includes:
preprocessing the fused point cloud, wherein the preprocessing comprises denoising, downsampling and point cloud normalization so as to reduce calculation burden and improve classification accuracy;
The feature extraction sub-step is that the point cloud classification relies on a feature extraction algorithm, the geometric information and the spatial distribution of each point are extracted into high-dimensional feature vectors, and the feature extraction algorithm comprises local curvature, normal vector estimation and a point cloud convolutional neural network based on deep learning;
A classification model training sub-step of processing sparse and irregular point cloud data based on deep learning classification models PointNet and PointCNN and classifying each point cloud based on global and local features;
and the object identification and segmentation sub-step is to classify the object to be identified from the point cloud, wherein the classified point cloud corresponds to a specific object category in the asset library.
With reference to the first aspect, the above-mentioned point cloud classification step further includes a point cloud homogenization step:
A grid conversion sub-step, namely converting point cloud data into triangular grid representation by using Marching Cubes or Poisson Surface Reconstruction grid generation algorithm, and describing the surface structure of the object more accurately by the connection mode of the triangular grid representation vertexes;
the fixed resolution sampling substep comprises the steps of resampling the grid based on the set resolution after generating the grid representation, generating new point cloud data through a uniform sampling algorithm, so that the distribution of data points of the new point cloud in space is more uniform;
And in the point cloud homogenization substep, the intervals of each point of the point cloud data are approximately equal, and the points can be uniformly covered in different areas of the object, so that the problem of overlarge concentration of the point density is avoided.
With reference to the first aspect, the above-mentioned point cloud classifying step further includes an object classifying step based on PCA:
PCA feature extraction substep, namely performing principal component analysis PCA on the homogenized point cloud data, and selecting and extracting the first 10 principal components by calculating a covariance matrix and extracting the first several principal components, wherein the principal components can describe most geometric features of an object;
Aligning the main components, namely aligning the extracted main components with the main components of similar objects in the asset library, and ensuring that two groups of point clouds are compared in the same direction by aligning a first main component, wherein the aligning process involves rigid body transformation, so that the objects are geometrically aligned, and the subsequent classification is convenient;
Calculating principal component distance ion, namely calculating principal component distance between the current object and each similar object in the asset library after alignment is completed, wherein the principal component distance is based on Euclidean distance measurement method to measure similarity of two objects in a low-dimensional feature space, and an object with smaller principal component distance is considered to have higher similarity with the current object;
And returning to the most similar object substep, namely selecting an object closest to the asset library according to the calculated principal component distance, wherein the object is a classification result most similar to the current object.
A second aspect of the present invention provides an object recognition device based on three-dimensional reconstruction, the device comprising:
the three-dimensional reconstruction module based on the multi-view geometry is used for inputting a plurality of scene images and outputting a point cloud of a scene by using a three-dimensional reconstruction algorithm based on the multi-view geometry;
The texture-free region detection module is used for setting a threshold according to depth information obtained in the three-dimensional reconstruction process of the multi-view geometry, and judging that the texture-free region exists if the depth information is smaller than the threshold and the region does not have point clouds according to the depth information of each photo;
Based on NeRF's three-dimensional reconstruction module, based on NeRF carries out processing to the image of input, generates corresponding point cloud, specifically realizes through following substep:
The image input submodule NeRF relies on multi-view image input, the images come from multiple camera angles and cover different view angles of a target scene so as to ensure the integrity and accuracy of reconstruction, and for a non-texture area, a powerful neural network of NeRF can be utilized to fit color and geometric characteristics;
NeRF a neural network modeling sub-module, which uses an MLP multi-layer perceptron neural network to fit the radiation field of the scene, the neural network can accept the pose and the light direction of each camera as input and output the color and density values of each 3D space position, allowing NeRF to generate continuous three-dimensional representation, and reconstructing depth and geometric information even in a non-textured area;
NeRF, rendering the density and color values of the three-dimensional space back to a two-dimensional image through a volume rendering technology, gradually optimizing a loss function of the two-dimensional image, and generating a high-quality rendering image matched with an input photo;
the point cloud extraction submodule can extract points with higher density from the NeRF model as point cloud data after the reconstruction of the three-dimensional scene is completed, wherein the points represent object boundaries or other important geometric features in the scene;
The point cloud conversion submodule NeRF is used for converting a scene output by the input picture into three-dimensional point clouds, wherein each point in the point clouds is represented by the position and the color value of the point clouds, and for a texture-free area, neRF is used for fitting the spatial information based on the input picture to the geometric characteristics of the texture-free area so as to generate corresponding point clouds;
The point cloud fusion module is used for aligning, correcting and integrating the point clouds obtained from NeRF and the multi-view geometry, so that each point cloud is ensured to be in the same coordinate system;
And the point cloud classification module is used for identifying the object from the point clouds based on the deep learning classification model, and each classified point cloud corresponds to a specific object category in the asset library.
The invention combines two schemes of multi-view and NeRF for three-dimensional reconstruction, can be applied to film and television production, virtual reality and augmented reality, improves the identification efficiency and the identification accuracy of articles, and reduces the production cost.
Drawings
FIG. 1 is a flow chart of the steps of a three-dimensional reconstruction-based object recognition method of the present invention;
Fig. 2 is an internal structural diagram of the electronic device provided by the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
Abbreviations and key terms commonly used in the art are defined as follows:
A point cloud refers to a set of discrete three-dimensional data points acquired by three-dimensional scanning or stereovision, etc., each point containing spatial coordinates (e.g., x, y, z). The point cloud is generally used for representing the surface structure of an object or a scene, and is widely applied to the fields of computer vision, 3D reconstruction, automatic driving, virtual reality and the like.
A mesh is a three-dimensional geometry made up of vertices, edges, and faces, typically used to represent the surface of an object. The mesh model can accurately describe the shape of objects, and is particularly suitable for scene and object representation in three-dimensional modeling, animation, game development, and computer graphics.
NeRF is a neural network based 3D scene representation method that enables high quality three-dimensional reconstruction from 2D images by mapping arbitrary points in three-dimensional space to color and volume density. NeRF is widely used in the fields of new view angle synthesis, 3D reconstruction, virtual reality and the like.
Multiview geometry is the field of studying the geometric relationships of deriving three-dimensional structures from images acquired from multiple perspectives. By analyzing a plurality of images of the same scene shot by the camera at different positions, the multi-view geometric technology can reconstruct the three-dimensional shape of the scene, and is widely applied to three-dimensional reconstruction and posture estimation in computer vision.
Classification is a task in machine learning that aims at assigning input data into predefined categories or labels. Classification algorithms typically identify patterns or features in input data by training models and categorize new data based on these features. Classification is widely used in the fields of image recognition, text analysis, speech recognition, etc.
PCA (principal component analysis) is a statistical method for dimension reduction that preserves the principal features or variances of data by transforming high-dimensional data into a low-dimensional space. PCA is widely applied to scenes such as data compression, feature extraction, noise filtration and the like, and has obvious effect especially in processing a high-dimensional data set.
As shown in fig. 1, the specific implementation process of the present invention is as follows:
1) Three-dimensional reconstruction based on multi-view geometry. Any three-dimensional reconstruction algorithm based on multi-view geometry in the current mainstream can be used, and the three-dimensional reconstruction algorithm is input into a plurality of scene images and output into a point cloud of a scene.
2) And detecting an un-textured area. And setting a threshold value by utilizing depth information obtained in the three-dimensional reconstruction process of the multi-view geometry. For the depth information of each photo, if the depth value is smaller than the threshold value and the region has no point cloud, the texture-free region is determined.
3) Three-dimensional reconstruction based on NeRF.
A) The input photo NeRF relies on multi-view photo input. The pictures may be from multiple camera angles, covering different perspectives of the target scene, preferably evenly distributed, to ensure the integrity and accuracy of the reconstruction. Particularly for non-textured areas, a strong neural network of NeRF can be utilized to fit the color and geometric features.
B) Neural network modeling NeRF a MLP (multi-layer perceptron) neural network is used to fit the radiation field of a scene. The network will accept each camera pose and ray direction as inputs and output color and intensity values for each 3D spatial location. This allows NeRF to generate a continuous three-dimensional representation that can reconstruct depth and geometry information even in non-textured areas.
C) And volume rendering, namely NeRF, rendering the density and color values of the three-dimensional space back to a two-dimensional image through a volume rendering technology, gradually optimizing a loss function of the two-dimensional image, and generating a high-quality rendered image matched with the input photo.
D) And extracting point cloud, namely extracting points with higher density from the NeRF model as point cloud data after the reconstruction of the three-dimensional scene is completed. These points typically represent object boundaries or other important geometric features in the scene.
E) The scene output by the conversion to a point cloud NeRF may be converted to a three-dimensional point cloud, each point in the point cloud being represented by its position and color value. For the non-textured region NeRF fits its geometric features based on spatial information of the input picture, thereby generating a corresponding point cloud.
4) And (5) fusing point clouds.
A) The camera pose is obtained, namely, in multi-view geometric reconstruction, each photo corresponds to the pose of one camera (comprising rotation and translation information of the camera). These poses are used to determine the spatial position of each picture in the scene.
B) NeRF the point cloud extracted from NeRF represents a denser and optimized spatial point distribution in the scene, including color and depth information.
C) Point clouds of multi-view geometry a set of preliminary point clouds based on different angles can also be generated by conventional multi-view geometry algorithms such as SFM (structure recovery from motion) or MVS (multi-view stereo matching). These point clouds are typically sparse, but may provide additional geometric constraints to the scene.
D) Fusion process the process of point cloud fusion involves the alignment, correction and integration of point clouds obtained from different methods (NeRF and multi-view geometry). By aligning camera pose information, it can be ensured that each point cloud is in the same coordinate system. Further fusion can be optimized by iterative closest point algorithm (ICP) or other technique, so that the two sets of point clouds are exactly matched in three-dimensional space.
5) And (5) classifying point clouds.
A) And (3) preprocessing the point cloud, namely preprocessing the fused point cloud, such as denoising, downsampling and point cloud normalization, so as to reduce calculation load and improve classification accuracy.
B) Feature extraction point cloud classification typically relies on feature extraction algorithms to extract the geometric information and spatial distribution of each point as high-dimensional feature vectors. Common feature extraction methods include local curvature, normal vector estimation, and deep learning based point cloud convolutional neural networks (e.g., pointNet).
C) Classification model training-based on deep learning classification models, particularly PointNet and PointCNN, etc., can process sparse, irregular point cloud data and classify each point based on global and local features.
D) Object identification and segmentation, namely, after classification is completed, the identified object is segmented from the point cloud. Each point after classification corresponds to a particular object category in the asset library.
6) And homogenizing the point cloud.
A) Point cloud to mesh conversion-point cloud data is converted to a triangular mesh representation using a mesh generation algorithm (e.g., marching Cubes or Poisson Surface Reconstruction). This transformation enables a more accurate description of the surface structure of the object by way of the connection of the vertices of the mesh.
B) Fixed resolution sampling-after the grid representation is generated, the grid is resampled based on the set resolution. And generating new point cloud data through a uniform sampling algorithm, so that the data points of the new point cloud are distributed more uniformly in space. This is critical for the subsequent analysis and comparison process.
C) And (3) a point cloud homogenization result, wherein the intervals of each point of the finally obtained point cloud data are approximately equal, and the points can be uniformly covered in different areas of the object, so that the problem of too concentrated point density is avoided.
7) Object classification based on PCA.
A) PCA feature extraction, namely performing Principal Component Analysis (PCA) on the homogenized point cloud data. The goal of PCA is to convert Gao Weidian cloud data into a low-dimensional feature space, typically by computing the covariance matrix and extracting the first few principal components. You can choose to extract the first 10 principal components, which are typically able to describe most of the geometric features of the object.
B) Aligning principal components, namely aligning the extracted principal components with principal components of similar objects in an asset library. The comparison of the two sets of point clouds in the same direction is typically ensured by aligning the first principal component. The alignment process may involve rigid body transformations so that the objects are geometrically aligned for subsequent classification.
C) And calculating the distance between principal components, namely calculating the distance between the current object and the principal components of each similar object in the asset library after alignment is completed. This distance may be based on a metric such as Euclidean distance to measure the similarity of two objects in a low dimensional feature space. Objects with smaller principal component distances are considered to have higher similarity to the current object.
D) And returning the most similar object, namely selecting the object closest to the asset library according to the calculated principal component distance. This object is the most similar classification result to the current object and can be used for further object recognition or classification tasks in general.
Fig. 2 illustrates a physical structure diagram of an electronic device, which may be an intelligent terminal, and an internal structure diagram thereof may be as shown in fig. 2. The electronic device includes a processor, an internal memory, and a network interface connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection.
It will be appreciated by those skilled in the art that the structure shown in fig. 2 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the electronic device to which the present inventive arrangements are applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
The invention has been proposed for use in a communication cloud tray product. Innovative solutions are provided for multiple industries by extracting key information and intelligently classifying. In the fields of film and television production, virtual Reality (VR) and Augmented Reality (AR), the technology can remarkably improve the creation efficiency and reduce the production cost. For example, complex backgrounds or props are quickly built in film and television special effects, realistic environment and object models are automatically generated during game design, and richer and personalized user experience is provided in VR/AR applications. With the rapid growth of the digital content industry, there is an increasing demand for efficient generation of high quality visual content. Therefore, the invention not only meets the existing market demand, but also has the potential to develop new application scenes, such as interactive teaching material production in online education and the like.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application.
In the above detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components unless context indicates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.