[go: up one dir, main page]

GB2562037A - Three-dimensional scene reconstruction - Google Patents

Three-dimensional scene reconstruction Download PDF

Info

Publication number
GB2562037A
GB2562037A GB1706499.9A GB201706499A GB2562037A GB 2562037 A GB2562037 A GB 2562037A GB 201706499 A GB201706499 A GB 201706499A GB 2562037 A GB2562037 A GB 2562037A
Authority
GB
United Kingdom
Prior art keywords
interest
region
space
viewpoint
scanning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1706499.9A
Other versions
GB201706499D0 (en
Inventor
Tapio Roimela Kimmo
Cricri Francesco
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to GB1706499.9A priority Critical patent/GB2562037A/en
Publication of GB201706499D0 publication Critical patent/GB201706499D0/en
Publication of GB2562037A publication Critical patent/GB2562037A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/88Lidar systems specially adapted for specific applications
    • G01S17/89Lidar systems specially adapted for specific applications for mapping or imaging
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/207Image signal generators using stereoscopic image cameras using a single 2D image sensor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/271Image signal generators wherein the generated image signals comprise depth maps or disparity maps
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S13/00Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
    • G01S13/86Combinations of radar systems with non-radar systems, e.g. sonar, direction finder
    • G01S13/867Combination of radar systems with cameras
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/521Depth or shape recovery from laser ranging, e.g. using interferometry; from the projection of structured light
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/133Equalising the characteristics of different image components, e.g. their average brightness or colour balance

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Electromagnetism (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Length Measuring Devices By Optical Means (AREA)
  • Image Analysis (AREA)

Abstract

A method for three-dimensional scene reconstruction is disclosed, comprising: (a) receiving data from a single or multi-camera device representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint from which a 3D region of interest is associated with the object is determined using the visual image possibly be means of convolutional neural network. A depth map is then generated from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density possibly using LIDAR. The region of interest may be updated using the depth map or information derived therefrom which may involve reducing its volume.

Description

(54) Title of the Invention: Three-dimensional scene reconstruction
Abstract Title: Three-dimensional scene reconstruction from images (57) A method for three-dimensional scene reconstruction is disclosed, comprising: (a) receiving data from a single or multi-camera device representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint from which a 3D region of interest is associated with the object is determined using the visual image possibly be means of convolutional neural network. A depth map is then generated from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density possibly using LIDAR. The region of interest may be updated using the depth map or information derived therefrom which may involve reducing its volume.
Figure GB2562037A_D0001
1/8
Figure GB2562037A_D0002
Fig. 1
2/8
Figure GB2562037A_D0003
Fig. 2
3/8
LD
Figure GB2562037A_D0004
Fig. 3
4/8 ο
«q«qΓΟ
Figure GB2562037A_D0005
LO
CO
Fig. 4 «3«3LO «35/8
CM
LfS
Figure GB2562037A_D0006
6/8
Figure GB2562037A_D0007
LD
Fig. 7
Figure GB2562037A_D0008
Figure GB2562037A_D0009
7/8
Figure GB2562037A_D0010
LD
8/8 oo _ ό
Γ\| d) τΗ
Figure GB2562037A_D0011
- 1 Three-Dimensional Scene Reconstruction
Field of the Invention
This invention relates to methods and systems for three-dimensional scene reconstruction.
Background of the Invention
Three-dimensional scene reconstruction generally refers to the creation of three-dimensional scenes or models from a set of images, for example frames of a video sequence. The video sequence may comprise static and/or dynamic objects.
Using computer vision techniques, such as depth mapping and/or structure-from-motion, it is possible to estimate the three-dimensional geometry of a scene using multiple camera images. The quality of the reconstruction can be enhanced by adding depth sensors to actively measure the three-dimensional scene geometry.
One method for performing depth mapping is using light imaging, detection and ranging (LiDAR) which is a surveying method that measures the distance to a target by illuminating the target with a light beam, usually a laser beam. LiDAR is conventionally used for making high-resolution maps, for example, and more recently for performing control and navigation in some autonomous vehicles.
LiDAR sensors generally work by transmitting laser light pulses to the environment and measuring the time of flight between transmission and the return reflection. Traditional LiDAR sensors employ rotating designs where, for example, a spinning mirror directs the laser pulses to different directions which may enable scanning of a 360 degree environment. More recently, solid-state LiDAR sensors are being used where the rotating design is replaced by a phased array of laser transmitters and receivers with electronic steering of the laser beam.
Typically, LiDAR sensors scan in a fixed pattern. Over a large scan volume, for example a 360 degrees volume, the angular resolution is relatively low.
Summary of the Invention
A first aspect of the invention provides a method comprising: (a) receiving data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint; (b) determining a region of interest associated with the object using the visual image;
- 2 (c) generating a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and (d) updating the region of interest using the depth map or information derived therefrom.
The method may be performed for a first frame of video content and may further comprise (e) determining if the updated region of interest corresponds to a predefined object or scene and, if so, classifying said region of interest to the object or scene.
If the updated region of interest does not correspond to a predefined object or scene, the method may comprise transferring the updated region of interest to a second, subsequent frame of the video content and repeating (c), (d) and (e) for the updated region of interest.
Generating the depth map may comprise scanning the limited portion of the space at a scanning resolution or density greater than the scanning resolution or density used outside of the limited portion of the space.
Generating the depth map may comprise scanning the limited portion of the space at a slower scan rate than is used for scanning outside of the limited portion of the space.
Generating the depth map may comprise scanning the limited portion of the space using a raster-type pattern in which the vertical line spacing is smaller than is used for scanning outside the limited portion of the space.
The visual image may be captured using a camera device and the depth map may be generated using a light scanning device, wherein the limited portion of the space which is scanned by the light scanning device is identified based on the determined region of interest and the relative positions of the camera device and the light scanning device.
The region of interest may be a three-dimensional region of interest.
The three-dimensional region of interest may be determined by identifying an object in the visual image and generating the three-dimensional region of interest therefrom.
The object may be identified using one or more of motion detection, semantic object detection and face detection.
-3The object maybe identified using a convolutional neural network, the output of which represents a probability distribution over object classes.
The three-dimensional region of interest may be generated in (b) by projecting a volumetric 5 shape over the object.
The volumetric shape may be substantially a cone.
Multiple two-dimensional visual images may be captured using a multi-capture device at the 10 first viewpoint and the three-dimensional region of interest is provided as the intersections of the volumetric shapes.
The three-dimensional region of interest may be updated in (d) by reducing the volume of the volumetric shape based on the depth map or information derived therefrom.
The three-dimensional region of interest may be updated in (d) by identifying using the depth map the presence of an object within the region of interest, and reducing the volume of the volumetric shape in accordance at least part of its position or shape within the region of interest.
The three-dimensional region of interest may be updated in (d) by identifying a part of the object which is closest to the first viewpoi nt, and by reducing the volumetric shape such that it extends substantially from said closest part.
The visual image data may be received from a multi-camera device.
The depth map may be generated using a LiDAR depth sensor.
A second aspect of the invention provides a computer program comprising instructions that 30 when executed by a computer program control it to perform the method of (a) receiving data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint; (b) determining a region of interest associated with the object using the visual image; (c) generating a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and (d) updating the region of interest using the depth map or information derived therefrom.
-4A third aspect of the invention provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: (a) receiving data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint; (b) determining a region of interest associated with the object using the visual image; (c) generating a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and (d) updating the region of interest using the depth map or information derived therefrom.
A fourth aspect of the invention provides an apparatus, the apparatus having at least one processor and at least one memory having computer-readable code stored thereon which when executed controls the at least one processor: to receive data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint; to determine a region of interest associated with the object using the visual image; to generate a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and to update the region of interest using the depth map or information derived therefrom.
A fifth aspect of the invention provides an apparatus configured to perform the method of (a) receiving data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint; (b) determining a region of interest associated with the object using the visual image; (c) generating a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and (d) updating the region of interest using the depth map or information derived therefrom..
Brief Description of the Drawings
The invention will now be described, by way of non-limiting example, with reference to the accompanying drawings, in which:
Figure 1 is a plan view of a capture space comprising an object, a camera and a depth scanning device, according to embodiments of the invention;
Figure 2 is a perspective view of the Figure 1 embodiment, including an analysis and control system;
-5Figure 3 is a perspective view of the Figure 1 embodiment, including an alternative analysis and control system arrangement;
Figure 4 is a schematic diagram of components of the Figure 2 analysis and control system; Figure 5 is a flow diagram showing processing steps performed by the Figure 2 analysis and control system;
Figure 6 is a perspective view of the Figure 1 embodiment, indicating image capture from the camera, which is useful for understanding the invention;
Figures 7a and 7b are, respectively, graphical representations of a region of interest and cone of interest shown in the camera and depth scanning device co-ordinate systems;
Figure 8 is a graphical representation of how scanning may be performed by a depth scanning device relative to a region of interest detected by the Figure 2 analysis and control system;
Figure 9 is plan view showing how in graphical representation how a region of interest can be refined using methods and systems of the preferred embodiments of the invention; and
Figure 10 is a flow diagram showing processing steps performed by the Figure 2 analysis and control system in accordance with a further embodiment.
Detailed Description of Preferred Embodiments
Embodiments herein relate to methods and systems for three-dimensional (3D) scene reconstruction, for example for creating 3D scenes or models from images, which may be images of a video sequence. The video sequence may comprise static and/or dynamic scenes, for example comprising one or more objects.
For example, some embodiments may relate to the creation of a 3D scene from a real-world scene for use in a virtual reality (VR) system or in a 3D computer game.
For example, the generated 3D scene may be provided as VR content from a content provider to one or more remote devices over a network, for example an IP network such as the internet, for output to a VR display system such as a head mounted display (HMD). The VR content may be live or pre-captured omnidirectional video data, representing a 360 degree field of view.
The VR display system may be provided with a live or stored feed from a video content source, the feed representing a virtual reality space for immersive output through the display system. In some embodiments, audio is provided, which may be spatial audio.
- 6 Nokia’s OZO (RTM) VR camera is an example of a VR capture device which comprises a camera and microphone array to provide VR video and a spatial audio signal, but it will be appreciated that the embodiments are not limited to VR applications nor the use of microphone arrays at the capture point.
In order to generate, or reconstruct, a real-world scene into a 3D scene for use in such applications, embodiments herein may employ LiDAR or similar ranging technologies to sense depth information which is used in addition to visual imaging information captured using one or more cameras. Depth information is provided in the form of a so-called ‘depth map’ which can be of any data format.
A depth map may be considered any array or set of data points consisting of a direction or position over a surface and a corresponding distance from a reference point or surface.
More particularly, one or more visual cameras are employed to capture a visual image, e.g. an image represented by RGB data and visible to the human eye when reproduced to an output display device. The visual image may be a still or moving image. The visual image may be encoded in any suitable format, e.g. RAW, JPEG, BMP, TIFF, PNG, MPEG, FLV or the like.
In some embodiments, one or more multi-camera devices may be used, for example the above-mentioned OZO camera.
Additionally, one or more LiDAR (or similar) sensors are employed at a spatially separate location from the visual image camera(s), but with a known positional relationship between the two types of device.
In some embodiments, rather than (or in addition to) employing a LiDAR sensor to provide a depth map of the overall capture space or environment, the visual image data from the camera and the relative position of said camera from the LiDAR sensor is used to focus or modulate a higher resolution scanning over a limited area of the overall scene. This more limited area may be scanned at a higher, or dense, resolution in order to provide more accurate depth information of any object or region of interest identified within the visual camera, for example, which can then be used to refine said object or region of interest.
In some embodiments, the refined object or region of interest is then moved to the next frame in the video sequence using optical flow for repeating the process in order to further refine the object or region of interest in subsequent frames. If the object or region of interest is moving, then its shape in the next frame may not be the same, and hence further refinement may be performed.
-ΊΙη some embodiments, the refinement process of the object or region of interest is repeated until one or more, so-called stopping criteria, is or are detected. For example, there maybe provided a subsequent classification stage whereby an object or region of interest is classified as a particular object or object type, forming part of the 3D scene for subsequent use. Classification may use one or more example methods to be described later on. Once classified as a particular type of object or region of interest, only those objects or regions of interest which remain unclassified may continue to undergo the refinement process.
Figure 1 is a top plan view of a capture space 1 within which one three-dimensional object, in this case a cone 5, is shown situated on the ground 3. In practice, there may be multiple objects distributed within the capture space 1 but for ease of explanation, only one is shown in the Figure.
Also provided in the capture space 1 is a camera device 7, which may be (but is not limited to) a multi-camera device, such as an OZO capture device. Such a multi-camera device 7 may comprise a body having an array of separate cameras distributed over the surface of the body at known positions in order to capture, for example, a 360 degree field-of-view which can be joined together using image processing techniques to generate all or part of the field-of-view.
The multi-camera device 7 may capture still images or frames of video.
For the avoidance of doubt, the use of a multi-camera device is not essential and for ease of explanation, we will assume the use of a single camera device 7.
The camera device 7 is positioned at a first position or viewpoint 8 on the ground 3, spatially separate from the cone object 5.
Also provided in the capture space 1 is a LiDAR sensor 9 at a second position or viewpoint 10 on the ground 3, spatially separate from the cone object 5 and from the camera device 7. The
LiDAR sensor 9 may be of any known type, for example a coherent or incoherent device and/or provided as a spinning mirror or solid state array. The LiDAR sensor 9 measures the distance to a target by illuminating the target with a light beam, usually a laser beam, to generate therefrom a depth map. The depth map comprises an array of 3D points consisting, for each point, of a direction and a depth.
Figure 2 shows the cone object 5, camera device 7 and LiDAR sensor 9 in perspective view. The camera device 7 is shown mounted on a stand 11, e.g. a tripod or similar device, which raises it above the ground 3. Similarly, the LiDAR sensor 9 is shown mounted on a stand 12.
-8Associated with both the camera device 7 and the LiDAR sensor 9 is an analysis and control system 13. The analysis and control system 13 in overview is a computer processing system, or network of such systems, arranged to receive the image and depth map data from the respective camera device 7 and LiDAR sensor 9, to perform functions to be described below, including in some cases providing one or more control signals to the camera device 7 and the LiDAR sensor 9.
With reference to Figure 3, in a further embodiment, the analysis and control system 13 may 10 be divided into separate constituent modules. For example, a visual analysis system (VAS) 14 is associated with the camera device 7. A depth analysis system (DAS) 15 is associated with the LiDAR sensor 9. A controller system 17 is connected to both the VAS 14 and DAS 15 for receiving analysis outputs on respective signal lines 19, 21 and, responsive thereto, for producing control outputs on signal lines 23, 25.
In some embodiments, the VAS 14 maybe a learned model such as a neural network. For example, the VAS 14 maybe a convolutional neural network (CNN). The CNN maybe coupled with a recurrent neural network (RNN) for temporal analysis. The output of the VAS may be a vector, for example representing a probability distribution over object classes, if 20 the aim is to scan an object. From the probability distribution, the system may infer the object class with highest probability.
Similarly, in some embodiments, the DAS 15 maybe a learned model or neural network. For example, the DAS 15 may be a CNN, optionally preceded by a dense neural layer, such as a fully connected layer, for converting the relatively sparse LiDAR data matrix to a dense matrix or activations, more suited to the input of a CNN. The output of the DAS 15 may be a vector, for example representing a probability distribution over object classes. From the probability distribution, the system may infer the object class with highest probability.
Figure 4 is an example schematic diagram of components of the analysis and control system 13. Alternatively, Figure 4 may represent components of each of the VAS 13, DAS and controller system 17 in the Figure 3 embodiment.
The analysis and control system 13 has a controller 30, RAM 32, a memory 34, and a network interface 40. The controller 30 is connected to each of the other components in order to control operation thereof. In some embodiments, an optional display and user control peripheral(s) maybe provided for displaying and controlling a user interface for user control of the processing system 30.
-9The memory 34 maybe a non-volatile memory such as read only memory (ROM) a hard disk drive (HDD) or a solid state drive (SSD). The memory 34 stores, amongst other things, an operating system 46 and one or more software applications 46. The RAM 32 is used by the controller 30 for the temporary storage of data. The operating system 46 may contain code which, when executed by the controller 30 in conjunction with RAM 32, controls operation of each of hardware components.
The controller 30 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors.
The analysis and control system 13 (or, if separate systems are provided, the VAS 14, DAS 15 and control system 17) may be a standalone computer, a server, a console, or a network thereof.
In some embodiments, the analysis and control system 13 may also be associated with external software applications. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications.
The network interface 40 receives video data from the VAS 13 and the DAS 15.
Figure 5 is a flow diagram showing processing steps performed in overview by the analysis and control system 13, or, more particularly, the software application 44 indicated in Figure
4. The software application 44 may store the relative positions between the camera device 7 and the LiDAR sensor 9, using any co-ordinate system that permits a re-projection of the second viewpoint 10 relative to the first viewpoint 8.
In a first step 5.1, visual capture is performed, e.g. by initiating capture by the camera device
7 of one or more visual images of the capture space 1. If a multi-camera device is used, the camera device 7 may capture multiple images, e.g. simultaneously, thereby to provide 360 degree coverage of the capture space 1.
In a second step 5.2, which may be performed in parallel with step 5.1, the LiDAR sensor 9 is initiated to generate a uniform depth map, similarly of the overall capture space 1 (or at least a substantial part of it).
- 10 At this stage in operation, little or nothing is known about the scene contents and hence the overall capture space is scanned uniformly to get a basic level of depth sensing for the whole scene.
In the Figure 3 embodiment, step 5.1 may be performed by the VAS 14 and step 5.2 by the DAS 15, under the control of the controller system 17.
In a subsequent step 5.3, one or more regions of interest are identified by the analysis and control system 13 (or by the VAS 14 in the Figure 3 embodiment).
As will be known, various methods and systems exist for identifying regions of interest in two-dimensional (2D) visual content. These may include motion detection techniques, semantic object detection, salient object detection and facial detection. These methods may employ the CNN model mentioned previously. Specific details of these methods are known in the art and hence a detailed explanation of each is not required herein. Each of these methods and systems generates from analysis of the 2D visual content a 2D region of pixels that represent an estimated region of interest.
In the context of 3D applications, each 2D region of interest may be projected into 3D. For example, at a basic level, each 2D region of interest may be projected into a so-called ‘cone of interest’ (COI) which extends outwardly from the viewport of the visual camera 7 and towards the region of interest (hence producing a cone-like region.) However, when the depth of the object within the region of interest can be estimated, as is performed subsequently, the COI may extend outwardly from the estimated visual object itself and is therefore a refined and more accurate representation.
In a multi-camera setup, for example if an OZO device were used as the camera device 7, each individual image may represent a given region of interest captured from a slightly different angle. Hence, there maybe generated, or estimated, multiple COIs. The intersections between the individual COIs may form volumes of interest (VOIs). Therefore, by detecting or estimating the same semantic content in multiple views, the VOI encapsulating the object may be defined in 3D data format. Note that a COI may be considered a VOI (as it is a 3D projection from the 2D region of interest) but is less optimal than a multi-camera VOI.
In the foregoing, the terms COI and VOI are applicable in the cases where a single camera device and a multi-camera device are employed, respectively. The general principles of operation remain the same, however.
- 11 The notion of a COI and a VOI are significant in situations where the LiDAR sensor 9 is not positioned at the same location as the camera 7, from where each visual region of interest is captured. This enables the system to re-project each COI or VOI to the LiDAR co-ordinate system, and hence limit the LiDAR sensor’s re-scanning area or zone at the higher resolution or density. Hence, for the remainder of the description, it is to be assumed that the camera device 7 and the LiDAR sensor 9 are not co-located, and hence re-projection is needed to map the position of the COI or VOI to the LiDAR sensor’s co-ordinate system, or more particularly the DAS 15 or associated control system associated with the LiDAR sensor 9. If the camera device 7 and LiDAR sensor 9 were co-located, re-projection would be trivial.
In a subsequent step 5.4, for each region of interest (or in the present case, for each COI or VOI) identified in step 5.3, the analysis and control system 13 (or the DAS 15) controls the LiDAR sensor 9 to perform a higher resolution depth scanning over a limited area of the capture space 1. This depth scan is based on the identified COI or VOI and the relative position between the camera device 7 and the LiDAR sensor 9, which is saved in the memory 34. Thus, the LiDAR sensor 9 does not perform a uniform scan as before, but rather performs a higher resolution scan of the COI or VOI (or a limited volume of the capture space 1 than includes or encloses the COI or VOI) to generate a more accurate, denser, depth map from the perspective of the second viewpoint 10.
In a subsequent step 5.5, the region of interest, or COI or VOI identified in step 5.3, is refined based on the high-resolution depth map generated in step 5.4. Typically, this will limit or reduce the size or volume of the region of interest, or the COI or VOI, making it a more accurate representation of the associated object.
In a subsequent step 5.6, the refined region of interest, or COI or VOI, is moved into the next frame of the video sequence, and the process may repeat from step 5.3. This is in order to further refine the region of interest, or COI or VOI, in subsequent frames. If the object or region of interest is moving, then its shape in the next frame may not be the same, and hence further refinement may be performed.
To illustrate the Figure 5 process, Figure 6 shows the camera device 7 capturing a visual image of the cone object 5 from its first viewpoint 10. In fact, the camera device 7 may capture images of the whole surrounding environment, but it is assumed that the analysis and control system 13 (or the VAS 14 in the Figure 3 embodiment) identifies the cone object 5 as a 2D region of interest.
- 12 Figure 7a is a graphical representation of the 2D region of interest 50 corresponding to the cone object 5 captured by the camera device 7, which is shown within said device’s coordinate system 52. This 2D region of interest 50 may be projected into to a COI or VOI extending from the first viewpoint 8 towards the region of interest 50.
This region of interest, or COI or VOI, may then be re-projected into the LiDAR co-ordinate system 54, based on the pre-calibrated relationship between the first and second viewpoints 8,10.
Figure 7b represents the viewpoint of the LiDAR sensor 9, with the region 56 indicating the COI or VOI extending from the first viewpoint (indicated by the apex 57 of said region.)
Thus, in step 5.4 of the overall method, the subsequent LiDAR scan in step 5.4 may be performed at a higher resolution over this region 56 rather than over the entire scene. For ease of explanation, the region 56 may be referred to hereafter as ROIlidar
In step 5.4, the scanning of the ROIlidar may be performed in a number of ways. To some extent, the scanning is dependent on the particular form of LiDAR sensor 9 used. For example, the LiDAR sensor 9 may uses a constant laser firing rate and the analysis and control system 13 (or the DAS 15) may modulate or control the beam direction only.
In some embodiments, the LiDAR sensor 9 may be a solid-state phased-array with electronic modulation of the beam direction. Such a LiDAR sensor 9 will typically have a limited fieldof-view which in use may be scanned in a raster-like pattern, similar to a cathode ray tube (CRT.) In the present case, the steering signals issued to the LiDAR sensor 9 may be modulated so that the scanning beam moves faster over regions outside of ROlLiDAR56 and slows down over ROIlidar 56. This, combined with the fixed firing rate of the laser beam, produces a denser sampling of the ROIudar56. In a practical capture setup, multiple such LiDAR sensors 9 may be employed and the steering signals of each modulated individually.
In some embodiments, the LiDAR sensor 9 may be a rotating optical sensor that scans the laser beam over 360 degrees in the horizontal direction. For example, the LiDAR sensor 9 may be a solid-state unit with a one-dimensional beam steering system, resulting in a constant speed in the horizontal direction, but modulated in the vertical direction.
Referring to Figure 8, for example, the resulting pattern of a fixed number of 360 degree scan lines may be adjusted so that as the horizontal direction approaches each part of the ROIlidar 56, the scan line spacing is reduced, indicated by reference numeral 58, to produce a higher resolution scan. Outside of the ROIuDAR56the scan line spacing is wider.
-13In step 5.5 of the Figure 5 method, refinement of each region of interest (or COI or VOI) is performed. In the previous step 5.4, more accurate depth measurements are produced by the LiDAR sensor 9 in the form of a depth map. Each LiDAR return value is effective a 3D point, consisting of a direction and depth parameter, and only the points falling inside the region of interest (or COI or VOI) 56 are actually part of the region of interest seen by the camera device 7.
Referring to Figure 9, for example, the region indicated by reference numeral 60 indicates 10 that part of the scan pattern which is scanned at a higher resolution by the LiDAR sensor 9 in step 5.4. Any LiDAR sensor rays that pass through the COI or VOI 56, i.e. those outside of region 60, are interpreted by the analysis and control system 13 as indicating that the COI or VOI is empty in those locations. Where the cone object 5 is detected by the LiDAR sensor 9 due to the receipt of return reflections, there is provided an estimate for the closest depth inside the COI or VOI 56 as seen by the camera 7. This enables the COI or VOI 56 to be reduced in size for the next iteration, and also makes any stereo depth estimation of the ROI more accurate.
For example, reference numeral 61 indicates the near-depth range of the COI or VOI 56, i.e.
that which is nearest to the camera device 7. In the next iteration, therefore, the COI or VOI may be assumed to start from that depth, rather from that of the camera device 7. For the next iteration of the Figure 5 method, the COI or VOI will therefore comprise a smaller volume in which the regions either side of the object 5 are eliminated. When seen from the camera device 7, it will coincidentally appear like the view in Figure 7a.
Thus, as further iterations are performed, a more accurate representation of the cone object 5 (and other objects, if multiple objects are present in the capture space 1) is generated which can be transferred to the next frame.
In some embodiments, the refinement process of the object or region of interest is repeated until one or more so-called stopping criteria is or are detected. For example, there may be provided a subsequent classification stage whereby an object or region of interest is classified as a particular object or object type, forming part of the 3D scene for subsequent use. Once classified as a particular type of object or region of interest, only those regions of interest, or
CO Is or VOIs which remain unclassified may continue to undergo the refinement process. When all regions of interest, or COIs or VOIs, in a scene are classified, the 3D scene is reconstructed.
-14Figure 10 is a flow diagram showing processing steps performed in overview by the analysis and control system 13, or, more particularly, the software application 44 indicated in Figure 4 in accordance with a further embodiment. The processing steps are similar to those shown in Figure 5, save for additional steps of testing each region of interest against a set of stopping criteria to determine whether or not to propagate the region(s) of interest to the next frame for further refinement.
In a first step 10.1, visual capture is performed, e.g. by initiating capture by the camera device 7 of one or more visual images of the capture space 1. If a multi-camera device is used, the camera device 7 may capture multiple images, e.g. simultaneously, thereby to provide 360 degree coverage of the capture space 1.
In a second step 10.2, which maybe performed in parallel with step 5.1, the LiDAR sensor 9 is initiated to generate a uniform depth map, similarly of the overall capture space 1 (or at least a substantial part of it).
In a subsequent step 10.3, one or more regions of interest are identified by the analysis and control system 13 (or by the VAS 14 in the Figure 3 embodiment). The methods and systems mentioned previously for identifying regions of interest in two-dimensional (2D) visual content are applicable as before.
In a subsequent step 10.4, for each region of interest (or in the present case, for each COI or VOI) identified in step 10.3, the analysis and control system 13 (or the DAS 15) controls the LiDAR sensor 9 to perform a higher resolution depth scanning over a limited area of the capture space 1. This depth scan is based on the identified COI or VOI and the relative position between the camera device 7 and the LiDAR sensor 9, which is saved in the memoiy 34. Thus, the LiDAR sensor 9 does not perform a uniform scan as before, but rather performs a higher resolution scan of the COI or VOI (or a limited volume of the capture space 1 than includes or encloses the COI or VOI) to generate a more accurate, denser, depth map from the perspective of the second viewpoint 10.
In a subsequent step 10.5, the region of interest, or COI or VOI identified in step 10.3, is refined based on the high-resolution depth map generated in step 10.4. Typically, this will limit or reduce the size or volume of the region of interest, or the COI or VOI, making it a more accurate representation of the associated object.
In a subsequent step 10.6, each refined region of interest, or COI or VOI, is taken in turn. In step 10.7, each refined region of interest, or COI or VOI, is tested against a set of stopping
-15criteria, examples of which will be described below. If the stopping criteria is fulfilled, the process moves to step 10.8 whereby the current region of interest, or COI or VOI, is removed from the active refinement list so that no further iteration is performed in respect of this region. If the stopping criteria is/are not fulfilled, the process moves to step 10.9 and the process may repeat from step 10.3.
For example, the analysis and control system 13 may use one or more of the following criteria or tests to deciding when to stop the scanning in step 10.7.
In a first example, the analysis and control system 13 may run a classifier, such as a CNN model on the depth data generated by the LiDAR sensor 9. At a time when the confidence of the classifier indicates that the class of an object or area in the region of interest, or COI or VOI, is above a predetermined threshold, then the scanning is stopped.
The classifier may be trained in an offline stage using a dataset comprising pairs of depth data and corresponding object labels. The confidence data may be obtained for example by a probability assigned to each output class. Alternatively, or additionally, the confidence data may be obtained by running an ensemble of neural networks and then considering the different outputs as a measure of uncertainty of the ensemble.
In a further example, which may be considered an improvement, visual data classification methods maybe used. In general, such methods maybe found to surpass human-level Usual data classification. A classifier is first run on the visual data corresponding to the region of interest, or COI or VOI. Then, a LiDAR data classifier is continuously run on the region of interest captured by the LiDAR sensor. If the LiDAR data classifier correctly classifies the region of interest, i.e. as the same class as predicted by the visual data classifier, and optionally with a sufficiently high confidence level, then scanning may be stopped. In other words, two classifiers are employed, and both need to agree in order to stop the scanning. If the two classifiers do not agree after a certain time period, or after they have both achieved a target confidence, the scanning is stopped in any case.
In a further example, offline processing may be performed once, to train a generative model such as a neural auto-encoder or an adversarial neural network or other suitable model. This may be performed to map depth data of objects, e.g. LiDAR data, to the corresponding visual appearance of the same objects, e.g. using RGB data. This trained model may then be used for continuously receiving LiDAR data from the focussed regions of interest and generating the corresponding visual appearance. The generated visual appearance may improve over time with the increased LiDAR data from a certain region of interest. Further, the generated
-16 image maybe continuously compared to the actual region of interest captured by the camera. This comparison may use loss measurements, e.g. cross-entropy or mean squared error (MSE) or perceptual losses, such as from a pre-trained feature extractor. If the loss value is below a threshold, or alternatively if it does not increase anymore, the stopping condition may be satisfied.
Thus, the above methods and systems may be employed to generate or reconstruct static and/or dynamic scenes into a 3D scene. The methods and systems are performed in an improved manner by identifying one or more regions of interest using visual data captured by a camera, and then performing higher resolution, or more dense, scanning of one or more regions corresponding to the regions of interest when re-projected to the LiDAR co-ordinate system. This permits a more accurate representation of the regions of interest for use in subsequent stages, for example in iterative refinement and/or classification of the regions of interest as particular objects.
Static regions of interest may be discarded or ignored for further optimisation when a certain level of confidence is achieved, so that refinement is performed over subsequent iterations for those regions of interest which are uncertain or moving. This can be performed by optical flow between the successive frames. For example, the pixels belonging to each region of interest in one frame are moved to the next frame with the follow, resulting in a potential region of interest for the next frame.
It will be appreciated that static regions of interest may start moving at a later time, and hence refinement using the above methods and systems may be performed again on such regions of interest. A motion detection method or algorithm may be used for this purpose, e.g. by monitoring the position of classified regions of interest and detecting movement over successive frames to re-allocate said region(s) to the refinement process outlined above.
The proposed methods and systems provide an efficient way for sensing depth information in a semantically meaningful way. Overall, the system performance may be improved by achieving a higher quality of reconstruction of the 3D scene with the same or similar data throughput.
It will be appreciated that the above described embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present application.
-17Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Claims (19)

  1. Claims
    1. A method comprising:
    (a) receiving data representing a visual image of a space comprising an object, the 5 visual image being captured from a first viewpoint;
    (b) determining a region of interest associated with the object using the visual image;
    (c) generating a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and
    10 (d) updating the region of interest using the depth map or information derived therefrom.
  2. 2. The method of claim l, the method being performed for a first frame of video content and further comprising (e) determining if the updated region of interest corresponds to a
    15 predefined object or scene and, if so, classifying said region of interest to the object or scene.
  3. 3. The method of claim 2, wherein, if the updated region of interest does not correspond to a predefined object or scene, transferring the updated region of interest to a second, subsequent frame of the video content and repeating (c), (d) and (e) for the updated region of
    20 interest.
  4. 4. The method of any preceding claim, wherein (c) comprises scanning the limited portion of the space at a scanning resolution or density greater than the scanning resolution or density used outside of the limited portion of the space.
  5. 5. The method of claim 4, wherein (c) comprises scanning the limited portion of the space at a slower scan rate than is used for scanning outside of the limited portion of the space.
    30
  6. 6. The method of claim 4 or claim 5, wherein (c) comprises scanning the limited portion of the space using a raster-type pattern in which the vertical line spacing is smaller than is used for scanning outside the limited portion of the space.
  7. 7. The method of any preceding claim, wherein the visual image is captured using a
    35 camera device and the depth map is generated using a light scanning device, wherein the limited portion of the space which is scanned by the light scanning device is identified based on the determined region of interest and the relative positions of the camera device and the light scanning device.
    -198. The method of any preceding claim, wherein the region of interest is a threedimensional region of interest.
    5 9. The method of claim 8, wherein the three-dimensional region of interest is determined by identifying an object in the visual image and generating the three-dimensional region of interest therefrom.
  8. 10. The method of claim 9, wherein the object is identified using one or more of motion 10 detection, semantic object detection and face detection.
  9. 11. The method of claim 10, wherein the object is identified using a convolutional neural network, the output of which represents a probability distribution over object classes.
    15
  10. 12. The method of any of claims 8 to 11, wherein the three-dimensional region of interest is generated in (b) by projecting a volumetric shape over the object.
  11. 13. The method of claim 12, wherein the volumetric shape is substantially a cone.
    20
  12. 14. The method of claim 12 or claim 13, wherein multiple two-dimensional visual images are captured using a multi-capture device at the first viewpoint and the three-dimensional region of interest is provided as the intersections of the volumetric shapes.
  13. 15. The method of any of claims 12 to 14, wherein the three-dimensional region of
    25 interest is updated in (d) by reducing the volume of the volumetric shape based on the depth map or information derived therefrom.
  14. 16. The method of claim 15, wherein the three-dimensional region of interest is updated in (d) by identifying using the depth map the presence of an object within the region of
    30 interest, and reducing the volume of the volumetric shape in accordance at least part of its position or shape within the region of interest.
  15. 17. The method of claim 16, wherein the three-dimensional region of interest is updated in (d) by identifying a part of the object which is closest to the first viewpoint, and reducing
    35 the volumetric shape such that it extends substantially from said closest part.
  16. 18. The method of any preceding claim, wherein the visual image data is received from a multi-camera device.
    - 20
  17. 19- The method of any preceding claim, wherein the depth map is generated using a LiDAR depth sensor.
    5 20. A computer program comprising instructions that when executed by a computer program control it to perform the method of any preceding claim.
    21. A non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least io one processor to perform a method, comprising:
    (a) receiving data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint;
    (b) determining a region of interest associated with the object using the visual image;
    (c) generating a depth map from a second viewpoint, different from the first
    15 viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and (d) updating the region of interest using the depth map or information derived therefrom.
  18. 20 22. An apparatus, the apparatus having at least one processor and at least one memoiy having computer-readable code stored thereon which when executed controls the at least one processor:
    to receive data representing a visual image of a space comprising an object, the visual image being captured from a first viewpoint;
    25 to determine a region of interest associated with the object using the visual image;
    to generate a depth map from a second viewpoint, different from the first viewpoint, by scanning a limited portion of the space, which includes the region of interest, at a predetermined scanning resolution or density; and to update the region of interest using the depth map or information derived
    30 therefrom.
  19. 23. An apparatus configured to perform the method of any of claims 1 to 19.
    Intellectual
    Property
    Office
    Application No: GB 1706499.9
GB1706499.9A 2017-04-25 2017-04-25 Three-dimensional scene reconstruction Withdrawn GB2562037A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1706499.9A GB2562037A (en) 2017-04-25 2017-04-25 Three-dimensional scene reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1706499.9A GB2562037A (en) 2017-04-25 2017-04-25 Three-dimensional scene reconstruction

Publications (2)

Publication Number Publication Date
GB201706499D0 GB201706499D0 (en) 2017-06-07
GB2562037A true GB2562037A (en) 2018-11-07

Family

ID=58795746

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1706499.9A Withdrawn GB2562037A (en) 2017-04-25 2017-04-25 Three-dimensional scene reconstruction

Country Status (1)

Country Link
GB (1) GB2562037A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3702809A1 (en) * 2019-02-27 2020-09-02 Nokia Solutions and Networks Oy Device navigation
GB2598078A (en) * 2020-06-17 2022-02-23 Jaguar Land Rover Ltd Vehicle control system using a scanning system
WO2023102224A1 (en) * 2021-12-03 2023-06-08 Innopeak Technology, Inc. Data augmentation for multi-task learning for depth mapping and semantic segmentation
US12223588B2 (en) 2018-09-27 2025-02-11 Snap Inc. Three dimensional scene inpainting using stereo extraction

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474161B2 (en) * 2017-07-03 2019-11-12 Baidu Usa Llc High resolution 3D point clouds generation from upsampled low resolution lidar 3D point clouds and camera images
CN110599416B (en) * 2019-09-02 2022-10-11 太原理工大学 Non-cooperative target image blind restoration method based on spatial target image database
CN111709982B (en) * 2020-05-22 2022-08-26 浙江四点灵机器人股份有限公司 Three-dimensional reconstruction method for dynamic environment
CN115861572B (en) * 2023-02-24 2023-05-23 腾讯科技(深圳)有限公司 Three-dimensional modeling method, device, equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014154839A1 (en) * 2013-03-27 2014-10-02 Mindmaze S.A. High-definition 3d camera device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014154839A1 (en) * 2013-03-27 2014-10-02 Mindmaze S.A. High-definition 3d camera device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12223588B2 (en) 2018-09-27 2025-02-11 Snap Inc. Three dimensional scene inpainting using stereo extraction
EP3702809A1 (en) * 2019-02-27 2020-09-02 Nokia Solutions and Networks Oy Device navigation
GB2598078A (en) * 2020-06-17 2022-02-23 Jaguar Land Rover Ltd Vehicle control system using a scanning system
GB2598078B (en) * 2020-06-17 2022-12-21 Jaguar Land Rover Ltd Vehicle control system using a scanning system
WO2023102224A1 (en) * 2021-12-03 2023-06-08 Innopeak Technology, Inc. Data augmentation for multi-task learning for depth mapping and semantic segmentation

Also Published As

Publication number Publication date
GB201706499D0 (en) 2017-06-07

Similar Documents

Publication Publication Date Title
GB2562037A (en) Three-dimensional scene reconstruction
US11915502B2 (en) Systems and methods for depth map sampling
US10460463B2 (en) Modelling a three-dimensional space
KR101783379B1 (en) Depth camera compatibility
EP2992508B1 (en) Diminished and mediated reality effects from reconstruction
KR101772719B1 (en) Depth camera compatibility
CN114761997B (en) Target detection methods, terminal equipment and media
US20210374986A1 (en) Image processing to determine object thickness
US20180189565A1 (en) Mapping a space using a multi-directional camera
US20120242795A1 (en) Digital 3d camera using periodic illumination
KR102151815B1 (en) Method and Apparatus for Vehicle Detection Using Lidar Sensor and Camera Convergence
KR20210042942A (en) Object instance mapping using video data
WO2021114776A1 (en) Object detection method, object detection device, terminal device, and medium
JP7285834B2 (en) Three-dimensional reconstruction method and three-dimensional reconstruction apparatus
US20180350087A1 (en) System and method for active stereo depth sensing
CN104581124A (en) Method and apparatus for generating depth map of a scene
KR20120003232A (en) Apparatus and method for volume prediction based occlusion zone reconstruction
WO2021114773A1 (en) Target detection method, device, terminal device, and medium
JP2016537901A (en) Light field processing method
CN110278431A (en) Phase Detection Autofocus 3D Image Capture System
JP2018195241A (en) Information processing apparatus, information processing method, and program
JP2022045947A5 (en)
KR20220133766A (en) Real-time omnidirectional stereo matching method using multi-view fisheye lenses and system therefore
Xu et al. Kinect-based easy 3d object reconstruction
KR20190029842A (en) Three-Dimensional Restoration Cloud Point Creation Method Using GPU Accelerated Computing

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)