[go: up one dir, main page]

HK1182248B - Generating free viewpoint video using stereo imaging - Google Patents

Generating free viewpoint video using stereo imaging Download PDF

Info

Publication number
HK1182248B
HK1182248B HK13109440.2A HK13109440A HK1182248B HK 1182248 B HK1182248 B HK 1182248B HK 13109440 A HK13109440 A HK 13109440A HK 1182248 B HK1182248 B HK 1182248B
Authority
HK
Hong Kong
Prior art keywords
stereo
scene
active
point cloud
depth map
Prior art date
Application number
HK13109440.2A
Other languages
Chinese (zh)
Other versions
HK1182248A1 (en
Inventor
查尔斯.日特尼克
辛.秉.康
亚当.柯克
帕特里克.斯威尼
阿米特.米塔尔
大卫.哈尼特
大卫.埃雷克
干尚‧米特拉
克斯图提斯.帕蒂耶尤纳斯
亚龙.埃谢
西蒙.温德
Original Assignee
微软技术许可有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/273,213 external-priority patent/US20130095920A1/en
Application filed by 微软技术许可有限责任公司 filed Critical 微软技术许可有限责任公司
Publication of HK1182248A1 publication Critical patent/HK1182248A1/en
Publication of HK1182248B publication Critical patent/HK1182248B/en

Links

Description

Generating free viewpoint video using stereo imaging
Technical Field
The invention relates to a method and a system for generating free viewpoint video.
Background
Free Viewpoint Video (FVV) is a video capture and playback technique for capturing an entire scene from multiple angles simultaneously, and where the viewing angle is dynamically controlled by the viewer during playback. Unlike traditional video, which is captured by a single camera and features a fixed viewing perspective, FVV capture involves an array of video cameras and related techniques to simultaneously record a video scene from multiple perspectives. During playback, intermediate synthetic viewpoints between the known real viewpoints are synthesized, allowing seamless spatial navigation within the camera array. In general, denser camera arrays consisting of more video cameras produce more realistic results during FVV playback. When more real data is recorded in a dense camera array, image-based rendering methods for synthetic viewpoints are more likely to generate high quality output, since they are aware of more real data. In sparser camera arrays with less real data, more estimates and approximations must be made in generating the synthetic viewpoints, the results are less accurate and therefore less realistic.
Newer techniques for active depth sensing, such as fromKinect from Corporation of MicrosoftTMSystems improve three-dimensional reconstruction methods by using structured light (i.e., active stereo) to extract geometry from a video scene, as opposed to passive methods that rely solely on image data captured with a video camera under ambient or natural lighting conditions. The structured light approach allows FVV to extract denser depth data because the light pattern provides additional texture on the scene for denser stereo matching. By comparisonPassive methods are generally not able to produce reliable data at surfaces that appear to lack texture under ambient and natural lighting conditions. Active stereo techniques tend to require fewer cameras for high quality 3D scene reconstruction due to the ability to produce denser depth data.
For the prior art, such as fromKinect from CorporationTMThe system, an Infrared (IR) pattern is projected onto a scene and captured by a single IR camera. The depth map can be extracted by finding a local shift of the light pattern. Despite the advantages of using structured light techniques, a number of problems limit the utility of similar devices in the creation of FVVs.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects described herein. This summary is not an extensive overview of the claimed subject matter. It is intended to neither identify key or critical elements of the claimed subject matter nor delineate the scope of the subject innovation. Its sole purpose is to present some concepts of the claimed subject matter in a simplified form as a prelude to the more detailed description that is presented later.
Embodiments provide a method for generating video using an active Infrared (IR) stereo module. The method includes calculating a depth map of a scene with an active IR stereo module. The depth map may be calculated by: projecting an IR dot (dot) pattern onto a scene, capturing a stereoscopic image from each of two or more synchronized IR cameras, detecting a plurality of dots of an IR dot pattern within the stereoscopic image, calculating a plurality of feature descriptors corresponding to the plurality of dots of the IR dot pattern in the stereoscopic image, calculating a disparity map between the stereoscopic images based on a comparison of the corresponding feature descriptors in each stereoscopic image, and generating a depth map of the scene using the disparity map. The method further comprises the following steps: a point cloud of a scene is generated in three-dimensional space using a depth map. The method also includes generating a mesh of the point cloud and generating a projected texture map of the scene from the mesh of the point cloud. The method further includes generating a video by combining the projected texture map with the real image.
Another embodiment provides a system for generating video with an active IR stereo module. The system includes a depth map calculation module configured to calculate a depth map of the scene with an active IR stereo module and calculate a disparity map between stereo images of the scene based on a comparison of corresponding feature descriptors of the IR dot patterns in each stereo image, and a point cloud generation module configured to generate a point cloud of the scene in three-dimensional space with the depth map, wherein the active IR stereo module includes three or more synchronized cameras and IR dot pattern projectors to project an IR origin pattern onto the scene. The modules further include a point cloud mesh generation module configured to generate a mesh of a point cloud and a projection texture map generation module configured to generate a projection texture map of a scene from the mesh of the point cloud. Further, the modules include a video generation module configured to generate a video of the scene using the projected texture map.
Furthermore, another embodiment provides one or more non-transitory computer-readable storage media for storing computer-readable instructions. The computer readable instructions, when executed by one or more processing devices, provide a stereo module system for generating video with an active IR stereo module. The computer-readable instructions include code configured to compute a depth map of a scene with an active IR stereo module by: projecting an IR dot pattern onto a scene, capturing a stereoscopic image from each of two or more synchronized IR cameras, detecting a plurality of dots within the stereoscopic image, calculating a plurality of feature descriptors corresponding to the plurality of dots in the stereoscopic image, calculating a disparity map between the stereoscopic images, and generating a depth map of the scene using the disparity map. The computer-readable instructions further include code configured to generate a point cloud of the scene in three-dimensional space using the depth map, generate a mesh of the point cloud, generate a projected texture map of the scene from the mesh of the point cloud, and generate the video by combining the projected texture map with the real image.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
FIG. 1 is a block diagram of a stereo module system for generating Free Viewpoint Video (FVV) using an active IR stereo module;
FIG. 2 is a schematic diagram of an active IR stereo module that may be used to generate a depth map of a scene;
FIG. 3 is a process flow diagram illustrating a method of generating a depth map using an active IR stereo module;
FIG. 4 is a schematic diagram of one type of partitioning (binning) method that may be used to identify feature descriptors within stereo images;
FIG. 5 is a schematic diagram of another type of partitioning method that may be used to identify feature descriptors within a stereo image;
fig. 6 is a process flow diagram illustrating a method for generating FVV using an active IR stereo module;
FIG. 7 is a schematic diagram of a system of active IR stereo modules connected by a synchronization signal that may be used to generate a depth map of a scene;
FIG. 8 is a process flow diagram illustrating a method of generating a depth map for each of two or more genlocked (genlocked) active IR stereo modules;
fig. 9 is a process flow diagram illustrating a method for generating FVV using two or more genlocked active IR stereo modules; and
fig. 10 is a block diagram illustrating a tangible computer-readable medium storing code adapted to generate FVV using an active IR stereo module.
The same numbers are used throughout the disclosure and figures to reference like components and features. The numbers in the 100 series refer to the features originally established in fig. 1, the numbers in the 200 series refer to the features originally established in fig. 2, the numbers in the 300 series refer to the features originally established in fig. 3, and so on.
Detailed Description
As discussed above, Free Viewpoint Video (FVV) is a technique for video playback in which the viewing perspective is dynamically controlled by the viewer. Unlike traditional video captured by a single camera and featuring a fixed viewing perspective, FVV capture utilizes an array of video cameras and related techniques to simultaneously record a video scene from multiple perspectives. Data from the video array is processed using a three-dimensional reconstruction method to extract texture mapping geometry of the scene. A synthetic viewpoint is then generated at the arbitrary viewpoint using an image-based rendering method. The texture mapping geometry recovered at each time frame allows the viewer to control both the spatial and temporal position of the virtual camera or viewpoint, which is essentially FVV. In other words, virtual navigation through both space and time is achieved.
Embodiments disclosed herein illustrate methods and systems for generating an FVV of a scene using active stereo observation (stereosis). Stereopsis (or just "stereo") is the process of extracting depth information of a scene from two or more different perspectives. If structured light is used, stereo has an "active" feature. A three-dimensional view of a scene may be obtained by generating a depth map using a method of disparity detection between stereo images from different perspectives.
The depth distribution of the stereoscopic image is determined by crossing image matching points (points). Once the corresponding points within the stereo image have been identified, triangulation is performed to recover the stereo image depth. Triangulation is a process of determining the location of each point in three-dimensional space based on minimizing back-projection (back-projection) errors. The back projection error is the sum of distances between a projected point of a three-dimensional point projected onto the stereoscopic image and the originally extracted matching point. Other similar errors may be used for triangulation.
FVV of a scene may be generated with one or more active IR stereo modules in a sparse wide baseline configuration. Sparse camera array configurations in active IR stereo modules can produce accurate results since more accurate geometries can be achieved by augmenting the scene with IR light patterns from the active IR stereo module. The IR light patterns can then be used to enhance image-based rendering methods by generating more accurate geometries, and these patterns do not interfere with RGB imaging.
In an embodiment, the use of IR light projected onto a scene allows for the extraction of highly accurate geometry from the video of the scene during FVV processing. The use of projected IR light also allows the recording of a scene at or near the center using a sparse camera array, such as four modules in a track configuration placed ninety degrees apart. Furthermore, the results obtained with a sparse camera array are more realistic than what is possible with conventional passive stereo.
In an embodiment, a depth map of a scene may be recorded with an active IR stereo module. As used herein, an "active IR stereo module" refers to a type of imaging device that generates a three-dimensional depth map of a scene using stereo observation. The term "depth map" is commonly used in three-dimensional computer graphics applications to describe an image containing information related to the distance from the camera viewpoint to the surface of an object in a scene. Stereo vision uses image features, which may include brightness, to estimate stereo disparity. The disparity map can be converted into a depth map using an internal (inrinsic) and external (extrinsic) camera configuration. According to the current approach, one or more active IR stereo modules may be utilized to create a three-dimensional depth map of a scene.
A depth map may be generated using a combination of sparse and dense stereo techniques. Dense depth maps may be generated using a regularization-based representation such as a markov random field. In image processing and computer vision, markov random fields are non-directional graphical models that are often used to model various low to medium level tasks. A sparse depth map may be generated using the feature descriptors. This approach allows the generation of different depth maps that can be combined with different probabilities. The higher probability characterizes sparse depth maps and the lower probability characterizes dense depth maps. For purposes of the methods disclosed herein, depth maps generated using sparse stereopsis may be preferred, as sparse data may be more reliable than dense data. Sparse depth maps are computed by comparing feature descriptors between stereo images, which tend to match with very high confidence or not at all.
In an embodiment, the active IR stereo module may include a random Infrared (IR) laser dot pattern projector, one or more RGB cameras, and two or more stereo IR cameras, all synchronized (i.e., genlock). The active IR stereo module may be used to project random IR dome patterns to a scene with a random IR laser dome pattern projector and capture stereo images of the scene with two or more genlock IR cameras. The term "genlock" is commonly used to describe a technique for maintaining temporal coherence between two or more signals, i.e., synchronization between signals. The genlock of the cameras in the active IR stereo module ensures that the capture occurs accurately across the cameras at the same time. This ensures that the mesh of the moving object will have the proper shape and texture at any given time during FVV navigation.
Dots can be detected within the stereoscopic IR image and several feature descriptors can be calculated for the dots. The feature descriptors may provide a starting point for comparison of stereo images from two or more genlock cameras and may include points of interest in the stereo images. For example, a particular dot in one stereo image may be analyzed and compared to a corresponding dot in another genlock stereo image.
A disparity map between two or more stereo images may be computed using conventional stereo techniques, and a depth map of a scene may be generated using the disparity map. As used herein, a "disparity map" refers to a distribution of pixel offsets across two or more stereoscopic images. The disparity map may be used to measure differences between stereoscopic images captured from two or more different corresponding viewpoints. Furthermore, a simple algorithm can be used to convert the disparity map into a depth map.
It should be noted that the current method is not limited to the use of a random IR dot pattern projector or an IR camera. But any type of pattern projector that projects identifiable features such as dots, triangles, grids, etc. may be used. Furthermore, any type of camera capable of detecting the presence of features projected onto a scene may be used.
In an embodiment, once a depth map of a scene has been determined using an active IR stereo module, a point cloud may be generated for the scene using the depth map. A point cloud is one type of scene geometry that can provide a three-dimensional representation of a scene. Generally, a point cloud is a collection of vertices in a three-dimensional coordinate system that can be used to represent the external surfaces of objects in a scene. Once the point cloud has been generated, the surface normal may be calculated for each point in the point cloud.
The three-dimensional point cloud may be used to generate a geometric mesh of the point cloud. As used herein, a geometric mesh is a random grid consisting of a collection of vertices, edges, and faces that define the shape of a three-dimensional object. The RGB image data from the active IR stereo module can be projected onto a mesh of point clouds to generate a projected texture map. FVV may be generated from a projected texture map by blending the contributions from the meshes of the RGB image data and the point cloud to allow viewing of a scene from any number of different camera angles. It is also possible to generate the texture mapping geometry mesh separately for each stereo module and rendering involves blending the rendered views of the nearest neighbor meshes.
Embodiments provide a system of multiple active IR stereo modules connected by a synchronization signal. The system may include any number of active IR stereo modules, each including three or more genlocked cameras. In particular, each active IR stereo module may include two or more genlocked IR cameras and one or more genlocked RGB cameras. A system of multiple active IR stereo modules may be utilized to generate depth maps of a scene from different locations or perspectives.
A system of multiple active IR stereo modules may be genlocked with synchronization signals between the active IR stereo modules. The synchronization signal may be any signal that results in temporal coherence of the active IR stereo module. In this embodiment, the temporal coherence of the active IR stereo modules ensures that all active IR stereo modules capture images at the same instant, so that the stereo images from the active IR stereo modules are directly related to each other. Once all active IR stereo modules have acknowledged receipt of the synchronization signal, each active IR stereo module may generate a depth map according to the method described above with respect to the single stereo module system.
In an embodiment, the above-described system of multiple active IR stereo modules utilizes an algorithm based on random light in the form of a random IR dot pattern that is projected onto a scene and recorded with two or more genlocked stereo IR cameras to generate a depth map. When additional active IR stereo modules are used to record the same scene, multiple random IR dome patterns are constructively viewed from the IR cameras in each active IR stereo module. This is possible because as more active IR stereo modules are added to the recording array, the multiple active IR stereo modules do not experience interference.
The problem of interference between active IR stereo modules is radically reduced due to the nature of the random IR dot pattern. Each active IR stereo module does not attempt to match the random IR dot pattern detected by the camera to the original pattern of a particular composition that has been projected onto the scene. Instead, each module observes the current dot pattern as a random dot texture on the scene. Thus, when the current dot pattern being projected onto the scene may be a combination of dots from multiple random IR dot pattern projectors, the actual pattern of dots is irrelevant as the dot pattern is not compared to any standard dot pattern. This therefore allows the use of multiple active IR stereo modules to image the same scene without interference. In fact, as more active IR stereo modules are added to the FVV recording array, the amount of features that can be seen in the IR spectrum can be increased to a point, resulting in an increasingly accurate depth map.
Once a depth map has been created for each active IR stereo module, each depth map may be used to generate a point cloud of the scene. Further, the point cloud may be interpolated to include areas of the scene that are not captured by the active IR stereo module. The point clouds generated by multiple active IR stereo modules may be combined to create a point cloud of a scene. Since each active IR stereo module may record a scene from a different location, the combined point cloud may represent image data acquired from multiple different perspectives or viewpoints. Further, combining the point clouds from the active IR stereo modules may create a single world coordinate system for the scene based on calibration of the cameras. A grid of point clouds may then be created and used to generate FVV for the scene, as described above.
As a preliminary matter, some of the figures describe concepts in the context of one or more structural components, which are respectively referred to as functions, modules, features, elements, etc. The various components shown in the figures may be implemented in any manner, such as through software, hardware (e.g., discrete logic components, etc.), firmware, etc., or any combination of these implementations. In one embodiment, the various components may reflect the use of the corresponding components in an actual implementation. In other embodiments, any single component illustrated in the figures may be implemented by several actual components. The depiction of any two or more separate components in the figures may reflect different functions performed by a single actual component. FIG. 1, discussed below, provides details regarding one system that may be used to implement the functionality shown in the figures.
Other figures describe the concepts in flow chart form. In this form, certain operations are described as constituting distinct blocks performed in a certain order. Such embodiments are exemplary and non-limiting. Certain blocks described herein can be grouped together and performed in a single operation, certain blocks can be separated into component blocks, and certain blocks can be performed in an order different than illustrated herein, including performing blocks in a parallel manner. The blocks shown in the flow diagrams may be implemented by software, hardware, firmware, manual processing, etc., or any combination of these implementations. As used herein, hardware may include a computer system, discrete logic components, such as an Application Specific Integrated Circuit (ASIC), and the like, as well as any combination thereof.
With respect to terminology, the phrase "configured to" includes any manner in which any type of functionality is configured to perform an identified operation. The functions may be configured to perform operations using, for example, software, hardware, firmware, etc., or any combination thereof.
The term "logic" includes any functionality that performs a task. For example, each operation illustrated in the flowcharts corresponds to logic to perform the operation. The operations may be performed using, for example, software, hardware, firmware, etc., or any combination thereof.
As used herein, the terms "component," "system," "client" and the like are intended to refer to a computer-related entity, either hardware, software (e.g., in execution), and/or firmware, or a combination thereof. For example, a component may be a process running on a processor, an object, an executable, a program, a function, a library, a subroutine, and/or a computer or a combination of software and hardware.
By way of illustration, both an application running on a server and the server can be a component. One or more components may reside within a process and a component may be localized on one computer and/or distributed between two or more computers. The term "processor" is generally understood to refer to a hardware component, such as a processing unit of a computer system.
Furthermore, the claimed subject matter may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof to control a computer to implement the disclosed subject matter. The term "article of manufacture" as used herein is intended to encompass a computer program accessible from any non-transitory computer-readable device or media.
Non-transitory computer-readable storage media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, and magnetic strips, etc.), optical disks (e.g., Compact Disk (CD) and Digital Versatile Disk (DVD), etc.), smart cards, and flash memory devices (e.g., card, stick, and key drive, etc.). In contrast, computer-readable media typically (i.e., not necessarily storage media) may additionally include communication media such as transmission media for wireless signals and the like.
Fig. 1 is a block diagram of a stereo module system 100 that generates FVV using an active IR stereo module. The stereoscopic modular system 100 may include a processor 102 adapted to execute stored instructions and a storage device 104 storing instructions executable by the processor. The processor 102 may be a single core processor, a multi-core processor, a computing cluster, or any number of other configurations. The storage device 104 may include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory system. The instructions implement a method comprising: calculating a depth map of the scene using an active IR stereo module; generating a point cloud of a scene in a three-dimensional space by using the depth map, generating a grid of the point cloud, generating a projection texture map of the scene according to the grid of the point cloud, and generating an FVV by creating the projection texture map. The processor 102 is connected to one or more input and output devices by a bus 106.
The stereoscopic modular system 100 may also include a storage 108 adapted to store an active stereoscopic algorithm 110, a depth map 112, a point cloud 114, a projected texture map 116, an FVV processing algorithm 118, and an FVV120 generated by the stereoscopic modular system 100. The storage 108 may include a hard disk drive, an optical disk drive, a finger drive, an array of drives, or any combination thereof. The network interface controller 122 may be adapted to connect the stereoscopic modular system 100 to a network 124 via the bus 106. Electronic text and imaging input files 126 may be downloaded over the network 124 and stored in the computer's storage system 108. Further, the stereoscopic modular system 100 may transmit a depth map, point cloud, or FVV over the network 124.
The stereoscopic module system 100 may be linked by a bus 106 to a display interface 128 adapted to connect the system 100 to a display device 130, where the display device 130 may include a computer monitor, a camera, a television, a projector, a virtual reality display, a mobile device, or the like. The display device 130 may also be a three-dimensional stereoscopic display device. Human machine interface 132 within stereoscopic module system 100 may connect the system to a keyboard 134 and a pointing device 136, where pointing device 136 may include a mouse, trackball, trackpad, joystick, pointing stick, stylus or touch screen, or the like. It should also be noted that the stereoscopic modular system 100 may include any number of other components, including a print interface suitable for connecting the stereoscopic modular system 100 to a printing device, and the like.
The stereoscopic modular system 100 may also be linked by a bus 106 to a random dot pattern projector interface 138 adapted to connect the stereoscopic modular system 100 to a random dot pattern projector 140. Further, the camera interface 142 may be adapted to connect the stereoscopic modular system 100 to three or more genlocked cameras 144, wherein the three or more genlocked cameras may include one or more genlocked RGB cameras and two or more genlocked IR cameras. A random dot pattern projector 140 and three or more genlock cameras 144 may be included within an active IR stereo module 146. In an embodiment, the stereo module system 100 may be connected to multiple active IR stereo modules 146 at once. In another embodiment, each active IR stereo module 146 may be connected to a separate stereo module system 100. In other words, any number of stereo module systems 100 may be connected to any number of active IR stereo modules 146. In an embodiment, each active IR stereo module 146 may include local memory on the module, such that each active IR stereo module 146 may locally store an independent view of the scene. Furthermore, in another embodiment, the entire system 100 may be included within the active IR stereo module 146. Any number of additional active IR stereo modules may also be connected to the active IR stereo module 146 via the network 124.
Fig. 2 is a schematic diagram 200 of an active IR stereo module 202 that may be used to generate a depth map of a scene. As noted, the active IR stereo module 202 may include two IR cameras 204 and 206, an RGB camera 208, and a random dot pattern projector 210. The IR cameras 204 and 206 may be phase locked or synchronized. The genlock of the IR cameras 204 and 206 ensures that the cameras are time coherent so that the captured stereo images are directly related to each other. Further, any number of IR cameras may be added to the active IR stereo module 202 in addition to the two IR cameras 204 and 206. In addition, the active IR stereo module 202 is not limited to use with IR cameras, as many other types of cameras may be used in the active IR stereo module 202.
The RGB camera 208 may be used to capture a color image of a scene by acquiring three different color signals, e.g., red, green, and blue. In addition to one RGB camera 208, any number of additional RGB cameras may be added to the active IR stereo module 202. The output of the RGB camera 208 may provide a useful output for the creation of depth maps for FVV applications.
A random dot pattern projector 210 may be used to project a random pattern 212 of IR dots onto a scene 214. In addition, any type of dot projector may be used instead of the random dot pattern projector 210.
Two genlock IR cameras 204 and 206 may be used to capture images of a scene, including a random pattern 212 of IR dots. The images from the two IR cameras 204 and 206 may be analyzed according to the method in fig. 3 described below to generate a depth map of the scene.
Fig. 3 is a process flow diagram illustrating a method 300 of generating a depth map using an active IR stereo module. At block 302, random IR dome patterns are projected onto a scene. The random IR dot pattern may be an IR laser dot pattern generated by a projector in the active IR stereo module. The random IR blob pattern may also be any other type of blob pattern projected by any module near the scene.
At block 304, stereo images may be captured from two or more stereo cameras within the active IR stereo module. The stereo camera may be an IR camera as discussed above, and may be phase locked in synchronization to ensure that the stereo camera is time coherent. The stereoscopic image captured at block 304 may include the projected random IR dot pattern from block 302.
At block 306, dots may be detected within the stereoscopic image. The detection of dots may be performed within the stereoscopic modular system 100. In particular, the stereo images may be processed by a dot detector within the stereo module system 100 to identify individual dots within the stereo images. The dot detector may achieve sub-pixel accuracy by processing the dot center.
At block 308, feature descriptors may be computed for the dots detected within the stereo image. The feature descriptors can be computed using several different methods, including several different partitioning methods, as described below with respect to fig. 4 and 5. The feature descriptors can be used to match similar features between stereo images.
At block 310, a disparity map between stereo images may be computed. The disparity map may be computed using conventional stereo techniques, such as the active stereo algorithm discussed with respect to fig. 1. Feature descriptors may also be used to create disparity maps that may map the similarity between stereoscopic images according to the identification of corresponding dots within the stereoscopic images.
At block 312, a depth map may be generated using the disparity map from block 310. The depth map may also be computed using conventional stereo techniques, such as the active stereo algorithm discussed with respect to fig. 1. The depth map may represent a three-dimensional view of the scene. It should be noted that the flow chart is not intended to indicate that the steps of the method should be performed in any particular order.
FIG. 4 is a schematic diagram of one type of partitioning method 400 that may be used to identify feature descriptors within a stereoscopic image. The partitioning method 400 utilizes a two-dimensional grid applied to the stereoscopic images. Dots within a stereo image may be assigned to specific coordinate locations in a given tile (bin). This may allow the feature descriptors of individual dots to be identified based on the coordinates of neighboring dots.
Fig. 5 is a schematic diagram of another type of partitioning method 500 that may be used to identify feature descriptors within a stereoscopic image. The partitioning method 500 utilizes concentric circles and a grid, e.g., a polar coordinate system that forms another two-dimensional block frame. A center point of the grid is selected and each patch may be located by its angle to the selected axis and its distance from the center point. Within a tile, a dot may be characterized by its spatial position, intensity, or radial position. For spatial positioning, in the absence of ambiguity, a tile may be characterized by a hard count of internal dots (hard count), or by a soft count of dots that overlap between tiles (soft count). For intensity modulation, the total brightness (aggregate brightness) of all dots within a particular block may be estimated, or an intensity histogram may be calculated. Further, within a particular tile, a radial descriptor may be determined for each dot based on a distance between the particular dot and an adjacent dot and a reference angle.
Although fig. 4 and 5 illustrate two types of partitioning methods that may be used to identify feature descriptors in a stereoscopic image, it should be noted that any other type of partitioning method may be used. In addition, other methods for identifying feature descriptors that are independent of partitioning may also be used.
Fig. 6 is a process flow diagram illustrating a method 600 of generating FVV using an active IR stereo module. As discussed above with respect to fig. 2, a single active IR stereo module may be used to generate a texture mapping geometric model suitable for FVV rendered with a sparse array camera recording a scene. At block 602, a depth map may be computed for a scene using an active IR stereo module, as discussed above with respect to fig. 3. Furthermore, as discussed above, a depth map of a scene may be created using a combination of sparse and dense stereo observations.
At block 604, a point cloud may be generated for the scene using the depth map. This is achieved by transforming the depth map into a point cloud in three-dimensional space and calculating a surface normal for each point in the point cloud. At block 606, a mesh of the point cloud may be generated to define the shape of the three-dimensional object in the scene.
At block 608, a projected texture map may be generated by projecting the RGB image data from the active IR stereo module onto a mesh of the point cloud. At block 610, FVV may be generated from the projected texture map by blending the contributions from the meshes of the RGB image data and the point cloud to allow viewing of the scene from different camera angles. In an embodiment, the FVV may be displayed on a display device such as a three-dimensional stereoscopic display. Furthermore, spatiotemporal navigation may be enabled by the user during FVV playback. Spatio-temporal navigation may allow a user to interactively control a video viewing window in both space and time.
Fig. 7 is a schematic diagram of a system 700 of active IR stereo modules 702 and 704 connected by a synchronization signal 706 that may be used to generate a depth map of a scene 708. It should be noted that the system may employ any number of active IR stereo modules in addition to the two active IR stereo modules 702 and 704. Further, each of the active IR stereo modules 702 and 704 may include two or more stereo cameras 710,712,714, and 716, one or more RGB cameras 718 and 720, and random dot pattern projectors 722 and 724, as discussed above with respect to fig. 2.
Each of the random dot pattern projectors 722 and 724 for the active IR stereo modules 702 and 704 may be used to project a random IR dot pattern 726 onto the scene 708. It should be noted, however, that not every active IR stereo module 702 and 704 necessarily includes a random dot pattern projector 722 and 724. Any number of random IR dot patterns may be projected onto the scene from any number of active IR stereo modules or from any number of separate projection devices independent of the active IR stereo modules.
A synchronization signal 706 between active IR stereo modules 702 and 704 may be used to genlock active IR stereo modules 702 and 704 so that they operate at the same time. According to the method of fig. 3 mentioned above, a depth map may be generated for each of the active IR stereo modules 702 and 704.
Fig. 8 is a process flow diagram illustrating a method 800 of generating a depth map for each of two or more genlocked active IR stereo modules. At block 802, random IR dome patterns are projected onto a scene. The random IR dot pattern may be an IR laser dot pattern generated by a projector in the active IR stereo module. The random IR dot pattern may also be any other type of dot pattern projected by any module near the scene. In addition, any number of active IR stereo modules in the system may project random IR dot patterns simultaneously. As discussed above, due to the random nature of the dot pattern, the overlap of multiple dot patterns onto a scene does not cause interference problems.
At block 804, a synchronization signal may be generated. The synchronization signal may be used to genlock two or more active IR stereo modules. This ensures temporal coherence of the active IR stereo modules. Further, synchronization signals may be generated by one central module and transmitted to each active IR stereo module, may be generated by one active IR stereo module and transmitted to all other active IR stereo modules, may be generated by each active IR stereo module and transmitted to each other active IR stereo module, and so on. It should also be noted that software or hardware genlock may be used to maintain temporal coherence between the active IR stereo modules. At block 806, the genlock of the active IR stereo modules may be confirmed by establishing receipt of the synchronization signal by each active IR stereo module. At block 808, a depth map of the scene may be generated by each active IR stereo module according to the method described with respect to fig. 3. Although each active IR stereo module can generate an independent depth map, the genlock of the active IR stereo modules ensures that all cameras record the scene at the same time. This allows an accurate FVV to be created using depth maps acquired from a plurality of different perspectives.
Fig. 9 is a process flow diagram illustrating a method 900 of generating FVV using two or more genlocked active IR stereo modules. At block 902, a depth map may be computed for each of two or more genlocked active IR stereo modules, as discussed above with respect to fig. 8. The active IR stereo modules can record scenes from different locations and can be phase locked by network communication or any type of synchronization signal to ensure that all cameras in each module are synchronized in time.
At block 904, a point cloud may be generated for each of two or more genlocked active IR stereo modules, as discussed with respect to fig. 6. At block 906, the independently generated point clouds may be combined into a single point cloud, or world coordinate system, based on the calibration of the camera in post-processing.
At block 908, after the normals have been calculated for the points, a geometric mesh of the combined point clouds may be generated. At block 910, an FVV may be generated by creating a projected texture map using the RGB image data and the mesh of the combined point cloud. The RGB image data can be texture mapped onto the mesh of the combined point cloud in a view-independent texture mapping (view-dependent texture mapping) manner, so that different viewing angles result from the proportional mixing of the two RGB images. In an embodiment, FVV may be displayed on a display device and a user may be enabled to navigate spatio-temporally.
Fig. 10 is a block diagram illustrating a tangible computer-readable medium 1000 storing code adapted to generate FVV using an active IR stereo module. The tangible computer-readable medium 1000 may be accessed by the processor 1002 through a computer bus 1004. Further, the tangible computer-readable medium 1000 may include code configured to instruct the processor 1002 to perform the steps of the present method.
The various software components discussed herein may be stored on a tangible computer-readable medium 1000, as shown in FIG. 10. For example, the depth map calculation module 1006 may be configured to calculate a depth map of the scene using an active IR stereo module. The point cloud generation module 1008 may be configured to generate a point cloud of a scene in three-dimensional space using a depth map. The point cloud mesh generation module 1010 may be configured to generate a mesh of the point cloud. The projection texture map generation module 1012 may be configured to generate a projection texture map of the scene, and the video generation module 1014 may be configured to generate the FVV by combining the projection texture map with the real image.
It should be noted that the block diagram of fig. 10 is not intended to indicate that the tangible computer-readable medium 1000 must include all of the software components 1006, 1008, 1010, 1012, and 1014. Further, the tangible computer-readable medium 1000 may include additional software components not shown in fig. 10. For example, the tangible computer-readable medium 1000 may also include a video display module configured to display the FVV on a display device and a video playback module configured to enable spatiotemporal navigation by a user during FVV playback.
In embodiments, the present systems and methods can be used to create a three-dimensional representation of scene geometry using both sparse and dense data. Points in a particular point cloud created from sparse data may be close to a one hundred percent confidence level, while points in a point cloud created from dense data may have a very low confidence level. By blending sparse and dense data together, the resulting three-dimensional representation of the scene may exhibit a balance between accuracy and richness of the three-dimensional visualization. Thus, in this way, different types of FVV may be created for each particular application depending on the quality of the FVV desired.
The present systems and methods may be used in a variety of applications. In an embodiment, FVV generated with active stereo may be used for teleconferencing applications. For example, using multiple active IR stereo modules to generate FVV for a conference call may allow people in dispersed locations to effectively feel as if they were all in the same room.
In another embodiment, the present system and method may be used for gaming applications. For example, generating FVV using multiple active IR stereo modules may allow for accurate three-dimensional rendering of multiple people playing a game together from different locations. Dynamic real-time data captured by an active IR stereo module can be used to create an augmented reality experience in which a person playing a game can virtually see three-dimensional images of other persons playing the game from different locations. The user of the gaming application may also control the viewing window during FVV playback to navigate through space and time. FVV may also be used to coach sports, such as diving, where actions may be compared by superimposing actions at different times or by different players.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
According to an embodiment of the present invention, the following scheme is provided:
1. a method of generating video with an active Infrared (IR) stereo module, comprising:
computing a depth map of a scene with the active IR stereo module, wherein computing the depth map comprises:
projecting IR dome samples onto the scene;
capturing a stereoscopic image from each of two or more synchronized IR cameras;
detecting a plurality of dots of the IR dot pattern within the stereoscopic image;
calculating a plurality of feature descriptors corresponding to the plurality of dots of the IR dot pattern in the stereoscopic image;
calculating a disparity map between the stereo images based on a comparison of corresponding feature descriptors in each of the stereo images; and
generating a depth map of the scene using the disparity map;
generating a point cloud of the scene in three-dimensional space using the depth map;
generating a grid of the point cloud;
generating a projected texture map of the scene from the mesh of the point cloud; and
generating the video of the scene using the projected texture map.
2. The method according to supplementary note 1, wherein the video is a Free Viewpoint Video (FVV).
3. The method according to supplementary note 1, comprising:
displaying the video on a display device; and
spatio-temporal navigation is enabled by a user during video playback.
4. The method of supplementary note 1, comprising capturing stereoscopic images from each of two or more synchronized IR cameras with one or more IR projectors, one or more synchronized RGB cameras, or any combination thereof.
5. The method according to supplementary note 1, comprising:
computing a depth map for each of two or more synchronized active IR stereo modules;
generating a point cloud of the scene in three-dimensional space for each of the two or more synchronized active IR stereo modules;
combining the point clouds generated by the two or more synchronized active IR stereo modules;
creating a grid of the combined point clouds; and
the video is generated by creating a projected texture map on the mesh.
6. The method of supplementary note 5, wherein computing the depth map for each of two or more synchronized active IR stereo modules comprises:
projecting the IR dome pattern onto a scene;
generating a synchronization signal for genlock-locking the two or more synchronized active IR stereo modules; and
confirming that each of the two or more synchronized active IR stereo modules has received the synchronization signal, and, if confirmation is received, generating the depth map of the scene for each of the two or more synchronized active IR stereo modules.
7. The method of supplementary note 1, wherein generating the point cloud of the scene in three-dimensional space using the depth map comprises converting the depth map into a three-dimensional point cloud.
8. The method of appendix 1, wherein generating the mesh of the point cloud comprises transforming the point cloud into a geometric mesh that is a three-dimensional representation of an object in the scene.
9. The method of supplementary note 1, wherein generating the projected texture map of the scene comprises generating the projected texture map by projecting RGB image data from the active IR stereo module onto the mesh of the point cloud.
10. The method of supplementary note 1, wherein generating the video by creating the projected texture map comprises combining the projected texture map with a real image using an image-based rendering method to create a composite viewpoint between the real images.
11. A system for generating video with an active Infrared (IR) stereo module, comprising:
a depth map computation module configured to: calculating a depth map of a scene with the active IR stereo module, wherein the active IR stereo module comprises three or more synchronized cameras and an IR dot pattern projector to project an IR origin pattern onto the scene; and calculating a disparity map between stereoscopic images of the scene based on a comparison of corresponding feature descriptors of the IR dot patterns in each of the stereoscopic images;
a point cloud generation module configured to generate a point cloud of the scene in three-dimensional space using the depth map;
a point cloud mesh generation module configured to generate a mesh of the point cloud;
a projection texture map generation module configured to generate a projection texture map of the scene from the mesh of the point cloud; and
a video generation module configured to generate the video of the scene using the projected texture map.
12. The system according to supplementary note 11, comprising:
a video display module configured to display the video on a display device; and
a video playback module configured to enable spatiotemporal navigation by a user during video playback.
13. The system of supplementary note 11, wherein the system comprises a conferencing system for generating real-time video in a room with one or more active IR stereo modules.
14. The system of supplementary note 11, wherein the system comprises a gaming system for generating real-time video with one or more active IR stereo modules connected to a gaming device.
15. The system of annex 14, wherein the three or more synchronized cameras include two or more synchronized IR cameras and one or more synchronized RGB cameras.
16. One or more non-transitory computer-readable storage media storing computer-readable instructions that, when executed by one or more processing devices, provide a stereo module system for generating video with an active Infrared (IR) stereo module, the computer-readable instructions comprising code configured to:
calculating a depth map of a scene using the active IR stereo module, wherein
Calculating the depth map comprises:
projecting IR dome samples onto the scene;
capturing a stereoscopic image from each of two or more synchronized IR cameras;
detecting a plurality of dots within the stereoscopic image;
computing a plurality of feature descriptors corresponding to the plurality of dots in the stereoscopic image;
calculating a disparity map between the stereo images; and
generating a depth map of the scene using the disparity map;
generating a point cloud of the scene in three-dimensional space using the depth map;
generating a grid of the point cloud;
generating a projected texture map of the scene from the mesh of the point cloud; and
generating the video by combining the projected texture map with a real image.
17. The non-transitory computer-readable storage medium of supplementary note 16, wherein the computer-readable instructions include code further configured to:
displaying the video on a display device; and
spatio-temporal navigation is enabled by a user during video playback.
18. The non-transitory computer readable storage medium of supplementary note 16, wherein the active IR stereo module comprises two or more synchronized IR cameras, one or more synchronized RGB cameras, or any combination thereof.
19. The non-transitory computer-readable storage medium of supplementary note 16, wherein the computer-readable instructions comprise code further configured to:
computing a depth map for each of two or more synchronized active IR stereo modules;
generating a point cloud of the scene in three-dimensional space for each of the two or more synchronized active IR stereo modules;
combining the point clouds generated by the two or more synchronized active IR stereo modules;
creating a grid of the combined point clouds; and
the video is generated by creating a projected texture map of the scene.
20. The non-transitory computer-readable storage medium of supplementary note 19, wherein the code configured to calculate the depth map for each of the two or more synchronized active IR stereo modules further comprises code configured to:
projecting the IR dome pattern onto a scene;
generating a synchronization signal for genlock-locking the two or more synchronized active IR stereo modules; and
confirming that each of the two or more synchronized active IR stereo modules has received the synchronization signal, and, if confirmation is received, generating the depth map of the scene for each of the two or more synchronized active IR stereo modules.

Claims (10)

1. A method of generating video with an active Infrared (IR) stereo module (146,702,704), comprising:
computing (602) a depth map (112) of a scene (708) with the active IR stereo module (146,702,704), wherein computing the depth map (112) comprises:
projecting (302) an IR dot pattern (726) onto the scene (708);
capturing (304) stereoscopic images from each of two or more synchronized IR cameras (144,710,712,714,716,718,720);
detecting (306) a plurality of dots of the IR dot pattern (726) within the stereoscopic image;
calculating (308) a plurality of feature descriptors for the plurality of dots corresponding to the IR dot pattern (726) in the stereo image;
computing (310) a disparity map between the stereo images based on a comparison of corresponding feature descriptors in each of the stereo images; and
generating (312) a depth map (112) of the scene (708) using the disparity map;
generating (604) a point cloud (114) of the scene (708) in three-dimensional space using the depth map (112);
generating (606) a grid of the point cloud (114);
generating (608) a projected texture map (116) of the scene (708) from the mesh of the point cloud (114); and
generating (610) the video of the scene (708) using the projected texture map (116).
2. The method of claim 1, wherein the video is a Free Viewpoint Video (FVV) (120).
3. The method of claim 1, comprising:
displaying the video on a display device (130); and
spatio-temporal navigation is enabled by a user during video playback.
4. The method of claim 1, comprising capturing stereoscopic images from each of two or more synchronized IR cameras (144,710,712,714,716,718,720) with one or more IR projectors (140,722,724), one or more synchronized RGB cameras, or any combination thereof.
5. The method of claim 1, comprising:
computing (902) a depth map (112) for each of two or more synchronized active IR stereo modules (146,702,704);
generating (904) a point cloud (114) of the scene (708) in three-dimensional space for each of the two or more synchronized active IR stereo modules (146,702,704);
combining (906) the point clouds (114) generated by the two or more synchronized active IR stereo modules (146,702,704);
creating (908) a mesh of the combined point clouds; and
the video is generated (910) by creating a projected texture map (116) on the mesh.
6. The method of claim 5, wherein computing the depth map for each of two or more synchronized active IR stereo modules (146,702,704) comprises:
projecting an IR dot pattern (726) onto a scene (708);
generating a synchronization signal for genlock-locking the two or more synchronized active IR stereo modules (146,702,704); and
confirming that each of the two or more synchronized active IR stereo modules (146,702,704) has received the synchronization signal, and, if confirmation is received, generating the depth map (112) of the scene (708) for each of the two or more synchronized active IR stereo modules (146,702,704).
7. The method of claim 1, wherein generating the point cloud (114) of the scene (708) in three-dimensional space using the depth map (112) comprises transforming the depth map (112) into a three-dimensional point cloud (114).
8. The method of claim 1, wherein generating the mesh of the point cloud (114) comprises transforming the point cloud (114) into a geometric mesh that is a three-dimensional representation of objects in the scene (708).
9. A system for generating video with an active Infrared (IR) stereo module (146,702,704), comprising:
a depth map computation module (1006), the depth map computation module (1006) configured to: computing a depth map (112) of a scene (708) with the active IR stereo module (146,702,704), wherein the active IR stereo module (146,702,704) includes three or more synchronized cameras (144,710,712,714,716,718,720) and IR dot pattern projectors (140,722,724) to project an IR origin pattern onto the scene (708); and calculating a disparity map between the stereo images of the scene (708) based on a comparison of the corresponding feature descriptors of the IR dot patterns in each of the stereo images;
a point cloud generation module (1008) configured to generate a point cloud (114) of the scene (708) in three-dimensional space using the depth map (112);
a point cloud mesh generation module (1010) configured to generate a mesh of the point cloud (114);
a projection texture map generation module (1012) configured to generate a projection texture map (116) of the scene (708) from the mesh of the point cloud (114); and
a video generation module (1014) configured to generate the video of the scene (708) using the projected texture map (116).
10. The system of claim 9, comprising:
a video display module configured to display the video on a display device; and
a video playback module configured to enable spatiotemporal navigation by a user during video playback.
HK13109440.2A 2011-10-13 2013-08-13 Generating free viewpoint video using stereo imaging HK1182248B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13/273,213 US20130095920A1 (en) 2011-10-13 2011-10-13 Generating free viewpoint video using stereo imaging
US13/273,213 2011-10-13

Publications (2)

Publication Number Publication Date
HK1182248A1 HK1182248A1 (en) 2013-11-22
HK1182248B true HK1182248B (en) 2016-08-12

Family

ID=

Similar Documents

Publication Publication Date Title
US20130095920A1 (en) Generating free viewpoint video using stereo imaging
US9098908B2 (en) Generating a depth map
US10977818B2 (en) Machine learning based model localization system
US10102639B2 (en) Building a three-dimensional composite scene
US11521311B1 (en) Collaborative disparity decomposition
Koyama et al. Live mixed-reality 3d video in soccer stadium
US9237330B2 (en) Forming a stereoscopic video
Tian et al. Handling occlusions in augmented reality based on 3D reconstruction method
KR100971862B1 (en) System and method for generating a two-layer, three-dimensional representation of an image
US20190228263A1 (en) Training assistance using synthetic images
CN102509348B (en) Method for showing actual object in shared enhanced actual scene in multi-azimuth way
US20130335535A1 (en) Digital 3d camera using periodic illumination
WO2016029939A1 (en) Method and system for determining at least one image feature in at least one image
JP7657308B2 (en) Method, apparatus and system for generating a three-dimensional model of a scene - Patents.com
Meerits et al. Real-time diminished reality for dynamic scenes
da Silveira et al. Dense 3D scene reconstruction from multiple spherical images for 3-DoF+ VR applications
Böhm Multi-image fusion for occlusion-free façade texturing
Ohta et al. Live 3D video in soccer stadium
Ishihara et al. Integrating both parallax and latency compensation into video see-through head-mounted display
Mulligan et al. Stereo-based environment scanning for immersive telepresence
Kim et al. 3-d virtual studio for natural inter-“acting”
US12190444B2 (en) Image-based environment reconstruction with view-dependent colour
Neumann et al. Visualizing reality in an augmented virtual environment
HK1182248B (en) Generating free viewpoint video using stereo imaging
Sankaranarayanan et al. Modeling and visualization of human activities for multicamera networks