WO2013120041A1

WO2013120041A1 - Method and apparatus for 3d spatial localization and tracking of objects using active optical illumination and sensing

Info

Publication number: WO2013120041A1
Application number: PCT/US2013/025468
Authority: WO
Inventors: Vivek K. GOYAL; Ghulam Ahmed Kirmani
Original assignee: Massachusetts Institute of Technology
Current assignee: Massachusetts Institute of Technology
Priority date: 2012-02-10
Filing date: 2013-02-10
Publication date: 2013-08-15
Anticipated expiration: 2014-08-10

Description

METHOD AND APPARATUS FOR 3D SPATIAL LOCALIZATION AND TRACKING OF OBJECTS USING ACTIVE OPTICAL ILLUMINATION AND SENSING

CROSS REFERENCE TO RELATED APPLICATION

[0001] The present application claims the benefit of U.S. Provisional Patent Application No. 61/597,233 filed on February 10, 2012, which is hereby incorporated herein by reference in its entirety.

FIELD

[0002] Subject matter disclosed herein relates generally to data acquisition techniques and, more particularly, to techniques and structures for determining object locations in a region of interest.

BACKGROUND

[0003] As is known in the art, there is an increasing trend to integrate three-dimensional (3D) sensing capabilities into a variety of hardware and software products including both government and consumer products, software, and user interfaces. The demand for accurate, low power, and compact depth cameras is evident and growing. In recent months, a flurry of commercial activity has occurred in 3D sensing applications including the introduction in consumer markets of several compact and low cost depth sensors for consumer electronics and other applications. Some desirable features in next generation 3D sensors include depth accuracy, resolution and frame rate which are relatively high compared with depth accuracy, resolution and frame rate provided by conventional 3D sensors. Other desirable features in next generation 3D sensors include insensitivity to ambient light and large working volume. It remains, however, difficult to provide such features at a high performance level. Furthermore, it is particularly difficult when such requirements are combined with other constraints, such as constraints related to device size and power.

[0004] As is also known, there is also an increasing demand for so-called natural user interfaces (NUIs) which enable unencumbered, free-form gestural input to electronic devices. While NUIs are prevalent in the gaming industry, mobile devices have yet to fully take advantage of the many benefits that NUIs can provide. One reason for this is that sensors that capture free-form gestures do not yet fit the mobile form-factor in power, size, and cost. State-of-the-art active depth sensing techniques like active stereo, time-of-flight (TOF) cameras, and pulsed LIDAR systems either use raster scanned illumination or focusing optics along with a two-dimensional sensor array to form spatial correspondence. These technologies are thus too expensive and/or too bulky and/or consume too much power to be effectively implemented in mobile devices.

[0005] There is thus a need for 3d sensing systems that are power efficient, economical to implement, and compact in size that are suited for use in, for example, mobile communication and computing devices.

SUMMARY

[0006] In accordance with one aspect of the concepts, systems, circuits, and techniques described herein, a method for spatial information measurement regarding a scene of interest includes illuminating a scene with a signal having an intensity which varies with time (i.e. a time -varying intensity signal); capturing reflections from the scene using at least one detector, the at least one detector being in known positions relative to the at least one light source; digitizing signals captured by the at least one detector; processing the digitized signals using parametric signal processing to recover parameters related to a plurality of optical impulse responses of the scene; and processing the parameters related to optical impulse responses of the scene to develop spatial information regarding the scene.

[0007] In one embodiment, the time-varying intensity signal is provided from an illumination source. In one embodiment, the illumination source may be provided from one or more discrete sources or a continuous source. In one embodiment, the individual discrete sources may be provided, for example, as one or more light sources. In one embodiment, processing the digitized signals includes decreasing the low-frequency content of the digitized signals.

[0008] In one embodiment, processing the digitized signals includes high pass filtering the digitized signals. [0009] In one embodiment, processing the digitized signals includes using parametric signal deconvolution on at least one of the digitized signals to obtain at least one approximate optical impulse response.

[0010] In one embodiment, the parametric form of the at least one approximate optical impulse response is a piecewise polynomial function or other piecewise continuous function.

[001 1] In one embodiment, the parametric form of the at least one approximate optical impulse response is a sum of short pulse functions.

[0012] In one embodiment, processing the parameters includes processing a plurality of approximate optical impulse responses to identify a region of the scene from which a substantial fraction of the reflected light originates.

[0013] In one embodiment, processing the parameters includes processing a plurality of approximate optical impulse responses to identify the position and orientation of an object in the scene relative to the information measurement apparatus.

[0014] In one embodiment, the method further comprises capturing a two-dimensional image of the scene using a camera.

[0015] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the method further comprises improving spatial resolution of the estimated spatial position of the at least one object using the two-dimensional image.

[0016] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the method further comprises applying a search or optimization process to the two-dimensional image using the estimated spatial position of the at least one object.

[0017] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the method further comprises generating another image using the two-dimensional image based, at least in part, on the estimated spatial position of the at least one object.

[0018] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the method further comprises estimating pose, position, or orientation of at least one of a user's hand or other body part or parts, a body, a stylus, or other object using the estimated spatial position of the at least one object.

[0019] In one embodiment, the method further comprises generating a region of interest of the scene using the estimated spatial position of the at least one object, and searching for the user's hand or other body part or parts, the body, the stylus, or the other object within the region of interest.

[0020] In one embodiment, generating a region of interest of the scene includes generating a region of interest using an estimated spatial position of at least one surface, and the method further comprises generating an image to change the appearance of the region of interest.

[0021] In one embodiment, the method further comprises identifying at least one gesture using the spatial information regarding the scene, and interacting with a

representation of the two-dimensional image or portions thereof using the at least one gesture.

[0022] In one embodiment, the method further comprises: identifying at least one gesture using the spatial information regarding the scene, and controlling an augmented reality display using the at least one gesture.

[0023] In one embodiment, the method further comprises: identifying at least one gesture using the spatial information regarding the scene, and controlling an electronic device using the at least one gesture. In one embodiment, the electronic device includes one of a phone, a tablet, an e-book reader, a watch, smart glasses, or a computer.

[0024] In accordance with another aspect of the concepts, systems, circuits, and techniques described herein, a data acquisition system comprises: at least one light source to illuminate a scene; at least one detector to detect light signals reflected from the scene, the at least one detector being in known positions with respect to the at least one light source; one or more digitizers to digitize output signals of the at least one detector; and at least one digital processor to: (a) process digitized output signals of the at least one detector using parametric signal processing to determine parameters related to a plurality of optical impulse responses of the scene or interest; and (b) process the parameters related to the plurality of optical impulse responses of the scene to develop spatial information regarding the scene. [0025] In one embodiment, the at least one light source is controlled to illuminate the scene with a time-varying intensity.

[0026] In one embodiment, the at least one light source is controlled to illuminate the scene with a series of short light pulses.

[0027] In one embodiment, the at least one digital processor is configured to process digitized output signals of the at least one detector to decrease low frequency content therein.

[0028] In one embodiment, the at least one digital processor is configured to use parametric signal deconvolution to process at least one digitized output signal to obtain at least one approximate optical impulse response.

[0029] In one embodiment, the parametric form of the at least one approximate optical impulse response is a piecewise polynomial function or other piecewise continuous function.

[0030] In one embodiment, the parametric form of the at least one approximate optical impulse response is a sum of short pulse functions.

[0031] In one embodiment, the at least one digital processor is configured to process the parameters related to the plurality of approximate optical impulse responses to identify a region of the scene from which a substantial fraction of the reflected light originates.

[0032] In one embodiment, the at least one digital processor is configured to process the parameters related to the plurality of approximate optical impulse responses to identify the position and orientation of an object in the scene relative to the data acquisition system.

[0033] In one embodiment, the data acquisition system further comprises a camera for capturing a two-dimensional image of the scene. In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the at least one digital processor is configured to enhance the spatial resolution of the estimated spatial position of the at least one object using the two- dimensional image.

[0034] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the at least one digital processor is configured to apply a search or optimization process to the two-dimensional image using the estimated spatial position of the at least one object. [0035] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the at least one digital processor is configured to generate another image using the two-dimensional image based, at least in part, on the estimated spatial position of the at least one object.

[0036] In one embodiment, the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the at least one digital processor is configured to estimate pose, position, or orientation of at least one of a user's hand or other body part or parts, a body, a stylus, or other structure using the estimated spatial position of the at least one object.

[0037] In one embodiment, the at least one digital processor is configured to generate a region of interest of the scene using the estimated spatial position of the at least one object; and the at least one digital processor is configured to search for at least one of the user's hand or other body part or parts, the body, the stylus, or the other structure within the region of interest.

[0038] In one embodiment, the at least one digital processor is configured to generate a region of interest of the scene using an estimated spatial position of at least one surface; and the at least one digital processor is configured to generate an image to change the appearance of the region of interest.

[0039] In one embodiment, the at least one digital processor is configured to identify at least one gesture using the spatial information regarding the scene; and the at least one digital processor is configured to interact with a representation of the two-dimensional image or portions thereof using the at least one gesture.

[0040] In one embodiment, the at least one digital processor is configured to identify at least one gesture using the spatial information regarding the scene; and the at least one digital processor is configured to control an augmented reality display using the at least one gesture.

[0041] In one embodiment, the at least one digital processor is configured to identify at least one gesture using the spatial information regarding the scene; and the at least one digital processor is configured to control an electronic device using the at least one gesture. In one embodiment, the electronic device includes one of a phone, a tablet, an e-book reader, a watch, smart glasses, or a computer. In one embodiment, the data acquisition system is part of the electronic device.

[0042] In one embodiment, the at least one detector includes a photodiode.

[0043] In one embodiment, the at least one light source includes at least one of: a light emitting diode or a laser diode.

[0044] In one embodiment, the at least one light source and the at least one detector are fixed within a common plane.

[0045] In one embodiment, the at least one detector operates without focusing optics.

[0046] In one embodiment, the at least one detector consists of six or fewer single-pixel detectors.

[0047] In accordance with a still another aspect of the concepts, systems, circuits, and techniques described herein, a method for spatial localization of objects using at least one active illumination source and multiple detectors comprises: illuminating a scene with an impulse illumination via the at least one active illumination source; receiving, at each of the multiple detectors, a response to the impulse illumination of the scene; processing the response received at each detector using parametric signal processing to determine distance information for one or more objects in the scene with respect to the detector and the illumination source; and using a localization process to reconstruct positions in space for the one or more objects in the scene.

[0048] In one embodiment, the method is used to determine a spatial location of hands for a human-computer interaction.

[0049] In accordance with a further aspect of the concepts, systems, circuits, and techniques described herein, a depth imager comprises: a modulated illumination source to generate a periodic light waveform to illuminate a scene; multiple time-resolved detectors spaced in a plane with the modulated illumination source to receive reflected light from the scene; and a processor to: process signals output by the multiple time-resolved detectors using parametric signal processing to identify depths of interest for objects in the scene; and use source localization and the depths of interest to identify spatial positions for objects in the scene.

[0050] In one embodiment, the multiple time-resolved detectors include multiple photodiodes. [0051] In one embodiment, the multiple time-resolved detectors include four time- resolved detectors.

[0052] In one embodiment, the four time -resolved detectors are spaced at the corners of a rectangle in a common plane as the modulated illumination source.

[0053] In one embodiment, the modulated illumination source is provided as a low- bandwidth light emitting diode or lasing diode.

BRIEF DESCRIPTION OF THE DRAWINGS

[0054] The foregoing features may be more fully understood from the following description of the drawings in which:

[0055] Fig. 1 is a block diagram illustrating an exemplary data acquisition system that may be used to acquire object localization information for objects within a 3 -dimensional scene in accordance with an embodiment;

[0056] Fig. 2 is a diagram illustrating an exemplary process for modeling the impulse response of a scene or interest in accordance with an embodiment;

[0057] Fig. 3 is a flowchart illustrating an exemplary method for determining the locations of objects that provide high frequency content in a scene response in accordance with an embodiment;

[0058] Fig. 4 is a diagram illustrating an exemplary signal processing pipeline that may be used to develop object localization information for a scene in accordance with an embodiment;

[0059] Fig. 5 is a flowchart illustrating an exemplary method for performing parametric signal processing in accordance with an embodiment;

[0060] Fig. 6 is a diagram illustrating a smart phone having a light source and four detectors in accordance with an embodiment;

[0061] Fig. 7A is a diagram illustrating a laptop computer having a light source and four detectors in accordance with an embodiment;

[0062] Fig. 7B is a diagram illustrating a flat panel monitor having a light source and four detectors in accordance with an embodiment; [0063] Fig. 8 is a diagram illustrating smart glasses having a light source and one or more detectors in accordance with an embodiment;

[0064] Fig. 9 is a diagram illustrating a setup for use in performing hand tracking in accordance with an embodiment;

[0065] Fig. 10 is a diagram illustrating a setup for use in generating physically-accurate rendered augmentations in accordance with an embodiment;

[0066] Fig. 11 is a signal flow diagram representing an exemplary signal acquisition pipeline at a detector in accordance with an embodiment;

[0067] Fig. 12 includes a series of graphs illustrating, in a top row, continuous-time scene impulse responses for hand gesture and single plane applications and, in a bottom row, the samples acquired at individual detectors in the absence of noise for Gaussian s(t) and h_k (t) for hand gesture and single plane applications;

[0068] Fig. 13 includes a pair of graphs illustrating the results of signal parameter estimation for a hand tracking application (left) and a single plane application (right);

[0069] Fig. 14 includes a pair of graphs illustrating the effects of noise on accuracy for localizing each of two hands in a detector's FOV (left) and recovering plane location and orientation information (right);

[0070] Fig. 15 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a user holds a mobile phone (e.g., a smart phone, etc.) containing a sensing apparatus in accordance with an embodiment;

[0071] Fig. 16 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a user wears smart glasses containing a sensing apparatus in accordance with an embodiment;

[0072] Fig. 17 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a computer contains a sensing apparatus in accordance with an embodiment;

[0073] Figs. 18A and 18B are diagrams illustrating an exemplary use of a data acquisition system as described herein where a user wears smart glasses containing a sensing apparatus in accordance with an embodiment; [0074] Fig. 19 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a tablet contains a sensing apparatus in accordance with an embodiment; and

[0075] Figs. 20A and 20B are diagrams illustrating an exemplary use of a device including a camera that contains a sensing apparatus in accordance with an embodiment.

DETAILED DESCRIPTION

[0076] Techniques, systems, and devices described herein relate to active optical sensor systems that are capable of capturing three-dimensional (3D) scene structure. Systems and techniques are provided herein that are capable of capturing 3D information for a scene of interest in a low complexity and low power manner. The techniques are capable of being implemented using relatively simple hardware construction and standard, commercially available components in a compact form factor. The techniques are also capable of producing accurate object location information with high frame rates, insensitivity to ambient light, and low power consumption. Unlike existing depth sensors, the techniques and systems described herein do not provide full spatial resolution depth maps for general scenes, but rather provide high 3D object localization accuracy in application specific scenarios. The techniques are capable of accurately estimating and tracking locations of a small number of distinct objects in a scene for applications such as, for example, hand gesture tracking for controlling mobile devices and 3D sensing for interactive augmented reality scenarios. As used herein, the term "gesture" and related terms encompass communication from machine to machine or human to machine based on static or dynamic positioning of any part of a body or extension to a body such as a stylus or other object. In addition, the techniques described herein are capable of achieving accurate gesture identification or object localization using one or more relatively low bandwidth detectors and light sources. In some implementations, sub-centimeter range resolutions have been achieved using the disclosed technology.

[0077] In at least one embodiment, a system is provided for determining object locations in a scene of interest using, for example, a single low cost (omnidirectional) light source and a small number of unfocused photo-detectors (although other configurations may alternatively be used). Active time-of-flight (TOF) techniques may be used that are based on parametric modeling and processing of scene impulse responses. In some embodiments, parametric signal processing may be used to first measure distances associated with one or more objects within a scene of interest at multiple different detectors. Object localization procedures may then be employed to determine the spatial locations of the different objects based on the distance information.

[0078] The techniques and structures described herein may utilize a signal processing approach called "parametric deconvolution" (see, e.g., "Parametric Deconvolution of Positive Spike Trains" by Li and Speed, Annals of Statistics, vol. 28, no. 5, pp. 1279-1301, Oct. 2000 which is incorporated by reference herein in its entirety, and references therein). Parametric deconvolution has been used previously in processing optical intensity signals (see "Exploiting Sparsity in Time-of-Flight Range Acquisition Using a Single Time- Resolved Sensor," by Kirmani et al., Optics Express, vol. 19, no. 22, pp. 21 485-21 507, Oct. 2011, which is hereby incorporated by reference in its entirety). In previous uses, transverse spatial resolution was obtained through the use of spatially- varying illumination intensity or spatially-varying detection sensitivity. In contrast, the techniques and systems described herein can generate spatial information with omnidirectional illumination and detection. Where imaging with omnidirectional illumination and detection were achieved previously (see, e.g., "Diffuse Imaging: Creating Optical Images with Unfocused Time- Resolved Illumination and Sensing" by Kirmani et al., IEEE Signal Processing Letters, vol. 19, no. 1, pp. 31-34, Jan. 2012), the non-parametric modeling and reconstruction methods required a large product of the number of illumination sources and number of detectors to achieve useful resolution. In contrast, some embodiments of the techniques and systems described herein have only a small number of illumination sources and a small number of detectors.

[0079] Mature 3D acquisition technologies, like Microsoft's Kinect, are power hungry and bulky. Traditional 3D sensing devices typically initially acquire a full depth map. The application pipeline will then use only the important edge information and discard the rest of the acquired pixels. In contrast, the techniques and systems described herein can accurately identify edge information during acquisition, thereby eliminating the collection of wasteful pixels during the acquisition process.

[0080] In conceiving some of the features described herein, it was appreciated that reflected light signals from objects in many types of scenes take the form of parametric signals. In contrast to general signals, parametric signals may be described using only a small number of parameters. As will be described in greater detail, in some embodiments, parametric signal processing techniques may be used including, for example, parametric signal deconvolution and other techniques.

[0081] Fig. 1 is a block diagram illustrating an exemplary data acquisition system 10 that may be used to acquire object localization information for objects within a 3- dimensional scene 12. The data acquisition system 10 may use optical time-of-flight information to determine locations of objects within the scene. In some implementations, the data acquisition system 10 may be implemented as a stand-alone unit. In other implementations, the data acquisition system 10 may be made part of a larger system, such as a desktop computer system, to serve as, for example, a user input device. The data acquisition system 10 may be used in any type of electronic system. Because of its relative simplicity, low cost, and compact size capabilities, the data acquisition system 10 is particularly suited for use in portable and mobile devices such as, for example, smart phones and other handheld communication devices, laptop and tablet computers, portable audio/video players, smart glasses, and other portable devices.

[0082] As illustrated in Fig. 1, data acquisition system 10 may include: a light source 16; one or more photo detectors 18, 20, 22, 24; one or more analog-to-digital (A/D) converters 26, 28, 30, 32; a signal processing unit 34; memory 36; a controller 38; and an RGB camera 40. The light source 16 may include any type of light source that is capable of illuminating the scene 12 with an impulse of light of sufficient intensity to be reflected from objects within the scene 12 and then be sensed by the photo detectors 18, 20, 22, 24. In some implementations, a single low-cost, omnidirectional light source may be used (e.g., a single light emitting diode, etc.). However, other types of light sources may alternatively be used (e.g., laser diodes, light sources including multiple different light generating elements, vertical-cavity surface-emitting diode lasers, and others known to those skilled in the art). In some embodiments, the light source will be modulated during operation. For example, in some implementations, the light source 16 may be modulated to provide a train of successive light pulses for use in tracking the locations of objects (e.g., hands, etc.) in the scene 12.

[0083] The photo-detectors 18, 20, 22, 24 may include any type of detectors or sensors that are capable of sensing the type of light (e.g., wavelength and bandwidth, etc.) generated by light source 16. In some embodiments, the photo-detectors 18, 20, 22, 24 may consist of low-cost, unfocussed, single-pixel photo-detectors, such as silicon PIN photodiodes. Other types of photo-detectors may alternatively be used, as would be known to those skilled in the art. As described above, the photo-detectors 18, 20, 22, 24 are operative for sensing light signals reflected from the scene 12 (e.g., from objects within scene 12). In some implementations, the photo-detectors 18, 20, 22, 24 are unfocussed detectors that do not make use of focusing optics (although focusing optics may be provided in some embodiments). In some embodiments, optical filtering may preferentially sense the wavelengths of light used for illumination and reject other wavelengths to reduce the effect of ambient light. In some embodiments, optical sensing may preferentially reject the constant and slowly- varying components of the light to further reduce the effect of ambient light. In the illustrated embodiment, four photo-detectors 18, 20, 22, 24 are used. It should be appreciated, however, that any number of detectors may be used in different

embodiments (e.g., two or more if there is a single light source, or one or more if there is more than one light source). In at least one approach, six or fewer single pixel photo- detectors are provided. In general, the photo-detectors 18, 20, 22, 24 will be located in known positions with respect to light source 16. These known positions may be fixed with respect to the light source, but in some embodiments the positions may vary with respect to the light source in a known manner. In some implementations, the photo-detectors 18, 20, 22, 24 may be mounted within the same plane (or substantially the same plane) as light source 16, although non-planar implementations also exist.

[0084] The A/D converters 26, 28, 30, 32 are operative for sampling the output signals of the photo-detectors 18, 20, 22, 24 and converting them to a digital representation (i.e., digitizing the signals). As shown in Fig. 1 , one A/D converter may be provided for each detector in some embodiments. The signal processing unit 34 is operative for processing the digitized output signals of the detectors 18, 20, 22, 24 to determine object spatial position information for the scene of interest. Controller 38 may be used to control the various components of data acquisition system 10 to generate the object position information. Controller 38 may control, for example, when light source 16 is illuminated, the intensity of the light signal generated, when A/D converters 26, 28, 30, 32 are activated, sampling rate of the A/D converters 26, 28, 30, 32, and/or other operational parameters of system 10. In some implementations, signal processing unit 34 and controller 38 may be implemented using one or more digital processing devices. The digital processing device(s) may include, for example, a general purpose microprocessor, a digital signal processor (DSP), a reduced instruction set computer (RISC), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a microcontroller, an embedded processor, and/or others, including combinations of the above.

[0085] Memory 36 may be used by signal processing unit 34 to, for example, store object spatial position information for the scene of interest. Memory 36 may also be used to store, for example, an operating system to be executed within data acquisition system 10 or a host device, one or more application programs to be executed within data acquisition system 10 or a host device, user data, and/or other information. Memory 36 may include, for example, random access memories (RAMs), read only memories (ROMs), erasable programmable ROMs (EPROMs), electrically erasable programmable ROMs (EEPROMs), semiconductor memories, registers, disk based date storage, floppy disks, hard disks, optical disks, compact disc read only memories (CD-ROMs), Blu-ray disks, magneto-optical disks, magnetic or optical cards, flash memory, and/or other types of digital data storage.

[0086] As will be described in greater detail, in some implementations, signal processing unit 34 may use a signal processing techniques known as "parametric signal processing" to process reflected light signals detected by each the photo-detectors 18, 20, 22, 24. This parametric signal processing may generate distinct distance data related to different objects within the scene 12 with respect to the corresponding detectors 18, 20, 22, 24. Localization processing may then be used to process the distance data associated with the detectors 18, 20, 22, 24 to determine spatial regions within which the different objects in the scene 12 are located. In at least one embodiment, multilateration techniques are used by the signal processing unit 34 to process the distance data to determine the spatial regions, although other location techniques may be used in other embodiments. In some

embodiments, optional RGB camera 40 may be used to take one or more 2-dimensional pictures of scene 12. Processing applied to the 2-dimensional picture(s) may then be modified based on the spatial regions computed for the objects in the scene in a fusion process. Although illustrated as an RGB camera in Fig. 1, it should be appreciated that any type of 2-dimensional imaging device may be used.

[0087] As part of the processing in signal processing unit 34, a digitized signal may be processed using parametric signal deconvolution to generate an approximate optical impulse response for a scene. Multiple digitized signals may result in multiple approximate optical impulse responses for the scene. These multiple scene impulse responses may then be processed to develop spatial position information for objects in the scene. Each digitized signal may correspond, for example, to a different source-detector pair. Thus, multiple different optical impulse responses may be achieved using, for example, multiple light sources with a single detector, a single light source with multiple detectors, multiple light sources with multiple detectors, or a single light source and a single detector where one moves in a known manner relative to the other.

[0088] Fig. 2 is a diagram illustrating an exemplary process for modeling the impulse responses of the scene of interest (e.g., scene 12 in Fig. 1) in accordance with an embodiment. As shown in Fig, 2, a rather simple scene includes a ball 50 and a mug 52 lying on a flat surface 54 (e.g., a table top, etc.). A light source 56 emits an omnidirectional light pulse toward the scene. The light pulse may include, for example, an impulse of light that may be described using the Dirac delta function 5(t). It should be appreciated that the Dirac delta function is an idealized form of impulse that may be difficult or impossible to achieve in practice. Those skilled in the art will recognize that other light intensity waveforms may alternatively be used, including those that approximate a Dirac delta function as well as others. As used herein, the term "pulse" is not limited to impulse type waveforms. The light pulse is reflected from each of the objects in the scene and the reflected signals are sensed by two detectors 58, 60. Each detector will receive one reflection component for each of the two objects in the scene. However, because of the different distances travelled by the light, and other reasons, the reflected light signals received at the two detectors from a particular object may differ in delay, in amplitude, and in other ways (e.g., Bi(t) at detector 58 and B₂(t) at detector 60 resulting from ball 50 or Mi(t) at detector 58 and M₂(t) at detector 60 resulting from mug 52). Reflected light signals from surface 54 (e.g., bi(t) and b₂(t)) may also be received at the two detectors 58, 60. The reflected light signals for all objects within the scene will combine in the detectors 58, 60. These combined signals will then be digitized and processed to determine spatial locations for the objects within the scene.

[0089] The main idea behind the processing is that when small objects that occupy a small portion of a sensor's field of view (FOV) (e.g., ball 50 and mug 52) are illuminated with a short impulse of light, then the reflected signal will be a short time duration signal having a fast rising edge and a fast falling edge (e.g., Bi(t) and B₂(t) in Fig. 2, etc.). Thus, distinct small objects in the scene will contribute to the high frequency content of the scene impulse response. In contrast, large smooth objects like surface 54 will contribute smooth, low frequency content to the reflected response. Therefore, in some implementations, one assumption behind the digital processing that may be used to process the reflected signals is that the high frequency response contributed by each of the small, distinct objects is well- modeled using parametric signals, irrespective of the object shape. In one possible approach, for example, the high frequency scene response may be modeled as the sum of time-shifted boxcar functions, with one boxcar function from each of the distinct objects (as shown in Fig. 2).

[0090] The use of parametric signal processing enables the following key advantages to be achieved: higher depth accuracy and resolution relative to the modulation bandwidth, effective rejection of smooth scene components including ambient light, simplified hardware architecture, and estimation of more than one depth value per sensor pixel. By using parametric modeling and processing of the scene impulse response, therefore, the cost, power consumption, and complexity of 3D acquisition can be significantly reduced. In addition, because the described techniques rely on high frequency content of detected signals, multipath distortion effects will have reduced significance since most multipath scattering of light in natural scenes is diffuse and contributes only a low frequency response.

[0091] It is well known from the theory of parametric signal estimation, that high frequency signals which are well approximated using weighted sums of time-shifted boxcar functions can be robustly sampled and reconstructed at sub-Nyquist rates by first convolving the signals with a low-pass filter response to avoid aliasing, and then sampling and performing nonlinear spectral estimation on the result to estimate the signal parameters (see, e.g., "Sampling Signals with Finite Rate of Innovation" by Vetterli et al., IEEE Transactions on Signal Processing, vol. 50, no. 6, pp. 1417-1428, June 2002, "Sampling Moments and Reconstructing Signals of Finite Rate of Innovation: Shannon Meets Strang- Fix" by Dragotti et al., IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 1741- 1757, May 2007, and "Sparse Sampling of Signal Innovations" by Blu et al., IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 31-40, March 2008, all of which are incorporated by reference herein in their entireties). In some implementations, therefore, the goal of the digital processing may be to estimate the 3D positions of scene objects that contribute to the high frequency scene response.

[0092] Fig. 3 is a flowchart illustrating an exemplary method 80 for determining the location of objects that provide high frequency content in a scene response in accordance with an embodiment. A light pulse is first transmitted toward a scene of interest (block 82). Reflected light pulse components are then detected at multiple light detectors (block 84). The reflected light pulse components may, for example, result from different objects within the scene of interest. The output signals of the detectors may then each be digitized (block 86). The digitized samples associated with each detector may then be digitally processed using parametric signal processing to recover parameters related to peak amplitudes and distance (or time) of signal jumps (block 88). The signal jumps may each correspond to a distinct object within the scene of interest. As will be described in greater detail, source localization processing may next be performed using the parameter information associated with each of the detectors to determine estimated spatial regions within which each object within the scene resides (block 90).

[0093] In some implementations, this spatial region information may be used to perform one or more object location related applications (e.g., gesture tracking, augmented reality, etc.). However, in some implementations, further processing may first be performed to refine the object spatial region information (e.g., to improve spatial resolution) before it is used. For example, in some embodiments, a two-dimensional image of the scene of interest may be obtained using, for example, an RGB camera or other two dimensional imager (block 92). The two dimensional image may then be fused with the object spatial region information to improve the spatial resolution thereof (block 94). Example techniques for performing the fusion process will be described below. As will be appreciated, the above- described process may be continually repeated in order to track object position within a scene of interest. For example, in some implementations, a train of light pulses may be used to illuminate the scene of interest, where the object location information is updated for each new pulse. Other tracking techniques are also possible.

[0094] Fig. 4 is a diagram illustrating an exemplary signal processing pipeline that may be used to develop object localization information for a scene in accordance with an embodiment. To simplify description, the pipeline of Fig. 4 assumes a similar scene to that shown in Fig. 2, with a ball and a mug resting on a flat surface. The pipeline of Fig. 4 also assumes that unfocused detectors are being used. As illustrated, a scene response 100, 102 is received at each of two detectors. Each scene response includes a superposition of responses received from different objects in the scene. The scene response of each small, distinct object in the scene will take the form of (or at least approximate) a boxcar function at each detector. The boxcar functions will add at the detector (e.g., ri(t) = Bi(t) + Mi(t), etc.). Due to optical path length differences, each of the baseline separated, unfocused photo-detectors will receive a different boxcar function combination. It is these

"parameter" variations that encode the 3D locations of the objects. A response associated with large smooth structures in the scene (e.g., bi(t) and b2(t) for the flat surface) will also add into the overall scene response at each detector.

[0095] The scene responses 100, 102 received at the detectors will be processed by the transfer functions (e.g., impulse responses) of the corresponding detectors 104, 106 (e.g., hi(t), h₂(t)). As is well known, this amounts to a convolution operation in the time domain. In many cases, the transfer functions of the different detectors will be substantially the same, although detectors having different transfer functions may be used in some implementations. In general, the transfer functions of the detectors will take the form of low pass filters. The scene responses will then be sampled and converted to a digital format in A/D converters 108, 110 to generate digital responses 1 12, 1 14. The digitized time- samples are then processed using parametric signal processing to obtain parameters that completely specify the underlying parametric signals; namely, the positions and amplitudes of signal jumps within the responses. Among other things, this processing may include high pass filtering the digitized time-sample information using digital high pass filtration 116, 1 18 and spectral estimation. As shown in Fig. 4, in some cases the parameters that are extracted may be inferred from power spectral density (PSD) plots 120, 122 as a function of frequency.

[0096] After the signal parameters have been estimated, a maximum likelihood (ML) source localization process 124 (or similar process) may be undertaken to develop location information for each small, distinct object within the scene (or some subset of some partition of the scene). In one approach, the source localization process will generate 3D regions of interest (ROI) 126 for each object that bounds the object location. If a 2D image 128 is available, the computed 3D ROIs 126 may be fused with the 2D image data to create a depth map of scene objects 130 that contribute to high frequency temporal variations.

[0097] Fig. 5 is a flowchart illustrating an exemplary method 200 for performing parametric signal deconvolution processing in accordance with an embodiment. As described previously, this parametric signal processing may be used to extract signal parameters from scene responses received at different detectors that can then be processed to determine object location information. The method 200 of Fig. 5 may be used to estimate a scene impulse response using digital samples collected at a detector's output and digital samples of the detector's impulse response. First, K digital samples are obtained from the detector (block 202). The K digital samples may be denoted yi, y₂, y₃..., γκ- An N-point discrete Fourier transform (N > K) of yi,..yi may then be computed to generate frequency domain samples Y₁ ..,YN (block 204). An N point discrete Fourier transform may then be computed for the detector impulse response to generate frequency domain samples Hi,..., H_N (block 206). An appropriate interpolation kernel may then be selected based on the kind of scene being imaged (block 208). If the scene to be imaged is mostly comprised of planar objects, then a linear interpolation kernel is appropriate. If there are curved objects in the scene, a higher order kernel like splines may be used. Exponential splines and other parametric function classes known to those skilled in the art may also be used. The Fourier coefficients of the interpolation kernel may be denoted as GI,...,GN-

[0098] The frequency domain samples ΥΙ,.,.,ΥΝ may next be rescaled by their corresponding values of Hi,...,H_N and GI,...,GN (block 210). The rescaled data may be denoted as Zi,...,Z_N. Next, the rescaled data Zi,...,Z_N may be used to estimate the number of discontinuities (or kinks) in the scene impulse response (block 212). In at least one embodiment, this may be accomplished by forming a structured Hankel matrix using Zi,...,Z_N, followed by computing the rank of the matrix using singular value decomposition. The number of discontinuities may be denoted by L. Note that by the definition of matrix rank, L < N. The computed value of L may then be used, along with rescaled data Zi,...,Z_N, to compute the positions of the discontinuities in the scene impulse response (block 214). In at least one embodiment, this may be accomplished by forming a structured Hankel matrix H of size (N-K+l x L+l) and computing the smallest eigenvector of H to be used as the coefficients of the polynomial whose roots are the estimates for the positions of kinks in the scene impulse response. The L kink position estimates may be denoted as di,...,d_L. Other methods to estimate kink positions in the scene impulse response may alternatively be employed. These methods are based on spectral estimation techniques. Once the L kink locations have been identified, the amplitudes of the kinks may be estimated using the data Zi,...,Z_N and di,...,d_L (block 216). In at least one embodiment, this may be accomplished using a fast implementation of the linear Vandermonde filtering operation. Other techniques may alternatively be used. The amplitude estimates may be denoted as A_l .., A_L. The estimates di,...,dL and A\,..., A_L, along with the interpolation kernel G, may then be used to estimate the scene impulse response (block 218).

[0099] In at least one embodiment, as described above, multilateration techniques may be used to process parameters associated with optical impulse responses of a scene to develop spatial information for the scene (e.g., to develop location information or 3D region of interest information for objects in a scene). As is known, multilateration is a technique that may be used to determine relative location based on measurement of the difference in distance to two or more known locations. As will be appreciated, the use of difference in distance will identify an infinite number of locations that satisfy the measurement. When plotted in three dimensions, these locations may form, for example, an ellipsoid (or a portion thereof). If multiple measurements are taken, a solution may be arrived at as, for example, an intersection of ellipsoids. Other location techniques may alternatively be used.

[00100] In at least one embodiment, source localization may be performed as follows. First, denote by a number L the number of objects producing substantial responses. L distance estimates may then be found for each of a number M of detectors using, for example, the method described in Fig. 5. Each distance estimate for each detector corresponds to a path length from illumination source to scene point to detector and hence to an ellipsoid in the scene volume. In a noiseless setting, LM ellipsoids may be obtained which contain the objects producing substantial responses. To account for noise, for each object, a fitting error may be minimized (such as sum of squared differences) between the path length to/from the candidate location and the M estimated distances for the object. This act may include a search over associations between objects and distances or other suitable optimizations.

[0101] As described above, in some implementations, 2D image information may be fused with object region-of-interest (ROI) information to improve the spatial resolution of the objection location information. In some implementations, this processing may only be performed when performing applications that require a higher degree of resolution than provided by the source localization procedure. In some embodiments, fusion is not used at all. In at least one embodiment, fusion may be performed by first denoting by a number L the number of regions of interest (ROIs) in a scene. L ROIs may then be identified using, for example, the method described in the previous paragraph. Next, a two-dimensional image may be captured by a camera. A computer vision algorithm, examples of which are described below, may then be applied only within the identified ROIs, improving the accuracy and reducing the computational complexity of the computer vision task. As used herein, the term "computer vision" and related terms encompass any computations that substantially use the two-dimensional image and may be modified to exploit one or more identified ROI.

[0102] As described previously, the techniques and structures described herein are particularly well suited for use within portable and/or mobile devices and systems.

However, they may also be used in more stationary systems. In general, to use these techniques, a device will need to include, on an exterior portion thereof, at least one light source and at least one light detector. Since the light sources and light detectors will need to be used to acquire at least two approximate optical impulse responses of the scene, it is convenient to illustrate embodiments as having multiple light detectors. It will be appreciated that the effect of having multiple detectors can be achieved by using multiple light sources or by moving a light source or a detector. For example, Fig. 6 is a diagram illustrating a back panel of a smart phone 500 having a light source 502 and four detectors 504, 506, 508, 510 installed thereon. In the illustrated embodiment, the light source 502 is centrally located along a top edge of the panel and the detectors 504, 506, 508, 510 are located in the four corners of the panel. Other arrangements may alternatively be used. An RGB camera 512 may also be provided within the smart phone 500 for use in supporting fusion operations and/or other purposes. As will be appreciated, in some implementations, smart phone 500 may use the object location capabilities described herein to support user input by hand gestures. Other additional or alternative applications of the object location capabilities may also be provided.

[0103] Fig. 7A is a diagram illustrating a laptop computer 520 configured to use the object location techniques described herein in accordance with an embodiment. As shown, a number of photo-detectors 522, 524, 526, 528 may be situated around a bezel of a display 530 of the laptop 520 for use in detecting reflected light signals. One or more light sources 532 may be situated along a top edge of the display 530. Other light source and detector positions may alternatively be used. Again, the object location capabilities of the laptop 520 may, in some implementations, be used to support user input operations by hand gestures. Other applications also exist. In at least one embodiment, one or more pixels of display 530 may be used as a light source for object location operations, thereby dispensing with the need for a separate light source. A similar arrangement may be used with tablet computers, e-Book readers, or other forms of computing devices.

[0104] Fig. 7B is a diagram illustrating a flat panel monitor 540 configured to use the object location techniques described herein in conjunction with a desktop computer. As shown, a number of photo-detectors 542, 544, 546, 548 may be situated around an edge of a display 550 of the monitor 540 for use in detecting reflected light signals. A light source 552 may be located along a top edge of the monitor 540. Other light source and detector positions may alternatively be used. As shown, hand gestures 554 may be used to provide user input to the desktop computer using the object location capabilities of the computer. Other or alternative applications may also be provided. As with the laptop display, in some embodiments, one or more pixels of display 550 may be used as a light source for object location operations, thereby dispensing with the need for a separate light source.

[0105] Fig. 8 is a diagram illustrating smart glasses 560 configured to use the object localization techniques described herein in accordance with an embodiment. As shown, a number of photo-detectors 562 may be situated on the frame of the glasses 560 along with a light source 564. An RGB camera may also be provided (not shown). As will be appreciated, capabilities described herein may support user input by hand gestures. Other additional or alternative applications of the object location capabilities may also be provided.

[0106] The image capture techniques described herein may be used in a variety of different applications. For example, as described above, the techniques may be used to provide user input to various devices and systems via hand gestures. The techniques may also be used in applications related to augmented reality and information display; smart glasses, smart watches, and other wearable sensor applications; mobile gaming; 3D gaming; navigation; and photography.

[0107] In the discussion that follows, features of the described techniques will be described in the context of simulations for two applications: (a) the localization of two hands for a gestural interface, and (b) estimation of plane position and pose for augmented reality. Fig. 9 is a diagram illustrating a setup for use in performing hand tracking in accordance with an embodiment. Fig. 10 is a diagram illustrating a setup for use in generating physically-accurate rendered augmentations in accordance with an embodiment. Both of these applications rely on the estimation of a few scene features, such as object pose and position. Prior image processing and computer vision techniques operate by first capturing a full 2D or 3D image of a scene, then processing the image to detect objects, and finally performing object parameter estimation (see, e.g., "Computer Vision: Algorithms and Applications" by R. Szeliski, Springer, 2010). These prior techniques are effective for general scenes under favorable lighting and resolution conditions, but require significant computation and acquisition resources. More significantly, these techniques generate full- resolution images even though the objects of interest in the scenes are simple and few. Using the techniques and features described herein, hands can be tracked and/or planar object pose and orientation can be inferred with low computational complexity using only a few sensors.

[0108] In the scenario of Fig. 9, the features of interest in the scene are the 3D locations of the two hands. In the scenario of Fig. 10, the features of interest are the pose and position of the plane relative to the imaging device. In describing these two applications below, it will be assumed that a single intensity-modulated light source 600 is being used to illuminate the corresponding scenes with a 7 -periodic signal s(t). It will also be assumed that 4 time -resolved detectors 602, 604, 606, 608 are being used to capture light signals reflected from the scenes. It will be appreciated that a plurality of scene impulse responses may be obtained in many other configurations. The intensity of reflected light at detector k will be expressed as r_k(t), where k = 1, ... ,4. The light source 600 and detectors 602, 604, 606, 608 are synchronized to a common time origin. It will further be assumed that the illumination period T is large enough to avoid distance aliasing. To derive the scene impulse response, the Γ -periodic signal may be set to the impulse function, s(t) = S(t).

[0109] The imaging setups and signal models for the two applications will first be discussed. For the hand tracking application of Fig. 9, it will be assumed that the two hands in the scene 610, 612 occupy a small area in the sensor field of view (FOV). Considering each hand 610, 612 as a small planar facet, it has been shown that the impulse response may be accurately modeled as follows:

9_k(t = % B(t - d_lk/c, + a₂ B(t - d_2k/c, Δ₂), where c is the speed of light, ¾ is the reflectance of object i, d_ik is the total length of the path from the source to object i to detector k, and

B(t, Δ) = u(t)— u(t— Δ)

denotes the causal box function of width Δ (see, e.g., "Exploiting Sparsity in Time-of-Flight Range Acquisition Using a Single Time-Resolved Sensor," by Kirmani et al., Optics Express, vol. 19, no. 22, pp. 21 485-21 507, Oct. 201 1). The scene impulse response is therefore the sum of two box functions that are time shifted in proportion to the respective object distances and scaled in amplitude by the object reflectances. The box function widths are governed by the object poses, positions, and areas. As Δ→ 0, it may be assumed that B(t, Δ) w Δ5(ί:), so the response for two small objects in the scene can be

approximated simply as a sum of two scaled, shifted Dirac delta functions:

9_k(t = S(t - d_lk/c + a₂ ' 8{t - d_2k/c .

Using this approximation, the locations and amplitudes of the delta functions constitute the signal parameters to be recovered.

[0110] A scene comprising a single plane occupying the entire FOV will now be considered (Fig. 10). In the analysis that follows, the variable x = (x_lt x₂) £ [0, L]² will be used to represent a point on the scene plane, the variable d^^s x will be used to denote the distance from the illumination source 600 to point x, and the variable d_k (x will be used to denote the distance from point x to detector k. The total distance traveled by the contribution from point x can then be expressed as d (x) = d^^s> (x) + d_k (x). This contribution will be attenuated by the reflectance f(x), the square-law radial fall-off, and cos(0( )) to account for foreshortening of the surface with respect to the illumination, where Θ (x) is the angle between the surface normal at point x and a vector from point x to the illumination source 600. Using s(t) = S(t), the amplitude contribution from point x is the light signal a_k(x) f x) S t—

where: a_k(x) = cos ( θ(χ . (1)

Combining contributions over the plane, the total light incident at detector k at time t may be expressed as:

The intensity g_k(t) thus contains the contour integrals over the object surface, where the contours are ellipses. As will be described in greater detail, g_k(t) will be zero until a certain onset time r_k and thereafter will be well approximated by a polynomial spline of degree of at most 2.

[0111] The sampling of the scene response will now be addressed for the two applications. An implementable digital system requires sampling at the detectors.

Moreover, a practical detector has an impulse response h_k(t) that needs to be considered. Furthermore, a Dirac impulse illumination cannot generally be realized in practice. Based on the fact that light transport is linear and time invariant, the signal acquisition pipeline at detector k may be accurately represented using the signal flow diagram of Fig. 1 1, where g_k(t) is the light incident at detector k, h_k(t) is the impulse response of detector k, ¾(t) is the photodetector noise, and T_s is the sampling period. At detector k, N digital samples may be acquired per illumination period using the sampling period of T_s = T/N as follows:

r_k [n] = [flffcC * h_k (t) * s(t) + ¾ (t)] l

n = 1

where ¾(t) represents the photodetector noise. Except at very low flux, ¾ (t) is modeled well as signal-independent, zero-mean, white and Gaussian with noise variance a_k . For simplicity, it may be assumed that the 4 detectors have identical responses and noise variances (i.e., h_k(t) = h(t) and a_k = σ² for k = 1, ... ,4). [0112] The top row of Fig. 12 shows the continuous-time scene impulse responses for the scenes under consideration, and the bottom row of Fig. 12 shows the samples acquired at the individual detectors in the absence of noise for Gaussian s(t) and h_k(t). The goal is to use the samples, r_k [n], k = 1, ... ,4, to estimate the desired scene features.

[0113] The estimation of the scene features will now be described. As described previously, for the hand tracking application, the scene features of interest are the 3D locations of the objects (i.e., hands) in the sensor's FOV. For the single plane, the scene features of interest are the plane position and orientation. In each case, a two-step process may be used to perform the feature estimation. That is, (1) use parametric deconvolution to estimate scene impulse response g_k(t) from the acquired samples r_k[n]; and (2) use the set of estimated signal parameters from the 4 scene impulse responses to recover the scene features.

[0114] In both cases, the coordinate system may be defined relative to the imaging device with the illumination source 600 as the origin and the device lying in the x-y plane (see Figs. 11 and 12). The locations of the detectors 602, 604, 606, 608 may be defined as +W-L, ±w₂, 0). The imaging direction may be selected as the positive z-direction so that the sensor's FOV lies in the half-space where z > 0.

[0115] The estimation of signal parameters for the hand tracking application will now be described. Following Step 1 of the two-step process, the amplitudes and time shifts of the Dirac delta functions in the scene impulse response may be directly estimated. Since the two objects are to be localized, model order 2 may be assumed in the parametric deconvolution scheme and the scene impulse response may be recovered as the sum of 2 Diracs. From the time shifts, the distances d_Ak and d_Bk can be estimated. To recover the spatial locations of the two objects in the FOV, the distances d_ik may be used.

[0116] Once the estimates d_Ak and d_Bk have been made for each detector k in step 1, the scene features may be recovered in step 2. It must first be determined which estimated distance corresponds to each object. This may be accomplished by first finding the equations describing the 8 total ellipsoids for which the total distance from the source to a point on the ellipsoid and back to detector k are d_Ak and d_Bk. In the ideal noiseless case, the 8 ellipsoids can be partitioned into two disjoint sets of 4 ellipsoids each, with the first set defined by d_lk and the second set defined by d_2k, such that each set intersects in one unique point. These two points of intersection x and x₂ are the estimates for the locations of the two objects.

[0117] In the noisy case, the two sets will nearly intersect in one point. The variable d_k (x) may be defined as the total distance traveled by the contribution from point x in the detector's FOV. To estimate the locations of the objects under noisy conditions, the following optimization problem may be solved that finds the point for which the sum of squared differences between total distances to the point and estimated total distances is minimized:

These x_t are the recovered locations of the two objects (e.g., the hands). Note that estimates in the ideal noiseless case also satisfy this minimization problem.

[0118] The estimation of signal parameters for the planar scene will now be described. It has been shown that the impulse responses of large planar scenes can be modeled as continuous piecewise-polynomial signals with several kinks (or discontinuities) in slope. The kinks in the slope directly correspond to spatial locations of scene features (such as, for example, nearest points, edges, and vertices), and thus key parameters that may be recovered from the signal include the time locations of the kinks. For Step 1 of the process, a method for recovering piecewise-polynomial signals may be employed to determine both the locations in time and amplitudes of the kinks (see, e.g., "Sampling Signals with Finite Rate of Innovation," by Vetterli et al., IEEE Trans. Signal Process., vol. 50, no. 6, pp. 1417-1428, Jun. 2002 and "Sparse Sampling of Signal Innovations," by Blu et al., IEEE Signal Process. Mag., vol. 25, no. 2, pp. 31-40, Mar. 2008). Though a typical signal can have as few as 4 or 5 kinks, recovery in practice was found to be more accurate by initially assuming a number of kinks of 10 or 12 and rejecting kinks with low amplitude. To determine the location and orientation of the large planar scene, the time location of the first kink, or onset time z_k, of the impulse response £/_¾ £) is needed. The onset times correspond to the times at which the light that travels the shortest total path length is incident on the detector. Thus, from each z_k, the shortest path length d™ⁱⁿ = cr_k can be calculated for each source-detector pair k.

[0119] The plane P(n) may be described by the point n = (a, b, c) on the plane having the minimum distance to the origin. The plane is also equivalently described by the equation n · x = n · n. For any plane not passing through the origin, the ordered triple n uniquely determines both the normal direction and a point on the plane. Let d™ⁿ(n) be the minimum path length from the origin to P(n) and back to detector k. With the shortest path length d™¹⁷¹ for each source-detector pair k, the following optimization problem may be solved that finds the plane P(n) for which the sum of squared differences between total distances to the plane and estimated total distances is minimized:

The resulting plane P(n) is the estimate for the plane.

[0120] Simulations were performed for a device of dimension 25 cm x 20 cm, which is the size of a typical current-generation tablet device. The illumination source s(t) was a Gaussian pulse of width 1 ns with a pulse repetition rate of 50 MHz (i.e., signal period T = 20 ns) with N = 501 samples per repetition period. To demonstrate the framework for the two example applications discussed above, the following two cases were considered: (1) two small rectangular planes of dimension 5 cm x 10 cm (approximately the size of average human hands) fronto-parallel to the device; and (2) a single, large tilted rectangular plane of dimension 50 cm x 50 cm defined by nearest point and normal direction n = (0.6,0,0.8) relative to the device.

[0121] Fig. 13 shows the results of the signal parameter estimation of Step 1 for the hand tracking application (left) and the single plane application (right). These results allow the recovery of the important times and distances needed for estimating scene features. The time locations d_ik of the hands and the onsets r_k of the large plane are captured accurately. Note that the exact amplitudes of the piecewise-polynomial fit for the scene impulse response of the large plane are not completely preserved due to the mismatch in the model, but the time locations of the kinks are still preserved. [0122] Fig. 14 shows the effects of noise on accuracy for (1) localizing each of the two hands in the detector's FOV and (2) recovering plane location and orientation. The normalized MSE averaged over 500 trials at each SNR level was calculated. It can be seen that the recovery of two hands is more robust to noise than the recovery of the single plane due to the lower complexity and better fit of signal model.

[0123] Fig. 15 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a user holds a mobile phone (e.g., a smart phone, etc.) containing a sensing apparatus in accordance with an embodiment. The 3D spatial information estimated in accordance with the invention allows the position of the user's fingers to be localized and tracked over space and time and hence associated with a gesture. The gesture may include movement in three dimensions. Gestural input is possible with any of a wide variety of different electronic devices containing a sensing apparatus as described herein. This may include, for example, a phone, a tablet, an e-book reader, a watch, smart glasses, a computer, and/or other devices or systems.

[0124] Fig. 16 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a user wears smart glasses containing a sensing apparatus in accordance with an embodiment. The 3D spatial information estimated using the sensing apparatus allows the position of a planar surface in a user's field of view to be estimated accurately. This enables augmentation of the user's view, with the augmentation having the appearance of being on the planar surface. Augmentations of reality are possible with other display devices and on other types of surfaces.

[0125] Fig. 17 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a computer contains a sensing apparatus in accordance with an embodiment. The 3D spatial information estimated in this scenario may allow the position of the user, the orientation of the screen, the size of the table, and the movements of the user's hands and face to be estimated. As will be appreciated, this information may be of value within any number of different user applications. Many other placements of a sensing apparatus as described herein are also possible.

[0126] Figs. 18A and 18B are diagrams illustrating an exemplary use of a data acquisition system as described herein where a user wears smart glasses containing a sensing apparatus in accordance with an embodiment. The 3D spatial information estimated in this scenario may allow a user to control a gestural interface to a catalog of furniture items. The gestural interface allows placement of the selected furniture in an augmented- reality view. Similar combinations of gestural interfaces with augmented reality would be clear to those of skill in the art.

[0127] Fig. 19 is a diagram illustrating an exemplary use of a data acquisition system as described herein where a tablet contains a sensing apparatus in accordance with an embodiment. The 3D spatial information estimated in accordance with the invention allows several users to interact simultaneously with the same tablet through a gestural interface. Gestures may include movements in three dimensions in some implementations. Specific uses of such shared gestural interfaces would be clear to those of skill in the art.

[0128] Figs. 20A and 20B are diagrams illustrating an exemplary use of a device including a camera that contains the sensing apparatus in accordance with an embodiment. The user may interactively selects spatial boundaries of objects in a photograph or 3D scene and place them at new 3D positions through the use of gestures. The sensing apparatus on the device senses hand gestures and tracks movement of the user's hands for selection and re-positioning of objects. This information may be used, for example, for editing photographs and 3D content and other uses.

[0129] Having described exemplary embodiments of the invention, it will now become apparent to one of ordinary skill in the art that other embodiments incorporating their concepts may also be used. The embodiments contained herein should not be limited to disclosed embodiments but rather should be limited only by the spirit and scope of the appended claims. All publications and references cited herein are expressly incorporated herein by reference in their entirety.

Claims

What is claimed is:

1. A method for spatial information measurement regarding a scene of interest comprising:

illuminating a scene with a time-varying intensity from at least one light source; capturing reflections from the scene using at least one detector, the at least one detector being in known positions relative to the at least one light source;

digitizing signals captured by the at least one detector;

processing the digitized signals using parametric signal processing to recover parameters related to a plurality of optical impulse responses of the scene; and

processing the parameters related to optical impulse responses of the scene to develop spatial information regarding the scene.

2. The method of claim I, wherein:

processing the digitized signals includes decreasing the low-frequency content of the digitized signals.

3. The method of claim 2, wherein:

processing the digitized signals includes high pass filtering the digitized signals.

4. The method of claim I, wherein:

processing the digitized signals includes using parametric signal deconvolution on at least one of the digitized signals to obtain at least one approximate optical impulse response.

5. The method of claim 4, wherein:

the parametric form of the at least one approximate optical impulse response is a piecewise polynomial function or other piecewise continuous function.

6. The method of claim 4, wherein:

the parametric form of the at least one approximate optical impulse response is a sum of short pulse functions.

7. The method of claim 4, wherein:

processing the parameters includes processing a plurality of approximate optical impulse responses to identify a region of the scene from which a substantial fraction of the reflected light originates.

8. The method of claim 4, wherein:

processing the parameters includes processing a plurality of approximate optical impulse responses to identify the position and orientation of an object in the scene relative to the information measurement apparatus.

9. The method of claim I, further comprising:

capturing a two-dimensional image of the scene using a camera.

10. The method of claim 9, wherein:

the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and

the method further comprises improving spatial resolution of the estimated spatial position of the at least one object using the two-dimensional image.

1 1. The method of claim 9, wherein:

the method further comprises applying a search or optimization process to the two- dimensional image using the estimated spatial position of the at least one object.

12. The method of claim 9, wherein:

the spatial information regarding the scene includes an estimated spatial position of at least one object within the scene; and the method further comprises generating another image using the two-dimensional image based, at least in part, on the estimated spatial position of the at least one object.

13. The method of claim 9, wherein:

the method further comprises estimating pose, position, or orientation of at least one of a user's hand or other body part or parts, a body, a stylus, or other object using the estimated spatial position of the at least one object.

14. The method of claim 13, further comprising:

generating a region of interest of the scene using the estimated spatial position of the at least one object; and

searching for the user's hand or other body part or parts, the body, the stylus, or the other obj ect within the region of interest.

15. The method of claim 14, wherein:

generating a region of interest of the scene includes generating a region of interest using an estimated spatial position of at least one surface; and

the method further comprises generating an image to change the appearance of the region of interest.

16. The method of claim 9, further comprising:

identifying at least one gesture using the spatial information regarding the scene; and interacting with a representation of the two-dimensional image or portions thereof using the at least one gesture.

17. The method of claim 1, further comprising:

identifying at least one gesture using the spatial information regarding the scene; and controlling an augmented reality display using the at least one gesture.

18. The method of claim 1, further comprising:

identifying at least one gesture using the spatial information regarding the scene; and controlling an electronic device using the at least one gesture.

19. The method of claim 18, wherein:

the electronic device includes one of a phone, a tablet, an e-book reader, a watch, smart glasses, or a computer.

20. A data acquisition system comprising:

at least one light source to illuminate a scene;

at least one detector to detect light signals reflected from the scene, the at least one detector being in known positions with respect to the at least one light source;

one or more digitizers to digitize output signals of the at least one detector; and at least one digital processor to: (a) process digitized output signals of the at least one detector using parametric signal processing to determine parameters related to a plurality of optical impulse responses of the scene or interest; and (b) process the parameters related to the plurality of optical impulse responses of the scene to develop spatial information regarding the scene.

21. The data acquisition system of claim 20, wherein:

the at least one light source is controlled to illuminate the scene with a time-varying intensity.

22. The data acquisition system of claim 20, wherein:

the at least one light source is controlled to illuminate the scene with a series of short light pulses.

23. The data acquisition system of claim 20, wherein:

the at least one digital processor is configured to process digitized output signals of the at least one detector to decrease low frequency content therein.

24. The data acquisition system of claim 20, wherein:

the at least one digital processor is configured to use parametric signal

deconvolution to process at least one digitized output signal to obtain at least one approximate optical impulse response.

25. The data acquisition system of claim 24, wherein:

26. The data acquisition system of claim 24, wherein:

27. The data acquisition system of claim 20, wherein:

the at least one digital processor is configured to process the parameters related to the plurality of approximate optical impulse responses to identify a region of the scene from which a substantial fraction of the reflected light originates.

28. The data acquisition system of claim 20, wherein:

the at least one digital processor is configured to process the parameters related to the plurality of approximate optical impulse responses to identify the position and orientation of an object in the scene relative to the data acquisition system.

29. The data acquisition system of claim 20, further comprising:

a camera for capturing a two-dimensional image of the scene.

30. The data acquisition system of claim 29, wherein:

the at least one digital processor is configured to enhance the spatial resolution of the estimated spatial position of the at least one object using the two-dimensional image.

31. The data acquisition system of claim 29, wherein:

the at least one digital processor is configured to apply a search or optimization process to the two-dimensional image using the estimated spatial position of the at least one object.

32. The data acquisition system of claim 29, wherein:

the at least one digital processor is configured to generate another image using the two-dimensional image based, at least in part, on the estimated spatial position of the at least one object.

33. The data acquisition system of claim 29, wherein:

the at least one digital processor is configured to estimate pose, position, or orientation of at least one of a user's hand or other body part or parts, a body, a stylus, or other structure using the estimated spatial position of the at least one obj ect.

34. The data acquisition system of claim 33, wherein:

the at least one digital processor is configured to generate a region of interest of the scene using the estimated spatial position of the at least one object; and the at least one digital processor is configured to search for at least one of the user's hand or other body part or parts, the body, the stylus, or the other structure within the region of interest.

35. The data acquisition system of claim 29, wherein:

the at least one digital processor is configured to generate a region of interest of the scene using an estimated spatial position of at least one surface; and

the at least one digital processor is configured to generate an image to change the appearance of the region of interest.

36. The data acquisition system of claim 29, wherein:

the at least one digital processor is configured to identify at least one gesture using the spatial information regarding the scene; and

the at least one digital processor is configured to interact with a representation of the two-dimensional image or portions thereof using the at least one gesture.

37. The data acquisition system of claim 20, wherein:

the at least one digital processor is configured to control an augmented reality display using the at least one gesture.

38. The data acquisition system of claim 20, wherein:

the at least one digital processor is configured to control an electronic device using the at least one gesture.

39. The data acquisition system of claim 38, wherein:

40. The data acquisition system of claim 38, wherein:

the data acquisition system is part of the electronic device.

41. The data acquisition system of claim 20, wherein:

the at least one detector includes a photodiode.

42. The data acquisition system of claim 20, wherein:

the at least one light source includes at least one of: a light emitting diode or a laser diode.

43. The data acquisition system of claim 20, wherein:

the at least one light source and the at least one detector are fixed within a common plane.

44. The data acquisition system of claim 20, wherein:

the at least one detector operates without focusing optics.

45. The data acquisition system of claim 20, wherein:

the at least one detector consists of six or fewer single-pixel detectors.

46. A method for spatial localization of objects using at least one active illumination source and multiple detectors, the method comprising:

illuminating a scene with an impulse illumination via the at least one active illumination source;

receiving, at each of the multiple detectors, a response to the impulse illumination of the scene; processing the response received at each detector using parametric signal processing to determine distance information for one or more objects in the scene with respect to the detector and the illumination source; and

using a localization process to reconstruct positions in space for the one or more objects in the scene.

47. The method of claim 46 wherein:

the method is used to determine a spatial location of hands for a human-computer interaction.

48. A depth imager comprising:

a modulated illumination source to generate a periodic light waveform to illuminate a scene;

multiple time -resolved detectors spaced in a plane with the modulated illumination source to receive reflected light from the scene; and

a processor to:

process signals output by the multiple time-resolved detectors using parametric signal processing to identify depths of interest for objects in the scene; and

use source localization and the depths of interest to identify spatial positions for objects in the scene.

49. The depth imager of claim 48, wherein:

the multiple time-resolved detectors include multiple photodiodes.

50. The depth imager of claim 48, wherein:

the multiple time-resolved detectors include four time-resolved detectors.

51. The depth imager of claim 50, wherein:

the four time -resolved detectors are spaced at the corners of a rectangle in a common plane as the modulated illumination source.

52. The depth imager of claim 48, wherein:

the modulated illumination source is provided as a low-bandwidth light emitting diode or lasing diode.