WO2008073563A1

WO2008073563A1 - Method and system for gaze estimation

Info

Publication number: WO2008073563A1
Application number: PCT/US2007/081023
Authority: WO
Inventors: Xiaoming Liu; Nils Oliver Krahnstoever; A.G. Amitha Perera; Anthony J. Hoogs; Peter Tu; Gianfranco Doretto
Original assignee: Nbc Universal, Inc.
Priority date: 2006-12-08
Filing date: 2007-10-11
Publication date: 2008-06-19

Abstract

A method and system, the method including capturing a video sequence of images with an image capturing system, designating at least one landmark in a region of interest of the captured video sequence, fitting a model of the region of interest to the region of interest in the captured video sequence, and determining a pose parameter for the model fitted to the region of interest.

Description

METHOD AND SYSTEM FOR GAZE ESTIMATION

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims benefit of and priority to U.S. Provisional Patent Application Serial No. 60/869,216, filed on December 8, 2006, entitled "METHOD AND SYSTEM OR GAZE ESTIMATION", the contents of which are incorporated herein by reference for all purposes.

BACKGROUND

[0002] The present disclosure relates, generally, to gaze estimation. In particular, a system and method are disclosed for determining and presenting an estimate of the gaze of a subject in a video sequence of captured images.

[0003] Regarding captured video of various events, viewing of the video may allow a viewer to see the event from the perspective and location of the subject even though the viewer did not witness the event in person as it occurred. While the video may sufficiently capture and present the event, the presentation of the event may be enhanced to increase the viewing pleasure of the viewer. In some contexts, an on-air commentator may provide commentary in conjunction with a video broadcast in an effort to convey additional knowledge and information regarding the event to the viewer. It is noted however that care is needed by the on-air commentator not to say too much as to, for example, distract from the video broadcast.

[0004] In some embodiments, it would be beneficial to convey information and data regarding captured video to a viewer using a visualization mechanism as opposed to a spoken commentary. In this manner, the viewing of a video sequence of an event may be enhanced by efficient image visualizations that convey information and data regarding the event. SUMMARY

[0005] In some embodiments, a method including capturing a video sequence of images with an image capturing system, designating at least one landmark in a region of interest of the captured video sequence, fitting, based on the at least one landmark, a model of the region of interest to the region of interest in the captured video sequence, and determining a pose parameter for the model fitted to the region of interest may be provided. In some embodiments, the pose parameter includes an estimation of a gaze of a subject associated with the region of interest.

[0006] In some embodiments herein, the method may further include determining the pose parameter for the model over a period of time. In some embodiments still, extracted data may be associated with the pose parameter of the region of interest. The extracted data may be presented in a user-viewable format.

[0007] In some aspects, the image capturing system may be calibrated relative to a location of the image capturing system that includes determining geometrical information associated with location. The geometrical information may include at least one of information regarding a specification of the image capturing system components, the location of the image capturing system with respect to an area captured in the video sequence, and a pan, tilt, and roll parameter.

[0008] In some embodiments herein, a system including at least one image capturing device and a processor may be provided to implement the methods disclosed herein. In some embodiments, program instructions or code may be provided on a tangible media for execution by a system or device (e.g., processor) to implement some methods herein. BRIEF DESCRIPTION OF THE DRAWINGS

[0009] FIG. 1 is an illustrative depiction of an image captured by an image capturing system, including gaze estimation overlays, in accordance with some embodiments herein;

[0010] FIG. 2 is an illustrative depiction of a re-visualization of an image captured by an image capturing system, including gaze estimation overlays, in accordance with some embodiments herein;

[0011] FIG. 3 is an illustrative depiction of an image captured by an image capturing system, including a display area, in accordance with some embodiments herein;

[0012] FIG. 4 is an exemplary illustration of a number of models, in accordance herewith;

[0013] FIG. 5 is an exemplary depiction of a number of models, in accordance herewith

[0014] FIG. 6A is an exemplary illustration of an image captured by an image capturing system, in accordance herewith;

[0015] FIG. 6B is an illustrative depiction of a model used, for example, in association with the captured image of FIG. 6A, in accordance herewith;

[0016] FIG. 7 is an illustrative graphical representation, in accordance with aspects herein; and

[0017] FIG. 8 is an illustrative depiction of a captured image, including visualizations, in accordance with some embodiments herein. DETAILED DESCRIPTION

[0018] The present disclosure relates to video visualization. In particular, some embodiments herein provide a method, system, apparatus, and program instructions for gaze estimation of an individual captured by a video system.

[0019] A machine based gaze estimation process and system is provided that determines and estimates of the gaze direction of an individual captured on a video sequence. Some embodiments further provide a visual presentation of the gaze estimation. The visual presentation or visualization of the gaze estimation may be provided alone or in combination with a video sequence and in a variety of formats.

[0020] In some embodiments, a computer vision algorithm estimates the gaze of a subject individual. Portions of the process of estimating the gaze of the individual may be accomplished manually, semi-manually, semi-automatically, and automatically.

[0021] In some embodiments, the gaze estimation process may, in general, comprise two processing stages. A first stage includes a training stage wherein a number of landmarks on a region(s) of interest are labeled. The landmark labeling operation may include manually designating the region(s) of interest given a sequence of video images. In the context of gaze estimation, the region of interest includes the head of subject individual for whom the gaze estimation is being determined.

[0022] In some embodiments, a shape model may be used to represent the shape of a region of interest (i.e., the head of a subject individual). The shape model may be associated with appearance information such as, for example, texture information. In some embodiments, the shape model, may be an Active Appearance Model (AAM) using, for example, two subspace models; a deformable model; or a rigid model. Herein, the term AAM should be understood to be one of many model representations that can be used to estimate gaze direction.

[0023] In a second processing stage, a test stage, the model is capable of automatically fitting a model of the region of interest (i.e., head) to the subject individual in a sequence of video. For example, the model may automatically correlate a mesh model to the subject individual's head in each frame of a video sequence by estimating the shape and appearance for the subject individual. Based on the resulting shape parameter(s), an estimation of the gaze of the subject individual may be determined for each frame of the video sequence.

[0024] In some embodiments, the gaze estimation methods disclosed herein may efficiently provide gaze estimation in real time. For example, gaze estimation in accordance with the present disclosure may be performed substantially concurrent with the capture of video sequences such that gaze estimation data relating to the captured video sequences is available for presentation, visualization and otherwise, in real time coincident with a live broadcast of the video sequences.

[0025] The images used to learn an AAM may be, in some embodiments, relatively few as compared to the applicability of the AMM. For example, nine (9) images may be used to learn an AAM that in turn is used to estimate the gaze for about one hundred (100) frames of video.

[0026] In some embodiments, the gaze estimation methods disclosed herein may provide gaze estimation data even in an instance where low resolution video is used as a basis for the gaze estimation processing. By using AAM and/or other robust techniques to ascertain the shape and appearance of the subject individual, the methods herein may be effectively used with low resolution video.

[0027] In some embodiments and contexts, the gaze estimation herein may be extended to subject individuals having at least a portion of their face obscured. For example, the gaze estimation methods, systems, and related implementations herein may be used to provide gaze estimation for subject individuals captured on video participating in various contexts and sporting events wherein the face and head of the subject individual is visually obscured, such as in football, hockey, and other activities where a helmet is worn.

[0028] In some embodiments, the gaze direction of a football player may be provided as an overlay in broadcast video footage, in real time or subsequently (e.g., a replay). In the context of a broadcast, on-air commentators may offer, for example, on-air analysis of a quarterback's decision process before and/or during a football play by visually showing the broadcast viewers via gaze estimation overlays how and when the quarterback scans the football field and looks at different receivers and/or defenders before making a football pass.

[0029] Gaze estimation overlays may be obtained using a variety of techniques, including a completely manual technique by a graphics artist without any tools requiring specialized skills and knowledge from the domain of computer vision to a fully automatic process that requires significant technology from the realm of computer vision.

[0030] Regarding a manual technique, an individual such as, for example, a graphic artist or special effects artist may visually inspect a sequence of video and manually draw lines in every video frame to visually indicate the gaze direction of the football player. In some embodiments, an on-air commentator may use a broadcast tool/process (e.g., a Telestrator®) to manually draw overlays into the broadcast that indicates gaze direction. In this manner, a gaze estimation visualization may be provided.

[0031] In some semi-manual techniques for providing gaze estimation, an operator may manually inspect and draw gaze direction estimation indicators (e.g., lines, highlights, etc.) on certain frames of a sequence of video. The certain frames may be every few "key" frames of video in the footage. An interpolation operation may be performed on the non-key frames to obtain gaze direction estimates for every frame of the video.

[0032] In some embodiments, an operator may use a special tool to improve upon the accuracy and/or efficiency of the manual gaze direction estimation process in frames or key-frames. Such a tool may display a graphical model of a football player's helmet or an athlete's head, represented by points and/or lines. The graphical (i.e., virtual) model may be displayed on a display screen and, using a suitable graphical user interface, the location, scale, and pose of the model may be manipulated until there is a good visual match between the virtual model and the true helmet of the subject football player. Accordingly, the gaze direction of the subject player in the video footage would correspond to the pose of the virtual football helmet or head of the subject after alignment.

[0033] In some embodiments, a model of the football helmet or head may be a 3-D model that closely approximates or resembles an actual football helmet. In some embodiments, a model of the football helmet or head may be 2-D model that resembles the projection of an actual football helmet. Pose and shape parameters of the helmet or head model may be used to represent 3-D location and 3-D pose or be more abstract shape and appearance parameters may be used that describe the deformation of a 2-D model helmet in a 2-D image.

[0034] In some embodiments, the gaze estimation capture tool may further use knowledge about a broadcast camera that recorded the video footage. In particular, the location of the camera with respect to the field, the pan, tilt and roll of the camera, the focal length, the zoom factor, and other parameters and characteristics of the camera may be used to effectuate some gaze estimations, in accordance herewith. This camera knowledge may define certain constraints regarding the possible locations of the virtual helmet/head in the video imagery, thereby aiding the alignment process between the virtual model helmet and the captured video footage for the operator. The constraints arise because the helmet/head of the subject is, in practical terms, typically limited to between about 10 cm and about 250 cm above the football field and is typically limited to a fixed range of poses (i.e., a human primarily pans and tilts) etc.

[0035] Also, the gaze estimation capture tool may use multiple viewing angles of a football player. Given accurate camera information for multiple viewing angles, the operator may perform the alignment process between the virtual model and the actual video footage based on multiple viewing directions simultaneously, thereby making such alignment processes more accurate and more robust.

[0036] In some embodiments, a semi-automatic approach for providing a gaze estimation overlay includes associating a virtual model of the helmet/head of the subject individual with appearance information such as, for example, "image texture". The appearance information facilitates the generation of a virtual football helmet/head that appears substantially similar to the actual video captured helmet/head in the broadcast footage. In the instance of such an accurate model, the alignment between the virtual helmet and the image of the helmet may be automated. In some embodiments, an operator may initially bring the virtual helmet into an approximate alignment with the actual (i.e., real) helmet and an optimization algorithm may further refine the location and pose parameters of the virtual helmet in order to maximize a similarity between the video footage's real helmet and the virtual helmet.

[0037] In some embodiments, the automatic refinement may be selectively or exclusively performed with shape information (i.e., without appearance information in some instances) by performing a manual or purely shape based alignment once, that is then followed by an acquisition of appearance information from the video footage (e.g., texture information is mapped from the broadcast footage onto the virtual model of the helmet). Subsequent alignments may then be performed using the acquired appearance information.

[0038] The amount and degree of operator intervention may be further reduced to a single rough alignment between the virtual helmet and the helmet/head of the broadcast footage by using the automatic pose refinement incrementally. For example, after an alignment has been established for one frame, subsequent alignments may be obtained by maximizing the similarity between the model and the capture imagery, as described hereinabove.

[0039] In a fully automatic approach for providing a gaze estimation overlay, operator intervention may be eliminated by developing and using subject (e.g., football player or helmet) detectors. The detector may include an algorithm that automatically determines the location of the subject or subject body (e.g., helmet or head) in a sequence of video images. In some embodiments, the detector may also include determining at least a rough pose of an object or person in a video image.

[0040] In some embodiments, one or more cameras may be used to capture the video. It should be appreciated that the use of more than one camera to yield video containing multiple viewing angles of a scene may contribute to providing a gaze direction estimation that is more accurate that a single camera/single viewing angle approach. Furthermore, knowledge regarding the camera parameters may be obtained from optoelectronic devices attached to the broadcast cameras or via computer vision means that match 2-D image points field with 3-D world coordinate points of a video captured environment (e.g., a football field).

[0041] FIG. 1 is an exemplary illustration of a video image 100 including gaze estimation overlay 105. The gaze estimation presents a visualization of the field of vision of player 150 at a given instant in time. The gaze estimation overlay includes boundaries 110, 115, 120, and 125 that define the boundaries of the subject player's field of vision in the video scene. Boundary marking 130 further defines the filed of vision. Gaze estimation 105 may be obtained using one or more of the gaze estimation techniques disclosed herein.

[0042] FIG. 2 provides an exemplary illustration of video image 200, including gaze estimation overlay 205. Gaze estimation overlay 205 is provided in conjunction with other visualizations such as telemetry components 240, 245 that provide details of subjects in the video. Gaze estimation overlay 205 includes boundaries 210, 215, 220, 225, and 230 that define the boundaries of the subject player's (250) field of vision in the video scene. Gaze estimation overlay 205 may be continuously updated as video 200 changes to provide an accurate, real time visualization of the gaze direction of player 250. A directional icon 235 is provided to inform viewers of the frame of reference used in the determination and/or presentation of the gaze estimation overlay.

[0043] FIG. 3 provides an exemplary video image 300 including a display area 305 on video image 310. Display area 305 may be used to display textual and/or descriptive information regarding a gaze estimation determination for video image 310. For example, gaze estimation may be performed for a player in video image 310 but instead of an overlay being generated and visualized thereon, display area 305 may be used to display textual and/or descriptive information regarding the gaze estimation. For example, the textual and/or descriptive information may include a gaze angle, rate of change in the gaze angle, maximum distance downfield included in the gaze estimation, and other gaze related information.

[0044] FIG. 4 is an illustrative depiction of a number mesh models for determining a gaze estimation. Arrows on the mesh model provide, for example, an indication of the direction of the gaze for the model.

[0045] FIG. 5 is an illustrative depiction of a number visualization models for determining a gaze estimation.

[0046] FIG. 6A is an illustrative depiction of a video image 600 including a helmet 605 worn by football player 610. That is, helmet 605 is the actual or real helmet shown in the video. FIG. 6B is a depiction of a mesh model 615 that has been aligned with helmet 605. The alignment of mesh model 615 with helmet 605 may be accomplished using one or more of the techniques disclosed herein. [0047] FIG. 7 provides an exemplary graphical presentation 700 relating to a gaze estimation for a video image. Section 705 includes graph line 715 that tracks or represents the gaze direction (i.e., angle) over a period of time. The angle of the subject's helmet/head is determined relative to a central or neutral position 720 (i.e., gaze angle of 0°). Section 710 of graph 700 includes a segment of the video including the helmet of the player whose gaze is being determined and corresponds to the line graph in section 705.

[0048] FIG. 8 is an illustration 800 of video image 805 including visualizations of subject detections. A detector method and/or system may be used to detect, in real time or subsequent thereto, the helmet/head of interested subjects (e.g., football players) in video image 805. As shown, graphic overlays 810, 815, and 820 visually indicate the detected helmets/heads of, for example, three players. In some embodiments, graphic overlays 810, 815, and 820 may be visualized to indicate the players in the filed of vision for another player, such as the quarterback in video image 805. In this manner, gaze estimation data is also provided to a viewer.

[0049] FIG. 9 is an exemplary depiction 900 of a gaze estimation overlay for a video image. The gaze estimation is provided and associated with player 905. The player's jersey number is provided at 915, in close proximity with graphic overlay 910 that tracks the player's helmet. Graphic overlay 910 may be obtained using, though not necessarily, an automatic helmet detector method and system. The gaze direction of player 905 is visualized by a center line 930 and boundaries 920 and 925. In some embodiments, boundaries 925 and 920 may be based on a theoretical or even an estimated range of vision for player 905. In some embodiments, boundaries 925 and 920 may be offset from center line 940 based on a calculation using data specific to the actual range of vision for player 905.

[0050] Display area 935 includes graphical information relating to player 905. The information shown relates to the position of player to a reference point of the field (e.g., line of scrimmage), velocity and acceleration for player 905. Also included is the gaze direction (0°) for the player. It should be appreciated that additional, alternative, and fewer data may be provided in display area 935.

[0051] In some embodiments, gaze overlay information, including the visualization of same, may be presented as lines (solid, dashed, colored, wavy, flashing, etc.) in a 2-D presentation or a 3-D presentation that includes height (up and down), width (side-to-side), and depth (near to far) aspects of an estimated and determined field of vision. The 3-D presentation may resemble a "cone of vision".

[0052] Also, the gaze overlay information may be provided on-screen with a sequence of video images as graphical or textual descriptions. In some embodiments, a frame of reference for the gaze estimation may be presented as and include, for example, a line graph, a circle graph with indications of the gaze estimation therein, a coordinate system, ruler(s), a grid, a gaze angle and time graph, and other visual indicators. In some embodiments, an angle velocity indicative of a rate at which a subject individual changes their gaze direction may be provided. In some embodiments, gaze estimation may be presented on a video image in a split-screen presentation wherein one screen area displays the video without the gaze estimation overlay and another screen displays the video with the gaze estimation overlay. In some embodiments, an indication of a gaze estimation may be presented or associated with or in a computer-generated display or computer visualization (e.g., a PC-based game image, a console game image, etc.).

[0053] While the disclosure has been described in detail in connection with only a limited number of embodiments, it should be readily understood that the disclosure is not limited to such disclosed embodiments. Rather, the disclosure embodiments may be modified to incorporate any number of variations, alterations, substitutions or equivalent arrangements not heretofore described, but which are commensurate with the spirit and scope of the invention. Accordingly, the disclosure is not to be seen as limited by the foregoing description.

Claims

WHAT IS CLAIMED IS:

1. A method comprising:

capturing a video sequence of images with an image capturing system;

designating at least one landmark in a region of interest of the captured video sequence;

fitting, based on the at least one landmark, a model of the region of interest to the region of interest in the captured video sequence; and

determining a pose parameter for the model fitted to the region of interest.

2. The method of claim 1 , wherein the pose parameter includes an estimation of a gaze of a subject associated with the region of interest.

3. The method of claim 1 , further comprising determining the pose parameter for the model over a period of time.

4. The method of claim 1 , further comprising extracting data associated with the pose parameter of the region of interest.

5. The method of claim 4, further comprising presenting the extracted data in a user-viewable format.

6. The method of claim 5, wherein the extracted data is presented as an overlay in a broadcast of the video sequence.

7. The method of claim 5, further wherein the extracted data is presented concurrent with a presentation of the video sequence.

8. The method of claim 7, wherein the presentation of the video sequence is a live broadcast.

9. The method of claim 1 , wherein the region of interest is selected from the group consisting of: a head, a face, an eye, a body part, or combinations thereof of a subject captured in the video sequence.

10. The method of claim 1 , wherein the model is associated with image texture information of the region of interest.

11. The method of claim 1 , wherein the model is one of a two- dimensional representation of the region of interest and a three-dimensional representation of the region of interest.

12. The method of claim 1 , further comprising calibrating the image capturing system relative to a location of the image capturing system, including determining geometrical information associated with location.

13. The method of claim 12, wherein the geometrical information includes at least one of information regarding a specification of the image capturing system components, the location of the image capturing system with respect to an area captured in the video sequence, a pan parameter, a tilt parameter, and a roll parameter.

14. The method of claim 1 , wherein one or more of the capturing, designating, and fitting are performed either semi-manually, semi-automatically, or fully automatically.

15. The method of claim 14, wherein the capturing, designating, and fitting are performed fully automatically and the region of interest is automatically detected by a machine.

16. The method of claim 1 , wherein the image capturing system includes a plurality of image capturing devices.

17. A system, comprising:

image capturing system; and a computing system connected to the image capturing system, the computing system adapted to:

capture a video sequence of images with an image capturing system;

designate at least one landmark in a region of interest of the captured video sequence;

fit, based on the at least landmark, a model of the region of interest to the region of interest in the captured video sequence; and

determine a pose parameter for the model fitted to the region of interest.

18. The system of claim 17, wherein the computing system is further adapted to present the extracted data in a user-viewable format.

19. The system of claim 17, wherein the pose parameter includes an estimation of a gaze of a subject associated with the region of interest.

20. The system of claim 17, wherein the computing system is further adapted to determine the pose parameter for the model over a period of time.

21. The system of claim 17, wherein the computing system is further adapted to calibrate the image capturing system relative to a location of the image capturing system, including determining geometrical information associated with location.