HK1190545A

HK1190545A - Video summary including a particular person

Info

Publication number: HK1190545A
Application number: HK14103565.3A
Authority: HK
Inventors: 基思．斯托尔．卡恩; 布鲁斯．哈罗德．皮尔曼; 亚伦．托马斯．狄佛; 约翰．R．麦科伊; 弗兰克．拉扎维; 罗伯特．格特辛格
Original assignee: 高智83基金会有限责任公司
Priority date: 2011-05-18
Filing date: 2012-05-14
Publication date: 2014-07-04

Description

Video summary including a particular person

Technical Field

The present invention relates to the field of digital video processing, and more particularly, to a system and method for forming a digital video summary.

Background

Many digital capture devices are capable of capturing video as well as still images. Managing digital video content, however, can be a difficult task. Videos are typically visually represented using thumbnail images of the first frame of the video. This may not provide a deep understanding of the content of the video. Determining whether a particular event is contained in a given video typically requires viewing the entire video. For lengthy videos, the user may prefer to be able to obtain a quick summary of the video without having to view the entire video.

From a sharing perspective, digital video may also pose practical problems. Many digital capture devices record video at 30 or 60 frames per second with spatial resolutions as high as 1920 x 1080 pixels. Even when compressed, the amount of data produced may make it impractical to share even fairly short videos.

Video editing software can be used to manually summarize videos into shorter versions that can be more easily shared. However, manual video editing can be a lengthy, laborious process, and many users are not interested in manual editing. Automatic video summarization algorithms also exist. These solutions start with the captured video as input and analyze the video to determine a video summary. U.S. patent 5,995,095 to Ratakonda entitled Method for collaborative summarization and browsing of digital video discloses a Method for generating hierarchical summaries based on key frames of a video sequence. U.S. patent 7,035,435 entitled "Scalable video rendering and navigation system and method" by Li et al describes a method for assigning an importance value to each scene, shot, and frame in a video and using the importance value to determine key frames for a video summary. U.S. patent 7,483,618 entitled "Automatic editing of a visual recording to an acquisition quality and/or a visual title or no interest" to Edwards et al discloses a method for determining a video summary that eliminates content of lower quality or less interest from a video.

However, automatic video summarization algorithms are very complex, since the video must be decoded to perform the analysis needed to determine the video summary. Thus, the video summary corresponding to the video just captured cannot be immediately viewed on the digital capture device. This disadvantage makes it difficult to facilitate fast browsing and sharing of captured video.

When creating a video summary, it is often desirable to have certain features in the summary. A video summary is created to contain some or all of the video content in which features are present. Examples of such features may include people, pets, events, locations, activities, or objects. Manually creating such customized video summaries can be a tedious process. The ability to quickly browse and share video summaries is hampered by the use of desktop software to generate such customized video summaries.

Accordingly, it would be desirable to provide a system and method for calculating a video summary in a digital capture device. In particular, it would be desirable to provide a solution that allows for the generation of a video summary on a digital capture device with minimal delay after the video capture is completed. Further, it would be desirable to provide a video summary that contains user-specified characteristics.

Disclosure of Invention

The present invention represents a digital camera system for capturing a video sequence and providing an associated video summary, comprising:

an image sensor for capturing a digital image;

an optical system for forming an image of a scene on the image sensor;

a data processing system;

a storage device for storing the captured video sequence; and

a program memory communicatively connected to the data processing system and storing instructions configured to cause the data processing system to perform a method for forming a video summary, wherein the method comprises:

specifying reference data, wherein the reference data contains a particular person;

capturing a video sequence of the scene using the image sensor, the video sequence comprising a temporal sequence of image frames;

processing the captured video sequence using a video processing path to form a digital video file;

during the capturing of the video sequence, analyzing the captured image frames using a human recognition algorithm to identify a subset of the image frames that contain the particular person;

forming the video summary comprising fewer than all image frames in the captured video sequence, wherein the video summary comprises at least a portion of the identified subset of image frames that contain the particular person;

storing the digital video file in the storage device; and

storing the representation of the video summary in the storage device.

An advantage of the present invention is that it analyzes video frames to determine a subset of the video frames that contain a particular person when capturing the video frames, thereby eliminating the need to decompress the video data when creating a video summary.

An additional advantage of the present invention is that it stores a representation of a video summary in a storage device without decompressing the stored digital video sequence. This allows the video summary to be generated and viewed on the digital capture device with minimal delay after the video capture is completed.

In some embodiments, the video summary is encoded using metadata in a digital video file without the need to encode the video summary as a separate file. This has the advantage that the video summary can be conveniently viewed using a "smart" video player that understands the metadata of the video summary, while the metadata of the video summary is transparent to legacy players.

Drawings

FIG. 1 is a high-level schematic diagram showing the components of a system for forming a video summary;

FIG. 2 is a flow chart of a method for forming a video summary;

FIG. 3 is a flow diagram illustrating the processing of a digital video sequence using two different video processing paths;

FIG. 4 is a flow diagram illustrating the processing of a digital video sequence using two different video processing paths in accordance with an alternative embodiment;

fig. 5 is a flow chart of a method for creating a video summary according to a second embodiment; and

fig. 6 is a flowchart of a method for creating a video summary according to a third embodiment.

Detailed Description

In the following description, a preferred embodiment of the present invention will be described in terms that would normally be implemented as a software program. Those skilled in the art will readily recognize that the equivalent of such software may also be constructed in hardware. Because image processing algorithms and systems are well known, the present description will be directed in particular to algorithms and systems forming part of, or cooperating more directly with, the system and method in accordance with the present invention. Other aspects of such algorithms and systems, and hardware or software for generating and otherwise processing image signals associated therewith, not specifically shown or described herein, may be selected from such systems, algorithms, components, and elements known in the art. In view of the systems described in accordance with the present invention in the following material, software not specifically shown, suggested, or described herein for practicing the invention is conventional and well within the ordinary skill of those in the art.

Further, as used herein, a computer program for performing the methods of the present invention may be stored in a computer readable storage medium, which may include, for example: magnetic storage media, such as a magnetic disk (e.g., a hard drive or floppy disk) or magnetic tape; optical storage media, such as an optical disc, optical tape, or machine-readable bar code; solid state electronic storage devices, such as Random Access Memory (RAM) or Read Only Memory (ROM); or any other physical device or medium employed to store a computer program having instructions for controlling one or more computers to implement the methods in accordance with the present invention.

The invention includes combinations of the embodiments described herein. References to "a particular embodiment" or the like refer to features that are present in at least one embodiment of the invention. Separate references to "an embodiment" or "particular embodiments" or the like do not necessarily refer to the same embodiments; however, these embodiments are not mutually exclusive, unless indicated as mutually exclusive or as would be apparent to one of ordinary skill in the art. The use of the singular or plural to refer to "a method" or "methods" and the like is not limiting. It should be noted that the word "or" is used in this disclosure in a non-exclusive sense unless the context clearly dictates otherwise.

Because digital cameras employing imaging devices and associated circuitry for signal capture and processing and display are well known, the present description will be directed in particular to elements forming part of, or cooperating more directly with, the method and apparatus in accordance with the present invention. Elements not specifically shown or described herein are selected from elements well known in the art. Certain aspects of the embodiments to be described are provided in software. Software not specifically shown, described, or suggested herein for implementing the present invention is conventional and is within the ordinary skill of those in the art in view of the systems shown and described in accordance with the present invention in the following materials.

Those skilled in the art will be familiar with the following description of a digital camera. It is apparent that there are many variations on this embodiment that are possible and selected to reduce cost, add features, or improve the performance of the camera.

Fig. 1 depicts a block diagram of a digital photography system including a digital camera 10 capable of capturing video images in accordance with the present invention. Preferably, the digital camera 10 is a portable battery operated device that is small enough to be easily held by a user while capturing and viewing images. The digital camera 10 produces digital images that are stored as digital image files using the storage device 30. The phrase "digital image" or "digital image file" as used herein refers to any digital image file, for example, a digital still image or a digital video file.

In some embodiments, the digital camera 10 captures both motion video images and still images. In other embodiments, the digital camera 10 is a digital video camera that captures only motion video images. The digital camera 10 may also include other functions including, but not limited to: a digital music player (e.g., MP3 player), a sound recording device, a mobile phone, a GPS receiver, or a Programmable Digital Assistant (PDA).

The digital camera 10 includes a lens 4 having an adjustable aperture and an adjustable shutter 6. In a preferred embodiment, the lens 4 is a zoom lens and is controlled by a zoom and focus motor drive 8. The lens 4 focuses light from a scene (not shown) onto an image sensor 14, such as a monolithic color CCD or CMOS image sensor. The lens 4 is an optical system for forming an image of a scene on the image sensor 14. In other embodiments, the optical system may use a fixed focal length lens with a variable or fixed focus.

The output of the image sensor 14 is converted to digital form by an Analog Signal Processor (ASP) and an analog-to-digital (a/D) converter 16, and temporarily stored in a buffer memory 18. The image data stored in the buffer memory 18 is then processed by the processor 20 using an embedded software program (e.g., firmware) stored in the firmware memory 28. In some embodiments, the software program is permanently stored in firmware memory 28 using Read Only Memory (ROM). In other embodiments, firmware memory 28 may be modified using, for example, flash EPROM memory. In these embodiments, the external device may use the wired interface 38 or the wireless modem 50 to update the software program stored in the firmware memory 28. In these embodiments, firmware memory 28 may also be used to store image sensor calibration data, user setting selections, and other data that must be saved when the camera is turned off. In some embodiments, processor 20 includes a program memory (not shown), and the software programs stored in firmware memory 28 are copied into program memory prior to execution by processor 20.

It will be understood that the functions of processor 20 may be provided using a single programmable processor or using multiple programmable processors, including one or more Digital Signal Processor (DSP) devices. Alternatively, the processor 20 may be provided by a custom circuit (e.g., by one or more custom Integrated Circuits (ICs) specifically designed for digital cameras) or by a combination of a programmable processor and a custom circuit. It will be appreciated that a common data bus may be used to form the connector between the processor 20 and some or all of the various components shown in figure 1. For example, in some embodiments, a common data bus may be used to make the connections between processor 20, buffer memory 18, storage device 30, and firmware memory 28.

The processed image is then stored using the storage device 30. It will be understood that the storage device 30 may be any form of memory known to those skilled in the art, including but not limited to: a removable flash memory card, an internal flash memory chip, a magnetic memory, or an optical memory. In some embodiments, storage device 30 may include an internal flash memory chip and a standard interface to a removable flash memory card, such as a Secure Digital (SD) card. Alternatively, a different memory card format may be used, for example, a micro SD card, a Compact Flash (CF) card, a multimedia card (MMC), an xD card, or a memory stick.

The image sensor 14 is controlled by a timing generator 12, wherein the timing generator 12 generates various clock signals to select rows and pixels and to synchronize the operation of the ASP and a/D converter 16. The image sensor 14 may have, for example, 12.4 megapixels (4088 × 3040 pixels) to provide a still image file having approximately 4000 × 3000 pixels. To provide a color image, the image sensor is typically overlaid with a color filter array, which provides an image sensor having an array of pixels, wherein the array of pixels includes pixels of different colors. The different colored pixels can be arranged in many different patterns. As one example, pixels of different colors may be arranged using a well-known Bayer color filter array, as described in U.S. Pat. No. 3,971,065 to Bayer, commonly assigned the title "Co 1 engineering array". As a second example, pixels of different colors may be arranged as described in U.S. patent application publication 2007/0024931 to Compton and Hamilton, entitled "Image sensor with improved light sensitivity". These examples are not limiting and many other color modes may be used.

It will be understood that the image sensor 14, timing generator 12, and ASP and A/D converter 16 may be separately fabricated integrated circuits, or they may be fabricated as a single integrated circuit as is commonly done for CMOS image sensors. In some embodiments, the single integrated circuit may perform some of the other functions shown in fig. 1, including some of the functions provided by processor 20.

When the timing generator 12 drives the image sensor 14 in the first mode, the image sensor 14 is used to provide a motion sequence of lower resolution sensor image data that is used to compose an image when capturing a video image and also when previewing a still image to be captured. The sensor image data for this preview mode may be provided as HD resolution image data, e.g. having 280 x 720 pixels, or as VGA resolution image data, e.g. having 640 x 480 pixels (or other resolutions using data having significantly fewer columns and rows than the resolution of the image sensor).

The sensor image data of the preview mode may be provided by combining values of neighboring pixels having the same color or by eliminating some of the pixel values or by combining pixel values of some colors while eliminating pixel values of other colors. Image data for preview mode can be processed as described in U.S. patent 6,292,218, commonly assigned to Parulski et al, entitled "Electronic camera for initiating capture of still images and moving images".

The image sensor 14 is also used to provide high resolution still image data when the timing generator 12 drives the image sensor 14 in the second mode. The sensor image data of this final mode is provided as high resolution output image data, which includes all pixels of the image sensor for scenes with high illumination levels, and may be, for example, 12 megapixel final image data with 4000 × 3000 pixels. At lower illumination levels, the final sensor image data may be provided by "binning" a certain number of similarly colored pixels on the image sensor to increase the signal level of the sensor and thus increase the "ISO speed".

The zoom and focus motor drivers 8 are controlled by control signals provided by the processor 20 to provide the appropriate focus setting and focus the scene onto the image sensor 14. The exposure level of the image sensor 14 is controlled by controlling the diaphragm amount and exposure time of the adjustable aperture and adjustable shutter 6, the exposure period to the image sensor 14 via the timing generator 12, and the gain (i.e., ISO speed) setting of the ASP and a/D converter 16. The processor 20 also controls a flash 2 which may illuminate the scene.

The lens 4 of the digital camera 10 may be focused in the first mode by auto-focusing using a "through-the-lens" as described by Parulski et al in U.S. Pat. No. 5,668,597, commonly assigned to the assignee of the electronic Camera with Rapid Automatic Focus of an Image up a progressive Scan Image Sensor. This is done by using the zoom and focus motor drivers 8 to adjust the focus position of the lens 4 to a number of positions from a near focus position to an infinitely distant focus position, while the processor 20 determines the closest focus position that provides the highest sharpness value for the central portion of the image captured by the image sensor 14. The focus distance corresponding to the closest focus position may then be used for a variety of purposes, for example, automatically setting the appropriate scene mode, and may be stored as metadata in the image file along with other lens and camera settings.

The processor 20 produces menus and low resolution color images that are temporarily stored in the display memory 36 and displayed on the image display 32. The image display 32 is typically an active matrix color Liquid Crystal Display (LCD), but other types of displays, such as an Organic Light Emitting Diode (OLED) display, may be used. The video interface 44 provides video output signals from the digital camera 10 to a video display 46, such as a flat panel HDTV display. In preview mode or video mode, the digital image data from the buffer memory 18 is processed by the processor 20 to form a series of motion preview images that are typically displayed as color images on the image display 32. In the browse mode, image data from digital image files stored in the storage device 30 is used to generate images for display on the image display 32.

The graphical user interface displayed on the image display 32 is controlled in response to user input provided by the user controls 34. The user controls 34 are used to select various camera modes, such as a video capture mode, a still capture mode, and a review mode, and to initiate the capture of still images and the recording of moving images. In some embodiments, the first mode described above (i.e., the still preview mode) is initiated when the user partially presses the shutter button (which is one of the user controls 34), and the second mode (i.e., the still image capture mode) is initiated when the user fully presses the shutter button. The user controls 34 are also used to turn on the camera, control the lens 4, and initiate the photographing process. User controls 34 typically include some combination of buttons, rocker switches, joysticks, or rotary dials. In some embodiments, some of the user controls 34 are provided by using a touch screen overlay on the image display 32. In other embodiments, additional status displays or image displays may be used.

The camera modes that may be selected using user controls 34 include a "timer" mode. When the "timer" mode is selected, a short delay (e.g., 10 seconds) occurs after the user fully presses the shutter button before the processor 20 initiates the capture of the still image.

An audio codec 22 connected to the processor 20 receives audio signals from a microphone 24 and provides audio signals to a speaker 26. These components can be used to record and playback audio tracks as well as video sequences or still images. If the digital camera 10 is a multi-function device, such as a combination camera and mobile phone, the microphone 24 and speaker 26 may be used to make a call.

In some embodiments, the speaker 26 may be used as part of a user interface, for example, to provide various audible signals indicating that a user control has been pressed, or that a particular mode has been selected. In some embodiments, the microphone 24, audio codec 22, and processor 20 may be used to provide speech recognition, such that a user may provide user input to the processor 20 by using voice commands rather than user controls 34. The speaker 26 may also be used to notify the user of an incoming call. This may be accomplished by using standard ring tones stored in the firmware memory 28 or by using custom ring tones downloaded from the wireless network 58 and stored in the storage device 30. In addition, a vibration device (not shown) may be used to provide silent (e.g., inaudible) notifications regarding incoming calls.

In some embodiments, the digital camera 10 also includes an accelerometer 27, the accelerometer 27 providing data relating to any movement of the camera. Preferably, the accelerometer 27 detects linear and rotational accelerations for each of three orthogonal directions (6 dimensional total inputs).

The processor 20 also provides additional processing of the image data from the image sensor 14 to produce rendered sRGB image data, which is compressed and stored in a "finished" image file, such as a well-known Exif-JPEG image file, in the storage device 30.

The digital camera 10 may be connected via the wired interface 38 to an interface/charger 48, the interface/charger 48 being connected to the computer 40, and the computer 40 may be a desktop computer or a portable computer located in a home or office. The wired interface 38 may conform to, for example, the well-known USB2.0 interface specification. The interface/charger 48 may provide power to a set of rechargeable batteries (not shown) in the digital camera 10 via the wired interface 38.

The digital camera 10 may include a wireless modem 50, the wireless modem 50 interfacing with a wireless network 58 over a radio frequency band 52. The wireless modem 50 may use various wireless interface protocols, such as the well-known bluetooth wireless interface or the well-known 802.11 wireless interface. The computer 40 may upload images to a photo service provider 72, such as the kodak EasyShare gallery, via the internet 70. Other devices (not shown) may access the images stored by the photo service provider 72.

In an alternative embodiment, the wireless modem 50 communicates over a radio frequency (e.g., wireless) link with a mobile phone network (not shown) such as a 3GSM network that is connected to the internet 70 to transfer digital image files from the digital camera 10. These digital image files may be provided to the computer 40 or the photo service provider 72.

The invention will now be described with reference to fig. 2. First, a digital video capture device, such as the digital camera 10 (FIG. 1), is used in a capture video sequence step 200 to capture a digital video sequence 205 using the image sensor 14, where the digital video sequence 205 comprises a time sequence of image frames.

In a capture video sequence step 200, a capture reference image step 210 is performed to capture a reference image 215 using the image sensor 14, wherein the reference image 215 contains a particular person. The reference image 215 may be an image frame from the captured digital video sequence 205, and the reference image 215 may be selected by using the user control 34. For example, during video capture, a user may request that reference image 215 be captured by pressing a suitable user control button, which sends a signal to processor 20 to designate the current video frame in buffer memory 18 as reference image 215. The reference image 215 contains a particular person of interest. In a preferred embodiment, the reference image 215 contains the front of a particular person at a sufficient spatial resolution to allow facial features to be determined from the reference image 215. In some embodiments, reference image 215 contains only a single person, and a particular person is identified as the only person in reference image 215. In other embodiments, reference image 215 may contain multiple people, and a particular person may be selected in any of a number of ways, including: the largest person is selected, the person closest to the center of the reference image 215 is selected, the person with the largest visible face is selected, or the person is manually selected using a suitable user interface (e.g., having the user select a face using a pointing device). Alternatively, each person in reference image 215 may be designated as a particular person. In some embodiments, a particular person may be selected by comparing a face present in the reference image 215 to known faces in the image recognition database and selecting a known face. In other embodiments, a user interface may be provided to enable a user to manually specify a particular person contained in the reference image 215.

The process captured video sequence step 220 processes the captured digital video sequence 205 using a video processing path to form a digital video file 225. The video processing path may include, for example, a video compression step. Video compression algorithms such as those specified in the MPEG and h.263 standards are well known to those skilled in the art.

During the capture of digital video sequence 205, analyze captured video frames step 240 analyzes the image frames in captured digital video sequence 205 using a human recognition algorithm to identify a subset 245 of image frames that contain a particular person. In a preferred embodiment, the person recognition algorithm may be a face recognition algorithm, and the analyze captured image frames step 240 identifies image frames containing the same face as the face of the particular person in the reference image. Face recognition algorithms are well known in the art. For example, Turk et al describe in the article "Eigenfaces for Recognition" (journal of cognitive Neuroscience, Vo1.3, pp.71-86, 1991) face Recognition algorithms that can be used according to the present invention. Alternatively, the person identification algorithm may be any algorithm that matches a person in an image frame to a particular person in the reference image 215. Such algorithms may include the steps of gender classification, height estimation, and clothing analysis, and may be selected from such algorithms known to those skilled in the art.

The form video summary step 250 forms a video summary 255 comprising fewer than all of the image frames in the captured digital video sequence 205, wherein the video summary 255 comprises at least a portion of the identified subset 245 of image frames containing the particular person. In one embodiment of the present invention, only those image frames that contain a particular person are used to form the video summary 255.

In some embodiments, the video summary 255 includes only a portion of the subset 245 of image frames that contain a particular person. For example, the video summary 255 may be limited to include fewer image frames than a predetermined number of image frames. In an alternative embodiment, the video summary 255 may include a single image frame from each set of consecutive image frames in the subset 245 of image frames. In this manner, the video summary 255 may be a "slide show" comprised of a set of still images selected from the identified subset 245 of image frames containing a particular person.

In another embodiment, the video summary 255 includes additional image frames in addition to the identified subset 245 of image frames containing a particular person. In one embodiment of the invention, the additional image frames include image frames immediately preceding or following the image frame in the identified subset 245 of image frames containing the particular person. These frames may be selected, for example, as transition frames to allow a transition from one portion of the video summary 255 to another portion. The frames may also be selected such that the video summary 255 includes a group of pictures that can be easily extracted from the digital video file 225. Video compression standards such as MPEG encode video sequences such that some frames are encoded independently (without reference to other frames) and some groups of frames or groups of pictures are encoded with chronological order without reference to any frame other than the group of pictures. Thus, compressed video data representing the group of pictures can be extracted from the compressed digital video file 225 without decoding the compressed video data.

In another embodiment of the present invention, the additional image frames comprise other portions of the captured digital video sequence 205 that are determined to be significant portions. These other important portions of the digital video sequence 205 may be identified by performing key frame extraction or video summarization algorithms. These algorithms are described in Deever, co-assignee's published 12/1.2011 entitled "Method for determining key Video frames," pending U.S. application publication No. 2011/0292288, and Deever, co-assignee's published U.S. application publication No. 2011/0293018, entitled "Video summary Method and system.

In us application publication No. 2011/0293018, a method for forming a video summary is disclosed in which image frames are analyzed to determine feature values when captured. These feature values are analyzed without decompressing the compressed digital video sequence to identify key video snippets including a video summary.

In us application publication No. 2011/0292288, a method for determining key video snippets is disclosed, in which a digital video sequence is analyzed using feature values determined at capture to determine importance values as a function of time. The importance value is used to form a warped (warped) temporal representation of the digital video sequence. Warping time means providing some video frames in a digital video sequence with a larger temporal weight and other video frames with a smaller temporal weight. The warped-time digital video sequence representation is divided into a set of identical time intervals. Key video frames are selected from each time interval by analyzing feature values associated with video frames within the respective time interval. Such criteria may include selecting a key video frame that appears immediately after the magnification process is completed, or selecting a key video frame that has an appropriate local level of motion in a central region of the video frame.

In some embodiments, key video snippets may be formed by selecting a set of video frames surrounding each key video frame. For example, a key video snippet may be formed by selecting video frames two seconds before and two seconds after the key video frame, thereby forming a key video snippet of four seconds in length.

Alternatively, the key video frames may be ordered and the key video snippets formed for only a subset of the key video frames corresponding to the highest ordered key video frames. Sorting the key video frames may include analyzing the digital video sequence to determine a camera fusing mode, and sorting the key video frames responsive to the camera fusing mode. The feature values corresponding to the global motion determined at capture time may provide a global motion trajectory indicating a fixation region of the camera during the entire digital video sequence. Video frames corresponding to areas of higher fusing (that is, areas where the camera remains fused for a substantial proportion of the entire video capture) are ranked higher. The sorting process may be performed iteratively, at each step, the next highest ranked key video frame is selected, and at each step, the process may bring key video frames representing fixation regions not represented above key video frames representing fixation regions already included in the sorting process. Once the key video frames have been ranked, the highest ranked key video frame may be selected for inclusion in the key video snippets. The additional image frames represented by these key video snippets may be combined with the identified subset of image frames containing the particular person to form a video summary.

The store digital video file step 230 stores the digital video file 225 to the storage device 30, thereby generating a stored digital video file 235. The store video summary representation step 260 stores a representation 265 of the video summary in the storage device 30. In one embodiment, storing the representation 265 of the video summary in the storage device 30 includes storing frame identification metadata that provides an indication of the image frames in the digital video sequence 205 that correspond to the video summary 255. Frame identification metadata may be stored in association with the stored digital video file 235. For example, a representation 265 of a video summary may be stored as video summary metadata in the stored digital video file 235 indicating a series of start and end frames corresponding to the segments contained in the video summary. This allows the representation 265 of the video summary to be stored without using any additional physical memory, in addition to the smaller amount of memory required to store the frame identification metadata. Video summary 255 may then be played using a "smart" video player that understands the video summary metadata, while the video summary metadata is transparent to legacy players.

In another embodiment, the representation 265 of the video summary is a summarized digital video file. In this embodiment, the video summary 255 is stored as a digital video file separate from the stored digital video file 235. In this case, the representation 265 of the video summary is a summarized digital video file that may be viewed or shared independently of the stored digital video file 235. In a preferred embodiment, the stored representation 265 of the video summary is a summarized digital video file having a format suitable for playing using a standard video player.

In some embodiments, the summary digital video file may be created after the capture of the video sequence is completed. The desired frames of the video summary may be extracted from the stored digital video file 235. If a desired frame of the video summary is selected to correspond to an individually encoded group of pictures, a summary digital video file may be created by extracting compressed data corresponding to the desired frame without decoding the compressed video data.

In some embodiments, the summary digital video file is formed from image frames of the captured digital video sequence 205 using a different video processing path than that used to form the digital video file 225. For example, many video capture devices may capture and encode two video streams simultaneously. In these devices, a single image sensor 14 is used to capture each image frame. Each image frame is then processed using two different processing paths. One processing path may generate a digital video file 225. The second processing path may generate a summary digital video file for storing video summary 255. The second processing path differs from the first processing path in that it can generate a reduced spatial resolution version of each image frame and encode the lower resolution video. Many video capture devices can capture 1080p resolution video and QVGA resolution video simultaneously.

Fig. 3 shows a video capture process with two processing paths. The captured image frames 310 go to a first video processing path 320 and a second video processing path 330. The first video processing path 320 includes a first video encoder 340 that provides a first encoded image frame 350 at a first spatial resolution. The second video processing path 330 includes an optional image adjuster 360 that produces a modified image frame 370 having a second spatial resolution. The second video processing path 330 also includes a second video encoder 380 that encodes the modified image frame 370 to provide a second encoded image frame 390. Those skilled in the art will recognize that the first video processing path 320 and the second video processing path 330 may also optionally include other video processing steps 300, such as chrominance and tone scaling processing, noise reduction, sharpening, and image stabilization. Although these other video processing steps 300 are shown as being applied first, it should be appreciated that these video processing steps 300 may be applied at any point in the first video processing path 320 and the second video processing path 330, or that these video processing steps 300 may even be interspersed between other steps. Those skilled in the art will also recognize that fig. 3 illustrates one possible way in which the two video processing paths may differ, and that other differences are possible that fall within the scope of the present invention.

The summary digital video file may be created concurrently with the digital video file 225 by using a different video processing path than the video processing path used to form the digital video file 225 (fig. 2). Fig. 4 shows a variation of the method shown in fig. 3, which may be used to create two video files simultaneously. As with fig. 3, each captured image frame 310 passes through a first video processing path 320 that includes a first video encoder 340 to produce a first encoded image frame 350. The first encoded image frame 350 is contained in the digital video file 225.

The captured image frame 310 also passes through a second video processing path 430. The second video processing path 430 is similar to the second video processing path 330 in fig. 3, except that a test 460 is added where the image contains a particular person. The image contain specific person test 460 uses a human recognition algorithm to analyze the captured image frame 310 to determine if it contains a specific person from the reference image 215. If not, a discard images step 470 is used to discard the captured image frame 310. If a particular person is present in the captured image frame 310, processing proceeds to the image adjuster 360 to determine a modified image frame 370 and to the second video encoder 380 to provide a second encoded image frame 390. The second encoded image frame 390 is contained in a summary digital video file. In an alternative embodiment, the image adjuster 360 may be applied earlier in the second video processing path before the image contains a test 460 of a particular person. In other embodiments, the image adjuster 360 may not be present in the second video processing path 430 at all.

The first video processing path 320 and the second video processing path 430 shown in fig. 4 are used to generate two separate video files. A first video file (digital video file 225) contains all captured image frames 310 and is encoded at a first spatial resolution. The second video file (summary digital video file) includes only those captured image frames 310 that contain the particular person of interest and is encoded at a second spatial resolution. The second video file is formed simultaneously with the first video file.

Alternatively, the summary digital video file may be formed only partially concurrently with the digital video file. An initial summary digital video file may be created as shown in fig. 4. After the capture is complete, the summary digital video file may be augmented with additional data extracted from the stored digital video file 235. In this case, the summary digital video file may be extended to include image frames that were not originally encoded by the second video processing path. This allows the summarized digital video file to be generated more quickly than if the entire summarized digital video file were formed after capture.

Fig. 5 shows an alternative embodiment of the invention. In this case, rather than capturing the reference image 215 containing the particular person at the same time as the step of capturing the video sequence 200, the reference image 215 containing the particular person is captured in a separate step of capturing reference images 510, wherein the separate step of capturing reference images 510 may be performed before or after the step of capturing the video sequence 200. The reference image 215 may be a single image obtained in still capture mode or may be an image frame from an existing captured digital video sequence. In some embodiments, the digital camera may store a set of person images corresponding to a set of persons of interest to the owner of the digital camera (e.g., a person image may be stored for each family member), and a reference image 215 may be selected from the set of person images using a suitable user interface (e.g., a list of predetermined person names). The reference image 215 may be captured on the digital camera 10, or alternatively, the reference image 215 may be captured on a separate image capture device and the reference image 215 input into the digital camera 10.

In some embodiments, multiple reference images 215 containing different views of a particular person may be specified. The analyze captured image frames step 240 may use multiple reference images 215 to help more reliably determine whether an image frame contains a particular person.

In some embodiments, the analyze captured image frames step 240 occurs concurrently with the capture video sequence step 200 such that the video summary 255 is formed concurrently with the generation of the digital video file 225. In other embodiments, the video summary 255 may be formed using post-processing operations that are performed after the digital video file 225 has been captured and stored. In some cases, the analyze captured image frames step 240 may be performed on the digital camera 10. Alternatively, the analyze captured image frames step 240 may also be performed after the stored digital video file 235 has been loaded onto some other device, such as a host computer or the like. The remaining elements in fig. 5 are the same as those shown in the embodiment of fig. 2.

An alternative embodiment of the present invention will now be described with reference to fig. 6. In this case, rather than analyzing the captured image frames based on the reference image 215, the analyze captured image frames step 640 selects the subset 245 of image frames based on the set of reference data 615 indicative of the feature of interest, wherein the set of reference data 615 is specified in the specify reference data step 610. The remaining elements in the method of fig. 6 are the same as those shown in the embodiment of fig. 2.

The reference data 615 may take many forms. In some embodiments, the reference data 615 may be a textual description of the feature of interest. For example, the reference data 615 may be a name, object, location, or event of a person. In some embodiments, the reference data 615 may be a mathematical representation of the feature of interest. For example, the reference data 615 may be a color histogram, a feature value, a template, or any other feature vector. Those skilled in the art will recognize that there are a variety of methods that may be used to represent image information within the scope of the present invention. In some embodiments, reference data 615 may be associated with non-image information. For example, the reference data 615 may be information associated with: audio information, Global Positioning System (GPS) data, autofocus data, auto-exposure data, auto-white balance data, zoom lens data, accelerometer data, gyroscope data, or infrared sensor data. Those skilled in the art will recognize that there are various types of information that may be provided as reference data 615, which reference data 615 is used by the analyze captured image frame step 640 to identify the subset 245 of image frames.

During the capture of the digital video sequence 205, an analyze captured image frames step 640 analyzes the image frames in the digital video sequence 205 using a feature recognition algorithm to identify a subset of the image frames that contain a feature of interest, wherein the feature of interest is specified by the reference data 615. For example, if the feature of interest specified by the reference data 615 is a person whose particular name of the reference face image has been previously specified, a face recognition algorithm may be used to determine whether the image frame contains a person of the particular name by analyzing the image frame with reference to the reference face image. In another example, if the feature of interest specified by the reference data 615 is an event tag (e.g., a "golf swing"), a feature recognition algorithm may be used to determine whether the image frame corresponds to the specified event. In some cases, the feature recognition algorithm may need to analyze a series of image frames to determine the appropriate event tags (e.g., detect which image frames contain the motion features of a golf swing). In another example, if the feature of interest specified by the reference data 615 is an object tag (e.g., "whale"), a feature recognition algorithm may be used to determine whether the image frame contains the specified object. All of these examples are situations where a feature of interest may be identified at capture to determine the subset 245 of image frames to be included in the video summary.

In some embodiments, the features of interest specified by the reference data 615 may correspond to general features associated with the image frame. For example, the feature of interest specified by the reference data 615 may be an image frame with low associated camera motion. In this case, a feature recognition algorithm may be used to analyze the gyroscope data, accelerometer data, or image-based motion estimation data to identify image frames that meet specified camera motion conditions. Similarly, the feature of interest specified by the reference data 615 may be an image frame after a camera zoom process. In this case, a feature recognition algorithm may be used to analyze zoom lens data or image-based zoom estimation data to identify the image frame just after the camera zoom is completed. In another example, the feature of interest specified by the reference data 615 may be an image frame with object motion of a larger magnitude in the scene. In this case, a feature recognition algorithm may be used to quantify the amount of object motion in the scene to identify image frames that satisfy the object motion condition. Those skilled in the art will recognize that these are merely examples of many features and feature recognition algorithms of interest that fall within the scope of the present invention.

The computer program product may include one or more storage media, such as: magnetic storage media such as a magnetic disk (e.g., floppy disk) or magnetic tape; optical storage media such as optical disks, optical tape, or machine-readable bar codes; solid state electronic storage devices, such as Random Access Memory (RAM), or Read Only Memory (ROM); or any other physical device or medium employed to store a computer program having instructions for controlling one or more computers to implement the method according to the present invention.

Parts list

2 flash lamp

4 lens

6 Adjustable Aperture and Adjustable shutter

8 focus and focus motor drive

10 digital camera

12 timing generator

14 image sensor

16 ASP and A/D converter

18 buffer memory

20 processor

22 audio codec

24 microphone

25 pressure sensor

26 speaker

27 accelerometer

28 firmware memory

30 storage device

32 image display

34 user control

36 display memory

38 wired interface

40 computer

42 Pitch sensor

44 video interface

46 video display

48 interface/charger

50 wireless modem

52 radio frequency band

58 wireless network

70 Internet

72 photo service provider

200 step of capturing a video sequence

205 digital video sequence

210 capturing a reference image

215 reference picture

220 processing the captured video sequence

225 digital video file

230 storing digital video files

235 stored digital video files

240 analyze captured image frame step

245 image frame

250 forming a video summary step

255 video summary

260 storing a representation of a video summary

265 video summary representation

300 further video processing steps

310 captured image frame

320 first video processing path

330 second video processing path

340 first video encoder

350 first encoded image frame

360 image adjuster

370 modified image frame

380 second video encoder

390 second encoded image frame

430 second video processing path

460 the image contains a test of a particular person

470 image discarding step

510 capture reference image step

610 specify reference data step

615 reference data

640 analyzing captured image frames

Claims

1. A digital camera system for capturing a video sequence and providing an associated video summary, comprising:

an image sensor for capturing a digital image;

an optical system for forming an image of a scene on the image sensor;

a data processing system;

a storage device for storing the captured video sequence; and

designating a reference image, wherein the reference image contains a specific person;

analyzing the captured image frames using a human recognition algorithm to identify a subset of the image frames that contain the particular person;

storing the digital video file in the storage device; and

storing the representation of the video summary in the storage device.

2. The digital camera system of claim 1, wherein the reference image is captured using the image sensor.

3. The digital camera system of claim 2, wherein a user selects the reference image by actuating a user control.

4. The digital camera system of claim 1, wherein the reference image is downloaded to the digital camera system.

5. The digital camera system of claim 1, wherein the representation of the video summary is a summary digital video file.

6. The digital camera system of claim 5, wherein the summary digital video file is formed from the image frames of the captured video sequence using a different video processing path than a video processing path used to form the digital video file.

7. The digital camera system of claim 5, wherein the summary digital video file is formed at least partially concurrently with the digital video file.

8. The digital camera system of claim 1, wherein the representation of the video summary includes frame identification metadata specifying a set of image frames in the digital video file to be included in the video summary.

9. The digital camera system of claim 8, wherein the frame identification metadata is stored in the digital video file.

10. The digital camera system of claim 1, wherein the video summary includes additional image frames in addition to the identified subset of image frames containing the particular person.

11. The digital camera system of claim 10, wherein the additional image frames include image frames immediately preceding or following an image frame in the identified subset of image frames containing the particular person.

12. The digital camera system of claim 11, wherein the image frames of the video sequence are compressed for storage in the digital video file and the image frames immediately preceding or following an image frame in the identified subset of image frames containing the particular person are selected such that the video summary includes a set of image frames that can be extracted from the digital video file without decoding the compressed image frames.

13. The digital camera system of claim 10, wherein the additional image frames include other portions of the captured video sequence that are determined to be significant portions.

14. The digital camera system of claim 13, wherein the significant portion of the captured video sequence comprises a key image frame identified using a key frame extraction algorithm.

15. The digital camera system of claim 1, wherein the step of analyzing the captured image frames is performed during the capturing of the video sequence.

16. The digital video camera system of claim 1, wherein a plurality of reference images are specified, each reference image containing the particular person.

17. The digital camera system of claim 1, wherein the video digest is a set of still images selected from the identified subset of image frames that contain the particular person.

18. A method for forming a video summary of a video sequence, comprising:

receiving a video sequence comprising a temporal sequence of image frames;

using a data processor to automatically analyze the image frames with a person recognition algorithm to identify a subset of the image frames that contain the particular person;

forming the video summary comprising fewer than all image frames in the video sequence, wherein the video summary comprises at least a portion of the identified subset of image frames that contain the particular person; and

storing the representation of the video summary in a storage device accessible to a processor.