HK1073375B

HK1073375B - Video surveillance system and its method that employing video primitives

Info

Publication number: HK1073375B
Application number: HK05106910.9A
Authority: HK
Inventors: J. Lipton Alan; M. Strat Thomas; L. Venetianer Peter; C. Allmen Mark; E. Severson William; Haering Niels; J. Chosak Andrew; Zhang Zhong; F. Frazier Matthew; S. Sfekas James; Hirata Tasuki; I.W. Clark John
Original assignee: Avigilon Fortress Corporation
Priority date: 2001-11-15
Filing date: 2002-07-17
Publication date: 2009-07-17

Description

Video surveillance system and method employing video primitives

Technical Field

The present invention relates to an automated video surveillance system that employs video primitives.

Reference to the literature

For the convenience of the reader, the references are listed herein below. In the specification, numerals placed in parentheses represent corresponding references. All references listed herein are incorporated by reference.

The following references describe moving object detection:

{1} A.Lipton, H.Fujiyoshi and R.S.Patil, "Moving Targetdetection and Classification from Real-Time Video",Proceedings of IEEE WACV’98princeton NJ, 1998, pages 8-14.

{2} W.E.L.Grimson et al, "Using Adaptive Tracking to Classification and Monitor Activities in a Site",CVPRpages 22-29, 6 months 1998.

{3}A.J.Lipton，H.Fujiyoshi，R.S.Patil，“Moving TargetClassification and Tracking from Real-time Video”，IUWP. 129-136, 1998.

{4}T.J.Olson and F.Z.Brill，“Moving Object Detection andEvent Recognition Algorithm for Smart Cameras”，IUW159 and 175, 1997, 5 months.

The following references describe the detection and tracking of humans:

{5}A.J.Lipton，“Local Application of Optical Flow toAnalyse Rigid Versus Non-Rigid Motion”，International Conference on Computer Visioncorfu, greene, 9 months 1999.

{6} F.Bartolini, V.Cappelini, and A.Mecocci, "counting scope shaping in and out of a bus by time image-sequence processing",IVC,12(1): 36-41, 1 month and 30 days 1994.

{7} M.Rossi and A.Bozzoli, "Tracking and counting movingpeoples",ICIP94page 212-.

{8} c.r.wren, a.azarbayejani, t.darrell and a.pentland, "Pfinder: real-time tracking of the human body ",Vismod1995 year。

{9}L.Khoudour，L.Duvieubourg，J.P.Deparis，“Real-TimePedestrian Counting by Active Linear Cameras”，JEI,5(4): 452 and 459, 1996, 10 months.

{10}S.Ioffe，D.A.Forsyth，“Probabilistic Methods forFinding People”，IJCV,43(1): 45-68, 6 months 2001.

{11} m.jsard and j.maccormick, "BraMBLe: a Bayesian Multiple-Blob Tracker "ICCV2001.

The following references describe blob (blob) analysis:

{12}D.M.Gavrila，“The Visual Analysis of Human Movement：A Survey”，CVIU,73(1): 82-98, 1 month 1999.

{13} Niels Haering and Niels da Vitoria Lobo, "Visual event detection",Video Computing Serieseditor Mubarak Shah, 2001.

The following references describe spot analysis for trucks, cars and people:

{14} Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, Tolliver, Enomoto and Hasegawa, "A System for Video Surveillance and Monitoring: VSAM Final Report ", Technical Report CMU-RI-TR-OO-12, Robotics Institute, Carnegie Mellon university, 5 months 2000.

{15} Lipton, Fujiyoshi and Patil, "Moving TargetClassification and Tracking from Real-time Video",98 Darpa IUW20-23 months 11 of 1998.

The following references describe analyzing single person spots and their contours:

{16} c.r.wren, a.azarbayejani, t.darrell and a.p.pentland "Pfinder: Real-Time Tracking of the Human Body ",PAMIvol 19, pages 780-784, 1997.

The following references describe the internal motion of the blob, including any motion-based segment:

{17} M.Allmen and C.Dyer, "Long- -Range spread motion adapting Using spread Flow vessels",Proc.IEEE CVPRlahaina, Maui, Hawaii, 303-.

{18} L.Wixson, "Detecting saline Motion by accessing directive Consistent Flow", IEEE trans.Pattern animal. Mach.Intell., Vol.22, p.774-.

Background

Video surveillance in public places has gained widespread popularity and acceptance by the general public. Unfortunately, conventional video surveillance systems produce large volumes of data, and thus encounter difficult problems in the analysis of video surveillance data.

There is a need to reduce the amount of video surveillance data so that analysis of the video surveillance data can be performed.

The video surveillance data needs to be filtered to identify a desired portion of the video surveillance data.

Disclosure of Invention

It is an object of the present invention to reduce the amount of video surveillance data, thereby enabling analysis of the video surveillance data.

It is another object of the present invention to filter video surveillance data to identify desired portions of the video surveillance data.

It is another object of the present invention to generate real-time alerts based on automatic detection of events from video surveillance data.

It is another object of the present invention to combine data from surveillance sensors, rather than from video, to improve the search capability.

It is another object of the present invention to combine data from surveillance sensors, rather than from video, to improve event detection capability.

The present invention includes an article, method, system, and apparatus for video surveillance.

The article of manufacture of the present invention comprises a computer readable medium embodying software for a video surveillance system, the computer readable medium comprising code segments for operating the video surveillance system in accordance with video primitives.

The article of manufacture of the present invention comprises a computer readable medium comprising software for a video surveillance system, the computer readable medium comprising code segments for accessing archived video primitives, and code segments for extracting event occurrences from the accessed archived video primitives.

The system of the invention includes a computer system including a computer-readable medium having software to operate a computer in accordance with the invention.

The apparatus of the present invention includes a computer including a computer-readable medium having software to operate the computer in accordance with the present invention.

The article of manufacture of the invention includes a computer readable medium having software to operate a computer in accordance with the invention.

Further, the objects and advantages of the invention are exemplary, rather than exhaustive, of those that can be achieved. These and other objects and advantages of the present invention will therefore be apparent from the following description, as are embodiments and modifications apparent to those skilled in the art.

Definition of

"video" means a moving picture represented in analog and/or digital form. Examples thereof include: television, movies, image sequences from video cameras or other observers, and computer-generated image sequences.

A "frame" represents a particular image or other discrete unit in a video.

An "object" represents an item of interest in a video. Examples of objects include: human, vehicle, animal, physical subject.

"activity" means one or more actions of one or more objects and/or one or more compositions of actions. Examples thereof include: enter, leave, stop, move, rise, fall, grow, and shrink.

"location" means a space in which an activity can occur. For example, the location may be scene-based or image-based. Examples of scene-based locations include: public places, stores, retail stores, offices, warehouses, hotel rooms, hotel lobbies, halls of buildings, casinos, bus stops, train stations, airports, ports, buses, trains, airplanes, and boats. Examples of image-based locations include: a video image, a line in a video image, a region in a video image, a rectangular portion of a video image, and a polygonal portion of a video image.

An "event" represents one or more objects engaged in an activity. Events may be represented by location and/or time.

"computer" means any device capable of accepting a structural input, processing the structural input according to a specified rule, and producing a result of the processing as an output. Examples of the computer include: computers, general purpose computers, supercomputers, mainframes, super mini-computers, workstations, microcomputers, servers, interactive television, hybrid combinations of computers and interactive television, and special purpose hardware that emulates computers and/or software. The computer may have single or multiple processors, which may operate in parallel and/or non-parallel. A computer also means two or more computers connected together by a network for transmitting and receiving information between the computers. Examples of such computers include distributed computer systems that process information through computers connected by a network.

"computer-readable medium" means any storage device for storing computer-accessible data. Examples of computer readable media include: magnetic hard disks, floppy disks, optical disks such as CD-ROMs and DVDs, magnetic tapes, memory chips, and carrier waves for carrying computer-readable electronic data, such as those used in transmitting and receiving electronic mail or in accessing networks.

"software" means a specified rule for operating a computer. Examples of software include: software, code segments, instructions, computer programs and program control logic.

"computer system" means a system having a computer, wherein the computer includes a computer-readable medium embodying software to operate the computer.

"network" means a plurality of computers and associated equipment connected by a communications facility. Networks involve permanent connections, such as cables, or temporary connections made through telephone or other communication links. Examples of networks include the Internet, such as the Internet, intranets, Local Area Networks (LANs), Wide Area Networks (WANs), and combinations of networks, such as a combination of the Internet and an intranet, and the like.

Drawings

Embodiments of the invention are explained in more detail by the drawings, in which like reference numerals refer to like parts.

FIG. 1 shows a plan view of a video surveillance system of the present invention.

Fig. 2 shows a flow diagram of the video surveillance system of the present invention.

FIG. 3 shows a flow diagram for assigning tasks to a video surveillance system.

Fig. 4 shows a flow chart for operating a video surveillance system.

Fig. 5 shows a flow diagram for extracting video primitives for a video surveillance system.

FIG. 6 shows a flow chart for taking action with a video surveillance system.

Fig. 7 shows a flow chart of a semi-automatic calibration of a video surveillance system.

FIG. 8 shows a flow chart for automatic calibration of a video surveillance system.

Fig. 9 shows another flow diagram of the video surveillance system of the present invention.

10-15 illustrate examples of video surveillance systems of the present invention for monitoring grocery stores.

Detailed Description

The automatic video surveillance system of the present invention is used to monitor a particular site, for example, for market research or security purposes. The system may be a dedicated video surveillance appliance with a specially made surveillance component, or the system may be a retrofit to existing video surveillance equipment located on a surveillance video feed line. The system is capable of analyzing video data from a live source or recorded media. The system may have a specific response to the analysis, for example, recording data, activating an alarm mechanism, or initiating another sensor system. The system can also be integrated with other monitoring system components. The present system generates a security or market research report that can be tailored to the needs of the operator and, optionally, can be displayed through an interactive web-based interface or other reporting mechanism.

The operator has maximum flexibility in configuring the system by using the event discriminator. An event discriminator is identified with one or more objects (whose descriptions are based on video primitives) together with one or more optional spatial attributes and/or one or more optional temporal attributes. For example, the operator may define an event discriminator (referred to as a "strolling" event in this example) as a "people" object that appears "longer than 15 minutes" and "between 10:00pm and 6:00 am" at the "automated teller machine" space.

Although the video surveillance system of the present invention utilizes well known computer vision techniques from the public domain, the video surveillance system of the present invention has several unique and novel features that currently remain unaddressed. For example, current video surveillance systems use a large number of video images as the primary product of information exchange. The system of the present invention uses video primitives as the primary product, while typical video images are used as indirect evidence. The system of the present invention may also be calibrated (manually, semi-automatically, or automatically) and then automatically infer video primitives from the video images. The present system can also analyze previously processed video without having to completely reprocess the video. By analyzing previously processed video, the system can perform inference analysis based on previously recorded video primitives, greatly increasing the analysis speed of the computer system.

As another example, the system of the present invention provides unique system task assignments. With device control indications, current video systems allow a user to locate video sensors and, in some sophisticated conventional systems, to frame out areas of interest or not. The device control indication is an instruction to control the position, orientation, and focus of the camera. Instead of device control indication, the system of the present invention uses event discriminators based on video primitives as the primary task allocation mechanism. Using event discriminators and video primitives provides a more intuitive approach to operators than conventional systems to extract useful information from the system. In addition to assigning tasks to the system using device control directives such as "camera a pan 45 degrees to the left," tasks may be assigned to the system of the present invention in a human intuitive manner using one or more event discriminators based on video primitives such as "someone enters restricted area a.

Using the present invention for market research, the following is an example of the type of video surveillance that can be performed with the present invention: calculating the number of people in the store; calculating a number of people in a portion of the store; calculating the number of people staying at a specific location in the store; measuring how long people spend in a store; measuring how long people spend in a portion of the store; and measuring the length of lines in the store.

Using the present invention for security, the following is an example of the type of video surveillance that can be performed with the present invention: determining when a person enters a restricted area and storing a related image; determining when a person enters an area at an abnormal time; determining when a change, which may be unauthorized, has occurred to the shelf and storage space; determining when a passenger on the aircraft approaches the cockpit; determining when someone trails through the secure portal; determining whether an unattended bag exists in an airport; and determining whether a thief is present.

FIG. 1 shows a plan view of a video surveillance system of the present invention. The computer system 11 includes a computer 12, the computer 12 having a computer readable medium 13 embodying software to operate the computer 12 in accordance with the present invention. The computer system 11 is connected to one or more video sensors 14, one or more video recorders 15, and one or more input/output (I/O) devices 16. The video sensor 14 may optionally be connected to a video recorder 15 for direct recording of video surveillance data. The computer system is optionally connected to other sensors 17.

The video sensor 14 provides source video to the computer system 11. For example, each video sensor 14 may be connected to the computer system 11 using a direct connection (e.g., a firewire digital camera interface) or a network. The video sensor 14 may be present prior to installation of the present invention or may be installed as part of the present invention. Examples of the video sensor 14 include: cameras, digital cameras, color cameras, monochrome cameras, camcorders, PC cameras, web cameras, infrared cameras, and CCTV cameras.

The video recorder 15 receives video surveillance data from the computer system 11 for recording and/or provides source video to the computer system 11. For example, each video recorder 15 may be connected to the computer system 11 using a direct connection or a network. The video recorder 15 may be present prior to installation of the invention or may be installed as part of the invention. Examples of the video recorder 15 include: video tape recorders, digital video recorders, video discs, DVDs and computer readable media.

The I/O devices 16 provide input to the computer system 11 and receive output from the computer system 11. The I/O devices 16 may be used to assign tasks to the computer system 11 and generate reports from the computer system 11. Examples of the I/O device 16 include: a keyboard, a mouse, a stylus, a monitor, a printer, another computer system, a network, and an alarm.

Other sensors 17 provide additional input to the computer system 11. For example, each of the other sensors 17 may be connected to the computer system 11 using a direct connection or a network. The other sensors 17 may be present prior to installation of the invention or may be installed as part of the invention. Examples of other sensors 17 include: motion sensors, optical tripwires, biometric sensors, and card-based or keyboard-based authentication systems. The computer system 11, recording device and/or recording system may record the output of other sensors 17.

Fig. 2 shows a flow diagram of the video surveillance system of the present invention. Aspects of the present invention are illustrated with reference to fig. 10-15, which show examples of a video surveillance system of the present invention for monitoring a grocery store.

In block 21, a video surveillance system is set up as discussed with respect to FIG. 1. Each video sensor 14 is directed towards a video surveillance site. Computer system 11 is connected to video feeds from video devices 14 and 15. The video surveillance system may be implemented using existing equipment or newly installed equipment at the site.

In block 22, the video surveillance system is calibrated. When the video surveillance system is in place, per block 21, then calibration occurs. The result of block 22 is that the video surveillance system is able to determine the approximate absolute size and velocity of a particular object (e.g., a person) located at multiple locations in the video images provided by the video sensor. The system may be calibrated using manual calibration, semi-automatic calibration, and automatic calibration. After discussing block 24, the calibration will be further described.

In block 23 of fig. 2, a task is assigned to the video surveillance system. Task allocation occurs after calibration in block 22 and is optional. Assignment of tasks to the video surveillance system involves specification of one or more event discriminators. Without task allocation, the video surveillance system operates by detecting and archiving video primitives and associated video images without taking any action, as indicated by block 45 in FIG. 4.

FIG. 3 illustrates a flow chart for assigning tasks to a video surveillance system to determine event discriminators. An event discriminator represents one or more objects that optionally interact with one or more spatial attributes and/or one or more temporal attributes. Event discriminators are described in video primitives. The video primitives represent observable properties of objects observed in the video feed line. Examples of video primitives include the following: classification, size, shape, color, texture, location, velocity, internal motion, significant motion, features of significant motion, scene changes, features of scene changes, and predefined models.

Classification refers to the identification of objects that belong to a particular class or class. Examples of classifications include: people, dogs, vehicles, police cars, individual people, and specific types of objects.

The dimensions represent dimensional attributes of the object. Examples of dimensions include: large, medium, small, flat, above 6 feet, shorter than 1 foot, wider than 3 feet, thinner than 4 feet, about the size of a human, larger than a human, smaller than a human, about the size of a vehicle, a rectangle in an image having an approximate pixel size, and a plurality of image pixels.

The color represents a color attribute of the object. Examples of colors include: white, black, gray, red, HSV value ranges, YUV value ranges, RGB value ranges, average RGB values, average YUV values, and histograms of RGB values.

The texture represents the pattern properties of the object. Examples of texture features include: self-similarity, spectral power, linearity, and roughness.

The internal motion represents a measure of the stiffness of the object. An example of a fairly rigid object is a car, which does not exhibit a large amount of internal motion. An example of a rather inflexible object is a person with swinging limbs, which manifests a large amount of internal movement.

Motion represents any motion that can be automatically detected. Examples of motions include: appearance of an object, disappearance of an object, vertical motion of an object, horizontal motion of an object, periodic motion of an object.

Significant motion means any motion that can be automatically detected and tracked over a period of time. Such moving objects exhibit apparently purposeful motion. Examples of significant motion include: from one location to another, and to interact with another object.

The characteristic of the salient motion represents a characteristic of the salient motion. Examples of features of significant motion include: the trajectory, the length of the trajectory in image space, the approximate length of the trajectory in the three-dimensional display of the environment, the position of the object in image space as a function of time, the approximate position of the object in the three-dimensional display of the environment as a function of time, the duration of the trajectory, the velocity (e.g., velocity and direction) in image space, the approximate velocity (e.g., velocity and direction) in the three-dimensional display of the environment, the duration of the velocity, the change in velocity in image space, the approximate change in velocity in the three-dimensional display of the environment, the duration of the change in velocity, the cessation of motion, and the duration of the cessation of motion. The velocity represents the velocity and direction of the object at a particular time. The trajectory represents a set of (position, velocity) pairs of the object when the object can be tracked for a period of time.

Scene change represents any area of the scene that can be detected over time. Examples of scene changes include: fixed objects that leave the scene, objects that enter the scene and are fixed, objects that change position in the scene, and objects that change appearance (e.g., color, shape, or size).

The characteristics of the scene change represent the characteristics of the scene change. Examples of characteristics of scene changes include: the size of the scene change in the image space, the appropriate size of the scene change in the three-dimensional display of the environment, the time at which the scene change occurred, the location of the scene change in the image space, and the approximate location of the scene change in the three-dimensional display of the environment.

The predefined model represents a previously known model of the object. Examples of pre-defined include: adults, children, vehicles, and semi-trailers.

In block 31, one or more objects of the type of interest are identified with the video primitives and their abstractions. Examples of one or more objects include: object, person, red object, two objects, two persons, vehicle.

In block 32, one or more spatial regions of interest are identified. The regions represent one or more portions of an image from a source video or a spatial portion of a scene being viewed by a video sensor. Regions also include combinations of regions from multiple scenes and/or images. The region may be an image-based space (e.g., a line, rectangle, polygon, or circle in a video image), or a three-dimensional space (e.g., a cube, or a region of floor space in a building).

FIG. 12 illustrates an aisle identification area along a grocery store. Four regions are identified: coffee, carbonated beverages, snacks, and bottled water. With this system, these areas are identified by clicking on the interface.

In block 33, one or more temporal attributes of interest are optionally identified. Examples of temporal attributes include: every 15 minutes, between 9:00pm and 6:00am, less than 5 minutes, longer than 30 seconds, weekends, and within 20 minutes.

In block 34, a response is optionally identified. Examples of responses include the following: activating a visual and/or audio alarm on a display of the system, activating a visual and/or audio alarm system located at the site, activating a silent alarm, activating a quick response mechanism, locking a door, contacting a security service, forwarding data (e.g., image data, video primitives, and/or analyzed data) to another computer system over a network such as the internet, saving the data onto a designated computer-readable medium, activating some other sensor or monitoring system, assigning tasks to computer system 11 and/or another computer system, and directing computer system 11 and/or another computer system.

In block 35, one or more discriminators are identified by describing the interaction between the video primitives (or abstractions thereof), the spatial region of interest, and the temporal attribute of interest. The interaction is determined for a combination of the one or more objects defined in block 31, the one or more spatial regions of interest defined in block 32, and the one or more temporal attributes of interest defined in block 33. One or more responses identified in block 34 are optionally associated with each event discriminator.

Examples of event discriminators for a single object include: object appearance, person appearance, and red object motion are faster than 10 m/s.

Examples of event discriminators for multiple objects include: two objects together, a person in a car, and a red object moving behind a blue object.

Examples of event discriminators for object and spatial attributes include: object crossovers, object entry into the area, and person crossovers from the left.

Examples of event discriminators for object and temporal attributes include: subjects appeared at 10:00pm, people acted faster than 2m/s between 9:00am and 5:00pm, and vehicles appeared on weekends.

Examples of event discriminators for objects, spatial attributes and temporal attributes include: the person crosses between midnight and 6:00am and the vehicle stops in a certain area for more than 10 minutes.

Examples of event discriminators for objects, spatial attributes and temporal attributes associated with a response include: a person enters an area between midnight and 6:00am and notifies the security service.

In block 24 of fig. 2, the video surveillance system is operated. The video surveillance system of the present invention operates automatically, detects and archives video primitives of objects in a scene, and detects event occurrences in real-time using event discriminators. Further, where appropriate, actions are taken in real time, such as activating an alarm, generating a report, generating an output, and so forth. The reports and outputs may be displayed and/or stored locally on the system, or displayed and/or stored elsewhere over a network such as the internet. Fig. 4 shows a flow chart for operating a video surveillance system.

In block 41, the computer system 11 obtains source video from the video sensor 14 and/or the video recorder 15.

In block 42, video primitives are extracted from the source video in real-time. As an option, non-video primitives may be obtained and/or extracted from one or more other sensors 17 and used with the present invention. Figure 5 shows the extraction of video primitives.

Fig. 5 shows a flow diagram for extracting video primitives for a video surveillance system. Blocks 51 and 52 operate in parallel and may be performed in any order or simultaneously. In block 51, an object is detected by motion. For this block, any motion detection algorithm for detecting motion between frames at the pixel level may be used. As an example, a three frame difference technique may be used, which is discussed in {1 }. The detected object is forwarded to block 53.

In block 52, the object is detected by the change. Any change detection algorithm for detecting changes from the background module may be used for this block. If one or more pixels in the frame are determined to be in the foreground of the frame because the pixels do not coincide with the background module of the frame, then in this block, the object is detected. By way of example, random background modeling techniques, such as dynamically adaptive background subtraction, may be used, which are described in U.S. patent application No.09/694,712, filed {1} and 2000, 10, and 24. The detected object is forwarded to block 53.

The motion detection technique of block 51 and the change detection technique of block 52 are complementary techniques, each of which advantageously remedies a deficiency in the other technique, and as an option, additional and/or alternative detection schemes may be used with respect to the techniques discussed with respect to blocks 51 and 52. Examples of additional and/or alternative detection schemes include the following: pfinder detection scheme, skin color detection scheme, face detection scheme, and model-based detection scheme for finding a person as described in {8 }. The results of these additional and/or alternative detection schemes are provided to block 53.

As an option, if the video sensor 14 is capable of motion (e.g., camera scanning, zooming, and/or panning), an additional block may be inserted before the block between blocks 51 and 52 to provide input to blocks 51 and 52 to facilitate video stabilization. Video stabilization may be achieved by affine or projective global motion compensation. For example, the image registration described in U.S. patent application No.09/609,919, filed on 7/3/2000, which is incorporated herein by reference, may be used to obtain video stabilization.

In block 53, a blob is generated. In general, a blob is any object in a frame. Examples of blobs include: moving objects such as people or vehicles, and consumer products such as furniture, clothing, or retail shelving. The detected objects from blocks 32 and 33 are used to generate the spots. Any technique for generating blobs may be used for this block. Typical techniques for generating blobs from motion detection and change detection use a connected component scheme. For example, the scheme can be described in {1} using morphology and connection component algorithms.

In block 54, the blobs are tracked to obtain tracked objects. Any technique for tracking blobs may be used for this block. For example, Kalman filtering or CONDENSATION algorithms may be used. As another example, a template matching technique as described in {1} may be used. As another example, a multi-hypothesis Kalman tracking system may be used, as described by {5 }. As yet another example, the frame-to-frame tracking technique described in U.S. patent application No.09/694,712, filed 10/24/2000, may be used. For the example where the venue is a grocery store, examples of objects that can be tracked include people in motion, inventory items, and inventory sports equipment, such as shopping carts or trolleys.

As an option, blocks 51-54 can be replaced with any detection and tracking scheme known to those of ordinary skill in the art. An example of such a detection and tracking scheme is described in {11 }.

In block 55, each trajectory of the tracked object is analyzed to determine if the trajectory is significant. If the trajectory is not significant, the trajectory displays objects exhibiting erratic motion or objects exhibiting erratic size or color, and the system rejects the corresponding object and does not analyze it. If the trajectory is significant, the trajectory shows objects of potential interest. Whether the trajectory is significant is determined by applying a significance measure to the trajectory. Techniques for determining whether a trajectory is significant are described in {13} and {18 }.

In block 56, each object is classified. The general type of each object is determined as the classification of the object. The classification may be performed by a variety of techniques, and examples of such techniques include the use of a neural network classifier {14} and the use of a linear discriminative classifier {14 }. Examples of classifications are the same as those discussed for block 23.

In block 57, the video primitives are identified using the information from blocks 51-56 and the additional processing required. Examples of the identified video primitives are the same as those discussed with respect to block 23. As an example, the system may use the information obtained from the calibration in block 22 as a video primitive for size. From the calibration, the system has enough information to determine the approximate size of the object. As another example, the system may use the velocity measured by block 54 as a video primitive.

In block 43, the video primitives from block 42 are archived. The video primitives may be archived in computer readable medium 13 or another computer readable medium. Along with the video primitives, the relevant frames or video images from the source video may be archived.

In block 44, event occurrences are extracted from the video primitives using event discriminators. The video primitives are determined in block 42 and the event discriminator is determined by assigning tasks to the system in block 23. Event discriminators are used to filter video primitives to determine if any event occurrences have occurred. For example, the event discriminator may look for a "walk wrong" event, which is defined as a person reaching an area between 9:00am and 5:00 pm. The event discriminator examines all video primitives generated according to fig. 5 and determines whether there are video primitives having the following characteristics: the time scale between 9:00am and 5:00pm, the classification of "people" or "crowd", the location within the area, and the direction of motion of the "error".

In block 45, the required action is taken for each event occurrence extracted in block 44. FIG. 6 shows a flow chart for taking action with a video surveillance system.

In block 61, a response is made as indicated by the event discriminator where the event occurrence was detected. The response is identified for each event discriminator in block 34, if possible.

In block 62, an activity record is generated for each event occurrence that occurred. For example, the activity record includes: details of the trajectory of the object, the event at which the object was detected, the location at which the object was detected, and a description or definition of the event discriminator employed. The activity record may include information required for event discriminators such as video primitives. The activity record may also include representative video or still images and/or regions of the object involved in the event occurrence. The activity record is stored on a computer readable medium.

In block 63, an output is generated. The output is based on the event occurrences extracted in block 44 and the direct feed of the source video from block 41. The output is stored in a computer readable medium, displayed on the computer system 11 or another computer system, or forwarded to another computer system. Information relating to the occurrence of an event is collected as the system operates, and may be observed by an operator at any time, including in real-time. Examples of formats for receiving information include: a display on a monitor of a computer system, a hard copy, a computer readable medium, and an interactive web page.

The output may include a display of the direct feed of source video from block 41. For example, the source video may be displayed on a window of a monitor of a computer system, or on a closed circuit monitor. Further, the output may include source video graphically marked to highlight objects and/or areas involved in the event transmission.

The output may include one or more reports to the operator based on the operator and/or the requirements of the event occurrence. Examples of reports include: the number of occurrences of the event that occurred, the location in the scene where the event occurred, the time when the event occurred, the representative image of each event occurrence, the representative video of each event occurrence, raw statistics, statistics of the event occurrences (e.g., how many, how often, where, and when), and/or a human readable graphical display.

Fig. 13 and 14 show typical reports for the aisles in the grocery store of fig. 15. In fig. 13 and 14, several regions are identified in block 22 and are thus marked in the image. The regions in fig. 13 match the regions in fig. 12, while the regions in fig. 14 are different regions. The system is assigned the task of finding the person staying in the area.

In fig. 13, a typical report is an image from a marked video, including marks, graphics, statistics, and analysis of the statistics. For example, an area identified as coffee has statistics of an average number of customers in the area of 2/hour and an average dwell time in the area of 5 seconds. The system determines this area as a "cold" area, indicating that there are not too many commercial activities that pass through this area. As another example, an area identified as carbonated beverage has statistics of an average number of customers in the area of 15/hour and an average dwell time in the area of 22 seconds. The system determines this area as a "hot" area, indicating that there is a significant amount of business in this area.

In fig. 14, a typical report is an image from a marked video, including marks, graphics, statistics, and analysis of the statistics. For example, the area behind the aisle has an average customer count of 14/hour and is determined to have a lower traffic volume. As another example, the area in front of the aisle has an average customer count of 83/hour and is determined to have a higher traffic volume.

With respect to either FIG. 13 or FIG. 14, if the operator needs more information about any particular area, the click interface allows the operator to locate representative still and video images of the area and/or activity that has been detected and archived by the system.

FIG. 15 illustrates another exemplary report for an aisle in a grocery store. This exemplary report includes images from marked video, including markers and track indications and text describing the marked images. The system of this example is assigned the task of searching multiple regions: length, position and time of the object trajectory; the time and location at which the object is stationary; correlation of the trajectory to the region as specified by the operator; and classifying the object as not being a person, two persons, and three or more persons.

The video image of fig. 15 is from the time period in which the track was recorded. Of the three objects, two objects are classified as one person, respectively, and the other object is classified as not a person. A label is assigned to each object, and a person ID 1032, a person ID 1033, and an object ID 32001. For person ID 1032, the system determines that the person has stayed in the area for 52 seconds and in the location specified by the circle for 18 seconds. For person ID 1033, the system determines that the person stayed in the area for 8 seconds, 1 minute zero, and 12 seconds at the location specified by the circle. The tracks of person ID 1032 and person ID 1033 are included in the marked image. For object ID 32001, the system does not analyze it further and represents the location of the object with an X.

Returning to block 22 in FIG. 2, the calibration may be (1) manual, (2) semi-automatic with images from a video sensor or recorder, or (3) automatic with images from a video sensor or recorder. If an image is desired, it is assumed that the source video to be analyzed by the computer system 11 is from a video sensor that obtained the source video for calibration.

For manual calibration, the operator provides to the computer system 11 the orientation and internal parameters of each video sensor 14, as well as the displacement of each video sensor 14 relative to the site. The computer system 11 may optionally maintain a map of the location and may represent the displacement of the video sensor 14 on the map. The map may be a two-dimensional or three-dimensional display of the environment. In addition, manual calibration provides the system with sufficient information to determine the approximate size and relative position of the object.

Alternatively, for manual calibration, the operator may mark the video image from the sensor with a graphic representing the appearance of an object of known size, such as a human. If the operator can mark the images at least two different locations, the system can infer approximate camera calibration information.

For semi-automatic and automatic calibration, no knowledge of camera parameters or scene geometry is required. From semi-automatic and automatic calibration, a look-up table is generated to approximate the size of objects at multiple regions in a scene or to infer camera internal and external camera calibration parameters for a camera.

For semi-automatic calibration, the video surveillance system is calibrated using the video source in combination with input from the operator. A single person is positioned in the field of view of the video sensor for semi-automatic calibration. The computer system 11 receives source video relating to an individual person and, from the data, automatically infers the size of the person. The accuracy of the semi-automatic calibration increases as the number of locations in the market where the person's video sensor is observed increases, and as the period of time over which the person is observed in the video sensor increases.

Fig. 7 shows a flow chart of a semi-automatic calibration of a video surveillance system. Block 71 is the same as block 41 except that the representative objects traverse the scene in a variety of trajectories. A typical object may have multiple speeds and may be stationary at multiple locations. For example, a typical object is as close to the video sensor as possible and then as far away from the video sensor as possible. This movement of the representative object may be repeated if desired.

Blocks 72-75 are the same as blocks 51-54, respectively.

In block 76, representative objects are monitored throughout the scene. It is assumed that only (or at least the most) stable tracked objects are calibration objects in the scene (i.e., typical objects moving through the scene). The size of this stable object is collected at each point where the object is observed in the scene, and this information is used to generate calibration information.

In block 77, the dimensions of the representative objects are identified for different regions throughout the scene. The size of a typical object is used to determine the approximate size of similar objects located at multiple regions in the scene. Using this information, a look-up table is generated that matches typical apparent dimensions of typical objects in multiple regions in the image, or internal and external camera calibration parameters are inferred. As a sampling output, the display representation system of the bar graph in the plurality of regions of the image determines it as an appropriate height. Such a bar graph is depicted in fig. 11.

For auto-calibration, a learning phase is performed in which the computer system 11 determines information about the location in the field of view of each video camera. During auto-calibration, the computer 11 receives source video for the location for a typical period of time (e.g., minutes, hours, or days) sufficient to obtain a statistically significant sample of typical objects of the scene, inferring typical apparent dimensions and location.

FIG. 8 shows a flow chart for automatic calibration of a video surveillance system. Blocks 81-86 are the same as blocks 71-76 in fig. 7.

In block 87, trackable regions in the field of view of the video sensor are identified. Trackable regions represent areas in the field of view of the video sensor where objects can be easily and/or accurately tracked. An untrackable area represents an area in the field of view of the video sensor where tracking of an object is not easy and/or precise and/or difficult. The untrackable region may be represented as an unstable or insignificant region. Because the object is small (e.g., less than a predetermined threshold), occurs for a short time (e.g., less than a predetermined threshold), or exhibits insignificant motion (e.g., purposeless), it may be difficult to track the object. For example, trackable regions may be identified using the techniques described in {13 }.

FIG. 10 illustrates trackable areas determined for an aisle in a grocery store. The area located at the far end of the aisle is determined to be insignificant, because too much interference occurs in this area. Interference represents something in the video that confuses the tracking scheme. Examples of interference include: blown foliage, rain, partially obscured objects, objects that appear too short to be accurately tracked. Instead, the area near the aisle is determined to be significant, since for this area a good tracking is determined.

In block 88, the size of the object is identified for different regions throughout the scene. The size of the object is used to determine the approximate size of similar objects at different regions in the scene. Techniques such as using histograms or statistical median are used to determine the typical apparent height and width of objects as a function of location in the scene. In a portion of the graphics of the scene, a typical object may have a typical apparent height and width. Using this information, a look-up table is generated that matches typical apparent dimensions in multiple regions in the image, or internal and external camera calibration parameters can be inferred.

FIG. 11 illustrates exemplary dimensions for identifying exemplary objects in an aisle from the grocery store of FIG. 10. It is assumed that the representative object is a human and is thus identified by a marker. The typical size of a person is determined by a graph of the average height and average width of the person detected in the prominent zone. In this example, curve a determines the average height of an average person, while curve B determines the average width of one, two, and three persons.

For curve A, the x-axis represents the height of the blob in pixels, while the y-axis represents an exemplary number of specific heights that occur when labeled on the x-axis. The peak of curve a corresponds to the most prevalent spot height in a specified area in the scene, which for this example corresponds to the average height of people standing in this specified area.

Assuming one travels with a looser population, a graph similar to curve a is generated for the width as curve B. For curve B, the x-axis represents the width of the blob in pixels, while the y-axis represents an exemplary number of specific widths that occur when labeled on the x-axis. The peak of curve B corresponds to the average width of the plurality of spots. Assuming that the most people contain only one person, the largest peak corresponds to the most common width, which corresponds to the average width of a single person in a given area. Similarly, the second largest peak corresponds to the average width of two people in the designated area, while the third largest peak corresponds to the average width of three people in the designated area.

Fig. 9 shows another flow diagram of the video surveillance system of the present invention. In this additional embodiment, the system analyzes archived video primitives with event discriminators to generate additional reports, e.g., without reviewing the entire source video. At any time after the video source has been processed in accordance with the present invention, the video primitives for the source video are archived in block 43 of fig. 4. The video content can be reanalyzed with the additional embodiments in a relatively short time, since only the video primitives are reviewed and the video source is not reprocessed. This provides a significant efficiency improvement over current existing systems because processing video image data is extremely computationally expensive, while analyzing small-sized video primitives extracted from the video is extremely computationally inexpensive. As an example, the following event discriminators may be generated: "the number of persons who stayed in the area A for more than 10 minutes in the past two months. "with this additional embodiment, there is no need to review the source video in the past two months. Instead, only video primitives from the past two months need to be reviewed, which is a much more efficient process.

Block 91 is the same as block 23 in fig. 2.

In block 92, archived video primitives are accessed. In block 43 of fig. 4, the video primitives are archived.

Blocks 93 and 94 are the same as blocks 44 and 45 in fig. 4.

As a typical application, the present invention may analyze retail market space by measuring the effectiveness of retail displays. A large amount of expense is injected into retail displays in order to promote sales of the displayed items and accessories as much as possible by the eye effect. The video surveillance system of the present invention can be configured to measure the effectiveness of these retail displays.

For this typical application, a video surveillance system is set up by directing the field of view of a video sensor to the space surrounding the retail display of interest. During assignment of the task, the operator selects an area representing space around the retail display of interest. As an identifier, the operator defines a humanoid-sized object that he or she would like to monitor for entry into the area and exhibit a measurable slowing down or stopping perceptible time in speed.

After a period of operation, the video surveillance system may provide reports for market analysis. The report may include: the number of people slowed down around the retail display; the number of people staying at the retail display; a breakdown of the number of people interested in the retail display as a function of time, e.g., how many people are interested on weekends and how many people are interested in the evening; and snapshots of people showing interest in the retail display. Market research information obtained by the video surveillance system may be combined with sales information from the store and consumer records from the store to improve analysts understanding of the effectiveness of the retail display.

The embodiments and examples discussed herein are non-limiting examples.

Having described the invention in detail and by reference to preferred embodiments thereof, it will now be apparent from the foregoing description that changes and modifications may be made by those skilled in the art without departing from the invention in its broader aspects and, therefore, the invention is intended to cover all such changes and modifications as fall within the true spirit of the invention, as defined by the appended claims.

Claims

1. A video surveillance method, comprising:

identifying one or more user-defined event discriminators;

extracting video primitives from the video, wherein each video primitive is an observed attribute of an object in the video; and

extracting event occurrences from the video primitives using at least one of the one or more user-defined event discriminators.

2. The method of claim 1, further comprising archiving the extracted video primitives.

3. The method of claim 1, further comprising responding based on the extracted event occurrence.

4. A method according to claim 3, wherein responding comprises activating an additional sensor system.

5. The method of claim 1, further comprising calibrating a video surveillance system used to perform the method.

6. The method of claim 5, wherein said calibrating comprises self-calibrating the video surveillance system.

7. The method of claim 6, wherein the self-calibrating comprises:

detecting at least one object in a source video; and

the object is tracked.

8. The method of claim 7, wherein detecting at least one object comprises:

detecting at least one object by means of a motion of the object; and

at least one object is detected by a change in the background model.

9. The method of claim 5, wherein said calibrating comprises:

identifying a trackable area; and

typical dimensions of typical objects are identified.

10. The method of claim 5, wherein said calibrating comprises:

manual calibration;

semi-automatic calibration; and

and (6) automatic calibration.

11. The method of claim 1, further comprising assigning tasks to a video surveillance system using the user-defined event discriminator.

12. The method of claim 11, wherein assigning the task comprises identifying at least one object.

13. The method of claim 11, wherein assigning the task comprises identifying at least one spatial region.

14. The method of claim 11, wherein assigning the task comprises identifying at least one time attribute.

15. The method of claim 11, wherein assigning the task comprises identifying at least one interactive action.

16. The method of claim 11, wherein assigning the task comprises identifying at least one alarm.

17. The method of claim 1, wherein the video primitives are from at least one of a video sensor and an additional sensor.

18. The method of claim 1 wherein the video primitives are retrieved from an archive of video primitives.

19. The method of claim 1, wherein the video surveillance is performed using a computer system.

20. The method of claim 1, further comprising:

detecting an object in a video to obtain a detected object;

tracking the detected object to obtain a tracked object; and

classifying the tracked object to obtain a classified object;

wherein video primitives are extracted from the video based on the classified objects.

21. A method of video surveillance, comprising:

identifying one or more user-defined event discriminators;

accessing archived video primitives extracted from the video; and

extracting event occurrences from the accessed archived video primitives using at least one of the one or more user-defined event discriminators.

22. The method of claim 21, further comprising responding based on the extracted event occurrence.

23. The method of claim 21, further comprising:

detecting an object in a video to obtain a detected object;

tracking the detected object to obtain a tracked object; and

classifying the tracked object to obtain a classified object;

24. An apparatus for a video surveillance system, the apparatus comprising:

identifying means for identifying one or more user-defined event discriminators;

first extracting means for extracting video primitives from the video; and

second extracting means for extracting event occurrences from the extracted video primitives using at least one of the one or more user-defined event discriminators.

25. The apparatus of claim 24, wherein the apparatus comprises special purpose hardware emulating one of a computer and software, the special purpose hardware performing the operations of the identifying means, the first extracting means, and the second extracting means.

26. An apparatus for performing video surveillance, the apparatus comprising:

accessing means for accessing archived video primitives extracted from the video; and

extracting means for extracting event occurrences from the accessed video primitives using at least one of the one or more user-defined event discriminators.

27. The apparatus of claim 26, wherein the apparatus comprises special purpose hardware emulating one of a computer and software, the special purpose hardware performing the operations of the identifying means, the accessing means, and the extracting means.

28. A special purpose hardware for performing video surveillance, the special purpose hardware comprising:

first extracting means for extracting video primitives from the video; and

29. The special-purpose hardware as recited in claim 28, wherein the special-purpose hardware further comprises: self-calibration means for performing self-calibration.

30. A specific use hardware according to claim 28, wherein said second extracting means is adapted to extract event occurrences based on video primitives and non-video primitives.

31. The specific use hardware according to claim 28, wherein the at least one user-defined event discriminator comprises at least two of: objects, spatial regions, event attributes, interactions, and alerts.

32. The specific use hardware according to claim 28, wherein the at least one user-defined event discriminator defines interactions between one or more video primitives, between one or more spatial regions of interest, and/or between one or more temporal regions of interest.