HK1150284B

HK1150284B - Real-time body segmentation system

Info

Publication number: HK1150284B
Application number: HK11104314.8A
Authority: HK
Inventors: 颜庆义; 李宏亮
Original assignee: 香港中文大学
Priority date: 2008-03-07
Filing date: 2009-03-02
Publication date: 2014-08-15

Description

Real-time body segmentation system

Cross Reference to Related Applications

This application is based on and claims the benefit of U.S. patent application No. 12/044,416 filed on 7/3/2008, the entire contents of which are incorporated by reference into this application.

Technical Field

The present invention relates to three-dimensional data analysis, and more particularly, to object segmentation for pattern recognition applying computer vision.

Background

Object segmentation is a key technique for semantic object extraction and is useful in digital video processing, pattern recognition and computer vision. The task of segmenting/tracking three-dimensional images in the form of video objects occurs in many applications such as video surveillance and surveillance, video summarization and indexing, and digital entertainment. The sampling applied includes:

video surveillance, where the segmentation results are used to allow identification of intruders or abnormal situations and to help predict and reveal patterns of actions and interactions in the environment to determine when a security unit should be notified of an "alarm".

Content-based video summarization, e.g. sports summary, video preview, video pattern mining, i.e. work requiring segmented semantic objects to perform content classification, representation or understanding.

-content-based coding applications, wherein each frame of a video sequence is segmented into semantically meaningful objects having arbitrary shapes.

Computer vision, such as video matting, video "artistic stylization" and rendering, where segmented two-dimensional objects from an input image or video sequence can be used for 3D scene reconstruction.

Video conferencing and video telephony applications, where segmentation enables better quality by encoding the most relevant objects at higher quality.

Digital entertainment, where some specific objects can be replaced by segmentation, such as video games.

Other possible applications include industrial inspection, environmental monitoring, or association of metadata with segmented objects, etc.

Human image object segmentation is generally considered as a key step of human recognition, behavioral analysis, or human-computer communication. Data sets and features derived from images and the like as so-called human objects can be applied in many fields, such as video surveillance, computer vision and video entertainment. For example, the extracted human object can be used to allow identification of suspicious behavior, and it can help to discover problematic behavior and alert a security center of possible hazards.

Generally, object segmentation can be divided into two phases, namely desired object detection related to pattern recognition and object extraction related to clustering (clustering) techniques. In detection mode, object segmentation can be performed in two ways, supervised and unsupervised. However, it is often difficult to automatically find the (unsupervised) desired object due to different object characteristics, such as color, intensity, shape and contour. To avoid false detection of the segmentation of the object of interest, many interactive methods have been developed which require the user to define the desired object in advance. These methods are generally able to provide users with better segmentation performance than automatic methods, since the complex steps of object detection are avoided at the expense of interactive effort on the part of the user.

To meet future content-based multimedia services, there is an urgent need to segment meaningful objects in an unsupervised manner in real-world scenarios.

Many video segmentation methods can be found in the literature and typically use spatial and temporal information. The spatial segmentation method divides each frame into homogeneous regions with respect to color or intensity. Typical partitioning methods are generally divided into region-based methods, boundary-based methods, and classification-based methods.

Spatial segmentation methods involving region growing, splitting and merging rely on the homogeneity of local features (e.g., color, texture, motion and other pixel statistics). The temporal segmentation method uses the dominant gradient information to locate object boundaries. In the classification-based approach, a partition of the feature space is first created and then converted into a video signal. Such methods may be combinations of tokens (cue), such as texture, color, motion, and depth. Spatial segmentation methods can produce relatively accurate object boundaries. However, since the segmentation must be done on the entire image for each frame, the computational complexity is sufficiently high and limited for non-real-time applications. Furthermore, the main problem with spatial based methods is the lack of robustness (robustness) for the following "corrupted" situations, such as noisy or blurred video images where the boundaries of regions are often missing or mixed with other regions.

On the other hand, temporal segmentation uses motion rather than spatial information to obtain the initial position and boundary of an object. The so-called change detection mask (mask) is the most common form of motion information incorporated into the segmentation process. Because the object of interest is usually moving, change detection can be done on an inter-frame or background frame basis. Due to image noise, the boundaries of objects are often irregular and must be refined using spatial information of the image. Higher efficiency is obtained since the boundary refinement procedure only contains segmented motion regions, rather than the entire frame. However, shadowing effects, reflections and noise may be erroneously assigned to foreground objects. Moreover, it is often difficult to discern between changes caused by real object motion and changes caused by noise, shadowing effects, and the like.

Since the object of interest generally corresponds to multiple regions that may have many spatio-temporal variations, most existing video image segmentation techniques are not able to automatically extract objects in the image. It is difficult to automatically segment these objects without using any master segmentation criteria. The "blind segmentation" algorithm has no assumptions about the environmental knowledge of the segmented objects, with the inherent problem that there may be no homogeneity between the objects of interest with respect to low-level characteristics, or that the objects may change with environmental factors (e.g., lighting conditions, etc.).

For these and other reasons, there is a need for improved object segmentation that is applicable to dynamic human body shapes.

Disclosure of Invention

In a human feature recognition system that is intended to provide substantially real-time recognition of human body parts, various methods and structures are provided to facilitate real-time recognition with reduced computational requirements, including a face detection module, a body detection module, and a boundary matting module. In particular embodiments, the face detection module uses an active boosting procedure. In another embodiment, the face detection module uses a lazy boosting procedure on a hybrid cascade structure to speed up object detection. The hybrid cascade structure is a tree structure in which one class of nodes represents strong classifiers learned from active boosting, another class of classifiers is obtained by lazy boosting with low computation load, and weak classifiers are obtained from previous layers.

A useful contribution of the present invention is a feature-based rejector that can reject non-face samples when detecting valid faces. The rejector uses non-normalized Haar transforms to assist in the rejection process. Other mechanisms include active lifting and passive lifting, which takes advantage of the foreground features of previous frames. Provided in particular embodiments are fine and coarse segmentations and a trimap automatically generated using energy minimization techniques. Other features include methods based on algorithms developed specifically for real-time segmentation.

Among other advantages, the present invention enables objects to be operated independently, thereby enabling a suitable encoding algorithm to be used for each object, resulting in an increase in subjective quality.

The present invention is applied to prospective content-based real-time multimedia services provided by telecommunication companies providing multimedia services and service industries such as banks, hotels and security agencies. For example, carriers can incorporate the technology into products to provide better multimedia services to users with better visual quality, video content browsing (e.g., television or movie shows) and retrieval, and video games, etc., based on content-based coding. The present invention also provides key technology to video surveillance applications that can provide segmented objects (people) to be identified and tracked for some specific users, such as banks, supermarkets and hotels. Furthermore, as an important technology in the field of pattern recognition, the result of the present solution can also be used for facial recognition in passports, identification/credit cards or other photo-bearing certificates.

The product based on the invention can directly enhance the competitiveness of real-time multimedia services and improve the technical level of local telecommunication industry.

The invention will be better understood by reference to the following detailed description and the accompanying drawings.

Drawings

FIG. 1A is a block diagram of an embodiment of an object segmentation system according to the present invention;

FIG. 1B is a flow chart illustrating operation of a system according to an embodiment of the present invention;

FIG. 2 is a simplified flow diagram of a face detection procedure;

FIG. 3 is a diagram illustrating an example of non-Normalized Haar Transform (NHT) coefficients based on a feature set;

FIG. 4 shows the form of a rectangular block used as a basic unit of learning features;

FIG. 5 is a probability map showing the sampling results for a given normalized weight;

FIG. 6 is a pair of related three-dimensional diagrams showing features of two blocks;

FIG. 7A is a diagram of a cascaded classifier as known in the prior art;

FIG. 7B is a diagram of a hybrid cascade structure according to the present invention;

FIG. 8 is a graphical illustration of the calculation of a weighting factor w, where F and B represent the foreground and background, respectively;

FIG. 9 is a two-dimensional graphical illustration showing how rectangles may be used in the initial region selection for coarse segmentation;

FIG. 10 is a diagram illustrating segmentation according to a tracking scheme;

11A-11C are illustrations of a mask problem;

FIGS. 12A-12B show a detailed flow chart of the face detection process;

FIGS. 13A-13F show detailed flow charts of the body segmentation process; and is

Fig. 14A to 14D show a set of diagrams for explaining the processing.

The definitions of the variables in FIGS. 12A-12B are listed below:

SIZE _ H: width of frame image

SIZE _ V: height of frame image

IGR (I): calculating integral images of I-images

IGR(I²): calculation of I²Integral image of an image

W: sample window

Ws: size of sample window

Number layer Lnum of classifier 12

Node NodeNum in each layer

Training data:

window W_T(ii) a Variable V_T(ii) a Variable V_T12；

A characteristic pattern M [ ]; the variable alpha [ ]; a threshold Thrshold;

threshold taoT [ ]

V2num ═ 40: layers of variable classifiers

B2num ═ 50: layers of block classifiers

p [ ]: sign of learning variable

mv [ ]: learning variables

The definitions of the variables in FIGS. 13A-13F are listed below:

y, Cb, Cr: component representing YCbCr color space

start _ Px: coordinate width of upper left point of initial window

start _ Py: coordinate height of upper left point of initial window

W: size of initial window

And (3) obtaining the product by Sequare: structure for storing foreground pixels

Ri: the ith object region in the foreground

Centrnumf: number of clusters of foreground

Centrnumb: number of clusters of background

V_PIXEL: pixel value of image

Beta ← (0, 1): weights of learning parameters

Struct sequare

{

self; // Current Pixel

right; // image of the right neighbor

below; // adjacent pixel

}；

μ_R: mean value of region R

∑_R: covariance of region R

d(V₁，V₂): euclidean distance between V1 and V2

Detailed Description

It has been found that by designing appropriate detectors based on physical models or training schemes, some specific objects of interest can be detected. Fig. 1A is a block diagram of an object detection system 10, which object detection system 10 is adapted to recognize a human body part by segmentation of image elements. Fig. 1B is a related flow chart of the system 10.

Referring to fig. 1A, by focusing on the human body's problem, environmental confusion is minimized and automatic segmentation of the human body is facilitated, which is particularly useful for real-time interactive systems with a dominant person in live video. The present system 10 has three key processes: object detection focused on, for example, face detection 12, object segmentation focused on human body structures 14, and boundary matting 16. The output is a usable data description 17 describing the human body, which usable data description 17 is suitable for use in an application referred to herein as artistic stylizing (tononing) 18. The detailed considerations of each technique will be summarized below.

Referring to fig. 1B, a video input is first captured as image data (step a), and if the image data is not processed (step B), it is directed to a display monitor (step C). If the image data is processed, i.e. automatically segmented (step B), face detection processing is first applied (step D), the output by the face detection processing (integral image) is constructed (step E), and the integral image is detected to confirm that it is a face area (step F). If the confirmation is not passed, the output is displayed (step C). If the validation is passed, the data is used to calculate a body region (step G) and a body segmentation process (step G) is applied, interacting with the current mask (step H) and the current frame (step J). If the number of segmentation-resulting pixels is less than 1000 (step K), the face detection process is invoked again (step D). Otherwise, a boundary matting process (step L) is invoked in conjunction with the integral image input (step E) to produce an output result for at least display (step C) or other data application.

Face detection: as a very amorphous object, human faces are characterized by a high degree of variability in size, shape, color, and texture. The boundary between the face and non-facial patterns is also highly non-linear due to variations in facial appearance, lighting, head pose, and expression. In normal operation, the face detector 12 (FIG. 2) scans the image at many scales to find the face position within the scaled window. There are three parts or sub-processors: a skin color filter 18, a rejection cascade 20 and a boosted-face cascade (cascade-of-face) classifier 22A. The skin tone filter 18 is used to clear non-skin areas in the colored image during face detection. The rejection element 20 is used to remove most non-face candidate regions while allowing 100% accuracy of face detection. The likely face-like locations will be examined in the final boosted face classifier 22A-22N.

Fast skin filtering in the Cb-Cr color space: it is known that in the skin tone filter 18, skin tones can be detected by perceiving the presence of a certain range of chrominance values having a narrow and uniform distribution in the YCbCr color space. Software for sorting the colors at the image scan and location can be used for this purpose in the element 18, optical filters and/or wavelength-specific photodetectors connected to the analog-to-digital converter. The empirical range of chromaticity values used is typically Cr ═ 130, 175 and Cb ═ 75, 130. Since facial regions typically show similar skin tone characteristics regardless of different skin types, it is possible to quickly eliminate or skip most non-skin color regions to save considerable computation time.

Let S (m, n) denote the filtered binary result at position (m, n) in the current image, i.e. "1" for skin tone chrominance values or, conversely, "0" for skin tone chrominance values. Then we compute the integral image of the binary mapping. If no or only few flesh tone pixels appear in this scanning window w, then we can conclude that no face is found in the current window and the dataset for that window can be skipped. Here, the threshold value is set to a small value (e.g., 0.06w × w).

Feature-based rejector: the feature-based rejector 20 is preferably a cascade of rejection modules that are collectively designed to reject a large number of non-face samples while detecting nearly 100% of valid faces. Cascading can significantly reduce computation time before more complex face classifiers are accessed to achieve low false positive rates. A simple feature can be used as a flag to construct an efficient rejector module that can be part of a cascade. Since these characteristics are also used to boost the face classifier element 22 after the rejection cascade 20, little or no additional computation is required for certain feature generation.

There are two feature sets: the first feature set is the regional variance, which can be taken from two integral images, namely the "integral image" and the "integral image of the squared image". These integral images are known to be used to perform illumination correction during the image scanning process. For a 24 x 24 pixel image, there are 76176 variance features to be considered. Suppose σ_iRepresenting the variance of the ith region, the training process can be described in algorithm 1:

algorithm 1: training of rejectors using variance features

1. Input training example (x)₁，y₁)，...，(x_n，y_n) With y for non-face and face instances, respectively_i＝0，1。

2. For y_iWhen 0, initialize reject tag l_i＝0。

3. For T1.., T:

4. find for each region k from the facial training exampleA minimum value and a maximum value of (1), respectivelyAndand (4) showing.

5. Calculating a rejection number r for a non-facial training set_kI.e. by

6. The region with the highest number of rejections is selected.

7. Set label l for all rejected samples i_i＝1。

8. And repeating from the step 3.

The second feature set is the difference between two low-low (LL) non-Normalized Haar Transform (NHT) coefficients. Since these coefficients will be used to compute high frequency NHT coefficients as input features for a training phase (e.g., a readily available adaptive boosting (AdaBoost) binary classification learning algorithm as the training phase), there is no additional computational burden associated with the generation of this feature set. Based on the scaling advantage of the Haar feature, only the LL coefficients are used, i.e. with a 4 × 4 block size as training set, which includes 97020 features for a 24 × 24 pixel image. The training method is similar to the first, with small modifications from step 4 to step 6, as in algorithm 2:

and 2, algorithm: training of rejector using block differences

4. Finding D for two arbitrary coefficients from the face training example_(k，j)＝LL_k-LL_jMinimum and maximum of (1), respectivelyAndand (4) showing.

5. Calculating the rejection number r of a non-facial training set_kI.e. by

6. The difference feature with the highest rejection number is selected.

Using between 40 and 50 rejectors for the regional variance set and the LL NHT coefficient set, respectively, the system is able to produce a rejection rate of 98.269% and a detection rate of 100% on the test data set. The regional variance shows a relatively high rejection rate compared to the LL NHT coefficients. It has been observed that for a training face set consisting of 500000 non-face images and 12536 face images, the first variance feature rejector is able to reject approximately 47.6% of the non-face images and produce a detection rate of 100%.

Because Haar-like features can be computed very quickly using integral images, most of these methods build weak classifiers by selecting one feature from a given feature set. The four classes of features shown in fig. 3 are constructed from different NHT coefficients. In a sense, these coefficients are more like "bricks" that can be built according to a certain style composition. Each feature is composed of one or more "bricks" that are combined by arithmetic operations (specifically, addition, subtraction, and absolute value operations). Examples can be found in the lower right column of fig. 3. The first two features are obtained by calculating the difference between the two LH and HL coefficients, respectively. A sixth center surround feature can be obtained by adding and subtracting four HH blocks. For a 24 × 24 window size, the number of over-complete (over-complete) features is very large, e.g., 2646 for feature a, 69768 for feature B, 830574 for feature C, and 1045386 for feature D.

Active boosting: active boosting is characterized by a feature set, importance weight sampling, a Kullback-Leibler confidence map, mean-shift based region segmentation, and active feature selection. Each of these features is explained below.

1) Feature set: as shown in fig. 4, the form of a rectangular block (column a) is adopted as a basic unit of the learning feature. Each feature is constructed by a linear combination of two or four blocks (columns B, C and D). By appropriately adjusting the combination of these blocks, coefficients with different Haar-like characteristics can be generated. Assume that the input instance size is 24 x 24. The number of possible rectangular blocks would be 76176. Therefore, the total number of features composed of four blocks will reach 3.3672e + 019. Even with one second for each weak classifier, the full runtime still requires 3.8973e +014 days, which is unthinkable for the training process. It is known that the set of features is over-complete, with many features contributing little to the optimal classifier. However, they will take most of the training time in the process of conducting a brute force search through the entire feature set. The optimal feature set can be constructed in an interactive way.

2) Importance weight sampling: the first step of the algorithm according to this embodiment of the invention is importance sampling from the weight distribution, i.e. eliminating samples with low importance weights and adding samples with high importance weights.

As described above in this disclosure, misclassified instances will be assigned more weight in the next promotion phase and will remain small for those correct instances. However, the number of samples is fixed for samples with large weights or samples with small weights. Fig. 5 shows the sampling results for a given normalized weight. The curve represents the weight distribution that can be obtained by the lifting procedure. It can be seen that those instances with large weights will increase after the resampling step. This process is important for the subsequent feature selection step because the selection is based on the information gain in the training data.

3) Kullback-Leibler confidence map: by usingRepresenting the resampled data from the original instance. Suppose that each of the samples x's can be written as { x }^s(e₁)，x^s(e₂)，...，x^s(e_d) In which e_iIs the ith element (e.g., pixel). By omega_eA set of elements representing a given sample. Using the resampled data, the Kullback-Leibler (KL) divergence is used to measure the difference between the positive and negative samples. Here, a symmetrical Lullback-Leibler distance (KLD) is used, which is defined byIs defined as

Rule 1: if one of the blocks has a larger KL divergence distance and the other block has a smaller KL divergence distance, then the feature can be considered a candidate weak classifier.

To select the appropriate feature according to rule 1, first for each element E_i∈Ω_eCalculate the KL divergence. Suppose C (e)_i) Representing x from positive and negative samples^sAt the element e_iKL measurement at (c). Here, the set C is referred to as KL confidence map for a given sample, which describes the difference between positive and negative samples at each element.

4) Region segmentation based on mean shift: to find the block features from the KL confidence maps, clustering is done using mean-shift filtering, which can divide the confidence maps into different regions according to the confidence values. Mean shift is a non-parametric method of estimating the density distribution pattern in an iterative process. The kernel of the mean shift procedure (kernel) moves in the direction of the greatest increase in the joint density gradient.

5) Active feature selection: with (z)_i，l_i) Represents the segmentation result of a confidence map C, where l_iIs an area label. Respectively by z_maxAnd l_maxDenotes z_iAnd the maximum value of label/. A feature search is then performed using a coarse to fine technique.

Since the low confidence block is selected at half resolution, the task of the fine phase is to search for neighboring positions when the best candidate is found by the lifting phase. An illustration can be found on the right side of fig. 6. Further, it should be noted that this refinement will be imposed on each high confidence block.

In the feature search process, two thresholds, i.e., τ, are used₁And τ₂To truncate the search range of the two blocks of features. Tau is₂Large value of (a) or tau₁A small value of (a) will increase the number of feature sets. In a particular embodiment of the present invention, τ is set₁> 0.5 and τ₂Is less than 0.5. Different thresholds will be used for different cascaded classifiers.

In addition to two blocks, four blocks are also contemplated as being within the contemplation of the present invention. After the two-block feature lifting step, some features with lower error are selected and the classification error is evaluated by linear combination. The final best weak classifier selected is the best one of the two/four features. The detailed combining procedure is given below:

for each of the two-block features,

the lift-up procedure is executed and,

the classification error is recorded and,

B. the two features with the lowest error are selected and sorted in (ascending) order according to error,

C. the highest m two-block features are selected.

Four features are composed from the m two-block features,

a lifting procedure is performed for each of the four features,

the classification error is recorded and,

D. the four features are compared to the best two features and the feature with the lowest error is selected.

Lazy boost (lazy boosting): unlike active boosting learning algorithms, lazy boosting is specified for the concatenation of classifiers, and it serves to improve detection performance and significantly save computation time. This method will be described in more detail after explaining a specific cascade structure.

Hybrid cascade architecture: a form of through cascade structure is known for effectively eliminating as many negative samples as possible. Fig. 7A and 7B are diagrams of cascaded classifiers. Fig. 7A is a cascade structure as known in the art. Fig. 7B is a cascade structure, more specifically a hybrid cascade structure.

FIG. 7A illustrates a decision tree diagram. Each layer in the cascade is tuned to have a higher decision rate with a predictable "failure" rate. Only the "accepted" sample is allowed to enter the next layer. After valuable detection of several layers, most of the calculations are concentrated on a few positive samples, not negative samples.

According to embodiments of the present invention, a hybrid concatenated data structure is provided to accelerate object detection and required processing. The cascaded structure of such a tree is shown in FIG. 7B, where node C_iRepresenting strong classifiers learned from active boosting. Prepared node C_i' denotes a strong classifier derived by lazy boosting. The weak classifiers come from a previous layer as shown in fig. 7B with the dashed lines representing the data paths.Since it is not used in C_iThe extra computation of the weak classifiers is found in' so the detection performance will be very efficient and only some multiplications and additions are needed to construct the strong classifiers. A detailed description of lazy promotion will be described below.

Lazy promotion is used for hybrid cascade structures. The aim of the method is to utilize the weak classifiers as efficiently as possible. In the cascade of conventional classifiers illustrated in fig. 7A, each weak classifier in a certain layer, which is the best one in a given feature set, can be selected from hundreds of thousands of features. However, these optimal features are used only once. The goal of each layer is to find the best strong classifier that achieves the required detection rate.

In the boosting algorithm (i.e., the lazy boosting algorithm) according to an embodiment of the present invention, detection efficiency is a focus rather than the best performance of each weak classifier. The weak classifier in the lazy boosting algorithm may not be the best one because the weak classifier is selected from the previous layers. Since the outputs of these weak classifiers have already been calculated before the lazy boosting step, no calculation is required for them. The calculation is the focus of the last classifier, which includes only some multiplications and additions. A detailed lazy boost algorithm is given below:

lazy lifting algorithm

(input)

Input training example (x)₁，y₁)，...，(x_n，y_n) Wherein x is_iRepresents a sample, and for positive and negative samples, y_i1 or-1.

(initialization)

The initialization sample weights wi are 1/2p and 1/2q, where p and q are the number of positive and negative samples.

3. (lazy lift)

For T1.

a. Normalized weight w_t，j

b. Obtaining weak classifiers from previous layers

c. Computing classification errors for a given weak classifier

d. Selecting the best weak classifier with the lowest error

e. Updating weights

(output)

Outputting combined classifiers

Contribution in human segmentation: after face detection, the face position in the input image can be quickly accessed. A minimum cut (min-cut) optimization method is described below to perform a face segmentation algorithm in the obtained face region.

In general, when the user's input explicitly defines the foreground or background, the segmentation process can be performed according to graph cut optimization. Unfortunately, in an unsupervised approach, there will only be coarse face positions obtained from the boosted face detector. The respective sub-window may cover the complete face contour or only some parts of the face area. This means that pixels outside the window may belong to the background and pixels inside the window may be part of the face. It cannot be determined with confidence which pixel should be marked as background or foreground. The following describes a technique for segmenting a face region under over-incomplete labeling conditions.

A. The cost function is: a graph G is defined by a set of nodes V (e.g., pixels or regions) and a set of directed edges E connecting the nodes<V，E>. If each pixel represents a node, then the segmentation of the image, Z ═ Z_i} can be represented by solving the energy function based on two cost functions: a data cost E1 for assigning each node i to a foreground or background label, and a smoothing cost E2 for measuring the similarity between two nodes:

data cost E1: the E1 term is used to set the penalty of assigning each pixel to either the foreground or the background. Generally, in the interactive method, the two terminals F and B, or at least the terminal B, have been defined by the input of the user. In this way, some hard constraints can be imposed to ensure a label that is consistent with the user's drawing with a minimal cutting procedure. This means that infinite costs may apply when the assigned labels violate the user's drawing strokes. However, there is no predefined terminal in the automatic processing. Therefore, the following novel derivation of E1 was used.

In particular embodiments, a Gaussian Mixture Model (GMM) is used to model the color distribution of the foreground or background based on the initial face position. For having a color z_iAnd a given pixel i of label α (i), the distance of pixel i to foreground F and B is defined as follows:

wherein

Wherein, w_kRepresents the weight, μ, corresponding to the percentage of spatial samples for the k-th component of the GMM_kSum Σ_kMean color and covariance matrix are represented, respectively. K represents the number of components of the GMM in the foreground and background. Fig. 8 shows an example of weight calculation, where the left and right sides represent the foreground F and the background B, respectively. The spatial sample of pixels (between the F and B regions) can be calculated from the number of clustered pixels within the circled region.

Smoothing cost E2: the term E2 is used to set the penalty for a discontinuity between two nodes (e.g., two pixels). E2 becomes larger when a smaller change is found between pixels i and j, which means that there is less likelihood of edges occurring between adjacent pixels. The general format is defined in terms of local intensity gradients. This novel derivation allows the use of exponential functions based on gradients but not through label constraints.

Background learning: the distance between each node and the background is made up of two partsThe ions are mixed by a weighting coefficient beta. Calculating the first color distribution according to the background in the current frame iAnd learn from initialization to secondBy adjusting the weight β, we are more or less able to incorporate the previous context information into the current edge join calculation. For example, if β is set to 0, then the connection to the background of each node relies only on prior knowledge, which can be obtained for those static scenes (e.g., video surveillance). Conversely, for dynamic scenes, especially for those fast moving backgrounds, larger weights should be considered.

Each pixel in background learning is modeled as a Gaussian distribution with a mean valueSum varianceThey are learned from the initialization of the system. We estimate the mode parameters using 500 frames.

B. Segmentation from coarse to fine: the color distribution must be established from regions of the body that are not completely marked. The proposed method works on two levels, a coarse scale and a fine scale.

The initial segmentation is done on a coarse level. As shown in fig. 9 [ TLO: novelty ], there are four regions (i.e., a-d) for estimating foreground information, while there are nine regions (i.e., a-J) for estimating background. For example, seven sections, namely, a, (B + E), (C + F), (D + E), (F + G), H, and I, are selected for estimating the color distribution of the background. Their mean and variance are used as initial clustering of the background, which means that there are four components in the background for modeling the GMM. Whereas for a body region, an initial gaussian model with four components is set. The initial face area is set to a small square window, which is located at the center of the detection window and has a size of W/2 × W/2. The corresponding mean and variance are considered as the initial cluster of face regions. For each pixel in this region, the weighted distance between face and background clusters is calculated from (3), (4) and other similarities. Finally, a minimum cut is used for global optimization.

The second level is a finer segmentation with the aim of improving the initial segmentation result. The corresponding foreground region is defined as the set of pixels in the current window that belong to the body side, while the background region consists of pixels outside the window that are classified as the background side. We describe the foreground color using 8 components and the background color using 8 components, respectively. The mean and covariance of component K are estimated based on a K-means algorithm. Then, the energies E1 and E2 may be calculated using a similar method to the coarse scale. Note that in the data cost function E1, the weight of each component is estimated from spatial samples in a defined window of 20 × 20 pixels centered on the current node. The minimum cut is also used for final optimization.

Contribution of tracking-based person segmentation

A. Optimal position prediction: body segmentation in successive frames is achieved by tracking techniques. The first step in tracking the body in the current frame n is to predict the state space in the previous frame (n-1). The state space here refers to the position of the body region.

For the previous frame, the current position of the candidate body region is obtained from the motion estimation. In a particular system, a motion estimation technique is used to obtain the projection positions of the candidate focus areas according to a strategy concept of coarse-to-fine refinement. First, full search motion estimation is performed in a sample space of half spatial resolution using a search window of 15. Then, for the best matching position, a finer-scale search is performed only for eight neighboring points. The Sum of Absolute Differences (SAD) is used as a similarity measure between two regions.

Next, a second order auto-regression (AR) model is used to describe the state change. AR models are generally able to provide better prediction results when objects move smoothly. However, human objects often change direction due to sudden movements, which leads to inaccurate predictions for subsequent segmentations. To avoid such a deteriorated situation, an error check is performed between the predicted position and the original position, and the optimal one is selected. Furthermore, a small window at the prediction center is used to find the best matching position, which is preferably set to 5 × 5 pixels and 11 × 11 pixels for the motion case and the still case, respectively.

B. Multi-stage segmentation: after state prediction, a coarse mask in the current frame n is obtained. To reduce the computation time, three modes are used for body segmentation. The first mode is based on a trimap (trimap) which consists of three regions, namely a foreground region, a background region and an unknown region. As shown in FIG. 10, O_n-1Representing candidate face regions in the (n-1) th frame. Obtaining a projection region O by prediction in an n-th frame_n. Then to the projection area O_nErosion and dilation morphological operations are performed. The obtained regions are respectively composed ofAndand (4) showing. The structuring element used was a square structuring element with a width of 10 pixels. Using a process based on the functions represented by equations (2), (3), and (6), the data cost and smoothing cost of the nodes in the unknown region can be determined. A label is then assigned to each node using the minimum partition.

The first mode operates very fast due to the small node set, and it also enables segmentation results similar to the fine segmentation when the human body has not changed unexpectedly. However, in order to avoid possible accumulated errors caused by noise effects in the first mode, a second mode (i.e. fine segmentation) is required to correct these errors. As shown in fig. 10, pixels in the body region and the unknown region are considered as nodes of the constructed graph. A minimum partition is then used to assign a label to each node.

C. Update of foreground model [ TLO: novelty and key to the real-time characteristics of the system ]: to perform fast multi-level segmentation, foreground information of previous frames is used. After segmentation of the current frame, the foreground is updated by the following function:

wherein

Where W is a normalization factor, where,andthe mean and variance of the ith component representing the foreground of frame n.

Contribution of boundary matting: matting is an important operation that separates arbitrarily shaped foreground elements from the image by estimating the opacity of the foreground elements at each pixel. Since the matting equation shown in equation 9 has too many unknowns, the matting is inherently under-constrained.

I＝αF+(1-α)B (9)

As can be seen from equation 8, when the mask α changes from 1 (or 0) to 0 (or 1), there will be a gradual color transition from the foreground (or background) to the background (or foreground). Since the value of α is limited to the range of [0, 1], the original data I in the matting region must be a certain value between the foreground and the background.

A. Adaptive trimap graph: adaptive trimap can be used to solve the problem of matting. FIG. 11A shows a typical matting example where there is a gradually changing progression between the F and B regions. The upper region has a different transition process between foreground and background. The lower region shows the curve of the alpha mask. Furthermore, there are many cases where a distinct boundary can be observed between the background and the foreground, which means that no mixing effect occurs between the two regions. When the foreground and background color distributions are well estimated in the unknown region shown in fig. 11B, the method is based onIn a similar way in the first case, the corresponding α value (1 or 0) can be solved. However, this approach fails in some cases where one of the distributions of the two regions cannot be estimated from the neighboring regions. For example, as shown in fig. 11C, there is a distinct boundary and no blending region between the foreground and background, although a transition region can be observed within the foreground region. If the area between the two dashed lines is assumed to be an unknown area for which the alpha calculation is performed, the optimized mask will smooth the foreground area in the unknown area, which results in noticeable artifacts. Such pathological conditions can be avoided by using artificially defined trimap maps, i.e. three regions: omega_F("clear foreground region (α ═ 1)"), Ω_B("clear background region (α ═ 0)") and Ω_U("unknown region (. alpha. epsilon. [0, 1))]) ") but it is difficult to handle it in an unsupervised manner. Unfortunately, the last case is often observed for human facial boundaries.

To avoid unmanageable situations and to reduce possible estimation errors, a trimap is constructed using an adaptive approach according to an embodiment of the invention. In general, automatic generation of the trimap can be achieved by using morphological erosion and dilation to generate unknown regions from the initial object boundaries. A uniform structure size, e.g. 5 pixels, is used. However, according to certain embodiments of the present invention, the size of the trimap depends on the degree of blurring of the centered pixel with respect to its neighbors. This means that if this area mixes well between foreground and background, then a larger size will be required; otherwise a small unknown region will be considered. Since the matting starts from a closed face contour, a trimap is computed for each pixel over the initial boundary. A circle of radius r is used to create the unknown region by removing and expanding one side of the profile, respectively. By r_pRepresenting the radius of the pixel p on the face contour. This results in:

r_p＝r_max exp(-k||I_p-(g*I)_p||) (10)

whereing is a gaussian function with a standard deviation σ (═ 5), which is used to estimate smoothness by convolution operations. r is_maxDenotes the maximum radius, which in our work is set to 5.

A new trimap is generated based on the erosion and dilation operations of the original facial boundary and the corresponding size r of the unknown region. This process erodes and expands less than r against the foreground and background on either side of the initial face boundary pixel p_pAnd (4) a pixel. This means that the corresponding structuring element uses a circular structuring element with a calculated radius r.

B. Alpha estimation using energy minimization: to estimate the unknown region omega_Uα in (b), the following energy function is defined for use as a definition:

E＝E₁+λE₂

where E1 is the error energy to measure the degree of data estimation and E2 is the smoothing energy, representing the variation between adjacent alpha values.

Energy E1 is defined as:

wherein

And (12)

(13)

Here, L is the weight w_j，kIs defined as alpha for the foreground and the background, respectively²d and (1-alpha)²d. d represents a distance with respect to the pixel (j, k).Andrepresenting the area located in the center of the pixel (m, n). In our work, the estimation is performed using a circular area of radius 11.

The smoothing term E2 is defined as:

wherein f is_kRepresenting the filter corresponding to direction k. E2 measures the variation of adjacent alpha in a certain direction, which promotes a to change smoothly. In our work, the filtering process is performed with respect to the 0, π/4, π/2 and 3 π/4 directions based on four 4-tap filters.

To minimize the energy function E, a gradient descent optimization method is used. The gradient of the energy E can be written as:

in each iteration, the α value is updated as:

where τ is the step size, which is used to minimize energy along the direction and is set to 1.5.

The energy minimization problem is an iterative optimization process, which consists of three steps. The first step is to locate the unknown region Ω according to equations 12 and 13_UF and B. The second step is the updating of the alpha value based on gradient descent optimization. Initial alpha₀(x，y)，(x，y)∈Ω_UIs set to 0.5. Equation 16 is then used to perform the update of the mask values. The last step is (F, B) refinement. If the condition alpha (x, y) > 0.99 is satisfied, (x, y) ∈ omega_UUpdate the surface by setting the mask value to 1And (4) a section. For α (x, y) < 0.01, (x, y) ∈ Ω_UIn this case, the pixel is classified as a background. Otherwise, in the case where 0.01 ≦ α (x, y) ≦ 0.99, the value of the pixel (x, y) will be taken as the foreground Ω_fAnd background omega_bAnd (4) processing a mixed result. We do not classify those pixels as segmented foreground regions.

To enhance the present invention, reference is made to fig. 12A-12B, fig. 13A-13F and fig. 14A-14D, where fig. 12A-12B show a detailed flow chart of a face detection process, fig. 13A-13F show a detailed flow chart of a body segmentation process, and fig. 14A-14D show a set of graphs illustrating the process.

The invention has been explained with reference to one or more specific embodiments. Other embodiments will be apparent to those skilled in the art upon review of this specification. Accordingly, the invention is not to be restricted except as indicated by the claims.

Claims

1. A method for identifying characteristics and features of a selected body part in a system for identifying a person, comprising:

capturing an image of at least a portion of a human body having a known body part as a data set;

identifying a subdata set corresponding to the known human body part from the dataset representing the image by subjecting the dataset to a plurality of feature-based rejection tests;

segmenting the identified subdata sets into subdata set segments for known human body parts;

matching the sub data set segments with available data descriptions describing a human body; and

the matched data description is reported as output to an output device,

wherein the feature-based rejection test uses a set of regional variance features as a first classifier.

2. The method of claim 1, wherein the known body part is a human face.

3. The method of claim 2, wherein the regional variance feature set is constructed from an integral image used for illumination correction during an input scan.

4. The method of claim 2, wherein the feature-based rejection test further uses a feature set consisting of only differences between low-to-low non-normalized Haar transform coefficients as a second classifier.

5. The method of claim 4, wherein the feature set is formed from a plurality of features constructed from different non-normalized Haar transform coefficients combined by arithmetic operations.

6. The method of claim 1, wherein the segmenting step is implemented using active boosting characterized by importance weight sampling, Kullback-Leibler confidence maps, mean-shift based region segmentation, and active feature selection.

7. The method of claim 6, wherein the set of features comprises a plurality of linear combinations of two rectangular blocks or four rectangular blocks used to generate features representing Haar transform coefficients.

8. The method of claim 7, wherein the combining of the two tiles or the four tiles comprises:

executing an active lifting program on each feature consisting of two rectangular blocks;

recording a classification error for each boosting procedure; then the

The feature consisting of two rectangular blocks with the lowest error is selected by:

sorting the features consisting of two rectangular blocks in ascending order according to error, an

Selecting a plurality of features consisting of two rectangular blocks with the lowest error; then the

The feature composed of four rectangular blocks is formed by the feature composed of two rectangular blocks with the lowest error; then the

Executing the lifting program on each feature consisting of four rectangular blocks;

recording a classification error for the boosting procedure; then the

Comparing the classification error of the feature composed of four rectangular blocks with the classification error of the feature composed of two rectangular blocks with the lowest error; and

the feature with the lowest total error is selected.

9. The method of claim 6, wherein the active feature selection comprises searching for features using coarse criteria and then searching for features using fine criteria.

10. The method of claim 1, wherein the data structure of the sub data set is a hybrid cascade structure, the hybrid cascade structure being a tree structure, wherein a first node type represents a strong classifier learned from active boosting, a second node type is derived from lazy boosting with low computational effort, and a weak classifier is derived from previous layers.

11. The method of claim 1, wherein the segmenting step includes using foreground information of a previous image frame for current computation and updating foreground information in preparation for a subsequent frame to facilitate fast multi-layer segmentation.

12. The method of claim 11, wherein the foreground information is updated according to the following function:

wherein

Wherein, W is a normalization factor,andthe mean and variance of the ith component representing the foreground of frame n.

13. The method of claim 1, wherein the segmenting step comprises automatically generating a trimap using energy minimization.

14. A system for identifying a person by identifying characteristics and features of a selected body part, comprising:

means for capturing an image of at least a portion of a human body having a known body part as a data set;

means for identifying a subdata set corresponding to the known human body part from the data set representing the image by subjecting the data set to a plurality of feature-based rejection tests;

means for segmenting the identified subdata set into subdata set segments for known human body parts;

means for matching the sub data set segments with available data descriptions describing a human body; and

means for reporting, as output, the matched data description to an output means,

15. The system of claim 14, wherein the known human body part is a human face, the feature set is constructed from an integral image used for illumination correction during an input scan, and the feature set comprises means for constructing a feature set consisting of only differences between low-low non-normalized Haar transform coefficients as a second classifier.

16. The system of claim 15, comprising means for two or four tile combinations, the tiles for generating features representing Haar transform coefficients, the means for two or four tile combinations comprising:

means for performing an active lifting procedure on each feature consisting of two rectangular blocks;

means for recording a classification error for each lifting procedure;

means for selecting the feature consisting of two rectangular blocks with the lowest error by:

Selecting a plurality of features consisting of two rectangular blocks with the lowest error;

means for composing a feature of four rectangular blocks from those features of two rectangular blocks with the lowest error; then the

Means for executing the lifting program for each of the four rectangular block-composed features;

means for recording classification errors for the boosting procedure;

means for comparing the classification error of the feature composed of four rectangles with the classification error of the feature composed of two rectangles with the lowest error; and

means for selecting the feature having the lowest total error.

17. The system of claim 14, wherein the data structure of the sub data set is a mixed cascade structure, the mixed cascade structure being a tree structure, wherein a first node type represents a strong classifier learned from active boosting and a second node type is derived from low computational lazy boosting and a weak classifier is derived from previous layers, wherein the means for partitioning comprises means for performing a lazy boosting procedure to accelerate matching.

18. The system of claim 17, wherein the foreground information is updated according to the following function:

wherein

Wherein W is a normalization factor，Andthe mean and variance of the ith component of the foreground representing frame n.

19. The system of claim 14, wherein the means for segmenting comprises means for automatically generating a trimap using energy minimization.