Human head locking method based on RGB-D camera
Technical Field
The invention relates to a human head locking method based on an RGB-D camera.
Background
With the development of camera technology, an RGB-D camera appears as a new technology, and in recent years, the RGB-D camera has been widely used as the price thereof is lowered. At present, the RGB-D camera has many implementation principles, such as speckle, TOF and the like, and is gradually and widely applied to various fields, such as three-dimensional reconstruction, image understanding and video monitoring. An advantage of an RGB-D camera is that the distance of the scene to the camera can be directly obtained and then presented to the user in the form of an image (referred to as a depth image or depth image) which is more accurate than the conventional depth image obtained using binoculars. The advantages of an RGB-D camera may provide great convenience for people counting in complex environments.
People counting is one of core contents of video monitoring all the time, and is not well solved for a long time, the main reason is that not only human targets but also other targets exist in scenes, and the targets do not have obvious colors or edge features in some crowded scenes, such as public transportation scenes, so that the targets are often difficult to be segmented by using a traditional RGB camera design algorithm, such as in supermarket channels, except people, bags, carts, purchased articles and the like, false targets (such as bags and carried articles) and human heads do not have obvious features to be distinguished, and the traditional RGB camera is difficult to be accurately locked to people.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a human head locking method based on an RGB-D camera, which can accurately lock the human head.
In order to achieve the purpose, the invention adopts the following technical scheme:
a human head locking method based on an RGB-D camera comprises the following steps:
the method comprises the following steps: erecting an RGB-D camera in a channel scene, calibrating the camera, and calculating a parameter matrix of the camera, wherein the channel comprises an A direction and a B direction which are opposite;
step two: continuously shooting a channel containing a human body target by using a camera to obtain N depth maps; obtaining a top view of each depth map; obtaining background image I by using all the obtained top viewsb;
Step three: shooting a channel containing a human body target by using a camera to obtain a depth map of m at a certain moment; acquiring a corresponding top view of the depth map; performing background removal operation on a top view to obtain a foreground picture, performing blocking operation on the foreground picture to obtain a blocked picture, searching a local maximum region operation on the blocked picture to obtain a local maximum region set, performing expansion operation on the local maximum region set to obtain an expanded local maximum region set, and performing filtering rectangular frame processing on the expanded local maximum region set to obtain a rectangular frame set S containing a plurality of elementsFmThe purpose of locking the human head is achieved.
Specifically, in the second step and the third step, the top view of each depth map is obtained by the following formula:
len=m*r
where θ is the distance P (x) passing through the depth mapp,yp,zp) Of dotsCorresponding to the included angle between the ray and the ground plane; g (x)G,yG0) is the intersection point of the oblique line passing through the point P and the ground plane; hCIs the camera height; m (0 < m < D) is the depth value of the P point in the depth map, wherein D is the maximum pixel value set by a user; r is the distance in world space corresponding to the unit depth value;
plan view I is obtained using the following formula:
wherein, (u, v) represents a pixel point in the top view I corresponding to the point P on the depth map, and I (u, v) represents a pixel value at the pixel point (u, v);
and aiming at each point in the depth map, obtaining a pixel point in the top view corresponding to the point and a pixel value at the pixel point, wherein all the pixel values form a top view I.
Specifically, in the third step, a background removal operation is performed on the top view to obtain a foreground picture, and an adopted formula is as follows:
wherein,Fthreshold value set for user for extracting foreground, IF(u, v) denotes the foreground Picture IFPixel value, I, at the middle pixel point (u, v)b(u, v) is background plot IbPixel value at pixel point (u, v) position, Im(u, v) shows the plan view ImPixel value at pixel point (u, v).
Specifically, in the third step, a blocking operation is performed on the foreground picture to obtain a blocked picture, and an adopted formula is as follows:
wherein, IF(u, v) is the foreground picture IFPixel value with coordinates (u, v), IB(x, y) is Picture IBThe size of the delineated block is w for the pixel value at the location of the pixel point (x, y)b×wb。
Specifically, the step three includes performing an operation of finding a local maximum region for a picture after being partitioned to obtain a local maximum region set, and specifically includes the following steps:
for picture IBThe above pixel point (x, y) is searched for eight pixel points around the pixel point, if the pixel value corresponding to the pixel point is larger than the pixel values corresponding to the eight pixel points, the pixel point is put into a local maximum area set SLIn (1) by SL (i)Denotes SLIs a member of, and SL (i)=(ui,vi,di),(ui,vi) Represents the pixel point, diIs a pixel point (u)i,vi) In picture IBThe pixel value of (1).
Specifically, the expanding the local maximum region operation on the local maximum region set in the third step to obtain an expanded local maximum region set specifically includes the following steps:
for a local maximum region set SLEach element S ofL (i)Looking for SL (i)In the foreground picture IFThe formula adopted by the corresponding pixel position in (1) is as follows:
wherein (x)i,yi) Is SL (i)Corresponding to the foreground picture IFThe position of (1); order SS (i)=(xi,yi,zi),(xi,yi) Denotes SL (i)Corresponding to the foreground picture IFTo obtain a set SS,SS (i)Is a set SSAn element of (1);
for SSEach member S inS (i)=(xi,yi,zi) With SS (i)For the seeds, a seed filling method is utilized, outward expansion is carried out, and the conditions of the expansion are as follows: if IF(xi,yi)-zi|≤EThen use a rectangular frame SE (i)=(ui,vi,Hi,Wi,zi) All pixel points satisfying the condition in the frame selection, wherein (u)i,vi) Is the upper left corner of the rectangular frame, (H)i,Wi) Is the height and width of a rectangular frame, ziAs a result of the original pixel values,Eforming a set S of expanded regions for a specified thresholdE,SE (i)Is a set SEOf (2) is used.
Specifically, the step three includes performing rectangular frame filtering processing on the expanded local maximum region set to obtain a rectangular frame set including multiple elements, and includes the following steps:
using two sets S of filter condition pairsEFiltering the elements in (1):
(1) if the element SE (i)The following conditions are met:then the element is deleted;
(2) if two rectangular frames SE (i)=(ui,vi,Hi,Wi,zi) And SE (j)=(uj,vj,Hj,Wj,zj) Satisfy the following requirementsThen determine SE (i)And SE (j)Coincidence, if coincident, then z is retainediAnd zjA larger rectangular frame;
forming the reserved rectangular frames into a rectangular frame set SFmSet of rectangular frames SFmThe element in (A) is SFm (i)Where m represents time.
Compared with the prior art, the invention has the following technical effects: according to the method, the RGB-D camera is erected in the channel, the channel containing the human body target is shot by the camera, a plurality of depth maps are obtained, the top view corresponding to the depth maps is obtained, and the rectangular frame set is formed according to the top view.
The embodiments of the invention will be explained and explained in further detail with reference to the figures and the detailed description.
Drawings
FIG. 1 is a scene model without a coordinate system;
FIG. 2 is a channel model of the world coordinate system;
FIG. 3 is a schematic diagram of a top view image blocking operation;
FIG. 4 is a schematic illustration of finding a local maximum; wherein, (a) represents the picture area in which the maximum value is sought, (b) represents the process of finding the local maximum value, and (c) represents the final finding of the local maximum value;
FIG. 5 is a schematic view of a camera mounting location;
FIG. 6 is a diagram illustrating the selection of six sets of world coordinates and their corresponding image coordinates;
FIG. 7 is a schematic diagram of a top view from a depth map; the method comprises the following steps of (a) obtaining a background image of a channel scene, (b) obtaining a depth image, (c) obtaining a foreground image through background removing operation, and (d) obtaining a top view;
FIG. 8 is a schematic diagram of a filtered set of rectangular boxes taken from a top view; the method comprises the following steps of (a) representing a blocking operation result graph, (b) representing a local maximum area set, (c) extending a rectangular frame set after a local maximum area, and (d) filtering the rectangular frame set after rectangular frame processing.
Detailed Description
The invention discloses a human head locking method based on an RGB camera, which comprises the following steps:
erecting an RGB-D camera in a channel scene, calibrating the camera, and calculating a parameter matrix P of the camera;
step 1.1: selecting a certain channel as a scene for people counting, referring to fig. 1, installing a camera right above the channel, and enabling a plurality of human body targets to walk on the channel along the direction A or the direction B, wherein the direction A is opposite to the direction B;
step 1.2: and establishing a world coordinate system. Referring to fig. 2, the camera is located on the Z-axis of the world coordinate system, the direction along the channel is the Y-axis direction of the world coordinate system, the direction perpendicular to the channel is the X-axis direction of the world coordinate system, and the position coordinate of the camera in the world coordinate system is (0,0, H), where H is the distance of the camera from the origin of the world coordinate system.
Step 1.3: and calibrating the camera. Using a calibration support, selecting N (N is more than or equal to 6) groups of image coordinates and world coordinates corresponding to the image coordinates:
the parameter matrix P of the camera is calculated using the following formula:
wherein,
step two: continuously shooting a channel containing a human body target by using a camera to obtain N (N is more than or equal to 50) depth maps; obtaining a top view of each depth map; obtaining a background image I from a top viewb。
The method for obtaining the top view of each depth map comprises the following steps:
the depth values in the depth map represent points in the world coordinate space, such as the distance len from a point P to the camera, i.e. the length of the hypotenuse of the small right triangle in the map, and we can obtain the following formula according to the geometric relationship of the objects under the world coordinate system:
len=m*r (4)
wherein, theta is an included angle between a corresponding ray passing through the point P on the depth map and the ground plane; g (x)G,yG0) is the intersection point of the oblique line passing through the point P and the ground plane; hCIs the camera height; m (0 < m < D) is the depth value of the P point in the depth map, wherein D is set by a userA maximum pixel value; and r is the distance in world space corresponding to the unit depth value.
After obtaining the coordinates of the point P, zooming and translating the point P to be located at the center of the top view I, then:
wherein, (u, v) represents a pixel point in the top view I corresponding to the point P, and I (u, v) represents a pixel value at the pixel point (u, v), where (r)x,ry) To point P of (x)p,yp) Scaling factor of (d)x,dy) To point P of (x)p,yp) And (4) translation coefficient.
And aiming at each point in the depth map, obtaining a pixel point in the top view corresponding to the point and a pixel value of the pixel point, wherein all the pixel values form a top view I. N top views I can be obtained by adopting the method aiming at N depth mapsi(i=1,...N)。
Wherein, a background image I is obtained by using a top viewbThe formula adopted is as follows;
wherein H is the length of the top view, W is the width of the top view, Ib(x, y) is background picture IbThe pixel value at the position of the pixel point (x, y) can be used to obtain the background image Ib。
Step three: shooting a channel containing a human body target by using a camera to obtain a depth map at a certain moment; acquiring a corresponding top view of the depth map; background removal, blocking, local maximum area searching, local maximum area expanding and rectangular frame filtering processing are carried out on the top view, and a rectangular frame set S is obtainedFm(ii) a The method specifically comprises the following steps:
step 3.1: the method comprises the steps that a camera is used for shooting a channel containing a human body target, an RGB-D camera is adopted, and a depth map of a certain moment m (m is 1, 2.);
step 3.2, acquiring a corresponding top view I of the shot depth mapmAnd the adopted method is the same as the method for acquiring the top view in the second step.
Step 3.3, for top view I obtained in step 3.2mPerforming background removal, blocking, local maximum area searching, local maximum area expanding and rectangular frame filtering processing to obtain a rectangular frame set SFmThe specific treatment process is as follows:
removing the background: for top view ImObtaining the foreground picture I by adopting a formula (8)F:
Wherein,Fthreshold value set for user for extracting foreground, IF(u, v) denotes the foreground Picture IFPixel value at the middle pixel point (u, v).
And (3) blocking operation: with a size wb×wbBlock pair foreground picture IFBlocking to obtain picture IBThe formula adopted is as follows:
wherein, IF(u, v) is the foreground picture IFPixel value with coordinates (u, v), IB(x, y) is Picture IBPixel value at pixel point (x, y) location.
Finding the local maximum area: for picture IBThe (x, y) of the above point, and the (x, y) of the point around the point is searchedIf the pixel value corresponding to the pixel point is larger than the pixel values corresponding to the eight pixel points, the pixel point is placed into a local maximum area set SLIn, adopt SL (i)Denotes SLAnd S isL (i)=(ui,vi,di),(ui,vi) Represents the pixel point, diIs a pixel point (u)i,vi) In picture IBThe pixel value of (1).
Expanding the local maximum area: for a local maximum region set SLEach element S ofL (i)Looking for SL (i)In the foreground picture IFThe formula adopted by the corresponding pixel position in (1) is as follows:
wherein (x)i,yi) Is SL (i)Corresponding to the foreground picture IFOf (c) is used. Order SS (i)=(xi,yi,zi),(xi,yi) Denotes SL (i)Corresponding to the foreground picture IFThe pixel points of (2) can obtain a set SS,SS (i)Is a set SSOf (2) is used.
For SSEach member S inS (i)=(xi,yi,zi) With SS (i)For the seeds, a seed filling method is utilized, outward expansion is carried out, and the conditions of the expansion are as follows: if IF(xi,yi)-zi|≤E,ETo set the threshold value to 10, a rectangular frame S is usedE (i)=(ui,vi,Hi,Wi,zi) All pixel points satisfying the condition in the frame selection, wherein (u)i,vi) Is the upper left corner of the rectangular frame, (H)i,Wi) Is a rectangular frameHeight and width of (z)iFor the original pixel values (i.e. the spatial height of the rectangular box), a set S of expanded regions is finally formedE,SE (i)Is a set SEOf (2) is used.
And (3) filtering a rectangular frame: after the extended area is obtained, the overlap area and the abnormal area need to be filtered, and two filtering conditions are used, 1. if the rectangle frame SE (i)The following conditions are met:then not reserved; 2. if two rectangular frames SE (i)=(ui,vi,Hi,Wi,zi) And SE (j)=(uj,vj,Hj,Wj,zj) Satisfy the following requirementsThen determine SE (i)And SE (j)Coincidence, if coincident, then z is retainediAnd zjA larger rectangular frame.
The remaining rectangular frames form a set S of rectangular framesFmSet of rectangular frames SFmThe element in (A) is SFm (i)And completing the human head locking task.
Examples
In the processing process of the embodiment, the sampling frequency is 25 frames/second, the size of the frame image is 320 × 240, and the scene is a front door scene of a bus.
The camera is mounted on the bus at the position shown in fig. 5, and a world coordinate system is established, and the height H of the camera is 254 (cm). And 6 groups of points of world coordinates corresponding to the images are selected by using the calibration frame, and parameters P of the camera are calculated according to the figure 6.
Here is selectedb10, as shown in fig. 7, (a) is a background image of the channel scene, and (b) is obtainedDepth map, (c) foreground picture obtained by background removing operation, and (d) top view.
As shown in FIG. 8, select BS×BS=5×5,h10(a) a partitioning operation result graph, wherein a white rectangular frame in (b) is a local maximum region set, a white rectangular frame in (c) is an expanded local maximum region, and a white rectangular frame in (d) is a set after filtering rectangular frame processing.