WO2022188030A1

WO2022188030A1 - Crowd density estimation method, electronic device and storage medium

Info

Publication number: WO2022188030A1
Application number: PCT/CN2021/079755
Authority: WO
Inventors: 胡金星; 杨戈
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2022-09-15
Anticipated expiration: 2023-09-09

Abstract

Disclosed in the present application is a crowd density estimation method. The method comprises: acquiring a plurality of crowd images, wherein the plurality of crowd images are respectively acquired by means of a plurality of image acquisition devices; inputting the plurality of crowd images into a crowd density estimation network so as to obtain a first crowd density image corresponding to each crowd image, wherein the crowd density estimation network comprises several feature extraction layers, several feature fusion layers and a crowd density estimation layer, and the several feature extraction layers have different network depths; and according to the positions and image acquisition angles of the plurality of image acquisition devices, combining a plurality of first crowd density images to form a second crowd density image, so as to estimate the pedestrian flow of a target area by using the second crowd density image. By means of the method, the accuracy of acquisition devices acquiring crowd images at different viewing angles and different viewing field distances to perform crowd density estimation can be improved.

Description

Crowd density estimation method, electronic device and storage medium

【Technical field】

本申请涉及人群密度估计技术领域，特别是涉及人群密度估计方法、电子设备及存储介质。The present application relates to the technical field of crowd density estimation, and in particular, to a crowd density estimation method, an electronic device and a storage medium.

【Background technique】

随着城市现代化建设不断加深，城市人公共空间越发庞大复杂，同时城市人口规模也在不断加大，社会公众参与公共活动越来越多，由此带来城市潜在安全风险以及城市空间优化等问题，比如近来持续的公共卫生安全要求保持社交距离等，要求对人群密度较高精度、及时的感知。随着智慧城市的不断发展建设，监控视频网络系统的广泛部署应用使得我们充分感知公共空间的人群分布成为了可能。With the continuous deepening of urban modernization, the public space for urban people is becoming more and more complex, and the scale of urban population is also increasing, and the public is participating in more and more public activities, which brings potential urban security risks and urban space optimization issues. For example, the recent continuous public health security requirements to maintain social distance, etc., require high-precision and timely perception of crowd density. With the continuous development and construction of smart cities, the widespread deployment and application of surveillance video network systems makes it possible for us to fully perceive the distribution of people in public spaces.

相关技术中，对人群密度估计的精度仍有待提高。In the related art, the accuracy of crowd density estimation still needs to be improved.

【发明内容】[Content of the invention]

本申请主要解决的技术问题是提供一种人群密度估计方法、电子设备及存储介质，能够提高对采集设备在不同视角、不同视场远近采集人群图像进行人群密度估计的准确性。The technical problem mainly solved by the present application is to provide a crowd density estimation method, electronic device and storage medium, which can improve the accuracy of crowd density estimation for crowd images collected by a collection device at different viewing angles and different fields of view.

为了解决上述问题，本申请采用的一种技术方案是提供一种人群密度估计方法，该方法包括：获取多个人群图像；其中，多个人群图像分别由多个图像采集设备采集得到；将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层，若干特征提取层具有不同的网络深度；根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。In order to solve the above problem, a technical solution adopted in the present application is to provide a method for estimating crowd density, the method comprising: acquiring multiple crowd images; wherein the multiple crowd images are acquired respectively by multiple image acquisition devices; The individual crowd images are input to the crowd density estimation network to obtain a first crowd density image corresponding to each crowd image; wherein, the crowd density estimation network includes several feature extraction layers, several feature fusion layers and crowd density estimation layers, and several feature extraction layers. The layers have different network depths; according to the positions of multiple image acquisition devices and the image acquisition angles, multiple first crowd density images are combined to form a second crowd density image, so as to use the second crowd density image to carry out the flow of people in the target area. estimate.

其中，根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，包括：根据每一采集设备的位置和图像采集角度确定每一采集设备的透视变换关系；利用透视变换关系将每一第一人群密度图像进行平面投影，得到对应的人群密度平面图像；对多个人群密度平面图像进行归一化；将归一化后的每一人群密度平面图像进行组合形成第二人群密度图像。Wherein, combining multiple first crowd density images to form a second crowd density image according to the positions and image capturing angles of multiple image capturing devices includes: determining each capturing device according to the position and image capturing angle of each capturing device The perspective transformation relationship is based on the perspective transformation relationship; use the perspective transformation relationship to perform plane projection on each first crowd density image to obtain the corresponding crowd density plane image; normalize multiple crowd density plane images; The density plane images are combined to form a second crowd density image.

其中，根据每一采集设备的位置和图像采集角度确定每一采集设备的透视变换关系，包括：在每一采集设备的位置对应的采集区域中确定至少四个空间坐标；以及在对应采集设备的人群图像中确定与至少四个空间坐标对应的像素点坐标；利用至少四个空间坐标和与至少四个空间坐标对应的像素点坐标确定每一采集设备的透视变换关系。Wherein, determining the perspective transformation relationship of each acquisition device according to the position of each acquisition device and the image acquisition angle includes: determining at least four spatial coordinates in the acquisition area corresponding to the position of each acquisition device; Pixel coordinates corresponding to at least four spatial coordinates are determined in the crowd image; the perspective transformation relationship of each acquisition device is determined by using the at least four spatial coordinates and the pixel coordinates corresponding to the at least four spatial coordinates.

其中，对多个人群密度平面图像进行归一化，包括：确定归一化权重矩阵；将每一人群密度平面图像与归一化权重矩阵点乘，以对每一人群密度平面图像进行归一化。Wherein, normalizing the plurality of crowd density plane images includes: determining a normalization weight matrix; multiplying each crowd density plane image with the normalization weight matrix to normalize each crowd density plane image change.

其中，确定归一化权重矩阵包括：利用以下公式确定归一化权重矩阵的元素：

其中，(x ₀,y ₀)表示人群图像上的像素点坐标，(x,y)表示人群密度平面图像上与人群图像上的像素点坐标相对应的像素点坐标，

为高斯模糊核中心落在人群图像像素点(x ₀,y ₀)的第一人群密度图像；

表示人群密度平面图像，i,j与m,n分别为人群图像上的像素点坐标和人群密度平面图像上的像素点坐标，w _xy为高斯模糊核中心落在人群图像像素点(x ₀,y ₀)的第一人群密度图像在人群密度平面图像(x,y)处像素点的权重，其中，

像素点(x ₀,y ₀)的像素值在使用高斯模糊计算前像素值为1且其他像素点的像素值为0。 Wherein, determining the normalized weight matrix includes: using the following formula to determine the elements of the normalized weight matrix:

Among them, (x ₀ , y ₀ ) represents the pixel coordinates on the crowd image, (x, y) represents the pixel coordinates on the crowd density plane image corresponding to the pixel coordinates on the crowd image,

is the first crowd density image with the Gaussian blur kernel center at the crowd image pixel point (x ₀ , y ₀ );

Represents the crowd density plane image, i, j and m, n are the pixel coordinates on the crowd image and the pixel coordinates on the crowd density plane image, respectively, w _xy is the Gaussian blur kernel center falling on the crowd image pixel (x ₀ , y ₀ ) of the first crowd density image at the crowd density plane image (x, y), where,

The pixel value of the pixel point (x ₀ , y ₀ ) is 1 before the Gaussian blur is calculated and the pixel value of other pixels is 0.

其中，将归一化后的每一人群密度平面图像进行组合形成第二人群密度图像，包括：确定每一人群密度平面图像的加权平均权重；获取每一人群密度平面图像中对应相同平面位置的像素点的第一像素值，得到像素值集合；利用加权平均权重将像素值集合中的第一像素值进行加权求平均，得到第二像素值；将第二像素值作为第二人群密度图像中对应像素点的像素值，以形成第二人群密度图像。Wherein, combining each normalized crowd density plane image to form a second crowd density image includes: determining the weighted average weight of each crowd density plane image; The first pixel value of the pixel point is obtained to obtain a pixel value set; the first pixel value in the pixel value set is weighted and averaged by using the weighted average weight to obtain the second pixel value; the second pixel value is used as the second crowd density image. The pixel values of the corresponding pixel points are used to form a second crowd density image.

其中，若干特征提取层包括第一特征提取层、第二特征提取层、第三特征提取层和第四特征提取层；其中，第一特征提取层、第二特征提取层、第三特征提取层和第四特征提取层的网络深度依次增加；若干特征融合层包括第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层和第五特征融合层；其中，第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层的网络深度相同，第五特征融合层的网络深度大于第一特征融合层的网络深度。Among them, several feature extraction layers include a first feature extraction layer, a second feature extraction layer, a third feature extraction layer and a fourth feature extraction layer; wherein, the first feature extraction layer, the second feature extraction layer, the third feature extraction layer and the network depth of the fourth feature extraction layer increase in turn; several feature fusion layers include the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, the fourth feature fusion layer and the fifth feature fusion layer; The network depths of the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, and the fourth feature fusion layer are the same, and the network depth of the fifth feature fusion layer is greater than the network depth of the first feature fusion layer.

其中，将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像，包括：将每一人群图像输入至第一特征提取层，以输出第一特征图；将第一特征图输入至第二特征提取层，以输出第二特征图；将第二特征图输入至第三特征提取层，以输出第三特征图，以及将第二特征图输入至第一特征融合层，以输出第一特征融合图；将第三特征图输入至第四特征提取层，以输出第四特征图，以及将第三特征图和第一特征融合图输入至第五特征融合层，以输出第二特征融合图，以及将第三特征图输入至第二特征融合层，以输出第三特征融合图；将第四特征图、第二特征融合图和第三特征融合图输入至第三特征融合层，以输出第四特征融合图；将第四特征融合图输入至第四特征融合层，以输出第五特征融合图；将第五特征融合图输入至人群密度估计层，以输出与每一图像对应的第一人群密度图像。Wherein, inputting a plurality of crowd images into a crowd density estimation network to obtain a first crowd density image corresponding to each crowd image includes: inputting each crowd image into a first feature extraction layer to output a first feature map ; Input the first feature map to the second feature extraction layer to output the second feature map; Input the second feature map to the third feature extraction layer to output the third feature map, and input the second feature map to the first feature map a feature fusion layer to output the first feature fusion map; input the third feature map to the fourth feature extraction layer to output the fourth feature map, and input the third feature map and the first feature fusion map to the fifth feature fusion layer to output the second feature fusion map, and input the third feature map to the second feature fusion layer to output the third feature fusion map; combine the fourth feature map, the second feature fusion map and the third feature fusion map Input to the third feature fusion layer to output the fourth feature fusion map; input the fourth feature fusion map to the fourth feature fusion layer to output the fifth feature fusion map; input the fifth feature fusion map to the crowd density estimation layer , to output the first crowd density image corresponding to each image.

其中，第一特征提取层的通道数由输入至输出方向依次为3、64、64和64；第二特征提取层的通道数由输入至输出方向依次为64、128、128和128；第三特征提取层的通道数由输入至输出方向依次为128、256、256、256、256、256、256和256；第四特征提取层的通道数由输入至输出方向依次为256、512、512、512、512、512、512和512；其中，第一特征提取层、第二特征提取层、第三特征提取层和第四特征提取层中的池化层的步长为2和感受域为2；第一特征融合层的通道数由输入至输出方向依次为128和16；第二特征融合层的通道数由输入至输出方向依次为16和16；第三特征融合层的通道数由输入至输出方向依次为16和16；第四特征融合层的通道数由输入至输出方向依次为16、16、 16、16、16和16；第五特征融合层的通道数由输入至输出方向依次为256和16。Among them, the number of channels of the first feature extraction layer is 3, 64, 64 and 64 from the input to the output direction; the number of channels of the second feature extraction layer is 64, 128, 128 and 128 from the input to the output direction; the third The number of channels of the feature extraction layer is 128, 256, 256, 256, 256, 256, 256 and 256 from input to output; the number of channels of the fourth feature extraction layer is 256, 512, 512, 512, 512, 512, 512, and 512; wherein the pooling layers in the first feature extraction layer, the second feature extraction layer, the third feature extraction layer and the fourth feature extraction layer have a step size of 2 and a receptive field of 2 ; the number of channels of the first feature fusion layer is 128 and 16 from the input to the output direction; the number of channels of the second feature fusion layer is 16 and 16 from the input to the output direction; the number of channels of the third feature fusion layer is from the input to the output direction. The output directions are 16 and 16 in turn; the number of channels of the fourth feature fusion layer is 16, 16, 16, 16, 16 and 16 from the input to the output direction; the number of channels of the fifth feature fusion layer is from the input to the output direction. 256 and 16.

其中，该方法还包括：在第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层和第五特征融合层中输入的特征图大小和通道数不一致时，采用双线性差值法对特征图进行上采样和下采样处理，并使用预设卷积层进行处理，以输出统一通道数的特征图。Wherein, the method further includes: when the input feature map size and the number of channels in the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, the fourth feature fusion layer and the fifth feature fusion layer are inconsistent, adopting The bilinear difference method upsamples and downsamples the feature maps, and uses preset convolutional layers for processing to output feature maps with a uniform number of channels.

为了解决上述问题，本申请采用的另一种技术方案是提供一种电子设备，该电子设备包括处理器和处理器连接的存储器；其中，存储器用于存储程序数据，处理器用于执行程序数据，以实现如上述技术方案提供的方法。In order to solve the above problems, another technical solution adopted in this application is to provide an electronic device, the electronic device includes a processor and a memory connected to the processor; wherein the memory is used for storing program data, the processor is used for executing the program data, In order to realize the method provided by the above technical solution.

为了解决上述问题，本申请采用的另一种技术方案是提供一种计算机可读存储介质，该计算机可读存储介质用于存储程序数据，程序数据在被处理器执行时，用于实现如上述技术方案提供的方法。In order to solve the above problem, another technical solution adopted in the present application is to provide a computer-readable storage medium, the computer-readable storage medium is used to store program data, and when the program data is executed by the processor, it is used to realize the above-mentioned Methods provided by technical solutions.

本申请的有益效果是：区别于现有技术的情况，本申请的一种人群密度估计方法，该方法包括：获取多个人群图像；其中，多个人群图像分别由多个图像采集设备采集得到；将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层，若干特征提取层具有不同的网络深度；根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。通过上述方式，利用特征融合层和不同网络深度的特征提取层对每一人群图像进行不同尺度的特征提取以及融合，以适应不同人群图像的采集高度，以便能够更好的进行特征提取和进一步的人群密度估计，能够提高对采集设备在不同视角、不同视场远近采集人群图像进行人群密度估计的准确性，提升在跨视频人群分布统计中进行人群密度估计的准确性。The beneficial effects of the present application are: different from the situation in the prior art, a method for estimating crowd density of the present application includes: acquiring a plurality of crowd images; wherein, the plurality of crowd images are respectively acquired by a plurality of image acquisition devices ; Input a plurality of crowd images into a crowd density estimation network to obtain a first crowd density image corresponding to each crowd image; wherein, the crowd density estimation network includes several feature extraction layers and several feature fusion layers and a crowd density estimation layer, Several feature extraction layers have different network depths; according to the positions and image acquisition angles of multiple image acquisition devices, multiple first crowd density images are combined to form a second crowd density image, so as to use the second crowd density image to carry out the target area. estimated traffic flow. Through the above method, the feature fusion layer and the feature extraction layer with different network depths are used to extract and fuse features of different scales for each crowd image, so as to adapt to the collection height of different crowd images, so as to better perform feature extraction and further processing. Crowd density estimation can improve the accuracy of crowd density estimation for crowd images collected by collection devices from different viewing angles and different fields of view, and improve the accuracy of crowd density estimation in cross-video crowd distribution statistics.

【Description of drawings】

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。其中：In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort. in:

图1是本申请提供的人群密度估计方法一实施例流程示意图；1 is a schematic flowchart of an embodiment of a crowd density estimation method provided by the present application;

图2是本申请提供的显示界面的调节方法第二实施例流程示意图；2 is a schematic flowchart of a second embodiment of a method for adjusting a display interface provided by the present application;

图3是本申请提供的人群密度估计方法另一实施例流程示意图；3 is a schematic flowchart of another embodiment of the crowd density estimation method provided by the present application;

图4是本申请提供的步骤33的具体流程示意图；Fig. 4 is the specific flow chart of step 33 provided by this application;

图5是本申请提供的步骤35的具体流程示意图；Fig. 5 is the specific flow chart of step 35 provided by this application;

图6是本申请提供的步骤36的具体流程示意图；Fig. 6 is the specific flow chart of step 36 provided by this application;

图7是本申请提供的人群密度估计方法另一实施例流程示意图；7 is a schematic flowchart of another embodiment of the crowd density estimation method provided by the present application;

图8是本申请提供的人群密度估计方法的一应用示意图；8 is a schematic diagram of an application of the crowd density estimation method provided by the present application;

图9是本申请提供的电子设备一实施例的结构示意图；9 is a schematic structural diagram of an embodiment of an electronic device provided by the present application;

图10是本申请提供的计算机可读存储介质一实施例的结构示意图。FIG. 10 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present application.

【Detailed ways】

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。可以理解的是，此处所描述的具体实施例仅用于解释本申请，而非对本申请的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本申请相关的部分而非全部结构。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

参阅图1，图1是本申请提供的人群密度估计方法一实施例流程示意图。该方法包括：Referring to FIG. 1 , FIG. 1 is a schematic flowchart of an embodiment of a crowd density estimation method provided by the present application. The method includes:

步骤11：获取多个人群图像。Step 11: Acquire multiple crowd images.

其中，多个人群图像分别由多个图像采集设备采集得到。可以理解，人群图像中并不一定含有人群。Wherein, multiple crowd images are acquired by multiple image acquisition devices respectively. It can be understood that crowd images do not necessarily contain crowds.

在一些实施例中，多个图像采集设备可分布于一区域的不同位置，以采集对应位置的人群图像。如该区域为十字路口，参阅图2，将该十字路口的平面图以XOY坐标系进行划分，则在第一象限对应的区域设置采集设备D，在第二象限对应的区域设置采集设备A，在第三象限对应的区域设置采集设备B，在第四象限对应的区域设置采集设备C。采集设备A、采集设备B、采集设备C和采集设备D可分别采集其对应区域的人群图像。In some embodiments, a plurality of image capturing devices may be distributed at different positions in an area to capture crowd images at corresponding positions. If the area is an intersection, referring to Figure 2, the plan view of the intersection is divided by the XOY coordinate system, then the acquisition device D is set in the area corresponding to the first quadrant, the acquisition device A is set in the area corresponding to the second quadrant, and the The acquisition device B is set in the area corresponding to the third quadrant, and the acquisition device C is set in the area corresponding to the fourth quadrant. The acquisition device A, the acquisition device B, the acquisition device C, and the acquisition device D can respectively acquire crowd images in their corresponding regions.

在一些实施例中，步骤11可以是将对多个人群图像进行预处理。具体地，因多个人群图像由不同的采集设备采集得到，则可按照采集设备进行分类，并在分类后按照人群图像的生成时间进行排序。然后遍历每一采集设备对应的人群图像，获取这些人群图像中生成时间相同的多个人群图像。In some embodiments, step 11 may be to preprocess the plurality of crowd images. Specifically, since multiple crowd images are collected by different collection devices, they can be classified according to the collection devices, and after the classification, the crowd images can be sorted according to the generation time of the crowd images. Then traverse the crowd images corresponding to each acquisition device, and acquire multiple crowd images with the same generation time among these crowd images.

步骤12：将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层，若干特征提取层具有不同的网络深度。Step 12: Input multiple crowd images into the crowd density estimation network to obtain a first crowd density image corresponding to each crowd image; wherein, the crowd density estimation network includes several feature extraction layers and several feature fusion layers and crowd density estimation layers, several feature extraction layers have different network depths.

在一些实施例中，每一人群图像可对应输入一人群密度估计网络，以得到与该人群图像对应的第一人群密度图像。In some embodiments, each crowd image may be input to a crowd density estimation network to obtain a first crowd density image corresponding to the crowd image.

在一些实施例中，将多个人群图像进行排序，然后将多个人群图像按照排序的先后顺序依次输入至人群密度估计网络，以使人群密度估计网络输出与每一人群图像对应的第一人群密度图像。In some embodiments, the plurality of crowd images are sorted, and then the plurality of crowd images are sequentially input to the crowd density estimation network according to the sorted order, so that the crowd density estimation network outputs the first crowd corresponding to each crowd image density image.

下面介绍人群密度轨迹网络对人群图像的处理过程：The following describes the process of crowd density trajectory network processing crowd images:

首先，人群图像输入至若干特征提取层中网络深度最小的特征提取层，以在该特征提取层进行对应网络深度的特征提取，得到第一目标特征图；然后将该第一目标特征图输入至下一特征提取层，以在下一特征提取层中得到第二目标特征图，将第二目标特征图分别输入至下一特征提取层和特征融合层，以得到第三目标特征图和第一目标融合图，按照此逻辑，根据特征提取层和特征融合层的数量进行对应的特征提取和特征融合。将最后一个特征融合层输出的目标融合图输入至人群密度估计层，得到与每一人群图像对应的第一人群密度图像。First, the crowd image is input to the feature extraction layer with the smallest network depth among several feature extraction layers, so as to perform feature extraction corresponding to the network depth in the feature extraction layer to obtain the first target feature map; then input the first target feature map to the The next feature extraction layer is to obtain the second target feature map in the next feature extraction layer, and the second target feature map is input to the next feature extraction layer and feature fusion layer respectively to obtain the third target feature map and the first target feature map In the fusion graph, according to this logic, corresponding feature extraction and feature fusion are performed according to the number of feature extraction layers and feature fusion layers. The target fusion map output by the last feature fusion layer is input to the crowd density estimation layer, and the first crowd density image corresponding to each crowd image is obtained.

在一些实施例中，每一特征提取层包括若干卷积层。每一特征融合层包括若干卷积层以及人群密度估计层包括若干卷积层。其中，每一卷积层后具有激活层。In some embodiments, each feature extraction layer includes several convolutional layers. Each feature fusion layer includes several convolutional layers and the crowd density estimation layer includes several convolutional layers. Among them, each convolutional layer is followed by an activation layer.

在一应用场景中，以若干卷积层(每层卷积层后具有ReLu激活层)作为一个特征提取层，以若干卷积层作为一个特征融合层(每层卷积层后具有ReLu激活层)，以若干卷积层作为人群密度估计层(每层卷积层后具有ReLu激活层)以组成人群密度估计网络。In an application scenario, several convolutional layers (with ReLu activation layer after each convolutional layer) are used as a feature extraction layer, and several convolutional layers are used as a feature fusion layer (with ReLu activation layer after each convolutional layer). ), several convolutional layers are used as crowd density estimation layers (each convolutional layer is followed by a ReLu activation layer) to form a crowd density estimation network.

进一步说明，每一特征提取层具有对特征图下采样的功能即特征提取层输出的目标特征图的宽和高降低1/2倍大小，可通过最大池化层或卷积层实现。其中，人群密度估计网络分N个阶段计算输出第一人群密度图像；除第一个阶段的特征提取层的输入为人群图像，每个阶段的特征提取层只输入上一个阶段的特征提取层输出的目标特征图；每个阶段的特征融合层同时输入上一个阶段的特征提取层和特征融合层输出的目标特征图；每个特征融合层以4x、8x分别表示其处理输入大小为1/4、1/8图像大小的目标特征图；每个特征融合层的输入与处理的特征图大小不一致时，采用双线性插值对输入的目标特征图进行上采样和下采样，否则直接复制输入。It is further explained that each feature extraction layer has the function of downsampling the feature map, that is, the width and height of the target feature map output by the feature extraction layer are reduced by 1/2 times the size, which can be realized by a maximum pooling layer or a convolution layer. Among them, the crowd density estimation network calculates and outputs the first crowd density image in N stages; except that the input of the feature extraction layer of the first stage is the crowd image, the feature extraction layer of each stage only inputs the output of the feature extraction layer of the previous stage The target feature map of the target feature map; the feature fusion layer of each stage simultaneously inputs the feature extraction layer of the previous stage and the target feature map output by the feature fusion layer; each feature fusion layer is represented by 4x, 8x, and its processing input size is 1/4 , 1/8 image size of the target feature map; when the input of each feature fusion layer is inconsistent with the size of the processed feature map, bilinear interpolation is used to upsample and downsample the input target feature map, otherwise the input is directly copied.

更进一步说明，在一些实施例中，人群密度估计网络的第一个阶段输入图像依次由两个特征提取层串联构成，第二个阶段由一个4x特征融合层和一个特征提取层并列构成，第三个阶段由一个4x特征融合层、一个8x特征融合层和一个特征提取层并列构成，第四个阶段由一个4x特征融合层、一个8x特征融合层、一个16x特征融合层和一个特征提取层并列构成，第五个阶段由一个4x特征融合模块和人群密度估计层串联构成，特别地第四个阶段的4x特征融合模块是以若干并列的不同分离率(Dilation Rate)的卷积层作为一个特征融合层实现多尺度特征融合(每层卷积层后具有ReLu激活层)。关键地，一个特征融合层同时接受多个特征融合层和特征提取层的输出作为输入时，采用加法将特征图逐元素相加再输入特征融合层进行计算。To further illustrate, in some embodiments, the input image of the first stage of the crowd density estimation network is sequentially composed of two feature extraction layers in series, the second stage is composed of a 4x feature fusion layer and a feature extraction layer in parallel, and the second stage is composed of a 4x feature fusion layer and a feature extraction layer. The three stages consist of a 4x feature fusion layer, an 8x feature fusion layer and a feature extraction layer in parallel, and the fourth stage consists of a 4x feature fusion layer, an 8x feature fusion layer, a 16x feature fusion layer and a feature extraction layer The fifth stage is composed of a 4x feature fusion module and a crowd density estimation layer in series. In particular, the 4x feature fusion module of the fourth stage is composed of several parallel convolutional layers with different separation rates (Dilation Rate) as a The feature fusion layer implements multi-scale feature fusion (with a ReLu activation layer after each convolutional layer). Crucially, when a feature fusion layer accepts the outputs of multiple feature fusion layers and feature extraction layers as input at the same time, the feature maps are added element by element by addition and then input to the feature fusion layer for calculation.

第一、二、三和四阶段的网络构成多尺度特征的融合和提取，以提取出多尺度的隐藏特征；第五个阶段的4x特征融合层构成多尺度感受域卷积网络模块，进一步融合或变换多尺度的隐藏特征；第五个阶段的人群密度估计层输入由多尺度感受域卷积网络模块形成的特征融合层所输出的多尺度隐藏特征，以进行计算输出第一人群密度图像。The first, second, third and fourth stages of the network constitute the fusion and extraction of multi-scale features to extract multi-scale hidden features; the 4x feature fusion layer in the fifth stage constitutes a multi-scale receptive field convolutional network module for further fusion Or transform the multi-scale hidden features; the crowd density estimation layer in the fifth stage inputs the multi-scale hidden features output by the feature fusion layer formed by the multi-scale receptive field convolutional network module to calculate and output the first crowd density image.

步骤13：根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。Step 13: Combine multiple first crowd density images to form a second crowd density image according to the positions and image capturing angles of multiple image capturing devices, so as to use the second crowd density image to estimate the flow of people in the target area.

在一些实施例中，因每一采集设备安装的位置和对图像采集的角度不同，则根据每一采集设备的位置和图像采集角度对第一人群密度图像进行坐标转换，将第一人群密度图像转换为采集设备所采集区域的平面图像。此时会得到多个对应采集设备所采集区域的平面图像，然后将这些平面图像进行处理得到第二人群密度图像，此时，可以利用第二人群密度图像进行多个采集设备所在的目标区域的人流量估计。In some embodiments, since the installation position of each collection device and the angle of image collection are different, coordinate transformation is performed on the first crowd density image according to the position of each collection device and the image collection angle, and the first crowd density image Converted to a flat image of the area acquired by the acquisition device. At this time, a plurality of plane images corresponding to the areas collected by the acquisition devices will be obtained, and then these plane images will be processed to obtain a second crowd density image. Estimation of people flow.

如，在得到第二人群密度图像时，将该第二人群密度图像中表示人群的像素区域利用特定的颜色表示。其中，可根据像素区域中对像素点设置不同的像素值，以表示不同的人群密度。For example, when the second crowd density image is obtained, the pixel area representing the crowd in the second crowd density image is represented by a specific color. Among them, different pixel values can be set for pixel points in the pixel area to represent different crowd densities.

在本实施例中，通过获取多个人群图像；其中，多个人群图像分别由多个图像采集设备采集得到；将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层；若干特征提取层具有不同的网络深度；根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。通过上述方式，利用特征融合层和不同网络深度的特征提取层对每一人群图像进行不同尺度的特征提取以及融合，以适应不同人群图像的采集高度，以便能够更好的进行特征提取和进一步的人群密度估计，能够提高对采集设备在不同视角、不同视场远近采集人群图像进行人群密度估计的准确性，提升在跨视频人群分布统计中进行人群密度估计的准确性。In this embodiment, by acquiring multiple crowd images; wherein, multiple crowd images are acquired by multiple image acquisition devices respectively; and multiple crowd images are input into the crowd density estimation network to obtain a corresponding image of each crowd image. The first crowd density image; wherein, the crowd density estimation network includes several feature extraction layers, several feature fusion layers and crowd density estimation layers; several feature extraction layers have different network depths; according to the positions of multiple image acquisition devices and image acquisition angles , combining a plurality of first crowd density images to form a second crowd density image, so as to use the second crowd density image to estimate the flow of people in the target area. Through the above method, the feature fusion layer and the feature extraction layer with different network depths are used to extract and fuse features of different scales for each crowd image, so as to adapt to the collection height of different crowd images, so as to better perform feature extraction and further processing. Crowd density estimation can improve the accuracy of crowd density estimation for crowd images collected by collection devices from different viewing angles and different fields of view, and improve the accuracy of crowd density estimation in cross-video crowd distribution statistics.

参阅图3，图3是本申请提供的人群密度估计方法另一实施例流程示意图。该方法包括：Referring to FIG. 3, FIG. 3 is a schematic flowchart of another embodiment of the crowd density estimation method provided by the present application. The method includes:

步骤31：获取多个人群图像。Step 31: Acquire multiple crowd images.

步骤32：将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层，若干特征提取层具有不同的网络深度。Step 32: Input multiple crowd images into the crowd density estimation network to obtain a first crowd density image corresponding to each crowd image; wherein, the crowd density estimation network includes several feature extraction layers, several feature fusion layers and crowd density estimation layers. layers, several feature extraction layers have different network depths.

步骤31-32与上述实施例具有相同或相似的技术方案，这里不做赘述。Steps 31 to 32 have the same or similar technical solutions as the above-mentioned embodiments, and are not repeated here.

步骤33：根据每一采集设备的位置和图像采集角度确定每一采集设备的透视变换关系。Step 33: Determine the perspective transformation relationship of each acquisition device according to the position of each acquisition device and the image acquisition angle.

因每一采集设备的位置和图像采集角度的不同，则每一采集设备对应一透视变换关系。可根据采集设备采集的区域的空间坐标和采集角度，计算的采集设备采集的人群图像与该区域空间坐标之间的透视变换关系。Due to the difference in the position of each acquisition device and the image acquisition angle, each acquisition device corresponds to a perspective transformation relationship. The perspective transformation relationship between the crowd image collected by the collection device and the spatial coordinates of the area can be calculated according to the spatial coordinates and collection angle of the area collected by the collection device.

在一些实施例中，参阅图4，步骤33可以是如下流程：In some embodiments, referring to FIG. 4 , step 33 may be the following process:

步骤331：在每一采集设备的位置对应的采集区域中确定至少四个空间坐标；以及在对应采集设备的人群图像中确定与至少四个空间坐标对应的像素点坐标。Step 331: Determine at least four spatial coordinates in the collection area corresponding to the location of each collection device; and determine pixel point coordinates corresponding to the at least four spatial coordinates in the crowd image corresponding to the collection device.

该至少四个空间坐标可以是该采集设备的位置对应的采集区域中的标志性建筑的空间坐标。因建筑的坐标在该采集区域相对移动的人群是固定的，则以建筑坐标的空间坐标和在人群图像中的像素点坐标作为相对应的参考坐标，执行步骤332。The at least four spatial coordinates may be spatial coordinates of landmark buildings in the acquisition area corresponding to the location of the acquisition device. Since the coordinates of the building are fixed relative to the crowd in the collection area, the spatial coordinates of the building coordinates and the coordinates of the pixel points in the crowd image are used as the corresponding reference coordinates, and step 332 is executed.

步骤332：利用至少四个空间坐标和与至少四个空间坐标对应的像素点坐标确定每一采集设备的透视变换关系。Step 332: Determine the perspective transformation relationship of each acquisition device by using at least four spatial coordinates and pixel point coordinates corresponding to the at least four spatial coordinates.

具体地，可以利用至少四个空间坐标和与至少四个空间坐标对应的像素点坐标确实透视变换矩阵，将此透视变换矩阵作为每一采集设备的透视变换关系。Specifically, at least four spatial coordinates and pixel point coordinates corresponding to the at least four spatial coordinates can be used to determine the perspective transformation matrix, and the perspective transformation matrix can be used as the perspective transformation relationship of each acquisition device.

如，可以利用以下公式计算得到透视变换矩阵：For example, the perspective transformation matrix can be calculated using the following formula:

[x',y',w']＝[x,y,w]*A；[x',y',w']＝[x,y,w]*A;

其中，[x',y',w']是变换后的坐标，即采集区域的空间坐标，[x,y,w]是变换前的坐标，即人群图像中的像素点坐标，A是透视变换矩阵。Among them, [x', y', w'] are the transformed coordinates, that is, the spatial coordinates of the collection area, [x, y, w] are the coordinates before transformation, that is, the pixel coordinates in the crowd image, and A is the perspective Transformation matrix.

将上述的至少四个空间坐标和与至少四个空间坐标对应的像素点坐标对应的代入上述公式，则可得到透视变换矩阵A中的参数a ₁₁、a ₁₂、a ₁₃、a ₂₁、a ₂₂、a ₂₃、a ₃₁、a ₃₂和a ₃₃。 Substituting the above-mentioned at least four spatial coordinates and the pixel coordinates corresponding to the at least four spatial coordinates into the above formula, the parameters a ₁₁ , a ₁₂ , a ₁₃ , a ₂₁ , a ₂₂ in the perspective transformation matrix A can be obtained , a ₂₃ , a ₃₁ , a ₃₂ and a ₃₃ .

其中，在进行二维转换时，在使用上述公式时，可将坐标中的w'和w设置为1。Among them, when performing two-dimensional transformation, when using the above formula, w' and w in the coordinates can be set to 1.

步骤34：利用透视变换关系将每一第一人群密度图像进行平面投影，得到对应的人群密度平面图像。Step 34: Using the perspective transformation relationship, perform plane projection on each first crowd density image to obtain a corresponding crowd density plane image.

在得到透视变换关系后，则将第一人群密度图像中的每一像素点与透视变换关系进行计算，相当于进行平面投影，得到其对应于采集区域的空间坐标，然后以这些空间坐标形成对应的人群密度平面图像。After the perspective transformation relationship is obtained, each pixel in the first crowd density image is calculated with the perspective transformation relationship, which is equivalent to performing plane projection to obtain its spatial coordinates corresponding to the collection area, and then form a correspondence with these spatial coordinates Crowd Density Flat Image.

步骤35：对多个人群密度平面图像进行归一化。Step 35: Normalize the plurality of crowd density plane images.

在一些实施例中，参阅图5，步骤35可以是如下流程：In some embodiments, referring to FIG. 5, step 35 may be the following process:

步骤351：确定归一化权重矩阵。Step 351: Determine the normalized weight matrix.

由于通过透视变换将第一人群密度图像投影到平面上会带来畸变，则需要将其进行归一化。Since the projection of the first crowd density image onto the plane by perspective transformation will bring about distortion, it needs to be normalized.

其中，确定归一化权重矩阵包括：利用以下公式确定归一化权重矩阵：

像素点(x ₀,y ₀)的像素值在使用高斯模糊计算前像素值为1且其他像素点的像素值为0。 Wherein, determining the normalized weight matrix includes: using the following formula to determine the normalized weight matrix:

步骤352：将每一人群密度平面图像与归一化权重矩阵点乘，以对每一人群密度平面图像进行归一化。Step 352: Dot-multiply each crowd density plane image with the normalized weight matrix to normalize each crowd density plane image.

将人群密度平面图上的每一像素点与归一化权重矩阵点乘，得到对应的像素值，基于该像素值组成归一化后的人群密度平面图像。Multiply each pixel on the crowd density plane image with the normalized weight matrix to obtain the corresponding pixel value, and form a normalized crowd density plane image based on the pixel value.

步骤36：将归一化后的每一人群密度平面图像进行组合形成第二人群密度图像。Step 36: Combine each of the normalized crowd density plane images to form a second crowd density image.

在一些实施例中，参阅图6，步骤36可以是如下流程：In some embodiments, referring to FIG. 6, step 36 may be the following process:

步骤361：确定每一人群密度平面图像的加权平均权重。Step 361: Determine the weighted average weight of each crowd density plane image.

步骤362：获取每一人群密度平面图像中对应相同平面位置的像素点的第一像素值，得到像素值集合。Step 362: Acquire the first pixel value of the pixel point corresponding to the same plane position in each crowd density plane image to obtain a set of pixel values.

步骤363：利用加权平均权重将像素值集合中的第一像素值进行加权求平均，得到第二像素值。Step 363: Use the weighted average weight to perform a weighted average of the first pixel values in the pixel value set to obtain a second pixel value.

步骤364：将第二像素值作为第二人群密度图像中对应像素点的像素值，以形成第二人群密度图像。Step 364: Use the second pixel value as the pixel value of the corresponding pixel in the second crowd density image to form a second crowd density image.

对于形成第二人群密度图像而言，需要遍历其每一空间位置(即所有像素点)，采用加权平均的方式将每一人群密度平面图像上对应的像素点的像素值求和平均作为第二人群密度图像中对应像素点的像素值，最终形成第二人群密度图像。其中，加权平均权重为每一人群密度平面图像中每一像素位置(对应世界坐标平面上的位置)被监控视频覆盖的采集设备数量的倒数。For the formation of the second crowd density image, it is necessary to traverse each of its spatial positions (that is, all the pixel points), and use the weighted average method to sum and average the pixel values of the corresponding pixel points on each crowd density plane image as the second The pixel value of the corresponding pixel in the crowd density image finally forms the second crowd density image. Wherein, the weighted average weight is the inverse of the number of collection devices covered by the surveillance video for each pixel position (corresponding to the position on the world coordinate plane) in each crowd density plane image.

可以理解，因采集设备的设置会导致采集设备的采集区域会发生重叠，则此时重叠部分需要按照步骤361-364进行处理。同样，未重叠部分也可按照上述步骤进行，只是未重叠部分的加权平均权重为1。It can be understood that the collection areas of the collection devices may overlap due to the settings of the collection devices, and at this time, the overlapping parts need to be processed according to steps 361-364. Similarly, the non-overlapping parts can also be performed according to the above steps, except that the weighted average weight of the non-overlapping parts is 1.

在本实施例中，通过上述方式，采用透视变换关系将多个采集设备的第一人群密度图像投影变换到同一平面上，并进行归一化和空间融合，以实现跨视频人流量估计。In this embodiment, the perspective transformation relationship is used to project the first crowd density images of multiple collection devices onto the same plane, and normalization and spatial fusion are performed to realize cross-video human flow estimation.

参阅图7和图8，图7是本申请提供的人群密度估计方法另一实施例流程示意图，图8是本申请提供的人群密度估计方法的一应用示意图。在图8中，若干特征提取层包括第一特征提取层、第二特征提取层、第三特征提取层和第四特征提取层；其中，第一特征提取层、第二特征提取层、第三特征提取层和第四特征提取层的网络深度依次增加；若干特征融合层包括第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层和第五特征融合层；其中，第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层的网络深度相同，第五特征融合层的网络深度大于第一特征融合层的网络深度。Referring to FIG. 7 and FIG. 8 , FIG. 7 is a schematic flowchart of another embodiment of the crowd density estimation method provided by the present application, and FIG. 8 is an application schematic diagram of the crowd density estimation method provided by the present application. In FIG. 8, several feature extraction layers include a first feature extraction layer, a second feature extraction layer, a third feature extraction layer and a fourth feature extraction layer; wherein the first feature extraction layer, the second feature extraction layer, the third feature extraction layer The network depths of the feature extraction layer and the fourth feature extraction layer are sequentially increased; several feature fusion layers include a first feature fusion layer, a second feature fusion layer, a third feature fusion layer, a fourth feature fusion layer, and a fifth feature fusion layer; The network depths of the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, and the fourth feature fusion layer are the same, and the network depth of the fifth feature fusion layer is greater than the network depth of the first feature fusion layer.

该方法包括：The method includes:

步骤71：获取多个人群图像。Step 71: Acquire multiple crowd images.

步骤72：将每一人群图像输入至第一特征提取层，以输出第一特征图。Step 72: Input each crowd image to the first feature extraction layer to output a first feature map.

步骤73：将第一特征图输入至第二特征提取层，以输出第二特征图。Step 73: Input the first feature map to the second feature extraction layer to output the second feature map.

步骤74：将第二特征图输入至第三特征提取层，以输出第三特征图，以及将第二特征图输入至第一特征融合层，以输出第一特征融合图。Step 74: Input the second feature map to the third feature extraction layer to output the third feature map, and input the second feature map to the first feature fusion layer to output the first feature fusion map.

步骤75：将第三特征图输入至第四特征提取层，以输出第四特征图，以及将第三特征图和第一特征融合图输入至第五特征融合层，以输出第二特征融合图，以及将第三特征图输入至第二特征融合层，以输出第三特征融合图。Step 75: Input the third feature map to the fourth feature extraction layer to output the fourth feature map, and input the third feature map and the first feature fusion map to the fifth feature fusion layer to output the second feature fusion map , and input the third feature map to the second feature fusion layer to output the third feature fusion map.

步骤76：将第四特征图、第二特征融合图和第三特征融合图输入至第三特征融合层，以输出第四特征融合图。Step 76: Input the fourth feature map, the second feature fusion map and the third feature fusion map to the third feature fusion layer to output the fourth feature fusion map.

步骤77：将第四特征融合图输入至第四特征融合层，以输出第五特征融合图。Step 77: Input the fourth feature fusion map to the fourth feature fusion layer to output the fifth feature fusion map.

步骤78：将第五特征融合图输入至人群密度估计层，以输出与每一图像对应的第一人群密度图像。Step 78: Input the fifth feature fusion map to the crowd density estimation layer to output a first crowd density image corresponding to each image.

步骤79：根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。Step 79: Combine multiple first crowd density images to form a second crowd density image according to the positions and image capturing angles of multiple image capturing devices, so as to use the second crowd density image to estimate the flow of people in the target area.

在一应用场景中，第一特征提取层的通道数由输入至输出方向依次为3、64、64和64。具体地，第一特征提取层的结构为{C(3,3,64),C(3,64,64),M(2,2)}，其中，C(3,3,64)表示一个卷积核大小为3、输入通道数为3、输出通道数为64、默认激活函数为ReLu的卷积层，M(2,2)表示一个感受域大小为2、步长为2的最大池化层。In an application scenario, the number of channels of the first feature extraction layer is 3, 64, 64 and 64 in order from the input to the output direction. Specifically, the structure of the first feature extraction layer is {C(3,3,64), C(3,64,64), M(2,2)}, where C(3,3,64) represents a The convolution kernel size is 3, the number of input channels is 3, the number of output channels is 64, and the default activation function is ReLu. M(2,2) represents a maximum pool with a receptive field size of 2 and a stride of 2. chemical layer.

第二特征提取层的通道数由输入至输出方向依次为64、128、128和128。具体地，第二特征提取层的结构为{C(3,64,128),C(3,128,128),M(2,2)}。The number of channels of the second feature extraction layer is 64, 128, 128 and 128 in order from input to output. Specifically, the structure of the second feature extraction layer is {C(3, 64, 128), C(3, 128, 128), M(2, 2)}.

第三特征提取层的通道数由输入至输出方向依次为128、256、256、256、256、256、256和256。具体地，第三特征提取层的结构为{C(3,128,256),C(3,256,256),C(3,256,256),C(3,256,256),M(2,2)}。The number of channels of the third feature extraction layer is 128, 256, 256, 256, 256, 256, 256 and 256 in order from input to output. Specifically, the structure of the third feature extraction layer is {C(3,128,256), C(3,256,256), C(3,256,256), C(3,256,256), M(2,2)}.

第四特征提取层的通道数由输入至输出方向依次为256、512、512、512、512、512、512和512。具体地，第四特征提取层的结构为{C(3,256,512),C(3,512,512),C(3,512,512),C(3,512,512),M(2,2)}。The number of channels of the fourth feature extraction layer is 256, 512, 512, 512, 512, 512, 512 and 512 in order from input to output. Specifically, the structure of the fourth feature extraction layer is {C(3,256,512), C(3,512,512), C(3,512,512), C(3,512,512), M(2,2)}.

第一特征融合层的通道数由输入至输出方向依次为128和16。具体地，第一特征融合层的结构为{C(3,128,16)}。The number of channels of the first feature fusion layer is 128 and 16 sequentially from input to output. Specifically, the structure of the first feature fusion layer is {C(3, 128, 16)}.

第二特征融合层的通道数由输入至输出方向依次为16和16。具体地，第二特征融合层的结构为{C(3,16,16)}。The number of channels of the second feature fusion layer is 16 and 16 from the input to the output direction. Specifically, the structure of the second feature fusion layer is {C(3,16,16)}.

第三特征融合层的通道数由输入至输出方向依次为16和16。第四特征融合层的通道数由输入至输出方向依次为16、16、16、16、16和16；具体地，第三特征融合层的结构为{C(3,16,16)}，第四特征融合层的结构为{C(3,16,16),C(3,16,16),C(3,16,16)}。The number of channels of the third feature fusion layer is 16 and 16 in order from input to output. The number of channels of the fourth feature fusion layer is 16, 16, 16, 16, 16 and 16 in turn from input to output; specifically, the structure of the third feature fusion layer is {C(3,16,16)}, the first The structure of the four-feature fusion layer is {C(3,16,16), C(3,16,16), C(3,16,16)}.

第五特征融合层的通道数由输入至输出方向依次为256和16。具体地，第五特征融合层的结构为{C(3,256,16)}。The number of channels of the fifth feature fusion layer is 256 and 16 sequentially from input to output. Specifically, the structure of the fifth feature fusion layer is {C(3, 256, 16)}.

其中，在第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层和第五特征融合层中输入的目标特征图大小和通道数不一致时，采用双线性差值法对目标特征图进行上采样和下采样处理，并使用预设卷积层进行处理，以输出统一通道数的目标特征图。如卷积层为{C(3,x,16)}。其中，x表示接收到的目标特征图的输入通道数。在第一特征融合层、第二特征融合层、第三特征融合层、第四特征融合层和第五特征融合层中输入的目标特征图大小和通道数一致时，直接复制目标特征图进行输入。Among them, when the size of the target feature map and the number of channels input in the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, the fourth feature fusion layer and the fifth feature fusion layer are inconsistent, the bilinear difference is adopted. The value method upsamples and downsamples the target feature map, and uses a preset convolutional layer for processing to output the target feature map with a uniform number of channels. For example, the convolutional layer is {C(3,x,16)}. where x represents the number of input channels of the received target feature map. When the size of the target feature map and the number of channels input in the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, the fourth feature fusion layer and the fifth feature fusion layer are the same, the target feature map is directly copied for enter.

下面介绍下人群密度估计网络的训练方法，首先，构建如上述任一实施例中的人群密度估计网络。然后进行训练样本的收集。其中，训练样本需要由不同位置的采集设备采集的不同区域的人群图像，以及人群图像对应的真实人群密度图像。这样在训练时可以获取到更多尺度的隐藏特征，提升人群密度估计网络的估计准确性。然后利用训练样本对人群密度估计网络进行训练，其中，将损失函数定义如下：The training method of the crowd density estimation network is described below. First, the crowd density estimation network as in any of the above embodiments is constructed. Then the collection of training samples is carried out. Among them, the training samples need crowd images in different regions collected by collection devices at different locations, and real crowd density images corresponding to the crowd images. In this way, hidden features of more scales can be obtained during training, and the estimation accuracy of the crowd density estimation network can be improved. Then use the training samples to train the crowd density estimation network, where the loss function is defined as follows:

其中，

z和

分别为用于训练的真实人群密度图像和人群密度估计网络预测的第一人群密度图像向量化后的向量，其中，W(·)为最优传输代价函数，可采用Sinkhorn算法求解最优传输代价的解和梯度，λ ₁和λ ₂为损失函数子项权重。 in,

z and

are the vectorized vectors of the real crowd density image used for training and the first crowd density image predicted by the crowd density estimation network, where W( ) is the optimal transmission cost function, and the Sinkhorn algorithm can be used to solve the optimal transmission cost The solution and gradient of λ ₁ and λ ₂ are the sub-term weights of the loss function.

其中，L _c用于表示真实人群密度图像中人群数量与第一人群密度图像中人群数量之间的损失值，L _ot用于表示最优传输损失，L _tv用于表示真实人群密度图像中的像素点与对应的第一人群密度图像中的像素点之间的损失值。 Among them, L _c is used to represent the loss value between the number of people in the real crowd density image and the number of people in the first crowd density image, L _ot is used to represent the optimal transmission loss, and L _tv is used to represent the real crowd density image. The loss value between the pixel point and the corresponding pixel point in the first crowd density image.

通过多次的迭代训练，在损失函数L满足预设条件时，则可以结束训练，人群密度估计网络训练完成，则可以将训练完成的人群密度估计网络用于上述任一实施例中。Through multiple iterations of training, when the loss function L satisfies the preset conditions, the training can be ended, and the training of the crowd density estimation network is completed, and the trained crowd density estimation network can be used in any of the above embodiments.

参阅图9，图9是本申请提供的电子设备一实施例的结构示意图。该电子设备90包括处理器91和处理器91连接的存储器92；其中，存储器92用于存储程序数据，处理器91用于执行程序数据，以实现如下方法：Referring to FIG. 9 , FIG. 9 is a schematic structural diagram of an embodiment of an electronic device provided by the present application. The electronic device 90 includes a processor 91 and a memory 92 connected to the processor 91; wherein, the memory 92 is used to store program data, and the processor 91 is used to execute the program data to realize the following method:

获取多个人群图像；其中，多个人群图像分别由多个图像采集设备采集得到；将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层；若干特征提取层具有不同的网络深度；根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。Acquiring a plurality of crowd images; wherein, the plurality of crowd images are acquired by a plurality of image acquisition devices respectively; inputting the plurality of crowd images into a crowd density estimation network to obtain a first crowd density image corresponding to each crowd image; wherein , the crowd density estimation network includes several feature extraction layers, several feature fusion layers and crowd density estimation layers; several feature extraction layers have different network depths; The density images are combined to form a second crowd density image, so as to use the second crowd density image to estimate the flow of people in the target area.

可以理解的，处理器91还用于执行程序数据，以实现上述任一实施例提供的方法，其具体的实施步骤可以参考上述任一实施例，这里不再赘述。It can be understood that the processor 91 is further configured to execute program data to implement the method provided in any of the foregoing embodiments, and the specific implementation steps may refer to any of the foregoing embodiments, which will not be repeated here.

参阅图10，图10是本申请提供的计算机可读存储介质一实施例的结构示意图，该计算机可读存储介质100用于存储程序数据101，程序数据101在被处理器执行时，用于实现如下方法：Referring to FIG. 10, FIG. 10 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present application. The computer-readable storage medium 100 is used to store program data 101, and when the program data 101 is executed by a processor, it is used to realize The following method:

获取多个人群图像；其中，多个人群图像分别由多个图像采集设备采集得到；将多个人群图像输入至人群密度估计网络，以得到与每一人群图像对应的第一人群密度图像；其中，人群密度估计网络包括若干特征提取层和若干特征融合层以及人群密度估计层，若干特征提取层具有不同的网络深度；根据多个图像采集设备的位置和图像采集角度，将多个第一人群密度图像进行组合形成第二人群密度图像，以利用第二人群密度图像进行目标区域的人流量估计。Acquiring a plurality of crowd images; wherein, the plurality of crowd images are acquired by a plurality of image acquisition devices respectively; inputting the plurality of crowd images into a crowd density estimation network to obtain a first crowd density image corresponding to each crowd image; wherein , the crowd density estimation network includes several feature extraction layers, several feature fusion layers and crowd density estimation layers, and several feature extraction layers have different network depths; The density images are combined to form a second crowd density image, so as to use the second crowd density image to estimate the flow of people in the target area.

可以理解的，本实施例中的计算机可读存储介质100应用于电子设备，其具体的实施步骤可以参考上述实施例，这里不再赘述。It can be understood that the computer-readable storage medium 100 in this embodiment is applied to an electronic device, and the specific implementation steps thereof may refer to the foregoing embodiments, which will not be repeated here.

在本申请所提供的几个实施方式中，应该理解到，所揭露的方法以及设备，可以通过其它的方式实现。例如，以上所描述的设备实施方式仅仅是示意性的，例如，所述模块或单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。In the several embodiments provided in this application, it should be understood that the disclosed method and device may be implemented in other manners. For example, the device implementations described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other divisions. For example, multiple units or components may be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this implementation manner.

另外，在本申请各个实施方式中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

上述其他实施方式中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated units in the other embodiments described above are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

以上所述仅为本申请的实施方式，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above description is only an embodiment of the present application, and is not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied to other related technologies Fields are similarly included within the scope of patent protection of this application.

Claims

A method for estimating crowd density, characterized in that the method comprises:

acquiring multiple crowd images; wherein, the multiple crowd images are acquired by multiple image acquisition devices respectively;

Inputting the plurality of crowd images to a crowd density estimation network to obtain a first crowd density image corresponding to each of the crowd images; wherein the crowd density estimation network includes several feature extraction layers and several feature fusion layers and Crowd density estimation layer, the several feature extraction layers have different network depths;

According to the positions of the plurality of image acquisition devices and the image acquisition angles, a plurality of the first crowd density images are combined to form a second crowd density image, so as to use the second crowd density image to estimate the flow of people in the target area .

The method of claim 1, wherein:

The combining the multiple first crowd density images to form the second crowd density image according to the positions and image capturing angles of the multiple image capturing devices includes:

Determine the perspective transformation relationship of each of the acquisition devices according to the position of each of the acquisition devices and the image acquisition angle;

Using the perspective transformation relationship to perform plane projection on each of the first crowd density images to obtain a corresponding crowd density plane image;

normalizing a plurality of the crowd density plane images;

The normalized crowd density plane images are combined to form the second crowd density image.

The method of claim 2, wherein:

The determining the perspective transformation relationship of each of the acquisition devices according to the position of each of the acquisition devices and the image acquisition angle includes:

determining at least four spatial coordinates in a collection area corresponding to the position of each of the collection devices; and determining pixel coordinates corresponding to the at least four spatial coordinates in the crowd image corresponding to the collection device;

The perspective transformation relationship of each acquisition device is determined by using the at least four spatial coordinates and the pixel point coordinates corresponding to the at least four spatial coordinates.

The method of claim 2, wherein:

The normalizing the plurality of the crowd density plane images includes:

Determine the normalized weight matrix;

Each of the crowd density plane images is dot-multiplied by the normalization weight matrix to normalize each of the crowd density plane images.

The method of claim 4, wherein:

The determining normalized weight matrix includes:

The normalized weight matrix is determined using the following formula:

Wherein, (x ₀ , y ₀ ) represents the pixel coordinates on the crowd image, (x, y) represents the pixel coordinates on the crowd density plane image corresponding to the pixel coordinates on the crowd image,

is the first crowd density image with the Gaussian blur kernel center falling on the crowd image pixel point (x ₀ , y ₀ );

Represents the crowd density plane image, i, j and m, n are the pixel coordinates on the crowd image and the pixel coordinates on the crowd density plane image, respectively, w _xy is the Gaussian blur kernel center falls on The weight of the first crowd density image of the crowd image pixel point (x ₀ , y ₀ ) at the crowd density plane image (x, y), wherein,

The method of claim 2, wherein:

The forming the second crowd density image by combining each of the normalized crowd density plane images includes:

determining a weighted average weight for each of said crowd density plane images;

Obtain the first pixel value of the pixel point corresponding to the same plane position in each of the crowd density plane images to obtain a set of pixel values;

Using the weighted average weight to perform a weighted average of the first pixel values in the pixel value set to obtain a second pixel value;

The second pixel value is used as the pixel value corresponding to the pixel point in the second crowd density image to form the second crowd density image.

The method of claim 1, wherein:

The several feature extraction layers include a first feature extraction layer, a second feature extraction layer, a third feature extraction layer and a fourth feature extraction layer; wherein the first feature extraction layer, the second feature extraction layer, the The network depths of the third feature extraction layer and the fourth feature extraction layer are sequentially increased;

The several feature fusion layers include a first feature fusion layer, a second feature fusion layer, a third feature fusion layer, a fourth feature fusion layer, and a fifth feature fusion layer; The network depths of the second feature fusion layer, the third feature fusion layer, and the fourth feature fusion layer are the same, and the network depth of the fifth feature fusion layer is greater than the network depth of the first feature fusion layer.

The method of claim 7, wherein:

The inputting the plurality of crowd images into a crowd density estimation network to obtain a first crowd density image corresponding to each of the crowd images includes:

inputting each of the crowd images to the first feature extraction layer to output a first feature map;

inputting the first feature map to the second feature extraction layer to output a second feature map;

inputting the second feature map to the third feature extraction layer to output a third feature map, and inputting the second feature map to the first feature fusion layer to output a first feature fusion map;

inputting the third feature map to the fourth feature extraction layer to output a fourth feature map, and inputting the third feature map and the first feature fusion map to the fifth feature fusion layer, to output a second feature fusion map, and input the third feature map to the second feature fusion layer to output a third feature fusion map;

Inputting the fourth feature map, the second feature fusion map and the third feature fusion map to the third feature fusion layer to output a fourth feature fusion map;

The fourth feature fusion map is input to the fourth feature fusion layer to output the fifth feature fusion map;

The fifth feature fusion map is input to the crowd density estimation layer to output the first crowd density image corresponding to each of the images.

The method of claim 8, wherein:

The number of channels of the first feature extraction layer is 3, 64, 64 and 64 sequentially from the input to the output direction;

The number of channels of the second feature extraction layer is 64, 128, 128 and 128 in order from the input to the output direction;

The number of channels of the third feature extraction layer is 128, 256, 256, 256, 256, 256, 256 and 256 in order from the input to the output direction;

The number of channels of the fourth feature extraction layer is 256, 512, 512, 512, 512, 512, 512 and 512 sequentially from input to output direction; wherein, the first feature extraction layer and the second feature extraction layer , the step size of the pooling layer in the third feature extraction layer and the fourth feature extraction layer is 2 and the receptive field is 2;

The number of channels of the first feature fusion layer is 128 and 16 sequentially from the input to the output direction;

The number of channels of the second feature fusion layer is 16 and 16 sequentially from the input to the output direction;

The number of channels of the third feature fusion layer is 16 and 16 sequentially from the input to the output direction;

The number of channels of the fourth feature fusion layer is 16, 16, 16, 16, 16 and 16 sequentially from the input to the output direction;

The number of channels of the fifth feature fusion layer is 256 and 16 sequentially from the input to the output direction.

The method of claim 8, wherein:

The method also includes:

When the target feature map input in the first feature fusion layer, the second feature fusion layer, the third feature fusion layer, the fourth feature fusion layer and the fifth feature fusion layer does not meet the conditions , using the bilinear difference method to perform up-sampling and down-sampling processing on the target feature map, and use a preset convolution layer for processing to output the target feature map with a uniform number of channels.

An electronic device, characterized in that the electronic device comprises a processor and a memory connected to the processor;

Wherein, the memory is used for storing program data, and the processor is used for executing the program data, so as to implement the method according to any one of claims 1-10.

A computer-readable storage medium, characterized in that, the computer-readable storage medium is used to store program data, and when the program data is executed by a processor, is used to implement any one of claims 1-10 Methods.