CN102034096B

CN102034096B - Video event recognition method based on top-down motion attention mechanism

Info

Publication number: CN102034096B
Application number: CN 201010591513
Authority: CN
Inventors: 胡卫明; 李莉
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2010-12-08
Filing date: 2010-12-08
Publication date: 2013-03-06
Anticipated expiration: 2030-12-08
Also published as: CN102034096A

Abstract

The present invention is a video event recognition method based on a top-down motion attention mechanism, including step S1: using a Gaussian difference detector to detect interest points of each frame of each video in a video set on a computer, and the video set includes : Training video set and test video set; Step S2: Extract scale-invariant feature description sub-features and optical flow features for the detected interest points of each frame; Step S3: Establish apparent vocabulary and motion vocabulary; Step S4: In Learn the probability of each motion word about each type of event on the training video set and thus establish a attention histogram based on motion information; step S5: use the bulldozer distance to calculate the similarity between the videos in the video set, and generate a kernel function matrix ; Step S6: use the obtained kernel function matrix to train the support vector machine classifier, obtain classifier parameters and classify the test video set, and output the classification result.

Description

Video Event Recognition Method Based on Top-Down Motion Attention Mechanism

技术领域 technical field

本发明涉及计算机应用技术领域，特别涉及视频事件识别方法。The invention relates to the field of computer application technology, in particular to a video event recognition method.

背景技术 Background technique

近几年来，随着Internet的飞速发展，视频压缩技术、DVD、WebTV、第三代移动通信技术(3G)等技术的推广和普及，尤其是宽带网的建设使得人们交互访问视频信息的机会越来越多，一些视频门户网站应运而生，如国内的优酷和土豆网，国外的youtube等。世界上的视频信息制作者，如电视台、电影制片商、广告制作商等，甚至各种各样的数字捕捉设备如数码相机、数码摄像机等已走入平常百姓家，每时每刻都在源源不断地生产制作出新的视频材料，数字视频媒体已开始大量充斥人们的生活空间。In recent years, with the rapid development of the Internet, the promotion and popularization of technologies such as video compression technology, DVD, WebTV, and third-generation mobile communication technology (3G), especially the construction of broadband networks, people have more and more opportunities to interactively access video information. More and more video portals have emerged, such as Youku and Tudou in China, and Youtube in foreign countries. Video information producers in the world, such as TV stations, film producers, advertising producers, etc., and even various digital capture devices such as digital cameras, digital video cameras, etc. New video materials are continuously produced, and digital video media has begun to flood people's living spaces.

如何使人们对视频中包含的有用信息进行快捷定位、方便获取以及有效管理是一个亟待解决的问题，该问题的本质就是如何用计算机技术对视频内容进行有效管理和表达；而视频内容理解已经是国际上的一个研究热点，很多研究人员开始运用相关的视频数据处理技术来提取视频中隐含的、有用的、可以理解的语义信息，从而实现视频内容理解。视频信息有其自身的特点，那就是数据量大，结构性差，所以视频信息膨胀带来的问题也非常严重。很多领域由于对大量的视频信息无法有效的处理而导致采集的视频信息闲置。How to enable people to quickly locate, conveniently obtain and effectively manage the useful information contained in the video is an urgent problem to be solved. The essence of the problem is how to use computer technology to effectively manage and express the video content; and video content understanding is already a As a research hotspot in the world, many researchers have begun to use related video data processing technology to extract hidden, useful and understandable semantic information in videos, so as to realize video content understanding. Video information has its own characteristics, that is, a large amount of data and poor structure, so the problem caused by the expansion of video information is also very serious. In many fields, the collected video information is left idle due to the inability to effectively process a large amount of video information.

事件识别一直都是TRECVID的主要任务之一。随着网络上各种多媒体信息的不断丰富，基于内容的多媒体检索技术越来越受到关注和重视。目前，基于内容检索所面临的最大问题就是底层特征和高层语义之间存在的“语义鸿沟”。视频事件的检测与识别是将计算机视觉技术与基于内容的多媒体检索技术相结合，联系上下文的信息和相关的领域知识，融合各种线索进行推理，以事件为基础建立底层特征和高层语义之间的联系。通过建立基于事件的视频语义描述，我们可以对多媒体视频进行更高层次的语义分析，建立高效的索引和检索机制。以前的视频分析都局限于一些固定摄像机下的视频或者是严格控制的视频如Weizman、KTH、IXMAS等数据库，不同于普通视频，事件检测中的视频都来源于真实视频如新闻广播视频、体育比赛视频和电影中的视频等，这就使得事件检测面临了诸多挑战：无序的运动、复杂的背景、目标的遮挡、光照以及目标的几何形变等等。Event recognition has always been one of the main tasks of TRECVID. With the continuous enrichment of various multimedia information on the Internet, content-based multimedia retrieval technology has attracted more and more attention and attention. At present, the biggest problem facing content-based retrieval is the "semantic gap" between low-level features and high-level semantics. The detection and recognition of video events is the combination of computer vision technology and content-based multimedia retrieval technology, linking contextual information and related domain knowledge, fusing various clues for reasoning, and establishing the relationship between low-level features and high-level semantics based on events. contact. By establishing an event-based video semantic description, we can conduct higher-level semantic analysis of multimedia videos and establish efficient indexing and retrieval mechanisms. Previous video analysis was limited to some fixed camera videos or strictly controlled videos such as Weizman, KTH, IXMAS and other databases. Different from ordinary videos, the videos in event detection are all derived from real videos such as news broadcast videos, sports games, etc. Videos in videos and movies, etc., which make event detection face many challenges: disorderly motion, complex background, occlusion of targets, lighting and geometric deformation of targets, etc.

通常一个视频事件是由是什么(what)和如何发生(how)个方面描述。what通常指的是视频帧镜头特征，即表观特征，例如人、物体、建筑物等；how通常指的是视频的动态特征即运动特征。运动信息是视频数据所独有的，它表示了视频内容随时间的发展变化情况，对于描述和理解视频内容具有相当重要的作用。如何有效地融合这两个方面也是一个很有挑战性的问题。但是目前还缺乏有效的描述事件的方法，这主要是因为目前的方法只考虑事件的某一方面，如what或者是how，尤其是有些方法只利用运动的分布信息，这种方法在真实视频中并不鲁棒。对于两者的融合方面目前的工作都很少，而且对于传统的融合方法如先融合与后融合方法，基本上都是自底向上的，只是盲目地去将事件的两个方面结合起来，并不是任务驱动的。Usually a video event is described by what (what) and how it happens (how). What usually refers to the lens features of video frames, that is, apparent features, such as people, objects, buildings, etc.; how usually refers to the dynamic features of the video, that is, motion features. Motion information is unique to video data. It represents the development and change of video content over time, and plays a very important role in describing and understanding video content. How to effectively integrate these two aspects is also a very challenging problem. However, there is still a lack of effective methods for describing events, mainly because the current methods only consider a certain aspect of the event, such as what or how, especially some methods only use the distribution information of motion. Not robust. There is very little current work on the fusion of the two, and for traditional fusion methods such as fusion first and fusion later, they are basically bottom-up, just blindly combining the two aspects of the event, and Not task driven.

发明内容 Contents of the invention

(一)要解决的技术问题(1) Technical problems to be solved

为了解决现有技术背景信息对分类过程的干扰，使得提取到的特征具针对性不强，识别的准确度低的技术问题，为此本发明的目的是提供一种视频静态特征和动态特征融合的基于自顶向下运动注意机制的视频事件识别方法。In order to solve the technical problem that the background information of the prior art interferes with the classification process, the extracted features are not targeted, and the recognition accuracy is low. Therefore, the purpose of the present invention is to provide a fusion of video static features and dynamic features. A Top-Down Motion Attention Mechanism Based Video Event Recognition Approach.

(二)技术方案(2) Technical solutions

为达到上述目的，本发明提供了一种基于自顶向下运动注意机制的视频事件识别方法，该方法的解决技术问题的技术方案包括：In order to achieve the above object, the present invention provides a kind of video event recognition method based on top-down motion attention mechanism, the technical solution of the technical problem of this method comprises:

步骤S1：利用高斯差分检测子，在计算机上检测视频集中每一个视频每一帧的兴趣点，所述视频集包括：训练视频集和测试视频集；Step S1: Utilize the Gaussian difference detector to detect the interest points of each frame of each video in the video collection on the computer, and the video collection includes: a training video collection and a testing video collection;

步骤S2：对检测得到每一帧的兴趣点提取表观特征和运动特征，所述表观特征为尺度不变特征描述子特征，所述运动特征为光流特征；Step S2: Extracting apparent features and motion features for the detected interest points of each frame, the apparent features are scale-invariant feature descriptor features, and the motion features are optical flow features;

步骤S3：对得到的尺度不变特征描述子特征和光流特征进行聚类，并分别建立表观词汇表和运动词汇表；Step S3: Clustering the obtained scale-invariant feature descriptor features and optical flow features, and establishing an apparent vocabulary and a motion vocabulary, respectively;

步骤S4：在训练视频集上学习每一个运动单词关于每一类事件的概率并建立基于运动信息的注意直方图；Step S4: learn the probability of each motion word about each type of event on the training video set and establish an attention histogram based on motion information;

步骤S5：利用视频集的基于运动注意直方图特征，采用推土机距离计算训练视频集与训练视频集之间的相似度、及训练视频集与测试视频集之间的相似度，并生成核函数矩阵；Step S5: Using the histogram feature based on motion attention of the video set, the bulldozer distance is used to calculate the similarity between the training video set and the training video set, and the similarity between the training video set and the test video set, and generate a kernel function matrix ;

步骤S6：利用得到的核函数矩阵对支持向量机分类器进行训练，得到分类器参数，利用训练好的支持向量机分类器模型对测试视频集分类，输出测试视频集的分类结果。Step S6: use the obtained kernel function matrix to train the support vector machine classifier, obtain classifier parameters, use the trained support vector machine classifier model to classify the test video set, and output the classification result of the test video set.

其中，所述每一帧的兴趣点提取采用哈里斯角点、哈里斯-拉普拉斯兴趣点、黑森-拉普拉斯兴趣点、哈里斯-仿射变换兴趣点、黑森-仿射变换兴趣点、最大稳定极值区域兴趣点、快速鲁棒特征兴趣点或网格点及高斯差分检测子中的一种。Wherein, the interest point extraction of each frame adopts Harris corner point, Harris-Laplace interest point, Hesse-Laplace interest point, Harris-affine transformation interest point, Hesse-Laplacian One of the projective transformation interest points, maximum stable extremum area interest points, fast robust feature interest points or grid points, and Gaussian difference detectors.

其中，所述建立基于运动信息的注意直方图的步骤包括：Wherein, the step of establishing the attention histogram based on motion information includes:

步骤S41：设定视频集中视频每一帧Iⁱ由下式表示：Step S41: each frame I ⁱ of the video set video set is expressed by the following formula:

$n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{m m})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,$

式中：n(·)是第i帧Iⁱ的直方图表示，w^v是表观特征单词，w^m是运动特征单词，C是事件的类别标签，c∈{1，2，...}，

是运动单词

属于第c类的概率；δ为示性函数，

分别为兴趣点d_j的运动和表观特征单词指标；where n( ) is the histogram representation of the i-th frame I ⁱ , w ^v is the apparent feature word, w ^m is the motion feature word, C is the category label of the event, c ∈ {1, 2, ... },

is a sports word

The probability of belonging to class c; δ is an indicative function,

are the movement and appearance feature word indexes of the interest point d _j respectively;

步骤S42：对于运动强度和运动方向建立两种类型的注意直方图为：Step S42: Two types of attention histograms are established for motion intensity and motion direction:

基于视觉单词的运动强度直方图(MMA-BOW)如下式表示：The visual word-based motion intensity histogram (MMA-BOW) is expressed as follows:

$n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Mag Mag})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,$

式中：

为兴趣点d_j的运动幅度单词指标；In the formula:

Be the motion amplitude word index of point of interest d _j ;

基于视觉单词的运动方向直方图(OMA-BOW)如下式表示：The histogram of direction of motion based on visual words (OMA-BOW) is expressed as follows:

式中：

为兴趣点d_j的运动方向单词指标；

In the formula:

Be the movement direction word index of point of interest d _j ;

步骤S43：同时考虑光流的强度和方向信息，建立基于视觉词袋的运动注意直方图(MOMA-BoW)如下式表示：Step S43: Considering the intensity and direction information of the optical flow at the same time, a bag-of-visual-words-based motion attention histogram (MOMA-BoW) is established as follows:

$n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Mag Mag})) P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Orient Orient})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})) . .$

其中，对于训练视频集中的每一类训练视频集c∈C，每一个运动单词w^m关于每一类的概率P(C＝c|w^m)通过贝叶斯法则得到：Wherein, for each type of training video set c ∈ C in the training video set, the probability P (C=c|w ^m ) of each motion word w ^m about each class is obtained by Bayesian rule:

$P P ((C C = = c c | | {w w}^{m m})) = = \frac{P P (({w w}^{m m} | | C C = = c c)) P P ((C C = = c c))}{P P (({w w}^{m m}))},,$

$P P (({w w}^{m m} | | C C = = c c)) = = \frac{11}{| | | | {T T}^{c c + +} | | | |} \underset{{w w}_{{d d}_{j j}} &Element; &Element; {T T}^{c c + +}}{Σ Σ} δ δ (({w w}_{{d d}_{j j}}^{m m},, {w w}^{m m}))$

$P P (({w w}^{m m})) = = \frac{11}{| | | | {T T}^{c c} | | | |} \underset{{w w}_{{d d}_{j j}} &Element; &Element; {T T}^{c c}}{Σ Σ} δ δ (({w w}_{{d d}_{j j}}^{m m},, {w w}^{m m}))$

式中T^c+是所有属于第c类的训练视频集的集合，T^c是所有训练样本的集合，||·||表示的是兴趣点的数目。In the formula, T ^c+ is the set of all training video sets belonging to the c-th category, T ^c is the set of all training samples, and ||·|| represents the number of interest points.

其中，所述采用推土机距离来度量视频集的两个视频序列的距离，对于任意两段视频P和Q，分别表示为

其中p_i和q_i分别表示视频P和Q的直方图特征，

和

分别表示视频P和视频Q的第i帧的权重，m和n分别表示视频P和视频Q的帧数，视频P和视频Q的相似度D(P，Q)由下式计算：Wherein, the distance between the two video sequences of the video set is measured using the bulldozer distance, for any two sections of video P and Q, respectively expressed as

where p _i and q _i denote the histogram features of videos P and Q, respectively,

and

Represent the weight of the i-th frame of video P and video Q respectively, m and n represent the frame numbers of video P and video Q respectively, and the similarity D(P, Q) of video P and video Q is calculated by the following formula:

$D D. ((P P,, Q Q)) = = \frac{{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {d d}_{ij ij} {f f}_{ij ij}}{{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij}}$

式中d_ij是p_i和q_j之间的欧式距离，f_ij是视频P和视频Q的最优匹配，所述最优匹配由一个线性规划问题解决。where d _ij is the Euclidean distance between p _i and q _j , f _ij is the optimal matching of video P and video Q, which is solved by a linear programming problem.

(三)有益效果(3) Beneficial effects

从上述技术方案可以看出，本发明具有以下优点：As can be seen from the foregoing technical solutions, the present invention has the following advantages:

1、本发明提供的这种视频的识别方法，由于兴趣点的选择方法多种多样，兴趣点处局部特征的选择也很灵活，使得如果今后出现了更为快速鲁棒的兴趣点检测方法及兴趣点处局部特征的提取方法，可以轻而易举地添加到本系统中，从而进一步提升系统的性能。1. The video recognition method provided by the present invention has a variety of methods for selecting points of interest, and the selection of local features at points of interest is also very flexible, so that if there is a faster and more robust point of interest detection method and The extraction method of local features at interest points can be easily added to this system to further improve the performance of the system.

2、由于在视频上直接提取到的兴趣点数量往往非常大，包含了复杂的背景信息，这些背景信息的存在对后续的处理带来非常严重的干扰，降低分类的准确率，本发明提供的这种视频识别的方法，由于采用了人的注意机制对兴趣点进行选择，突出那些对事件识别贡献大的那些兴趣点，大大减少了背景信息对分类过程的干扰，使得提取到的特征更具针对性，可以显著提高识别的准确度。2. Since the number of interest points directly extracted from the video is often very large and contains complex background information, the existence of these background information will cause very serious interference to subsequent processing and reduce the accuracy of classification. The present invention provides This method of video recognition uses the human attention mechanism to select interest points, highlights those interest points that contribute greatly to event recognition, greatly reduces the interference of background information on the classification process, and makes the extracted features more accurate. Pertinence can significantly improve the accuracy of recognition.

3、传统的特征融合方法如先融合与后融合都是自下而上的，而我们利用人的注意机制采用自上而下的方式来融合视频的静态和动态特征，融合效率有了显著提高。3. Traditional feature fusion methods such as first fusion and post-fusion are bottom-up, but we use the human attention mechanism to fuse the static and dynamic features of the video in a top-down manner, and the fusion efficiency has been significantly improved. .

本发明根据人的注意机制利用自顶向下的方式来融合视频的表观和运动特征，该融合方法不需要任何参数的设置，能很好地结合先融合与后融合的优势，显著提高了识别效率，本发明克服了传统事件识别方法需要背景减除、目标跟踪、检测等技术的缺点，具有很好的应用前景。According to the human attention mechanism, the present invention uses a top-down method to fuse the appearance and motion features of the video. The fusion method does not require any parameter setting, and can well combine the advantages of fusion first and fusion later, and significantly improves the Recognition efficiency, the invention overcomes the shortcomings of background subtraction, target tracking, detection and other technologies required by traditional event recognition methods, and has a good application prospect.

附图说明 Description of drawings

图1为本发明基于自顶向下运动注意机制的视频事件识别方法的流程图；Fig. 1 is the flowchart of the video event recognition method based on top-down motion attention mechanism of the present invention;

图1a-图1d为本发明的视频图像帧的兴趣点检测及光流示例；Fig. 1a-Fig. 1d are the interest point detection and the optical flow example of the video image frame of the present invention;

图2为本发明系统结构框图。Fig. 2 is a block diagram of the system structure of the present invention.

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明的执行环境采用一台具有3.0G赫兹中央处理器和2G字节内存的奔腾4计算机并用Matlab和C语言编制了高效的视频事件识别的算法程序，还可以采用其他的执行环境，在此不再赘述。Execution environment of the present invention adopts a Pentium 4 computer with 3.0G Hz central processing unit and 2G byte memory and has compiled the algorithm program of efficient video event recognition with Matlab and C language, can also adopt other execution environments, here No longer.

本发明系统方案的整体框架见附图2，利用计算机实现基于自顶向下运动注意机制的视频事件识别任务，含有主要的五个模块为：The overall framework of the system scheme of the present invention is shown in accompanying drawing 2, utilizes computer to realize the video event recognition task based on top-down motion attention mechanism, contains main five modules and is:

兴趣点检测模块1，该模块的主要功能是将视频数据库分为训练集(训练视频)和测试集(测试视频)两部分，利于利用高斯差分检测子检测训练视频和测试视频每一帧的兴趣点。Interest point detection module 1, the main function of this module is to divide the video database into two parts: training set (training video) and test set (testing video), which is conducive to using Gaussian difference detector to detect the interests of each frame of training video and testing video point.

特征提取模块2的输入端与兴趣点检测模块1的输出端连接，特征提取模块2的主要功能是在兴趣点检测模块1的基础上，提取每一个兴趣点的尺度不变特征描述子特征和光流特征。The input of the feature extraction module 2 is connected to the output of the interest point detection module 1. The main function of the feature extraction module 2 is to extract the scale-invariant feature description sub-features and optical features of each interest point on the basis of the interest point detection module 1. flow characteristics.

词汇表的建立模块3的输入端特征提取模块2的输出端连接，用于对得到的尺度不变特征描述子和光流特征在训练数据上聚类，并分别建立表观词汇表和运动词汇表；The input terminal of the vocabulary establishment module 3 is connected to the output terminal of the feature extraction module 2, which is used to cluster the obtained scale-invariant feature descriptors and optical flow features on the training data, and establish the apparent vocabulary and the motion vocabulary respectively ;

基于运动信息的注意直方图的建立模块4的输入端与特征提取模块2的输出端和词汇表的建立模块3的输出端连接，根据训练数据，计算运动词汇表中的每一个运动单词关于每一个事件特别类的运动概率，通过所述的概率以及表观词汇表中的表观单词得到基于运动信息的注意直方图。Based on the input of the building module 4 of the attention histogram of motion information is connected with the output of the output of the feature extraction module 2 and the output of the vocabulary building module 3, according to the training data, calculate each motion word in the motion vocabulary about each Motion probabilities for a particular class of events, and motion-based attention histograms are derived from the probabilities and apparent words in the apparent vocabulary.

分类模块5的输入端与与基于运动信息的注意直方图的建立模块4的输出端连接，用于接收视频的基于运动注意的直方图特征，并采用推土机距离计算任意两个视频的相似度，生成核函数矩阵，利用训练集对支持向量机分类器进行训练，得到分类器参数，利用训练好的支持向量机分类器模型对测试集分类，并输出测试视频集的分类结果，其中“出现车(Existing Car)，握手(Handshaking)，跑(Running)，示威游行(Demonstration Or Protest)，走(Walking)，暴动(Riot)，跳舞(Dancing)，射击(Shooting)，群众行军(People Marching)”是我们的事件识别任务。The input end of classification module 5 is connected with the output end of building module 4 based on the attention histogram of motion information, is used to receive the histogram feature based on motion attention of video, and adopts bulldozer distance to calculate the similarity of any two videos, Generate the kernel function matrix, use the training set to train the support vector machine classifier, obtain the classifier parameters, use the trained support vector machine classifier model to classify the test set, and output the classification results of the test video set, where "cars appear (Existing Car), Handshaking, Running, Demonstration Or Protest, Walking, Riot, Dancing, Shooting, People Marching" is our event recognition task.

如图1示出的基于自顶向下运动注意机制的视频事件识别方法的流程图；下面详细给出该发明技术方案中所涉及的各个细节问题的说明。The flow chart of the video event recognition method based on the top-down motion attention mechanism shown in FIG. 1; the description of each detail problem involved in the technical solution of the invention is given in detail below.

(1)兴趣点检测(1) Interest point detection

兴趣点的提取方法可以有很多选择，如：哈里斯角点(Harris)、哈里斯-拉普拉斯兴趣点(Harris Laplace)、黑森-拉普拉斯兴趣点(Hessian Laplace)、哈里斯-仿射变换兴趣点(Harris Affine)、黑森-仿射变换兴趣点(Hessian_Affine)、最大稳定极值区域兴趣点(Maximally Stable Extremal Regions，MSER)、快速鲁棒特征兴趣点(Speeded Up Robust Features，SURF)以及网格点(Grid)等There are many options for extracting interest points, such as: Harris corner point (Harris), Harris-Laplace point of interest (Harris Laplace), Hessian-Laplace point of interest (Hessian Laplace), Harris -Affine Transformation Interest Point (Harris Affine), Hessian-Affine Transformation Interest Point (Hessian_Affine), Maximally Stable Extremal Regions (MSER), Fast Robust Feature Interest Point (Speeded Up Robust Features) , SURF) and grid points (Grid), etc.

将视频V记作V＝{Iⁱ}，i∈{1，2，...，N}。对视频的每一帧Iⁱ高斯差分核(DOG，Difference of Gassian)尺度空间中同时检测局部极值点以作为兴趣点。Let the video V be V={I ⁱ }, i∈{1, 2, . . . , N}. For each frame of the video I ⁱ Gaussian difference kernel (DOG, Difference of Gassian) scale space simultaneously detects local extremum points as interest points.

(2)特征提取(2) Feature extraction

接下来提兴趣点处的局部图像特征，可供选择的局部特征提取方法有：尺度不变量特征(Scale Invariant Feature Transform，SIFT)、快速鲁棒特征(Speeded Up Robust Features，SURF)、以及形状上下文描述特征(Shape Context，SC)等。Next, the local image features at the point of interest are mentioned. The available local feature extraction methods are: Scale Invariant Feature Transform (SIFT), Fast Robust Features (Speeded Up Robust Features, SURF), and shape context Describe features (Shape Context, SC), etc.

我们采用128维的SIFT来表示兴趣点的表观特征，根据检测到的兴趣点，利用金字塔中的迭代Lucas-Kanade方法计算了一个稀疏特征集的光流。图1a至图1d给出了一些视频帧上检测到的兴趣子和光流向量的示例。We use 128-dimensional SIFT to represent the apparent features of interest points, and calculate the optical flow of a sparse feature set using the iterative Lucas-Kanade method in the pyramid based on the detected interest points. Figures 1a to 1d give examples of detected interest points and optical flow vectors on some video frames.

用k均值聚类方法或者其它的聚类方法将检测到的兴趣点根据表观和运动特征分别聚类，聚类成两个词汇表：w^m(运动单词)和w^v(表观单词)，定义每一个聚类中心为一个单词(word)。Use the k-means clustering method or other clustering methods to cluster the detected interest points according to the appearance and motion features, and cluster them into two vocabularies: w ^m (moving words) and w ^v (apparent words) , define each cluster center as a word.

在极坐标系下光流可以用强度Mag和方向Orient表示，在二维运动场中，每个运动矢量都包含了强度和方向这两种运动线索。强度信息反映了运动的空间幅度，方向信息反映了运动的趋势。因此我们有两种类型的运动单词：一种是运动强度单词

一种是运动方向单词

In the polar coordinate system, the optical flow can be expressed by the intensity Mag and the direction Orient. In the two-dimensional sports field, each motion vector contains two kinds of motion clues, the intensity and the direction. Intensity information reflects the spatial magnitude of motion, and direction information reflects the trend of motion. So we have two types of exercise words: one is exercise intensity words

one is the direction of motion word

(3)基于运动信息的注意直方图的建立(3) The establishment of attention histogram based on motion information

由图1a-图1d可以看出，视频帧上所提的兴趣点数量往往非常大，包含了复杂的背景信息以及与我们的事件识别任务无关的信息，这些信息的存在会对我们后续的处理带来非常严重的干扰。本发明利用人的注意机制对兴趣点进行选择和权衡，生物和心理研究证明，人类总是主动地特别关注于某些特定的、能够产生新异的刺激和人所期待的刺激的区域，被称为注意焦点或者显著区域。视觉显著性包括自底向上和自顶向下的两种模式，前者是由数据驱动的，后者是由知识或者任务驱动的。利用人的自顶向下的注意机制，突出那些对事件识别贡献大的那些兴趣点，尽量忽略那些对识别任务无关的兴趣点。It can be seen from Figure 1a-Figure 1d that the number of interest points mentioned on the video frame is often very large, which contains complex background information and information that is not related to our event recognition task. The existence of these information will affect our subsequent processing. cause very serious disturbances. The present invention utilizes human attention mechanism to select and weigh points of interest. Biological and psychological research proves that human beings always actively pay special attention to some specific areas that can produce novel stimuli and people's expected stimuli. It is called the focus of attention or the salient area. Visual saliency includes two modes, bottom-up and top-down, the former is driven by data, and the latter is driven by knowledge or tasks. Using the human top-down attention mechanism, highlight those points of interest that contribute greatly to event recognition, and try to ignore those points of interest that are irrelevant to the recognition task.

视频的每一帧Iⁱ可以由下式表示：Each frame I ⁱ of the video can be represented by the following formula:

式中：C是事件的类别标签，c∈{1，2，...}，δ为示性函数，

分别为兴趣点d_j的运动和表观特征单词指标；where: C is the category label of the event, c ∈ {1, 2, ...}, δ is an indicative function,

从上式我们可以看出，尺度不变特征描述子(SIFT)特征的功能是一个描述子，描述事件中的what方面，而运动特征的功能有两个方面，一方面描述事件中的how方面，另一方面又作为一个注意线索，指导人们去识别相应的事件类别。From the above formula, we can see that the function of the scale-invariant feature descriptor (SIFT) feature is a descriptor, which describes the what aspect of the event, and the function of the motion feature has two aspects. On the one hand, it describes the how aspect of the event. , on the other hand, it serves as an attention clue to guide people to identify the corresponding event category.

对于运动强度和运动方向可以建立两种类型的注意直方图：Two types of attention histograms can be built for motion intensity and motion direction:

基于视觉单词的运动强度直方图(MMA-BOW)表示如下：The histogram of visual word-based motion intensity (MMA-BOW) is represented as follows:

式中：

为兴趣点d_j的运动幅度单词指标；In the formula:

Be the motion amplitude word index of point of interest d _j ;

基于视觉单词的运动方向直方图(OMA-BOW)表示如下：The Direction of Motion Histogram Based on Visual Words (OMA-BOW) is expressed as follows:

$n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Orient Orient})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,$

式中：

为兴趣点d_j的运动方向单词指标；In the formula:

Be the movement direction word index of point of interest d _j ;

如果同时考虑光流的强度和方向信息，基于视觉词袋的特别类的运动注意直方图(MOMA-BoW)：If the intensity and direction information of the optical flow are considered at the same time, the special class of motion attention histogram based on the bag of visual words (MOMA-BoW):

$n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Mag Mag})) P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Orient Orient})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,$

而对于每一类视频事件c∈C，每一个运动单词关于每一类的概率可以通过贝叶斯法则得到：And for each type of video event c ∈ C, the probability of each moving word about each type can be obtained by Bayesian rule:

$P P (({w w}^{m m} | | C C = = c c)) = = \frac{11}{| | | | {T T}^{c c + +} | | | |} \underset{{w w}_{{d d}_{j j}} &Element; &Element; {T T}^{c c + +}}{Σ Σ} δ δ (({w w}_{{d d}_{j j}},, {w w}^{m m})),,$

$P P (({w w}^{m m})) = = \frac{11}{| | | | {T T}^{c c} | | | |} \underset{{w w}_{{d d}_{j j}} &Element; &Element; {T T}^{c c}}{Σ Σ} δ δ (({w w}_{{d d}_{j j}},, {w w}^{m m})),,$

其中T^c+是所有属于第c类的视频的集合，T是所有训练样本的集合，||·||表示的是兴趣点的数目。where T ^c+ is the set of all videos belonging to the c-th category, T is the set of all training samples, and ||·|| represents the number of interest points.

从基于运动信息的注意直方图的公式可以看出，运动信息隐含在视频的表示中，也可以当作是表观信息SIFT特征的权重。特别地，对于一个给定的运动单词，关于不同事件类的概率是不一样的，也就是说同一个运动单词对于不同类的识别的贡献是不一样。例如在我们进行事件“Running”分类的时候，在所有检测到的兴趣点中，确实描述“Run”这个动作的运动单词应该赋予大一些的权重。另一方面，对于一些象“Riot”这样的一些事件，运动信息并不是相关的，那么每一个运动单词对于这一类的概率基本上都是一样的，词包模型也会退化成最基本的形式。From the formula of the attention histogram based on motion information, it can be seen that the motion information is implicit in the representation of the video, and can also be regarded as the weight of the apparent information SIFT feature. In particular, for a given motion word, the probabilities of different event classes are different, that is to say, the same motion word contributes differently to the recognition of different classes. For example, when we classify the event "Running", among all the detected points of interest, the motion word that really describes the action of "Run" should be given a larger weight. On the other hand, for some events like "Riot", the motion information is not relevant, then the probability of each motion word for this category is basically the same, and the word bag model will degenerate into the most basic form.

(四)事件识别(4) Event identification

给定一段视频V，可以得到第i帧的基于视觉词袋的运动注意直方图特征p_i后，这个视频就可以表示为

表示第i帧的权重，满足

这里采用默认值1/m。采用推土机距离(The Earth’s Mover Distance，EMD)来度量两个视频序列的距离。对于任意两段视频P和Q，分别可以表示为其中p_i和q_i分别表示视频P和Q的直方图特征，和

分别表示视频P和视频Q的第i帧的权重，m和n分别表示视频P和视频Q的帧数，视频P和Q的相似度可以由下式计算推土机距离具有时序漂移和尺度变化的特点，前者指的是一段视频的起始帧可能与另外一段视频的结束帧匹配，后者指的是一段视频的一帧可能与另外一段视频的多帧匹配。Given a video V, after the bag-of-visual-words-based motion attention histogram feature p _i of the i-th frame can be obtained, the video can be expressed as

Represents the weight of the i-th frame, satisfying

The default value of 1/m is used here. The Earth's Mover Distance (EMD) is used to measure the distance between two video sequences. For any two video segments P and Q, they can be expressed as where p _i and q _i denote the histogram features of videos P and Q, respectively, and

Denote the weight of the i-th frame of video P and video Q respectively, m and n denote the frame numbers of video P and video Q respectively, the similarity between video P and Q can be calculated by the following formula. Bulldozer distance has the characteristics of timing drift and scale change , the former means that the start frame of a video may match the end frame of another video, and the latter means that one frame of a video may match multiple frames of another video.

视频P和视频Q的相似度可以由下式计算：The similarity between video P and video Q can be calculated by the following formula:

其中d_ij是p_i和q_j之间的欧式距离，f_ij是两个视频P和Q的最优匹配，可以由一个线性规划问题解决。where d _ij is the Euclidean distance between p _i and q _j , f _ij is the optimal matching of two videos P and Q, which can be solved by a linear programming problem.

$D D. ((P P,, Q Q)) = = \frac{{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {d d}_{ij ij} {f f}_{ij ij}}{{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij}},,$

$min min : : WORK WORK ((P P,, Q Q,, F f)) = = {Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {d d}_{ij ij} {f f}_{ij ij}$

s.t.s.t.

f_ij≥0f _ij ≥0

${Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij} \leq \leq {p p}_{i i}$

${Σ Σ}_{i i = = 11}^{m m} {f f}_{ij ij} \leq \leq {q q}_{j j}$

${Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij} = = min min (({Σ Σ}_{i i = = 11}^{m m} {p p}_{i i},, {Σ Σ}_{j j = = 11}^{n no} {q q}_{j j})),,$

接下来使用支持向量机作为分类器，“一对多”作为分类策略。Next, support vector machines are used as classifiers, and "one-to-many" is used as a classification strategy.

由于需要识别的是9个事件，因此训练了9个分类器，在每一个分类器中是一类事件的样本作为测试，其余的作为训练。视频之间的推土机距离嵌入到支持向量机分类器的高斯核函数中：Since 9 events need to be identified, 9 classifiers are trained, and in each classifier, a sample of a class of events is used as a test, and the rest are used as training. The bulldozer distance between videos is embedded into the Gaussian kernel of the support vector machine classifier:

$K K ((P P,, Q Q)) = = exp exp - - ((- - \frac{11}{λ λ} D D. ((P P,, Q Q)))),,$

M是一个归一化的因子，可以由所有训练数据集中的平均推土机距离得到。λ是尺度因子可以由交叉验证经验确定。M is a normalization factor that can be obtained from the average bulldozer distance across all training datasets. λ is the scaling factor which can be empirically determined by cross-validation.

以上所述，仅为本发明中的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉该技术的人在本发明所揭露的技术范围内，可理解想到的变换或替换，都应涵盖在本发明的包含范围之内，因此，本发明的保护范围应该以权利要求书的保护范围为准。The above is only a specific implementation mode in the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technology can understand the conceivable transformation or replacement within the technical scope disclosed in the present invention. All should be covered within the scope of the present invention, therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A video event recognition method based on top-down motion attention mechanism, comprising steps:

Step S1: Utilize the Gaussian difference detector to detect the points of interest of each frame of each video in the video set on the computer, the video set comprising: a training video set and a test video set;

Step S2: Extracting apparent features and motion features for the detected interest points of each frame, the apparent features are scale-invariant feature descriptor features, and the motion features are optical flow features;

Step S3: Clustering the scale-invariant feature descriptor features and optical flow features obtained in the training video set, and establishing the apparent vocabulary and the motion vocabulary respectively;

Step S4: Calculate the probability of each motion word in the motion vocabulary obtained in step S3 on each type of event on the training video set, and establish a visual word-based motion intensity histogram on the training video set and test video set , motion direction histogram based on visual words and motion attention histogram based on visual words;

Step S5: Use the visual word-based motion intensity histogram, visual word-based motion direction histogram, and visual word-based motion attention histogram features of the training video set and test video set calculated in step S4, and use bulldozer distance calculation training The similarity between the video set and the video in the training video set, and the similarity between the video in the training video set and the test video set, and generate a kernel function matrix;

Step S6: use the obtained kernel function matrix to train the support vector machine classifier, obtain classifier parameters, use the trained support vector machine classifier to classify the test video set, and output the classification result of the test video set.

2. video event identification method according to claim 1, is characterized in that, uses Harris corner point, Harris-Laplace point of interest, Hesse-Laplace point of interest, Harris-affine transformation One of interest points, Hessian-affine transformation interest points, maximum stable extremum area interest points, fast robust feature interest points, and grid points detects the interest points of each frame of each video in the video set.

3. video event identification method according to claim 1, is characterized in that, described establishment is based on the motion strength histogram of visual word, the motion direction histogram based on visual word and the step of paying attention to the histogram based on visual word motion comprises :

Step S41: each frame I ⁱ of the video set video set is expressed by the following formula:

n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{m m})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,

In the formula: n( ) is the histogram representation of the i-th frame I ⁱ , ‖I ⁱ ‖ represents the number of all interest points in the video frame I ⁱ , w ^v is the apparent feature word, w ^m is the motion feature word, C is the set of category labels of events, c ∈ {1, 2, ...},

is a sports word The probability of belonging to class c; δ is an indicative function,

Step S42: Two types of attention histograms are established for motion intensity and motion direction:

The histogram of motion intensity based on visual words is expressed as follows:

n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Mag Mag})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,

In the formula: is the exercise intensity word index of the point of interest d _j ; where;

The histogram of motion direction based on visual words is expressed as follows:

n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Orient Orient})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})),,

In the formula:

is the movement direction word index of the point of interest d _j , where;

Considering the intensity and direction information of optical flow at the same time, the motion attention histogram based on visual words is established as follows:

n no (({w w}^{v v} | | {I I}^{i i},, C C = = c c)) = = {Σ Σ}_{j j = = 11}^{| | | | {I I}^{i i} | | | |} P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Orient Orient})) P P ((C C = = c c | | {w w}_{{d d}_{j j}}^{Orient Orient})) δ δ (({w w}_{{d d}_{j j}}^{v v},, {w w}^{v v})) . .

4. The video event recognition method according to claim 3, characterized in that, for each class c∈C in the training video set, each motion feature word w ^m is about the probability P(C=c|w of each class ^m ) By Bayes' rule:

P P ((C C = = c c | | {w w}^{m m})) = = \frac{P P (({w w}^{m m} | | C C = = c c)) P P ((C C = = c c))}{P P (({w w}^{m m}))},,

P P (({w w}^{m m} | | C C = = c c)) = = \frac{11}{| | | | {T T}^{c c + +} | | | |} \underset{{w w}_{{d d}_{j j}}^{m m} &Element; &Element; {T T}^{c c + +}}{Σ Σ} δ δ (({w w}_{{d d}_{j j}}^{m m},, {w w}^{m m}))

P P (({w w}^{m m})) = = \frac{11}{| | | | {T T}^{c c} | | | |} \underset{{w w}_{{d d}_{j j}}^{m m} &Element; &Element; {T T}^{c c}}{Σ Σ} δ δ (({w w}_{{d d}_{j j}}^{m m},, {w w}^{m m})),,

where Tc ⁺ is the set of all videos in the training video set belonging to the c-th category, ^Tc is the set of all videos in the training video set, and ‖·‖ represents the number of interest points on all frames of all videos in the video set and.

5. video event identification method according to claim 1, is characterized in that, adopts bulldozer distance to measure the distance of two video sequences of video set, for any two sections of video P and Q, express respectively as

P = {(p_{i}, w_{p_{i}}), 1 \leq i \leq m},

Q = {(q_{i}, w_{q_{i}}), 1 \leq i \leq no},

where p _i and qi denote the motion attention histogram features of videos P and Q, respectively,

and

D D. ((P P,, Q Q)) = = \frac{{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {d d}_{ij ij} {f f}_{ij ij}}{{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij}}

where d _ij is the Euclidean distance between p _i and q _j , f _ij is the optimal matching of video P and video Q, which is solved by the following linear programming problem:

min min : : WORK WORK ((P P,, Q Q,, F f)) = = {Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {d d}_{ij ij} {f f}_{ij ij}

satisfy

f _ij ≥0

{Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij} \leq \leq {w w}_{pi p} . .

{Σ Σ}_{i i = = 11}^{m m} {f f}_{ij ij} \leq \leq {w w}_{qj qj}

{Σ Σ}_{i i = = 11}^{m m} {Σ Σ}_{j j = = 11}^{n no} {f f}_{ij ij} = = min min (({Σ Σ}_{i i = = 11}^{m m} {w w}_{{p p}_{i i}},, {Σ Σ}_{j j = = 11}^{n no} {w w}_{{q q}_{j j}}))