CN115984966B

CN115984966B - Character object interaction action detection method based on feature refining and multiple views

Info

Publication number: CN115984966B
Application number: CN202310005121.4A
Authority: CN
Inventors: 张铭宣; 吴晓; 袁召全
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-01-03
Filing date: 2023-01-03
Publication date: 2023-10-13
Anticipated expiration: 2043-01-03
Also published as: CN115984966A

Abstract

The invention provides a character interaction action detection method based on feature refining and multiple views, which comprises the following steps of S1: extracting entity time sequence characteristics of a video frame to be detected, and acquiring the entity time sequence characteristics; s2: refining the entity time sequence characteristics to obtain refined entity time sequence characteristics; s3: based on the refined entity time sequence characteristics, acquiring a human-centered interactive action modeling model; s4: constructing and fusing a plurality of groups of multi-view character interaction action characteristics by adopting a human-centered interaction action modeling model to obtain character interaction action classification characteristics; s5: classifying the character interaction actions through an action classifier. The accuracy of the time sequence feature representation is improved, the robustness of the character interaction motion representation is enhanced, and the problem of character interaction motion detection in the video stream is solved.

Description

A human-character interaction action detection method based on feature refinement and multi-view

技术领域Technical field

本发明涉及计算机视觉技术领域，特别涉及一种基于特征精炼与多视图的人物物交互动作检测方法。The invention relates to the field of computer vision technology, and in particular to a method for detecting interactive actions of people and objects based on feature refining and multi-view.

背景技术Background technique

交互动作检测任务旨在检测人和物体之间的交互行为并对其进行定位。目前针对交互动作识别的方法大体可以分为两大类，分别是基于单帧的交互动作检测方法和基于时序的交互动作检测方法。The interaction detection task aims to detect and locate interactions between people and objects. Currently, methods for interactive action recognition can be roughly divided into two categories, namely single-frame-based interactive action detection methods and time-series-based interactive action detection methods.

对于基于单帧的交互动作检测方法而言，为了更好的检测人物交互动作，现有的一些方法通过捕获人和物体的视觉信息或者空间信息来表示交互动作。在图传递神经网络方法中，提出将图形模型和神经网络结合起来的方法，整合了来自人和物体的视觉特征信息。该方法迭代地学习图结构并推理消息传递权重以输出最终解析图，该解析图包括人与的物的图结构以及交互动作的标签。在视觉空间图网络方法中，提出将空间结构与图网络相结合的方法，整合了人和物体之间的相对空间位置信息，该架构基于人物对的空间位置从人物对中提取和细化视觉特征，并通过图卷积网络检测人的交互动作。For single-frame based interactive action detection methods, in order to better detect human interactive actions, some existing methods capture visual information or spatial information of people and objects to represent interactive actions. In the graph transfer neural network method, a method that combines graphical models and neural networks is proposed to integrate visual feature information from people and objects. The method iteratively learns the graph structure and infers the message passing weights to output the final parsing graph, which includes the graph structure of people and objects and the labels of interaction actions. In the visual spatial graph network method, a method that combines spatial structure with graph network is proposed to integrate the relative spatial position information between people and objects. This architecture extracts and refines vision from pairs of persons based on their spatial positions. features, and detect human interactions through graph convolutional networks.

然而，所有这些方法都没有对人和物体之间的时间依赖关系进行建模，无法更好地基于时序信息理解人与对象的交互。并且这些方法都需要通过枚举所有的成对人和物体的组合进行动作分析，因此存在计算和推理时间成本高的问题。However, all these methods do not model the temporal dependence between people and objects, and cannot better understand the interaction between people and objects based on temporal information. Moreover, these methods require action analysis by enumerating all pairs of people and objects, so there is a problem of high computational and reasoning time costs.

对于基于时序的交互动作检测方法而言，为了更好的利用时间线索，现有的一些方法通过采用视频时序特征建模人物时序交互表征，从而准确的表示交互动作。在异步交互聚合方法中，提出通过整合不同的交互行为以促进动作检测的方法。该方法首先通过异步内存更新算法提取长期时序特征，然后分别建模人人交互特征，人物交互特征和时序特征，最后通过交互聚合结构建模交互特征。在高级关系建模方法中，提出通过推断多个参与者与上下文之间的交互关系，间接的建模高阶交互关系的方法。该方法首先建模一阶的人与上下文之间的关系，然后构建高阶关系推理模型，最后通过推理模型利用一阶关系对二阶关系进行建模。For time-series-based interactive action detection methods, in order to better utilize time clues, some existing methods use video time-series features to model character time-series interaction representations, thereby accurately representing interactive actions. In the asynchronous interaction aggregation method, a method is proposed to promote action detection by integrating different interactive behaviors. This method first extracts long-term time series features through an asynchronous memory update algorithm, then models human interaction features, character interaction features and time series features respectively, and finally models interaction features through an interaction aggregation structure. In the advanced relationship modeling method, a method of indirectly modeling high-order interactive relationships is proposed by inferring the interactive relationships between multiple participants and the context. This method first models the first-order relationship between people and context, then builds a high-order relationship inference model, and finally uses the first-order relationship to model the second-order relationship through the inference model.

尽管这些方法相较于基于单帧的交互动作检测方法取得了更好的结果，但是他们获取视频特征的方法均是基于视频片段特征和ROI对齐相结合的方法，这将会导致对于具有快速移动性质的动作，无法准确的获取其时序特征。并且现有的动作检测方法无法准确有效地建模与表征人物物交互动作，导致模型存在不可解释性并且检测准确率较差的问题。Although these methods achieve better results than single frame-based interactive action detection methods, their methods of obtaining video features are based on a combination of video clip features and ROI alignment, which will result in fast-moving images. It is impossible to accurately obtain the timing characteristics of actions of a certain nature. Moreover, existing action detection methods cannot accurately and effectively model and characterize human interaction actions, resulting in uninterpretability of the model and poor detection accuracy.

发明内容Contents of the invention

为解决上述问题，本发明提供了一种基于特征精炼与多视图的人物物交互动作检测方法，利用YOLO目标检测算法、ROI对齐算法以及SlowFast时序特征生成算法抓取实体的时序特征，之后采用移动定位方法实现实体运动轨迹的跟踪定位，执行特征精炼操作，基于定位结果级联不同时间步下的时序特征，提高实体时序特征的准确性，解决了空间偏移造成的时序特征不准确的问题，最后从多视图探索待检测的人与多个实体之间的动作关系，分别表征主体视图下的和协作视图下的人物物交互动作，通过融合不同视图下的特征构建了多视图特征，并基于多组多视图特征表征人物物交互动作，解决人物物交互动作建模问题，增强了人物物交互动作表征的鲁棒性。In order to solve the above problems, the present invention provides a human interaction action detection method based on feature refinement and multi-view, using YOLO target detection algorithm, ROI alignment algorithm and SlowFast timing feature generation algorithm to capture the timing features of entities, and then using movement The positioning method realizes the tracking and positioning of the entity's motion trajectory, performs feature refining operations, and cascades the temporal features at different time steps based on the positioning results to improve the accuracy of the entity's temporal features and solve the problem of inaccurate temporal features caused by spatial offset. Finally, the action relationship between the person to be detected and multiple entities is explored from multiple views, and the interactive actions of people and objects in the subject view and collaboration view are respectively represented. Multi-view features are constructed by fusing the features in different views, and based on Multiple sets of multi-view features represent human and object interaction actions, solve the problem of character and object interaction action modeling, and enhance the robustness of character and object interaction action representation.

本发明提供了一种基于特征精炼与多视图的人物物交互动作检测方法，具体技术方案如下：The present invention provides a method for detecting interactive actions of people and objects based on feature refinement and multi-view. The specific technical solutions are as follows:

S1：对待检测的视频帧进行实体时序特征提取，获取实体时序特征；S1: Extract entity timing features from the video frames to be detected to obtain entity timing features;

S2：对实体时序特征进行精练，获得精练后的实体时序特征；包括如下步骤：S2: Refine the entity timing characteristics and obtain the refined entity timing characteristics; including the following steps:

S201：通过移动定位操作，对实体运动轨迹进行跟踪定位；S201: Track and position the movement trajectory of the entity through mobile positioning operations;

S202：基于定位的实体位置，逐段提取时序特征并进行时序特征精炼；S202: Based on the position of the positioned entity, extract timing features segment by segment and refine the timing features;

S3：基于精炼的实体时序特征，获取以人为中心的交互动作建模模型；S3: Based on refined entity timing characteristics, obtain a human-centered interactive action modeling model;

S4：采用以人为中心的交互动作建模模型，构建和融合多组多视图人物物交互动作特征，获得人物物交互动作分类特征；包括如下步骤：S4: Use a human-centered interactive action modeling model to construct and fuse multiple sets of multi-view human and character interaction action features to obtain character and character interaction action classification features; including the following steps:

S401：采用以人为中心的交互动作建模模型，构建主体视图下的人物物交互动作特征；S401: Use a human-centered interactive action modeling model to construct interactive action characteristics of characters in the subject view;

S402：采用以人为中心的交互动作建模模型，构建协作视图下的人物物交互动作特征；S402: Use a human-centered interactive action modeling model to construct human-character interaction action characteristics under the collaboration view;

S403：基于主体视图和协作视图下的人物物交互动作特征构建人物物交互动作分类特征；S403: Construct character interaction action classification features based on the character interaction action features in the subject view and collaboration view;

S5：通过动作分类器对人物物交互动作进行分类。S5: Classify human and character interaction actions through action classifiers.

进一步的，步骤S1中，包括如下步骤：Further, step S1 includes the following steps:

S101：采用YOLO目标检测算法实时检测出当前帧中人和物的类别以及坐标框；S101: Use the YOLO target detection algorithm to detect the categories and coordinate frames of people and objects in the current frame in real time;

S102：通过SlowFast时序特征提取算法生成当前帧的时序特征；S102: Generate the timing features of the current frame through the SlowFast timing feature extraction algorithm;

S103：采用ROI对齐算法和最大池化算法根据检测出的实体坐标和当前帧的时序特征提取人和物的时序特征；S103: Use the ROI alignment algorithm and the maximum pooling algorithm to extract the temporal features of people and objects based on the detected entity coordinates and the temporal features of the current frame;

S104：将当前帧的实体的坐标框和时序特征存储到特征池中。S104: Store the coordinate frame and temporal features of the entity in the current frame into the feature pool.

进一步的，步骤S201中，所述移动定位操作，包括渐进式扩展和自适应定位，通过迭代执行渐进式扩展和自适应定位两个步骤，定位当前帧中每一个实体在之前和之后的帧中发生偏移后的实体位置。Further, in step S201, the mobile positioning operation includes progressive expansion and adaptive positioning. By iteratively executing the two steps of progressive expansion and adaptive positioning, each entity in the current frame is positioned in the previous and subsequent frames. The position of the entity after offset.

进一步的，所述渐进式扩展，为将上一次迭代中定位好的实体坐标框，复制到相邻的未定位的帧中。Further, the progressive expansion is to copy the positioned entity coordinate frame in the previous iteration to adjacent unpositioned frames.

进一步的，如果是第一次迭代则将当前帧中所有实体的坐标框，复制到相邻的未定位的帧中。Further, if it is the first iteration, the coordinate frames of all entities in the current frame are copied to adjacent unpositioned frames.

进一步的，所述自适应定位，具体过程如下：Further, the specific process of adaptive positioning is as follows:

从特征池中获取相邻的未定位帧中存储的实体坐标框，然后计算相邻帧中的实体坐标框和在渐进式扩展步骤中复制的实体坐标框的中心点，并计算两组坐标框中心点之间的距离；Obtain the entity coordinate frame stored in the adjacent unpositioned frame from the feature pool, then calculate the entity coordinate frame in the adjacent frame and the center point of the entity coordinate frame copied in the progressive expansion step, and calculate the two sets of coordinate frames distance between center points;

利用匈牙利算法求解中心点之间的对应关系；Use Hungarian algorithm to solve the correspondence between center points;

基于对应关系获取每一个复制的实体坐标框其对应的存储的实体坐标框的位置，并将该位置作为定位后的实体位置。Based on the corresponding relationship, the position of the corresponding stored entity coordinate frame of each copied entity coordinate frame is obtained, and the position is used as the position of the positioned entity.

进一步的，步骤S202中，时序特征精练的具体过程如下：Further, in step S202, the specific process of refining temporal features is as follows:

将当前帧的时序特征拆成2D+1个特征块，其中，D表示总的迭代次数；Split the temporal features of the current frame into 2D+1 feature blocks, where D represents the total number of iterations;

利用定位的实体位置和ROI对齐方法抓取每个实体在每个特征块中的分区时序特征；Use the positioned entity position and ROI alignment method to capture the partition timing features of each entity in each feature block;

将每个实体的所有分区时序特征进行级联，并通过卷积操作构建精炼后的实体时序特征。All partition timing features of each entity are concatenated, and refined entity timing features are constructed through convolution operations.

进一步的，所述卷积操作由两组堆叠的卷积层、Dropout层和ReLU层组成。Further, the convolution operation consists of two stacked convolution layers, Dropout layers and ReLU layers.

进一步的，步骤S3中，所述以人为中心的交互动作建模模型，模型包括可学习的权重块、长期时序特征增强块以及两个原子动作表征块；Further, in step S3, the human-centered interactive action modeling model includes a learnable weight block, a long-term temporal feature enhancement block and two atomic action representation blocks;

所述原子动作表征块以待检测人体的精炼时序特征和参与交互的其他人/物体的精炼时序特征为输入，所述原子动作表征块的输出与所述可学习的权重块的输入连接，所述可学习的权重块的输出与长期时序特征作为所述长期时序特征增强块的输入。The atomic action representation block takes the refined timing characteristics of the human body to be detected and the refined timing characteristics of other people/objects participating in the interaction as inputs, and the output of the atomic action representation block is connected to the input of the learnable weight block, so The output of the learnable weight block and the long-term temporal features are used as the input of the long-term temporal feature enhancement block.

进一步的，步骤S401中，构建主体视图下的人物物交互动作特征具体过程如下：Further, in step S401, the specific process of constructing the interactive action characteristics of the characters in the subject view is as follows:

通过步骤S3中，以人为中心的交互动作建模方法，将待检测的人体的精炼时序特征以及参与交互的两个物体的精炼时序特征作为输入，获取主体视图下的人物物交互动作特征。Through the human-centered interactive action modeling method in step S3, the refined timing features of the human body to be detected and the refined timing features of the two objects participating in the interaction are used as input to obtain the interactive action features of people and objects in the subject view.

进一步的，步骤S402中，构建协作视图下的人物物交互动作特征具体过程如下：Further, in step S402, the specific process of constructing the character interaction action characteristics in the collaboration view is as follows:

通过步骤S3中，以人为中心的交互动作建模方法，将待检测的人体的精炼时序特征以及协助交互的人体的精炼时序特征和共同操作的物体的精炼时序特征作为输入，获取协作视图下的人物物交互动作特征；Through the human-centered interactive action modeling method in step S3, the refined timing features of the human body to be detected, the refined timing features of the human body assisting the interaction, and the refined timing features of the co-operated objects are used as input to obtain the collaborative view. Characteristics of character interaction;

进一步的，步骤S403中，构建人物物交互动作分类特征具体过程如下：Further, in step S403, the specific process of constructing character interaction action classification features is as follows:

将主体视图下的人物物交互动作特征和协作视图下的人物物交互动作特征的相加，得到多视图人物物交互动作特征；The multi-view character interaction characteristics are obtained by adding the character interaction characteristics in the main view and the character interaction characteristics in the collaboration view;

构建多组多视图人物物交互动作特征；Construct multiple groups of multi-view character interaction action features;

相加融合多组多视图人物物交互动作特征，获得人物物交互动作分类特征。Add and fuse multiple sets of multi-view human and object interaction action features to obtain the character and object interaction action classification features.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

1、提取实体的时序特征后，基于移动定位进行时序特征精练，有效的解决时序特征提取中由于空间位移带来的特征描述不准确性的问题，提高了时序特征的准确性。1. After extracting the temporal features of the entity, the temporal features are refined based on mobile positioning, which effectively solves the problem of inaccurate feature description due to spatial displacement in temporal feature extraction and improves the accuracy of the temporal features.

2、给出了以人为中心的交互动作建模模型，模型通过两个原子动作表征块、一组可学习的权重以及长期时序特征增强块构成，并基于该模型从两个不同的视图分别地表征人物物交互动作，最后通过融合不同视图下的特征构建了多视图特征，并基于多组多视图特征表征人物物交互动作，有效的解决了现有方法无法建模人物物交互特征的问题，并基于多视图的方法增强了人物物交互特征的鲁棒性。2. A human-centered interactive action modeling model is given. The model is composed of two atomic action representation blocks, a set of learnable weights and long-term temporal feature enhancement blocks. Based on this model, the model is separately analyzed from two different views. Characterize human-character interaction actions. Finally, multi-view features are constructed by merging features under different views, and characterize human-character interaction actions based on multiple sets of multi-view features. This effectively solves the problem that existing methods cannot model human-character interaction features. And based on the multi-view method, the robustness of human-character interaction features is enhanced.

附图说明Description of the drawings

图1是方法整体过程模块示意图；Figure 1 is a schematic diagram of the overall process module of the method;

图2是以人为中心的交互动作建模模型结构示意图。Figure 2 is a schematic structural diagram of the human-centered interactive action modeling model.

具体实施方式Detailed ways

在下面的描述中对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following description. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

在发明实施例的描述中，需要说明的是，指示方位或位置关系为基于附图所示的方位或位置关系，或者是该发明产品使用时惯常摆放的方位或位置关系，或者是本领域技术人员惯常理解的方位或位置关系，或者是该发明产品使用时惯常摆放的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”仅用于区分描述，而不能理解为指示或暗示相对重要性。In the description of the embodiments of the invention, it should be noted that the indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, or is the orientation or positional relationship that is customarily placed when the product of the invention is used, or is the one in this field. The orientation or positional relationship commonly understood by skilled persons, or the orientation or positional relationship in which the product of the invention is customarily placed when used, is only for the convenience of describing the present invention and simplifying the description, and does not indicate or imply that the device or component referred to must have Specific orientations, construction and operation in specific orientations and therefore are not to be construed as limitations of the invention. In addition, the terms "first" and "second" are only used to differentiate descriptions and cannot be understood as indicating or implying relative importance.

在本发明实施例的描述中，还需要说明的是，除非另有明确的规定和限定，术语“设置”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是直接连接，也可以通过中间媒介间接连接。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In the description of the embodiments of the present invention, it should also be noted that, unless otherwise clearly stated and limited, the terms "set" and "connection" should be understood in a broad sense. For example, it can be a fixed connection or a detachable connection. , or integrally connected; it can be directly connected or indirectly connected through an intermediary. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood on a case-by-case basis.

实施例1Example 1

本发明的实施例1公开了一种基于特征精炼与多视图的人物物交互动作检测方法，如图1所示，包括如下步骤：Embodiment 1 of the present invention discloses a human interaction action detection method based on feature refining and multi-view, as shown in Figure 1, including the following steps:

本实施例中，定义一个待检测的视频为其中，v_t表示当前正在检测的视频中的第t帧，称之为当前帧，T是待检测的视频中总视频帧数。In this embodiment, a video to be detected is defined as Among them, v _t represents the t-th frame in the video currently being detected, which is called the current frame, and T is the total number of video frames in the video to be detected.

具体的，步骤如下：Specifically, the steps are as follows:

S101：采用YOLO目标检测算法实时检测出当前帧中人和物的类别以及坐标框，人和物的坐标框分别记为Bh_t和Bo_t，将Bh_t和Bo_t的集合记为Be_t；S101: Use the YOLO target detection algorithm to detect the categories and coordinate frames of people and objects in the current frame in real time. The coordinate frames of people and objects are recorded as Bh _t and Bo _t respectively, and the set of Bh _t and Bo _t is recorded as Be _t ;

S103：采用ROI对齐算法和最大池化算法根据检测出的实体坐标和当前帧的时序特征提取人和物的时序特征，分别记为Fh_t和Fo_t；将Fh_t和Fo_t的集合记为Fe_t；S103: Use the ROI alignment algorithm and the maximum pooling algorithm to extract the temporal features of people and objects based on the detected entity coordinates and the temporal features of the current frame, which are recorded as Fh _t and Fo _t respectively; the set of Fh _t and Fo _t is recorded as _Fet ;

S104：将当前帧的实体的坐标框和时序特征存储到特征池中，特征池表示为 S104: Store the coordinate frame and temporal features of the entity in the current frame into the feature pool. The feature pool is expressed as

S2：对实体时序特征进行精练，获得精练后的实体时序特征；S2: Refine the entity timing characteristics and obtain the refined entity timing characteristics;

由于在交互过程中实体往往具有快速移动的特性，因此传统的基于ROI对齐的方法提取实体时序特征容易因为实体移动偏移较大，而造成特征提取不准确问题；Since entities often move quickly during interactions, the traditional ROI alignment-based method for extracting temporal features of entities is prone to inaccurate feature extraction due to large offsets in entity movement;

本实施例中，通过对实体时序特征进行精练，以提高实体时序特征的准确性。In this embodiment, the accuracy of the entity timing features is improved by refining the entity timing features.

具体步骤如下：Specific steps are as follows:

移动定位的目的是对关键帧中的每一个实体在之前和之后的帧中，定位发生偏移后的实体位置。The purpose of mobile positioning is to locate the offset position of each entity in the key frame in the previous and subsequent frames.

所述移动定位操作包括渐进式扩展和自适应定位，本实施例中，通过迭代执行渐进式扩展和自适应定位两个步骤，定位当前帧中每一个实体在之前和之后的帧中发生偏移后的实体位置。在每一次迭代中，首先执行渐进式扩展操作为位移定位提供实体位置参考，然后执行自适应定位操作匹配移动后的实体坐标框。The mobile positioning operation includes progressive expansion and adaptive positioning. In this embodiment, by iteratively executing the two steps of progressive expansion and adaptive positioning, each entity in the current frame is positioned and offset in the previous and subsequent frames. The final entity position. In each iteration, a progressive expansion operation is first performed to provide an entity position reference for displacement positioning, and then an adaptive positioning operation is performed to match the moved entity coordinate frame.

具体的，所述渐进式扩展，是将上一次迭代中定位好的实体坐标框，复制到相邻的未定位的帧v_t+d*Ns，v_t-d*Ns中，复制的坐标框记为Bc_t+d*Ns、Bc_t-d*Ns，其中，d表示当前的迭代次数，Ns表示设置的位移定位的间隔时间内的帧数；Specifically, the progressive expansion is to copy the entity coordinate frame positioned in the previous iteration to the adjacent unpositioned frames v _t+d*Ns and v _td*Ns . The copied coordinate frame is recorded as Bc _t+d*Ns , Bc _td*Ns , where d represents the current number of iterations, and Ns represents the number of frames within the set displacement positioning interval;

如果是第一次迭代则将当前帧中所有实体的坐标框，复制到相邻的未定位的帧中。If it is the first iteration, copy the coordinate frames of all entities in the current frame to adjacent unpositioned frames.

所述自适应定位，具体过程如下：The specific process of adaptive positioning is as follows:

从特征池P中获取相邻的未定位的帧中，例如v_t+d*Ns，存储的实体坐标框Be_t+d*Ns，并计算存储的实体坐标框Be_t+d*Ns和在渐进式扩展步骤中复制的实体坐标框Bc_t+d*Ns的中心点，最后计算两组坐标框中心点之间的距离，计算如下：Obtain the adjacent unpositioned frames from the feature pool P, such as v _t+d*Ns , the stored entity coordinate frame Be _t+d*Ns , and calculate the stored entity coordinate frame Be _t+d*Ns and The center point of the entity coordinate frame Bc _t+d*Ns copied in the progressive expansion step, and finally the distance between the center points of the two sets of coordinate frames is calculated as follows:

j∈[1,Se],Dis(i,j)＝|center(i)-center(j)|； j∈[1,Se],Dis(i,j)=|center(i)-center(j)|;

其中，Sc和Se分别是Bc_t+d*Ns和Be_t+d*Ns中实体的数量，center用于求解实体坐标框的中心点，Dis(i,j)存储复制的坐边框Bc_t+d*Ns中第i个坐标框和存储的坐标框Be_t+d*Ns中第j个坐标框之间的距离；Among them, Sc and Se are the number of entities in Bc _t+d*Ns and Be _t+d*Ns respectively, center is used to solve the center point of the entity coordinate frame, and Dis(i,j) stores the copied sitting frame Bc _t+ The distance between the i-th coordinate frame in _d*Ns and the j-th coordinate frame in the stored coordinate frame Be _t+d*Ns ;

之后利用匈牙利算法求解中心点之间的对应关系，如下：Then the Hungarian algorithm is used to solve the correspondence between the center points, as follows:

Π^*＝argmin<Π,Dis>Π ^* =argmin<Π,Dis>

其中，Π表示中心点之间的对应关系，Dis是一个矩阵用于存储中心点之间的距离，Π^*存储求解的中心点之间的对应结果；Among them, Π represents the correspondence between center points, Dis is a matrix used to store the distance between center points, Π ^* stores the corresponding results between the center points of the solution;

本实施例中总迭代次数D的计算如下：The calculation of the total number of iterations D in this embodiment is as follows:

D＝[Nc/Ns]/2D＝[Nc/Ns]/2

其中，Nc为SlowFast模型提取当前帧时序特征时需要输入的视频帧数，为了避免逐帧偏移定位造成迭代次数过多增大计算量的问题，本实施例中，以一秒为间隔进行位移定位，即Ns表示一秒钟内视频的帧数。Among them, Nc is the number of video frames that need to be input when the SlowFast model extracts the timing features of the current frame. In order to avoid the problem of frame-by-frame offset positioning causing too many iterations and increasing the amount of calculation, in this embodiment, the displacement is performed at intervals of one second. Positioning, that is, Ns represents the number of frames of the video in one second.

最后基于对应关系求解每一个复制的坐标框Bc_t+d*Ns其对应的存储的实体坐标框Be_t+d*Ns，并将该位置作为定位后的实体位置：Finally, based on the corresponding relationship, each copied coordinate frame Bc _{t+d*Ns is} solved for its corresponding stored entity coordinate frame Be _t+d*Ns , and this position is used as the position of the positioned entity:

其中，proj用于投影该中心点到对应的边界框，表示用于约束运动位移距离的阈值，Bt_t+d*Ns存储移动定位结果。Among them, proj is used to project the center point to the corresponding bounding box, Represents the threshold used to constrain the motion displacement distance, and Bt _t+d*Ns stores the mobile positioning results.

具体过程如下：The specific process is as follows:

利用定位的实体位置和ROI对齐方法抓取实体在每个特征块中的分区时序特征，记为Fe；Use the positioned entity position and ROI alignment method to capture the partition timing features of the entity in each feature block, recorded as Fe;

本实施例中，所述卷积操作由两组堆叠的卷积层、Dropout层和ReLU层组成，能够从级联的特征中选择和放大有助于动作识别的特征；In this embodiment, the convolution operation consists of two stacked convolution layers, Dropout layers and ReLU layers, which can select and amplify features that are helpful for action recognition from the cascaded features;

具体处理过程表示如下：The specific processing process is as follows:

其中，Wr表示卷积堆操作，Fe为分区时序特征，Re表示精炼的实体时序特征，包含了精炼的人和物体的时序特征Rh，Ro。Among them, Wr represents the convolution heap operation, Fe is the partition timing feature, and Re represents the refined entity timing feature, including the refined timing features Rh and Ro of people and objects.

S3：基于精炼的实体时序特征，获取以人为中心的交互动作建模模型。S3: Based on refined entity timing characteristics, obtain a human-centered interactive action modeling model.

如图2所示，以人为中心的交互动作建模模型结构如下：As shown in Figure 2, the structure of the human-centered interactive action modeling model is as follows:

包括可学习的权重块、长期时序特征增强块以及两个原子动作表征块；Includes learnable weight blocks, long-term temporal feature enhancement blocks, and two atomic action representation blocks;

具体的，模型包括两个原子动作表征块，分别记为原子动作表征块1、原子动作表征块2、可学习的权重块以及长期时序特征增强块；Specifically, the model includes two atomic action representation blocks, respectively recorded as atomic action representation block 1, atomic action representation block 2, learnable weight block and long-term temporal feature enhancement block;

所述原子动作表征块以及长期时序特征增强块可通过多种方式实现，例如AvgPooling、Transformer和Non-Local Block；由于Non-Local Block可以更有效地捕获特征之间的依赖选择对目标人特征高度激活的其他特征，并且可以将他们合并以增强目标人特征，此外还不会消耗大量的计算资源，因此，本实施例中，采用Non-Local Block来提取原子动作特征。The atomic action representation block and the long-term temporal feature enhancement block can be implemented in a variety of ways, such as AvgPooling, Transformer and Non-Local Block; because the Non-Local Block can more effectively capture the dependence between features, the selection of the target person's features is highly Activated other features, and they can be combined to enhance the target person features, and it will not consume a lot of computing resources. Therefore, in this embodiment, Non-Local Block is used to extract atomic action features.

原子动作表征块1和原子动作表征块2输入待检测人体的精炼时序特征和参与交互的其他人/物体的精炼时序特征，并输出原子交互动作特征；Atomic action representation block 1 and atomic action representation block 2 input the refined timing features of the human body to be detected and the refined timing features of other people/objects participating in the interaction, and output the atomic interaction action features;

可学习的权重块由多个卷积堆叠组成用于融合两个原子交互特征实现对原子交互动作的组合，输出组合后的交互特征。The learnable weight block is composed of multiple convolution stacks and is used to fuse two atomic interaction features to achieve a combination of atomic interaction actions and output the combined interaction features.

长期时序特征块首先从特征池P中提取以当前帧为中心时间跨度为5秒的所有实体的时序特征并级联5秒内的所有实体特征构建长期时序特征L，然后实现对组合后的交互特征的增强。The long-term temporal feature block first extracts the temporal features of all entities with a time span of 5 seconds centered on the current frame from the feature pool P and cascades all entity features within 5 seconds to construct a long-term temporal feature L, and then implements the combined interaction Feature enhancement.

S4：采用以人为中心的交互动作建模模型，构建和融合多组多视图人物物交互动作特征，获得人物物交互动作分类特征；S4: Use a human-centered interactive action modeling model to construct and fuse multiple sets of multi-view human and character interaction action features to obtain character and character interaction action classification features;

结合图1所示，为了实现更鲁棒地提取人物物交互动作特征，采用以人为中心的交互动作建模模型，分别提取主体视图以及协作视图下的人物物交互动作特征；As shown in Figure 1, in order to achieve a more robust extraction of interactive action features of people and objects, a human-centered interactive action modeling model is used to extract interactive action features of people and objects in the main view and collaboration view respectively;

具体步骤如下：Specific steps are as follows:

S401：采用以人为中心的交互动作建模模型，构建主体视图下的人物物交互动作特征。S401: Use a human-centered interactive action modeling model to construct interactive action characteristics of people and characters in the subject view.

对于主体视图，主要建模直接发生交互的主体。考虑交互过程中分别与两个物体产生交互，因此可以基于两个物体的特征直接地建模人物物交互动作。For agent views, primarily models the agents that interact directly. Considering the interaction with two objects respectively during the interaction process, human-character interaction actions can be directly modeled based on the characteristics of the two objects.

在主体视图下，人物物交互动作由待检测人分别地与两个物体进行交互这两个原子动作组成。通过步骤S3中，以人为中心的交互建模方法，主体视图下的人物物交互动作表征如下：In the subject view, the human-object interaction action consists of two atomic actions in which the person to be detected interacts with two objects respectively. Through the human-centered interaction modeling method in step S3, the human-character interaction actions in the subject view are represented as follows:

Isub＝Hc_sub(Rh1，Ro1，Ro2，L)Isub=Hc_sub(Rh1, Ro1, Ro2, L)

其中，Hc_sub表示以人为中心的交互动作建模函数，Rh1表示待检测的人体精炼特征，Ro1和Ro2分别表示发生交互的两个物体的精炼特征，L表示长期时序特征，Isub表示输出的主体视图下的人物物交互动作特征。Among them, Hc_sub represents the human-centered interactive action modeling function, Rh1 represents the refined features of the human body to be detected, Ro1 and Ro2 respectively represent the refined features of the two objects interacting, L represents the long-term time series features, and Isub represents the output subject view. Characteristics of character interaction below.

S402：采用以人为中心的交互动作建模模型，构建协作视图下的人物物交互动作特征。S402: Use a human-centered interactive action modeling model to construct human-character interaction action characteristics under the collaboration view.

对于协作视图，主要建模协助交互的人体和共同操作的物体。考虑交互过程中存在其他人协作完成的情况，因此可以基于协作人的特征以及共同操作的物体特征建模人物物交互动作。For the collaborative view, the human body assisting the interaction and the objects operating together are mainly modeled. Considering that there are other people collaborating during the interaction process, human-character interaction actions can be modeled based on the characteristics of the collaborators and the characteristics of the objects being operated together.

在协作视图下，人物物交互动作由待检测人分别地与协作人和共同操作的物体进行交互这两个原子动作组成。通过步骤S3中，以人为中心的交互建模方法，协作视图下的人物物交互动作表征如下：In the collaboration view, the person-object interaction action consists of two atomic actions in which the person to be detected interacts with the collaborator and the co-operated object respectively. Through the human-centered interaction modeling method in step S3, the human-character interaction actions in the collaboration view are represented as follows:

Icol＝Hc_col(Rh1，Ro1，Rh2，L)Icol=Hc_col(Rh1, Ro1, Rh2, L)

其中，Hc_col表示以人为中心的交互动作建模函数，Rh1和Rh2表示参与交互的以及协助交互的两个人体的精炼特征，Ro1代表共同操作的一个物体的精炼特征，L表示长期时序特征，Isub表示输出的协作视图下的人物物交互动作特征。Among them, Hc_col represents the human-centered interactive action modeling function, Rh1 and Rh2 represent the refined characteristics of the two human bodies participating in the interaction and assisting the interaction, Ro1 represents the refined characteristics of an object operating together, L represents the long-term time series characteristics, Isub Represents the interactive action characteristics of people and objects in the output collaboration view.

S403：基于主体视图和协作视图构建人物物交互动作分类特征；S403: Construct character interaction action classification features based on the subject view and collaboration view;

通过采用主体视图下的人物物交互动作特征和协作视图下的人物物交互动作特征的相加，得到多视图人物物交互动作特征。并构建多组多视图人物物交互动作特征，通过相加融合多组多视图人物物交互动作特征，获得人物物交互动作分类特征。By adding the character-character interaction action features in the subject view and the character-character interaction action features in the collaboration view, the multi-view character-character interaction action features are obtained. And construct multiple sets of multi-view human and object interaction action features, and obtain the character and object interaction action classification features by adding and fusing multiple sets of multi-view human and object interaction action features.

具体处理过程表示如下：The specific processing process is as follows:

其中，g为多组多视图人物物交互动作特征的组数,F_HOO为人物物交互动作分类特征。Among them, g is the number of groups of multi-view human-character interaction action features, and F _HOO is the character-character interaction action classification feature.

S5：通过动作分类器对人物物交互动作进行分类，S5: Classify human and character interaction actions through action classifiers,

由于精炼的人体时序特征包含了丰富的交互语义信息适用于交互动作的识别，本实施例中，通过相加精炼的人体时序特征和人物物交互动作分类特征来进行人物物交互动作分类。Since the refined human body temporal features contain rich interactive semantic information and are suitable for the recognition of interactive actions, in this embodiment, the human-character interactive actions are classified by adding the refined human body temporal features and the human-character interactive action classification features.

具体表示如下：The specific expression is as follows:

P＝Wc(F_HOO+Rh)P＝Wc(F _HOO +Rh)

其中，Wc表示由两个完全连接层和一个softmax分类器组成的动作分类器，F_HOO表示人物物交互动作分类特征，Rh表示精炼的人体时序特征。Among them, Wc represents an action classifier composed of two fully connected layers and a softmax classifier, F _HOO represents human-character interaction action classification features, and Rh represents refined human body timing features.

本发明并不局限于前述的具体实施方式。本发明扩展到任何在本说明书中披露的新特征或任何新的组合，以及披露的任一新的方法或过程的步骤或任何新的组合。The present invention is not limited to the specific embodiments described above. The invention extends to any new features or any new combinations disclosed in this specification, as well as to any new method or process steps disclosed or any new combinations.

Claims

1. A human-object interaction action detection method based on feature refining and multi-view, which is characterized by:

S1: Extract entity timing features from the video frames to be detected to obtain entity timing features;

S2: Refining the entity timing characteristics and obtaining the refined entity timing characteristics; including the following steps:

S201: Track and position the movement trajectory of the entity through mobile positioning operations;

S202: Based on the position of the positioned entity, extract timing features segment by segment and refine the timing features;

The specific process of timing feature refining is as follows:

Split the temporal features of the current frame into 2D+1 feature blocks, where D represents the total number of iterations;

Use the positioned entity position and ROI alignment method to capture the partition timing features of each entity in each feature block;

Concatenate all partition timing features of each entity and construct refined entity timing features through convolution operations;

The convolution operation consists of two stacked convolution layers, Dropout layers and ReLU layers;

S3: Based on refined entity timing features, build a human-centered interactive action modeling model. The model includes a learnable weight block, a long-term timing feature enhancement block, and two atomic action representation blocks;

The atomic action representation block takes the refined timing characteristics of the human body to be detected and the refined timing characteristics of other people/objects participating in the interaction as inputs, and the output of the atomic action representation block is connected to the input of the learnable weight block, so The output of the learnable weight block and the long-term temporal features are used as the input of the long-term temporal feature enhancement block;

S4: Use a human-centered interactive action modeling model to construct and fuse multiple sets of multi-view human and character interaction action features to obtain character and character interaction action classification features; including the following steps:

S401: Use a human-centered interactive action modeling model to construct interactive action characteristics of characters in the subject view;

S402: Use a human-centered interactive action modeling model to construct human-character interaction action characteristics under the collaboration view;

S403: Construct character interaction action classification features based on the character interaction action features in the subject view and collaboration view;

S5: Classify human and character interaction actions through action classifiers.

2. The human interaction action detection method according to claim 1, characterized in that step S1 includes the following steps:

S101: Use the YOLO target detection algorithm to detect the categories and coordinate frames of people and objects in the current frame in real time;

S102: Generate the timing features of the current frame through the SlowFast timing feature extraction algorithm;

S103: Use the ROI alignment algorithm and the maximum pooling algorithm to extract the temporal features of people and objects based on the detected entity coordinates and the temporal features of the current frame;

S104: Store the coordinate frame and temporal features of the entity in the current frame into the feature pool.

3. The human-character interaction action detection method according to claim 1, characterized in that, in step S201, the mobile positioning operation includes progressive expansion and adaptive positioning, and both progressive expansion and adaptive positioning are performed iteratively. This step locates the position of each entity in the current frame after the offset in the previous and subsequent frames.

4. The human-character interaction action detection method according to claim 3, characterized in that the progressive expansion is to copy the entity coordinate frame positioned in the previous iteration to an adjacent unpositioned frame;

If it is the first iteration, copy the coordinate frames of all entities in the current frame to adjacent unpositioned frames.

5. The human interaction action detection method according to claim 3, characterized in that the adaptive positioning, the specific process is as follows:

Obtain the entity coordinate frame stored in the adjacent unpositioned frame from the feature pool, then calculate the entity coordinate frame in the adjacent frame and the center point of the entity coordinate frame copied in the progressive expansion step, and calculate the two sets of coordinate frames distance between center points;

Use Hungarian algorithm to solve the correspondence between center points;

Based on the corresponding relationship, the position of the corresponding stored entity coordinate frame of each copied entity coordinate frame is obtained, and the position is used as the position of the positioned entity.

6. The human-character interaction action detection method according to claim 1, characterized in that, in step S401, the specific process of constructing the character-character interaction action characteristics under the subject view is as follows:

Through the human-centered interactive action modeling method in step S3, the refined timing features of the human body to be detected and the refined timing features of the two objects participating in the interaction are used as input to obtain the interactive action features of people and objects in the subject view.

7. The human-character interaction action detection method according to claim 1, characterized in that, in step S402, the specific process of constructing the character-character interaction action characteristics under the collaboration view is as follows:

Through the human-centered interactive action modeling method in step S3, the refined timing features of the human body to be detected, the refined timing features of the human body assisting the interaction, and the refined timing features of the co-operated objects are used as input to obtain the collaborative view. Characteristics of character interaction.

8. The human-character interaction action detection method according to claim 1, characterized in that, in step S403, the specific process of constructing the character-character interaction action classification feature is as follows:

Add the character and object interaction action features in the subject view and the character and object interaction action features in the collaboration view to obtain the multi-view character and object interaction action features;

Construct multiple groups of multi-view character interaction action features;

Add and fuse multiple sets of multi-view human and object interaction action features to obtain the character and object interaction action classification features.