CN114167978A

CN114167978A - A human-computer interaction system mounted on a construction robot

Info

Publication number: CN114167978A
Application number: CN202111335163.1A
Authority: CN
Inventors: 蔡长青
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-03-11

Abstract

The invention discloses a man-machine interaction system carried on a construction robot, which comprises the following parts: the position tracking module is used for detecting and tracking the position of a construction worker; the gesture tracking module is used for tracking and identifying hand actions of construction workers when the construction workers are detected to make gestures to the robot; and the gesture recognition module is used for recognizing the meaning of the gesture of the worker and outputting a corresponding instruction. The method completes gesture recognition by collecting construction workers and applying a deep learning method, can receive gesture instructions of the workers on the premise of not influencing the work of the construction workers, has good universality and effectiveness, and is widely applied to the field of building construction.

Description

Human-computer interaction system carried on construction robot

Technical Field

The invention relates to the field of computer vision, in particular to a man-machine interaction system carried on a construction robot.

Background

In building construction, life and property loss is caused by irregular operation and various accidents, and construction efficiency is reduced. In order to improve the safety of the efficiency of building production, the construction robot is more and more widely applied to the construction site. However, the construction robot cannot communicate with human directly, and therefore, communication between construction workers and the construction robot needs to be achieved through various human-computer interaction technologies.

Common human-computer interaction technologies include operation of a rocker and a controller, but the technologies all require manual operation of a construction worker to realize interaction, and human-computer interaction cannot be realized while the construction worker works. The construction robot is also provided with a sensor to track the construction worker, so as to achieve the purpose of man-machine interaction.

Compared with the interaction technology mentioned above, the human-computer interaction technology based on machine vision has obvious advantages. The human-computer interaction is realized through specific actions without needing a construction worker to additionally wear equipment or input instructions. The gesture action has the characteristics of easiness in use, nature and intuition, and is convenient for construction workers and construction robots to learn and use.

Disclosure of Invention

In view of this, embodiments of the present invention provide a human-computer interaction system mounted on a construction robot.

The invention provides a human-computer interaction system mounted on a construction robot, which is characterized by comprising the following parts:

the position tracking module is used for detecting and tracking the position of a construction worker;

the gesture tracking module is used for tracking and identifying hand actions of construction workers when the construction workers are detected to make gestures to the robot;

and the gesture recognition module is used for recognizing the meaning of the gesture of the worker and outputting a corresponding instruction.

Further, the operating steps of the location tracking module include:

the method comprises the steps that image information of construction workers is collected in real time through a construction robot, and a first video sequence is established through collected images;

identifying construction workers in the video sequence, and establishing different identification IDs for each construction worker;

drawing a boundary frame of a construction worker, and modeling appearance information of the construction worker;

and drawing the action track of the construction worker through the video sequence and the appearance information of the construction worker, and associating the collected image with the action track to realize the position tracking of the construction worker.

Further, the location tracking module implements location tracking using a convolutional neural network, which includes YOLOv3 convolutional neural network.

Further, the working steps of the gesture tracking module include:

when receiving a gesture signal of a construction worker, amplifying a first video sequence by taking the construction worker sending the signal as a center, and adjusting the first video sequence according to a boundary frame of the construction worker to obtain a second video sequence;

and performing motion capture on the gestures of the construction workers, and generating and outputting a third video sequence.

Further, the adjusting the first video sequence according to the bounding box of the construction worker specifically includes: and enabling the distance between the boundary box of the construction worker and the edge of the acquired image to be not less than one eighth of the corresponding radial side length of the first video sequence.

Further, the working steps of the gesture recognition module comprise:

detecting a gesture using a detector;

identifying a specific meaning of a gesture with a classifier when the gesture is detected by the detector;

and outputting an operation instruction corresponding to the specific meaning according to the specific meaning of the gesture.

Further, the gesture recognition module is realized through a convolutional neural network based on a hierarchical structure.

Further, the detecting a gesture using a detector specifically includes:

cutting the second video sequence into 8 video frames according to unit time;

carrying out frame-by-frame detection on the video frame, and extracting gesture features in the video frame through a ResNet-10 convolutional neural network;

the video frame in which the gesture was detected is marked as the first frame.

Further, the identifying, by using the classifier, the specific meaning of the gesture specifically includes:

further cropping the video frames, the second video sequence per unit time being cropped into 32 video frames;

establishing a video frame index T, identifying subsequent video frames from a first frame TO, and classifying gestures in the video frames when the difference between T and T0 is equal TO a multiple of a time factor L;

calculating the weighted probability of each frame in the video frame index T through a weighting function, calculating the difference value between the highest value and the next highest value of the weighted probabilities, and searching a corresponding gesture in a library according to the gesture in the video frame when the difference value is larger than a preset threshold value;

and outputting the specific meaning of the corresponding gesture.

Further, the weighting function is formulated as:

wherein, w_TReferring to the weight at the Tth frame, u corresponds to the average duration of the gesture (i.e., the number of frames) in the dataset and s is the stride length.

The invention has the beneficial effects that: in an actual construction scene, by means of the man-machine interaction system carried on the construction robot, a construction worker can move and make gestures to the construction robot at the same time, a sensor does not need to be worn, and the working efficiency of the construction worker is guaranteed. The experimental result shows that the method has good overall accuracy and recall rate, and the effectiveness of the method is verified.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a general flow diagram of a human-computer interaction system mounted on a construction robot;

fig. 2 is a flowchart of the operation of a gesture recognition module in a human-computer interaction system mounted on a construction robot.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment provides a man-machine interaction system carried on a construction robot. And the vision-based convolutional neural network is adopted to capture and explain the gestures of the construction workers so as to guide the operation of the tower crane or other construction equipment.

The system specifically comprises the following modules, and the work flows corresponding to the modules are shown in fig. 1:

The present embodiment introduces a location tracking module. The working steps of the position tracking module comprise:

identifying construction workers in the video sequence, establishing a different identification I D for each construction worker;

The purpose of the location tracking module is to extract the construction worker who made the gesture in the video sequence. The detection module identifies the constructors in each frame and obtains their bounding boxes. Given the detection results, the trajectory and appearance information is modeled to correlate the current detection with the existing trajectory to track the worker. When multiple workers are present in the scan, the construction worker making the gesture can be identified by tracking the identification number (I D). In the embodiment, a simple online real-time (sequencing) tracker with multiple object depths based on a YOLOv3 convolutional neural network is used for tracking constructors, the same constructor detected in the previous process is associated to all frames, and the track and appearance information provided by the detection result tracks the constructors in video frames.

The present embodiment introduces a gesture tracking module. The workflow of the gesture tracking module comprises:

The purpose of the gesture tracking module is to crop the area of the construction worker issuing the gesture from the original frame to form a queue for detecting and classifying gestures. The assembly can be divided into two steps: horizontal extension of the extraction area and formation of a gesture recognition queue. The extracted area is first expanded horizontally by 25% to adequately capture the gestures made by the worker on a trial and error basis. As the worker swings the arm, the area directly obtained by the detection and tracking assembly may miss a portion of the hand area. After extending horizontally to the left and right, the area of the construction worker can capture the entire hand area, which is critical for gesture recognition. The generated gesture recognition queue is used for the work of a subsequent gesture recognition module.

The present embodiment introduces a gesture recognition module. The work flow of the gesture recognition module is shown in fig. 2, and comprises the following steps:

detecting a gesture using a detector;

identifying a specific meaning of the gesture with a classifier when the gesture is detected by the detector;

Wherein, using the detector to detect the gesture specifically includes:

cutting the second video sequence into 8 video frames according to unit time;

the video frame in which the gesture is detected is denoted as T0.

This element acts as a switch to determine whether the classifier needs to be activated. If a gesture is detected and the classifier is not activated, the classifier is activated and records the current frame index as the first frame, and TO is the first frame index when the gesture is detected.

The method for recognizing the specific meaning of the gesture by using the classifier specifically comprises the following steps:

and outputting the specific meaning of the corresponding gesture.

The formula of the weight function is:

where wT refers to the weight at the tth frame, u corresponds to the average duration of the gesture (i.e., the number of frames) in the dataset, and s is the stride length, which can take the value of 1, which is small enough not to miss any gesture.

With respect to calculating the difference between the highest and second highest weighted probabilities: if the difference is greater than a threshold τ, gesture recognition will be triggered; otherwise, this means that the classifier has insufficient confidence in classifying the gesture type. The architecture will make another round of gesture detection and classification until the detector no longer detects the gesture and disables the classifier. The choice of τ and L depends on the likelihood and frequency with which the user triggers the identification. After repeated experiments, tau and L respectively take values of 0.20 and 15.

Through field experiments, the effectiveness of the implementation theory in gesture recognition is verified, and the overall accuracy and recall rate respectively reach 87.0% and 66.7%. In addition, a laboratory study was conducted to demonstrate how the system could be used to interact with a dump truck. Future work will integrate the proposed system into a robotic construction machine.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. a human-computer interaction system mounted on a construction robot, is characterized in that, comprises the following parts:

Location tracking module to detect the location of construction workers and track them;

The gesture tracking module is used to track and recognize the hand movements of the construction worker when it is detected that the construction worker gestures to the robot;

The gesture recognition module is used to recognize the meaning of the worker's gesture and output corresponding instructions.

2. A human-computer interaction system mounted on a construction robot according to claim 1, wherein the working step of the position tracking module comprises:

Real-time acquisition of image information of construction workers by the construction robot, and establishment of a first video sequence by acquiring images;

Identifying construction workers in the video sequence, establishing a different identification ID for each construction worker;

Draw the bounding box of construction workers and model the appearance information of construction workers;

Based on the video sequence and the appearance information of the construction worker, the action track of the construction worker is drawn, and the collected images are associated with the action track, so as to realize the position tracking of the construction worker.

3. a kind of human-computer interaction system mounted on a construction robot according to claim 2, is characterized in that, described position tracking module, uses convolutional neural network to realize positional tracking, and described convolutional neural network comprises YOLOv3 convolution Neural Networks.

4. A human-computer interaction system mounted on a construction robot according to claim 2, wherein the working step of the gesture tracking module comprises:

When receiving the gesture signal of the construction worker, the first video sequence is zoomed in with the construction worker who sent the signal as the center,

Adjust the first video sequence according to the bounding box of the construction worker to obtain a second video sequence;

Motion capture of construction workers' gestures to generate and output a third video sequence.

5 . The human-computer interaction system mounted on a construction robot according to claim 4 , wherein the adjusting the first video sequence according to the bounding box of the construction worker specifically comprises: making the construction worker’s The distance between the bounding box and the edge of the collected image is not less than one-eighth of the length of the corresponding radial side of the collected image.

6. A human-computer interaction system mounted on a construction robot according to claim 4, wherein the working steps of the gesture recognition module comprise:

use a detector to detect gestures;

When the detector detects the gesture, use the classifier to identify the specific meaning of the gesture;

The operation instruction corresponding to the specific meaning is output according to the specific meaning of the gesture.

7 . The human-computer interaction system mounted on a construction robot according to claim 6 , wherein the gesture recognition module is implemented by a hierarchical structure-based convolutional neural network. 8 .

8. A human-computer interaction system mounted on a construction robot according to claim 6, wherein the use of a detector to detect gestures specifically includes:

The second video sequence is cropped by unit time, and cropped into 8 video frames;

The video frame is detected frame by frame, and the gesture features in the video frame are extracted through the ResNet-10 convolutional neural network;

Mark the video frame where the gesture was detected as the first frame.

9. a kind of human-computer interaction system mounted on a construction robot according to claim 6, is characterized in that, the concrete meaning of described utilizing the classifier to identify gesture, specifically comprises:

The video frame is further trimmed, and the second video sequence per unit time is trimmed into 32 video frames;

A video frame index T is established, and subsequent video frames are identified from the first frame TO, and when the difference between T and T0 is equal to a multiple of the time factor L, the gestures in the video frame are classified;

Calculate the weighted probability of each frame in the video frame index T through the weighting function, and calculate the difference between the highest value and the next highest value of the weighted probability. When the difference is greater than the preset threshold, according to the video frame gestures Find the corresponding gesture;

Output the specific meaning of the corresponding gesture.

10. A human-computer interaction system mounted on a construction robot according to claim 6, wherein the weight function, the formula is:

where w _T refers to the weight at frame T, u corresponds to the average duration of gestures (i.e., the number of frames) in the dataset, and s is the stride length.