CN119097890A

CN119097890A - A method and system for automatically evaluating the safety of strength training postures

Info

Publication number: CN119097890A
Application number: CN202411101105.6A
Authority: CN
Inventors: 黄伟红; 张柳; 胡建中; 高悦; 张子浩; 张忠腾; 彭情; 刘硕
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2024-08-12
Filing date: 2024-08-12
Publication date: 2024-12-10
Anticipated expiration: 2044-08-12
Also published as: CN119097890B

Abstract

The present invention discloses a method and system for automatically evaluating the safety of strength training action postures, which extracts frame image features and outputs action category serial numbers by receiving a video transmitted from a user end; outputs a call model vector according to the frame image features and the action category serial number; calls the corresponding body part granularity level model according to the call model vector; outputs the coordinates of key points of human posture according to the body part granularity level model; calculates the movement speed and joint angle according to the coordinates of key points of human posture and the action category serial number; determines whether the movement speed or joint angle is within the safety threshold range; and gives a prompt warning if it is identified that the movement speed or joint angle is not within the safety threshold range. The present invention reduces the subjectivity and inaccuracy in human observation and improves the safety and effect of sports training.

Description

Method and system for automatically evaluating action gesture safety of strength training

Technical Field

The invention relates to the technical field of body building, and particularly discloses a method and a system for automatically evaluating the action posture safety of strength training.

Background

In the current fitness culture, people pay more and more attention to shaping ideal stature and body shape through movement. Many people are dedicated to strength training in indoor gymnasiums, such as training muscles specifically using dumbbells, barbells or immobilization devices, and repeated training of these activities can effectively enhance the strength and volume of a particular muscle group.

However, due to the repeatability and fixability of the motion, if the posture is improper or lacks proper observation and guidance, the risk of motion injury may increase. For example, when making deep squats, misposes of the knees slightly beyond the toes may be ignored. In addition, fatigue or lack of concentration can also affect the ability to detect itself, failing to consistently and accurately observe each action and correct errors in time. In this case, the safety and effectiveness of the training is significantly reduced, increasing the risk of athletic injuries.

Therefore, the safety discrimination of the action can be carried out by means of the gesture estimation model of the artificial intelligence algorithm. However, the existing pose estimation models are various, so that the efficiency of model selection is low. The existing model only focuses on specific body parts to identify, such as hands, trunk, feet and the like, so that people need to select different models for different body parts in the using process, and complexity is increased. In addition, different models have obvious differences in precision, calculation cost and detail processing capacity, and the difficulty of selecting a proper model is further increased.

In addition, the quality of the shot motion video also directly affects the gesture recognition result. In practical application, due to the limitation of shooting conditions, such as factors of camera angle, distance, illumination and the like, and the shielding problem of a human body, people in a video can not be displayed completely or the video definition is insufficient. Under these undesirable detection conditions, even fine-granularity estimation methods are difficult to provide accurate results, and the reliability is reduced.

Accordingly, the following drawbacks are mainly present in the prior art:

1. human observation and evaluation of inaccurate motion gestures increases risk of athletic injuries

In the process of manually observing and evaluating the motion gestures, inaccurate and incomplete evaluation of the motion gestures may occur due to differences in human factors and subjective judgment, thereby increasing the risk of athletic injuries. For example, when deep squatting is performed, the detail errors such as the small exceeding of the toes of the knees are not noticed, or the action joint angles cannot be continuously and accurately monitored due to fatigue, distraction and the like, so that the errors cannot be timely found and corrected. Thus, the manual observation and evaluation approach significantly reduces the safety and effectiveness of the training and increases the risk of athletic injuries.

2. The gesture estimation model is complex and various, and the model selection efficiency is low

Existing pose estimation models are of a wide variety, each model typically being identifiable only for a particular body part. For example, some models are dedicated to hand pose estimation, while others are used for torso or foot pose recognition. This limitation results in a user's need to select the appropriate model according to the specific needs, a complex process and inefficiency. In addition, the different models have different precision, calculation cost and granularity, so that the user can quickly and correctly select the model most suitable for the current task more difficult, and the overall efficiency and the application flexibility are obviously reduced.

3. The video shooting is unclear, and the identification accuracy is low

Under the condition of some shooting scene limitations, people cannot completely cover a shot main body in the shooting process, or the video itself is not shot clearly enough, so that the detection conditions cannot meet the requirements, the fine granularity pose estimation result is not accurate enough, and the accuracy is low. In this case, even if a fine-granularity attitude estimation method is adopted, the obtained data is difficult to ensure its accuracy and reliability, and a large computing resource is consumed.

In summary, it is particularly important to provide a method for automatically evaluating the safety of the strength training actions, which not only can remarkably improve the safety and effectiveness of the training, but also can greatly reduce the risk of sports injury, thereby providing more scientific training guidance for fitness enthusiasts.

Disclosure of Invention

The invention provides a method and a system for automatically evaluating the safety of a strength training action gesture, and aims to solve at least one of the defects.

The invention relates to a method for automatically evaluating the safety of a strength training action gesture, which comprises the following steps:

receiving a video segment transmitted by a user terminal, extracting frame image characteristics, and outputting an action category sequence number;

outputting a calling model vector according to the frame image characteristics and the action category sequence number;

Calling a corresponding body part granularity level model according to the calling model vector;

outputting human body posture key point coordinates according to the body part granularity level model;

Calculating the movement speed and the joint angle according to the coordinates of the key points of the human body posture and the sequence number of the action category;

judging whether the movement speed or the joint angle is within a safety threshold range;

If the movement speed or the joint angle is not within the safety threshold range, a prompt warning is given.

Further, the step of receiving a video segment transmitted by the user terminal, extracting frame image features and outputting an action category sequence number comprises the following steps:

extracting a series of successive frame images from the video at regular time intervals;

preprocessing the extracted frame image to ensure the consistency and the effectiveness of an input model;

The method comprises the steps of processing a fixed frame sequence by using a 3D convolution layer, utilizing a 3D convolution kernel to slide in three dimensions of time, width and height so as to realize simultaneous extraction of time and space characteristics, flattening four-dimensional characteristics processed by a 3D convolution module into three-dimensional characteristic representation, constructing a backbone network based on a double-flow transform architecture, respectively extracting time sequence and space two types of characteristics by using two parallel transform coding modules on each layer, and fusing the two types of characteristics in an addition mode so as to obtain more comprehensive characteristic representation, wherein the four-dimensional characteristics comprise frame number, width, height and characteristic dimension, and the three-dimensional characteristics comprise time sequence, space and characteristic dimension;

Flattening the fused three-dimensional feature vectors into one-dimensional feature vectors, inputting the one-dimensional feature vectors into a full-connection layer, mapping the three-dimensional feature vectors into an action category space by the full-connection layer, enabling the number of output nodes to be equal to the number of action categories, converting the output into probability distribution through a softmax activation function, selecting the category with the highest probability as a prediction result according to the output probability distribution, and outputting the action category serial number with the highest probability.

Further, the step of outputting the calling model vector according to the frame image feature and the action class sequence number comprises the following steps:

The frame image characteristics and the action class serial numbers output in the last step are taken as inputs to be transmitted to a decision model;

the decision model decides and calls the body part model with corresponding granularity level according to the frame image characteristics and the action category serial number output in the last step by the video quality, and outputs the final model call result, and the result is expressed by a vector.

Further, a section of video transmitted by a user terminal is received, frame image characteristics are extracted, a data set is prepared before the step of outputting an action type serial number, a corresponding body part model is determined and called according to the action type serial number, extracted video characteristic data are respectively transmitted into models corresponding to each granularity level of the body part in a model pool, the total number N of key point estimation and average error ME are respectively output by each model, and the average error ME is calculated by the following formula:

Wherein ME represents average error, the position of the key point output by the model is (x ₁,y₁)、(x₂,y₂)……(x_n,y_n), and the position of the key point correctly marked by the original picture is (x₁ ^′,y₁ ^′)、(x₂ ^′,y₂ ^′)……(x_n ^′,y_n ^′);N, which represents the total number of the key point estimation.

Further, the decision model decides and calls the body part model with corresponding granularity level according to the frame image characteristics and the action category sequence number output in the last step by video quality, and outputs a final model call result, and in the step of representing the result by vectors, if the foot key point estimation model-granularity level 3 and the trunk key point estimation model-granularity level 3 are called, a final output vector [3,3,0,0] is obtained.

Further, the step of outputting coordinates of key points of the human body posture according to the body part granularity level model includes:

After the final output vector [3,3,0,0] is obtained, the output vector [3,3,0,0] is transmitted into a gesture estimation model pool to represent that a foot key point estimation model-granularity 3 level and a trunk key point estimation model-granularity 3 level are required to be called, and the two-dimensional coordinates of the human gesture key points of each frame in the video are obtained.

Further, the joint angles include elbow joint angles, and in the step of calculating the movement speed and the joint angles according to the human body posture key point coordinates and the action category serial numbers, the elbow joint angles are calculated by the following formula:

Wherein, the angle ABC represents the angle of the elbow joint, Representing a vector from the elbow to the shoulder,A vector representing elbow to wrist;

Wherein, the coordinates of the three points of the angle ABC are respectively A (x ₁,y₁)、B(x₂,y₂)、C(x₃,y₃).

Further, in the step of calculating the movement speed and the joint angle according to the human body posture key point coordinates and the action category serial number, the movement speed is calculated by the following formula:

Where v denotes the motion speed, s denotes the distance from the coordinates of the previous frame key point a (x ₁,y₁) to the coordinates of the current frame key point a ^′ (x ₁ ^′,y₁ ^′), t denotes the interval time between each frame of images, and n denotes the number of frames of images.

Further, the step of determining whether the movement speed or the joint angle is within the safety threshold range further includes:

If the movement speed and the joint angle are recognized to be within the safe threshold range, ending.

The invention also provides a system for automatically evaluating the security of the gesture of the strength training action, which is applied to the method for automatically evaluating the security of the gesture of the strength training action, and comprises the following steps:

the first output module is used for receiving a section of video transmitted by the user terminal, extracting frame image characteristics and outputting an action category sequence number;

The second output module is used for outputting a calling model vector according to the frame image characteristics and the action category serial number;

The calling module is used for calling the corresponding body part granularity grade model according to the calling model vector;

the third output module is used for outputting the coordinates of key points of the human body posture according to the body part granularity level model;

the calculation module is used for calculating the movement speed and the joint angle according to the coordinates of the key points of the human body posture and the sequence number of the action category;

The judging module is used for judging whether the movement speed or the joint angle is within a safety threshold range;

And the identification module is used for giving a prompt warning if the movement speed or the joint angle is not in the safety threshold range.

The beneficial effects obtained by the invention are as follows:

The invention provides a method and a system for automatically evaluating the motion gesture safety of strength training, which are characterized in that a frame image feature is extracted by receiving a video transmitted by a user side, a motion class sequence number is output, a calling model vector is output according to the frame image feature and the motion class sequence number, a corresponding body part granularity level model is called according to the calling model vector, a human body gesture key point coordinate is output according to the body part granularity level model, a motion speed and a joint angle are calculated according to the human body gesture key point coordinate and the motion class sequence number, whether the motion speed or the joint angle is in a safety threshold range is judged, and a prompt warning is given if the motion speed or the joint angle is not in the safety threshold range. The method and the system for automatically evaluating the action gesture safety of the strength training provided by the invention have the following beneficial effects:

1. automatic assessment of motion gesture accuracy without human observation

The technology for automatically evaluating the safety of the strength training actions can reduce subjectivity and inaccuracy in artificial observation, thereby improving the safety and effect of the sports training. By capturing actions through the camera and combining with the gesture estimation model, the motion information such as the angle of the action joint, the motion speed and the like in motion can be automatically detected, and small but key gesture deviations such as the position of the knee in deep squat can be accurately captured. The method not only reduces human observation errors caused by fatigue or inattention, but also provides instant feedback for body-building people, helps to adjust actions in time and avoids sports injury.

2. Support multitasking, improve model selection efficiency

The decision model can call the recognition tasks of a plurality of body parts, and the efficiency of model selection is remarkably improved. By integrating the gesture estimation models of different body parts and combining factors such as precision, calculation cost, granularity and the like, a flexible task selection mechanism is designed, the decision model can rapidly select and call gesture estimation tasks or single tasks of a plurality of different body parts in a gesture estimation model pool, and a user can realize gesture recognition of various body parts without manually selecting a specific model. The task selection mechanism of the decision model not only improves the model selection efficiency, but also meets diversified application requirements.

3. Autonomous selection of estimated granularity based on video quality

The decision model can autonomously select the appropriate estimated granularity based on the input video quality. For high quality video, the model may employ finer granularity for fine pose estimation to ensure accuracy and detail richness of the results. For lower quality video, the model selects coarser granularity to balance the consumption of computing resources and the credibility of the estimation result. This ability to dynamically adjust granularity makes the model excellent at various video qualities to provide the best estimation results.

Drawings

FIG. 1 is a flow chart of a method for automatically evaluating the gesture safety of a strength training action according to the present invention;

fig. 2 is a schematic view showing calculation of elbow joint angle when the present invention is applied to push-up.

Detailed Description

In order to better understand the above technical solutions, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1 and 2, a first embodiment of the present invention proposes a method for automatically evaluating the security of the gesture of a strength training exercise, comprising the following steps:

step S100, a video segment transmitted by a user terminal is received, frame image characteristics are extracted, and action category serial numbers are output.

Video is input to an action category model, the action category model extracts video features, an action category sequence number is output, for example, output "1" indicates that the action category is push-up, output "2" indicates that the action category is flat support, and so on.

And step 200, outputting a calling model vector according to the frame image characteristics and the action category sequence number.

And (5) taking the frame image characteristics and the action class serial numbers output in the last step as inputs to be transmitted to a decision model. The existing 4 different body part key point estimation models are respectively a hand key point estimation model, a foot key point estimation model, a face key point estimation model and a trunk key point estimation model.

The video quality can affect the number of keypoints that each body-part model can accurately identify. By referring to the related literature and data, the recognition granularity of each body part model was classified into several classes according to the granularity class, the lower the class, the coarser the granularity was indicated, and the fewer the number of key points that can be recognized accurately (class 0 indicates that the video does not have the body part), as shown in table 1.

TABLE 1

According to table 1, each body part model has data sets of different granularity levels, 3 foot key point estimation models, 4 torso key point estimation models, 5 hand key point estimation models, and 6 face key point estimation models.

A plurality of image datasets with human body posture marks are collected, the datasets comprise human body images with different postures and angles, and the positions of the dataset marks corresponding to different body part models are different.

Taking a trunk key point estimation model as an example, the 4 granularity level data sets of the trunk key point estimation model are data sets with 4, 8, 12 and 17 key points marked on feet. The model pool is built by constructing each key point estimation model by replacing different pre-measurement heads by using the current most advanced CNN-based top-down human body posture estimation algorithm HRNet (high-resolution network) as a main network. Taking a trunk key point estimation model with granularity of 4 as an example, inputting the features extracted through HRNet trunk networks into corresponding prediction heads, and outputting corresponding 17 key point coordinates by the prediction heads.

Training the model by using the collected labeling data, and dividing the data set into a training set, a verification set and a test set according to the proportion of 8:1:1. In the training process, the model parameters are continuously optimized, so that the positions of key points of a human body can be accurately predicted. The model is evaluated using the validation set to adjust model parameters and prevent over-fitting and under-fitting. After training, the model was finally evaluated using the test set, ensuring that it performed well on unseen data.

And the other body part models are similar, so that a plurality of granularity grade models corresponding to each body part model can be finally obtained, and the total number of the granularity grade models is 18 (3 foot key point estimation models, 4 trunk key point estimation models, 5 hand key point estimation models and 6 face key point estimation models). And (5) simply splicing and integrating all the models to form a posture estimation model pool.

The models are vectorized using one-hot coding (one-hot coding), i.e. each model is represented by a vector whose length is equal to the total number of models, each position in the vector corresponds to one model (the first position represents the foot keypoint estimation model, the second position represents the torso keypoint estimation model, the third position represents the hand keypoint estimation model, and the fourth position represents the face keypoint estimation model), and the number represents the granularity level of the corresponding position model, for example:

(1) The foot keypoint estimation model-granularity level 1 vector form is represented as [1, 0];

(2) The vector form of the trunk key point estimation model-granularity level 2 is expressed as [0,2,0,0];

(3) The hand key point estimation model-granularity 3-level vector form is represented as [0,0,3,0];

(4) The vector form of the face key point estimation model-granularity level 4 is represented as [0,0,0,4];

(5) Simultaneously calling a vector form of a hand key point estimation model-granularity 5 level and a vector form of a face key point estimation model-granularity 6 level to be expressed as [0,0,5,6];

and so on according to the rules above.

It can be seen that each serial number has a corresponding invoked body part keypoint estimation model, such as a serial number "2" flat support to invoke a foot keypoint estimation model and a torso keypoint estimation model. After the decision model obtains the information according to the action category sequence number, the video quality determines what granularity level body part model is called, a final model calling result is output, the result is expressed by vectors, and if a foot key point estimation model-granularity level 3 and a trunk key point estimation model-granularity level 3 are called, the final output vector is [3,3,0,0].

Step S300, calling a corresponding body part granularity level model according to the calling model vector.

After the final output vector [3,3,0,0] is obtained, the vector is transmitted into a gesture estimation model pool to represent that the foot key point estimation model-granularity 3 level and the trunk key point estimation model-granularity 3 level are required to be called.

Step 400, outputting the coordinates of key points of the human body posture according to the body part granularity level model.

According to the body part granularity level model, the two-dimensional coordinates of the human body posture key points of each frame in the video can be obtained.

And S500, calculating the movement speed and the joint angle according to the coordinates of the key points of the human body posture and the sequence number of the action category.

According to the action type sequence number, the human body can know what action is being performed, and the concerned part of each action is different, so that whether the action is accurately performed or not is judged, the joint angle and the movement speed of the concerned part are required to be obtained, and the required joint angle and movement speed can be obtained through calculation of gesture key point coordinates.

Step S600, judging whether the movement speed or the joint angle is within a safety threshold range.

By referring to the relevant movement literature and data, a safety threshold range is set for the joint angle and movement speed required to be detected for each movement, and when the joint angle or movement speed is not within the safety threshold range, the movement is considered to have injury risk at the moment, so that warning and reminding are realized.

And step S700, if the movement speed or the joint angle is not in the safety threshold range, a prompt warning is given.

When the calculated movement speed or joint angle is not within the range of the pre-checked safety threshold value, a prompt warning is given.

Further, please refer to fig. 1 and fig. 2, in the method for automatically evaluating the security of the gesture of the strength training action according to the present embodiment, step S100 includes:

step S110, extracting a series of continuous frame images from the video at fixed time intervals.

And step 120, preprocessing the extracted frame image to ensure the consistency and the effectiveness of the input model.

Preprocessing the extracted frame image comprises resizing, normalizing, clipping and the like so as to meet the input requirements of an input model.

Step S130, a 3D convolution layer is used for processing a fixed frame sequence, a 3D convolution kernel is used for sliding in three dimensions of time, width and height to achieve simultaneous extraction of time and space characteristics, four-dimensional characteristics processed by a 3D convolution module are flattened into three-dimensional characteristic representations, a backbone network is built based on a double-flow transform architecture, each layer uses two parallel transform coding modules to extract time sequence and space characteristics respectively, the two types of characteristics are fused in an adding mode to obtain more comprehensive characteristic representations, the four-dimensional characteristics comprise frame number, width, height and characteristic dimensions, and the three-dimensional characteristics comprise time sequence, space and characteristic dimensions.

The transform model architecture, a model proposed in 2017 Google in paper Attentions is All you need, uses Self-Attention structure instead of RNN (Recurrent Neural Network ) network structure commonly used in NLP tasks. Compared with the RNN network structure, the method has the greatest advantage of parallel computation.

Step S140, flattening the fused three-dimensional feature vectors into one-dimensional feature vectors, inputting the one-dimensional feature vectors into a full-connection layer, mapping the three-dimensional feature vectors into an action category space by the full-connection layer, enabling the number of output nodes to be equal to the number of action categories, converting the output into probability distribution through a softmax activation function, selecting the category with the highest probability as a prediction result according to the output probability distribution, and outputting the action category serial number with the highest probability.

Preferably, referring to fig. 1 and 2, in the method for automatically evaluating the security of the gesture of the strength training action according to the present embodiment, step S200 includes:

And step S210, the frame image characteristics and the action type serial numbers output in the last step are all used as inputs to be transmitted to a decision model, and the action type serial numbers are all provided with the body part models which are correspondingly called.

And (5) taking the frame image characteristics and the action class serial numbers output in the last step as inputs to be transmitted to a decision model. Each serial number has a corresponding invoked body part model, such as a serial number '2' flat support to invoke a foot keypoint estimation model and a torso keypoint estimation model.

And S220, determining and calling the body part model with the corresponding granularity level by the decision model according to the frame image characteristics and the action type serial number output in the last step, and outputting a final model calling result, wherein the result is expressed by a vector.

After the decision model obtains the information according to the action category sequence number, the video quality determines what granularity level body part model is called, a final model calling result is output, the result is expressed by vectors, and if a foot key point estimation model-granularity level 3 and a trunk key point estimation model-granularity level 3 are called, the final output vector is [3,3,0,0].

Further, please refer to fig. 1 and 2, in the method for automatically evaluating the safety of the gesture of the strength training action according to the present embodiment, step S100 includes:

Step S100A, data set preparation

The input data of the decision model is video and the action category serial number thereof, the output is a model pool selection result expressed in a vector form, if the hand model-granularity 3 level is called, the output result is [0,0,3,0], if the hand model-granularity 5 level and the face model-granularity 6 level are called, the output result is [0,0,5,6]. The input data and the corresponding output vector are used as a dataset for subsequent training and validation of the decision model, the dataset being prepared as follows.

And determining and calling the corresponding body part model according to the action type serial number. The extracted video characteristic data are respectively transmitted into models corresponding to each granularity level of the body part in a model pool, the total number N of key point estimation and the average error ME are respectively output by each model, and the average error ME is calculated by the following formula:

In the formula (1), ME represents average error, the position of a key point output by the model is (x ₁,y₁)、(x₂,y₂)……(x_n,y_n), and the position of the key point correctly marked by the original picture is (x₁ ^′,y₁ ^′)、(x₂ ^′,y₂ ^′)……(x_n ^′,y_n ^′);N, which represents the total number of the key point estimates. The average error ME units are mm.

The total number N of key points of the gesture estimation reflects the granularity of a gesture estimation model, and in the same body part model, the larger N represents the larger granularity level, namely the finer granularity, but the larger the consumed calculation resource is. The average error ME reflects the accuracy of the model to the pose key point estimate, the lower ME, the more accurate the position estimate.

According to the investigation of the current human body posture estimation model, the allowable range of average error is different for models with different granularity levels. Obtaining the allowable average error range of different granularity levels of each body part model by continuously consulting literature, calculating and verifying, such as

Table 2 shows the results.

TABLE 2

Table 2 is described below:

(1) For each body part model, when a plurality of granularity level models conform to the average error range correspondingly set, selecting the model with the highest granularity level by default;

(2) For each body part model, defining the granularity grade as 0 when the granularity grade model exceeds the set average error range, and not selecting the granularity grade model;

The collection of data sets (i.e., input data and its corresponding output vector) according to the table 2 rules proceeds as follows.

According to the action category sequence number, each video is respectively input into a body part model to be selected by the model pool, and different granularity levels of each body part model respectively output a group of N and ME. By looking up table 2, the corresponding vector form can be output as shown in table 3).

For example:

TABLE 3 Table 3

The final result of the above table 3 is [3,1,4,0], so the output vector corresponding to the video is [3,1,4,0].

Through the steps and operations of data preparation, a plurality of segments of videos, action category serial numbers thereof and corresponding output vectors can be finally obtained and used as a data set of a model. The data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1.

Step S100B, constructing a decision model

1. Model network structure

The model input is an action class sequence number and a video frame sequence, and the model input is a model selection vector. Firstly, extracting image features in a video sequence, referring to a network structure of an action category model, and using a 3D-CNN structure and a double-flow transducer. After the frame sequence features are obtained through 3D-CNN processing, the action class serial numbers are spliced into feature vectors, the spliced feature vectors are input into double-flow feature vectors for processing, and finally the features are mapped into a1 multiplied by 4 vector through a full connection layer and output for being used as an output result of a model.

2. Model training

The loss function and optimizer are defined before starting to train the model. For the regression task, the Mean Square Error (MSE) may be selected as the loss function and an Adam optimizer may be used to accelerate the training process. The training set data is input into the model for training, and model parameters are optimized through multiple iterations, so that the performance of the model is gradually improved.

3. Model evaluation and tuning

During the training process, the verification set is used for evaluating the performance of the model. And according to feedback of the verification set, performing super-parameter tuning, such as adjusting parameters of learning rate, batch size and the like, so as to ensure that the model can be effectively learned but not fitted. After model training is completed, the performance of the final model is further evaluated on a test set, so that the model can be well performed on unseen data, and the problems of over-fitting and under-fitting are avoided.

4. Model deployment

When model training and evaluation is completed, the weights and structure of the model are saved for subsequent use. The trained model is deployed into a production environment, so that new data can be inferred or predicted, and the body part model in the gesture estimation model pool can be efficiently and accurately called.

Further, please refer to fig. 1 and 2, in the method for automatically evaluating the safety of the gesture of the strength training according to the present embodiment, the joint angle includes an elbow joint angle, and in step S500, the elbow joint angle is calculated by the following formula:

In the formula (2), the angle ABC represents the elbow joint angle, Representing a vector from the elbow to the shoulder,A vector representing elbow to wrist;

In the formulas (3) and (4), the coordinates of the three points of the angle ABC are respectively A (x ₁,y₁)、B(x₂,y₂)、C(x₃,y₃).

Preferably, please refer to fig. 1 and 2, in the method for automatically evaluating the gesture safety of the strength training in the present embodiment, in step S500, the portion of interest of each motion is different, and therefore the portion to be calculated for the movement speed is also different. As shown in fig. 2, taking push-ups as an example, the movement speed is focused on the movement speed of the point A, B, C on the shoulder.

Taking the movement speed of the point A as an example, if the video is 1 second n frames of images, the interval time t between each frame of images isThe distance s from the coordinates (x ₁,y₁) of the key point A of the previous frame to the coordinates (x ₁ ^′,y₁ ^′) of the key point A ^′ of the current frame isThe movement speed is calculated by the following formula:

In the formula (5), v represents the motion speed, s represents the distance from the coordinates (x ₁,y₁) of the key point a of the previous frame to the coordinates (x ₁ ^′,y₁ ^′) of the key point a ^′ of the current frame, t represents the interval time between each frame of images, and n represents the frame number of the images.

Further, please refer to fig. 1 and fig. 2, in the method for automatically evaluating the safety of the gesture of the strength training action according to the present embodiment, step S600 further includes:

Step S800, if it is recognized that the movement speed and the joint angle are within the safety threshold range, the process is ended.

When the calculated movement speed and the joint angle are recognized to be within the range of the pre-checked safety threshold value, the whole flow is ended.

The embodiment also provides a system for automatically evaluating the safety of the strength training action gesture, which is applied to the method for automatically evaluating the safety of the strength training action gesture, and comprises a first output module, a second output module, a calling module, a third output module, a calculation module, a judgment module and an identification module, wherein the first output module is used for receiving a section of video transmitted by a user end, extracting frame image features and outputting an action category sequence number, the second output module is used for outputting a calling model vector according to the frame image features and the action category sequence number, the calling module is used for calling a corresponding body part granularity level model according to the calling model vector, the third output module is used for outputting a body gesture key point coordinate according to the body part granularity level model, the calculation module is used for calculating a movement speed and a joint angle according to the body gesture key point coordinate and the action category sequence number, the judgment module is used for judging whether the movement speed or the joint angle is within a safety threshold range, and the identification module is used for giving a prompt if the movement speed or the joint angle is identified to be out of the safety threshold range.

The method and the system for automatically evaluating the motion gesture safety of the strength training are provided, a frame image feature is extracted by receiving a video transmitted by a user side, a motion type sequence number is output, a calling model vector is output according to the frame image feature and the motion type sequence number, a corresponding body part granularity level model is called according to the calling model vector, body gesture key point coordinates are output according to the body part granularity level model, motion speed and joint angles are calculated according to the body gesture key point coordinates and the motion type sequence number, whether the motion speed or the joint angles are in a safety threshold range is judged, and a prompt warning is given if the motion speed or the joint angles are not in the safety threshold range. The method and system for automatically evaluating the action gesture safety of strength training provided by the embodiment have the following beneficial effects:

1. automatic assessment of motion gesture accuracy without human observation

2. Support multitasking, improve model selection efficiency

3. Autonomous selection of estimated granularity based on video quality

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for automatically evaluating the safety of strength training postures, comprising the following steps:

Receive a video from the user, extract frame image features, and output action category numbers;

Outputting a calling model vector according to the frame image features and the action category sequence number;

According to the called model vector, calling the corresponding body part granularity level model;

Outputting the coordinates of key points of human posture according to the body part granularity level model;

Calculate the movement speed and joint angle according to the coordinates of the key points of the human body posture and the sequence number of the action category;

Determining whether the movement speed or the joint angle is within a safety threshold range;

If it is identified that the movement speed or the joint angle is not within the safety threshold range, a prompt warning is given.

2. The method for automatically evaluating the safety of strength training posture movements as claimed in claim 1, wherein the steps of receiving a video transmitted by a user terminal, extracting frame image features, and outputting action category serial numbers include:

Extracting a series of continuous frame images from the video at fixed time intervals;

Preprocessing the extracted frame image to ensure the consistency and validity of the input model;

Use 3D convolution layer to process fixed frame sequence, use 3D convolution kernel to slide in three dimensions of time, width and height, so as to extract time and space features at the same time; flatten the four-dimensional features processed by 3D convolution module into three-dimensional feature representation; build backbone network based on dual-stream Transformer architecture, use two parallel Transformer encoding modules in each layer to extract time series and space features respectively, and fuse the two features by adding them to obtain more comprehensive feature representation; the four-dimensional features include frame number, width, height and feature dimension, and the three-dimensional features include time series, space and feature dimension;

The fused three-dimensional feature vector is flattened into a one-dimensional feature vector and input into the fully connected layer. The fully connected layer maps the three-dimensional feature vector to the action category space, and the number of output nodes is equal to the number of action categories. The output is converted into a probability distribution through the softmax activation function. According to the output probability distribution, the category with the highest probability is selected as the prediction result, and the sequence number of the action category with the highest probability is output.

3. The method for automatically evaluating the safety of strength training posture movements as claimed in claim 2, wherein the step of outputting the calling model vector according to the frame image features and the action category sequence number comprises:

The frame image features and the action category serial number output in the previous step are both passed as input to the decision model; the action category serial number has a corresponding body part model to be called;

The decision model determines to call the body part model of the corresponding granularity level according to the frame image features and the action category number output in the previous step based on the video quality, and outputs the final model calling result, which is represented by a vector.

4. The method for automatically evaluating the safety of strength training postures as described in claim 3 is characterized in that before the step of receiving a video transmitted by the user end, extracting frame image features, and outputting the action category serial number, a data set is prepared, and the corresponding body part model is determined to be called according to the action category serial number; the extracted video feature data is respectively transmitted to the models of each granularity level of the corresponding body part in the model pool, and each model outputs the total number N of key point estimates and the average error ME, and the average error ME is calculated by the following formula:

Among them, ME represents the mean error, the key point positions output by the model are ( _x1 , _y1 ), ( _x2 , _y2 )…( _xn , _yn ); the key point positions correctly annotated in the original image are ( _x1 ^′ , _y1 ^′ ), ( _x2 ^′ , _y2 ^′ )…( _xn ^′ , _yn ^′ ); N represents the total number of key point estimates.

5. The method for automatically evaluating the safety of strength training movements and postures as described in claim 3 is characterized in that the decision model determines to call the body part model of the corresponding granularity level according to the video quality based on the frame image features and the action category serial number output in the previous step, and outputs the final model calling result. In the step where the result is represented by a vector, if the foot key point estimation model-granularity level 3 and the torso key point estimation model-granularity level 3 are called, the final output vector [3, 3, 0, 0] is obtained.

6. The method for automatically evaluating the safety of strength training posture according to claim 5, wherein the step of outputting the coordinates of key points of human posture according to the body part granularity level model comprises:

After obtaining the final output vector [3, 3, 0, 0], the output vector [3, 3, 0, 0] is passed into the posture estimation model pool, indicating that the foot key point estimation model - granularity level 3 and the torso key point estimation model - granularity level 3 are to be called to obtain the two-dimensional coordinates of the human body posture key points in each frame of the video.

7. The method for automatically evaluating the safety of strength training postures as claimed in claim 1, characterized in that the joint angle includes an elbow joint angle, and in the step of calculating the movement speed and the joint angle according to the human body posture key point coordinates and the action category sequence number, the elbow joint angle is calculated by the following formula:

Among them, ∠ABC represents the elbow joint angle, represents the vector from the elbow to the shoulder, represents the vector from elbow to wrist;

Among them, the coordinates of the three points ∠ABC are A (x ₁ , y ₁ ), B (x ₂ , y ₂ ), and C (x ₃ , y ₃ ) respectively.

8. The method for automatically evaluating the safety of strength training postures as claimed in claim 7, characterized in that in the step of calculating the movement speed and joint angle according to the coordinates of the key points of the human body posture and the action category sequence number, the movement speed is calculated by the following formula:

Wherein, v represents the motion speed, s represents the distance from the coordinates (x ₁ , y ₁ ) of the key point A in the previous frame to the coordinates (x ₁ ^′ , y ₁ ^′ ) of the key point A ^′ in the current frame, t represents the interval time between each frame image, and n represents the number of image frames.

9. The method for automatically evaluating the safety of strength training posture according to claim 8, characterized in that after the step of determining whether the movement speed or the joint angle is within the safety threshold range, it also includes:

If it is recognized that the movement speed and the joint angle are within the safety threshold range, the process ends.

10. A system for automatically evaluating the safety of strength training posture movements, applied to the method for automatically evaluating the safety of strength training posture movements as claimed in any one of claims 1 to 9, characterized in that the system for automatically evaluating the safety of strength training posture movements comprises:

The first output module is used to receive a video from the user end, extract frame image features, and output action category serial numbers;

A second output module, used for outputting a call model vector according to the frame image feature and the action category sequence number;

A calling module, used for calling a corresponding body part granularity level model according to the calling model vector;

A third output module, used for outputting the coordinates of key points of human posture according to the body part granularity level model;

A calculation module, used for calculating the movement speed and joint angle according to the coordinates of the key points of the human body posture and the sequence number of the action category;

A judging module, used for judging whether the movement speed or the joint angle is within a safety threshold range;

The recognition module is used to give a prompt warning if it is recognized that the movement speed or the joint angle is not within the safety threshold range.