CN119964191A

CN119964191A - A gesture recognition method and its device, equipment and storage medium

Info

Publication number: CN119964191A
Application number: CN202311436504.3A
Authority: CN
Inventors: 钟盛涛; 程虎; 林垠; 殷保才; 殷兵
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2023-10-30
Filing date: 2023-10-30
Publication date: 2025-05-09

Abstract

The present application discloses a gesture recognition method and its device, equipment, and storage medium. The gesture recognition method comprises: obtaining an image to be recognized containing a target; extracting features of the image to be recognized using a gesture recognition model to obtain image features and key point features; and performing recognition based on the image features and key point features to obtain a gesture recognition result. The above scheme can realize accurate gesture recognition.

Description

Gesture recognition method and device, equipment and storage medium thereof

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a gesture recognition method, a device, an apparatus, and a storage medium thereof.

Background

In the existing image recognition technology, a deep learning scheme is generally used for extracting modal features of the gestures of a person in an image when the gestures of the person are acquired, and recognition and classification of the gestures are performed based on the modal features, but the final recognition result is often inaccurate. For example, in a scheme for identifying an abnormal sitting posture of a child, only key point information obtained by 2D or 3D human body posture estimation is generally used for carrying out posture classification, and the scheme uses a key point mode to identify the sitting posture of the child, so that an obtained identification result is often inaccurate, and the problems of less effective information, insufficient robustness and the like exist.

Disclosure of Invention

The application aims to provide at least one gesture recognition method, a device, equipment and a storage medium thereof, and can realize accurate recognition of gestures.

The first aspect of the application provides a gesture recognition method, which comprises the steps of obtaining an image to be recognized containing a target, extracting features of the image to be recognized by utilizing a gesture recognition model to obtain image features and key point features, and recognizing the image features and the key point features to obtain a gesture recognition result.

The gesture recognition model comprises a network feature extraction module, an image feature extraction module and a key point feature extraction module, wherein the gesture recognition module is used for extracting features of an image to be recognized to obtain image features and key point features.

The gesture recognition model comprises a fusion module and a recognition module; the method comprises the steps of carrying out recognition based on image features and key point features to obtain a gesture recognition result, and inputting the image features and the key point features into a fusion module to carry out feature fusion to obtain gesture features, inputting the gesture features into a recognition module to carry out gesture recognition to determine a gesture recognition result corresponding to an image to be recognized.

The method comprises the steps of inputting image features and key point features into a fusion module to perform feature fusion to obtain gesture features, wherein the step of performing weighted fusion on the image features and the key point features by using the fusion module to obtain gesture features, and/or inputting the gesture features into a recognition module to perform gesture recognition to determine gesture recognition results corresponding to images to be recognized, and the step of using the recognition module to match the gesture features with preset gesture features in a database to determine gesture types to which the gesture features belong and further determine gesture recognition results corresponding to the gesture features.

The key point features are the position information of the key points of the target in the image to be identified, and/or the gesture recognition result is the recognition result of the sitting posture of the target.

The method further comprises the steps of obtaining a sample image, carrying out feature extraction on the sample image by utilizing a gesture recognition model to obtain image sample features and key sample application sample features, obtaining gesture sample features based on the image sample features and the key sample application sample features, recognizing the image sample features to obtain a first predicted gesture category, recognizing the gesture sample features to obtain a gesture recognition sample result, wherein the gesture recognition sample result comprises a second predicted gesture category, and adjusting parameters of the gesture recognition model by utilizing the first predicted gesture category and a reference gesture category marked by the sample image, key point information marked by the key sample application sample features and the sample image, and the second predicted gesture category and the reference gesture category.

The method comprises the steps of determining a first loss value based on the difference between a first predicted gesture type and a reference gesture type, determining a second loss value based on the difference between a key sample characteristic and reference key point information, determining a third loss value based on the difference between the second predicted gesture type and the reference gesture type, and adjusting parameters of a gesture recognition model by using the first loss value, the second loss value and the third loss value.

The gesture recognition model comprises a network feature extraction module, an image feature extraction module, a classification module, a key point feature extraction module, a fusion module and a recognition module, wherein the gesture recognition module is used for carrying out feature extraction on a sample image to obtain image sample features and key sample features, the method comprises the steps of carrying out feature extraction on the sample image by the network feature extraction module to obtain initial sample features, carrying out convolution processing on the initial sample features by the image feature extraction module to obtain image sample features, carrying out key point information extraction on the initial sample features by the key point feature extraction module to obtain key sample features, obtaining gesture sample features based on the image sample features and the key sample features, and comprises the steps of carrying out fusion on the image sample features and the key sample features by the fusion module to obtain gesture sample features, carrying out recognition on the gesture sample features to obtain gesture recognition sample results, carrying out recognition on the gesture sample features by the recognition module to obtain gesture recognition sample results, carrying out recognition on the image sample features to obtain first prediction gesture categories, and comprises the step of carrying out recognition on the image sample features by the classification module to determine first prediction gesture categories, the first prediction gesture categories corresponding to the image sample features, the first prediction gesture categories, the first prediction gesture losses, second prediction gesture losses and third prediction gesture categories, the first prediction gesture losses, the first prediction losses and third prediction gesture losses, the first prediction losses, the second prediction losses and third prediction losses, the first prediction losses and the second prediction losses The system comprises a key point feature extraction module, a fusion module and parameters of an identification module.

The method comprises the steps of utilizing a first loss value, a second loss value and a third loss value to adjust parameters of a gesture recognition model, and utilizing the first loss value, the second loss value and the third loss value to conduct fusion to generate a fourth loss value, conducting weighted fusion to the second loss value and the fourth loss value to generate a comprehensive loss value, and utilizing the comprehensive loss value to adjust all parameters of the gesture recognition model.

The second aspect of the application provides a training method of a gesture recognition model, which comprises the steps of obtaining a sample image, carrying out feature extraction on the sample image by utilizing the gesture recognition model to obtain image sample features and key sample features, obtaining gesture sample features based on the image sample features and the key sample features, recognizing the image sample features to obtain a first predicted gesture category, recognizing the gesture sample features to obtain a gesture recognition sample result, wherein the gesture recognition sample result comprises a second predicted gesture category, and adjusting parameters of the gesture recognition model by utilizing the first predicted gesture category and a reference gesture category marked by the sample image, key sample feature and reference key point information marked by the sample image, and the second predicted gesture category and the reference gesture category.

The application provides a gesture recognition device which comprises an acquisition module, a feature extraction module and a gesture recognition module, wherein the acquisition module is used for acquiring an image to be recognized containing a target, the feature extraction module is used for extracting features of the image to be recognized by utilizing a gesture recognition model to at least obtain image features and key point features, and the gesture recognition module is used for recognizing the image features and the key point features to obtain a gesture recognition result.

The application provides a model training device which comprises a sample acquisition module, a sample feature extraction module, an image feature recognition module, a sample gesture recognition module and an adjustment module, wherein the sample acquisition module is used for acquiring a sample image, the sample feature extraction module is used for carrying out feature extraction on the sample image by utilizing a gesture recognition model to obtain image sample features and key sample features, the gesture sample features are obtained based on the image sample features and the key sample features, the image feature recognition module is used for recognizing the image sample features to obtain a first predicted gesture category, the sample gesture recognition module is used for recognizing the gesture sample features to obtain a gesture recognition sample result, the gesture recognition sample result comprises a second predicted gesture category, and the adjustment module is used for adjusting parameters of the gesture recognition model by utilizing the first predicted gesture category and the reference gesture category marked by the sample image, the key sample feature and the reference key point information marked by the sample image and the second predicted gesture category and the reference gesture category.

A fifth aspect of the present application provides an electronic device, including a memory and a processor coupled to each other, the processor being configured to execute program instructions stored in the memory to implement the gesture recognition method in the first aspect or implement the training method of the gesture recognition model in the second aspect.

A sixth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions which, when executed by a processor, implement the gesture recognition method in the first aspect described above, or implement the training method of the gesture recognition model in the second aspect described above.

According to the scheme, the feature extraction is carried out on the image to be identified containing the target by utilizing the gesture recognition model, so that the image feature and the key point feature are obtained, and the final gesture recognition result is obtained by carrying out gesture recognition on the image feature and the key point feature, so that the gesture recognition can be realized by utilizing the multi-mode feature. Moreover, the scheme can realize the extraction of the multi-mode features by using the same model, reduces the time for feature extraction and the occupation of processing resources, and greatly reduces the training time and the model parameters to be adjusted because only one model is required to be trained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart of an embodiment of a gesture recognition method of the present application;

FIG. 2 is a schematic diagram of a framework of one embodiment of a gesture recognition model of the present application;

FIG. 3 is a flow chart of an embodiment of a training method of the gesture recognition model of the present application;

FIG. 4 is a schematic diagram of a frame of an embodiment of a gesture recognition apparatus of the present application;

FIG. 5 is a schematic diagram of a frame of an embodiment of the model training apparatus of the present application;

FIG. 6 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 7 is a schematic diagram of a frame of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. Further, "a plurality" herein means two or more than two. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a gesture recognition method according to the present application.

Specifically, the method may include the steps of:

Step S110, obtaining an image to be identified containing the target.

The gesture recognition method can be applied to application scenes such as gait analysis, video monitoring, augmented reality, man-machine interaction, finance, mobile payment, entertainment, games, sports science and the like, multiple modal information in the image to be recognized is obtained by using the gesture recognition method, and target gesture recognition is carried out based on the multiple modal information so as to monitor or classify the gesture of the target, so that a gesture recognition result is obtained, wherein the gesture recognition result is a recognition result about the gesture of the target. The image to be identified may be a 2D image, a 3D image, a thermodynamic diagram, or the like, and the detected target may be a human or an animal, which is not particularly limited herein.

In some embodiments, to acquire a 2D image containing a target, the target may be captured with a camera. If the acquired 3D image includes the target, the 3D stereo camera may be used to capture the target, or the structured light may be used to scan the target.

In addition, the acquired image to be identified can be obtained by shooting by using equipment or can be selected from a database.

And step S120, extracting features of the image to be identified by using the gesture recognition model to obtain image features and key point features.

In the application, the multi-modal characteristics extracted from the image to be identified are key point characteristics and image characteristics by utilizing a gesture identification method. Furthermore, in order to improve the gesture recognition accuracy, modal features such as target action features and the like can be added on the basis of key point features and image features. It can be understood that the application can utilize two modes of the key point feature and the image feature to perform gesture recognition, and can additionally add a plurality of modes to perform gesture recognition on the basis of the key point feature and the image feature, which is not particularly limited herein.

The key point feature may specifically be position information of the key point of the target in the image to be identified, for example, if the image to be identified is a 2D image, the key point feature is coordinate information of the target key point in the 2D image. For another example, if the image to be identified is a 3D image, the key point features include depth information of the key point in addition to coordinate information of the target key point in the image. The posture recognition result is a recognition result concerning the sitting posture of the subject. The image features are features characterizing the condition of pixels in the image to be identified, such as pixel values of a plurality of statistical regions in the image to be identified.

In some embodiments, the object to be identified is a human, and the gesture recognition model is used to perform feature extraction on an image to be identified containing the human to obtain image features, where the image features include pixel features of the image to be identified, identity information of a human body in the image to be identified, and the like. And then, detecting key points of human bones of the image to be identified by using the gesture identification model so as to acquire key point characteristics. In the process of acquiring the key point features, target detection is firstly carried out on the image to be identified to acquire the number of people in the image to be identified and whether related information such as contact, shielding and the like exists among the people, and then the key points of the bones of the single person are detected to acquire the key point features. Wherein the gesture recognition model used is a deep neural network based on multitasking training. Compared with the extraction of single-mode features in the prior art, the gesture recognition method and device can effectively improve the accuracy of gesture recognition.

In some embodiments, the pose recognition model includes two branches, one of which is an image branch and the other of which is a pose estimation branch. Image features can be obtained by using the image branches, and key point features can be obtained by using the gesture estimation branches. Specifically, referring to fig. 2 in combination, fig. 2 is a schematic frame diagram of an embodiment of the gesture recognition model of the present application. The gesture recognition model includes a network feature extraction module 210, an image feature extraction module 220, and a keypoint feature extraction module 230. The image feature extraction module 220 is located in the image branch, the key point feature extraction module 230 is located in the pose estimation branch, and the steps executed by each feature extraction model can refer to steps S121 to S123.

And S121, carrying out feature extraction on the image to be identified by utilizing a network feature extraction module to obtain initial features.

In some embodiments, after the image to be identified is input to the gesture recognition model, the gesture recognition model performs feature extraction on the image to be identified by using the network feature extraction module 210 to obtain initial features. The initial characteristics comprise action information, image characteristics, key point characteristics and the like of a human body in the image to be identified.

And step S122, carrying out convolution processing on the initial features by utilizing an image feature extraction module to obtain image features.

In some embodiments, the acquired initial features are input into an image branch, and the image feature extraction module 220 is utilized to convolve the initial features multiple times to obtain image features with high correlation with the human body posture from a plurality of image features in the initial features.

And step 123, extracting key point information from the initial features by using a key point feature extraction module to obtain key point features.

In some embodiments, the obtained initial features are input into a pose estimation branch, and human body pose estimation is performed on the initial features by using a key point feature extraction module 230 to extract human body key point information, so as to obtain key point features. Specifically, when the recognition target is a human body, the gesture estimation branch can adopt a framework based on SimDR of the 1D heat map as a branch thereof to perform human body gesture estimation, so as to obtain 2D human body key point coordinate information as key point characteristics. It will be appreciated that the human body posture estimation algorithm is not particularly limited herein.

And step S130, identifying based on the image features and the key point features to obtain a gesture identification result.

In some embodiments, after the image features and the key point features are obtained by using the gesture recognition model, the image features and the key point features are analyzed, and the gesture recognition result about the target is obtained by combining the analysis result of the image features and the analysis result of the key point features.

With continued reference to fig. 2, the gesture recognition model includes a fusion module 240 and a recognition module 250. The steps of acquiring the gesture recognition result are steps S131 to S132.

And S131, inputting the image features and the key point features into a fusion module for feature fusion to obtain gesture features.

In some embodiments, after obtaining the image features and the keypoint features, the image features and the keypoint features are input into the fusion module 240 of the gesture recognition model for fusion, so as to obtain new gesture features. Wherein, the image features and the key point features can be weighted and fused by utilizing the fusion module 240 to obtain the gesture features when the features are fused according to the weights of the image features and the key point features in the gesture features.

Specifically, concat fusion may be used when the image features and the key point features are weighted and fused, or other fusion algorithms may be used, which are not limited in detail herein. However, the feature fusion method tends to make stronger branch return gradient take the dominant role, so that the fused features collapse into a single feature, and therefore, in this embodiment, the SE attention mechanism can be adopted for feature fusion.

And S132, inputting the gesture features into a recognition module for gesture recognition, and determining a gesture recognition result corresponding to the image to be recognized.

In some embodiments, after the gesture feature is obtained, the gesture feature may be matched with a preset gesture feature in the database by using the recognition module 250, so as to determine a gesture category to which the gesture feature belongs, and further determine a gesture recognition result corresponding to the gesture feature.

In a specific application scenario, the abnormal sitting posture of the child is identified by using a posture identification method, so that the sitting posture of the child is corrected. The method comprises the steps of installing cameras in classrooms, shooting sitting postures of children in real time by the aid of the cameras, analyzing shot videos by a system, preprocessing each frame of image in the videos by means of a top-down human body posture estimation algorithm, cutting out an image of an area where each child is located to serve as an image to be recognized, inputting the image to a posture recognition model, and recognizing painful postures in the image to be recognized by the aid of the posture recognition model.

After the gesture recognition model receives the image to be recognized, the network feature extraction module 210 extracts initial features in the image to be recognized. And input the initial feature into image branch and human body posture estimation branch of the posture recognition model, utilize image feature extraction module 220 in the image branch to convolve the initial feature to extract the image feature, utilize key point feature extraction module 230 in the human body posture estimation branch to extract the key point information of the initial feature, thus finish the human body posture estimation task, in order to extract the key point feature.

The obtained image features and the key point features are input into a fusion module 240 for weighted fusion, so as to obtain gesture features. The gesture features are input into the recognition module 250, and the recognition module 250 matches the gesture features with preset gesture features in a database to determine the gesture category to which the gesture features belong, and further determines a gesture recognition result corresponding to the gesture features, so that the sitting gesture category of the child can be obtained. If the sitting posture of the child is a sitting posture such as a face supporting posture, a lying table posture, a humpback posture, a tilting posture and the like, the system calculates the position of the child in the classroom according to the position of the image to be recognized of the child in the original image, and prompts a teacher that the sitting posture of the child at a certain position is abnormal to remind. If the sitting posture of the child is normal, reminding is not needed.

In addition, in order to improve the accuracy of the gesture recognition, the gesture recognition model can be trained before the gesture recognition method is formally applied, so that the gesture recognition model is put into a specific application scene for use after reaching the expected effect after being trained. Referring to fig. 3, fig. 3 is a flowchart illustrating an embodiment of a training method of the gesture recognition model of the present application. Specifically, the method may include the steps of:

step S310, acquiring a sample image.

In some embodiments, the acquired sample image may be an image of the gesture recognition results known in the database.

In other embodiments, when the target is a person, the camera assembly of the acquisition device may be used to acquire color (RGB) images and label the human frame, joint coordinates, and corresponding gesture categories in the color images, including walking, lying table, humpback, sitting position, and the like. Wherein the color image is a sample image.

In addition, an open source NanoDet target detection algorithm based on Anchor-free (without prior frame) can be adopted, and a human body detection model can be obtained by combining the public data set COCO and the acquired human body detection data set through training. The algorithm can achieve excellent balance in detection precision and speed, and provides powerful support for subsequent processes. NanoDet is a target detection model of an Anchor-free mobile terminal with ultra-high speed and light weight.

And step S320, carrying out feature extraction on the sample image by utilizing the gesture recognition model to obtain image sample features and key sample application features, and obtaining gesture sample features based on the image sample features and the key sample application features.

In some embodiments, after the sample image is acquired, the sample image is input into the gesture recognition model to train the gesture recognition model. Specifically, the gesture recognition model is composed of a plurality of modules, and thus, a sample image needs to be predicted by using the plurality of modules in the training process. With continued reference to fig. 2, after the gesture recognition model receives the sample image, the network feature extraction module 210 of the gesture recognition model is used to perform feature extraction on the sample image, so as to obtain an initial sample feature. The image feature extraction module 220 is then utilized to convolve the initial sample features to obtain image sample features. Meanwhile, the key point feature extraction module 230 is utilized to extract key point information of the initial sample feature to obtain a key sample feature, wherein the key point feature extraction module 230 can be utilized to extract the key sample feature to perform target pose estimation by adopting SimDR (SIMPLE DISENTAGLED coordinate Representation, decoupling coordinate representation of pose estimation) based on the 1D heat map as a framework of the key point branch, so as to obtain the key sample feature.

And after the image sample features and the key sample features are obtained, the image sample features and the key sample features are fused by utilizing a fusion module 240 of the gesture recognition model, so as to obtain gesture sample features.

And step S330, identifying the image sample characteristics to obtain a first predicted gesture category.

In some embodiments, after the image sample features are obtained, to further calculate a loss between the image features predicted by the gesture recognition model and the actual image features, the classification module 260 may be used to recognize and classify the image sample features to determine a first predicted gesture class corresponding to the image sample features.

And step S340, recognizing the gesture sample characteristics to obtain a gesture recognition sample result, wherein the gesture recognition sample result comprises a second predicted gesture category.

In some embodiments, after the gesture sample features are obtained, the gesture sample features are identified using the identification module 240 to obtain a gesture identification sample result. The gesture recognition sample result includes a second predicted gesture category, and it can be known which gesture the gesture sample feature belongs to, for example, belongs to a walking category or a sitting category, etc., according to the second predicted gesture category.

And S350, adjusting parameters of the gesture recognition model by using the first predicted gesture type and the reference gesture type marked by the sample image, the key point information marked by the key sample features and the sample image and the second predicted gesture type and the reference gesture type.

In some embodiments, the sample image is identified by using the gesture recognition model, after the predicted recognition result is obtained, the predicted recognition result is compared with the actual labeling result to determine a loss value of the gesture recognition model, and the loss value is returned, so that parameters of the gesture recognition model are adjusted.

Wherein the first loss value may be determined using a difference between the first predicted pose class and the reference pose class, and in particular, the first loss value may be calculated using a cross entropy loss algorithm to supervise the image branches. It will be appreciated that in calculating the first Loss value, other Loss (Loss) algorithms may be used in addition to the cross entropy Loss algorithm, and are not specifically limited herein.

And determining a second loss value by utilizing the difference between the key sample characteristics and the reference key point information, and particularly calculating the second loss value by using KL divergence so as to supervise the gesture estimation task of the gesture estimation branch. It is to be understood that, in addition to the KL-divergence algorithm, other loss algorithms may be used in calculating the second loss value, which is not specifically limited herein.

And determining a third loss value by utilizing the difference between the second predicted gesture category and the reference gesture category, and particularly calculating the third loss value by adopting a cross entropy loss algorithm, and carrying out overall network supervision on the fusion characteristics. It will be appreciated that other loss algorithms may be used in calculating the third loss value, in addition to the cross entropy loss algorithm, and are not specifically limited herein.

And then, adjusting parameters of the gesture recognition model by using the first loss value, the second loss value and the third loss value.

In some embodiments, after obtaining the first loss value, the second loss value, and the third loss value, parameters of the network feature extraction module 210, the classification module 260, and the image feature extraction module 220 in the gesture recognition model are adjusted using the first loss value. Parameters of the network feature extraction module 210 and the keypoint feature extraction module 230 in the gesture recognition model are adjusted using the second loss value. Parameters of the network feature extraction module 210, the image feature extraction module 220, the key point feature extraction module 230, the fusion module 240, and the recognition module 250 in the gesture recognition model are adjusted using the third loss value.

In other embodiments, since the image features often contain much more abundant information than the keypoint features, which is difficult to avoid, dominant in the fused features, during the training process, before adjusting the parameters of the gesture recognition model, the second loss values corresponding to the keypoint features are weighted and fused, so that even if the information contained in the keypoint features is much less than the image features, a certain gradient is still transmitted back to the keypoint feature extraction module 230, thereby ensuring that the keypoint feature extraction module 230 can be continuously updated, and greatly alleviating the problem of feature collapse. In addition to the overall model network supervision of the fused feature addition classification Loss (Loss), classification Loss is added to the individual image features and the key point features, and finally the three are added to form the overall classification Loss. The individual classification loss of each modal branch updates the parameters of the respective feature extraction module and affects the parameters of the shared portion of the two branches in the model, while the classification loss on the fusion module 240 updates all the parameters in the network.

Specifically, the first loss value, the second loss value and the third loss value are used for fusion to generate a fourth loss value, the second loss value and the fourth loss value are subjected to weighted fusion to generate a comprehensive loss value, and then the comprehensive loss value is used for adjusting all parameters of the gesture recognition model. The first loss value, the second loss value and the third loss value can be added to obtain a fourth loss value, the second loss value and the fourth loss value are subjected to weighted fusion to generate a comprehensive loss value, and then the comprehensive loss value is utilized to adjust all parameters of the gesture recognition model, wherein the following formula can be specifically combined:

Loss_Cls＝Loss_{Img_Cls}+Loss_{Poss_Cls}+Loss_{Fuse_Cls} (1)

Loss=α₁×Loss_{Poss_Cls}+α₂×Loss_Cls (2)

Where Loss _{Img_Cls} is a first Loss value, loss _{Poss_Cls} is a second Loss value, loss _{Fuse_Cls} is a third Loss value, loss _Cls is a fourth Loss value, and Loss is a composite Loss value.

Referring to fig. 4, fig. 4 is a schematic frame diagram of an embodiment of a gesture recognition apparatus according to the present application. The gesture recognition apparatus 400 includes an acquisition module 410, a feature extraction module 420, and a gesture recognition module 430. The acquiring module 410 is configured to acquire an image to be identified including a target. The feature extraction module 420 is configured to perform feature extraction on an image to be identified by using the gesture recognition model, so as to obtain at least an image feature and a key point feature. The gesture recognition module 430 is configured to perform recognition based on at least the obtained image feature and the key point feature, and obtain a gesture recognition result.

Referring to fig. 5, fig. 5 is a schematic diagram of a model training apparatus according to an embodiment of the application. Model training apparatus 500 includes a sample acquisition module 510, a sample feature extraction module 520, an image feature recognition module 530, a sample gesture recognition module 540, and an adjustment module 550. The sample acquisition module 510 is used to acquire a sample image. The sample feature extraction module 520 is configured to perform feature extraction on a sample image by using a gesture recognition model to obtain an image sample feature and a key sample feature, and obtain a gesture sample feature based on the image sample feature and the key sample feature. The image feature recognition module 530 is configured to recognize features of an image sample to obtain a first predicted gesture class. The sample gesture recognition module 540 is configured to recognize the gesture sample feature to obtain a gesture recognition sample result, where the gesture recognition sample result includes a second predicted gesture category. The adjustment module 550 is configured to adjust parameters of the gesture recognition model by using the first predicted gesture category and the reference gesture category of the sample image annotation, the key point information of the key sample feature and the reference key point information of the sample image annotation, and the second predicted gesture category and the reference gesture category.

According to the application, the gesture recognition model is utilized to extract the characteristics of the image to be recognized containing the target, so that the image characteristics and the key point characteristics are obtained, and the final gesture recognition result is obtained by carrying out gesture recognition on the image characteristics and the key point characteristics, so that the problem of misrecognition caused by inaccurate single-mode data can be effectively reduced, and the accuracy of image recognition and the robustness of the overall scheme are improved.

In addition, the multi-mode feature extraction method based on multitasking is adopted, features of different modes can be extracted by using only a single network, most parameters can be effectively shared, the model reasoning speed is increased, the regularization constraint function can be achieved, and the recognition accuracy of the model is improved. Meanwhile, the application also adopts a method of fusing multi-mode information to carry out gesture recognition, thereby effectively reducing false recognition caused by inaccurate single-mode data and greatly improving the robustness of the whole scheme. The multi-mode gesture recognition scheme based on the multi-task provided by the application skillfully combines multi-task learning with multi-modes, and is an accurate and real-time gesture recognition solution.

In the prior art, only single-mode information is used, so that the problem of inaccuracy or low generalization often exists, and the scheme fuses the image and the key point information to carry out gesture recognition, so that the discrimination capability, the robustness and the generalization of the model are greatly improved.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of an electronic device 60 according to an embodiment of the application. The electronic device 60 comprises a memory 61 and a processor 62 coupled to each other, the processor 62 being adapted to execute program instructions stored in the memory 61 for implementing the steps of any of the gesture recognition method embodiments described above or for implementing the steps of any of the gesture recognition model training method embodiments described above. In one specific implementation scenario, electronic device 60 may include, but is not limited to, a microcomputer, a server, and further, electronic device 60 may also include a mobile device such as a notebook computer, a tablet computer, etc., without limitation.

Specifically, the processor 62 is configured to control itself and the memory 61 to implement the steps of any of the gesture recognition method embodiments described above, or to implement the steps of any of the gesture recognition model training method embodiments described above. The processor 62 may also be referred to as a CPU (Central Processing Unit ). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be commonly implemented by an integrated circuit chip.

Referring to FIG. 7, FIG. 7 is a schematic diagram of a computer readable storage medium 70 according to an embodiment of the application. The computer readable storage medium 70 stores program instructions 701 executable by a processor, the program instructions 701 for implementing the steps of any of the gesture recognition method embodiments described above, or implementing the steps of any of the gesture recognition model training method embodiments described above.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Claims

1. A gesture recognition method, comprising:

Obtain an image to be identified containing a target;

Extracting features of the image to be recognized by using a posture recognition model to obtain image features and key point features;

Recognition is performed based on the image features and key point features to obtain a gesture recognition result.

2. The method according to claim 1 is characterized in that the posture recognition model includes a network feature extraction module, an image feature extraction module and a key point feature extraction module; the step of extracting features of the image to be recognized by using the posture recognition model to obtain image features and key point features comprises:

Using the network feature extraction module to extract features from the image to be identified to obtain initial features;

Using the image feature extraction module to perform convolution processing on the initial features to obtain the image features;

The key point feature extraction module is used to extract key point information from the initial features to obtain the key point features.

3. The method according to claim 1, characterized in that the posture recognition model includes a fusion module and a recognition module; the recognition based on the image features and key point features to obtain the posture recognition result includes:

Inputting the image features and key point features into the fusion module for feature fusion to obtain posture features;

The posture feature is input into the recognition module for posture recognition, and the posture recognition result corresponding to the image to be recognized is determined.

4. The method according to claim 3, characterized in that the step of inputting the image features and key point features into the fusion module for feature fusion to obtain posture features comprises:

Using the fusion module to perform weighted fusion on the image features and key point features to obtain posture features;

And/or, inputting the posture feature into the recognition module for posture recognition, and determining the posture recognition result corresponding to the image to be recognized, comprises:

The recognition module is used to match the posture feature with a preset posture feature in a database to determine the posture category to which the posture feature belongs, and then determine the posture recognition result corresponding to the posture feature.

5. The method according to claim 1 is characterized in that the key point feature is the position information of the key point of the target in the image to be identified; and/or the posture recognition result is the recognition result of the sitting posture of the target.

6. The method according to claim 1, characterized in that the method further comprises:

Get a sample image;

Extracting features from the sample image using a posture recognition model to obtain image sample features and key point sample features, and obtaining posture sample features based on the image sample features and key point sample features;

Identifying features of the image sample to obtain a first predicted posture category;

Identify the gesture sample features to obtain a gesture recognition sample result, wherein the gesture recognition sample result includes a second predicted gesture category;

The parameters of the posture recognition model are adjusted using the first predicted posture category and the reference posture category annotated by the sample image, the key point sample features and the reference key point information annotated by the sample image, and the second predicted posture category and the reference posture category.

7. The method according to claim 6, characterized in that the adjusting the parameters of the posture recognition model by using the first predicted posture category and the reference posture category annotated by the sample image, the key point sample features and the reference key point information annotated by the sample image, and the second predicted posture category and the reference posture category comprises:

Determining a first loss value based on a difference between the first predicted posture category and the reference posture category, determining a second loss value based on a difference between the key point sample feature and the reference key point information, and determining a third loss value based on a difference between the second predicted posture category and the reference posture category;

The first loss value, the second loss value and the third loss value are used to adjust parameters of the gesture recognition model.

8. The method according to claim 7, characterized in that the posture recognition model comprises: a network feature extraction module, an image feature extraction module, a classification module, a key point feature extraction module, a fusion module and a recognition module;

The method of extracting features from the sample image using the posture recognition model to obtain image sample features and key point sample features includes:

Using the network feature extraction module to extract features from the sample image to obtain initial sample features;

Using the image feature extraction module to perform convolution processing on the initial sample features to obtain the image sample features;

Extracting key point information from the initial sample features using the key point feature extraction module to obtain the key point sample features;

The obtaining of posture sample features based on the image sample features and the key point sample features includes:

Using the fusion module to fuse the image sample features and the key point sample features to obtain the posture sample features;

The step of identifying the gesture sample features to obtain a gesture recognition sample result includes:

Using the recognition module to recognize the features of the gesture sample to obtain the gesture recognition sample result;

The step of identifying the image sample feature to obtain a first predicted posture category includes:

Using a classification module to identify and classify the image sample features, and determine the first predicted posture category corresponding to the image sample features;

The adjusting the parameters of the gesture recognition model by using the first loss value, the second loss value and the third loss value includes:

Using the first loss value, adjusting the parameters of the network feature extraction module, the classification module, and the image feature extraction module;

Using the second loss value, adjusting the parameters of the network feature extraction module and the key point feature extraction module;

The third loss value is used to adjust the parameters of the network feature extraction module, the image feature extraction module, the key point feature extraction module, the fusion module and the recognition module.

9. The method according to claim 7, wherein the adjusting the parameters of the gesture recognition model by using the first loss value, the second loss value and the third loss value comprises:

The first loss value, the second loss value and the third loss value are combined to generate a fourth loss value;

Performing weighted fusion on the second loss value and the fourth loss value to generate a comprehensive loss value;

All parameters of the gesture recognition model are adjusted using the comprehensive loss value.

10. A training method for a posture recognition model, comprising:

Get a sample image;

11. A gesture recognition device, comprising:

An acquisition module is used to acquire an image to be identified containing a target;

A feature extraction module, used to extract features of the image to be recognized using a posture recognition model, and obtain at least image features and key point features;

The gesture recognition module performs recognition based on at least the obtained image features and key point features to obtain a gesture recognition result.

12. A model training device, comprising:

A sample acquisition module, used for acquiring a sample image;

A sample feature extraction module, used to extract features from the sample image using a posture recognition model to obtain image sample features and key point sample features, and to obtain posture sample features based on the image sample features and key point sample features;

An image feature recognition module, used to recognize features of the image sample to obtain a first predicted posture category;

A sample posture recognition module, used for recognizing the posture sample features to obtain a posture recognition sample result, wherein the posture recognition sample result includes a second predicted posture category;

An adjustment module is used to adjust the parameters of the posture recognition model by using the first predicted posture category and the reference posture category annotated by the sample image, the key point sample features and the reference key point information annotated by the sample image, and the second predicted posture category and the reference posture category.

13. An electronic device, characterized in that it comprises a memory and a processor coupled to each other, wherein the processor is used to execute program instructions stored in the memory to implement the gesture recognition method described in any one of claims 1 to 9, and/or the training method of the gesture recognition model described in claim 10.

14. A computer-readable storage medium having program instructions stored thereon, characterized in that when the program instructions are executed by a processor, the posture recognition method described in any one of claims 1 to 9 and/or the posture recognition model training method described in claim 10 are implemented.