[go: up one dir, main page]

WO2018033154A1 - Gesture control method, device, and electronic apparatus - Google Patents

Gesture control method, device, and electronic apparatus Download PDF

Info

Publication number
WO2018033154A1
WO2018033154A1 PCT/CN2017/098182 CN2017098182W WO2018033154A1 WO 2018033154 A1 WO2018033154 A1 WO 2018033154A1 CN 2017098182 W CN2017098182 W CN 2017098182W WO 2018033154 A1 WO2018033154 A1 WO 2018033154A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
video image
human hand
business object
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/098182
Other languages
French (fr)
Chinese (zh)
Inventor
钱晨
栾青
刘文韬
李全全
闫俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201610694510.2A external-priority patent/CN107340852A/en
Priority claimed from CN201610707579.4A external-priority patent/CN107341436B/en
Priority claimed from CN201610696340.1A external-priority patent/CN107368182B/en
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Publication of WO2018033154A1 publication Critical patent/WO2018033154A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer

Definitions

  • the present application relates to information processing technologies, and in particular, to a gesture control method, apparatus, and electronic device.
  • Internet video has become an important business traffic portal and is considered a premium resource for ad placement.
  • Existing video advertisements are mainly inserted into a fixed-time advertisement before the video is played, or at a certain time of the video playback, or placed in a fixed position in the area where the video is played and its surrounding area.
  • the embodiment of the present application provides a solution for gesture control.
  • a gesture control method includes: performing gesture detection on a currently played video image; and determining that a service object to be displayed is in the video image when detecting that the gesture matches a predetermined gesture a presentation location; the business object is drawn in a computer drawing manner at the presentation location.
  • a gesture control apparatus includes: a gesture detection module configured to perform gesture detection on a currently played video image; and a presentation location determining module configured to match the gesture to the predetermined gesture Determining a presentation location of the business object to be displayed in the video image; a business object rendering module, configured to draw the business object by using a computer drawing manner at the presentation location.
  • an electronic device includes: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete each other through the communication bus
  • the memory is for storing at least one executable instruction that causes the processor to perform the operations of the steps in the gesture control method of any of the above-described embodiments of the present application.
  • another electronic device including:
  • the processor runs the gesture control device
  • the unit in the gesture control device described in any of the above embodiments of the present application is operated.
  • a computer program comprising computer readable code, when a computer readable code is run on a device, a processor in the device performs the above-described An instruction of each step in the gesture control method of an embodiment.
  • a computer readable storage medium for storing computer readable instructions, and when the instructions are executed, each of the gesture control methods of any of the above embodiments of the present application is implemented. The operation of the steps.
  • the human hand and gesture detection are performed on the currently played video image, and the presentation position corresponding to the detected gesture is determined, and then the computer display manner is used to draw the display position of the video image.
  • the displayed business object when the business object is used to display the advertisement, on the one hand, the business object to be displayed is drawn by using a computer drawing manner at the determined display position, and the business object is combined with the video playing, and does not need to transmit the video independent through the network. Additional advertising video data is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to preserve the main image and motion of the video subject (such as the anchor) in the video image.
  • FIG. 1 is a flow chart of an embodiment of a gesture control method of the present application
  • FIG. 2 is a flowchart of an embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application;
  • FIG. 3 is a flowchart of another embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application;
  • FIG. 4 is a flow chart of another embodiment of a gesture control method of the present application.
  • FIG. 5 is a flowchart of still another embodiment of the gesture control method of the present application.
  • FIG. 6 is a structural block diagram of an embodiment of a gesture control apparatus of the present application.
  • FIG. 7 is a structural block diagram of another embodiment of the gesture control apparatus of the present application.
  • FIG. 8 is a schematic structural diagram of an embodiment of an electronic device of the present application.
  • FIG. 9 is a schematic structural diagram of another embodiment of an electronic device according to the present application.
  • Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
  • Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.
  • FIG. 1 is a flow chart of an embodiment of a gesture control method of the present application.
  • the gesture control method of various embodiments of the present application can be exemplarily executed by an electronic device such as a computer system, a terminal device, a server, or the like.
  • the gesture control method of this embodiment includes:
  • gesture detection is performed on the currently played video image.
  • the video image may be an image in a live broadcast video being broadcasted, a video image in a recorded video, or a video image being recorded.
  • the gestures may include, but are not limited to, one or any combination of the following: wave, scissors, fist, hand, palm closing or opening, heart hand, applause, thumbs up Pose the pistol posture, swing the V-hand and swing the OK hand.
  • a live video is taken as an example.
  • the live video platform includes multiple, such as a pepper live broadcast platform, a YY live broadcast platform, etc., each live broadcast platform includes multiple live broadcast rooms.
  • Each live room will include at least one anchor.
  • the anchor can broadcast a video to a fan in the live room where the electronic device (such as a mobile phone, tablet, or PC) is located, and the live video includes multiple video images.
  • the subject in the above video image is usually a main character (ie, anchor) and a simple background, and the anchor often occupies a larger area in the video image.
  • a business object such as an advertisement, etc.
  • the video image in the embodiments of the present application may also be a video image in a short video that has been recorded.
  • the user may use the electronic device to play the short video.
  • the electronic device may Each frame or each key frame or each sample frame video image is acquired as a video image to be processed.
  • the recording process in the case of a video image being a video image being recorded, in the recording process
  • the electronic device can acquire each frame or each key frame or each sample frame video image recorded as a video image to be processed.
  • a mechanism for performing human hand detection on a video image and gesture detection in a human hand candidate region where a human hand is located may be set in an electronic device that plays a video image or an electronic device used by a user (eg, an anchor), through the above mechanism.
  • the currently played video image ie, the video image to be processed
  • the video image may be detected to determine whether the user's hand information is included in the video image to be processed, and if so, the video image is acquired to pass the subsequent embodiment of the present application.
  • the video image is processed; if not included, the video image may be discarded or not processed, and the next frame of video image may be acquired to continue the above processing.
  • the hand information may include, for example, but not limited to, a finger state and position, a state and position of the palm, a closing and opening of the hand, and the like.
  • a human hand candidate region in which a human hand is located may be detected from the video image, wherein the human hand candidate region may be a minimum rectangular region in the video image that covers the entire human hand candidate region or other The area of a shape (such as an ellipse, etc.).
  • An optional process may be: the electronic device acquires a video image currently being played as a video image to be processed, and adopts a preset mechanism to extract an image including a candidate region of the human hand from the video image, and then The image of the candidate region of the human hand can be analyzed and extracted by a preset mechanism, and the feature data of each 30 parts (including fingers, palms, etc.) in the candidate region of the human hand is obtained, and the analysis of the feature data is performed.
  • the gesture in the candidate region of the human hand in the video image belongs to any of gestures such as waving, scissors, clenching, holding, palm closing or opening.
  • the presentation position of the business object can be restricted by the hand position.
  • the hand position may be a center position of the human hand candidate area, or may be a coordinate position determined by a plurality of edge positions such as a rectangular area or an elliptical area of the human hand candidate area. For example, after determining the area where the hand is located in the video image, the human hand candidate area is analyzed and calculated, and the center position of the human hand candidate area is determined as the hand position.
  • the human hand candidate area is a rectangular area
  • the The diagonal length of the rectangular area can be selected as the hand position of the diagonal, so that the hand position determined based on the candidate area of the hand can be obtained.
  • a plurality of edge positions such as a rectangular area or an elliptical area of the human hand candidate area may be used as the hand position, and the corresponding processing may be referred to as the center.
  • the location is the content of the hand position and will not be described here.
  • step S110 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a gesture detection module 601 that is executed by the processor.
  • step S120 when it is detected that the gesture matches the predetermined gesture, the presentation position of the business object to be displayed in the video image is determined.
  • the business object to be displayed is an object created according to a certain business requirement, and may include, for example, but not limited to, advertisement, entertainment, weather forecast, traffic forecast, pet, and the like.
  • the service object may include any one or more of the following: a video, an image, an effect including semantic information, and the like.
  • the special effect including the semantic information may include, but is not limited to, including: At least one or any of the following special effects of the advertisement information: two-dimensional sticker effects, three-dimensional effects, particle effects, and the like.
  • the presentation position may be a center position of a designated area in the video image, or may be a coordinate position or the like of a plurality of edge positions in the specified area.
  • the predetermined gestures in the embodiments of the present application may include, but are not limited to, one or any combination of the following: wave, scissors, fist, hand, palm closing or opening, heart hand, applause, vertical Thumb, swing pistol posture, swing V-hand and swing OK hand.
  • feature data of a plurality of different gestures may be pre-stored, and different gestures are marked correspondingly to distinguish the meaning represented by each gesture.
  • the human hand candidate area where the human hand and the human hand are located and the gesture in the human hand candidate area can be detected from the video image to be processed, and the detected hand gesture can be separately compared with the pre-stored gesture. If a plurality of different gestures stored in advance include the same gesture as the gesture of detecting the hand, it may be determined that the detected gesture matches the corresponding predetermined gesture.
  • the matching result may be determined by calculation.
  • a matching algorithm may be set to calculate the matching degree between the feature data of any two gestures.
  • the matching data of the feature data of the detected gesture and the feature data of the pre-stored gesture may be used to obtain a matching degree value between the two.
  • the predetermined matching threshold may determine that the pre-stored gesture corresponding to the largest matching value matches the detected gesture of the hand. If the maximum matching degree value does not exceed the predetermined matching threshold, the matching fails, that is, the detected hand gesture is not a predetermined gesture, and at this time, the processing of step S110 described above may be continued on the subsequent video image.
  • the meaning represented by the gesture of the matched hand may be determined first, and the meanings corresponding to or corresponding to the meaning may be selected among a plurality of preset display positions.
  • the presentation location is the presentation location of the business object to be displayed in the video image.
  • the display position related to the meaning and the hand position or the corresponding display position may be selected among the plurality of preset display positions as the business object to be displayed.
  • the presentation position in the video image For example, in the case of video live broadcast, when the anchor of the anchor is detected, the hand can be The upper region of the candidate region is selected as the presentation location associated with or corresponding to it.
  • the palm area or its background area may be selected as the presentation position associated with or corresponding to it.
  • step S120 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 602 executed by the processor.
  • step S130 the business object is drawn by computer drawing at the presentation position.
  • the corresponding business object may be drawn by using a computer drawing method in the upper area of the palm of the candidate area of the anchor in the video image, for example, with a predetermined reservation. If the fan is interested in the business object, the fan can click on the area where the business object is located, and the electronic device of the fan can obtain the network link corresponding to the business object, and enter the business object through the network link. A related page in which to obtain resources related to the business object.
  • the business object may be drawn by using a computer drawing manner, and may be implemented by appropriate computer graphics image drawing or rendering, for example, but not limited to: based on Open Graphics Language (OpenGL).
  • OpenGL Open Graphics Language
  • the graphics drawing engine draws and so on.
  • OpenGL defines a professional graphical program interface for cross-programming language and cross-platform programming interface specifications. It is hardware-independent and can easily draw 2D or 3D graphics images.
  • 3D stickers can be drawn
  • 3D effects can be drawn and particle effects can be drawn.
  • the application is not limited to the drawing method based on the OpenGL graphics rendering engine, and other methods may be adopted.
  • the drawing method based on the graphics engine (Unity) or the Open Computing Language (OpenCL) is also applicable to the present invention. Apply for each embodiment.
  • step S130 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 603 executed by the processor.
  • the gesture control method performs human hand and gesture detection on the currently played video image, and determines a presentation position corresponding to the detected gesture, and then draws a display to be displayed in a computer drawing manner at the above-mentioned display position of the video image.
  • the business object when the business object is used for displaying the advertisement, on the one hand, the computer object is used to draw the business object to be displayed at a certain display position, and the business object is combined with the video playing, and the network is not required to transmit the extra irrelevant to the video.
  • Advertising video data is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to preserve the main image and motion of the video subject (such as the anchor) in the video image. It can also add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, can attract the viewer's attention to a certain extent, and improve the influence of the business object. .
  • the process of performing gesture detection on the currently played video image in step S110 in the foregoing embodiment shown in FIG. 1 may adopt a corresponding feature extraction algorithm or use a neural network model such as a volume.
  • the product network model is implemented.
  • 2 is a flow chart of an embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application.
  • the convolutional network model is taken as an example, and the human hand candidate region and the gesture detection of the video image are performed on the video image, and the first convolutional network model for detecting the human hand candidate region in the image may be pre-trained and used for the human hand.
  • the second convolutional network model of the candidate region detection gesture Referring to FIG. 2, the method for acquiring the first convolutional network model and the second convolutional network model of the embodiment includes:
  • step S210 the first convolutional network model is trained according to the sample image containing the human hand annotation information, and the prediction information of the first convolutional network model for the human hand candidate region of the sample image is obtained.
  • the human hand annotation information may include, for example, but not limited to, annotation information of a human hand area and/or annotation information of a gesture.
  • the annotation information of the human hand area may include coordinate information of a location or a range of the human hand area, and the annotation information of the gesture may include specific type information of the gesture.
  • the labeling information of the human hand area and the labeling information of the gesture are not limited.
  • the sample image containing the human hand annotation information may be a video image derived from an image acquisition device, and is composed of an image of one frame and one frame, or may be a single image or an image, and may also be a source.
  • the source and the access route of the sample image containing the hand-labeled information are not limited in this embodiment.
  • An annotation operation can be performed in the sample image, for example, a plurality of human hand candidate regions can be marked in the sample image.
  • the human hand candidate area is the same as the human hand candidate area in the above embodiment of the present application.
  • the prediction information of the candidate region of the human hand may include: location information of a region where the human hand is located in the sample image, for example, coordinate point information or pixel point information; integrity information of the human hand in the region where the human hand is located, for example, a human hand
  • the area includes a complete human hand or only one finger; gesture information in the area where the human hand is located, for example, gesture type, and the like.
  • the content of the prediction information of the candidate region of the human hand is not limited in this embodiment.
  • the sample image may be an image that satisfies a preset resolution condition.
  • the above preset resolution condition may be: the longest side of the image does not exceed 640 pixels, the shortest side does not exceed 480 pixels, and the like.
  • the labeled human hand candidate area may be a minimum rectangular area or an elliptical area in the image that can cover the whole hand.
  • the first convolutional network model may include: a first input layer, a first output layer, and a plurality of first convolution layers, wherein the first input layer is used to input an image, A plurality of first convolutional layers are used to detect an image to obtain a human hand candidate region, and the human hand candidate region is output through the first output layer.
  • the network parameters of each layer in the first convolutional network model and the number of layers of the first convolutional layer may be set according to a preset rule, or may be randomly set, and which setting method may be determined according to actual needs.
  • the first convolutional network model processes the sample image by using the plurality of first convolutional layers, that is, feature extraction is performed on the sample image, and the first convolutional network model obtains the candidate region of the human hand in the sample image
  • the first input layer obtains a sample image, and then extracts features of the sample image through the first convolutional layer, and determines a human hand candidate region in the sample image in combination with the extracted features and outputs through the first output layer.
  • the convolutional network model is trained to obtain the first convolutional network model.
  • the first input layer parameter, the first output layer parameter and the plurality of first convolution layer parameters may be trained, and then the first convolutional network model is constructed according to the obtained parameters.
  • the first convolutional network model can be trained using a sample image containing human hand annotation information, so that the first convolutional network model obtained by the training is more accurate, and samples in various cases can be selected when selecting the sample image.
  • the image, the sample image may include a sample image that labels the human hand information, and may also include a sample image that is not labeled with the human hand information.
  • the first convolutional network model may be a Region Proposal Network (RPN). This embodiment is described by way of example. In the actual application, the first convolutional network model is not limited to this. For example, it may be another two-class or more classified convolutional neural network (CNN), a multi-box network (Multi-Box Network) or an end-to-end real-time target detection system (YOLO).
  • RPN Region Proposal Network
  • CNN convolutional neural network
  • Multi-Box Network multi-box network
  • YOLO end-to-end real-time target detection system
  • step S210 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a human hand area determination module 604 that is executed by the processor.
  • step S220 the prediction information of the human hand candidate region is corrected.
  • the prediction information of the candidate region of the sample image obtained by training the first convolutional network model is a rough judgment result, and there may be a certain error rate.
  • the prediction information of the candidate region of the hand is used as an input for training the second convolutional network model in the subsequent step, and the rough judgment result obtained by training the first convolutional network model can be corrected before training the second convolutional network model.
  • the optional correction process may be through manual correction, or introducing other convolutional network models to filter the error results, etc.
  • the purpose of the correction is to improve the training second while ensuring that the input information of the second convolutional network model is accurate.
  • the accuracy of the convolutional network model does not limit the correction process used.
  • step S220 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a modification module 605 executed by the processor.
  • step S230 the second convolutional network model is trained based on the predicted information and the sample image of the corrected human hand candidate region.
  • the second convolutional network model and the first convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model.
  • the second convolutional network model may include: a second input layer, a second output layer, a plurality of second convolutional layers, and a plurality of fully connected layers.
  • the second convolution layer is used for feature extraction, the fully connected layer is equivalent to the classifier, and the features extracted by the second convolution layer are classified, and the second convolutional network model is obtained when the gesture detection result in the sample image is obtained.
  • the second input layer obtains the human hand candidate region, and then extracts the feature of the above-mentioned human hand candidate region through the second convolution layer, and the all-connected layer performs classification processing according to the feature of the human hand candidate region, determines whether the sample image includes the human hand, and includes the human hand.
  • the gesture of the candidate region and the hand is determined, and finally the classification result is output through the second output layer.
  • both the first convolutional network model and the second convolutional network model include a convolutional layer, in order to facilitate model training and reduce the amount of computation, the network parameters of the feature extraction layer in the above two convolutional network models can be set.
  • the feature extraction layer is shared for the same network parameters, that is, the second convolutional network model and the first convolutional network model, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model.
  • the network parameters of the input layer and the network parameters of the classification layer may be trained first, and then the network parameters of the feature extraction layer of the first convolutional network model are obtained.
  • the network parameter of the feature extraction layer of the second convolutional network model is determined, and then the second convolutional network model is constructed according to the network parameters of the input layer, the network parameters of the classification layer, and the network parameters of the feature extraction layer.
  • the second convolutional network model may be trained by using the predicted information and the sample image of the corrected candidate region, so that the second convolutional network model obtained by the training is more accurate, and the sample image may be selected.
  • a sample image in a plurality of cases may be selected, and the sample image may include a sample image marked with a gesture, and may also include a sample image not marked with a gesture.
  • sample image in this embodiment may be a sample image that satisfies the above resolution conditions or other resolution conditions.
  • the first convolutional network model is trained according to the sample image containing the human hand annotation information, and the first convolutional network model is obtained for the candidate region of the sample image. Predicting information; correcting prediction information of the candidate region of the human hand; training the second convolutional network model according to the predicted information of the corrected candidate region of the human hand and the sample image.
  • the first convolutional network model and the second convolutional network model have the following relationship: the first convolutional network model and the second convolutional network model share the feature extraction layer, and are maintained during the training of the second convolutional network model The parameters of the feature extraction layer are unchanged.
  • the rough judgment result obtained by training the first convolutional network model is corrected, and the corrected prediction information and the sample image of the candidate region are used as the input of the second convolutional network model.
  • the accuracy of training the second convolutional network model can be improved while ensuring that the input information of the second convolutional network model is accurate.
  • first convolutional network model and the second convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model, and the feature extraction layer of the second convolutional network model
  • the feature extraction layer of the first convolutional network model can be directly used to facilitate the training of the second convolutional network model, which is beneficial to reduce the computational amount of training the second convolutional network model.
  • the second convolutional neural network may be, for example, a Fast Region with Convolutional Neural Network (FRCNN).
  • FRCNN Fast Region with Convolutional Neural Network
  • This embodiment is only used as an example for description, and the second volume in practical application.
  • the neural network is not limited to this. For example, it may be another two-class or multi-class convolutional neural network.
  • step S230 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolutional model training module 606 executed by the processor.
  • the first convolutional network model and the first convolutional network model obtained by the training can facilitate subsequent detection of the hand and gesture of the currently played video image, and determine the display position corresponding to the detected gesture. And then, in the above-mentioned display position of the video image, the business object to be displayed is drawn by using a computer drawing manner.
  • the business object to be displayed is drawn by using a computer drawing manner at the determined display position, the business object.
  • the business object is closely combined with the gestures in the video image, and can be retained.
  • the main image and motion of the video subject (such as the anchor) in the video image can add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image. To some extent attract the attention of the audience and improve The influence of business objects.
  • the correcting the prediction information of the candidate region may include: inputting the plurality of supplementary negative sample images and the prediction information of the candidate region into the third convolutional neural network for classification, The negative samples in the candidate region are filtered to obtain prediction information of the corrected candidate region.
  • the supplementary negative sample image may be, for example, a blank sample image without a human hand, or a sample image including information such as a human hand but marked with a human hand, or an image without a human hand, and the like.
  • the supplemental negative sample image may be input only in the third convolutional neural network without inputting the first convolutional neural network and the second convolutional neural network, and the negative sample image may have only a negative sample image and no positive sample image.
  • the specific number of the supplementary negative sample images input to the third convolutional neural network may fall within a predetermined allowable range from the difference in the number of the human hand candidate regions in the prediction information of the human hand candidate region, wherein the predetermined allowable range may be Set according to the actual situation, for example, set to a range of 3-5, including 3, 4, and 5. For example, if the number of candidate regions in the prediction information of the candidate region is 5, the number of complementary negative samples may be 8, 9, or 10.
  • the specific number of the supplementary negative sample images input to the third convolutional neural network may be equal to the number of the human hand candidate regions in the prediction information of the human hand candidate region, for example, the prediction information of the human hand candidate region.
  • the number of candidate regions in the middle is 5, and the number of supplementary negative samples is also 5.
  • the third convolutional neural network is used to correct the prediction information of the candidate region of the human hand obtained by training the first convolutional neural network.
  • the negative samples in the candidate region of the human hand can be filtered out, that is, the non-human hand region in the candidate region of the human hand is filtered out, and the predicted information of the corrected candidate region is obtained, so that the predicted information of the corrected candidate region is more accurate.
  • the third convolutional neural network may be, for example, FRCNN, or other two-class or multi-class convolutional neural networks.
  • FIG. 3 is a flowchart of another embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application.
  • the method for obtaining the first convolutional network model and the second convolutional network model of the embodiment includes:
  • S310 Train the first convolutional neural network according to the sample image containing the human hand annotation information, and obtain prediction information of the first convolutional neural network for the human hand candidate region of the sample image.
  • the sample image may be an image in a red, green, and blue (RGB) format, or may be an image in another format, for example, a color difference component (YUV) format image, etc., which is not limited in this application. .
  • RGB red, green, and blue
  • YUV color difference component
  • the sample images in the embodiments of the present application may be obtained by using an image acquisition device.
  • the acquired images may not satisfy the preset resolution conditions due to different hardware parameters, different settings, and the like of the image acquisition device.
  • the collected image is scaled to obtain a sample image. .
  • the first convolutional neural network may include: a first input layer, a first feature extraction layer, and a first classification output layer, wherein the first classification output layer is used to predict a sample image. Whether the plurality of candidate regions are human candidate regions.
  • each layer included in the first convolutional neural network is functionally divided.
  • the first feature extraction layer may be composed of a convolution layer, or may be composed of convolution layers.
  • the linear transformation layer is composed of a convolution layer, a nonlinear transformation layer and a pooling layer; the output result of the first classification output layer can be understood as a result of the two classification, which can be specifically implemented by a convolution layer. But it is not limited to being implemented by a convolutional layer.
  • the first input layer parameter, the first feature extraction layer parameter, and the first classification output layer parameter may be trained to be constructed, and then constructed according to the obtained parameters.
  • the first convolutional neural network when training the first convolutional neural network, the first input layer parameter, the first feature extraction layer parameter, and the first classification output layer parameter may be trained to be constructed, and then constructed according to the obtained parameters.
  • training the first convolutional neural network with the sample image can be understood as: training the initial model of the first convolutional neural network with the sample image to obtain a final first convolutional neural network.
  • the gradient descent method and the backpropagation algorithm can be used for training.
  • the initial model of the first convolutional neural network may be determined according to factors such as the number of layers of the convolution layer manually set, the number of neurons in each convolution layer, and the like, and the number of convolution layers and neurons. The number and so on can be determined based on actual needs.
  • step S310 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a human hand area determination module 604 that is executed by the processor.
  • the second convolutional neural network may include: a second input layer, a second feature extraction layer, and a second classification output layer, wherein the second classification output layer is configured to output a sample image. Gesture detection results.
  • the second feature extraction layer is similar to the first feature extraction layer, and details are not described herein again.
  • the output result of the second classification output layer may be understood as a result of multi-classification, which may be specifically implemented by a fully connected layer, but is not limited to being implemented by a fully connected layer.
  • the training of the second feature extraction layer of the second neural network may be omitted, that is, The first convolutional neural network and the second convolutional neural network are jointly trained, and the two share the feature extraction layer, which is beneficial to improve the training speed of the convolutional neural network.
  • the gesture detection result may include any one or more of the following predetermined gesture types: wave, scissors, fist, hand, thumbs up, pistol, OK hand, hand, open, closed, etc.
  • it may optionally include: an unscheduled gesture type.
  • the above-mentioned unscheduled gesture type can be understood as: a gesture type other than the predetermined gesture type described above or a situation indicating “no gesture”, which can further improve the gesture classification accuracy of the second convolutional neural network.
  • step S320 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a parameter replacement module executed by the processor.
  • the human hand gesture in the sample image can be calibrated again, that is, the gesture of calibrating the human hand is open state, closed state, etc., and based on this time
  • the calibration result and the above prediction information are trained on the initial model of the second convolutional neural network to obtain a final second convolutional neural network.
  • step S330 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a second training module executed by the processor.
  • the second convolutional neural network parameter may be trained according to the prediction result of the human hand candidate region and the sample image, and the second feature extraction layer parameter is maintained during the training process.
  • the prediction information of the candidate region of the human hand is corrected: the second convolutional neural network parameter is trained according to the predicted information and the sample image of the corrected candidate region of the human hand.
  • the plurality of supplementary negative sample images and the prediction information of the above-mentioned candidate region may be input into the third convolutional neural network for classification to filter the negative samples in the candidate region of the above-mentioned hand. , the predicted information of the corrected candidate region of the hand is obtained.
  • the supplementary negative sample image is only used as an input of the third convolutional neural network, and is not used as an input of the first convolutional neural network and the second convolutional neural network, and the supplementary negative sample image may be It is a blank image without a hand, or it can be an image containing a region similar to a hand (not a hand) but not labeled as containing a hand.
  • the prediction information of the revised candidate area may also be performed by a labeler manually, which is not limited in this application.
  • the prediction information obtained by the first convolutional neural network is used to train the second convolutional neural network, the accuracy is poor, and the first volume is Compared with the prediction information obtained by the neural network, the accuracy of the prediction information of the corrected candidate region is much higher. Therefore, the prediction information of the corrected candidate region and the sample image are used to train the second convolutional neural network.
  • the second convolutional neural network has a higher accuracy rate.
  • the prediction information of the candidate region of the human hand is corrected, and then the second convolutional neural network parameter is trained according to the corrected prediction information of the candidate region and the sample image, thereby improving the accuracy of the second convolutional neural network obtained by the training. .
  • the gesture control method of this embodiment includes: at step S410, acquiring a currently played video image.
  • step S410 For the content of the step S410, refer to the related content in step S110 of the embodiment shown in FIG. 1 , and details are not described herein again.
  • the human hand candidate region corresponding to the hand information may be determined by the video image and the pre-trained convolutional network model, and the gesture of the hand is detected in the human hand candidate region.
  • the corresponding processing refer to the following steps S420 to S440.
  • step S420 the video image is detected by using the pre-trained first convolution network, and the first feature information of the video image and the prediction information of the candidate region of the human hand are obtained.
  • the first feature information includes hand feature information
  • the first convolutional network model may be used to detect whether a plurality of candidate regions of the image segmentation are human hand candidate regions.
  • the acquired video image containing the hand information may be input into the first convolutional network model, and the video image may be separately extracted and mapped by the network parameter in the first convolutional network model. And processing such as transforming to perform human hand candidate region detection on the video image to obtain a human hand candidate region included in the video image.
  • processing such as transforming to perform human hand candidate region detection on the video image to obtain a human hand candidate region included in the video image.
  • step S430 the first feature information and the prediction information of the candidate region are used as the second feature information of the pre-trained second convolutional network model, and the second convolutional network model is used to perform the gesture of the video image according to the second feature information. Detecting, obtaining a gesture detection result of the video image.
  • the second convolutional network model and the first convolutional network model share a feature extraction layer.
  • the gesture may include, but is not limited to, any one or more of the following: wave, scissors, fist, hand, clapping, palm open, palm closed, thumbs up, pistol pose, pendulum V and hand .
  • steps S410-S430 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a gesture detection module 601 executed by the processor.
  • step S440 when it is detected that the gesture matches the predetermined gesture, the feature points of the hand in the human hand candidate region corresponding to the detected gesture are extracted.
  • the hand for a video image including hand information, the hand includes certain feature points, such as finger points, palms, hand contours, and the like.
  • the detection of the human hand in the video image and the determination of the feature point can be implemented in any suitable related art, which is not limited in the embodiment of the present application.
  • linear feature extraction methods such as principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), etc.
  • nonlinear feature extraction methods such as kernel principal component analysis (Kernel PCA), manifolds Learning, etc.
  • the trained neural network model can also be used to extract the feature points of the hand, such as the convolutional network model in the embodiment of the present application.
  • the human hand is detected from the live video image and the feature points of the hand are determined; for example, during the playback of a recorded video, the playback is performed.
  • the human hand is detected and the feature points of the hand are determined; for example, during the recording of a certain video, the human hand is detected from the recorded video image and the feature points of the hand are determined.
  • step S450 the presentation position of the business object to be displayed in the video image is determined according to the feature point of the hand.
  • one or more presentation positions of the business object to be displayed in the video image may be determined based on the feature points of the hand.
  • the optional implementation may include, for example:
  • a pre-trained third convolution network model for detecting the presentation position of the business object from the video image is used, and the service to be displayed corresponding to the hand position is determined in the video image.
  • a convolutional network module can be trained in advance.
  • Type ie the third convolutional network model
  • the trained third convolutional network model has the function of determining the presentation position of the business object in the video image; or, the third party has been trained to have the determined business object
  • the training of the business object is taken as an example, but those skilled in the art should understand that the third convolutional network model can also train the opponent while training the business object. To achieve joint training of hands and business objects.
  • an optional training method includes the following process:
  • the feature vector includes location information and/or confidence information of the business object in the business object sample image.
  • the confidence information of the business object indicates the probability that the business object can achieve the effect (such as being focused or clicked or viewed) when the current location is displayed.
  • the probability may be set according to the statistical analysis result of the historical data, or may be According to the results of the simulation experiment, it can also be set according to the artificial experience.
  • only the location information of the business object may be trained according to actual needs, or only the confidence information of the business object may be trained, and both may be trained.
  • the third convolutional network model after training can be used to determine the location information and confidence information of the business object more effectively and accurately, so as to provide a basis for the display of the business object.
  • the third convolutional network model is trained by using at least one sample image.
  • the third convolutional network model may be trained using the business object sample image including the business object, and those skilled in the art may understand that The sample image of the business object to be trained may include hand information in addition to the business object.
  • the business object in the business object sample image in the embodiment of the present application may be pre-labeled with location information, or confidence information, or both location information and confidence information. Of course, in practical applications, this information can also be obtained through other means. By marking the corresponding information on the business object in advance, the data processing data and the number of interactions can be effectively saved, and the data processing efficiency is improved.
  • a business object sample image having location information and/or confidence information of the business object is used as a training sample, and feature vector extraction is performed to obtain a feature vector including location information and/or confidence information of the business object.
  • the third convolutional network model can be used to simultaneously train the counterpart and the business object.
  • the feature vector of the business object sample image also includes the characteristics of the hand.
  • the acquired feature vector convolution result includes location information and/or confidence information of the service object.
  • the feature vector convolution result also contains hand information.
  • the number of times of convolution processing on the feature vector can be set according to actual needs. That is, in the third convolutional network model, the number of layers of the convolution layer can be set according to actual needs, and details are not described herein again.
  • the eigenvector convolution result is the result of feature extraction of the feature vector, and the feature vector convolution result can effectively characterize the hand in the video image.
  • the feature vector when the feature vector includes both the location information of the service object and the confidence information of the service object, that is, when the location information and the confidence information of the service object are trained,
  • the eigenvector convolution result is shared in the subsequent judgment of the convergence condition, and no need to perform repeated processing and calculation, which is beneficial to reduce resource loss caused by data processing and improve data processing speed and efficiency.
  • the convergence condition is appropriately set by a person skilled in the art according to actual needs.
  • the network parameter in the third convolutional network model can be considered to be appropriate; when the information cannot satisfy the convergence condition, the network parameter in the third convolutional network model can be considered inappropriate, and it needs to be performed.
  • Adjustment, the adjustment may be an iterative process until the result of convolution processing the feature vector using the adjusted network parameters satisfies the convergence condition.
  • the convergence condition may be set according to a preset standard location and/or a preset standard confidence, for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold The convergence condition of the confidence information as a business object, and the like.
  • a preset standard location and/or a preset standard confidence for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold.
  • the preset standard location may be an average location obtained by averaging the location of the service object in the sample image of the business object to be trained; the preset standard confidence may be a sample image of the business object to be trained.
  • the confidence level of the business objects in the average confidence obtained after averaging processing. Since the sample image is a sample to be trained and the amount of data is large, the standard position and/or the standard confidence can be set according to the position and/or confidence of the business object in the sample image of the business object to be trained, so that the standard position and the standard position are set. Standard confidence is also more objective and accurate.
  • the third distance between the confidence level of the information indication and the preset standard confidence level determines whether the confidence information of the corresponding service object satisfies the convergence condition according to the third distance.
  • the Euclidean distance method is adopted, and the implementation is simple and can effectively indicate whether the convergence condition is satisfied.
  • the embodiment of the present application is not limited thereto, and other methods such as a horse distance, a bar distance, and the like may also be adopted.
  • the preset standard location is an average position obtained by averaging the positions of the business objects in the sample image of the business object to be trained; and/or, the preset standard confidence is to be trained.
  • the confidence of the business object in the sample image of the business object is averaged after the average processing.
  • the convergence condition is satisfied, for example, the distance between the position indicated by the position information of the business object in the feature vector convolution result and the preset standard position satisfies a certain threshold, and the confidence of the business object in the feature vector convolution result If the difference between the confidence level of the information indication and the preset standard confidence satisfies a certain threshold, the training of the convolutional network model is completed; if the convergence condition is not satisfied, for example, the location information indication of the business object in the feature vector convolution result The distance between the position and the preset standard position does not satisfy a certain threshold, and/or the difference between the confidence level indicated by the confidence information of the business object and the preset standard confidence in the feature vector convolution result does not satisfy the certain
  • the threshold value is adjusted according to the location information and/or the confidence information of the corresponding business object in the feature vector convolution result, and the network parameter of the third convolutional network model is adjusted according to the network parameter pair of the adjusted third convolutional network model.
  • the third convolutional network model performs
  • the third convolutional network model can feature extracting and classifying the presentation position of the business object based on the hand presentation, thereby having the position of determining the presentation position of the business object in the video image.
  • the third convolutional network model may also determine the order of the presentation effects in the plurality of presentation locations, thereby determining the final based on the sequence of the advantages and disadvantages. Show position. When a business object is displayed in a subsequent application, a valid presentation location can be determined based on the current image in the video.
  • the business object sample image may be pre-processed, including: acquiring a plurality of business object sample images, where each business object sample image includes a business object.
  • the labeling information determining the location of the business object according to the labeling information, determining whether the distance between the determined location of the business object and the preset location is less than or equal to a set threshold; and the business object sample image corresponding to the business object that is less than or equal to the set threshold , determined as the sample image of the business object to be trained.
  • the preset position and the set threshold may be appropriately set by any suitable means by a person skilled in the art, for example, according to the statistical analysis result of the data or the related distance calculation formula or the artificial experience, etc., which is not limited by the embodiment of the present application.
  • the sample image that does not meet the conditions can be filtered out to ensure the accuracy of the training result.
  • the training of the third convolutional network model is realized, and the trained third convolutional network model can be used to determine the presentation position of the business object in the video image.
  • the location of the display service object may be indicated.
  • the forehead location of the anchor controls the live application to display the business object at the location; or, during the live broadcast of the video, if the anchor clicks on the business object to indicate the display of the business object, the third convolutional network model can be directly determined according to the live video image.
  • the presentation location of the business object is if the anchor clicks on the business object to indicate the display of the business object.
  • the presentation position of the business object to be displayed corresponding to the hand position is determined in the video image according to the feature point of the hand and the type of the business object to be displayed.
  • the presentation position of the business object to be displayed may be determined according to the set rules.
  • determining the presentation position of the business object to be displayed includes, for example, any one or more of the following: a palm area of the person in the video image, an upper area of the palm, a lower area of the palm, a background area of the palm, a body area other than the hand, The background area in the video image, the area within the setting range centering on the area where the hand is located in the video image, the area preset in the video image, and the like.
  • the presentation location of the business object to be displayed in the video image can be determined.
  • the center point of the presentation area corresponding to the presentation location is used to display the business object as the center point of the presentation location; for example, a certain coordinate position in the presentation area corresponding to the presentation location is determined as the center point of the presentation location, etc. This embodiment of the present application does not limit this.
  • the business object to be displayed when determining the presentation position of the business object to be displayed in the video image, is determined not only according to the feature point of the hand but also according to the type of the business object to be displayed. The position of the presentation in the video image. Ben Shen
  • the type of the business object includes, for example but not limited to, any one or more of the following: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type,
  • the virtual headwear type, the virtual hair accessory type, and the virtual jewelry type may include, in addition to, a virtual cap type, a virtual cup type, a text type, and the like.
  • the appropriate presentation position can be selected for the business object with reference to the feature point and the hand position of the hand.
  • At least one presentation position may be selected from the plurality of presentation positions as The presentation location of the business object to be displayed in the video image. For example, for a text type business object, it can be displayed in the background area, or in the palm area of the character or the area above the hand.
  • the correspondence between the gesture and the presentation position may be stored in advance, and when it is determined that the detected gesture matches the corresponding predetermined gesture, the target presentation position corresponding to the predetermined gesture may be acquired from the correspondence between the pre-stored gesture and the presentation position.
  • the presentation location of the business object to be displayed in the video image As the presentation location of the business object to be displayed in the video image.
  • the gesture is not necessarily related to the presentation position.
  • the gesture is only a way to trigger the presentation of the business object, and the presentation position and the human hand are not necessarily present.
  • the relationship, that is, the business object can be displayed in a certain area of the hand, or can be displayed in other areas than the hand, such as the background area of the video image.
  • the same gesture can also trigger the display of different business objects.
  • the anchor performs two wave gestures in succession
  • the first gesture can display two-dimensional sticker effects
  • the second gesture can display three-dimensional effects, etc.
  • the content of the corresponding advertisement may be the same or different.
  • steps S440-S450 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 602 executed by the processor.
  • step S460 the business object to be displayed is drawn in a computer drawing manner at the presentation position.
  • the sticker can be used for advertising and display.
  • the business object may be scaled, rotated, etc. according to the coordinates of the presentation position, and then the business object to be displayed is displayed by a corresponding drawing method such as an open graphics language (OpenGL) graphics rendering engine.
  • OpenGL open graphics language
  • ads can also be displayed in 3D special effects, such as text or logos (LOGOs) that display ads through particle effects.
  • the virtual bottle cap type of two-dimensional sticker effects display the name of a product to attract viewers to watch, improve the efficiency of advertising and display.
  • step S460 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 603 executed by the processor.
  • the gesture control method provided by the embodiment of the present application triggers the display of the business object by using a gesture.
  • the business object to be displayed is drawn by using a computer drawing manner at the determined presentation position, and the business object is
  • the combination of video playback eliminates the need to transmit additional advertising video data unrelated to the video over the network, which is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to retain the video.
  • the main image and motion of the video main body (such as the anchor) in the image can add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, and can be certain To the extent that it attracts the attention of the audience and enhances the influence of business objects.
  • FIG. 5 is a flow chart of still another embodiment of the gesture control method of the present application. As shown in FIG. 5, the gesture control method of this embodiment includes:
  • step S501 the first convolutional network model is trained according to the sample image containing the human hand annotation information, and the prediction information of the first convolutional network model for the human hand candidate region of the sample image is obtained.
  • step S501 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a human hand area determination module 604 that is executed by the processor.
  • step S502 the prediction information of the human hand candidate region is corrected.
  • step S502 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a modification module 605 executed by the processor.
  • step S503 the second convolutional network model is trained based on the predicted information and the sample image of the corrected human hand candidate region.
  • the second convolutional network model and the first convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model.
  • step S503 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolutional model training module 606 executed by the processor.
  • step S504 a feature vector of the business object sample image to be trained is acquired.
  • the feature vector includes location information and/or confidence information of the business object in the business object sample image, and a feature vector corresponding to the gesture.
  • the business object sample image to be trained may be the sample image containing the human hand annotation information described above.
  • sample images in the sample image of the business object may be filtered out by preprocessing the image of the business object sample.
  • the business object sample image includes a business object, and the business object is marked with location information and confidence information.
  • the location information of the central point of the business object is used as the location information of the business object.
  • the sample image is filtered according to the location information of the business object. After obtaining the coordinates of the location indicated by the location information, the coordinates are compared with the preset location coordinates of the business object of the type, and the position variance of the two is calculated. If the location variance is less than or equal to the set threshold, the business object sample image may be used as the sample image to be trained; if the location variance is greater than the set threshold, the business object sample image is filtered out.
  • the preset position coordinates and the set thresholds may be appropriately set by a person skilled in the art according to actual conditions.
  • the threshold that can be set may be The length or width of the image is 1/20 to 1/5, and alternatively, it may be 1/10 of the length or width of the image.
  • the location and the confidence of the service object in the determined sample image of the business object to be trained may be averaged to obtain an average position and an average confidence, and the average position and the average confidence may be used as a basis for determining the convergence condition subsequently.
  • the business object sample image used for training in this embodiment is labeled with the coordinates of the advertisement location and the confidence of the advertisement space.
  • the advertisement position can be marked in the hand, the front background and the like, and the joint training of the advertisement points of the hand feature point and the front background can be realized, and the scheme is different from the scheme based on the manual training of one hand. Conducive to saving computing resources.
  • the size of the confidence indicates the probability that this ad slot is a better ad slot, for example, if the ad slot is mostly occluded, the confidence is low.
  • the feature vector is subjected to convolution processing to obtain a feature vector convolution result.
  • step S506 it is determined whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition.
  • step S507 if the convergence conditions in step S506 are all satisfied, the training on the third convolutional network model is completed; if the convergence conditions in step S506 are neither satisfied nor satisfied, the corresponding results are obtained according to the feature vector convolution result.
  • the location information and/or confidence information of the business object adjust the network parameters of the third convolutional network model, and iteratively train the third convolutional network model according to the adjusted network parameters of the third convolutional network model,
  • the position information and/or confidence information of the business object after the iterative training corresponds to the convergence condition.
  • the trained third convolutional network model can be obtained.
  • the structure of the third convolutional network model reference may be made to the structure of the first convolutional network model or the second convolutional network model in the embodiment shown in FIG. 2 or FIG. 3 of the present application, and details are not described herein again.
  • the first convolutional network model, the second convolutional network model, and the third convolutional network model obtained by the above training may perform corresponding processing on the video image, and specifically may include the following steps S508 to S513.
  • steps S504-S507 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a third training module in the gesture control device operated by the processor.
  • step S508 the currently played video image is acquired.
  • step S509 the video image is detected by using the pre-trained first convolution network, and the first feature information of the video image and the prediction information of the candidate region of the human hand are obtained.
  • step S510 the first feature information and the prediction information of the candidate region are used as the second feature information of the pre-trained second convolutional network model, and the second convolutional network model is used to perform the video image according to the second feature information.
  • Gesture detection get the gesture detection result of the video image.
  • the gesture in the human hand candidate region may be determined in a probabilistic manner. For example, taking the palm open gesture and the palm closing gesture as an example, when the probability of the palm open gesture is high, the human hand with the palm open gesture in the video image can be considered, and when the probability of the palm closing gesture is high, the video image can be considered.
  • the hand that contains the palm closing gesture when the probability of the palm open gesture is high, the human hand with the palm open gesture in the video image can be considered.
  • the output result of the second convolutional network model model may include: a probability that the human hand candidate region does not include a human hand, and a probability that the human hand candidate region includes a palm open gesture.
  • the candidate area of the human hand includes the probability of a human hand closing the gesture of the palm, and the like.
  • the second convolutional network model model obtains a gesture for the video image according to the characteristics of the human hand candidate region and various predetermined gestures.
  • the second convolutional network model model may directly determine the first feature of the plurality of first convolutional layer extracted video images as the second feature of the plurality of second convolutional layer extracted human hand candidate regions, and then According to the second feature described above, the human hand candidate region is classified by a plurality of fully connected layers, and the gesture detection result for the video image is obtained, thereby saving the calculation amount and improving the detection speed.
  • steps S508-S510 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be The gesture detection module 601 running by the processor executes.
  • step S511 when it is determined that the gesture of the detected hand matches the corresponding predetermined gesture, the feature point of the hand in the human hand candidate region corresponding to the detected gesture is extracted.
  • step S512 a third convolutional network model for determining a presentation position of the business object in the video image is used according to the feature point of the hand, and the to-be-displayed corresponding to the hand position is determined in the video image. The location of the business object.
  • steps S511-S512 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a presentation location determining module 602 executed by the processor.
  • step S513 the business object to be displayed is drawn in a computer drawing manner at the presentation position.
  • step S460 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 603 executed by the processor.
  • the business object is closely combined with the gestures in the video image to preserve the main image and motion of the video subject (such as the anchor) in the video image, and can also add video images.
  • the fun can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of displaying the business object in the video image, and can attract the attention of the viewer to a certain extent and improve the influence of the business object.
  • business objects can be widely applied to other aspects, such as education, consulting, services, etc., by providing entertainment, appreciation and other business information to improve interaction and improve user experience.
  • any of the image processing methods provided by the embodiments of the present application may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device, a server, and the like.
  • any image processing method provided by the embodiment of the present application may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to execute any one of the image processing methods mentioned in the embodiments of the present application. This will not be repeated below.
  • FIG. 6 is a structural block diagram of an embodiment of a gesture control apparatus of the present application.
  • the gesture control apparatus of the embodiments of the present application can be used to implement the foregoing gesture control method embodiments of the present application.
  • the gesture control apparatus of this embodiment includes a gesture detection module 601, a presentation location determining module 602, and a business object rendering module 603.
  • the gesture detection module 601 is configured to perform gesture detection on the currently played video image.
  • the presentation location determining module 602 is configured to determine a presentation location of the business object to be displayed in the video image when the gesture is detected to match the predetermined gesture.
  • the business object drawing module 603 is configured to draw a business object by using a computer drawing manner at the presentation position.
  • the gesture control apparatus performs human hand candidate area and gesture detection on the currently played video image containing the hand information, and matches the detected gesture with the corresponding predetermined gesture, when the two match, Determining the presentation position of the business object to be displayed in the video image by using the location of the hand.
  • the business object to be displayed is drawn by using a computer drawing manner at the determined presentation position, and the business object is The combination of video playback eliminates the need to transmit additional advertising video data unrelated to the video over the network, which is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to retain the video.
  • the main image and motion of the video main body (such as the anchor) in the image can add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, and can be certain To attract the attention of the audience and improve the business objects Influence.
  • FIG. 7 is a structural block diagram of another embodiment of the gesture control apparatus of the present application.
  • the presentation position determining module 602 includes: a feature point extraction unit, configured to extract a human hand candidate region corresponding to the detected gesture. a feature point of the inner hand; a presentation position determining unit, configured to determine, according to the feature point of the hand, a display position of the business object to be displayed corresponding to the detected gesture in the video image.
  • the presentation location determining module 602 is configured to determine, according to the feature point of the hand and the type of the business object to be displayed, a presentation location of the business object to be displayed corresponding to the detected gesture in the video image.
  • the presentation location determining module 602 is configured to determine, according to the feature point of the hand and the type of the business object to be displayed, a plurality of presentation locations of the business object to be displayed corresponding to the detected gesture in the video image; Select at least one exhibition from multiple presentation locations Current location.
  • the presentation location determining module 602 is configured to determine, when the determined gesture is matched with the corresponding predetermined gesture, a presentation position of the business object to be displayed corresponding to the predetermined gesture in the video image as the detected The position of the corresponding business object to be displayed in the video image.
  • the presentation location determining module 602 is configured to obtain, from a correspondence between the pre-stored gesture and the presentation location, a target presentation location corresponding to the predetermined gesture as a business object to be displayed corresponding to the detected gesture in the video image. Show position.
  • the business object may include: a video, an image, or an effect including semantic information
  • the video image may include a still image or a live video image.
  • the special effect including the semantic information includes any one or more of the following special effects including the advertisement information: two-dimensional sticker special effects, three-dimensional special effects, particle special effects, and the like.
  • the presentation location may include any one or more of the following regions: a hair region of the person in the video image, a forehead region, a cheek region, a chin region, a body region other than the head, a background region in the video image, and a video image An area within the setting range centering on the area where the hand is located, a predetermined area in the video image, and the like.
  • the type of the business object may include any one or more of the following types: forehead patch type, cheek patch type, chin patch type, virtual hat type, virtual clothing type, virtual makeup type, virtual headdress type, virtual Hair accessory type, virtual jewelry type.
  • the gesture or the predetermined gesture may include any one or more of: wave, scissors, fist, hand, heart, applause, palm open, palm closed, thumbs up, pistol pose, pendulum V Hand and pendulum OK hand.
  • the gesture detection module 601 is configured to detect a video image by using a pre-trained first convolution network, and obtain first feature information of the video image and prediction information of the candidate region of the human hand, where the first feature information includes hand feature information. Taking the first feature information and the prediction information of the candidate region as the second feature information of the pre-trained second convolutional network model, and using the second convolutional network model to perform the gesture detection of the video image according to the second feature information, Obtaining a gesture detection result of the video image; wherein the second convolutional network model and the first convolutional network model share the feature extraction layer.
  • the method further includes: a human hand area determining module 604, configured to train the first convolutional network model according to the sample image that includes the human hand annotation information, Obtaining prediction information of the first convolutional network model for the candidate region of the sample image; the correction module 605, for correcting the prediction information of the candidate region of the human hand; and the convolution model training module 606, for predicting the candidate region according to the revised candidate
  • the information and the sample image train the second convolutional network model, wherein the second convolutional network model and the first convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are maintained during the training process of the second convolutional network model change.
  • the method further includes: a human hand area determining module 604, configured to train the first convolution according to the sample image containing the human hand annotation information.
  • a neural network the prediction information of the first convolutional neural network for the human hand candidate region of the sample image is obtained;
  • the parameter replacement module is configured to replace the second feature extraction layer parameter of the second convolutional neural network for detecting the gesture with the training a first feature extraction layer parameter of the first first convolutional neural network;
  • a second training module configured to train the second convolutional neural network parameter according to the prediction information of the candidate region of the human hand and the sample image, and maintain the second during the training process
  • the feature extraction layer parameters are unchanged.
  • the second training module may include: a correction module 605, configured to correct prediction information of the candidate region of the human hand; and a convolution model training module 606, configured to train the second information according to the predicted information and the sample image of the corrected candidate region
  • the neural network parameters are convolved and the second feature extraction layer parameters are kept unchanged during the training process.
  • the human hand annotation information may include annotation information of the human hand area and/or annotation information of the gesture.
  • the first convolutional neural network may include: a first input layer, a first feature extraction layer, and a first classification output layer, where the first classification output layer is used to predict whether multiple candidate regions of the sample image are candidate regions of the human hand. .
  • the second convolutional neural network may include: a second input layer, a second feature extraction layer, and a second classification output layer, where the second classification output layer is configured to output a gesture detection result of the sample image.
  • the correction module 605 is configured to input the plurality of supplementary negative sample images and the prediction information of the candidate region of the human hand into the third convolutional neural network for classification, to filter the negative samples in the candidate region of the human hand, and obtain the corrected human hand. Prediction information for candidate regions.
  • the difference between the number of the human hand candidate regions and the number of supplementary negative sample images in the prediction information of the human hand candidate region is within a predetermined allowable range.
  • the number of human hand candidate regions in the prediction information of the human hand candidate region is equal to the number of supplementary negative sample images.
  • the first convolutional neural network may comprise an RPN; the second convolutional neural network may comprise an FRCNN.
  • the third convolutional neural network may comprise an FRCNN.
  • the presentation location determining module 602 is configured to determine, by using a gesture and a pre-trained third convolutional network model for detecting a presentation location of the business object from the video image, to be displayed corresponding to the detected gesture.
  • the presentation location of the business object is determined, by using a gesture and a pre-trained third convolutional network model for detecting a presentation location of the business object from the video image, to be displayed corresponding to the detected gesture.
  • the presentation location of the business object is configured to determine, by using a gesture and a pre-trained third convolutional network model for detecting a presentation location of the business object from the video image, to be displayed corresponding to the detected gesture.
  • FIG. 8 is a structural block diagram of still another embodiment of a processing apparatus for a video image of the present application.
  • the electronic device may include: a processor 802, a communication interface (Communications) Interface 804, memory 806, and communication bus 808. among them:
  • Processor 802, communication interface 804, and memory 806 complete communication with one another via communication bus 808.
  • the communication interface 804 is configured to communicate with network elements of other devices, such as other clients or servers.
  • the processor 802 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, or a graphics processor ( Graphics Processing Unit, GPU).
  • the one or more processors included in the terminal device may be the same type of processor, such as one or more CPUs, or one or more GPUs; or may be different types of processors, such as one or more CPUs and One or more GPUs.
  • the memory 806 is for at least one executable instruction that causes the processor 802 to perform operations corresponding to a method of presenting a business object in a video image as in any of the above-described embodiments of the present application.
  • the memory 806 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.
  • FIG. 9 is a schematic structural diagram of an embodiment of an electronic device according to the present application.
  • the electronic device includes one or more processors, a communication unit, etc., such as one or more central processing units (CPUs) 901, and/or one or more A graphics processor (GPU) 913 or the like, the processor may execute various types according to executable instructions stored in read only memory (ROM) 902 or executable instructions loaded from random access memory (RAM) 903 from storage portion 908. Proper action and handling.
  • CPUs central processing units
  • GPU graphics processor
  • Communication portion 912 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card, and the processor can communicate with read only memory 902 and/or random access memory 903 to execute executable instructions over bus 904.
  • the operation is performed by the communication unit 912, and communicates with other target devices via the communication unit 912, thereby performing operations corresponding to any of the gesture control methods provided by the embodiments of the present application, for example, performing gesture detection on the currently played video image; Determining a presentation location of the business object to be displayed in the video image when the predetermined gesture is matched; drawing the business object in a computer drawing manner at the presentation location.
  • a network card which can include, but is not limited to, an IB (Infiniband) network card
  • the processor can communicate with read only memory 902 and/or random access memory 903 to execute executable instructions over bus 904.
  • the operation is performed by the communication unit 912, and communicates with other target devices via the communication unit 912,
  • RAM 903 various programs and data required for the operation of the device can be stored.
  • the CPU 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904.
  • ROM 902 is an optional module.
  • the RAM 903 stores executable instructions or writes executable instructions to the ROM 902 at runtime, the executable instructions causing the processor 901 to perform operations corresponding to the gesture control method described above.
  • An input/output (I/O) interface 905 is also coupled to bus 904.
  • the communication unit 912 may be integrated or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards) and on the bus link.
  • the following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, etc.; an output portion 907 including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 908 including a hard disk or the like. And a communication portion 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the Internet.
  • the drive 911 is also connected to the I/O interface 905 as needed.
  • a removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 911 as needed so that a computer program read therefrom is installed into the storage portion 908 as needed.
  • FIG. 9 is only an optional implementation manner.
  • the number and type of components in FIG. 9 may be selected, deleted, added, or replaced according to actual needs;
  • Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more.
  • an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing an instruction corresponding to the method step provided by the embodiment of the present application, for example, an instruction for performing gesture detection on the currently played video image; and determining that the business object to be displayed is in the video image when detecting that the gesture matches the predetermined gesture An instruction to present a location; an instruction to draw the business object in a computer drawing manner at the presentation location.
  • the embodiment of the present application further provides a computer program, the computer program comprising computer readable code, the program code includes computer operating instructions, when the computer readable code is run on the device, the processor in the device executes An instruction for implementing each step in the gesture control method of any of the embodiments of the present application.
  • an embodiment of the present application includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Performing the instruction corresponding to the method step provided by the embodiment of the present application, for example, performing gesture detection on the currently played video image; determining, when the gesture is matched with the predetermined gesture, determining the presentation position of the business object to be displayed in the video image Drawing the business object in a computer drawing manner at the presentation position.
  • the embodiment of the present application further provides a computer readable storage medium for storing computer readable instructions, which are executed to implement the operations of the steps in the gesture control method of any embodiment of the present application.
  • the methods and apparatus of the present application may be implemented in a number of ways.
  • the methods and apparatus of the present application can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the various components/steps described in the embodiments of the present application may be split into more components/steps according to the needs of the implementation, or two or more components/steps or partial operations of the components/steps may be combined into new components. /Steps to achieve the objectives of the embodiments of the present application.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present application are not limited to the order specifically described above unless otherwise specifically stated.
  • the present application can also be implemented as a program recorded in a recording medium, the programs including machine readable instructions for implementing the method according to the present application.
  • the present application also covers a recording medium storing a program for executing the method according to the present application.
  • the above method according to the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network.
  • a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network.
  • the computer code originally stored in a remote recording medium or non-transitory machine readable medium and to be stored in a local recording medium, whereby the methods described herein can be stored using a general purpose computer, a dedicated processor, or programmable or dedicated Such software processing on a recording medium of hardware such as an ASIC or an FPGA.
  • a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is The processing methods described herein are implemented when the processor or hardware is accessed and executed. Moreover, when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code converts the general purpose computer into a special purpose computer for performing the processing shown herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A gesture control method, device, and electronic apparatus. The method comprises: detecting a gesture in a video image being played (S110); upon detecting the gesture matching a predefined gesture, determining a display location of the service object to be displayed in the video image (S120); and drawing, using computer graphics, at the display location, the service object (S130). The technical solution is adopted to save network resources and/or system resources at a user terminal, adding entertaining values to video images, preventing distraction of a user from viewing a video normally, thereby reducing the negative response of a user to viewing of a service object in a video image, attracting viewer attention, and increasing impact of the service object.

Description

手势控制方法、装置和电子设备Gesture control method, device and electronic device

本申请要求在2016年08月19日提交中国专利局、申请号为CN201610696340.1、发明名称为“手势检测网络训练、手势检测、手势控制方法及装置”、2016年08月19日提交中国专利局、申请号为CN201610707579.4、发明名称为“手势检测网络训练、手势检测及控制方法、系统及终端”和2016年08月19日提交中国专利局、申请号为CN 201610694510.2、发明名称为“手势控制方法、装置和终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application is required to be submitted to the China Patent Office on August 19, 2016, the application number is CN201610696340.1, and the invention name is “gesture detection network training, gesture detection, gesture control method and device”, and China patent is submitted on August 19, 2016. Bureau, application number is CN201610707579.4, the invention name is “gesture detection network training, gesture detection and control method, system and terminal” and submitted to China Patent Office on August 19, 2016, application number is CN 201610694510.2, and the invention name is “ The priority of the Chinese Patent Application for the Method, Apparatus, and Terminal Device for Gesture Control, the entire contents of which is incorporated herein by reference.

技术领域Technical field

本申请涉及信息处理技术,尤其涉及一种手势控制方法、装置和电子设备。The present application relates to information processing technologies, and in particular, to a gesture control method, apparatus, and electronic device.

背景技术Background technique

随着互联网技术的发展,人们越来越多地使用互联网观看视频,互联网视频为许多新的业务提供了商机。互联网视频已成为重要的业务流量入口,并且被认为是广告植入的优质资源。With the development of Internet technology, people are increasingly using the Internet to watch video, and Internet video offers business opportunities for many new businesses. Internet video has become an important business traffic portal and is considered a premium resource for ad placement.

现有视频广告主要通过植入的方式,在视频播放之前、或者视频播放的某个时间插入固定时长的广告,或者在视频播放的区域及其周边区域固定位置放置广告。Existing video advertisements are mainly inserted into a fixed-time advertisement before the video is played, or at a certain time of the video playback, or placed in a fixed position in the area where the video is played and its surrounding area.

发明内容Summary of the invention

本申请实施例提供一种手势控制的方案。The embodiment of the present application provides a solution for gesture control.

根据本申请实施例的一方面,提供一种手势控制方法,包括:对当前播放的视频图像进行手势检测;在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置;在所述展现位置采用计算机绘图方式绘制所述业务对象。According to an aspect of the embodiments of the present application, a gesture control method includes: performing gesture detection on a currently played video image; and determining that a service object to be displayed is in the video image when detecting that the gesture matches a predetermined gesture a presentation location; the business object is drawn in a computer drawing manner at the presentation location.

根据本申请实施例的另一方面,提供一种手势控制装置,包括:手势检测模块,用于对当前播放的视频图像进行手势检测;展现位置确定模块,用于在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置;业务对象绘制模块,用于在所述展现位置采用计算机绘图方式绘制所述业务对象。According to another aspect of the embodiments of the present application, a gesture control apparatus includes: a gesture detection module configured to perform gesture detection on a currently played video image; and a presentation location determining module configured to match the gesture to the predetermined gesture Determining a presentation location of the business object to be displayed in the video image; a business object rendering module, configured to draw the business object by using a computer drawing manner at the presentation location.

根据本申请实施例的又一方面,提供一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存储至少一可执行指令,所述可执行指令使所述处理器执行本申请上述任一实施例所述的手势控制方法中步骤的操作。According to still another aspect of the embodiments of the present application, an electronic device includes: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete each other through the communication bus The memory is for storing at least one executable instruction that causes the processor to perform the operations of the steps in the gesture control method of any of the above-described embodiments of the present application.

根据本申请实施例的再一方面,提供另一种电子设备,包括:According to still another aspect of the embodiments of the present application, another electronic device is provided, including:

处理器和本申请上述任一实施例所述的手势控制装置;a processor and a gesture control device according to any of the above embodiments of the present application;

在处理器运行所述手势控制装置时,本申请上述任一实施例所述的手势控制装置中的单元被运行。When the processor runs the gesture control device, the unit in the gesture control device described in any of the above embodiments of the present application is operated.

根据本申请实施例的再一方面,提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请上述任一实施例所述手势控制方法中各步骤的指令。According to still another aspect of an embodiment of the present application, a computer program is provided, comprising computer readable code, when a computer readable code is run on a device, a processor in the device performs the above-described An instruction of each step in the gesture control method of an embodiment.

根据本申请实施例的还一方面,提供一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请上述任一实施例所述手势控制方法中各步骤的操作。According to still another aspect of the embodiments of the present application, a computer readable storage medium is provided for storing computer readable instructions, and when the instructions are executed, each of the gesture control methods of any of the above embodiments of the present application is implemented. The operation of the steps.

根据本申请实施例提供的手势控制方案,通过对当前播放的视频图像进行人手和手势检测,并确定与检测到的手势相应的展现位置,进而在视频图像的上述展现位置采用计算机绘图方式绘制待显示的业务对象,当业务对象用于展示广告时,一方面,在确定的展现位置采用计算机绘图方式绘制待展示的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的手势紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加趣味性,还可以避免打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,可以在一定程度上吸引观众的注意力,提高业务对象的影响力。According to the gesture control scheme provided by the embodiment of the present application, the human hand and gesture detection are performed on the currently played video image, and the presentation position corresponding to the detected gesture is determined, and then the computer display manner is used to draw the display position of the video image. The displayed business object, when the business object is used to display the advertisement, on the one hand, the business object to be displayed is drawn by using a computer drawing manner at the determined display position, and the business object is combined with the video playing, and does not need to transmit the video independent through the network. Additional advertising video data is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to preserve the main image and motion of the video subject (such as the anchor) in the video image. It can also add fun to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, can attract the viewer's attention to a certain extent, and improve the influence of the business object. force.

下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。The technical solutions of the present application are further described in detail below through the accompanying drawings and embodiments.

附图说明DRAWINGS

构成说明书的一部分的附图描述了本申请的实施例,并且连同描述一起用于解释本申请的原理。 The accompanying drawings, which are incorporated in FIG.

参照附图,根据下面的详细描述,可以更加清楚地理解本申请,其中:The present application can be more clearly understood from the following detailed description, in which:

图1是本申请手势控制方法一实施例的流程图;1 is a flow chart of an embodiment of a gesture control method of the present application;

图2是本申请实施例中第一卷积网络模型和第二卷积网络模型的获取方法一实施例的流程图;2 is a flowchart of an embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application;

图3是本申请实施例中第一卷积网络模型和第二卷积网络模型的获取方法另一实施例的流程图;3 is a flowchart of another embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application;

图4是本申请手势控制方法另一实施例的流程图;4 is a flow chart of another embodiment of a gesture control method of the present application;

图5是本申请手势控制方法又一实施例的流程图;FIG. 5 is a flowchart of still another embodiment of the gesture control method of the present application; FIG.

图6是本申请手势控制装置一实施例的结构框图;6 is a structural block diagram of an embodiment of a gesture control apparatus of the present application;

图7是本申请手势控制装置另一实施例的结构框图;7 is a structural block diagram of another embodiment of the gesture control apparatus of the present application;

图8是本申请电子设备一实施例的结构示意图;8 is a schematic structural diagram of an embodiment of an electronic device of the present application;

图9为本申请电子设备另一实施例的结构示意图。FIG. 9 is a schematic structural diagram of another embodiment of an electronic device according to the present application.

具体实施方式detailed description

现在将参照附图来详细描述本申请的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。Various exemplary embodiments of the present application will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the application.

同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本申请及其应用或使用的任何限制。The following description of the at least one exemplary embodiment is merely illustrative and is in no way

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but the techniques, methods and apparatus should be considered as part of the specification, where appropriate.

应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.

本领域技术人员可以理解,本申请实施例中的“第一”、“第二”等术语仅用于区别不同步骤、设备或模块等,既不代表任何特定技术含义,也不表示它们之间的必然逻辑顺序。Those skilled in the art can understand that the terms “first”, “second” and the like in the embodiments of the present application are only used to distinguish different steps, devices or modules, etc., and do not represent any specific technical meaning or between them. The inevitable logical order.

本申请实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。Embodiments of the present application can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.

终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.

图1是本申请手势控制方法一实施例的流程图。本申请各实施例的手势控制方法可以示例性地通过计算机系统、终端设备、服务器等电子设备执行。参照图1,该实施例的手势控制方法包括:1 is a flow chart of an embodiment of a gesture control method of the present application. The gesture control method of various embodiments of the present application can be exemplarily executed by an electronic device such as a computer system, a terminal device, a server, or the like. Referring to FIG. 1, the gesture control method of this embodiment includes:

在步骤S110,对当前播放的视频图像进行手势检测。At step S110, gesture detection is performed on the currently played video image.

本申请各实施例中,视频图像可以是正在直播的直播类视频中的图像,也可以是已录制完成的视频中的视频图像,还可以是正在录制过程中的视频图像等。本申请各实施例中,手势例如可以包括但不限于以下一种或任意多种的组合:挥手、剪刀手、握拳、托手、手掌的闭合或张开、桃心手、鼓掌、竖大拇指、摆手枪姿势、摆V字手和摆OK手等。In various embodiments of the present application, the video image may be an image in a live broadcast video being broadcasted, a video image in a recorded video, or a video image being recorded. In various embodiments of the present application, the gestures may include, but are not limited to, one or any combination of the following: wave, scissors, fist, hand, palm closing or opening, heart hand, applause, thumbs up Pose the pistol posture, swing the V-hand and swing the OK hand.

在本申请各实施例的其中一个可选示例中,以直播类视频为例,目前,视频直播平台包括多个,如花椒直播平台、YY直播平台等,每一个直播平台包括有多个直播房间,每个直播房间中会包括至少一个主播,主播可以通过电子设备(如手机、平板电脑或PC等)的摄像头向其所在的直播房间中的粉丝直播视频,该直播类视频包括多个视频图像。上述视频图像中的主体通常为一个主要人物(即主播)和简单的背景,主播常常在视频图像中所占的区域较大。当需要在直播视频的过程中插入业务对象(如广告等)时,可以获取当前直播类视频中的视频图像。In an optional example of the embodiments of the present application, a live video is taken as an example. Currently, the live video platform includes multiple, such as a pepper live broadcast platform, a YY live broadcast platform, etc., each live broadcast platform includes multiple live broadcast rooms. Each live room will include at least one anchor. The anchor can broadcast a video to a fan in the live room where the electronic device (such as a mobile phone, tablet, or PC) is located, and the live video includes multiple video images. . The subject in the above video image is usually a main character (ie, anchor) and a simple background, and the anchor often occupies a larger area in the video image. When a business object (such as an advertisement, etc.) needs to be inserted during the live video, the video image in the current live video can be obtained.

此外,本申请各实施例中的视频图像也可以是已录制完成的短视频中的视频图像,对于此种情况,用户可以使用其电子设备播放该短视频,在播放的过程中,电子设备可以获取每一帧或每一关键帧或每一采样帧视频图像作为待处理的视频图像。In addition, the video image in the embodiments of the present application may also be a video image in a short video that has been recorded. In this case, the user may use the electronic device to play the short video. In the process of playing, the electronic device may Each frame or each key frame or each sample frame video image is acquired as a video image to be processed.

另外,在本申请各实施例中,对于视频图像是正在录制过程中的视频图像的情况,在录制的过程 中,电子设备可以获取录制的每一帧或每一关键帧或每一采样帧视频图像作为待处理的视频图像。In addition, in the embodiments of the present application, in the case of a video image being a video image being recorded, in the recording process The electronic device can acquire each frame or each key frame or each sample frame video image recorded as a video image to be processed.

本申请各实施例中,可以在播放视频图像的电子设备或者用户(例如主播)使用的电子设备中设置对视频图像进行人手检测和人手所在的人手候选区域中的手势检测的机制,通过上述机制可以对当前播放的视频图像(即上述待处理的视频图像)进行检测,确定待处理的视频图像中是否包括用户的手部信息,如果包括,则获取该视频图像,以通过本申请后续实施例中该视频图像进行处理;如果不包括,则可以丢弃该视频图像或者不对该视频图像做任何处理,并获取下一帧视频图像继续进行上述处理。本申请各实施例中,手部信息例如可包括但不限于手指状态和位置、手掌的状态和位置、手部的合拢和张开等。In various embodiments of the present application, a mechanism for performing human hand detection on a video image and gesture detection in a human hand candidate region where a human hand is located may be set in an electronic device that plays a video image or an electronic device used by a user (eg, an anchor), through the above mechanism. The currently played video image (ie, the video image to be processed) may be detected to determine whether the user's hand information is included in the video image to be processed, and if so, the video image is acquired to pass the subsequent embodiment of the present application. The video image is processed; if not included, the video image may be discarded or not processed, and the next frame of video image may be acquired to continue the above processing. In various embodiments of the present application, the hand information may include, for example, but not limited to, a finger state and position, a state and position of the palm, a closing and opening of the hand, and the like.

对于包含手部信息(或者说人手)的视频图像,可从该视频图像中检测人手所在的人手候选区域,其中,人手候选区域可以是视频图像中能覆盖整个人手候选区域的最小矩形区域或者其它形状(如椭圆形等)的区域。一种可选的处理过程可以为,电子设备获取当前正在播放的一帧视频图像作为待处理的视频图像,通过预先设定的机制以从该视频图像中截取出包括人手候选区域的图像,然后,可以通过预先设定的机制对该人手候选区域的图像进行分析和特征提取,得到人手候选区域中各30个部分(包括手指、手掌等)的特征数据,通过对该特征数据的分析,确定视频图像中人手候选区域中的手势属于挥手、剪刀手、握拳、托手、手掌的闭合或张开等手势中的哪一种。For a video image including hand information (or a human hand), a human hand candidate region in which a human hand is located may be detected from the video image, wherein the human hand candidate region may be a minimum rectangular region in the video image that covers the entire human hand candidate region or other The area of a shape (such as an ellipse, etc.). An optional process may be: the electronic device acquires a video image currently being played as a video image to be processed, and adopts a preset mechanism to extract an image including a candidate region of the human hand from the video image, and then The image of the candidate region of the human hand can be analyzed and extracted by a preset mechanism, and the feature data of each 30 parts (including fingers, palms, etc.) in the candidate region of the human hand is obtained, and the analysis of the feature data is performed. The gesture in the candidate region of the human hand in the video image belongs to any of gestures such as waving, scissors, clenching, holding, palm closing or opening.

此外,为了后续可以准确快速的确定待显示的业务对象在视频图像中的展现位置,可以通过手部位置对业务对象的展现位置进行限制。其中,手部位置可以是上述人手候选区域的中心位置,也可以是人手候选区域的矩形区域或椭圆形区域等的多个边缘位置确定的坐标位置等。例如,可以在视频图像中确定手部所在的区域后,对该人手候选区域进行分析计算,确定该人手候选区域的中心位置作为手部位置,例如,人手候选区域为矩形区域,则可以计算该矩形区域的对角线长度,可以选取该对角线的中间位置作为手部位置,从而可得到基于人手候选区域确定的手部位置。其中,除了可以使用人手候选区域的中心位置作为手部位置外,还可以通过人手候选区域的矩形区域或椭圆形区域等的多个边缘位置作为手部位置,相应的处理过程可以参见上述以中心位置作为手部位置的内容,在此不再赘述。In addition, in order to accurately and quickly determine the presentation position of the business object to be displayed in the video image, the presentation position of the business object can be restricted by the hand position. The hand position may be a center position of the human hand candidate area, or may be a coordinate position determined by a plurality of edge positions such as a rectangular area or an elliptical area of the human hand candidate area. For example, after determining the area where the hand is located in the video image, the human hand candidate area is analyzed and calculated, and the center position of the human hand candidate area is determined as the hand position. For example, if the human hand candidate area is a rectangular area, the The diagonal length of the rectangular area can be selected as the hand position of the diagonal, so that the hand position determined based on the candidate area of the hand can be obtained. In addition to the central position of the human hand candidate area as the hand position, a plurality of edge positions such as a rectangular area or an elliptical area of the human hand candidate area may be used as the hand position, and the corresponding processing may be referred to as the center. The location is the content of the hand position and will not be described here.

在一个可选示例中,步骤S110可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的手势检测模块601执行。In an alternative example, step S110 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a gesture detection module 601 that is executed by the processor.

在步骤S120,在检测到手势与预定手势匹配时,确定待显示的业务对象在视频图像中的展现位置。In step S120, when it is detected that the gesture matches the predetermined gesture, the presentation position of the business object to be displayed in the video image is determined.

本申请各实施例中,待显示的业务对象是根据一定的业务需求而创建的对象,例如可包括但不限于广告、娱乐、天气预报、交通预报、宠物等方面的相关信息。本申请各实施例中,业务对象可以包括以下任意一项或多项:视频、图像、包含有语义信息的特效等,示例性地,该包含有语义信息的特效例如可以包括但不限于:包含广告信息的以下至少一种或任意多种形式的特效:二维贴纸特效、三维特效、粒子特效等。展现位置可以是视频图像中指定区域的中心位置,或者可以是上述指定区域中多个边缘位置的坐标位置等。本申请各实施例中的预定手势例如可以包括但不限于以下一种或任意多种的组合:挥手、剪刀手、握拳、托手、手掌的闭合或张开、桃心手、鼓掌、竖大拇指、摆手枪姿势、摆V字手和摆OK手等。In various embodiments of the present application, the business object to be displayed is an object created according to a certain business requirement, and may include, for example, but not limited to, advertisement, entertainment, weather forecast, traffic forecast, pet, and the like. In various embodiments of the present application, the service object may include any one or more of the following: a video, an image, an effect including semantic information, and the like. For example, the special effect including the semantic information may include, but is not limited to, including: At least one or any of the following special effects of the advertisement information: two-dimensional sticker effects, three-dimensional effects, particle effects, and the like. The presentation position may be a center position of a designated area in the video image, or may be a coordinate position or the like of a plurality of edge positions in the specified area. The predetermined gestures in the embodiments of the present application may include, but are not limited to, one or any combination of the following: wave, scissors, fist, hand, palm closing or opening, heart hand, applause, vertical Thumb, swing pistol posture, swing V-hand and swing OK hand.

在本申请各实施例的一个可选示例中,可以预先存储多种不同的手势的特征数据,并对不同的手势进行相应的标记,以区分各个手势所代表的含义。通过上述步骤S110的处理可以从待处理的视频图像中检测人手和人手所在的人手候选区域以及该人手候选区域中的手势,可以将检测到的手部的手势分别与预先存储的手势进行比对,如果预先存储的多种不同的手势中包括与检测到手部的手势相同的手势,则可以确定检测到的手势与对应的预定手势相匹配。In an optional example of the embodiments of the present application, feature data of a plurality of different gestures may be pre-stored, and different gestures are marked correspondingly to distinguish the meaning represented by each gesture. Through the processing of the above step S110, the human hand candidate area where the human hand and the human hand are located and the gesture in the human hand candidate area can be detected from the video image to be processed, and the detected hand gesture can be separately compared with the pre-stored gesture. If a plurality of different gestures stored in advance include the same gesture as the gesture of detecting the hand, it may be determined that the detected gesture matches the corresponding predetermined gesture.

为了提高匹配的准确度,可以通过计算的方式确定上述匹配结果,例如,可以设置匹配算法计算任意两个手势的特征数据之间的匹配度。例如,可以使用检测到手势的特征数据和预先存储的手势的特征数据进行匹配计算,得到两者之间的匹配度数值。通过上述方式分别计算得到检测到的手势与预先存储的每一种手势的特征数据之间的匹配度数值,从得到的匹配度数值中选取最大的匹配度数值,如果该最大的匹配度数值超过预定的匹配阈值,则可以确定最大的匹配度数值对应的预先存储的手势与检测到的手部的手势相匹配。如果该最大的匹配度数值未超过预定的匹配阈值,则匹配失败,即检测到的手部的手势不是预定手势,此时,可以继续对后续视频图像执行上述步骤S110的处理。In order to improve the accuracy of the matching, the matching result may be determined by calculation. For example, a matching algorithm may be set to calculate the matching degree between the feature data of any two gestures. For example, the matching data of the feature data of the detected gesture and the feature data of the pre-stored gesture may be used to obtain a matching degree value between the two. Calculating the matching degree value between the detected gesture and the feature data of each gesture stored in advance by using the above manner, and selecting the maximum matching degree value from the obtained matching degree value, if the maximum matching degree value exceeds The predetermined matching threshold may determine that the pre-stored gesture corresponding to the largest matching value matches the detected gesture of the hand. If the maximum matching degree value does not exceed the predetermined matching threshold, the matching fails, that is, the detected hand gesture is not a predetermined gesture, and at this time, the processing of step S110 described above may be continued on the subsequent video image.

另外,当确定检测到的手势与对应的预定手势相匹配时,可以先确定匹配到的手部的手势所代表的含义,可以在预先设定的多个展现位置中选取与其含义相关或相应的展现位置作为待显示的业务对象在视频图像中的展现位置。另外,对于上述步骤S110的处理中确定的手部位置的情况,还可以在预先设定的多个展现位置中选取与其含义、以及手部位置相关或相应的展现位置作为待显示的业务对象在视频图像中的展现位置。例如,以视频类直播为例,当检测到主播进行托手的手势时,可以将人手 候选区域的上部区域选取为与其相关或相应的展现位置。又如,当检测到主播挥手的手势时,可以将手掌区域或其背景区域选取为与其相关或相应的展现位置。In addition, when it is determined that the detected gesture matches the corresponding predetermined gesture, the meaning represented by the gesture of the matched hand may be determined first, and the meanings corresponding to or corresponding to the meaning may be selected among a plurality of preset display positions. The presentation location is the presentation location of the business object to be displayed in the video image. In addition, for the case of the hand position determined in the processing of the above step S110, the display position related to the meaning and the hand position or the corresponding display position may be selected among the plurality of preset display positions as the business object to be displayed. The presentation position in the video image. For example, in the case of video live broadcast, when the anchor of the anchor is detected, the hand can be The upper region of the candidate region is selected as the presentation location associated with or corresponding to it. As another example, when the gesture of the anchor wave is detected, the palm area or its background area may be selected as the presentation position associated with or corresponding to it.

在一个可选示例中,步骤S120可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的展现位置确定模块602执行。In an alternative example, step S120 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 602 executed by the processor.

在步骤S130,在展现位置采用计算机绘图方式绘制业务对象。In step S130, the business object is drawn by computer drawing at the presentation position.

例如,以直播类视频为例,当检测到主播进行托手的手势时,可以在视频图像中主播的人手候选区域中手掌的上方区域内采用计算机绘图方式绘制相应的业务对象,例如带有预定商品标识的图片广告等,如果粉丝对该业务对象感兴趣,则可以点击该业务对象所在的区域,粉丝的电子设备可以获取该业务对象对应的网络链接,并通过该网络链接进入与该业务对象相关的页面,在该页面中获取与该业务对象相关的资源。For example, in the case of a live video, when the gesture of the anchor is detected, the corresponding business object may be drawn by using a computer drawing method in the upper area of the palm of the candidate area of the anchor in the video image, for example, with a predetermined reservation. If the fan is interested in the business object, the fan can click on the area where the business object is located, and the electronic device of the fan can obtain the network link corresponding to the business object, and enter the business object through the network link. A related page in which to obtain resources related to the business object.

在本申请各实施例的一个可选示例中,可以采用计算机绘图方式绘制业务对象,可以通过适当的计算机图形图像绘制或渲染等方式实现,例如可以包括但不限于:基于开放图形语言(OpenGL)图形绘制引擎进行绘制等。OpenGL定义了一个跨编程语言、跨平台的编程接口规格的专业的图形程序接口,其与硬件无关,可以方便地进行2D或3D图形图像的绘制。通过OpenGL图形绘制引擎,不仅可以实现2D效果如2D贴纸的绘制,还可以实现3D特效的绘制及粒子特效的绘制等等。但本申请不限于基于OpenGL图形绘制引擎的绘制方式,还可以采取其它方式,例如基于游戏引擎(Unity)或开放运算语言(Open Computing Language,OpenCL)等图形绘制引擎的绘制方式也同样适用于本申请各实施例。In an optional example of the embodiments of the present application, the business object may be drawn by using a computer drawing manner, and may be implemented by appropriate computer graphics image drawing or rendering, for example, but not limited to: based on Open Graphics Language (OpenGL). The graphics drawing engine draws and so on. OpenGL defines a professional graphical program interface for cross-programming language and cross-platform programming interface specifications. It is hardware-independent and can easily draw 2D or 3D graphics images. Through the OpenGL graphics rendering engine, not only can 2D effects be drawn, but also 3D stickers can be drawn, and 3D effects can be drawn and particle effects can be drawn. However, the application is not limited to the drawing method based on the OpenGL graphics rendering engine, and other methods may be adopted. For example, the drawing method based on the graphics engine (Unity) or the Open Computing Language (OpenCL) is also applicable to the present invention. Apply for each embodiment.

在一个可选示例中,步骤S130可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的业务对象绘制模块603执行。In an alternative example, step S130 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 603 executed by the processor.

本申请实施例提供的手势控制方法,通过对当前播放的视频图像进行人手和手势检测,并确定与检测到的手势相应的展现位置,进而在视频图像的上述展现位置采用计算机绘图方式绘制待显示的业务对象,当业务对象用于展示广告时,一方面,在确定的展现位置采用计算机绘图方式绘制待展示的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的手势紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加趣味性,还可以避免打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,可以在一定程度上吸引观众的注意力,提高业务对象的影响力。The gesture control method provided by the embodiment of the present application performs human hand and gesture detection on the currently played video image, and determines a presentation position corresponding to the detected gesture, and then draws a display to be displayed in a computer drawing manner at the above-mentioned display position of the video image. The business object, when the business object is used for displaying the advertisement, on the one hand, the computer object is used to draw the business object to be displayed at a certain display position, and the business object is combined with the video playing, and the network is not required to transmit the extra irrelevant to the video. Advertising video data is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to preserve the main image and motion of the video subject (such as the anchor) in the video image. It can also add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, can attract the viewer's attention to a certain extent, and improve the influence of the business object. .

在本申请各实施例的一个可选示例中,上述图1所示实施例中步骤S110的对当前播放的视频图像进行手势检测的处理,可以采用相应的特征提取算法或者使用神经网络模型如卷积网络模型等实现。图2是本申请实施例中第一卷积网络模型和第二卷积网络模型的获取方法一实施例的流程图。本实施例中以卷积网络模型为例,对视频图像进行人手所在的人手候选区域和手势检测,,可以预先训练用于检测图像中人手候选区域的第一卷积网络模型和用于从人手候选区域检测手势的第二卷积网络模型。参照图2,该实施例第一卷积网络模型和第二卷积网络模型的获取方法包括:In an optional example of the embodiments of the present application, the process of performing gesture detection on the currently played video image in step S110 in the foregoing embodiment shown in FIG. 1 may adopt a corresponding feature extraction algorithm or use a neural network model such as a volume. The product network model is implemented. 2 is a flow chart of an embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application. In this embodiment, the convolutional network model is taken as an example, and the human hand candidate region and the gesture detection of the video image are performed on the video image, and the first convolutional network model for detecting the human hand candidate region in the image may be pre-trained and used for the human hand. The second convolutional network model of the candidate region detection gesture. Referring to FIG. 2, the method for acquiring the first convolutional network model and the second convolutional network model of the embodiment includes:

在步骤S210,根据含有人手标注信息的样本图像训练第一卷积网络模型,得到第一卷积网络模型针对样本图像的人手候选区域的预测信息。In step S210, the first convolutional network model is trained according to the sample image containing the human hand annotation information, and the prediction information of the first convolutional network model for the human hand candidate region of the sample image is obtained.

本申请各实施例中,人手标注信息例如可以包括但不限于人手区域的标注信息和/或手势的标注信息。其中,人手区域的标注信息可以包括人手区域所在的位置或者范围的坐标信息等,手势的标注信息可以包括手势的具体类型信息等。本实施例对人手区域的标注信息和手势的标注信息不做限制。In various embodiments of the present application, the human hand annotation information may include, for example, but not limited to, annotation information of a human hand area and/or annotation information of a gesture. The annotation information of the human hand area may include coordinate information of a location or a range of the human hand area, and the annotation information of the gesture may include specific type information of the gesture. In this embodiment, the labeling information of the human hand area and the labeling information of the gesture are not limited.

本申请各实施例中,含有人手标注信息的样本图像可以是来源于图像采集设备的视频图像,由一帧一帧的图像组成,也可以为单独的一帧图像或者一幅图像,还可以来源于其他设备,本实施例对含有人手标注信息的样本图像的来源和获得途径等不做限定。可以在样本图像中进行标注操作,例如可以在样本图像中标注多个人手候选区域。本申请实施例中,人手候选区域与本申请上述实施例中的人手候选区域相同。In various embodiments of the present application, the sample image containing the human hand annotation information may be a video image derived from an image acquisition device, and is composed of an image of one frame and one frame, or may be a single image or an image, and may also be a source. For other devices, the source and the access route of the sample image containing the hand-labeled information are not limited in this embodiment. An annotation operation can be performed in the sample image, for example, a plurality of human hand candidate regions can be marked in the sample image. In the embodiment of the present application, the human hand candidate area is the same as the human hand candidate area in the above embodiment of the present application.

本申请各实施例中,人手候选区域的预测信息可以包括:样本图像中的人手所在区域的位置信息,例如,坐标点信息或者像素点信息;人手所在区域中人手的完整度信息,例如,人手所在区域中包括一只完整的人手或者只包括一只手指;人手所在区域中的手势信息,例如,手势类型,等等。本实施例对人手候选区域的预测信息的内容不做限定。In various embodiments of the present application, the prediction information of the candidate region of the human hand may include: location information of a region where the human hand is located in the sample image, for example, coordinate point information or pixel point information; integrity information of the human hand in the region where the human hand is located, for example, a human hand The area includes a complete human hand or only one finger; gesture information in the area where the human hand is located, for example, gesture type, and the like. The content of the prediction information of the candidate region of the human hand is not limited in this embodiment.

在本申请各实施例中,由于图像的分辨率越大其数据量也就越大,后续进行人手候选区域和手势检测时,所需要的计算资源越多,检测速度越慢,鉴于此,在本申请的一种可选实现方式中,上述样本图像可以是满足预设分辨率条件的图像。例如,上述预设分辨率条件可以是:图像的最长边不超过640个像素点,最短边不超过480个像素点等等。 In each embodiment of the present application, the larger the resolution of the image, the larger the amount of data, and the more computing resources are required for the subsequent candidate region and gesture detection, the slower the detection speed, in view of this, In an optional implementation manner of the present application, the sample image may be an image that satisfies a preset resolution condition. For example, the above preset resolution condition may be: the longest side of the image does not exceed 640 pixels, the shortest side does not exceed 480 pixels, and the like.

得到样本图像后,例如可以通过人工标注或机器标注等方式,在样本图像中标注人手候选区域和手势的信息,得到标注有人手候选区域的多个样本图像。其中,标注的人手候选区域可以是图像中能覆盖整手的最小矩形区域或椭圆形区域等。After the sample image is obtained, for example, information of the human hand candidate region and the gesture may be marked in the sample image by manual labeling or machine labeling, and a plurality of sample images labeled with the human hand candidate region may be obtained. The labeled human hand candidate area may be a minimum rectangular area or an elliptical area in the image that can cover the whole hand.

本申请各实施例的一个可选示例中,第一卷积网络模型可以包括:第一输入层、第一输出层和多个第一卷积层,其中,第一输入层用于输入图像,多个第一卷积层用于对图像进行检测得到人手候选区域,将人手候选区域通过第一输出层输出。第一卷积网络模型中各层的网络参数以及第一卷积层的层数可以按照预设规则设定,也可以随机设定,采用哪种设定方式可以依据实际需求确定。In an optional example of various embodiments of the present application, the first convolutional network model may include: a first input layer, a first output layer, and a plurality of first convolution layers, wherein the first input layer is used to input an image, A plurality of first convolutional layers are used to detect an image to obtain a human hand candidate region, and the human hand candidate region is output through the first output layer. The network parameters of each layer in the first convolutional network model and the number of layers of the first convolutional layer may be set according to a preset rule, or may be randomly set, and which setting method may be determined according to actual needs.

示例性地,第一卷积网络模型采用多个第一卷积层对样本图像进行处理时,即对样本图像进行特征提取,第一卷积网络模型获得样本图像中的人手候选区域时,通过第一输入层获得样本图像,然后通过第一卷积层提取样本图像的特征,并结合所提取的特征确定样本图像中的人手候选区域并通过第一输出层输出。Illustratively, when the first convolutional network model processes the sample image by using the plurality of first convolutional layers, that is, feature extraction is performed on the sample image, and the first convolutional network model obtains the candidate region of the human hand in the sample image, The first input layer obtains a sample image, and then extracts features of the sample image through the first convolutional layer, and determines a human hand candidate region in the sample image in combination with the extracted features and outputs through the first output layer.

获取样本图像中手部所在区域的标注信息,以该标注信息作为训练依据,将样本图像输入到初始的第一卷积网络模型模型中,可采用梯度下降法和反向传播算法对该第一卷积网络模型训练,得到第一卷积网络模型。训练得到第一卷积网络模型时,可以先训练得到第一输入层参数、第一输出层参数和多个第一卷积层参数,然后再根据所获得的参数,构建第一卷积网络模型。Obtaining the annotation information of the region where the hand is located in the sample image, using the annotation information as a training basis, and inputting the sample image into the initial first convolutional network model model, and using the gradient descent method and the back propagation algorithm to the first The convolutional network model is trained to obtain the first convolutional network model. When training the first convolutional network model, the first input layer parameter, the first output layer parameter and the plurality of first convolution layer parameters may be trained, and then the first convolutional network model is constructed according to the obtained parameters. .

示例性地,可以使用含有人手标注信息的样本图像对第一卷积网络模型进行训练,为使得训练得到的第一卷积网络模型更加准确,在选择样本图像时可以选择多种情况下的样本图像,样本图像中可以包括标注有人手信息的样本图像,还可以包括未标注有人手信息的样本图像。Illustratively, the first convolutional network model can be trained using a sample image containing human hand annotation information, so that the first convolutional network model obtained by the training is more accurate, and samples in various cases can be selected when selecting the sample image. The image, the sample image may include a sample image that labels the human hand information, and may also include a sample image that is not labeled with the human hand information.

本申请各实施例中,第一卷积网络模型可以是区域方案网络(Region Proposal Network,RPN),本实施例以此为例进行说明,实际应用中第一卷积网络模型并不仅限于此,例如,还可以是其他二分类或者更多分类卷积神经网络(CNN)、多箱网络(Multi-Box Network)或者端到端实时目标检测系统(YOLO)等。In the embodiments of the present application, the first convolutional network model may be a Region Proposal Network (RPN). This embodiment is described by way of example. In the actual application, the first convolutional network model is not limited to this. For example, it may be another two-class or more classified convolutional neural network (CNN), a multi-box network (Multi-Box Network) or an end-to-end real-time target detection system (YOLO).

在一个可选示例中,步骤S210可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的人手区域确定模块604执行。In an alternative example, step S210 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a human hand area determination module 604 that is executed by the processor.

在步骤S220,修正人手候选区域的预测信息。In step S220, the prediction information of the human hand candidate region is corrected.

本实施例中,训练第一卷积网络模型得到的样本图像的人手候选区域的预测信息是粗略判断结果,可能存在一定的错误率。该人手候选区域的预测信息在后续步骤中作为训练第二卷积网络模型的输入项,可在训练第二卷积网络模型之前,将训练第一卷积网络模型得到的粗略判断结果进行修正。In this embodiment, the prediction information of the candidate region of the sample image obtained by training the first convolutional network model is a rough judgment result, and there may be a certain error rate. The prediction information of the candidate region of the hand is used as an input for training the second convolutional network model in the subsequent step, and the rough judgment result obtained by training the first convolutional network model can be corrected before training the second convolutional network model.

可选的修正过程可以是通过手动修正,或引入其他卷积网络模型进行错误结果的过滤等,修正的目的在于,在保证第二卷积网络模型的输入信息准确的情况下,提高训练第二卷积网络模型的准确率。本实施例对采用的修正过程不做限定。The optional correction process may be through manual correction, or introducing other convolutional network models to filter the error results, etc. The purpose of the correction is to improve the training second while ensuring that the input information of the second convolutional network model is accurate. The accuracy of the convolutional network model. This embodiment does not limit the correction process used.

在一个可选示例中,步骤S220可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的修正模块605执行。In an alternative example, step S220 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a modification module 605 executed by the processor.

在步骤S230,根据修正后的人手候选区域的预测信息和样本图像训练第二卷积网络模型。In step S230, the second convolutional network model is trained based on the predicted information and the sample image of the corrected human hand candidate region.

其中,第二卷积网络模型和第一卷积网络模型共享特征提取层,并在第二卷积网络模型训练过程中保持特征提取层的参数不变。The second convolutional network model and the first convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model.

本申请各实施例的一个可选示例中,第二卷积网络模型可以包括:第二输入层、第二输出层、多个第二卷积层和多个全连接层。第二卷积层用于进行特征提取,全连接层相当于分类器,对第二卷积层提取出的特征进行分类,第二卷积网络模型获得针对样本图像中的手势检测结果时,通过第二输入层获得人手候选区域,然后通过第二卷积层提取上述人手候选区域的特征,全连接层根据人手候选区域的特征进行分类处理,确定样本图像中是否包含人手,以及在包含人手的情况下,确定人手候选区域和手部的手势,最后将分类结果通过第二输出层输出。由于第一卷积网络模型和第二卷积网络模型中均包含卷积层,为了便于进行模型训练,减小计算量,可以将上述两个卷积网络模型中的特征提取层的网络参数设置为相同的网络参数,即第二卷积网络模型和第一卷积网络模型共享特征提取层,并在第二卷积网络模型训练过程中保持特征提取层的参数不变。In an optional example of various embodiments of the present application, the second convolutional network model may include: a second input layer, a second output layer, a plurality of second convolutional layers, and a plurality of fully connected layers. The second convolution layer is used for feature extraction, the fully connected layer is equivalent to the classifier, and the features extracted by the second convolution layer are classified, and the second convolutional network model is obtained when the gesture detection result in the sample image is obtained. The second input layer obtains the human hand candidate region, and then extracts the feature of the above-mentioned human hand candidate region through the second convolution layer, and the all-connected layer performs classification processing according to the feature of the human hand candidate region, determines whether the sample image includes the human hand, and includes the human hand. In the case, the gesture of the candidate region and the hand is determined, and finally the classification result is output through the second output layer. Since both the first convolutional network model and the second convolutional network model include a convolutional layer, in order to facilitate model training and reduce the amount of computation, the network parameters of the feature extraction layer in the above two convolutional network models can be set. The feature extraction layer is shared for the same network parameters, that is, the second convolutional network model and the first convolutional network model, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model.

基于此,在本实施例中,训练得到第二卷积网络模型时,可以先训练得到输入层的网络参数和分类层的网络参数,再将第一卷积网络模型的特征提取层的网络参数确定为第二卷积网络模型的特征提取层的网络参数,然后根据输入层的网络参数、分类层的网络参数和特征提取层的网络参数构建第二卷积网络模型。Based on this, in the embodiment, when the second convolutional network model is obtained by training, the network parameters of the input layer and the network parameters of the classification layer may be trained first, and then the network parameters of the feature extraction layer of the first convolutional network model are obtained. The network parameter of the feature extraction layer of the second convolutional network model is determined, and then the second convolutional network model is constructed according to the network parameters of the input layer, the network parameters of the classification layer, and the network parameters of the feature extraction layer.

本申请实施例中,可以使用修正后的人手候选区域的预测信息和样本图像对第二卷积网络模型进行训练,为使得训练得到的第二卷积网络模型更加准确,在选择样本图像时可以选择多种情况下的样本图像,样本图像中可以包括标注有手势的样本图像,还可以包括未标注有手势的样本图像。 In the embodiment of the present application, the second convolutional network model may be trained by using the predicted information and the sample image of the corrected candidate region, so that the second convolutional network model obtained by the training is more accurate, and the sample image may be selected. A sample image in a plurality of cases may be selected, and the sample image may include a sample image marked with a gesture, and may also include a sample image not marked with a gesture.

另外,本实施例中的样本图像可以为满足上述分辨率条件或者其他分辨率条件的样本图像。In addition, the sample image in this embodiment may be a sample image that satisfies the above resolution conditions or other resolution conditions.

通过本实施例提供的手势控制方法,分别训练两个卷积网络模型:根据含有人手标注信息的样本图像训练第一卷积网络模型,得到第一卷积网络模型针对样本图像的人手候选区域的预测信息;修正人手候选区域的预测信息;根据修正后的人手候选区域的预测信息和样本图像训练第二卷积网络模型。其中,第一卷积网络模型和第二卷积网络模型存在如下关联关系:第一卷积网络模型和第二卷积网络模型共享特征提取层,并在第二卷积网络模型训练过程中保持特征提取层的参数不变。Through the gesture control method provided in this embodiment, two convolutional network models are respectively trained: the first convolutional network model is trained according to the sample image containing the human hand annotation information, and the first convolutional network model is obtained for the candidate region of the sample image. Predicting information; correcting prediction information of the candidate region of the human hand; training the second convolutional network model according to the predicted information of the corrected candidate region of the human hand and the sample image. Wherein, the first convolutional network model and the second convolutional network model have the following relationship: the first convolutional network model and the second convolutional network model share the feature extraction layer, and are maintained during the training of the second convolutional network model The parameters of the feature extraction layer are unchanged.

在训练第二卷积网络模型之前,先将训练第一卷积网络模型得到的粗略判断结果进行修正,再将修正后的人手候选区域的预测信息和样本图像作为第二卷积网络模型的输入,可以在保证第二卷积网络模型的输入信息准确的情况下,提高训练第二卷积网络模型的准确率。Before training the second convolutional network model, the rough judgment result obtained by training the first convolutional network model is corrected, and the corrected prediction information and the sample image of the candidate region are used as the input of the second convolutional network model. The accuracy of training the second convolutional network model can be improved while ensuring that the input information of the second convolutional network model is accurate.

另外,第一卷积网络模型和第二卷积网络模型共享特征提取层,在第二卷积网络模型的训练过程中保持特征提取层的参数不变,第二卷积网络模型的特征提取层可以直接利用第一卷积网络模型的特征提取层,为训练第二卷积网络模型提供了便利,有利于减少训练第二卷积网络模型的计算量。In addition, the first convolutional network model and the second convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model, and the feature extraction layer of the second convolutional network model The feature extraction layer of the first convolutional network model can be directly used to facilitate the training of the second convolutional network model, which is beneficial to reduce the computational amount of training the second convolutional network model.

本申请各实施例中,第二卷积神经网络例如可以是快速区域卷积神经网络(Fast Region with Convolutional Neural Network,FRCNN),本实施例只是以此为例进行说明,实际应用中第二卷积神经网络并不仅限于此,例如,还可以是其他二分类或多分类卷积神经网络。In the embodiments of the present application, the second convolutional neural network may be, for example, a Fast Region with Convolutional Neural Network (FRCNN). This embodiment is only used as an example for description, and the second volume in practical application. The neural network is not limited to this. For example, it may be another two-class or multi-class convolutional neural network.

在一个可选示例中,步骤S230可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的卷积模型训练模块606执行。In an alternative example, step S230 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolutional model training module 606 executed by the processor.

本实施例中,通过训练得到的第一卷积网络模型和第一卷积网络模型,可方便后续对当前播放的视频图像进行人手和手势检测,并确定与检测到的手势相应的展现位置,进而在视频图像的上述展现位置采用计算机绘图方式绘制待显示的业务对象,当业务对象用于展示广告时,一方面,在确定的展现位置采用计算机绘图方式绘制待展示的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的手势紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加趣味性,还可以避免打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,可以在一定程度上吸引观众的注意力,提高业务对象的影响力。In this embodiment, the first convolutional network model and the first convolutional network model obtained by the training can facilitate subsequent detection of the hand and gesture of the currently played video image, and determine the display position corresponding to the detected gesture. And then, in the above-mentioned display position of the video image, the business object to be displayed is drawn by using a computer drawing manner. When the business object is used for displaying the advertisement, on the one hand, the business object to be displayed is drawn by using a computer drawing manner at the determined display position, the business object. Combined with video playback, there is no need to transmit additional advertising video data that is not related to video through the network, which is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image, and can be retained. The main image and motion of the video subject (such as the anchor) in the video image can add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image. To some extent attract the attention of the audience and improve The influence of business objects.

在本申请各实施例的一个可选示例中,修正人手候选区域的预测信息,可以包括:将多个补充负样本图像和人手候选区域的预测信息输入第三卷积神经网络以进行分类,以过滤人手候选区域中的负样本,得到修正后的人手候选区域的预测信息。In an optional example of the embodiments of the present application, the correcting the prediction information of the candidate region may include: inputting the plurality of supplementary negative sample images and the prediction information of the candidate region into the third convolutional neural network for classification, The negative samples in the candidate region are filtered to obtain prediction information of the corrected candidate region.

本申请各实施例中,补充负样本图像例如可以是没有人手的空白样本图像,或者是包括像人手但标注有不是人手的信息的样本图像,或者没有人手的图像,等等。补充负样本图像可以不输入第一卷积神经网络和第二卷积神经网络,只在第三卷积神经网络中输入,补充负样本图像可以只有负样本图像,没有正样本图像。In various embodiments of the present application, the supplementary negative sample image may be, for example, a blank sample image without a human hand, or a sample image including information such as a human hand but marked with a human hand, or an image without a human hand, and the like. The supplemental negative sample image may be input only in the third convolutional neural network without inputting the first convolutional neural network and the second convolutional neural network, and the negative sample image may have only a negative sample image and no positive sample image.

本申请各实施例中,输入至第三卷积神经网络的补充负样本图像的具体数量可以与人手候选区域的预测信息中人手候选区域数量的差异落入预定容许范围,其中,预定容许范围可以根据实际情况设定,例如,设定为3-5的范围,包括3、4和5。例如,人手候选区域的预测信息中人手候选区域数量为5,则补充负样本图像的数量可以为8、9或10。当预定容许范围设定为0时,表示输入至第三卷积神经网络的补充负样本图像的具体数量可以与人手候选区域的预测信息中人手候选区域数量相等,例如,人手候选区域的预测信息中人手候选区域数量为5,则补充负样本图像的数量也为5。In various embodiments of the present application, the specific number of the supplementary negative sample images input to the third convolutional neural network may fall within a predetermined allowable range from the difference in the number of the human hand candidate regions in the prediction information of the human hand candidate region, wherein the predetermined allowable range may be Set according to the actual situation, for example, set to a range of 3-5, including 3, 4, and 5. For example, if the number of candidate regions in the prediction information of the candidate region is 5, the number of complementary negative samples may be 8, 9, or 10. When the predetermined allowable range is set to 0, the specific number of the supplementary negative sample images input to the third convolutional neural network may be equal to the number of the human hand candidate regions in the prediction information of the human hand candidate region, for example, the prediction information of the human hand candidate region. The number of candidate regions in the middle is 5, and the number of supplementary negative samples is also 5.

本申请各实施例中,第三卷积神经网络用于对训练第一卷积神经网络得到的人手候选区域的预测信息进行修正。可以过滤掉人手候选区域中的负样本,即过滤掉人手候选区域中非人手区域,得到修正后的人手候选区域的预测信息,使得修正后的人手候选区域的预测信息更加准确。第三卷积神经网络例如可以为FRCNN,还可以是其他二分类或多分类卷积神经网络。In various embodiments of the present application, the third convolutional neural network is used to correct the prediction information of the candidate region of the human hand obtained by training the first convolutional neural network. The negative samples in the candidate region of the human hand can be filtered out, that is, the non-human hand region in the candidate region of the human hand is filtered out, and the predicted information of the corrected candidate region is obtained, so that the predicted information of the corrected candidate region is more accurate. The third convolutional neural network may be, for example, FRCNN, or other two-class or multi-class convolutional neural networks.

图3是本申请实施例中第一卷积网络模型和第二卷积网络模型的获取方法另一实施例的流程图。如图3所示,该实施例第一卷积网络模型和第二卷积网络模型的获取方法包括:FIG. 3 is a flowchart of another embodiment of a method for acquiring a first convolutional network model and a second convolutional network model in the embodiment of the present application. As shown in FIG. 3, the method for obtaining the first convolutional network model and the second convolutional network model of the embodiment includes:

S310:根据含有人手标注信息的样本图像训练第一卷积神经网络,得到第一卷积神经网络针对样本图像的人手候选区域的预测信息。S310: Train the first convolutional neural network according to the sample image containing the human hand annotation information, and obtain prediction information of the first convolutional neural network for the human hand candidate region of the sample image.

本申请各实施例中,上述样本图像可以是红绿蓝(RGB)格式的图像,也可以是其它格式的图像,例如,色差分量(YUV)格式的图像等等,本申请并不对此进行限定。In various embodiments of the present application, the sample image may be an image in a red, green, and blue (RGB) format, or may be an image in another format, for example, a color difference component (YUV) format image, etc., which is not limited in this application. .

另外,本申请各实施例中的样本图像可以是通过图像采集设备得到的,实际应用中由于图像采集设备的硬件参数不同、设置不同等等,所采集的图像可能不满足上述预设分辨率条件,为得到满足上述预设分辨率条件的目标图像,在本申请的一种可选实现方式中,还可以在图像采集设备采集到图像之后,对所采集到的图像进行缩放处理,获得样本图像。 In addition, the sample images in the embodiments of the present application may be obtained by using an image acquisition device. In actual applications, the acquired images may not satisfy the preset resolution conditions due to different hardware parameters, different settings, and the like of the image acquisition device. In an optional implementation manner of the present application, in the optional implementation manner of the present application, after the image acquisition device acquires the image, the collected image is scaled to obtain a sample image. .

在本申请的一种实现方式中,上述第一卷积神经网络可以包括:第一输入层、第一特征提取层和第一分类输出层,其中,上述第一分类输出层用于预测样本图像的多个候选区域是否为人手候选区域。In an implementation manner of the present application, the first convolutional neural network may include: a first input layer, a first feature extraction layer, and a first classification output layer, wherein the first classification output layer is used to predict a sample image. Whether the plurality of candidate regions are human candidate regions.

需要说明的是,上述第一卷积神经网络所包含的各个层是从功能上进行的划分,示例性地,上述第一特征提取层可以由卷积层组成,或者可以由卷积层与非线性变换层组成,或者由卷积层、非线性变换层和池化层组成;上述第一分类输出层的输出结果可以理解为是二分类的结果,其具体可以是由卷积层实现的,但是并不仅限于由卷积层实现。It should be noted that each layer included in the first convolutional neural network is functionally divided. For example, the first feature extraction layer may be composed of a convolution layer, or may be composed of convolution layers. The linear transformation layer is composed of a convolution layer, a nonlinear transformation layer and a pooling layer; the output result of the first classification output layer can be understood as a result of the two classification, which can be specifically implemented by a convolution layer. But it is not limited to being implemented by a convolutional layer.

在其中一种可选示例中,训练第一卷积神经网络时,可以先训练得到第一输入层参数、第一特征提取层参数和第一分类输出层参数,然后根据所获得的参数,构建第一卷积神经网络。In an optional example, when training the first convolutional neural network, the first input layer parameter, the first feature extraction layer parameter, and the first classification output layer parameter may be trained to be constructed, and then constructed according to the obtained parameters. The first convolutional neural network.

示例性地,采用样本图像训练第一卷积神经网络,可以理解为:采用样本图像对第一卷积神经网络的初始模型进行训练,得到最终的第一卷积神经网络。其中,采用样本图像对第一卷积神经网络的初始模型进行训练时,可以采用梯度下降法和反向传播算法进行训练。Illustratively, training the first convolutional neural network with the sample image can be understood as: training the initial model of the first convolutional neural network with the sample image to obtain a final first convolutional neural network. Among them, when the sample model is used to train the initial model of the first convolutional neural network, the gradient descent method and the backpropagation algorithm can be used for training.

上述第一卷积神经网络的初始模型可以是依据人工设定的卷积层层数、每一卷积层中神经元的个数等等因素确定的,上述卷积层层数、神经元个数等等可以是依据实际需求确定的。The initial model of the first convolutional neural network may be determined according to factors such as the number of layers of the convolution layer manually set, the number of neurons in each convolution layer, and the like, and the number of convolution layers and neurons. The number and so on can be determined based on actual needs.

在一个可选示例中,步骤S310可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的人手区域确定模块604执行。In an alternative example, step S310 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a human hand area determination module 604 that is executed by the processor.

S320:将用于检测手势的第二卷积神经网络的第二特征提取层参数,替换为训练后的第一卷积神经网络的第一特征提取层参数。S320: Replace the second feature extraction layer parameter of the second convolutional neural network for detecting the gesture with the first feature extraction layer parameter of the trained first convolutional neural network.

在本申请的一种实现方式中,第二卷积神经网络可以包括:第二输入层、第二特征提取层、第二分类输出层,其中,上述第二分类输出层用于输出样本图像的手势检测结果。In an implementation manner of the present application, the second convolutional neural network may include: a second input layer, a second feature extraction layer, and a second classification output layer, wherein the second classification output layer is configured to output a sample image. Gesture detection results.

需要说明的是,第二特征提取层与上述第一特征提取层类似,这里不再赘述。上述第二分类输出层的输出结果可以理解为是多分类的结果,其具体可以是由全连接层实现的,但是并不仅限于由全连接层实现。It should be noted that the second feature extraction layer is similar to the first feature extraction layer, and details are not described herein again. The output result of the second classification output layer may be understood as a result of multi-classification, which may be specifically implemented by a fully connected layer, but is not limited to being implemented by a fully connected layer.

本实施例中,将训练后的第一卷积神经网络的第一特征提取层参数作为上述第二特征提取参数,可以省去对第二神经网络的第二特征提取层的训练,也就是对第一卷积神经网络和第二卷积神经网络进行了联合训练,两者共享了特征提取层,有利于提高对卷积神经网络的训练速度。In this embodiment, by using the first feature extraction layer parameter of the trained first convolutional neural network as the second feature extraction parameter, the training of the second feature extraction layer of the second neural network may be omitted, that is, The first convolutional neural network and the second convolutional neural network are jointly trained, and the two share the feature extraction layer, which is beneficial to improve the training speed of the convolutional neural network.

示例性地,上述手势检测结果可以包括以下任意一种或多种预定手势类型:挥手、剪刀手、握拳、托手、竖大拇指、手枪手、OK手、桃心手、张开、闭合等;另外,还可以选择性地包括:非预定手势类型。其中,上述非预定手势类型可以理解为:除了上述预定手势类型以外的手势类型或者表示“没有手势”的情形,这样可以进一步提高第二卷积神经网络的手势分类准确度。Exemplarily, the gesture detection result may include any one or more of the following predetermined gesture types: wave, scissors, fist, hand, thumbs up, pistol, OK hand, hand, open, closed, etc. In addition, it may optionally include: an unscheduled gesture type. The above-mentioned unscheduled gesture type can be understood as: a gesture type other than the predetermined gesture type described above or a situation indicating “no gesture”, which can further improve the gesture classification accuracy of the second convolutional neural network.

需要说明的是,本申请仅仅以上述手势类型为例进行说明,实际中上述预定手势类型并不仅限于上述几种。It should be noted that the present application is only described by taking the above gesture type as an example. In practice, the foregoing predetermined gesture type is not limited to the above.

在一个可选示例中,步骤S320可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的参数替换模块执行。In an alternative example, step S320 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a parameter replacement module executed by the processor.

S330:根据人手候选区域的预测信息和样本图像训练第二卷积神经网络参数,并在训练过程中保持第二特征提取层参数不变。S330: Training the second convolutional neural network parameter according to the prediction information and the sample image of the candidate region of the human hand, and keeping the second feature extraction layer parameter unchanged during the training process.

根据人手候选区域的预测信息和样本图像,训练第二卷积神经网络时,可以再次对样本图像中人手手势进行标定,即标定人手的手势为张开状态、闭合状态等等,并基于本次标定结果和上述预测信息对第二卷积神经网络的初始模型进行训练,得到最终的第二卷积神经网络。According to the prediction information and the sample image of the candidate region of the human hand, when training the second convolutional neural network, the human hand gesture in the sample image can be calibrated again, that is, the gesture of calibrating the human hand is open state, closed state, etc., and based on this time The calibration result and the above prediction information are trained on the initial model of the second convolutional neural network to obtain a final second convolutional neural network.

在一个可选示例中,步骤S330可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第二训练模块执行。In an alternative example, step S330 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a second training module executed by the processor.

在本申请各实施例的一个可选示例中,可以通过如下方式根据人手候选区域的预测结果和样本图像训练第二卷积神经网络参数,并在训练过程中保持第二特征提取层参数不变:In an optional example of the embodiments of the present application, the second convolutional neural network parameter may be trained according to the prediction result of the human hand candidate region and the sample image, and the second feature extraction layer parameter is maintained during the training process. :

修正人手候选区域的预测信息:根据修正后的人手候选区域的预测信息和样本图像训练第二卷积神经网络参数。The prediction information of the candidate region of the human hand is corrected: the second convolutional neural network parameter is trained according to the predicted information and the sample image of the corrected candidate region of the human hand.

示例性地,修正人手候选区域的预测信息时,可以将多个补充负样本图像和上述人手候选区域的预测信息输入第三卷积神经网络以进行分类,以过滤上述人手候选区域中的负样本,得到修正后的人手候选区域的预测信息。Illustratively, when the prediction information of the candidate region of the human hand is corrected, the plurality of supplementary negative sample images and the prediction information of the above-mentioned candidate region may be input into the third convolutional neural network for classification to filter the negative samples in the candidate region of the above-mentioned hand. , the predicted information of the corrected candidate region of the hand is obtained.

需要说明的是,上述补充负样本图像仅仅作为第三卷积神经网络的输入,而不作为上述第一卷积神经网络和上述第二卷积神经网络的输入,另外,上述补充负样本图像可以是没有手的空白图像,也可以是包含类似手(不是手)的区域但未被标定为包含手的图像。It should be noted that the supplementary negative sample image is only used as an input of the third convolutional neural network, and is not used as an input of the first convolutional neural network and the second convolutional neural network, and the supplementary negative sample image may be It is a blank image without a hand, or it can be an image containing a region similar to a hand (not a hand) but not labeled as containing a hand.

另外,本申请实施例中,修正人手候选区域的预测信息还可以通过标注员手动修正的方式进行,本申请并不对此进行限定。 In addition, in the embodiment of the present application, the prediction information of the revised candidate area may also be performed by a labeler manually, which is not limited in this application.

由于第一卷积神经网络的预设测结果可能存在较大的误差,也就是用第一卷积神经网络得到的预测信息训练第二卷积神经网络时准确率较差,而与第一卷积神经网络得到的预测信息相比,修正后的人手候选区域的预测信息准确率要高的多,所以,采用修正后的人手候选区域的预测信息和样本图像训练第二卷积神经网络得到的第二卷积神经网络准确率较高。Since the preset measurement result of the first convolutional neural network may have a large error, that is, the prediction information obtained by the first convolutional neural network is used to train the second convolutional neural network, the accuracy is poor, and the first volume is Compared with the prediction information obtained by the neural network, the accuracy of the prediction information of the corrected candidate region is much higher. Therefore, the prediction information of the corrected candidate region and the sample image are used to train the second convolutional neural network. The second convolutional neural network has a higher accuracy rate.

本实施例中,修正人手候选区域的预测信息,然后根据修正后的人手候选区域的预测信息和样本图像训练第二卷积神经网络参数,提高了训练得到的第二卷积神经网络的准确率。In this embodiment, the prediction information of the candidate region of the human hand is corrected, and then the second convolutional neural network parameter is trained according to the corrected prediction information of the candidate region and the sample image, thereby improving the accuracy of the second convolutional neural network obtained by the training. .

图4是本申请手势控制方法另一实施例的流程图。如图4所示,该实施例的手势控制方法包括:在步骤S410,获取当前播放的视频图像。4 is a flow chart of another embodiment of the gesture control method of the present application. As shown in FIG. 4, the gesture control method of this embodiment includes: at step S410, acquiring a currently played video image.

其中,上述步骤S410的步骤内容可以参见上述图1所示实施例步骤S110中的相关内容,在此不再赘述。For the content of the step S410, refer to the related content in step S110 of the embodiment shown in FIG. 1 , and details are not described herein again.

本实施例中,可以通过视频图像和预先训练的卷积网络模型确定手部信息对应的人手候选区域,并在人手候选区域检测手部的手势,相应的处理参见下述步骤S420~步骤S440。In this embodiment, the human hand candidate region corresponding to the hand information may be determined by the video image and the pre-trained convolutional network model, and the gesture of the hand is detected in the human hand candidate region. For the corresponding processing, refer to the following steps S420 to S440.

在步骤S420,采用预先训练好的第一卷积网络检测视频图像,获得视频图像的第一特征信息和人手候选区域的预测信息。In step S420, the video image is detected by using the pre-trained first convolution network, and the first feature information of the video image and the prediction information of the candidate region of the human hand are obtained.

其中,第一特征信息包括手部特征信息,第一卷积网络模型可以用于检测图像划分的多个候选区域是否为人手候选区域。The first feature information includes hand feature information, and the first convolutional network model may be used to detect whether a plurality of candidate regions of the image segmentation are human hand candidate regions.

在本实施例中,可以将获取到的包含手部信息的视频图像输入到第一卷积网络模型中,通过第一卷积网络模型中的网络参数可以分别对视频图像进行如特征提取、映射和变换等处理,以对视频图像进行人手候选区域检测,得到视频图像中包含的人手候选区域。人手候选区域的预测信息可以参照上述实施例中的介绍和说明,在此不再赘述。In this embodiment, the acquired video image containing the hand information may be input into the first convolutional network model, and the video image may be separately extracted and mapped by the network parameter in the first convolutional network model. And processing such as transforming to perform human hand candidate region detection on the video image to obtain a human hand candidate region included in the video image. For the prediction information of the candidate area, refer to the description and description in the foregoing embodiments, and details are not described herein again.

在步骤S430,将第一特征信息和人手候选区域的预测信息作为预先训练好的第二卷积网络模型的第二特征信息,采用第二卷积网络模型根据第二特征信息进行视频图像的手势检测,得到视频图像的手势检测结果。In step S430, the first feature information and the prediction information of the candidate region are used as the second feature information of the pre-trained second convolutional network model, and the second convolutional network model is used to perform the gesture of the video image according to the second feature information. Detecting, obtaining a gesture detection result of the video image.

其中,第二卷积网络模型和第一卷积网络模型共享特征提取层。手势例如可以包括但不限于以下任意一种或多种:挥手、剪刀手、握拳、托手、鼓掌、手掌张开、手掌闭合、竖大拇指、摆手枪姿势、摆V字手和摆OK手。The second convolutional network model and the first convolutional network model share a feature extraction layer. The gesture may include, but is not limited to, any one or more of the following: wave, scissors, fist, hand, clapping, palm open, palm closed, thumbs up, pistol pose, pendulum V and hand .

上述步骤S440的处理过程可以参见本申请上述实施例中的相关内容,在此不再赘述。For the processing of the foregoing step S440, refer to related content in the foregoing embodiment of the present application, and details are not described herein again.

在一个可选示例中,步骤S410~S430可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的手势检测模块601执行。In an optional example, steps S410-S430 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a gesture detection module 601 executed by the processor.

在步骤S440,在检测到手势与预定手势匹配时,提取与检测到的手势相应的人手候选区域内手部的特征点。In step S440, when it is detected that the gesture matches the predetermined gesture, the feature points of the hand in the human hand candidate region corresponding to the detected gesture are extracted.

在本申请各实施例中,对于包含手部信息的视频图像,其中手部都会包含有一定的特征点,如手指、手掌、手部轮廓等特征点。对视频图像中的人手进行检测并确定特征点,可以采用任意适当的相关技术中的方式实现,本申请实施例对此不作限定。例如,线性特征提取方式如主成分分析(PCA)、线性判别分析(LDA)、独立成分分析(ICA)等等;再如,非线性特征提取方式如核主成分分析(Kernel PCA)、流形学习等;也可以使用训练完成的神经网络模型如本申请实施例中的卷积网络模型进行手部的特征点的提取。In various embodiments of the present application, for a video image including hand information, the hand includes certain feature points, such as finger points, palms, hand contours, and the like. The detection of the human hand in the video image and the determination of the feature point can be implemented in any suitable related art, which is not limited in the embodiment of the present application. For example, linear feature extraction methods such as principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), etc.; for example, nonlinear feature extraction methods such as kernel principal component analysis (Kernel PCA), manifolds Learning, etc.; the trained neural network model can also be used to extract the feature points of the hand, such as the convolutional network model in the embodiment of the present application.

以直播类视频为例,在进行视频直播的过程中,从直播的视频图像中检测人手并确定手部的特征点;再如,在某一已录制完成的视频的播放过程中,从播放的视频图像中检测人手并确定手部的特征点;又如,在某一视频的录制过程中,从录制的视频图像中检测人手并确定手部的特征点等等。Taking a live video as an example, in the process of video live broadcast, the human hand is detected from the live video image and the feature points of the hand are determined; for example, during the playback of a recorded video, the playback is performed. In the video image, the human hand is detected and the feature points of the hand are determined; for example, during the recording of a certain video, the human hand is detected from the recorded video image and the feature points of the hand are determined.

在步骤S450,根据手部的特征点,确定待显示的业务对象在视频图像中的展现位置。In step S450, the presentation position of the business object to be displayed in the video image is determined according to the feature point of the hand.

在实施中,在手部的特征点确定后,可以以手部的特征点为依据,确定待显示的业务对象在视频图像中的一个或多个展现位置。In the implementation, after the feature points of the hand are determined, one or more presentation positions of the business object to be displayed in the video image may be determined based on the feature points of the hand.

在本实施例中,在根据手部的特征点确定待显示的业务对象在视频图像中的展现位置时,可选的实现方式例如可以包括:In this embodiment, when determining the presentation position of the business object to be displayed in the video image according to the feature point of the hand, the optional implementation may include, for example:

方式一,根据手部的特征点,使用预先训练好的、用于从视频图像检测业务对象的展现位置的第三卷积网络模型,在视频图像中确定与手部位置相应的待显示的业务对象的展现位置;方式二,根据手部的特征点和待显示的业务对象的类型,在视频图像中确定与检测到的手势相应的待显示的业务对象在视频图像中的展现位置。In the first mode, according to the feature point of the hand, a pre-trained third convolution network model for detecting the presentation position of the business object from the video image is used, and the service to be displayed corresponding to the hand position is determined in the video image. The presentation position of the object; secondly, according to the feature point of the hand and the type of the business object to be displayed, the display position of the business object to be displayed corresponding to the detected gesture in the video image is determined in the video image.

以下,分别对上述两种方式进行示例性说明。Hereinafter, the above two modes will be exemplarily described.

方式一method one

在使用方式一确定待显示的业务对象在视频图像中的展现位置时,可以预先训练一个卷积网络模 型,即第三卷积网络模型,训练完成的第三卷积网络模型具有确定业务对象在视频图像中的展现位置的功能;或者,也可以直接使用第三方已训练完成的、具有确定业务对象在视频图像中的展现位置的功能的卷积网络模型。When the usage mode 1 determines the presentation position of the business object to be displayed in the video image, a convolutional network module can be trained in advance. Type, ie the third convolutional network model, the trained third convolutional network model has the function of determining the presentation position of the business object in the video image; or, the third party has been trained to have the determined business object A convolutional network model that exhibits the functionality of a location in a video image.

需要说明的是,本实施例中,以对业务对象的训练为例进行说明,但本领域技术人员应当明了,第三卷积网络模型在对业务对象进行训练的同时,也可以对手部进行训练,实现手部和业务对象的联合训练。It should be noted that, in this embodiment, the training of the business object is taken as an example, but those skilled in the art should understand that the third convolutional network model can also train the opponent while training the business object. To achieve joint training of hands and business objects.

当训练第三卷积网络模型时,一种可选的训练方式包括以下过程:When training the third convolutional network model, an optional training method includes the following process:

(1)获取待训练的业务对象样本图像的特征向量。(1) Obtaining a feature vector of a sample image of a business object to be trained.

其中,特征向量中包含有业务对象样本图像中的业务对象的位置信息和/或置信度信息。业务对象的置信度信息指示了业务对象展示在当前位置时,能够达到的效果(如被关注或被点击或被观看)的概率,该概率可以根据对历史数据的统计分析结果设定,也可以根据仿真实验的结果设定,还可以根据人工经验进行设定。在实际应用中,可以根据实际需要,仅对业务对象的位置信息进行训练,也可以仅对业务对象的置信度信息进行训练,还可以对二者均进行训练。对二者均进行训练时,可以使得训练后的第三卷积网络模型较为有效和精准地确定业务对象的位置信息和置信度信息,以便为业务对象的展示提供依据。The feature vector includes location information and/or confidence information of the business object in the business object sample image. The confidence information of the business object indicates the probability that the business object can achieve the effect (such as being focused or clicked or viewed) when the current location is displayed. The probability may be set according to the statistical analysis result of the historical data, or may be According to the results of the simulation experiment, it can also be set according to the artificial experience. In practical applications, only the location information of the business object may be trained according to actual needs, or only the confidence information of the business object may be trained, and both may be trained. When training both, the third convolutional network model after training can be used to determine the location information and confidence information of the business object more effectively and accurately, so as to provide a basis for the display of the business object.

第三卷积网络模型通过至少一个样本图像进行训练,本申请实施例中,可以使用包含有业务对象的业务对象样本图像对第三卷积网络模型进行训练,本领域技术人员可以应当明了,用来训练的业务对象样本图像中,除了包含业务对象外,还可以包含手部信息。此外,本申请实施例中的业务对象样本图像中的业务对象可以被预先标注位置信息、或者置信度信息,或者位置信息和置信度信息二种信息都有。当然,在实际应用中,这些信息也可以通过其它途径获取。通过预先在对业务对象进行相应信息的标注,可以有效节约数据处理的数据和交互次数,提高数据处理效率。The third convolutional network model is trained by using at least one sample image. In the embodiment of the present application, the third convolutional network model may be trained using the business object sample image including the business object, and those skilled in the art may understand that The sample image of the business object to be trained may include hand information in addition to the business object. In addition, the business object in the business object sample image in the embodiment of the present application may be pre-labeled with location information, or confidence information, or both location information and confidence information. Of course, in practical applications, this information can also be obtained through other means. By marking the corresponding information on the business object in advance, the data processing data and the number of interactions can be effectively saved, and the data processing efficiency is improved.

将具有业务对象的位置信息和/或置信度信息的业务对象样本图像作为训练样本,对其进行特征向量提取,获得包含有业务对象的位置信息和/或置信度信息的特征向量。A business object sample image having location information and/or confidence information of the business object is used as a training sample, and feature vector extraction is performed to obtain a feature vector including location information and/or confidence information of the business object.

可选地,可以使用第三卷积网络模型对手部和业务对象同时进行训练,在此情况下,业务对象样本图像的特征向量中,也包含手部的特征。Alternatively, the third convolutional network model can be used to simultaneously train the counterpart and the business object. In this case, the feature vector of the business object sample image also includes the characteristics of the hand.

对特征向量的提取可以采用相关技术中的适当方式实现,本申请实施例在此不再赘述。The extraction of the feature vector can be implemented in an appropriate manner in the related art, and details are not described herein again.

(2)对特征向量进行卷积处理,获取特征向量卷积结果。(2) Convolution processing of the feature vector to obtain the feature vector convolution result.

在本实施例中,获取的特征向量卷积结果中包含有业务对象的位置信息和/或置信度信息。在对手部和业务对象进行联合训练的情况下,特征向量卷积结果中还包含手部信息。In this embodiment, the acquired feature vector convolution result includes location information and/or confidence information of the service object. In the case of joint training between the opponent and the business object, the feature vector convolution result also contains hand information.

对特征向量的卷积处理次数可以根据实际需要进行设定,也即,第三卷积网络模型中,卷积层的层数可以根据实际需要进行设置,在此不再赘述。The number of times of convolution processing on the feature vector can be set according to actual needs. That is, in the third convolutional network model, the number of layers of the convolution layer can be set according to actual needs, and details are not described herein again.

特征向量卷积结果是对特征向量进行了特征提取后的结果,该特征向量卷积结果可以有效表征视频图像中手部的特征。The eigenvector convolution result is the result of feature extraction of the feature vector, and the feature vector convolution result can effectively characterize the hand in the video image.

本申请实施例中,当特征向量中既包含业务对象的位置信息、又包含业务对象的置信度信息时,也即,对业务对象的位置信息和置信度信息均进行了训练的情况下,该特征向量卷积结果在后续分别进行收敛条件判断时共享,无须进行重复处理和计算,有利于减少由数据处理引起的资源损耗,提高数据处理速度和效率。In the embodiment of the present application, when the feature vector includes both the location information of the service object and the confidence information of the service object, that is, when the location information and the confidence information of the service object are trained, The eigenvector convolution result is shared in the subsequent judgment of the convergence condition, and no need to perform repeated processing and calculation, which is beneficial to reduce resource loss caused by data processing and improve data processing speed and efficiency.

(3)判断特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足收敛条件。(3) determining whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition.

其中,收敛条件由本领域技术人员根据实际需求适当设定。当信息满足收敛条件时,可以认为第三卷积网络模型中的网络参数设置适当;当信息不能满足收敛条件时,可以认为第三卷积网络模型中的网络参数设置不适当,需要对其进行调整,该调整可以是一个迭代的过程,直至使用调整后的网络参数对特征向量进行卷积处理的结果满足收敛条件。The convergence condition is appropriately set by a person skilled in the art according to actual needs. When the information satisfies the convergence condition, the network parameter in the third convolutional network model can be considered to be appropriate; when the information cannot satisfy the convergence condition, the network parameter in the third convolutional network model can be considered inappropriate, and it needs to be performed. Adjustment, the adjustment may be an iterative process until the result of convolution processing the feature vector using the adjusted network parameters satisfies the convergence condition.

一种可选方式中,收敛条件可以根据预设的标准位置和/或预设的标准置信度进行设定,例如,将特征向量卷积结果中业务对象的位置信息指示的位置与预设的标准位置之间的距离满足一定阈值作为业务对象的位置信息的收敛条件;将特征向量卷积结果中业务对象的置信度信息指示的置信度与预设的标准置信度之间的差别满足一定阈值作为业务对象的置信度信息的收敛条件等。In an optional manner, the convergence condition may be set according to a preset standard location and/or a preset standard confidence, for example, a location indicated by the location information of the service object in the feature vector convolution result and a preset The distance between the standard positions satisfies a certain threshold as a convergence condition of the location information of the service object; the difference between the confidence level indicated by the confidence information of the service object in the feature vector convolution result and the preset standard confidence satisfies a certain threshold The convergence condition of the confidence information as a business object, and the like.

其中,可选地,预设的标准位置可以是对待训练的业务对象样本图像中的业务对象的位置进行平均处理后获得的平均位置;预设的标准置信度可以是对待训练的业务对象样本图像中的业务对象的置信度进行平均处理后获取的平均置信度。因样本图像为待训练样本且数据量庞大,可依据待训练的业务对象样本图像中的业务对象的位置和/或置信度设定标准位置和/或标准置信度,这样设定的标准位置和标准置信度也更为客观和精确。Optionally, the preset standard location may be an average location obtained by averaging the location of the service object in the sample image of the business object to be trained; the preset standard confidence may be a sample image of the business object to be trained. The confidence level of the business objects in the average confidence obtained after averaging processing. Since the sample image is a sample to be trained and the amount of data is large, the standard position and/or the standard confidence can be set according to the position and/or confidence of the business object in the sample image of the business object to be trained, so that the standard position and the standard position are set. Standard confidence is also more objective and accurate.

在进行特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足收敛条件的判断 时,一种可选的方式包括:Judging whether the position information and/or the confidence information of the corresponding business object in the feature vector convolution result satisfies the convergence condition An optional way to do this includes:

获取特征向量卷积结果中对应的业务对象的位置信息,通过计算对应的业务对象的位置信息指示的位置与预设的标准位置之间的欧式距离,得到对应的业务对象的位置信息指示的位置与预设的标准位置之间的第一距离,根据第一距离判断对应的业务对象的位置信息是否满足收敛条件;Obtaining the location information of the corresponding service object in the feature vector convolution result, and calculating the location indicated by the location information of the corresponding service object by calculating the Euclidean distance between the location indicated by the location information of the corresponding service object and the preset standard location Determining, according to the first distance, a first distance between the preset standard position and determining whether the location information of the corresponding service object satisfies the convergence condition;

和/或,and / or,

获取特征向量卷积结果中对应的业务对象的置信度信息,计算对应的业务对象的置信度信息指示的置信度与预设的标准置信度之间的欧式距离,得到对应的业务对象的置信度信息指示的置信度与预设的标准置信度之间的第三距离,根据第三距离判断对应的业务对象的置信度信息是否满足收敛条件。其中,采用欧式距离的方式,实现简单且能够有效指示收敛条件是否被满足。但本申请实施例并不限于此,还可以采用马式距离、巴式距离等其它方式。Obtaining the confidence information of the corresponding service object in the feature vector convolution result, calculating the Euclidean distance between the confidence level indicated by the confidence information of the corresponding service object and the preset standard confidence, and obtaining the confidence of the corresponding business object. The third distance between the confidence level of the information indication and the preset standard confidence level determines whether the confidence information of the corresponding service object satisfies the convergence condition according to the third distance. Among them, the Euclidean distance method is adopted, and the implementation is simple and can effectively indicate whether the convergence condition is satisfied. However, the embodiment of the present application is not limited thereto, and other methods such as a horse distance, a bar distance, and the like may also be adopted.

可选地,如前所述,预设的标准位置为对待训练的业务对象样本图像中的业务对象的位置进行平均处理后获得的平均位置;和/或,预设的标准置信度为对待训练的业务对象样本图像中的业务对象的置信度进行平均处理后获取的平均置信度。Optionally, as described above, the preset standard location is an average position obtained by averaging the positions of the business objects in the sample image of the business object to be trained; and/or, the preset standard confidence is to be trained. The confidence of the business object in the sample image of the business object is averaged after the average processing.

(4)若满足收敛条件,例如,特征向量卷积结果中业务对象的位置信息指示的位置与预设的标准位置之间的距离满足一定阈值,且特征向量卷积结果中业务对象的置信度信息指示的置信度与预设的标准置信度之间的差别满足一定阈值,则完成对卷积网络模型的训练;若不满足收敛条件,例如,特征向量卷积结果中业务对象的位置信息指示的位置与预设的标准位置之间的距离不满足一定阈值,和/或特征向量卷积结果中业务对象的置信度信息指示的置信度与预设的标准置信度之间的差别不满足一定阈值,则根据特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息,调整第三卷积网络模型的网络参数,并根据调整后的第三卷积网络模型的网络参数对该第三卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息满足收敛条件。(4) If the convergence condition is satisfied, for example, the distance between the position indicated by the position information of the business object in the feature vector convolution result and the preset standard position satisfies a certain threshold, and the confidence of the business object in the feature vector convolution result If the difference between the confidence level of the information indication and the preset standard confidence satisfies a certain threshold, the training of the convolutional network model is completed; if the convergence condition is not satisfied, for example, the location information indication of the business object in the feature vector convolution result The distance between the position and the preset standard position does not satisfy a certain threshold, and/or the difference between the confidence level indicated by the confidence information of the business object and the preset standard confidence in the feature vector convolution result does not satisfy the certain The threshold value is adjusted according to the location information and/or the confidence information of the corresponding business object in the feature vector convolution result, and the network parameter of the third convolutional network model is adjusted according to the network parameter pair of the adjusted third convolutional network model. The third convolutional network model performs iterative training until the location information and/or confidence information of the service object after the iterative training is satisfied Convergence criteria.

通过对第三卷积网络模型进行上述训练,第三卷积网络模型可以对基于手部进行展示的业务对象的展现位置进行特征提取和分类,从而具有确定业务对象在视频图像中的展现位置的功能。其中,当展现位置包括多个时,通过上述业务对象置信度的训练,第三卷积网络模型还可以确定出多个展现位置中的展示效果的优劣顺序,从而基于该优劣顺序确定最终的展现位置。在后续应用中展示业务对象时,根据视频中的当前图像即可确定出有效的展现位置。By performing the above training on the third convolutional network model, the third convolutional network model can feature extracting and classifying the presentation position of the business object based on the hand presentation, thereby having the position of determining the presentation position of the business object in the video image. Features. Wherein, when the presentation location includes multiple, through the training of the above-mentioned business object confidence, the third convolutional network model may also determine the order of the presentation effects in the plurality of presentation locations, thereby determining the final based on the sequence of the advantages and disadvantages. Show position. When a business object is displayed in a subsequent application, a valid presentation location can be determined based on the current image in the video.

此外,在对第三卷积网络模型进行上述训练之前,还可以预先对业务对象样本图像进行预处理,包括:获取多个业务对象样本图像,其中,每个业务对象样本图像中包含有业务对象的标注信息;根据标注信息确定业务对象的位置,判断确定的业务对象的位置与预设位置的距离是否小于或等于设定阈值;将小于或等于设定阈值的业务对象对应的业务对象样本图像,确定为待训练的业务对象样本图像。其中,预设位置和设定阈值均可以由本领域技术人员采用任意适当方式进行适当设置,如根据数据统计分析结果或者相关距离计算公式或者人工经验等,本申请实施例对此不作限定。In addition, before performing the foregoing training on the third convolutional network model, the business object sample image may be pre-processed, including: acquiring a plurality of business object sample images, where each business object sample image includes a business object. The labeling information; determining the location of the business object according to the labeling information, determining whether the distance between the determined location of the business object and the preset location is less than or equal to a set threshold; and the business object sample image corresponding to the business object that is less than or equal to the set threshold , determined as the sample image of the business object to be trained. The preset position and the set threshold may be appropriately set by any suitable means by a person skilled in the art, for example, according to the statistical analysis result of the data or the related distance calculation formula or the artificial experience, etc., which is not limited by the embodiment of the present application.

通过预先对业务对象样本图像进行预处理,可以过滤掉不符合条件的样本图像,以保证训练结果的准确性。By pre-processing the business object sample image, the sample image that does not meet the conditions can be filtered out to ensure the accuracy of the training result.

通过上述过程实现了第三卷积网络模型的训练,训练完成的第三卷积网络模型可以用来确定业务对象在视频图像中的展现位置。例如,在视频直播过程中,若主播点击业务对象指示进行业务对象展示时,在第三卷积网络模型获得了直播的视频图像中主播的手部特征点后,可以指示出展示业务对象的位置如主播的额头位置,控制直播应用在该位置展示业务对象;或者,在视频直播过程中,若主播点击业务对象指示进行业务对象展示时,第三卷积网络模型可以直接根据直播的视频图像确定业务对象的展现位置。Through the above process, the training of the third convolutional network model is realized, and the trained third convolutional network model can be used to determine the presentation position of the business object in the video image. For example, in the live broadcast process, if the anchor clicks on the business object to indicate the display of the business object, after the third convolutional network model obtains the hand feature point of the anchor in the live video image, the location of the display service object may be indicated. For example, the forehead location of the anchor controls the live application to display the business object at the location; or, during the live broadcast of the video, if the anchor clicks on the business object to indicate the display of the business object, the third convolutional network model can be directly determined according to the live video image. The presentation location of the business object.

方式二Way two

根据手部的特征点和待显示的业务对象的类型,在视频图像中确定与手部位置相应的待显示的业务对象的展现位置。The presentation position of the business object to be displayed corresponding to the hand position is determined in the video image according to the feature point of the hand and the type of the business object to be displayed.

在实施中,在获取了手部的特征点之后,可以按照设定的规则确定待显示的业务对象的展现位置。其中,确定待显示的业务对象的展现位置例如包括以下任意一个或多个:视频图像中人物的手掌区域、手掌的上方区域、手掌的下方区域、手掌的背景区域、手部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域等。In the implementation, after the feature points of the hand are acquired, the presentation position of the business object to be displayed may be determined according to the set rules. Wherein, determining the presentation position of the business object to be displayed includes, for example, any one or more of the following: a palm area of the person in the video image, an upper area of the palm, a lower area of the palm, a background area of the palm, a body area other than the hand, The background area in the video image, the area within the setting range centering on the area where the hand is located in the video image, the area preset in the video image, and the like.

在确定了展现位置后,可以确定待显示的业务对象在视频图像中的展现位置。例如,以展现位置对应的展现区域的中心点为业务对象的展现位置中心点进行业务对象的展示;再如,将展现位置对应的展现区域中的某一坐标位置确定为展现位置的中心点等,本申请实施例对此不作限定。After the presentation location is determined, the presentation location of the business object to be displayed in the video image can be determined. For example, the center point of the presentation area corresponding to the presentation location is used to display the business object as the center point of the presentation location; for example, a certain coordinate position in the presentation area corresponding to the presentation location is determined as the center point of the presentation location, etc. This embodiment of the present application does not limit this.

在一种可选的实施方案中,在确定待显示的业务对象在视频图像中的展现位置时,不仅根据手部的特征点,还根据待显示的业务对象的类型,确定待显示的业务对象在视频图像中的展现位置。本申 请各实施例中,业务对象的类型例如包括但不限于以下任意一种或多种:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型,除此外,还可以包括虚拟瓶盖类型,虚拟杯子类型、文字类型等等。In an optional implementation, when determining the presentation position of the business object to be displayed in the video image, the business object to be displayed is determined not only according to the feature point of the hand but also according to the type of the business object to be displayed. The position of the presentation in the video image. Ben Shen In various embodiments, the type of the business object includes, for example but not limited to, any one or more of the following: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type, a virtual clothing type, a virtual makeup type, The virtual headwear type, the virtual hair accessory type, and the virtual jewelry type may include, in addition to, a virtual cap type, a virtual cup type, a text type, and the like.

另外,还根据业务对象的类型,可以以手部的特征点和手部位置为参考,为业务对象选择适当的展现位置。In addition, according to the type of the business object, the appropriate presentation position can be selected for the business object with reference to the feature point and the hand position of the hand.

此外,在根据手部的特征点和待显示的业务对象的类型,获得待显示的业务对象在视频图像中的多个展现位置的情况下,可以从多个展现位置中选择至少一个展现位置作为该待显示的业务对象在所述视频图像中的展现位置。例如,对于文字类型的业务对象,可以展示在背景区域,也可以展示在人物的手掌区域或手部上方区域等。Further, in a case where a plurality of presentation positions of the business object to be displayed in the video image are obtained according to the feature point of the hand and the type of the business object to be displayed, at least one presentation position may be selected from the plurality of presentation positions as The presentation location of the business object to be displayed in the video image. For example, for a text type business object, it can be displayed in the background area, or in the palm area of the character or the area above the hand.

此外,可以预先存储手势与展现位置的对应关系,在确定检测到的手势与对应的预定手势相匹配时,可从预先存储的手势与展现位置的对应关系中,获取预定手势对应的目标展现位置作为待显示的业务对象在视频图像中的展现位置。其中,需要说明的是,尽管存在上述手势与展现位置的对应关系,但是,手势与展现位置并没有必然关系,手势仅仅是触发业务对象展现的一种方式,而且展现位置与人手也不存在必然关系,也即是业务对象可以展现在手部的某一个区域,也可以显示在手部之外的其它区域,如视频图像的背景区域等。另外,相同的手势也可以触发不同业务对象的显示,例如,主播连续做了两次挥手的手势,第一次手势可以展示二维贴纸特效,第二次手势可以展示三维特效等,两次特效对应的广告等内容可以相同,也可以不同。In addition, the correspondence between the gesture and the presentation position may be stored in advance, and when it is determined that the detected gesture matches the corresponding predetermined gesture, the target presentation position corresponding to the predetermined gesture may be acquired from the correspondence between the pre-stored gesture and the presentation position. As the presentation location of the business object to be displayed in the video image. It should be noted that, although there is a correspondence between the above gesture and the presentation position, the gesture is not necessarily related to the presentation position. The gesture is only a way to trigger the presentation of the business object, and the presentation position and the human hand are not necessarily present. The relationship, that is, the business object can be displayed in a certain area of the hand, or can be displayed in other areas than the hand, such as the background area of the video image. In addition, the same gesture can also trigger the display of different business objects. For example, the anchor performs two wave gestures in succession, the first gesture can display two-dimensional sticker effects, the second gesture can display three-dimensional effects, etc. The content of the corresponding advertisement may be the same or different.

在一个可选示例中,步骤S440~S450可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的展现位置确定模块602执行。In an alternative example, steps S440-S450 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a presentation location determining module 602 executed by the processor.

在步骤S460,在展现位置采用计算机绘图方式绘制待显示的业务对象。In step S460, the business object to be displayed is drawn in a computer drawing manner at the presentation position.

当业务对象为包含有语义信息的二维贴纸特效时,可以使用该贴纸进行广告投放和展示。在进行业务对象的绘制之前,可以先获取业务对象的相关信息,如业务对象的标识、大小等。在确定了展现位置后,可以根据展现位置的坐标,对业务对象进行缩放、旋转等调整,然后,通过相应的绘图方式如开放图形语言(OpenGL)图形绘制引擎的绘制方式对待显示的业务对象进行绘制。在某些情况下,广告还可以以三维特效形式展示,如通过粒子特效方式展示广告的文字或商标(LOGO)等。例如,通过虚拟瓶盖类型的二维贴纸特效展示某一产品的名称,吸引观众观看,提高广告投放和展示效率。When a business object is a two-dimensional sticker effect containing semantic information, the sticker can be used for advertising and display. Before drawing a business object, you can first obtain information about the business object, such as the identity and size of the business object. After the display position is determined, the business object may be scaled, rotated, etc. according to the coordinates of the presentation position, and then the business object to be displayed is displayed by a corresponding drawing method such as an open graphics language (OpenGL) graphics rendering engine. draw. In some cases, ads can also be displayed in 3D special effects, such as text or logos (LOGOs) that display ads through particle effects. For example, the virtual bottle cap type of two-dimensional sticker effects display the name of a product to attract viewers to watch, improve the efficiency of advertising and display.

在一个可选示例中,步骤S460可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的业务对象绘制模块603执行。In an alternative example, step S460 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 603 executed by the processor.

本申请实施例提供的手势控制方法,通过手势触发业务对象的展示,当业务对象用于展示广告时,一方面,在确定的展现位置采用计算机绘图方式绘制待展示的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的手势紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加趣味性,还可以避免打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,可以在一定程度上吸引观众的注意力,提高业务对象的影响力。The gesture control method provided by the embodiment of the present application triggers the display of the business object by using a gesture. When the business object is used for displaying the advertisement, on the one hand, the business object to be displayed is drawn by using a computer drawing manner at the determined presentation position, and the business object is The combination of video playback eliminates the need to transmit additional advertising video data unrelated to the video over the network, which is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to retain the video. The main image and motion of the video main body (such as the anchor) in the image can add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, and can be certain To the extent that it attracts the attention of the audience and enhances the influence of business objects.

图5是本申请手势控制方法又一实施例的流程图。如图5所示,本实施例的手势控制方法包括:FIG. 5 is a flow chart of still another embodiment of the gesture control method of the present application. As shown in FIG. 5, the gesture control method of this embodiment includes:

在步骤S501,根据含有人手标注信息的样本图像训练第一卷积网络模型,得到第一卷积网络模型针对样本图像的人手候选区域的预测信息。In step S501, the first convolutional network model is trained according to the sample image containing the human hand annotation information, and the prediction information of the first convolutional network model for the human hand candidate region of the sample image is obtained.

在一个可选示例中,步骤S501可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的人手区域确定模块604执行。In an alternative example, step S501 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a human hand area determination module 604 that is executed by the processor.

在步骤S502,修正人手候选区域的预测信息。In step S502, the prediction information of the human hand candidate region is corrected.

在一个可选示例中,步骤S502可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的修正模块605执行。In an alternative example, step S502 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a modification module 605 executed by the processor.

在步骤S503,根据修正后的人手候选区域的预测信息和样本图像训练第二卷积网络模型。In step S503, the second convolutional network model is trained based on the predicted information and the sample image of the corrected human hand candidate region.

其中,第二卷积网络模型和第一卷积网络模型共享特征提取层,并在第二卷积网络模型训练过程中保持特征提取层的参数不变。The second convolutional network model and the first convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are kept unchanged during the training process of the second convolutional network model.

在一个可选示例中,步骤S503可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的卷积模型训练模块606执行。In an alternative example, step S503 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a convolutional model training module 606 executed by the processor.

上述步骤S501~步骤S503的步骤内容可以参见本申请上述实施例中的相关内容,在此不再赘述。For the content of the steps in the foregoing steps S501 to S503, refer to related content in the foregoing embodiment of the present application, and details are not described herein again.

在步骤S504,获取待训练的业务对象样本图像的特征向量。In step S504, a feature vector of the business object sample image to be trained is acquired.

其中,特征向量中包含有业务对象样本图像中的业务对象的位置信息和/或置信度信息、以及手势对应的特征向量。待训练的业务对象样本图像可以是上述含有人手标注信息的样本图像。 The feature vector includes location information and/or confidence information of the business object in the business object sample image, and a feature vector corresponding to the gesture. The business object sample image to be trained may be the sample image containing the human hand annotation information described above.

在本实施例中,业务对象样本图像中可能存在一些不符合第三卷积网络模型的训练标准的样本图像,可以通过对业务对象样本图像的预处理将这部分样本图像过滤掉。In this embodiment, there may be some sample images in the sample image of the business object that do not meet the training standard of the third convolutional network model, and the sample images may be filtered out by preprocessing the image of the business object sample.

本实施例中,业务对象样本图像中包括业务对象,业务对象标注有位置信息和置信度信息。一种可选的实施方案中,将业务对象的中心点的位置信息作为该业务对象的位置信息。本步骤中,根据业务对象的位置信息对样本图像进行过滤。获得位置信息指示的位置的坐标后,将该坐标与预设的该类型的业务对象的位置坐标进行比对,计算二者的位置方差。若该位置方差小于或等于设定的阈值,该业务对象样本图像可以作为待训练的样本图像;若该位置方差大于设定的阈值,则过滤掉该业务对象样本图像。其中,预设的位置坐标和设定的阈值可以由本领域技术人员根据实际情况适当设置,例如,由于一般用于第三卷积网络模型训练的图像具有相同的大小,可以设定的阈值可以为图像长或宽的1/20~1/5,可选地,可以为图像长或宽的1/10。In this embodiment, the business object sample image includes a business object, and the business object is marked with location information and confidence information. In an optional implementation, the location information of the central point of the business object is used as the location information of the business object. In this step, the sample image is filtered according to the location information of the business object. After obtaining the coordinates of the location indicated by the location information, the coordinates are compared with the preset location coordinates of the business object of the type, and the position variance of the two is calculated. If the location variance is less than or equal to the set threshold, the business object sample image may be used as the sample image to be trained; if the location variance is greater than the set threshold, the business object sample image is filtered out. The preset position coordinates and the set thresholds may be appropriately set by a person skilled in the art according to actual conditions. For example, since the images generally used for the third convolutional network model training have the same size, the threshold that can be set may be The length or width of the image is 1/20 to 1/5, and alternatively, it may be 1/10 of the length or width of the image.

此外,还可以对确定的待训练的业务对象样本图像中的业务对象的位置和置信度进行平均,获取平均位置和平均置信度,该平均位置和平均置信度可以作为后续确定收敛条件的依据。In addition, the location and the confidence of the service object in the determined sample image of the business object to be trained may be averaged to obtain an average position and an average confidence, and the average position and the average confidence may be used as a basis for determining the convergence condition subsequently.

当以业务对象为二维贴纸特效为实例时,本实施例中用于训练的业务对象样本图像标注有广告位置的坐标和该广告位的置信度。其中,该广告位置可以在手部、前背景等地方标注,可以实现手部特征点、前背景等地方的广告位的联合训练,该方案相对于基于手部一项技术单独训练的方案,有利于节省计算资源。置信度的大小表示了这个广告位是较佳广告位的概率,例如,如果这个广告位是被遮挡多,则置信度低。When the business object is a two-dimensional sticker special effect as an example, the business object sample image used for training in this embodiment is labeled with the coordinates of the advertisement location and the confidence of the advertisement space. Wherein, the advertisement position can be marked in the hand, the front background and the like, and the joint training of the advertisement points of the hand feature point and the front background can be realized, and the scheme is different from the scheme based on the manual training of one hand. Conducive to saving computing resources. The size of the confidence indicates the probability that this ad slot is a better ad slot, for example, if the ad slot is mostly occluded, the confidence is low.

在步骤S505,对特征向量进行卷积处理,获得特征向量卷积结果。At step S505, the feature vector is subjected to convolution processing to obtain a feature vector convolution result.

在步骤S506,判断该特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息是否满足收敛条件。In step S506, it is determined whether the location information and/or the confidence information of the corresponding service object in the feature vector convolution result satisfies the convergence condition.

在步骤S507,若步骤S506中的收敛条件都满足,则完成对第三卷积网络模型的训练;若步骤S506中的收敛条件都不满足或不都满足,则根据特征向量卷积结果中对应的业务对象的位置信息和/或置信度信息,调整第三卷积网络模型的网络参数,并根据调整后的第三卷积网络模型的网络参数对该第三卷积网络模型进行迭代训练,直至迭代训练后的业务对象的位置信息和/或置信度信息相应满足收敛条件。In step S507, if the convergence conditions in step S506 are all satisfied, the training on the third convolutional network model is completed; if the convergence conditions in step S506 are neither satisfied nor satisfied, the corresponding results are obtained according to the feature vector convolution result. The location information and/or confidence information of the business object, adjust the network parameters of the third convolutional network model, and iteratively train the third convolutional network model according to the adjusted network parameters of the third convolutional network model, The position information and/or confidence information of the business object after the iterative training corresponds to the convergence condition.

上述步骤S504~步骤S507的处理方式可以参见本申请上述实施例中的相关内容,在此不再赘述。For the processing of the foregoing steps S504 to S507, reference may be made to related content in the foregoing embodiments of the present application, and details are not described herein again.

通过上述步骤S504~步骤S507的处理可以得到训练完成的第三卷积网络模型。其中,第三卷积网络模型的结构可以参考本申请上述图2或图3所示实施例中第一卷积网络模型或第二卷积网络模型的结构,在此不再赘述。Through the processing of the above steps S504 to S507, the trained third convolutional network model can be obtained. For the structure of the third convolutional network model, reference may be made to the structure of the first convolutional network model or the second convolutional network model in the embodiment shown in FIG. 2 or FIG. 3 of the present application, and details are not described herein again.

通过上述训练得到的第一卷积网络模型、第二卷积网络模型和第三卷积网络模型可以对视频图像进行相应的处理,具体可以包括以下步骤S508~步骤S513。The first convolutional network model, the second convolutional network model, and the third convolutional network model obtained by the above training may perform corresponding processing on the video image, and specifically may include the following steps S508 to S513.

在一个可选示例中,步骤S504~S507可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的手势控制装置中第三训练模块执行。In an optional example, steps S504-S507 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a third training module in the gesture control device operated by the processor.

在步骤S508,获取当前播放的视频图像。At step S508, the currently played video image is acquired.

在步骤S509,采用预先训练好的第一卷积网络检测视频图像,获得视频图像的第一特征信息和人手候选区域的预测信息。In step S509, the video image is detected by using the pre-trained first convolution network, and the first feature information of the video image and the prediction information of the candidate region of the human hand are obtained.

在步骤S510,将第一特征信息和人手候选区域的预测信息作为预先训练好的第二卷积网络模型的第二特征信息,并采用第二卷积网络模型根据第二特征信息进行视频图像的手势检测,得到视频图像的手势检测结果。In step S510, the first feature information and the prediction information of the candidate region are used as the second feature information of the pre-trained second convolutional network model, and the second convolutional network model is used to perform the video image according to the second feature information. Gesture detection, get the gesture detection result of the video image.

其中,在进行人手候选区域检测后确定视频图像中包含人手的情况下,可以概率的形式确定人手候选区域中的手势。例如,以手掌张开手势和手掌闭合手势为例,当手掌张开手势的概率高时,可以认为视频图像中包含手掌张开手势的人手,当手掌闭合手势的概率高时,可以认为视频图像中包含手掌闭合手势的人手。Wherein, in the case where it is determined that the human hand is included in the video image after the human hand candidate region detection, the gesture in the human hand candidate region may be determined in a probabilistic manner. For example, taking the palm open gesture and the palm closing gesture as an example, when the probability of the palm open gesture is high, the human hand with the palm open gesture in the video image can be considered, and when the probability of the palm closing gesture is high, the video image can be considered. The hand that contains the palm closing gesture.

进而,在本申请实施例的一种可选实现方式中,第二卷积网络模型模型的输出结果可以包括:人手候选区域不包含人手的概率、人手候选区域包含手掌张开手势的人手的概率、人手候选区域包含手掌闭合手势的人手的概率等等。Furthermore, in an optional implementation manner of the embodiment of the present application, the output result of the second convolutional network model model may include: a probability that the human hand candidate region does not include a human hand, and a probability that the human hand candidate region includes a palm open gesture. The candidate area of the human hand includes the probability of a human hand closing the gesture of the palm, and the like.

为提高检测速度,在第一卷积层参数与第二卷积层参数一致的情况下,第二卷积网络模型模型根据人手候选区域和各种预定的手势的特征,获得针对视频图像的手势检测结果时,第二卷积网络模型模型可以直接将多个第一卷积层提取的视频图像的第一特征,确定为多个第二卷积层提取的人手候选区域的第二特征,然后根据上述第二特征,通过多个全连接层对人手候选区域进行分类处理,获得针对视频图像的手势检测结果,可以节省计算量,提高检测速度。In order to improve the detection speed, in a case where the first convolutional layer parameter is consistent with the second convolutional layer parameter, the second convolutional network model model obtains a gesture for the video image according to the characteristics of the human hand candidate region and various predetermined gestures. When detecting the result, the second convolutional network model model may directly determine the first feature of the plurality of first convolutional layer extracted video images as the second feature of the plurality of second convolutional layer extracted human hand candidate regions, and then According to the second feature described above, the human hand candidate region is classified by a plurality of fully connected layers, and the gesture detection result for the video image is obtained, thereby saving the calculation amount and improving the detection speed.

在一个可选示例中,步骤S508~S510可以由处理器调用存储器存储的相应指令执行,也可以由被 处理器运行的手势检测模块601执行。In an optional example, steps S508-S510 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be The gesture detection module 601 running by the processor executes.

在步骤S511,当确定检测到的手部的手势与对应的预定手势相匹配时,提取与检测到的手势相应的人手候选区域内手部的特征点。In step S511, when it is determined that the gesture of the detected hand matches the corresponding predetermined gesture, the feature point of the hand in the human hand candidate region corresponding to the detected gesture is extracted.

在步骤S512,根据手部的特征点,使用预先训练好的、用于确定业务对象在视频图像中的展现位置的第三卷积网络模型,在视频图像中确定与手部位置相应的待显示的业务对象的展现位置。In step S512, a third convolutional network model for determining a presentation position of the business object in the video image is used according to the feature point of the hand, and the to-be-displayed corresponding to the hand position is determined in the video image. The location of the business object.

在一个可选示例中,步骤S511~S512可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的展现位置确定模块602执行。In an alternative example, steps S511-S512 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a presentation location determining module 602 executed by the processor.

在步骤S513,在展现位置采用计算机绘图方式绘制待显示的业务对象。In step S513, the business object to be displayed is drawn in a computer drawing manner at the presentation position.

在一个可选示例中,步骤S460可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的业务对象绘制模块603执行。In an alternative example, step S460 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a business object rendering module 603 executed by the processor.

随着互联网直播和短视频分享的兴起,越来越多的视频以直播或者短视频的方式出现。这类视频常常以人物为主角(单一人物或少量人物),以人物加简单背景为主要场景,观众主要在手机等移动终端上观看。在此情况下,对于某些业务对象的投放(如广告投放)来说,通过本实施例提供的方案,可以实时对视频播放过程中的视频图像进行检测,给出效果较好的广告投放位置,且不影响用户的观看体验,投放效果更好;在展现位置采用计算机绘图方式绘制待显示的业务对象,将业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另外,业务对象与视频图像中的手势紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加了趣味性,还可以避免打扰用户正常观看视频,有利于减少用户对视频图像中展示业务对象的反感,可以在一定程度上吸引观众的注意力,提高业务对象的影响力。可以理解,业务对象的投放除了广告之外,还可广泛应用到其他方面,例如教育、咨询、服务等行业,可通过投放娱乐性、赞赏性等业务信息来提高交互效果,改善用户体验。With the rise of Internet live broadcast and short video sharing, more and more videos appear as live or short video. Such videos are often dominated by characters (single characters or a small number of characters), with characters and simple backgrounds as the main scenes, and viewers mainly watch on mobile terminals such as mobile phones. In this case, for the delivery of certain business objects (such as advertisement delivery), the video image during the video playback process can be detected in real time through the solution provided in this embodiment, and the advertisement placement position with better effect is given. And does not affect the user's viewing experience, the delivery effect is better; the computer object is used to draw the business object to be displayed in the presentation position, and the business object is combined with the video playback, and no additional advertising video data irrelevant to the video is transmitted through the network. It is beneficial to save network resources and/or system resources of the client; in addition, the business object is closely combined with the gestures in the video image to preserve the main image and motion of the video subject (such as the anchor) in the video image, and can also add video images. The fun can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of displaying the business object in the video image, and can attract the attention of the viewer to a certain extent and improve the influence of the business object. It can be understood that, in addition to advertising, business objects can be widely applied to other aspects, such as education, consulting, services, etc., by providing entertainment, appreciation and other business information to improve interaction and improve user experience.

本申请实施例提供的任一种图像处理方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本申请实施例提供的任一种图像处理方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本申请实施例提及的任一种图像处理方法。下文不再赘述。Any of the image processing methods provided by the embodiments of the present application may be performed by any suitable device having data processing capabilities, including but not limited to: a terminal device, a server, and the like. Alternatively, any image processing method provided by the embodiment of the present application may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to execute any one of the image processing methods mentioned in the embodiments of the present application. This will not be repeated below.

本领域普通技术人员可以理解:实现上述图像处理方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art can understand that all or part of the steps of implementing the above image processing method may be completed by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium. The steps of the foregoing method embodiments are performed; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

图6是本申请手势控制装置一实施例的结构框图。本申请各实施例的手势控制装置可用于实现本申请上述各手势控制方法实施例。参照图6,本实施例的手势控制装置包括:手势检测模块601、展现位置确定模块602和业务对象绘制模块603。FIG. 6 is a structural block diagram of an embodiment of a gesture control apparatus of the present application. The gesture control apparatus of the embodiments of the present application can be used to implement the foregoing gesture control method embodiments of the present application. Referring to FIG. 6, the gesture control apparatus of this embodiment includes a gesture detection module 601, a presentation location determining module 602, and a business object rendering module 603.

手势检测模块601,用于对当前播放的视频图像进行手势检测。The gesture detection module 601 is configured to perform gesture detection on the currently played video image.

展现位置确定模块602,用于在检测到手势与预定手势匹配时,确定待显示的业务对象在视频图像中的展现位置。The presentation location determining module 602 is configured to determine a presentation location of the business object to be displayed in the video image when the gesture is detected to match the predetermined gesture.

业务对象绘制模块603,用于在展现位置采用计算机绘图方式绘制业务对象。The business object drawing module 603 is configured to draw a business object by using a computer drawing manner at the presentation position.

本实施例提供的手势控制装置,通过对当前播放的包含手部信息的视频图像进行人手候选区域和手势检测,并将检测到的手势与对应的预定手势进行匹配,当两者相匹配时,通过手部位置确定待显示的业务对象在视频图像中的展现位置,当业务对象用于展示广告时,一方面,在确定的展现位置采用计算机绘图方式绘制待展示的业务对象,该业务对象与视频播放相结合,无须通过网络传输与视频无关的额外广告视频数据,有利于节约网络资源和/或客户端的系统资源;另一方面,业务对象与视频图像中的手势紧密结合,既可以保留视频图像中视频主体(如主播)的主要形象和动作,又可以为视频图像增加趣味性,还可以避免打扰用户正常观看视频,有利于减少用户对视频图像中展现的业务对象的反感,可以在一定程度上吸引观众的注意力,提高业务对象的影响力。The gesture control apparatus provided in this embodiment performs human hand candidate area and gesture detection on the currently played video image containing the hand information, and matches the detected gesture with the corresponding predetermined gesture, when the two match, Determining the presentation position of the business object to be displayed in the video image by using the location of the hand. When the business object is used for displaying the advertisement, on the one hand, the business object to be displayed is drawn by using a computer drawing manner at the determined presentation position, and the business object is The combination of video playback eliminates the need to transmit additional advertising video data unrelated to the video over the network, which is conducive to saving network resources and/or system resources of the client; on the other hand, the business object is closely combined with the gestures in the video image to retain the video. The main image and motion of the video main body (such as the anchor) in the image can add interest to the video image, and can also avoid disturbing the user to watch the video normally, which is beneficial to reducing the user's dislike of the business object displayed in the video image, and can be certain To attract the attention of the audience and improve the business objects Influence.

图7是本申请手势控制装置另一实施例的结构框图。如图7所示,与图6所示实施例相比,本实施例的手势控制装置中,展现位置确定模块602包括:特征点提取单元,用于提取与检测到的手势相应的人手候选区域内手部的特征点;展现位置确定单元,用于根据手部的特征点,确定与检测到的手势相应的待显示的业务对象在视频图像中的展现位置。FIG. 7 is a structural block diagram of another embodiment of the gesture control apparatus of the present application. As shown in FIG. 7 , in comparison with the embodiment shown in FIG. 6 , in the gesture control apparatus of the present embodiment, the presentation position determining module 602 includes: a feature point extraction unit, configured to extract a human hand candidate region corresponding to the detected gesture. a feature point of the inner hand; a presentation position determining unit, configured to determine, according to the feature point of the hand, a display position of the business object to be displayed corresponding to the detected gesture in the video image.

可选地,展现位置确定模块602,用于根据手部的特征点和待显示的业务对象的类型,确定与检测到的手势相应的待显示的业务对象在视频图像中的展现位置。Optionally, the presentation location determining module 602 is configured to determine, according to the feature point of the hand and the type of the business object to be displayed, a presentation location of the business object to be displayed corresponding to the detected gesture in the video image.

可选地,展现位置确定模块602,用于根据手部的特征点和待显示的业务对象的类型,确定与检测到的手势相应的待显示的业务对象在视频图像中的多个展现位置;从多个展现位置中选择至少一个展 现位置。Optionally, the presentation location determining module 602 is configured to determine, according to the feature point of the hand and the type of the business object to be displayed, a plurality of presentation locations of the business object to be displayed corresponding to the detected gesture in the video image; Select at least one exhibition from multiple presentation locations Current location.

可选地,展现位置确定模块602,用于当确定检测到的手势与对应的预定手势相匹配时,确定与预定手势相应的待显示的业务对象在视频图像中的展现位置作为与检测到的手势相应的待显示的业务对象在视频图像中的展现位置。Optionally, the presentation location determining module 602 is configured to determine, when the determined gesture is matched with the corresponding predetermined gesture, a presentation position of the business object to be displayed corresponding to the predetermined gesture in the video image as the detected The position of the corresponding business object to be displayed in the video image.

可选地,展现位置确定模块602,用于从预先存储的手势与展现位置的对应关系中,获取预定手势对应的目标展现位置作为与检测到的手势相应的待显示的业务对象在视频图像中的展现位置。Optionally, the presentation location determining module 602 is configured to obtain, from a correspondence between the pre-stored gesture and the presentation location, a target presentation location corresponding to the predetermined gesture as a business object to be displayed corresponding to the detected gesture in the video image. Show position.

可选地,业务对象可以包括:视频、图像或包含有语义信息的特效等,视频图像可以包括静态图像或直播类视频图像。Optionally, the business object may include: a video, an image, or an effect including semantic information, and the video image may include a still image or a live video image.

可选地,包含有语义信息的特效包括包含广告信息的以下任意一种或多种形式的特效:二维贴纸特效、三维特效、粒子特效等。Optionally, the special effect including the semantic information includes any one or more of the following special effects including the advertisement information: two-dimensional sticker special effects, three-dimensional special effects, particle special effects, and the like.

可选地,展现位置可以包括以下任意一个或多个区域:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域等。Optionally, the presentation location may include any one or more of the following regions: a hair region of the person in the video image, a forehead region, a cheek region, a chin region, a body region other than the head, a background region in the video image, and a video image An area within the setting range centering on the area where the hand is located, a predetermined area in the video image, and the like.

可选地,业务对象的类型可以包括以下任意一种或多种:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型。Optionally, the type of the business object may include any one or more of the following types: forehead patch type, cheek patch type, chin patch type, virtual hat type, virtual clothing type, virtual makeup type, virtual headdress type, virtual Hair accessory type, virtual jewelry type.

可选地,手势或预定手势可以包括任一个或多个:挥手、剪刀手、握拳、托手、桃心手、鼓掌、手掌张开、手掌闭合、竖大拇指、摆手枪姿势、摆V字手和摆OK手等。Optionally, the gesture or the predetermined gesture may include any one or more of: wave, scissors, fist, hand, heart, applause, palm open, palm closed, thumbs up, pistol pose, pendulum V Hand and pendulum OK hand.

可选地,手势检测模块601,用于采用预先训练好的第一卷积网络检测视频图像,获得视频图像的第一特征信息和人手候选区域的预测信息,第一特征信息包括手部特征信息;将第一特征信息和人手候选区域的预测信息作为预先训练好的第二卷积网络模型的第二特征信息,并采用第二卷积网络模型根据第二特征信息进行视频图像的手势检测,得到视频图像的手势检测结果;其中,第二卷积网络模型和第一卷积网络模型共享特征提取层。Optionally, the gesture detection module 601 is configured to detect a video image by using a pre-trained first convolution network, and obtain first feature information of the video image and prediction information of the candidate region of the human hand, where the first feature information includes hand feature information. Taking the first feature information and the prediction information of the candidate region as the second feature information of the pre-trained second convolutional network model, and using the second convolutional network model to perform the gesture detection of the video image according to the second feature information, Obtaining a gesture detection result of the video image; wherein the second convolutional network model and the first convolutional network model share the feature extraction layer.

可选地,再参见图7,在本申请手势控制装置的又一实施例中,还可以包括:人手区域确定模块604,用于根据含有人手标注信息的样本图像训练第一卷积网络模型,得到第一卷积网络模型针对样本图像的人手候选区域的预测信息;修正模块605,用于修正人手候选区域的预测信息;卷积模型训练模块606,用于根据修正后的人手候选区域的预测信息和样本图像训练第二卷积网络模型,其中,第二卷积网络模型和第一卷积网络模型共享特征提取层,并在第二卷积网络模型训练过程中保持特征提取层的参数不变。Optionally, referring to FIG. 7, in another embodiment of the gesture control apparatus of the present application, the method further includes: a human hand area determining module 604, configured to train the first convolutional network model according to the sample image that includes the human hand annotation information, Obtaining prediction information of the first convolutional network model for the candidate region of the sample image; the correction module 605, for correcting the prediction information of the candidate region of the human hand; and the convolution model training module 606, for predicting the candidate region according to the revised candidate The information and the sample image train the second convolutional network model, wherein the second convolutional network model and the first convolutional network model share the feature extraction layer, and the parameters of the feature extraction layer are maintained during the training process of the second convolutional network model change.

或者,可选地,在本申请基于本申请图6所示实施例的再一实施例中,还可以包括:人手区域确定模块604,用于根据含有人手标注信息的样本图像训练第一卷积神经网络,得到第一卷积神经网络针对样本图像的人手候选区域的预测信息;参数替换模块,用于将用于检测手势的第二卷积神经网络的第二特征提取层参数,替换为训练后的第一卷积神经网络的第一特征提取层参数;第二训练模块,用于根据人手候选区域的预测信息和样本图像训练第二卷积神经网络参数,并在训练过程中保持第二特征提取层参数不变。可选地,第二训练模块可以包括:修正模块605,用于修正人手候选区域的预测信息;卷积模型训练模块606,用于根据修正后的人手候选区域的预测信息和样本图像训练第二卷积神经网络参数,并在训练过程中保持第二特征提取层参数不变。Alternatively, optionally, in another embodiment of the embodiment shown in FIG. 6 of the present application, the method further includes: a human hand area determining module 604, configured to train the first convolution according to the sample image containing the human hand annotation information. a neural network, the prediction information of the first convolutional neural network for the human hand candidate region of the sample image is obtained; the parameter replacement module is configured to replace the second feature extraction layer parameter of the second convolutional neural network for detecting the gesture with the training a first feature extraction layer parameter of the first first convolutional neural network; a second training module, configured to train the second convolutional neural network parameter according to the prediction information of the candidate region of the human hand and the sample image, and maintain the second during the training process The feature extraction layer parameters are unchanged. Optionally, the second training module may include: a correction module 605, configured to correct prediction information of the candidate region of the human hand; and a convolution model training module 606, configured to train the second information according to the predicted information and the sample image of the corrected candidate region The neural network parameters are convolved and the second feature extraction layer parameters are kept unchanged during the training process.

可选地,人手标注信息可以包括人手区域的标注信息和/或手势的标注信息。Optionally, the human hand annotation information may include annotation information of the human hand area and/or annotation information of the gesture.

可选地,第一卷积神经网络可以包括:第一输入层、第一特征提取层和第一分类输出层,第一分类输出层用于预测样本图像的多个候选区域是否为人手候选区域。Optionally, the first convolutional neural network may include: a first input layer, a first feature extraction layer, and a first classification output layer, where the first classification output layer is used to predict whether multiple candidate regions of the sample image are candidate regions of the human hand. .

可选地,第二卷积神经网络可以包括:第二输入层、第二特征提取层、第二分类输出层,第二分类输出层用于输出样本图像的手势检测结果。Optionally, the second convolutional neural network may include: a second input layer, a second feature extraction layer, and a second classification output layer, where the second classification output layer is configured to output a gesture detection result of the sample image.

可选地,修正模块605,用于将多个补充负样本图像和人手候选区域的预测信息输入第三卷积神经网络以进行分类,以过滤人手候选区域中的负样本,得到修正后的人手候选区域的预测信息。Optionally, the correction module 605 is configured to input the plurality of supplementary negative sample images and the prediction information of the candidate region of the human hand into the third convolutional neural network for classification, to filter the negative samples in the candidate region of the human hand, and obtain the corrected human hand. Prediction information for candidate regions.

可选地,人手候选区域的预测信息中人手候选区域数量与补充负样本图像的数量的差异在预定容许范围内。例如,人手候选区域的预测信息中人手候选区域数量与补充负样本图像的数量相等。Optionally, the difference between the number of the human hand candidate regions and the number of supplementary negative sample images in the prediction information of the human hand candidate region is within a predetermined allowable range. For example, the number of human hand candidate regions in the prediction information of the human hand candidate region is equal to the number of supplementary negative sample images.

可选地,第一卷积神经网络可以包括RPN;第二卷积神经网络可以包括FRCNN。Alternatively, the first convolutional neural network may comprise an RPN; the second convolutional neural network may comprise an FRCNN.

可选地,第三卷积神经网络可以包括FRCNN。Alternatively, the third convolutional neural network may comprise an FRCNN.

可选地,展现位置确定模块602,用于通过手势和预先训练好的、用于从视频图像检测业务对象的展现位置的第三卷积网络模型,确定与检测到的手势相应的待显示的业务对象的展现位置。Optionally, the presentation location determining module 602 is configured to determine, by using a gesture and a pre-trained third convolutional network model for detecting a presentation location of the business object from the video image, to be displayed corresponding to the detected gesture. The presentation location of the business object.

图8是本申请视频图像的处理装置又一实施例的结构框图。本申请实施例并不对电子设备的具体实现做限定。如图8所示,该电子设备可以包括:处理器(processor)802、通信接口(Communications  Interface)804、存储器(memory)806、以及通信总线808。其中:FIG. 8 is a structural block diagram of still another embodiment of a processing apparatus for a video image of the present application. The embodiments of the present application do not limit the specific implementation of the electronic device. As shown in FIG. 8, the electronic device may include: a processor 802, a communication interface (Communications) Interface 804, memory 806, and communication bus 808. among them:

处理器802、通信接口804、以及存储器806通过通信总线808完成相互间的通信。Processor 802, communication interface 804, and memory 806 complete communication with one another via communication bus 808.

通信接口804,用于与其它设备比如其它客户端或服务器等的网元通信。The communication interface 804 is configured to communicate with network elements of other devices, such as other clients or servers.

处理器802可能是中央处理器(CPU),或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本申请实施例的一个或多个集成电路,或者是图形处理器(Graphics Processing Unit,GPU)。终端设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU,或者,一个或多个GPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个GPU。The processor 802 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, or a graphics processor ( Graphics Processing Unit, GPU). The one or more processors included in the terminal device may be the same type of processor, such as one or more CPUs, or one or more GPUs; or may be different types of processors, such as one or more CPUs and One or more GPUs.

存储器806,用于至少一可执行指令,该可执行指令使处理器802执行如本申请上述任一实施例在视频图像中展示业务对象的方法对应的操作。存储器806可能包含高速随机存取存储器(random access memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 806 is for at least one executable instruction that causes the processor 802 to perform operations corresponding to a method of presenting a business object in a video image as in any of the above-described embodiments of the present application. The memory 806 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.

图9为本申请电子设备一个实施例的结构示意图。下面参考图9,其示出了适于用来实现本申请实施例的终端设备或服务器的电子设备的结构示意图。如图9所示,该电子设备包括一个或多个处理器、通信部等,所述一个或多个处理器例如:一个或多个中央处理单元(CPU)901,和/或一个或多个图像处理器(GPU)913等,处理器可以根据存储在只读存储器(ROM)902中的可执行指令或者从存储部分908加载到随机访问存储器(RAM)903中的可执行指令而执行各种适当的动作和处理。通信部912可包括但不限于网卡,所述网卡可包括但不限于IB(Infiniband)网卡,处理器可与只读存储器902和/或随机访问存储器903中通信以执行可执行指令,通过总线904与通信部912相连、并经通信部912与其他目标设备通信,从而完成本申请实施例提供的任一手势控制方法对应的操作,例如,对当前播放的视频图像进行手势检测;在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置;在所述展现位置采用计算机绘图方式绘制所述业务对象。FIG. 9 is a schematic structural diagram of an embodiment of an electronic device according to the present application. Referring to FIG. 9, there is shown a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server of an embodiment of the present application. As shown in FIG. 9, the electronic device includes one or more processors, a communication unit, etc., such as one or more central processing units (CPUs) 901, and/or one or more A graphics processor (GPU) 913 or the like, the processor may execute various types according to executable instructions stored in read only memory (ROM) 902 or executable instructions loaded from random access memory (RAM) 903 from storage portion 908. Proper action and handling. Communication portion 912 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card, and the processor can communicate with read only memory 902 and/or random access memory 903 to execute executable instructions over bus 904. The operation is performed by the communication unit 912, and communicates with other target devices via the communication unit 912, thereby performing operations corresponding to any of the gesture control methods provided by the embodiments of the present application, for example, performing gesture detection on the currently played video image; Determining a presentation location of the business object to be displayed in the video image when the predetermined gesture is matched; drawing the business object in a computer drawing manner at the presentation location.

此外,在RAM 903中,还可存储有装置操作所需的各种程序和数据。CPU901、ROM902以及RAM903通过总线904彼此相连。在有RAM903的情况下,ROM902为可选模块。RAM903存储可执行指令,或在运行时向ROM902中写入可执行指令,可执行指令使处理器901执行上述手势控制方法对应的操作。输入/输出(I/O)接口905也连接至总线904。通信部912可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线链接上。Further, in the RAM 903, various programs and data required for the operation of the device can be stored. The CPU 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. In the case of RAM 903, ROM 902 is an optional module. The RAM 903 stores executable instructions or writes executable instructions to the ROM 902 at runtime, the executable instructions causing the processor 901 to perform operations corresponding to the gesture control method described above. An input/output (I/O) interface 905 is also coupled to bus 904. The communication unit 912 may be integrated or may be provided with a plurality of sub-modules (for example, a plurality of IB network cards) and on the bus link.

以下部件连接至I/O接口905:包括键盘、鼠标等的输入部分906;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分907;包括硬盘等的存储部分908;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分909。通信部分909经由诸如因特网的网络执行通信处理。驱动器911也根据需要连接至I/O接口905。可拆卸介质911,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器911上,以便于从其上读出的计算机程序根据需要被安装入存储部分908。The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, etc.; an output portion 907 including, for example, a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a storage portion 908 including a hard disk or the like. And a communication portion 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the Internet. The drive 911 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like is mounted on the drive 911 as needed so that a computer program read therefrom is installed into the storage portion 908 as needed.

需要说明的,如图9所示的架构仅为一种可选实现方式,在具体实践过程中,可根据实际需要对上述图9的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如GPU和CPU可分离设置或者可将GPU集成在CPU上,通信部可分离设置,也可集成设置在CPU或GPU上,等等。这些可替换的实施方式均落入本申请公开的保护范围。It should be noted that the architecture shown in FIG. 9 is only an optional implementation manner. In a specific implementation process, the number and type of components in FIG. 9 may be selected, deleted, added, or replaced according to actual needs; Different function components can also be implemented in separate settings or integrated settings, such as GPU and CPU detachable settings or GPU can be integrated on the CPU, the communication part can be separated, or integrated on the CPU or GPU. and many more. These alternative embodiments are all within the scope of protection disclosed herein.

特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,对当前播放的视频图像进行手势检测的指令;在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置的指令;在所述展现位置采用计算机绘图方式绘制所述业务对象的指令。In particular, the processes described above with reference to the flowcharts may be implemented as a computer software program in accordance with an embodiment of the present disclosure. For example, an embodiment of the present disclosure includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing an instruction corresponding to the method step provided by the embodiment of the present application, for example, an instruction for performing gesture detection on the currently played video image; and determining that the business object to be displayed is in the video image when detecting that the gesture matches the predetermined gesture An instruction to present a location; an instruction to draw the business object in a computer drawing manner at the presentation location.

另外,本申请实施例还提供了一种计算机程序,该计算机程序包括计算机可读代码,该程序代码包括计算机操作指令,当计算机可读代码在设备上运行时,设备中的处理器执行用于实现本申请任一实施例手势控制方法中各步骤的指令。In addition, the embodiment of the present application further provides a computer program, the computer program comprising computer readable code, the program code includes computer operating instructions, when the computer readable code is run on the device, the processor in the device executes An instruction for implementing each step in the gesture control method of any of the embodiments of the present application.

根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,对当前播放的视频图像进行手势检测;在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置;在所述展现位置采用计算机绘图方式绘制所述业务对象。 According to an embodiment of the present application, the process described above with reference to the flowcharts may be implemented as a computer software program. For example, an embodiment of the present application includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Performing the instruction corresponding to the method step provided by the embodiment of the present application, for example, performing gesture detection on the currently played video image; determining, when the gesture is matched with the predetermined gesture, determining the presentation position of the business object to be displayed in the video image Drawing the business object in a computer drawing manner at the presentation position.

另外,本申请实施例还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,该指令被执行时实现本申请任一实施例手势控制方法中各步骤的操作。In addition, the embodiment of the present application further provides a computer readable storage medium for storing computer readable instructions, which are executed to implement the operations of the steps in the gesture control method of any embodiment of the present application.

本申请实施例中,计算机程序、计算机可读取的指令被执行时各步骤的具体实现可以参见上述实施例中的相应步骤和模块中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,在此不再赘述。For the specific implementation of the steps of the computer program and the computer readable command, the corresponding steps in the above-mentioned embodiments and corresponding descriptions in the modules are not described herein. A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the device and the module described above may be referred to the corresponding process description in the foregoing method embodiment, and details are not described herein again.

本说明书中各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似的部分相互参见即可。对于装置、设备、程序、存储介质等实施例而言,由于其与方法实施例基本对应,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. For the embodiments, such as the device, the device, the program, the storage medium, and the like, since the method basically corresponds to the method embodiment, the description is relatively simple. For related parts, refer to the description of the method embodiment.

可能以许多方式来实现本申请的方法和装置。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本申请的方法和装置。根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。用于所述方法的步骤的上述顺序仅是为了进行说明,本申请的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本申请实施为记录在记录介质中的程序,这些程序包括用于实现根据本申请的方法的机器可读指令。因而,本申请还覆盖存储用于执行根据本申请的方法的程序的记录介质。The methods and apparatus of the present application may be implemented in a number of ways. For example, the methods and apparatus of the present application can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The various components/steps described in the embodiments of the present application may be split into more components/steps according to the needs of the implementation, or two or more components/steps or partial operations of the components/steps may be combined into new components. /Steps to achieve the objectives of the embodiments of the present application. The above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present application are not limited to the order specifically described above unless otherwise specifically stated. Moreover, in some embodiments, the present application can also be implemented as a program recorded in a recording medium, the programs including machine readable instructions for implementing the method according to the present application. Thus, the present application also covers a recording medium storing a program for executing the method according to the present application.

需要指出,根据实施的需要,可将本申请中描述的各个步骤/部件拆分为更多步骤/部件,也可将两个或多个步骤/部件或者步骤/部件的部分操作组合成新的步骤/部件,以实现本申请的目的。It should be pointed out that the various steps/components described in the present application can be split into more steps/components according to the needs of the implementation, and two or more steps/components or partial operations of the steps/components can be combined into new ones. Steps/components to achieve the objectives of the present application.

上述根据本申请的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码,从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的处理方法。此外,当通用计算机访问用于实现在此示出的处理的代码时,代码的执行将通用计算机转换为用于执行在此示出的处理的专用计算机。The above method according to the present application can be implemented in hardware, firmware, or as software or computer code that can be stored in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or can be downloaded through a network. The computer code originally stored in a remote recording medium or non-transitory machine readable medium and to be stored in a local recording medium, whereby the methods described herein can be stored using a general purpose computer, a dedicated processor, or programmable or dedicated Such software processing on a recording medium of hardware such as an ASIC or an FPGA. It will be understood that a computer, processor, microprocessor controller or programmable hardware includes storage components (eg, RAM, ROM, flash memory, etc.) that can store or receive software or computer code, when the software or computer code is The processing methods described herein are implemented when the processor or hardware is accessed and executed. Moreover, when a general purpose computer accesses code for implementing the processing shown herein, the execution of the code converts the general purpose computer into a special purpose computer for performing the processing shown herein.

以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。 The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims (50)

一种手势控制方法,其特征在于,包括:A gesture control method, comprising: 对当前播放的视频图像进行手势检测;Performing gesture detection on the currently played video image; 在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置;Determining a presentation position of the business object to be displayed in the video image when detecting that the gesture matches the predetermined gesture; 在所述展现位置采用计算机绘图方式绘制所述业务对象。The business object is drawn in a computer drawing manner at the presentation position. 根据权利要求1所述的方法,其特征在于,所述确定待显示的业务对象在所述视频图像中的展现位置,包括:The method according to claim 1, wherein the determining a presentation location of the business object to be displayed in the video image comprises: 提取与检测到的所述手势相应的人手候选区域内手部的特征点;Extracting feature points of the hand in the candidate region of the human hand corresponding to the detected gesture; 根据所述手部的特征点,确定与检测到的所述手势相应的待显示的业务对象在所述视频图像中的展现位置。Determining a presentation position of the business object to be displayed in the video image corresponding to the detected gesture according to the feature point of the hand. 根据权利要求2所述的方法,其特征在于,所述根据所述手部的特征点,确定与检测到的所述手势相应的待显示的业务对象在所述视频图像中的展现位置,包括:The method according to claim 2, wherein the determining, according to the feature point of the hand, the presentation position of the business object to be displayed corresponding to the detected gesture in the video image, including : 根据所述手部的特征点和所述待显示的业务对象的类型,确定与检测到的所述手势相应的所述待显示的业务对象在所述视频图像中的展现位置。And determining, according to the feature point of the hand and the type of the business object to be displayed, a display position of the business object to be displayed corresponding to the detected gesture in the video image. 根据权利要求3所述的方法,其特征在于,所述根据所述手部的特征点和所述待显示的业务对象的类型,确定与检测到的所述手势相应的所述待显示的业务对象在所述视频图像中的展现位置,包括:The method according to claim 3, wherein the determining the service to be displayed corresponding to the detected gesture according to the feature point of the hand and the type of the business object to be displayed The presentation position of the object in the video image includes: 根据所述手部的特征点和所述待显示的业务对象的类型,确定与检测到的所述手势相应的所述待显示的业务对象在所述视频图像中的多个展现位置;Determining, according to the feature point of the hand and the type of the business object to be displayed, a plurality of presentation positions of the business object to be displayed corresponding to the detected gesture in the video image; 从所述多个展现位置中选择至少一个展现位置作为所述待显示的业务对象在所述视频图像中的展现位置。Selecting at least one presentation location from the plurality of presentation locations as a presentation location of the business object to be displayed in the video image. 根据权利要求1所述的方法,其特征在于,所述确定待显示的业务对象在所述视频图像中的展现位置,包括:The method according to claim 1, wherein the determining a presentation location of the business object to be displayed in the video image comprises: 从预先存储的手势与展现位置的对应关系中,获取所述预定手势对应的目标展现位置作为与检测到的所述手势相应的待显示的业务对象在所述视频图像中的展现位置。And a target presentation position corresponding to the predetermined gesture is obtained as a presentation position of the business object to be displayed corresponding to the detected gesture in the video image, from a correspondence between the pre-stored gesture and the presentation position. 根据权利要求1所述的方法,其特征在于,所述业务对象包括:包含有语义信息的特效;所述视频图像包括:直播类视频图像。The method according to claim 1, wherein the business object comprises: an effect comprising semantic information; and the video image comprises: a live video image. 根据权利要求6所述的方法,其特征在于,所述包含有语义信息的特效包括包含广告信息的以下至少一种形式的特效:二维贴纸特效、三维特效、粒子特效。The method according to claim 6, wherein the special effect including the semantic information comprises at least one of the following special effects including the advertisement information: a two-dimensional sticker effect, a three-dimensional special effect, and a particle special effect. 根据权利要求1-7任一所述的方法,其特征在于,所述展现位置包括以下任意一个或多个:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域。The method according to any one of claims 1-7, wherein the presentation position comprises any one or more of the following: a hair area of a person in a video image, a forehead area, a cheek area, a chin area, and a head. The body area, the background area in the video image, the area within the set range of the video image centered on the area where the hand is located, and the area preset in the video image. 根据权利要求1-8任一所述的方法,其特征在于,所述业务对象的类型包括以下任意一种或多种:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型。The method according to any one of claims 1-8, wherein the type of the business object comprises any one or more of the following: a forehead patch type, a cheek patch type, a chin patch type, and a virtual hat type. , virtual clothing type, virtual makeup type, virtual headwear type, virtual hair accessory type, virtual jewelry type. 根据权利要求1-9任一所述的方法,其特征在于,所述预定手势包括以下任一个或多个:挥手、剪刀手、握拳、托手、桃心手、鼓掌、手掌张开、手掌闭合、竖大拇指、摆手枪姿势、摆V字手和摆OK手。The method according to any one of claims 1-9, wherein the predetermined gesture comprises any one or more of the following: wave, scissors, fist, hand, heart, applause, palm open, palm Close, thumbs up, pistol pose, swing V-hand and swing OK hand. 根据权利要求1-10任一所述的方法,其特征在于,所述对当前播放的视频图像进行手势检测,包括:The method according to any one of claims 1-10, wherein the performing gesture detection on the currently played video image comprises: 采用预先训练的第一卷积网络检测所述视频图像,获得所述视频图像的第一特征信息和人手候选区域的预测信息,所述第一特征信息包括手部特征信息;Detecting the video image by using a pre-trained first convolution network to obtain first feature information of the video image and prediction information of a candidate region of the hand, the first feature information including hand feature information; 将所述第一特征信息和所述人手候选区域的预测信息作为预先训练的第二卷积网络模型 的第二特征信息,并采用所述第二卷积网络模型根据所述第二特征信息进行所述视频图像的手势检测,得到所述视频图像的手势检测结果;其中,所述第二卷积网络模型和所述第一卷积网络模型共享特征提取层。Using the first feature information and the prediction information of the human hand candidate region as a pre-trained second convolutional network model Second feature information, and performing gesture detection of the video image according to the second feature information by using the second convolutional network model to obtain a gesture detection result of the video image; wherein the second convolution The network model and the first convolutional network model share a feature extraction layer. 根据权利要求11所述的方法,其特征在于,所述对当前播放的视频图像进行手势检测之前,还包括:The method according to claim 11, wherein before the performing gesture detection on the currently played video image, the method further comprises: 根据含有人手标注信息的样本图像训练第一卷积网络模型,得到所述第一卷积网络模型针对所述样本图像的人手候选区域的预测信息;Training the first convolutional network model according to the sample image containing the human hand annotation information, and obtaining prediction information of the first convolutional network model for the human hand candidate region of the sample image; 修正所述人手候选区域的预测信息;Correcting prediction information of the candidate region of the human hand; 根据修正后的所述人手候选区域的预测信息和所述样本图像训练第二卷积网络模型,其中,所述第二卷积网络模型和所述第一卷积网络模型共享特征提取层,并在所述第二卷积网络模型训练过程中保持所述特征提取层的参数不变。Training a second convolutional network model according to the corrected prediction information of the human hand candidate region and the sample image, wherein the second convolutional network model and the first convolutional network model share a feature extraction layer, and The parameters of the feature extraction layer are kept unchanged during the training of the second convolutional network model. 根据权利要求11所述的方法,其特征在于,还包括:The method of claim 11 further comprising: 根据含有人手标注信息的样本图像训练第一卷积神经网络,得到所述第一卷积神经网络针对所述样本图像的人手候选区域的预测信息;Training the first convolutional neural network according to the sample image containing the human hand annotation information, and obtaining prediction information of the first hand convolutional neural network for the human hand candidate region of the sample image; 将用于检测手势的第二卷积神经网络的第二特征提取层参数,替换为训练后的所述第一卷积神经网络的第一特征提取层参数;Substituting the second feature extraction layer parameter of the second convolutional neural network for detecting the gesture with the first feature extraction layer parameter of the trained first convolutional neural network; 根据所述人手候选区域的预测信息和所述样本图像训练所述第二卷积神经网络参数,在训练过程中保持所述第二特征提取层参数不变。And training the second convolutional neural network parameter according to the prediction information of the human hand candidate region and the sample image, and maintaining the second feature extraction layer parameter unchanged during training. 根据权利要求13所述的方法,其特征在于,根据所述人手候选区域的预测结果和所述样本图像训练所述第二卷积神经网络参数,包括:The method according to claim 13, wherein the training the second convolutional neural network parameter according to the prediction result of the human hand candidate region and the sample image comprises: 修正所述人手候选区域的预测信息;Correcting prediction information of the candidate region of the human hand; 根据修正后的所述人手候选区域的预测信息和所述样本图像训练所述第二卷积神经网络参数。And clamping the second convolutional neural network parameter according to the corrected prediction information of the human hand candidate region and the sample image. 根据权利要求12-14任一所述的方法,其特征在于,所述人手标注信息包括人手区域的标注信息和/或手势的标注信息。The method according to any one of claims 12-14, wherein the human hand annotation information comprises annotation information of a human hand area and/or annotation information of a gesture. 根据权利要求12-15任一所述的方法,其特征在于,所述第一卷积神经网络包括:第一输入层、第一特征提取层和第一分类输出层,所述第一分类输出层用于预测所述样本图像的多个候选区域是否为人手候选区域。The method according to any one of claims 12-15, wherein the first convolutional neural network comprises: a first input layer, a first feature extraction layer and a first classification output layer, the first classification output The layer is configured to predict whether the plurality of candidate regions of the sample image are human hand candidate regions. 根据权利要求12-16任一所述的方法,其特征在于,所述第二卷积神经网络包括:第二输入层、第二特征提取层、第二分类输出层,所述第二分类输出层用于输出所述样本图像的手势检测结果。The method according to any one of claims 12-16, wherein the second convolutional neural network comprises: a second input layer, a second feature extraction layer, a second classification output layer, and the second classification output The layer is used to output a gesture detection result of the sample image. 根据权利要求12和14-17中任一所述的方法,其特征在于,所述修正所述人手候选区域的预测信息,包括:The method according to any one of claims 12 and 14-17, wherein the correcting the prediction information of the candidate region of the human hand comprises: 将多个补充负样本图像和所述人手候选区域的预测信息输入第三卷积神经网络以进行分类,以过滤所述人手候选区域中的负样本,得到修正后的所述人手候选区域的预测信息。And inputting the plurality of supplementary negative sample images and the prediction information of the human hand candidate region into the third convolutional neural network to perform classification, so as to filter the negative samples in the human hand candidate region, and obtain the corrected prediction of the human hand candidate region. information. 根据权利要求18所述的方法,其特征在于,所述人手候选区域的预测信息中人手候选区域数量与所述补充负样本图像的数量的差异在预定容许范围内。The method according to claim 18, wherein the difference between the number of human hand candidate regions and the number of the supplementary negative sample images in the prediction information of the human hand candidate region is within a predetermined allowable range. 根据权利要求19所述的方法,其特征在于,所述人手候选区域的预测信息中人手候选区域数量与所述补充负样本图像的数量相等。The method according to claim 19, wherein the number of human hand candidate regions in the prediction information of the human hand candidate region is equal to the number of the supplementary negative sample images. 根据权利要求12-20任一所述的方法,其特征在于,所述第一卷积神经网络包括:区域方案网络RPN,和/或,所述第二卷积神经网络包括:快速区域卷积神经网络FRCNN。The method according to any one of claims 12-20, wherein the first convolutional neural network comprises: a regional solution network RPN, and/or the second convolutional neural network comprises: fast regional convolution Neural network FRCNN. 根据权利要求18-21任一所述的方法,其特征在于,所述第三卷积神经网络包括FRCNN。The method of any of claims 18-21, wherein the third convolutional neural network comprises an FRCNN. 根据权利要求1-22任一所述的方法,其特征在于,所述确定待显示的业务对象在所述视频图像中的展现位置,包括: The method according to any one of claims 1 to 22, wherein the determining a presentation location of the business object to be displayed in the video image comprises: 通过所述检测检测到的手势和预先训练的、用于从视频图像检测业务对象的展现位置的第三卷积网络模型,确定与检测到的所述手势相应的待显示的业务对象的展现位置。Determining, by the detecting the detected gesture and the pre-trained third convolution network model for detecting a presentation position of the business object from the video image, determining a presentation position of the business object to be displayed corresponding to the detected gesture . 一种手势控制装置,其特征在于,包括:A gesture control device, comprising: 手势检测模块,用于对当前播放的视频图像进行手势检测;a gesture detection module, configured to perform gesture detection on a currently played video image; 展现位置确定模块,用于在检测到手势与预定手势匹配时,确定待显示的业务对象在所述视频图像中的展现位置;a presentation location determining module, configured to determine a presentation location of the business object to be displayed in the video image when the gesture is detected to match the predetermined gesture; 业务对象绘制模块,用于在所述展现位置采用计算机绘图方式绘制所述业务对象。A business object drawing module is configured to draw the business object by using a computer drawing manner at the presentation position. 根据权利要求24所述的装置,其特征在于,所述展现位置确定模块包括:The device according to claim 24, wherein the presentation location determining module comprises: 特征点提取单元,用于提取与检测到的所述手势相应的人手候选区域内手部的特征点;a feature point extracting unit, configured to extract a feature point of the hand in the candidate region of the human hand corresponding to the detected gesture; 展现位置确定单元,用于根据所述手部的特征点,确定与检测到的所述手势相应的待显示的业务对象在所述视频图像中的展现位置。And a presentation position determining unit, configured to determine, according to the feature point of the hand, a presentation position of the business object to be displayed corresponding to the detected gesture in the video image. 根据权利要求25所述的装置,其特征在于,所述展现位置确定单元,用于根据所述手部的特征点和所述待显示的业务对象的类型,确定与检测到的所述手势相应的所述待显示的业务对象在所述视频图像中的展现位置。The device according to claim 25, wherein the presentation position determining unit is configured to determine, according to the feature point of the hand and the type of the business object to be displayed, corresponding to the detected gesture The presentation location of the business object to be displayed in the video image. 根据权利要求26所述的装置,其特征在于,所述展现位置确定单元,用于根据所述手部的特征点和所述待显示的业务对象的类型,确定与检测到的所述手势相应的所述待显示的业务对象在所述视频图像中的多个展现位置;从所述多个展现位置中选择至少一个展现位置。The device according to claim 26, wherein the presentation position determining unit is configured to determine, according to the feature point of the hand and the type of the business object to be displayed, corresponding to the detected gesture The plurality of presentation locations of the business object to be displayed in the video image; selecting at least one presentation location from the plurality of presentation locations. 根据权利要求24所述的装置,其特征在于,所述展现位置确定模块,用于从预先存储的手势与展现位置的对应关系中,获取所述预定手势对应的目标展现位置作为与检测到的所述手势相应的待显示的业务对象在所述视频图像中的展现位置。The device according to claim 24, wherein the presentation location determining module is configured to acquire, from the correspondence between the pre-stored gesture and the presentation location, a target presentation location corresponding to the predetermined gesture as the detected The gesture corresponding to the presentation location of the business object to be displayed in the video image. 根据权利要求24所述的装置,其特征在于,所述业务对象包括:包含有语义信息的特效;所述视频图像包括:直播类视频图像。The apparatus according to claim 24, wherein the business object comprises: an effect comprising semantic information; and the video image comprises: a live video type image. 根据权利要求29所述的装置,其特征在于,所述包含有语义信息的特效包括包含广告信息的以下任意一种或多种形式的特效:二维贴纸特效、三维特效、粒子特效。The apparatus according to claim 29, wherein the special effect including the semantic information comprises an effect of any one or more of the following forms of the advertisement information: a two-dimensional sticker effect, a three-dimensional effect, a particle effect. 根据权利要求24-30任一所述的装置,其特征在于,所述展现位置包括以下任意一个或多个:视频图像中人物的头发区域、额头区域、脸颊区域、下巴区域、头部以外的身体区域、视频图像中的背景区域、视频图像中以手部所在的区域为中心的设定范围内的区域、视频图像中预先设定的区域。The device according to any one of claims 24 to 30, wherein the presentation position comprises any one or more of the following: a hair region of a person in the video image, a forehead region, a cheek region, a chin region, and a head portion. The body area, the background area in the video image, the area within the set range of the video image centered on the area where the hand is located, and the area preset in the video image. 根据权利要求24-31任一所述的装置,其特征在于,所述业务对象的类型包括以下任意一种或多种:额头贴片类型、脸颊贴片类型、下巴贴片类型、虚拟帽子类型、虚拟服装类型、虚拟妆容类型、虚拟头饰类型、虚拟发饰类型、虚拟首饰类型。The device according to any one of claims 24 to 31, wherein the type of the business object comprises any one or more of the following: a forehead patch type, a cheek patch type, a chin patch type, a virtual hat type. , virtual clothing type, virtual makeup type, virtual headwear type, virtual hair accessory type, virtual jewelry type. 根据权利要求24-32任一所述的装置,其特征在于,所述预定手势包括以下任意一种或多种:挥手、剪刀手、握拳、托手、桃心手、鼓掌、手掌张开、手掌闭合、竖大拇指、摆手枪姿势、摆V字手和摆OK手。The device according to any one of claims 24-32, wherein the predetermined gesture comprises any one or more of the following: wave, scissors, fist, hand, heart, applause, palm open, The palms are closed, the thumbs are vertical, the pistol poses, the V-hand and the OK hand are placed. 根据权利要求24-33任一所述的装置,其特征在于,所述手势检测模块,用于采用预先训练的第一卷积网络检测所述视频图像,获得所述视频图像的第一特征信息和人手候选区域的预测信息,所述第一特征信息包括手部特征信息;将所述第一特征信息和所述人手候选区域的预测信息作为预先训练的第二卷积网络模型的第二特征信息,并采用所述第二卷积网络模型根据所述第二特征信息进行所述视频图像的手势检测,得到所述视频图像的手势检测结果;其中,所述第二卷积网络模型和所述第一卷积网络模型共享特征提取层。The device according to any one of claims 24 to 33, wherein the gesture detecting module is configured to detect the video image by using a pre-trained first convolution network to obtain first feature information of the video image. And prediction information of the human hand candidate region, the first feature information including hand feature information; using the first feature information and the prediction information of the human hand candidate region as a second feature of the pre-trained second convolutional network model And performing, by the second convolutional network model, the gesture detection of the video image according to the second feature information, to obtain a gesture detection result of the video image; wherein the second convolution network model and the The first convolutional network model shares a feature extraction layer. 根据权利要求34所述的装置,其特征在于,还包括:The device according to claim 34, further comprising: 人手区域确定模块,用于根据含有人手标注信息的样本图像训练第一卷积网络模型,得到所述第一卷积网络模型针对所述样本图像的人手候选区域的预测信息;a manual area determining module, configured to train the first convolutional network model according to the sample image containing the human hand annotation information, to obtain prediction information of the first convolutional network model for the human hand candidate area of the sample image; 修正模块,用于修正所述人手候选区域的预测信息;a correction module, configured to correct prediction information of the candidate region of the human hand; 卷积模型训练模块,用于根据修正后的所述人手候选区域的预测信息和所述样本图像训练 第二卷积网络模型,其中,所述第二卷积网络模型和所述第一卷积网络模型共享特征提取层,并在所述第二卷积网络模型训练过程中保持所述特征提取层的参数不变。a convolution model training module, configured to train according to the corrected prediction information of the candidate region of the human hand and the sample image a second convolutional network model, wherein the second convolutional network model and the first convolutional network model share a feature extraction layer and maintain the feature extraction layer during training of the second convolutional network model The parameters are unchanged. 根据权利要求34所述的装置,其特征在于,还包括:The device according to claim 34, further comprising: 人手区域确定模块,用于根据含有人手标注信息的样本图像训练第一卷积神经网络,得到所述第一卷积神经网络针对所述样本图像的人手候选区域的预测信息;a manual area determining module, configured to train the first convolutional neural network according to the sample image containing the human hand annotation information, to obtain prediction information of the first convolutional neural network for the human hand candidate area of the sample image; 参数替换模块,用于将用于检测手势的第二卷积神经网络的第二特征提取层参数,替换为训练后的所述第一卷积神经网络的第一特征提取层参数;a parameter replacement module, configured to replace a second feature extraction layer parameter of the second convolutional neural network for detecting a gesture with a first feature extraction layer parameter of the first convolutional neural network after training; 第二训练模块,用于根据所述人手候选区域的预测信息和所述样本图像训练所述第二卷积神经网络参数,并在训练过程中保持所述第二特征提取层参数不变。And a second training module, configured to train the second convolutional neural network parameter according to the prediction information of the human hand candidate region and the sample image, and keep the second feature extraction layer parameter unchanged during the training process. 根据权利要求36所述的装置,其特征在于,所述第二训练模块,包括:The device of claim 36, wherein the second training module comprises: 修正模块,用于修正所述人手候选区域的预测信息;a correction module, configured to correct prediction information of the candidate region of the human hand; 卷积模型训练模块,用于根据修正后的所述人手候选区域的预测信息和所述样本图像训练所述第二卷积神经网络参数,并在训练过程中保持所述第二特征提取层参数不变。a convolution model training module, configured to train the second convolutional neural network parameter according to the corrected prediction information of the human hand candidate region and the sample image, and maintain the second feature extraction layer parameter during training constant. 根据权利要求35-37任一所述的装置,其特征在于,所述人手标注信息包括人手区域的标注信息和/或手势的标注信息。The apparatus according to any one of claims 35-37, wherein said human hand annotation information comprises annotation information of a human hand area and/or annotation information of a gesture. 根据权利要求34-38任一所述的装置,其特征在于,所述第一卷积神经网络包括:第一输入层、第一特征提取层和第一分类输出层,所述第一分类输出层用于预测所述样本图像的多个候选区域是否为人手候选区域。The apparatus according to any one of claims 34 to 38, wherein said first convolutional neural network comprises: a first input layer, a first feature extraction layer and a first classification output layer, said first classification output The layer is configured to predict whether the plurality of candidate regions of the sample image are human hand candidate regions. 根据权利要求34-39任一所述的方法,其特征在于,所述第二卷积神经网络包括:第二输入层、第二特征提取层、第二分类输出层,所述第二分类输出层用于输出所述样本图像的手势检测结果。The method according to any one of claims 34 to 39, wherein the second convolutional neural network comprises: a second input layer, a second feature extraction layer, a second classification output layer, and the second classification output The layer is used to output a gesture detection result of the sample image. 根据权利要求35和37-40任一所述的装置,其特征在于,所述修正模块,用于将多个补充负样本图像和所述人手候选区域的预测信息输入第三卷积神经网络以进行分类,以过滤所述人手候选区域中的负样本,得到修正后的所述人手候选区域的预测信息。The apparatus according to any one of claims 35 and 37 to 40, wherein the correction module is configured to input the plurality of supplementary negative sample images and the prediction information of the human hand candidate region into the third convolutional neural network. Classification is performed to filter negative samples in the human hand candidate region to obtain predicted information of the corrected human hand candidate region. 根据权利要求41所述的装置,其特征在于,所述人手候选区域的预测信息中人手候选区域数量与所述补充负样本图像的数量的差异在预定容许范围内。The apparatus according to claim 41, wherein a difference between the number of human hand candidate regions and the number of the supplementary negative sample images in the prediction information of the human hand candidate region is within a predetermined allowable range. 根据权利要求42所述的装置,其特征在于,所述人手候选区域的预测信息中人手候选区域数量与所述补充负样本图像的数量相等。The apparatus according to claim 42, wherein the number of human hand candidate regions in the prediction information of the human hand candidate region is equal to the number of the supplementary negative sample images. 根据权利要求34-43任一所述的装置,其特征在于,所述第一卷积神经网络包括:区域方案网络RPN,和/或,所述第二卷积神经网络包括:快速区域卷积神经网络FRCNN。The apparatus according to any one of claims 34-43, wherein said first convolutional neural network comprises: a regional solution network RPN, and/or said second convolutional neural network comprises: fast regional convolution Neural network FRCNN. 根据权利要求34-44任一所述的装置,其特征在于,所述第三卷积神经网络包括FRCNN。Apparatus according to any of claims 34-44 wherein said third convolutional neural network comprises an FRCNN. 根据权利要求24-45任一所述的装置,其特征在于,所述展现位置确定模块,用于通过所述手势和预先训练的、用于从视频图像检测业务对象的展现位置的第三卷积网络模型,确定与检测到的所述手势相应的待显示的业务对象的展现位置。The apparatus according to any one of claims 24 to 45, wherein said presentation position determining module is configured to pass the gesture and a pre-trained third volume for detecting a presentation position of a business object from a video image The product network model determines a presentation location of the business object to be displayed corresponding to the detected gesture. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;An electronic device comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete communication with each other through the communication bus; 所述存储器用于存储至少一可执行指令,所述可执行指令使所述处理器执行权利要求1-23任一所述的手势控制方法中各步骤的操作。The memory is for storing at least one executable instruction that causes the processor to perform the operations of the steps of the gesture control method of any of claims 1-23. 一种电子设备,其特征在于,包括:An electronic device, comprising: 处理器和权利要求24-46任一所述的手势控制装置;a processor and the gesture control device of any of claims 24-46; 在处理器运行所述手势控制装置时,权利要求24-45任一所述的手势控制装置中的单元被运行。The unit in the gesture control device of any of claims 24-45 is operated when the processor runs the gesture control device. 一种计算机程序,包括计算机可读代码,其特征在于,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现权利要求1-23任一项所述手势控制方法中各步 骤的指令。A computer program comprising computer readable code, wherein a processor in the device performs a gesture for implementing any of claims 1-23 when the computer readable code is run on a device Steps in the control method Command instructions. 一种计算机可读存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现权利要求1-23任一项所述手势控制方法中各步骤的操作。 A computer readable storage medium for storing computer readable instructions, wherein the instructions are executed to perform the operations of the steps of the gesture control method of any of claims 1-23.
PCT/CN2017/098182 2016-08-19 2017-08-19 Gesture control method, device, and electronic apparatus Ceased WO2018033154A1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201610696340.1 2016-08-19
CN201610694510.2A CN107340852A (en) 2016-08-19 2016-08-19 Gestural control method, device and terminal device
CN201610707579.4A CN107341436B (en) 2016-08-19 2016-08-19 Gesture detection network training, gesture detection and control method, system and terminal
CN201610694510.2 2016-08-19
CN201610696340.1A CN107368182B (en) 2016-08-19 2016-08-19 Gesture detection network training, gesture detection and gesture control method and device
CN201610707579.4 2016-08-19

Publications (1)

Publication Number Publication Date
WO2018033154A1 true WO2018033154A1 (en) 2018-02-22

Family

ID=61196400

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/098182 Ceased WO2018033154A1 (en) 2016-08-19 2017-08-19 Gesture control method, device, and electronic apparatus

Country Status (1)

Country Link
WO (1) WO2018033154A1 (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034397A (en) * 2018-08-10 2018-12-18 腾讯科技(深圳)有限公司 Model training method, device, computer equipment and storage medium
CN110210426A (en) * 2019-06-05 2019-09-06 中国人民解放军国防科技大学 Method for estimating hand posture from single color image based on attention mechanism
CN110362210A (en) * 2019-07-24 2019-10-22 济南大学 The man-machine interaction method and device of eye-tracking and gesture identification are merged in Virtual assemble
CN110414313A (en) * 2019-06-06 2019-11-05 平安科技(深圳)有限公司 Abnormal behaviour alarm method, device, server and storage medium
CN110796096A (en) * 2019-10-30 2020-02-14 北京达佳互联信息技术有限公司 Training method, device, equipment and medium for gesture recognition model
CN111061369A (en) * 2019-12-13 2020-04-24 腾讯科技(深圳)有限公司 Interaction method, device, equipment and storage medium
CN111221406A (en) * 2018-11-23 2020-06-02 杭州萤石软件有限公司 Information interaction method and device
CN111539947A (en) * 2020-04-30 2020-08-14 上海商汤智能科技有限公司 Image detection method, training method of related model, related device and equipment
CN111860346A (en) * 2020-07-22 2020-10-30 苏州臻迪智能科技有限公司 Dynamic gesture recognition method and device, electronic equipment and storage medium
CN112173497A (en) * 2020-11-10 2021-01-05 珠海格力电器股份有限公司 Control method and device of garbage collection equipment
CN112560787A (en) * 2020-12-28 2021-03-26 深研人工智能技术(深圳)有限公司 Pedestrian re-identification matching boundary threshold setting method and device and related components
CN112580596A (en) * 2020-12-30 2021-03-30 网易(杭州)网络有限公司 Data processing method and device
CN112911393A (en) * 2018-07-24 2021-06-04 广州虎牙信息科技有限公司 Part recognition method, device, terminal and storage medium
CN113033256A (en) * 2019-12-24 2021-06-25 武汉Tcl集团工业研究院有限公司 Training method and device for fingertip detection model
CN113326733A (en) * 2021-04-26 2021-08-31 吉林大学 Eye movement point data classification model construction method and system
CN113658298A (en) * 2018-05-02 2021-11-16 北京市商汤科技开发有限公司 Method and device for generating special-effect program file package and special effect
CN114167978A (en) * 2021-11-11 2022-03-11 广州大学 A human-computer interaction system mounted on a construction robot
CN114217728A (en) * 2021-11-26 2022-03-22 广域铭岛数字科技有限公司 Control method, system, equipment and storage medium for visual interactive interface
CN114333056A (en) * 2021-12-29 2022-04-12 北京淳中科技股份有限公司 Gesture control method, system, equipment and storage medium
CN114626024A (en) * 2022-05-12 2022-06-14 北京吉道尔科技有限公司 Internet infringement video low-consumption detection method and system based on block chain
CN115131871A (en) * 2021-03-25 2022-09-30 华为技术有限公司 Gesture recognition system and method and computing device
CN116152931A (en) * 2023-04-23 2023-05-23 深圳未来立体教育科技有限公司 Gesture recognition method and VR system
CN116263622A (en) * 2021-12-13 2023-06-16 北京字跳网络技术有限公司 Gesture recognition method, device, electronic device, medium and program product
CN116775924A (en) * 2023-06-19 2023-09-19 维沃移动通信有限公司 Image display control method, device and equipment
CN117058585A (en) * 2023-08-14 2023-11-14 广州商研网络科技有限公司 Target detection method and device, equipment and medium thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030063133A1 (en) * 2001-09-28 2003-04-03 Fuji Xerox Co., Ltd. Systems and methods for providing a spatially indexed panoramic video
CN101770283A (en) * 2009-01-05 2010-07-07 联想(北京)有限公司 Method and computer for generating feedback effect for touch operation
CN103902174A (en) * 2012-12-26 2014-07-02 联想(北京)有限公司 Display method and equipment
CN103984478A (en) * 2014-04-25 2014-08-13 广州市久邦数码科技有限公司 Dynamic icon displaying method and system
CN105867599A (en) * 2015-08-17 2016-08-17 乐视致新电子科技(天津)有限公司 Gesture control method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030063133A1 (en) * 2001-09-28 2003-04-03 Fuji Xerox Co., Ltd. Systems and methods for providing a spatially indexed panoramic video
CN101770283A (en) * 2009-01-05 2010-07-07 联想(北京)有限公司 Method and computer for generating feedback effect for touch operation
CN103902174A (en) * 2012-12-26 2014-07-02 联想(北京)有限公司 Display method and equipment
CN103984478A (en) * 2014-04-25 2014-08-13 广州市久邦数码科技有限公司 Dynamic icon displaying method and system
CN105867599A (en) * 2015-08-17 2016-08-17 乐视致新电子科技(天津)有限公司 Gesture control method and device

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658298A (en) * 2018-05-02 2021-11-16 北京市商汤科技开发有限公司 Method and device for generating special-effect program file package and special effect
CN112911393B (en) * 2018-07-24 2023-08-01 广州虎牙信息科技有限公司 Method, device, terminal and storage medium for identifying part
CN112911393A (en) * 2018-07-24 2021-06-04 广州虎牙信息科技有限公司 Part recognition method, device, terminal and storage medium
CN109034397B (en) * 2018-08-10 2023-04-07 腾讯科技(深圳)有限公司 Model training method and device, computer equipment and storage medium
CN109034397A (en) * 2018-08-10 2018-12-18 腾讯科技(深圳)有限公司 Model training method, device, computer equipment and storage medium
CN111221406B (en) * 2018-11-23 2023-10-13 杭州萤石软件有限公司 Information interaction method and device
CN111221406A (en) * 2018-11-23 2020-06-02 杭州萤石软件有限公司 Information interaction method and device
CN110210426B (en) * 2019-06-05 2021-06-08 中国人民解放军国防科技大学 An attention-based approach to hand pose estimation from a single color image
CN110210426A (en) * 2019-06-05 2019-09-06 中国人民解放军国防科技大学 Method for estimating hand posture from single color image based on attention mechanism
CN110414313B (en) * 2019-06-06 2024-02-13 平安科技(深圳)有限公司 Abnormal behavior alarming method, device, server and storage medium
CN110414313A (en) * 2019-06-06 2019-11-05 平安科技(深圳)有限公司 Abnormal behaviour alarm method, device, server and storage medium
CN110362210A (en) * 2019-07-24 2019-10-22 济南大学 The man-machine interaction method and device of eye-tracking and gesture identification are merged in Virtual assemble
CN110362210B (en) * 2019-07-24 2022-10-11 济南大学 Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly
CN110796096A (en) * 2019-10-30 2020-02-14 北京达佳互联信息技术有限公司 Training method, device, equipment and medium for gesture recognition model
CN110796096B (en) * 2019-10-30 2023-01-24 北京达佳互联信息技术有限公司 Training method, device, equipment and medium for gesture recognition model
CN111061369A (en) * 2019-12-13 2020-04-24 腾讯科技(深圳)有限公司 Interaction method, device, equipment and storage medium
CN113033256A (en) * 2019-12-24 2021-06-25 武汉Tcl集团工业研究院有限公司 Training method and device for fingertip detection model
CN113033256B (en) * 2019-12-24 2024-06-11 武汉Tcl集团工业研究院有限公司 A training method and device for fingertip detection model
CN111539947A (en) * 2020-04-30 2020-08-14 上海商汤智能科技有限公司 Image detection method, training method of related model, related device and equipment
CN111539947B (en) * 2020-04-30 2024-03-29 上海商汤智能科技有限公司 Image detection method, related model training method, related device and equipment
CN111860346A (en) * 2020-07-22 2020-10-30 苏州臻迪智能科技有限公司 Dynamic gesture recognition method and device, electronic equipment and storage medium
CN112173497A (en) * 2020-11-10 2021-01-05 珠海格力电器股份有限公司 Control method and device of garbage collection equipment
CN112560787A (en) * 2020-12-28 2021-03-26 深研人工智能技术(深圳)有限公司 Pedestrian re-identification matching boundary threshold setting method and device and related components
CN112580596A (en) * 2020-12-30 2021-03-30 网易(杭州)网络有限公司 Data processing method and device
CN112580596B (en) * 2020-12-30 2024-02-27 杭州网易智企科技有限公司 Data processing method and device
CN115131871A (en) * 2021-03-25 2022-09-30 华为技术有限公司 Gesture recognition system and method and computing device
CN113326733B (en) * 2021-04-26 2022-07-08 吉林大学 A method and system for constructing an eye-tracking data classification model
CN113326733A (en) * 2021-04-26 2021-08-31 吉林大学 Eye movement point data classification model construction method and system
CN114167978A (en) * 2021-11-11 2022-03-11 广州大学 A human-computer interaction system mounted on a construction robot
CN114217728A (en) * 2021-11-26 2022-03-22 广域铭岛数字科技有限公司 Control method, system, equipment and storage medium for visual interactive interface
CN116263622A (en) * 2021-12-13 2023-06-16 北京字跳网络技术有限公司 Gesture recognition method, device, electronic device, medium and program product
CN114333056A (en) * 2021-12-29 2022-04-12 北京淳中科技股份有限公司 Gesture control method, system, equipment and storage medium
CN114626024A (en) * 2022-05-12 2022-06-14 北京吉道尔科技有限公司 Internet infringement video low-consumption detection method and system based on block chain
CN116152931B (en) * 2023-04-23 2023-07-07 深圳未来立体教育科技有限公司 Gesture recognition method and VR system
CN116152931A (en) * 2023-04-23 2023-05-23 深圳未来立体教育科技有限公司 Gesture recognition method and VR system
CN116775924A (en) * 2023-06-19 2023-09-19 维沃移动通信有限公司 Image display control method, device and equipment
CN117058585A (en) * 2023-08-14 2023-11-14 广州商研网络科技有限公司 Target detection method and device, equipment and medium thereof

Similar Documents

Publication Publication Date Title
WO2018033154A1 (en) Gesture control method, device, and electronic apparatus
US10776970B2 (en) Method and apparatus for processing video image and computer readable medium
US11037348B2 (en) Method and apparatus for displaying business object in video image and electronic device
WO2018033155A1 (en) Video image processing method, apparatus and electronic device
WO2018033143A1 (en) Video image processing method, apparatus and electronic device
US11182591B2 (en) Methods and apparatuses for detecting face, and electronic devices
US11922661B2 (en) Augmented reality experiences of color palettes in a messaging system
US12073524B2 (en) Generating augmented reality content based on third-party content
US12118601B2 (en) Method, system, and non-transitory computer-readable medium for analyzing facial features for augmented reality experiences of physical products in a messaging system
US20240161179A1 (en) Identification of physical products for augmented reality experiences in a messaging system
US12165242B2 (en) Generating augmented reality experiences with physical products using profile information
US11044295B2 (en) Data processing method, apparatus and electronic device
EP4128026A1 (en) Identification of physical products for augmented reality experiences in a messaging system
CN112088377A (en) Real-time object detection and tracking
CN108229276A (en) Neural metwork training and image processing method, device and electronic equipment
CN107770602B (en) Video image processing method and device and terminal equipment
CN107770603B (en) Video image processing method and device and terminal equipment
US20260030784A1 (en) Augmented reality experiences of color palettes in a messaging system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17841126

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17841126

Country of ref document: EP

Kind code of ref document: A1