WO2018128996A1

WO2018128996A1 - System and method for facilitating dynamic avatar based on real-time facial expression detection

Info

Publication number: WO2018128996A1
Application number: PCT/US2018/012105
Authority: WO
Inventors: Faryar GHAZANFARI; Marcin KMIEC
Original assignee: Clipo, Inc.
Priority date: 2017-01-03
Filing date: 2018-01-02
Publication date: 2018-07-12

Abstract

A system is provided for dynamically generating and displaying emojis. During operation, the system receives a sequence of images capturing a user's head and face. For each image, the system determines the user's head position in the image; applies a machine-learning technique, which comprises a convolutional neutral network (CNN), on the image to determine a facial-expression class associated with the user's face; and generates an emoji image based on the user's head position and the determined facial-expression class. The system then applies a morph-target animation technique to a sequence of emoji images generated for the sequence of images to generate an animated emoji and displays the animated emoji.

Description

SYSTEM AND METHOD FOR FACILITATING DYNAMIC AVATAR BASED ON REAL-TIME FACIAL EXPRESSION DETECTION

BACKGROUND

Field

[0001] The present application relates to a system and method for dynamically displaying an emoji corresponding to a user's facial expression and head position.

Related Art

[0002] The rapid development of mobile technologies has fundamentally changed the ways people communicate with each other. In addition to conventional phone calls and text messages, more and more people are relying on social networking apps (e.g., Facebook^® and Instagram^®) to stay connected. In addition to text, many social networking apps allow users to send voice messages, pictures, and videos.

[0003] In order to bring emotions into text-based messages or online postings, emoticons, and later emojis, have been widely adopted by users. Smiles or smiling faces have been very popular among users to express their emotions, either happy or sad, when communicating electronically. However, traditional emoji use requires the user to either select from a dropdown menu or use a combination of keystrokes in order to input an emoji. Such a process can be cumbersome. Moreover, currently available emojis may not be able to accurately reflect users' emotions or expressions in real time while they are communicating with friends or consuming content.

SUMMARY

[0004] One embodiment of the present invention provides a system for dynamically generating and displaying emojis. During operation, the system receives a sequence of images capturing a user's head and face. For each image, the system determines the user's head position in the image; applies a machine-learning technique, which comprises a convolutional neutral network (CNN), on the image to determine a facial-expression class associated with the user's face; and generates an emoji image based on the user's head position and the determined facial- expression class. The system then applies a morph-target animation technique to a sequence of emoji images generated for the sequence of images to generate an animated emoji and displays the animated emoji. The animated emoji mimics the facial-expression of the user at a specific angle where the user's face is located

[0005] In a variation on this embodiment, determining the user's head position comprises determining facial landmark points in the image using a shape modeling algorithm.

[0006] In a variation on this embodiment, applying the machine-learning technique further comprises using multiple specialized convolutional neural networks. A respective specialized convolutional neural network is configured to generate a facial-expression-detection output based on a portion of the user's face.

[0007] In a further variation, the method further comprises aggregating facial-expression- detection outputs from the multiple specialized convolutional networks by applying weights to the facial-expression-detection outputs from the multiple specialized convolutional neural networks.

[0008] In a variation on this embodiment, the determined facial-expression class is selected from a group consisting of: neutral, happy, sad, fear, angry, surprise, disgust, tongue out, kiss, wink, eyebrow raise, and nose wrinkle.

[0009] In a variation on this embodiment, generating the emoji image comprises mapping the determined facial-expression class to a pre-stored emoji image and modifying the pre-stored emoji image based on the user's head position.

[0010] In a variation on this embodiment, receiving the sequence of images comprises receiving a live video feed from a camera associated with computer and sampling frames from the received live video feed.

BRIEF DESCRIPTION OF THE FIGURES

[0011] FIG. 1 shows an exemplary system for detecting facial expressions, according to one embodiment.

[0012] FIG. 2 shows an exemplary system for detecting facial expressions from a video feed, according to one embodiment.

[0013] FIG. 3 illustrates an exemplary system for detecting facial expressions, according to one embodiment.

[0014] FIG. 4 illustrates an exemplary CNN system for outputting a facial-expression class, according to one embodiment.

[0015] FIG. 5 illustrates a system for generating and displaying emojis based on a user's facial expression, according to one embodiment.

[0016] FIGs. 6A-6B illustrate an exemplary 3D object in different states.

[0017] FIGs. 7A-7F show an exemplary morph-target animation, according to one embodiment. [0018] FIGs. 8A-8D illustrate different emoji states.

[0019] FIG. 9 presents a flowchart illustrating an exemplary process for displaying a dynamic emoji based on a user's facial expression, according to one embodiment.

[0020] FIG. 10 illustrates a system for generating animated emojis based on a user's facial expression, according to one embodiment.

[0021] FIG. 11 illustrates a system for generating real time animated avatars based on a user's facial expression, according to one embodiment.

[0022] FIG. 12 illustrates an exemplary scenario for user sentiment analysis, according to one embodiment.

[0023] FIG. 13A illustrates a scenario for aggregating and displaying sentiments of a large number of users, according to one embodiment.

[0024] FIG. 13B shows exemplary viewer sentiment analytics data provided to a content producer, according to one embodiment.

[0025] FIG. 14 presents a flowchart illustrating an exemplary process of aggregating viewer sentiment analytics, according to one embodiment.

[0026] FIG. 15 illustrates an exemplary computer and communication system for displaying dynamic emojis, according to one embodiment.

[0027] In the figures, like reference numerals refer to the same figure elements. DETAILED DESCRIPTION

[0028] The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Overview

[0029] Embodiments of the present invention provide a system and method for dynamic displaying of emojis that correspond to a user's facial expression and head position. More specifically, the system can capture live video or images of a user's face and analyze the video or images using artificial intelligence (e.g., trained convolutional neural networks) in order to detect the user's emotions and facial movements in real time. In addition to detecting facial

expressions, the system can also track the orientation of the user's head by detecting and tracking a number of facial landmark points (e.g., the nose or eyes). The system can then use a morph- target animation technology to display a dynamic or live emoji based on both the detected head orientations and facial expressions. Facial Expression Detection Based on Landmark Points

[0030] Face detection and recognition technologies have been around for decades and have found applications in many areas, including many commercial and law enforcement applications. In addition to recognizing faces and identifying certain individuals based on their faces, the ability to recognize facial expressions can also be important. Facial expression is a visible manifestation of the affective state, cognitive activity, intention, personality, and psychopathology of a person. Facial expressions, and other gestures, convey non-verbal communication cues in face-to-face interactions. In online communications that do not involve face-to-face interactions, recognizing users' facial expressions and displaying emojis that can reflect the users' facial expressions make such communication more lively without burdening the user with manual input of emojis. Moreover, facial expressions can play an important role wherever humans interact with machines. Automatic recognition of facial expressions may act as a component of natural human machine interfaces (some variants of which are called perceptual interfaces or conversational interfaces). Such interfaces would enable the automated provision of services that require a good appreciation of the emotional state of the service user. Some robots can also benefit from the ability to recognize expressions. For example, a therapy robot caring for sick or disabled individuals may deliver care more effectively if it can recognize the emotional state of the patient being cared for.

[0031] Various technologies can be used for automatic facial expression recognition. One typical approach for detecting user emotions in real time can involve first detecting a face, fitting a general face pattern containing facial landmarks (e.g., nose, eyes, eyebrows, etc.) on the detected face, adjusting landmark positions based on the detected face, tracking movements of the landmarks, and determining user emotions based on the movements of the landmarks.

[0032] More specifically, the first step in facial expression analysis is to detect the face in the given image or video sequence. Locating the face within an image is called face detection or face localization, whereas locating the face and tracking it across the different frames of a video sequence is called face tracking. One popular face detecting method is the Viola- Jones algorithm, which can provide robust, real-time detection of human faces. Briefly, the Viola- Jones algorithm works by repeatedly scanning an image to find a familiar pattern. For example, if a few dark pixels have been found and assumed to be a human eye, it is predicted that another similar pattern can be found on either the left or right side. [0033] Once the human face is detected, the next operation is to find individual facial features, such as nose, lips, mouth, etc. Various algorithms, such as the Active Shapes Model (ASM) algorithm, can be used in this operation. ASMs are statistical models of the shape of objects which iteratively deform to fit to an example of the object is a new image. The ASM algorithm works by first manually marking facial features (e.g., nose, lips, eyes, etc.) on many sample face images, and then forming a general face pattern based on the average locations of the markings. This general face pattern is sometimes called the "universal face model" and the markings can be called "facial landmarks or landmark points." Once a face is detected, the general face pattern can be placed on top of the face in order to create a model for that particular face.

[0034] However, the general face pattern does not fit every face, and adjustments to the landmark positions are needed. To do so, the image of the detected face can be analyzed to extract certain facial features, and landmark points corresponding to those facial features can be adjusted accordingly to better fit the face. For example, if a detected face has a very narrow chin, when the general face pattern is placed on the face, the landmark points on the perimeter of the chin will not align to the chin line of the detected face. To make the model fit, the landmark points on the chin line will be moved to locations of the chin of the detected face. Note that the location of the chin of the detected face can be determined by scanning the image to find sharp contrasts in the color of pixels. Once the landmark points of a face have been determined, the facial expression can be extracted based on the relative positions of the landmark points. For example, if the two corners of a mouth are closer to the eyes than the center portion of the mouth, one may conclude that the face indicates a happy emotion.

[0035] FIG. 1 shows an exemplary system for detecting facial expressions, according to one embodiment. In FIG. 1, facial-expression-detection module 100 can include a face-detection module 102, an ASM module 104, and a classifier 106. During operation, a captured image can be sent to face-detection module 102, which can run a face-detection algorithm, such as the Viola- Jones algorithm, to detect one or more faces included in the captured image. The face portion of the image can then be sent to ASM module 104, which can use the ASM algorithm to model the detected face and adjust facial landmark points based on the input image. In some embodiments, ASM module 104 can output 2D or 3D coordinates of various facial features, e.g., eyes, nose, lips, etc. In some embodiments, ASM module 104 can output a generic 3D model of the face with landmark points of the face adjusted based on the input image.

[0036] The output of ASM module 104 can be sent to classifier 106 as input. Classifier 106 has been previously trained using annotated sample data similar to its input. For example, if the input of classifier 106 includes 2D or 3D coordinates, classifier 106 can be previously trained using data comprising 2D or 3D coordinates of known facial features. Similarly, if the input of classifier 306 includes a face model with modified landmark points, classifier 306 can be previously trained using data comprising face models. Based on its input, classifier 306 can output an emotion indicator or facial-expression class selected from the aforementioned discrete emotion classes, such as neutral emotion, happiness, sadness, surprise, fear, anger, disgust, etc. Moreover, the facial-expression classes can also include facial expressions defined by positions of one or more facial features, e.g., lips, eyes, eyebrows, etc. For example, a tongue-out face can belong to a particular facial-expression class. Similarly, an eye-wink face can belong to a different facial-expression class.

[0037] To determine the facial expression in a video feed, the Viola- Jones algorithm to detect a face in an image and then placing landmark points on the detected face need to be done continuously for frames of a video feed. In some cases a number of frames of the video feed are processed as processing every frame increases the need for processing power. When the landmark points are continuously placed on frames of a video feed, as the head moves or rotates, the landmark points will also move, albeit with some error, to the new location. For example, if a landmark point is placed at the tip of the nose, as the head rotates the coordinates where the tip of the nose is located is moved, and next time when the processes are performed to place the landmark points, the one landmark point for tip of the nose will be moved to a new coordinate. Therefore, when the processes to detect the face and place the landmark points are performed for frames of a video feed, the landmark points "track" the movement of the head. For this reason, the software module that detects a face in frames of a video feed and subsequently places the landmark points at the designated locations, can be called a "face tracker."

[0038] In the examples shown in FIG. 1, the input is a still image, and the output a single emotion or facial-expression class based on faces included in the still image. In many applications, the input can be live video feed or a video clip, in which a face's expression may change continuously, and the facial-expression-detection system needs to be able to detect the changing facial expression. The facial-expression-detection system can treat the video input as a sequence of individual still images and perform the facial-expression-detection operation on each and every image. Alternatively, the facial-expression-detection system can sample images from the video input. For example, instead of processing all 30 images in one second of the video, the facial-expression-detection system can evenly extract five or ten images from the 30 images to process. This can reduce the amount of computation needed for detecting facial expressions. A higher sampling rate (e.g., 20 frames per second) can result in a more accurate detection result. In some embodiments, the facial-expression-detection system can include a face-tracking module that tracks the movements of the landmark points, and the classifier can then detect facial expressions or changes of facial expressions based on movements of the landmarks.

[0039] Recognizing emotion from a video feed is more computationally intensive than from a still image, because one second of video contains 30 frames, and the aforementioned operations, including face detection, fitting the general face pattern to the detected face, and adjusting the landmark points, need to be performed on each and every frame. By doing so, not only the landmark points on a face can be determined, the movements of the landmark points can also be tracked. The facial expression and/or changes of the facial expression can be determined based on the tracked movements of the landmark points. For example, if it is detected that the two corners of the mouth are moving upward, one may conclude that the video feed shows a laughing face.

[0040] FIG. 2 shows an exemplary system for detecting facial expressions from a video feed, according to one embodiment. In FIG. 2, facial-expression-detection module 200 can include image sampler 202, face-detection module 204, face tracker 206, and classifier 208.

[0041] During operation, image sampler 202 samples a received video stream at a predetermined sample rate. In some embodiments, to ensure accuracy, the predetermined sample rate is at least 50%. The sample images can be sent to face-detection module 204, which can detect faces in each image. The detected faces can then be sent to face tracker 206, which can identify and track landmark points on the detected faces. Classifier 208 can take the output of face tracker 206 as input and output a sequence of emotion classes. For example, if the output of face tracker 206 indicates that the corners of the mouth move upward from one frame to the next, classifier 208 can output an emotion class as happy for the time instance corresponding to the frames.

[0042] As one can see, the process of fitting a general face pattern to a detected face is not precise, and it can be quite possible that certain landmark points do not end up in a coordinate where they need to be. As a result, it can be difficult to achieve a desired accuracy of facial expression detection by tracking landmark points. Note that the entire process of "face tracking" suffers from two levels of error that are inherent in the processes involved. First, fitting the landmark points on a face is an imprecise process. As noted above, the universal face model is a one-size-fit-all model, and it further needs adjustment for a particular face in an image. These adjustments introduce an error rate in the coordinate of the landmark points. Also, as noted above, when a universal model is placed on a face, the image is scanned and based on certain imperfect assumptions the landmark points of the universal face models are moved to fit the face in the image. For example, when the algorithm looks for a sharp contrast in shade of the image pixels to determine where the jaw line is, the location of this contrast may be erroneously affected by the ambient lighting environment. Therefore, the algorithm may place the landmark points in a wrong location. Second, this problem is further exacerbated when this is done in a video feed. When this processed is performed on a series of frames in rapid succession, the rate of change of the coordinates of facial elements in subsequent frames may be large enough that the landmark points cannot be placed on the new coordinates at that speed and hence the process breaks down, and the face tracker loses its ability to effectively track the face. This problem can be appreciated by observing mobile applications that place augmented reality elements on the face (such as hat, ear, glasses, etc.). When the user turns his/her head rapidly in front of the camera, these objects fall off and for a duration of time, do not stay in the proper coordinates. Such applications use face trackers to detect facial elements and track them to place the objects on the face.

[0043] To better understand the limitations of the landmark points we must first consider the history of where it came from. Landmark points are relic of an era in the field of computer vision where the processing power of the processors were not strong enough for processing an entire image of the face. In light of this limitation, only certain discrete "landmark" points were located and tracked. And the process by which this was done is prone to error as noted above. Therefore, detecting and determining human facial expression and emotion with this technique is highly imprecise. One way to remedy this problem is to utilize the exiting powerful processors and track several additional land mark points on the face. In addition, a depth sensor may be used in addition to a camera to determine the location of the landmark points in the Z-coordinate in a 3D space. Tracking additional landmark points alleviates the noted shortcomings as additional data points may be used to reduce the error rate. However, the traditional face tracking technology which uses land mark points is entirely incapable of detecting several facial expressions. For example, if we want to detect when a person protruding his or her tongue out, a face tracker cannot be of use, since a tongue is not always visible and hence it is not possible to track it with landmark points.

Facial Expression Detection Based on Optimized Convolutional Neural Networks

[0044] To improve the accuracy of the facial expression detection system, the disclosed solution can harvest the improved computation power in modern computing systems, such as graphics processing units (GPUs). More specifically, instead of tracking the limited number of discrete landmark points, the entire face can be digested with every pixel being tracked. In some embodiments, artificial intelligence (AI) can be used to analyze images or video feeds of human faces in order to detect facial expressions.

[0045] Among many machine-learning technologies, convolutional neural networks (CNNs) have been successfully applied to analyzing visual imagery. Compared to other image classification algorithms, CNNs use relatively little pre-processing. This means that the network learns the filters that in traditional algorithms were hand-engineered. In some embodiments, one or more CNNs can be trained using an image library comprising human faces. For example, a CNN can be trained to learn happy faces (including smiles and laughs), sad faces, angry faces, surprised faces, fearful or disgusted, and stares. In some embodiments, the CNN can also be trained to learn other types of facial expressions, such as wink, kiss, tongue out, nose wrinkle, raising one or both eyebrows, etc.

[0046] During operation, the CNN can take as input an image of face and return the probability of the image as belonging to one of the classes of emotions, such as neutral emotion, happiness, sadness, surprise, fear, anger, disgust, etc. The neural network can return probabilities of a face image belonging to each of those classes, for example: 0.7 for happiness, 0.2 for anger, 0.1 for surprise, 0.0 for sadness, 0.0 for neutral, 0.0 for disgust, 0.0 for fear. In addition, the CNN may also track the movements of the certain facial features, such as the tongue and eyebrows, for example, tongue out to the left or raised left eyebrow.

[0047] CNNs are generally designed to work with color images, but considering a gray- scale image for the sake of simplicity and ease of explanation, as an example, the input to the CNN can be the values and positions of each pixel. More specifically, in a gray-scale image, each pixel is represented by a number belonging to the interval [0.0, 1.0], where 0.0 denotes white, 1.0 denotes black, and all other numbers denote a certain shade of gray. An image can be represented by a two-dimensional matrix. For ease of calculation, such a matrix can be converted into an n-by-1 vector, where consecutive rows of the matrix are appended one after another, and the n-by-1 vector becomes the input to the CNN.

[0048] A neural network can be defined by its architecture (number of input neurons, number and size of layers) and by the weight factors w. Finding the weight factors w involves processes known as "training of a neural network." In order to train the neural network, a set of images with annotated emotions (which can be manually generated) may be needed. The size of this set should be correlated with the architecture of the neural networks. Deep neural networks consisting of a million layers may require a dataset in the order of a million images.

[0049] To reduce the number of images needed for training, in some embodiments, multiple specialized AIs can work together, each implementing a CNN trained for a particular task. More specifically, each AI can be trained with a narrower task of detecting emotion based on a portion of the face, e.g., eyes, nose, lip, tongue, etc. For example, a mouth AI can be trained to output probabilities of a face image belonging to each of the emotion classes based on values of pixels in the mouth area. Similarly, an eye AI can be trained to output probabilities of a face image belonging to each of the emotion classes based on values of pixels in the eye area. This can significantly reduce the size of the training set. Note that the same images can be used to train different specialized AIs, with each individual AI receiving the relative portion of each image. For example, the mouth AI may only look at the mouth area of a training image, whereas the eye AI may only look at the eyes of the training image.

[0050] FIG. 3 illustrates an exemplary system for detecting facial expressions, according to one embodiment. In FIG. 3, facial-expression-detection AI system 300 can include a number of specialized AI subsystems, such as eyebrow AI 302, eye AI 304, nose AI 306, mouth AI 308, etc. Facial-expression-detection AI system 300 can also include a pre-processing module 310, which can pre-process the to-be-analyzed image. In some embodiments, pre-processing module 310 can use the aforementioned Viola- Jones and ASM algorithms to detect a face included in the image and locate landmark points on the detected face. Based on the landmark points, preprocessing module 310 can divide the image of the face into multiple portions, each portion representing a facial feature, e.g., eyebrows, eyes, nose, mouth, etc. Each portion of the image can then be sent to the corresponding AI subsystem for processing. It will be understood by those skilled in the art that the term "AI" may be used to as a generalized term to broadly refer to a neural network.

[0051] The outputs from the specialized AIs can be sent to a post-processing module 320, which aggregates the results returned from the various AI subsystems and generates a final output indicating an emotion class of the face included in the image. In some embodiments, various weight functions (e.g.,w_eyeb_row, Wmouth, etc.) can be applied to the outputs of the AI subsystems in order for post-processing module 320 to generate a final output. Note that the weight functions can also be obtained via training of entire facial-expression-detection AI system 300.

Alternatively, an expert (e.g., a psychologist) can assign weight functions for the different AIs. For example, genuine smiles often involve movements of muscles around the eyes, hence, the eye AI may be assigned a larger weight function when its output indicates a happy face.

[0052] In order to achieve a better training outcome, in some embodiments, the training data may be augmented in order to include scenarios that may not be included in the gathered data. In some embodiments, the gathered training data can be pre-processed such that certain facial features can be modified to account for relatively rare facial expressions (e.g., someone making a weird-looking face). A specially developed algorithm can be used to pre-process training images to modify certain features. [0053] In one embodiment, a specially developed algorithm can be applied to 2D pictures of faces that rotate training samples, that would otherwise have been taken with a camera at different angles. It can be appreciated that such operation reduces the need to gather additional samples.

[0054] In one embodiment, multiple AIs may be utilized where each of them shares a degree of similarity and differs with others in the group by having one or more additional classes. This approach, which can be referred to as "AI layering" may be advantageous depending on the particular application to increase the accuracy of the system. AI layering is a novel system architecture. Note that architecture of a neural network is different from architecture of a system. For example, the AI layering system architecture may be built using known neural network architectures such as MobileNet which is a neural network architecture optimized for mobile devices.

[0055] A specific example of an AI layering system is provided below. It will be understood by those skilled in the art, that this example is illustrative and does not limit the scope of the invention.

[0056] FIG. 4 shows a specific example of an AI layering system, according to an embodiment. The specific embodiment shown in FIG. 4 is illustrative and not restrictive. For ease of explanation, the system of FIG. 4 is directed toward a system architecture for

classification of facial expressions. Those skilled in the art will appreciate that this architrave can be used for other types of classifications. FIG. 4 shows an input 402 and three CNNs (404, 406 and 408), and the layering module 410. CNN 404 has seven classes (404A-404G) each of which is for one facial expression, smile, laugh, neutral, kiss, tongue out, shock, sad. CNN 406 has six classes (406A-406F). The classes for CNN 406 are similar to CNN 404 except it lacks the sad (404G) class. CNN 408 has five classes (408A-408E). The classes for CNN 408 are similar to CNN 406 except it lacks shock (406F) .

[0057] Generally, as the number of classes for a neural network increases, the accuracy of the network is challenged. The aim of the AI layering system of FIG. 4 is to detect and classify seven facial expressions which are in CNN 404. However, it uses two additional CNNs with lesser number of classes in the manner described below to increase the accuracy of the overall system. Assume that an image is received from input 402 and fed into all three CNNs 404, 406 and 408. Suppose that CNN 404 classifies that image as neutral. In that case, layering module 410 takes that result as the final result. Now, assume that another image is fed into the CNNs and this time the image is classified as smile by CNN 404. In this case, layering module 410 will not make a determination until the classification result from CNN 406 is also received.

Furthermore, assume that another image is fed into the CNNs and this time CNN 404 classifies that image as shock. In this case, layering module 410 will consider the result of all three networks to make a decision.

[0058] The performance of the neural networks can be evaluated by an "evaluation set." The evaluation set is a smaller data set compared to the training set which also includes different samples compared to the training set. In the example shown in FIG. 4, the evaluation set has revealed that the classification result for neutral (class 404G) is quite accurate which a high enough probability. This may be because vast majority of humans show a similar facial characteristic when they have no facial expression. Furthermore, CNN 404 may also have a good classification capability for smile (class 404F) but not as good as neutral (class 404G). In this case, CNN 406 also classifies smile (class 406F). Given that CNN 406 has one less class, and smaller number of layers as a result, it can be more accurate for classifying smile. Therefore, when both CNN 404 and CNN 406 classify an image as smile, layering module 410 will determine that to be conclusive. On the other hand, if an image is classifies as any other class by CNN 404, layering module 410 considers the result of all three CNNs that classify that facial expression. Generally, layering module 410 assigns more weight to the result of a CNN with smaller number of class, since as the number of classes are reduced the accuracy of the neural network appears to go higher.

[0059] One advantage of this system architecture is that the neural networks can be trained with smaller sample sizes. One issue in training neural networks is access to a large enough dataset to properly train the neural network. As number of classes and layers are increased, the need for a larger dataset may also be increased. However, the architecture of FIG. 4 can be used to achieve the same result with smaller number of samples which is tremendously advantageous both in terms of time and cost savings.

[0060]

Displaying Dynamic Emoji Based on Facial Expression Detection

[0061] There are many applications of facial expression detection. As discussed previously, it can be used to assist or enhance the human-machine interface. In addition, during online human-to-human communication (e.g., online chatting or online posting), the system can display emojis that reflect the real emotional status or sentimental mode of the communication parties. For example, the system may include an emoji database, which can include a set of pre- generated emojis corresponding to various known emotional states. A happy-face emoji can correspond to the user being happy, and a sad-face emoji can correspond to the user being sad. Moreover, various facial movements, such as kissing, tongue out, nose wrinkling, winking, etc., can also have their corresponding emojis. This allows for a hands-free emoji input. More specifically, this allows a user to enter an emoji by activating the facial-expression-detection module and facing toward a camera associated with the computing device for communication, e.g., a smartphone or laptop computer. For example, when the facial-expression-detection module detects that the user is amused while communicating online (e.g., writing an online post or texting his friend), the system can generate and display a smiley-face emoji, without the need for a manual input. Similarly, when the user winks, the system can generate a wink emoji. In a further example, the user can make an angry face, and the system can display an angry face emoji. In addition to displaying emojis, the system can also perform other types of operation in response to the user's facial expressions. For example, a smartphone can be configured in such a way that when a user winks his right eye twice, a currently active window can be closed. Other facial expressions or combinations of facial expressions can be used to trigger different operations. The facial-expression-detection module can be trained to recognize these different facial movements and execute predetermined operations according to detection of facial movements.

[0062] In addition to the emotional status of the user, in some embodiments, displaying the emoji can also involve determining the user's head position/orientation with respect to his body, and the displayed emoji not only can reflect not only the user's sentiment but also his head position, such as tilting to the left or right, chin up or down, etc. For example, if the user's head tilts to the left and smiles at the camera, the system can display a left-tilted smiley. Various techniques can be used to detect the user's head position/orientation. In some embodiments, the system can detect a user's head position/orientation based on facial landmark points outputted by a face-modeling module, such as an ASM module. For example, by tracking the positions of the landmark points, the system can determine the movements of the head, such as shaking or nodding.

[0063] The system can then combine the detected head position/orientation with the detected facial expression in order to generate and display an emoji that can reflect both the user's head position and facial expression. In some embodiments, the system may map a detected facial expression and head position to an existing emoji. In this scenario, the system needs to generate, beforehand, different emojis reflecting different head positions of a particular facial expression. For example, the system may have generated a left-tilting happy face and a right-tilting happy face. Depending on the detected head position, the system will display a corresponding happy face. In some embodiments, the system may need to generate a new emoji by modifying an existing emoji based on the detected head position. For example, if the system detects that the user's head is tilting to the left by 30°, the system may modify an existing forward-facing emoji to achieve the effect of 30° left tilting. [0064] FIG. 5 illustrates a system for generating and displaying emojis based on a user's facial expression, according to one embodiment. In FIG. 5, system 500 can include face- detection module 502, head-position-detection module 504, facial-expression-detection module 506, emoji database 508, emoji-generation module 510, and emoji-display module 512. During operation, an image can be sent to face-detection module 502. The detected face (e.g., the portion of the image that includes the face) can be sent to both head-position-detection module 504 and facial-expression-detection module 506.

[0065] In some embodiments, head-position-detection module 504 can detect the position of the user's head using various face-modeling technologies, such as ASM. For example, the ASM algorithm can output positions of facial landmark points, such as chin, nose, eyes, etc. These landmark point positions can then be sent to head-position-detection module 504, which can then determine, based on the relative locations of the facial landmarks (e.g., the relative locations between the nose and the chin), the position/orientation of the head.

[0066] Facial-expression-detection module 506 receives the detected face and outputs an emotion and/or facial-expression class. In some embodiments, facial-expression-detection module 506 can include a CNN, which has been trained beforehand to recognize certain emotional states based on facial expressions, including neutral, happy, sad, fearful, angry, surprised, disgusted, etc. Moreover, sub-categories are also possible. For example, facial expressions corresponding to the happy emotion may include sub-categories such as smile and laugh. In addition, the CNN can be trained to recognize certain facial features and their positions. For example, the CNN can be trained to recognize the tongue and track its position, such that it can detect when the user sticking out his tongue. Similarly, the CNN can be trained to recognize a winked eye or a wrinkled nose.

[0067] Emoji-generation module 510 can generate an emoji based on the outputs from head-position-detection module 504 and facial-expression-detection module 506. In some embodiments, emoji-generation module 510 can retrieve or map an emoji stored in emoji database 508 based on the detected head position and facial expression. In some embodiments, emoji-generation module 510 can modify the retrieved or mapped emoji based on the detected head position. Emoji-display module 512 can display the emoji generated by emoji-generation module 510. Displaying emojis that correspond to the users' facial expressions allows users participating in online communications the ability to "see" the facial expression of their communication partners, making the online communication experience more similar to face-to- face communication.

[0068] In some embodiments, the system can display an animated emoji that changes its shape dynamically according to the real time changing facial expression of the user. To do so, the facial-expression-detection module can detect, based on an input video feed or stream, a sequence of emotion classes or facial expressions. For example, a user's facial expression may change gradually from a neutral expression to a smile, and then to a laugh, and the facial- expression-detection module can output an emotion class sequence reflecting such a change. Depending on the image- sampling rate, the facial-expression-detection module can output the emotion class at different rates. To ensure that the animated emoji accurately reflects the user's facial expression, in some embodiments, the image- sampling rate is at least 50%.

[0069] Once the sequence of emotion classes or facial expressions has been generated, a sequence of emojis can be generated. Note that, because the emotion classes are discrete, the emoji sequence can be discrete. For example, the emotion class for a certain frame can be neutral, and the emotion class for the next frame can be happy. This means that the two consecutive emojis corresponding to these two frames will be a neutral-face (or dull-face) emoji followed by a happy-face emoji. As one can see, displaying these emojis directly as discrete images cannot achieve the desired animation effect, where the emoji can mimic a human face with changing facial expressions. To achieve the desired animation effect, in some embodiments, a morph-target animation technique is used to animate the emojis based on detected facial expressions.

[0070] Morph-target animation (also referred to as shape interpolation) is a 3D computer animation technique, which can be used to animate a mesh from one state to another. A mesh (e.g., a model of a face) can be defined by the positions of all of its vertices. One mesh can depict a state of an object in a 3D space in one form and another mesh can depict the same object in the same 3D space in another form (e.g., different expressions of the same face). Stated differently, a mesh can have different states representing different representations of an object in 3D space.

[0071] FIG. 6A shows an exemplary 3D object in a first state and FIG. 6B shows the same object in another state. Morph-target animation can be used to efficiently animate the change of states from what is shown in FIG. 6 A to what is shown in FIG. 6B. Here, FIG. 6B is known as the morph-target. In the process of animating the transition from FIG. 6A to FIG. 6B, for each frame of the animation, the position of each mesh vertex is calculated using

interpolation. Interpolation allows for creation of a smooth animation between the two states.

[0072] FIGs. 7A-7F show an exemplary morph-target animation, according to one embodiment. In FIGs. 7A-7F, a user's facial expressions are mapped to an emoji drawing. In Figs. 7A-7F, elements 701A-701F show the result head-position detection module 504 shown (shown in Fig. 5), elements 702A-702F show one exemplary layer of the CNN that is used by the facial-expression-detection module 506 (shown in FIG. 5), and elements 703A-703F show the result of the emoji-generation module 510 (shown in FIG. 5).

[0073] More specifically, FIG. 7A shows a neutral face expression and a corresponding neutral emoji; FIG. 7B shows a smiling face and a corresponding smiling emoji; FIG. 7C shows a laughing face and a corresponding laughing emoji; FIG. 7D shows a laughing emoji while having tearing eyes; FIG. 7E shows an emoji with a shock face; FIG. 7F shows an emoji with a tongue out. Morph-target animation can be used to transition from the emoji shown in FIG. 7A to the emoji shown in FIGs. 7B, 7C, 7D, 7E and 7F. Depending on the detected emotion state of facial expression, the system can decide which morph target represents the end state of an animation.

[0074] In one example, if the facial-expression-detection has detected that the user's face changes from neutral to laugh (from FIG. 7A to FIG. 7C), the emoji shown in FIG. 7C will be the end state of the animation. Morph-target animation works by modifying the influence or weight of each of the two states (the starting state, which is FIG. 7A, and the end state, which is FIG. 7C). At the beginning, the influence or weight of the emoji shown in FIG. 7A is 100% and the influence or weight of the emoji shown in FIG. 7C is 0%. As time goes by, the influence of the emoji in FIG. 7 A decreases and the influence of the emoji in FIG. 7C increases. Also, as we transition from the start state to the end state, we can animate the various parts of the emoji's face to show the state of the transition. In other words, the emoji in FIG. 7B can be designated as 50% of the transition from the start state of FIG. 7 A to the end state of FIG. 7C. Any transition in between can then be a blend of the discrete states to present a smooth transition. In this example, as we transition from the emoji in FIG. 7A to the one in FIG. 7C, we can designate the emoji shown in FIG. 7B as the middle state. Moreover, we can blend the emojis shown in FIG. 7A and FIG. 7B to show any state in between.

[0075] The morph-target animation system works by looping through the set of mesh vertices. If there are only two morph targets, then for each vertex the following equation is calculated: vertexii) = w ^■ vertex_{target j} (z^') + w₂ ^■ vertex_{target 2}{i) . Note that the sum of the weights has to be 1 (or 100%); therefore, the weight in this equitation for each morph target can also be thought of as a percentage of that morph target.

[0076] In some embodiments, tracking these weights can also be used to detect facial expressions. More specifically, instead of using CNN for detecting facial expressions, one can use a face tracker (e.g., an ASM module) to track the positions of facial landmarks and map the facial landmark positions to emojis. For example, in FIGs. 7A-7C, the landmark points shown in the left drawings can be mapped to an emoji shown in the right drawings.

[0077] In many applications, the face tracker is tracking the landmark points in real time, thus making it impossible to know in advance what the end state will be (i.e., the final morph target is unknown). FIGs. 8A-8D illustrate different emoji states. FIG. 8A shows a starting state, which is a neutral state. A person might instantaneously transform their face from the neutral state shown in FIG. 8A to any one of the emotions depicted in FIGs. 8B-8D. Real-time changes in positions of facial landmark points can result in changes of the weight factors in the morph data. In this scenario, morph data (e.g., the vertices and the weights) can represent not only the distance until an end state (or a morph target) but a probability for various possible end states. This use of morph data as a probabilistic indicator of the end state (which can indicate an emotion state or facial expression) can be used to understand the emotional gestures of a face. In practice, the face tracker can start outputting changed coordinates for various points of the face in response to moving facial muscles; and the morph data can represent the probability of an end state at any given time. For example, the facial landmark points can change to result in the morph data indicating a 70% shock face (as shown in FIG. 8C) and a 30% smile (as shown in FIG. 8D). This can happen if the user starts to smile but suddenly opens his mouth. In other words, at any given time, the landmark points can be mapped to an emoji, which can be a blend of various end states to illustrate the current state of the coordinates of the landmark points coming from the face tracker.

[0078] As noted above, in situations where changes of a user's facial expression are tracked, more than one morph target will be used, and an optimization algorithm may be used to determine the proper weight of one or more morph targets. For example, the optimization algorithm may determine that the proper morph data includes 0.7 "smile" morph target and 0.3 "shock" morph target. Hence, by tracking the weights, one can know at any given time, what are the percentages of various end states (such as smile, shock, etc.), which also means that one can know the probability at any given point of the type of emotion associated with the facial expression. This can be used as an output to have a better understanding of facial emotions in real time.

[0079] It can be appreciated that, the use of a face tracker tracking movement of the head combined with a CNN for detecting facial expression and also morph target animation for smooth transition of emojis for each detected expression results in a 3D shaped emoji that not only changes in real time based on facial expression of a user, but also, as shown in Figs. 7D-7F, does so at any angle at which users face is located in a 3D space.

[0080] FIG. 9 presents a flowchart illustrating an exemplary process for displaying a dynamic emoji based on a user's facial expression, according to one embodiment. During operation, the system receives a video that captures a user's face (operation 902). The video can be from a live video feed (e.g., from a camera of a smartphone) or a pre-recorded video clip. The system can sample the video at a predetermined rate to extract frames from the video (operation 904). In some embodiments, the video can include at least 30 frames per second, and the sample rate can be between 50% and 100%. The system can then perform face-detection operations to detect faces included in each extracted frame (operation 906). The system can then detect the position/orientation of the head (operation 908) and at the same time detect the facial expression (operation 910).

[0081] In some embodiments, detecting the head position/orientation can involve modeling the face (e.g., using an ASM algorithm) to determine facial landmark points. For example, by tracking facial landmark points from frame to frame, the system can determine the movement of the head. As the user's head position/orientation changes from frame to frame, the detected head position/orientation changes accordingly.

[0082] In some embodiments, detecting the facial expression can involve a machine- learning technique, such as using a trained CNN as a classifier. More specifically, the CNN can receive, as input, pixels from the entire face (similar to examples shown in elements 702A-702F in FIGs. 7A-7F, respectively) to generate an output, which can indicate the probabilities of each discrete emotion class, e.g., neutral, happy, sad, angry, fearful, surprised, disgusted, etc.

Moreover, the CNN may also generate an output indicating the facial-expression class based on position or status of certain facial features, e.g., kissing, tongue out, eye wink, nose wrinkle, eyebrow raise, etc. In further embodiments, the CNN can directly take, as input, the entire image of a video frame. In other words, instead of relying on a face-detection module to detect a face within an image, the CNN can detect the face and the facial expression from a raw image directly. Because the CNN is receiving a sequence of frames as input, the CNN output can also include a sequence of detected facial expressions.

[0083] Subsequently, an animated emoji can be displayed based on the detected head position/orientation sequence and the detected facial expression sequence (operation 912). In some embodiments, a morph-target animation technique can be used to display the animated emoji. More specifically, the head positions/orientations and facial expressions of adjacent sampled frames can be treated as the starting and ending emoji states. The system performs the interpolation on vertices between the two states to achieve a smooth transition. As a result, the displayed emoji can mimic the head and facial movements of the user. For example, if the user is frowning while shaking his head, the displayed animated emoji will show a head-turning frowning face. Similarly, if the user is making a happy face while nodding his head, the animated emoji will be a head-nodding happy face. The head movements, including range and frequency, of the displayed emoji will follow the head movements of the user. In other words, if the user's head is moving fast, the displayed emoji will move fast; and if the user's head is moving slowly, the displayed emoji will move slowly. Note that the orientation the emojis in 3D space (elements 703A-703F) shown in FIGs. 7A-7F corresponds to the orientation of the facial landmark points (elements 701A-701F) shown in FIGs. 7A-7F.

[0084] In some embodiments, the animated emoji can be displayed on the user device recording the face to indicate to the user the facial-expression-detection result. In further embodiments, the animated emoji can be displayed on a remote device associated with a party communicating with the user to provide visual cues for the communication. For example, when a local user is chatting online using a messaging service with a remote user, the remote user not only can see not only text sent by the local user but also an animated emoji indicating in real time the emotional status or facial expression of the local user. This can greatly enhance the experience of a text-based communication.

[0085] FIG. 10 illustrates a system for generating animated emojis based on a user's facial expression, according to one embodiment. Dynamic-emoji-generation system 1000 can include frame- sampling module 1002, optional face-detection module 1004, head-position-detection module 1006, facial-expression-detection module 1008, emoji database 1010, emoji-animation module 1012, and emoji-display module 1014. In some embodiments, dynamic-emoji- generation system 1000 can be part of a mobile computing device (e.g., a smartphone) implemented using GPUs.

[0086] Frame- sampling module 1002 can be responsible for extracting frames from a received video at a predetermined sampling rate. In some embodiments, the sampling rate can be between 50% and 100%. Face-detection module 1004 can be responsible for detecting faces within the extracted frames. In some embodiments, face-detection module 1004 can use the Viola- Jones algorithm for face detection. Face-detection module 1004 can be optional. In some embodiments, the extracted frames can be sent directly to head-position-detection module 1006 and facial-expression-detection module 1008.

[0087] Head-position-detection module 1006 can be responsible for detecting and tracking the position/orientation of the head. In some embodiments, such detection and tracking can be done by identifying and tracking facial landmark points. More specifically, a 3D face- modeling technique (e.g., the ASM algorithm) can be used for identifying and tracking those facial landmark points. Head-position-detection module 1006 can determine the

position/orientation of the head based on relative positions of the facial landmark points. Head- position-detection module 1006 may use the output of face-detection module 1004 as input. Alternatively, head-position-detection module 1006 can use the entire raw frame as input.

[0088] Facial-expression-detection module 1008 can be responsible for detecting facial expressions and the emotional state of the face. In some embodiments, artificial intelligence (e.g., a trained CNN) can be used to recognize a number of facial expressions and the emotional status of the face based on the video. In further embodiments, the CNN can take each sampled frame as input, and generate an output, which can indicate the probabilities of each discrete emotion class, e.g., neutral, happy, sad, angry, fearful, surprised, disgusted, etc., for each frame. Moreover, the CNN may also generate an output indicating the position or status of certain facial features, e.g., kissing, tongue out, eye wink, nose wrinkle, eyebrow raise, etc. Facial-expression- detection module 1008 may use the output of face-detection module 1004 as input. Alternatively, facial-expression-detection module 1008 can use pixels within the entire raw frame as input. In some embodiments, head-position-detection module 1006 and facial-expression-detection module 1008 can work concurrently.

[0089] Emoji database 1010 can store a plurality of emojis, each of which can be designed to represent a certain type of human emotion, such as neutral, happy, sad, surprised, fearful, angry, disgusted, etc. Certain emojis can also be used to mimic certain movements of facial features, such as kissing, tongue out, eye wink, nose wrinkle, eyebrow raise, etc.

Moreover, emoji database 1010 may also store 3D emojis corresponding to different head position/orientations. For example, emoji database 1010 may store 3D happy faces with different tilting angles.

[0090] Emoji-animation module 1012 can be responsible for performing animations of 3D emojis based on the detected head position/orientation and facial expression. In some embodiments, emoji-animation module 1012 can apply a morph-target animation technique. More specifically, emoji-animation module 1012 can select an emoji from emoji database 1010 based on the detected head position/orientation and facial expression of a frame, treat the selected emoji as a morph target, and use interpolation between the morph target and an emoji selected for a previous frame to achieve a smooth animation effect. Emoji-display module 1014 can be responsible for displaying the animated emoji. Emoji-display module 1014 may be the screen of a mobile device, for example.

[0091] In the examples shown in FIGs. 7A-7F and 8A-8D, the emojis are drawn as humanoid faces with facial features such as eyes and mouth. Although not shown, additional facial features, such as eyebrows, eyelashes, nose, tongue, can also be included in an emoji. Moreover, other types of emojis that can express emotions or human facial expressions can also be used, such as avatars, cartoonish animals, or objects with human-like faces. The scope of the instant application is not limited by the design of the emoji, as long as the emoji can be used to indicate the user's emotional state or facial expression.

[0092] FIG. 11 illustrates a system for generating real time animated avatars based on a user's facial expression, according to one embodiment. System 1100 can include a camera 1101, a frame- sampling module 1102, a face-detection module 1104, a face tracker 1106, a facial- expression-detection module 1108, an avatar database 1110, an avatar animation module 1112, and a display module 1114.

[0093] Camera 1101 captures images (e.g., video images) that include a user's face.

Frame- sampling module 1102 samples frames from the video feed. Face-detection module 1104 detects faces in the sampled images, and face tracker 1106 tracks the face from frame to frame and detects the user's head position based on the face-tracking result. Facial-expression- detection module 1108 can include a number of CNN modules (e.g., CNN module 1108A- 1108F). Avatar database 1110 stores avatar images with different facial expressions. Avatar animation module 1112 can generate an animated avatar based on outputs of face tracker 1106 and facial-expression-detection module 1108. Display module 1114 displays the animated avatar.

[0094] As noted, detecting facial expressions based on CNN requires obtaining a training set of images that are representative of the facial expressions that are desired to be detected. The accuracy of the output of a neural network in general is related to the size and quality of samples that are used for training (i.e. the training data set). Furthermore, as the number of classifiers increases, the need for additional samples increases non-linearly and the difficulty of training the neural network also increases. In addition, given that one facial expression can be present in different angles (smile while face down, smile while face is rotated to the left, etc.) the training set for each class of facial expression needs to include samples from all various conditions and orientations.

[0095] The system shown in FIG. 11 attempts to address this issue by utilizing several convolutional neural networks (CNN) each optimized and trained to detect a facial expression at various orientations. As shown in FIG. 11, the facial-expression detection module 1108 incudes several CNNs (1108A-1108F). Each of these CNNs can be particularly trained for one facial expression at a specific orientation. For example, CNN1108A can be trained to facial

expressions when the face is straight toward a camera and CNN1108B can be trained to detect facial expressions when the face is looking upward and to the right (similar to the orientation shown in element 703D in FIG. 7D). In this embodiment, face tracker 1106 can communicate with the facial expression detection module 1108 to inform the orientation of the head. Then, based on the result of the head orientation, the captured images are fed into the appropriate CNN for processing. This mechanism increases the accuracy of the facial expression detection module 1108. In addition, the total number of the samples that are needed to train individual CNNs in facial expression detection module 1108 may be lower than the total number of samples that are needed to train one CNN that detects the facial expression. As a result of the novel architecture of the system shown in FIG. 11, the accuracy of the facial expression detection can be increased without the need to lots of additional samples and increases difficulty in training the neural network.

[0096] In one embodiment, the aforementioned techniques, systems and architectures may be used by a droid or a robot to determine facial expressions of humans. For example, referring to FIG. 11, a robot can use the above-noted techniques with respect to FIG. 11 up to the point where facial expression detection module 1108 receives input from face tracker 1106 to determine the facial expression of a human within the robots line of sight (assuming the robot is equipped with one or more cameras). Sentiment Analytics Based on Facial Expression Detection

[0097] The ability to detect facial expressions and to infer user sentiment based on the facial expressions can have other applications. For example, it can provide analytics information on viewer sentiment after one or more users have viewed a piece of media content.

[0098] FIG. 12 illustrates an exemplary scenario for user sentiment analysis, according to one embodiment. In FIG. 12, viewer 1208 can use a computing device, such as smartphone

1202, to view media content 1203, such as a post, a picture, a video clip, or an audio clip, which can be posted via a social media application. While viewer 1208 is viewing content 1203, camera 1204 on smartphone 1202 can be activated to capture viewer 1208's facial expressions. A facial-expression-detection module installed on smartphone 1202 can then analyze the captured facial expression information and determine viewer 1208's sentiment while viewer 1208 is viewing content 1203. The viewer sentiment can be sent to the author of content 1203 to provide feedback.

[0099] In further embodiments, on the viewer device, a "wire frame" image (similar to the images shown on the far left side of FIGs. 7A-7F) of the detected viewer facial expression can be displayed such that the viewer can see what is being captured. This can be helpful because some viewers might object to their faces being recorded while viewing the content. On the other hand, capturing or recording a "wire frame" representation of the viewer's facial expression can be much more agreeable to most viewers. Furthermore, the system can display a dynamic "live" emoji that can change its look dynamically with the viewer's facial expression. For example, the system can display a "wire frame" representation of the viewer's face, and also display a dynamic emoji the expression of which tracks the movement of the "wire frame" in real time. This way, the viewer can see what kind of feedback their facial expression is producing. Alternatively, the system can also display a morphed or distorted image of the user's face (similar to the images shown in the middle of FIGs. 7A-7F), which can reflect the user's facial expression without reviewing the user's identity. [00100] In some embodiments, the system can track a viewer's line of vision based on the relations among the facial landmark points. Note that determining the facial landmark points can be part of the facial-expression-detection operation. Based on the viewer's line of vision, the system can determine the viewer's sentiment accordingly. For example, if the system detects that the viewer's line of vision does not intersect the screen, the system can determine that the viewer's sentiment is "uninterested" or "does not care." Furthermore, the system can track the viewer's line of vision and analyze this information with respect to time duration. In one embodiment, the system can measure the amount of time the viewer's line of vision remains on the screen, and provide this information as part of the analytics information to the content producer. In other words, the system can provide literally the "eyeball time" a piece of content receives from a viewer.

[00101] In addition, the system can detect how the viewer's sentiment changes with different parts of the content. For example, the system can detect that a viewer's sentiment changes from "indifferent" to "amused" at a given point of a funny video clip, and provide this sentiment-change information (e.g., "viewer sentiment changes to "amused" at 0:42 of the clip") to the content producer.

[00102] Once the system determines viewer 1208 's sentiment, the system can represent viewer 1208's sentiment with an image, such as an icon or emoji, in addition to, or in place of text and data-based information. For example, when the system determines that viewer 1208 is amused, the system can represent that sentiment with smiley-face emoji 1206. When the system determines that viewer 1208 is annoyed or frustrated, the system can represent that sentiment with an angry-face emoji. When the system determines that viewer 1208 is indifferent or impatient, the system can represent that sentiment with a bored-face emoji. In general, the system can use one of many emojis to represent a viewer's sentiments.

[00103] In one embodiment, a number of viewer responses can be detected. The sentiments corresponding to these responses can be determined, represented in emojis, and aggregated to indicate the general-mass response to a piece of content. FIG. 13 A illustrates a scenario for aggregating and displaying sentiments of a large number of users, according to one embodiment. In FIG. 13A, smartphone 1302 sends a piece of content to a number of user devices 1304, 1306, 1308, and 1310. These user devices can display the content to their respective viewers. In response, the cameras on these user devices can capture their

corresponding viewers' facial expressions. A facial-expression-detection module installed on a user device can then determine the respective viewer's sentiments in response to viewing the content. [00104] In one embodiment, information indicating viewer sentiments determined by facial-expression-detection modules of the user devices can be transmitted to server 1312, which can in turn aggregate the collected viewer sentiment information, and transmit the aggregation sentiment information to smartphone 1302. Subsequently, smartphone 1302 can display this aggregate sentiment or facial expression information using emojis. In the example in FIG. 13A, smartphone 1302 displays that, in response to a certain piece of content, there are 5,531 viewers who laughed, 7,176 viewers who smiled, 3,024 viewers who were indifferent, and 701 viewers who disliked the content.

[00105] Note that server 1312 can be optional. In one embodiment, the user devices can send information indicating viewer sentiment directly to smartphone 1302, which can perform the aggregation locally.

[00106] FIG. 13B shows exemplary viewer sentiment analytics data provided to a content producer, according to one embodiment. In this example, the display of a smartphone can provide several types of viewer sentiment analytics information to the content producer. For example, field 1320 can display a number of icons for content the producer has previously posted, and allows the producer to swipe through them to select a particular piece of content for which sentiment analytics is provided. Field 1322 can allow the user to select a time frame over which the sentiment analytics data is aggregated. Field 1324 can present the aggregated viewer sentiment analytics information. In one embodiment, as shown in this example, the aggregated viewer sentiment is displayed with various emoji icons, each representing a particular sentiment, with a corresponding number of viewers who expressed this sentiment next to the emoji icon. Field 1326 can show the average viewing time calculated from all viewers. In one embodiment, each viewer's viewing time is calculated based on the duration of time for which his line of vision intersects with the viewing device's screen.

[00107] The content producer can scroll down the screen to access more sentiment analytics data, as shown in the additional screenshots in FIG. 13B. Field 1328 can show the total number of views, total number of screenshots viewers have taken of the content, and total number of reposts or shares viewers have generated based on this content. Field 1330 can show when the latest view has occurred and the latest viewer sentiment. Field 1332 can show avatars of the latest individual viewers who have viewed the content.

[00108] Field 1334 can show the number of viewers during each hour of the day, and allow the user to select which day to display by selecting a day from a menu. Field 1336 can show the gender distribution of all the viewers. Note that it is possible for the facial-expression- detection module to determine a viewer's gender based on analysis of the facial features. Furthermore, field 1338 can show the locations of viewers, both in a text format and a highlighted-map format.

[00109] FIG. 14 presents a flowchart illustrating an exemplary process of aggregating viewer sentiment analytics, according to one embodiment. During operation, a source device (which can be a smartphone or other types of computing device) can send a piece of content to multiple viewer devices (operation 1402). Then, a respective viewer device plays the content to a viewer and detects the viewer's facial expression (operation 1404). Based on the detected viewer expression, the viewer device can determine the viewer's sentiment. Subsequently, the viewer device can transmit the viewer sentiment information to a server (operation 1406).

[00110] In response, the server can aggregate the general-mass viewer sentiment information received from multiple viewer devices (operation 1408), and transmit the aggregated viewer sentiment information to the source device (operation 1410). The source device can then display the aggregated viewer sentiment information using different emojis (operation 1412). For example, the source device can display different emojis (e.g., happy emoji, sad emoji, etc.) indicating different user sentiments and display the statistics next to each emoji.

[00111] Several additional exemplary embodiments are described below:

[00112] One embodiment includes a non-transitory computer-readable storage medium storing instructions that when executed by a computing device cause the computing device to perform a method for providing a transaction platform that facilitates buying or selling of pre- owned merchandise, the method comprising:

receiving, by a computer, a sequence of images capturing a user's head and face;

for each image,

determining the user's head position in the image;

applying a machine-learning technique on the image to determine a facial-expression class associated with the user's face, wherein the

machine-learning technique comprises a convolutional neutral network

(CNN); and

generating an emoji image based on the user's head position

and the determined facial-expression class;

applying a morph-target animation technique to a sequence of

emoji images generated for the sequence of images to generate an

animated emoji; and

displaying the animated emoji.

[00113] In a variation of this embodiment, determining the user's head position comprises determining facial landmark points in the image using a shape modeling algorithm. [00114] In a variation of this embodiment, applying the machine-learning technique further comprises using multiple specialized convolutional neural networks, wherein a respective specialized convolutional neural network is configured to generate a facial-expression-detection output based on a portion of the user's face.

[00115] In a further variation, the method further comprises:

aggregating facial-expression-detection outputs from the multiple specialized

convolutional networks, wherein aggregating the facial-expression classes comprises applying weights to the facial-expression-detection outputs from the multiple specialized convolutional neural networks.

[00116] In a variation of this embodiment, generating the emoji image comprises:

mapping the determined facial-expression class to a pre-stored emoji image; and modifying the pre-stored emoji image based on the user's head position.

[00117] In a variation of this embodiment, receiving the sequence of images comprises: receiving a live video feed from a camera associated with computer; and

sampling frames from the received live video feed.

Computer and Communication System

[00118] FIG. 15 illustrates an exemplary computer and communication system for displaying dynamic emojis, according to one embodiment. In FIG. 15, system 1500 includes a processor 1510, a memory 1520, and a storage 1530. Storage 1530 typically stores instructions that can be loaded into memory 1520 and executed by processor 1510 to perform the methods mentioned above. As a result, system 1500 can perform the functions described above.

[00119] In one embodiment, the instructions in storage 1530 can implement a face- detection module 1532, a head-position-detection module 1534, a facial-expression-detection module 1536, and an emoji-animation module 1538, all of which can be in communication with each other through various means.

[00120] Face-detection module 1532 can detect a face from an input image or an input video frame. Head-position-detection module 1534 detects the position/orientation of the head associated with the face. Facial-expression-detection module 1536 determines an emotion class or certain facial expressions based on the detected face. Emoji-animation module 1538 generates an animated emoji based on the detected head position/orientation and the facial expressions.

[00121] In some embodiments, modules 1532, 1534, 1536, and 1538 can be partially or entirely implemented in hardware and can be part of processor 1510. Further, in some embodiments, the system may not include a separate processor and memory. Instead, in addition to performing their specific tasks, modules 1532, 1534, 1536, and 1538, either separately or in concert, may be part of general- or special-purpose computation engines.

[00122] System 1500 can be coupled to an optional camera 1550 and a display-and- input module 1540 (e.g., a touchscreen display module), which can further include display 1580, keyboard 1560, and pointing device 1570. System 1500 can also be coupled via one or more network interfaces to network 1582.

[00123] The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.

[00124] The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium.

[00125] Furthermore, methods and processes described herein can be included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an application- specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.

Claims

What Is Claimed Is:

1. A computer-executable method for dynamically generating and displaying emojis, comprising:

for each image,

determining the user's head position in the image;

applying a machine-learning technique on the image to

determine a facial-expression class associated with the user's face,

wherein the machine-learning technique comprises a convolutional neutral

network (CNN); and

generating an emoji image based on the user's head position and

the determined facial-expression class;

applying a morph-target animation technique to a sequence of emoji images generated for the sequence of images to generate an animated emoji; and

displaying the animated emoji;

wherein the animated emoji mimics the facial-expression of the user at a specific angle where the user's face is located.

2. The method of claim 1, wherein determining the user's head position comprises determining facial landmark points in the image using a shape modeling algorithm.

3. The method of claim 1, wherein applying the machine-learning technique further comprises using multiple specialized convolutional neural networks, wherein a respective specialized convolutional neural network is configured to generate a facial-expression-detection output based on a portion of the user's face.

4. The method of claim 3, further comprising:

aggregating facial-expression-detection outputs from the multiple specialized

5. The method of claim 1, wherein the determined facial-expression class is selected from a group consisting of:

neutral; happy;

sad;

fearful;

angry;

surprised;

disgusted;

tongue out;

kiss;

wink;

eyebrow raise; and

nose wrinkle.

6. The method of claim 1, wherein generating the emoji image comprises:

7. The method of claim 1, wherein receiving the sequence of images comprises: receiving a live video feed from a camera associated with computer; and

sampling frames from the received live video feed.

8. A computer system, comprising:

a processor; and

a memory storing instructions that when executed by the processor cause the computer system to perform a method for dynamically generating and displaying emojis, the method comprising:

receiving a sequence of images capturing a user's head and face; for each image,

determining the user's head position in the image;

applying a machine-learning technique on the image to

determine a facial-expression class associated with the user's face, wherein the machine-learning technique comprises a

convolutional neutral network (CNN); and

generating an emoji image based on the user's head position

and the determined facial-expression class;

displaying the animated emoji.

9. The computer system of claim 8, wherein determining the user's head position comprises determining facial landmark points in the image using a shape modeling algorithm.

10. The computer system of claim 8, wherein applying the machine-learning technique further comprises using multiple specialized convolutional neural networks, wherein a respective specialized convolutional neural network is configured to generate a facial-expression- detection output based on a portion of the user's face.

11. The computer system of claim 10, wherein the method further comprises:

aggregating facial-expression-detection outputs from the multiple specialized

12. The computer system of claim 8, wherein the determined facial-expression class is selected from a group consisting of:

neutral;

happy;

sad;

fearful;

angry;

surprised;

disgusted;

tongue out;

kiss;

wink;

eyebrow raise; and

nose wrinkle.

13. The computer system of claim 8, wherein generating the emoji image comprises: mapping the determined facial-expression class to a pre-stored emoji image; and modifying the pre-stored emoji image based on the user's head position.

14. The computer system of claim 8, wherein receiving the sequence of images comprises:

receiving a live video feed from a camera associated with computer system; and sampling frames from the received live video feed.

15. An electronic mobile device, comprising:

a digital camera;

a display;

a processor; and

a memory storing instructions that when executed by the processor cause the computer system to perform a method for dynamically generating and displaying a three- dimensional avatar, the method comprising:

displaying a real-time avatar, wherein the avatar is capable of tracking movement of a user's head and displaying the facial expression of the user on the display while the user's head is positioned in front of the digital camera..

16. The electronic mobile device of claim 15, wherein the avatar is an emoji.

17. The electronic mobile device of claim 15, wherein the avatar is a three- dimensional shaped avatar displayed on the screen.

18. The electronic mobile device of claim 15, wherein the electronic mobile device has a first side and a second side, and wherein the display and the digital camera are on the first side.

The electronic mobile device of claim 15, wherein the display is a touch screen.