WO2025179077A1

WO2025179077A1 - Systems and methods for performing speech detection using depth images

Info

Publication number: WO2025179077A1
Application number: PCT/US2025/016678
Authority: WO
Inventors: Yang Zhang; Xue Wang; Zixiong SU
Original assignee: University of California Berkeley; University of California San Diego UCSD
Current assignee: University of California Berkeley; University of California San Diego UCSD
Priority date: 2024-02-20
Filing date: 2025-02-20
Publication date: 2025-08-28
Anticipated expiration: 2026-08-20

Abstract

Systems and methods for performing speech detection using depth images in accordance with embodiments of the invention are illustrated. One embodiment of the invention includes: a depth sensor; a memory containing a speech detection application; and at least one processor configured by the speech detection application. The speech detection application configures the at least one processor to: capture a sequence of depth images; identify and crop a region of interest from within each depth image, where the cropped region of interest contains a mouth; detect at least one word by providing the sequence of cropped regions of interest to a machine learning model configured to receive a sequence of cropped regions of interest and output at least one detected word from a predetermined vocabulary; and update a user interface of the speech detection system based upon a command corresponding to the detected at least one spoken word.

Description

Systems and Methods for Performing Speech Detection using Depth Images

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The current application claims priority under 35 U.S.C. 119(e) to U.S. Provisional Patent Application Serial No. 63/555,856, entitled “Systems and Methods for Performing Speech Detection using Depth Images”, filed February 20, 2024. The disclosure of U.S. Provisional Patent Application Serial No. 63/555,856 is hereby incorporated by reference in its entirety.

FIELD OF THE INVENTION

[0002] The disclosure relates generally to the field of automatic speech detection and more specifically to systems and methods for performing speech detection using depth information.

BACKGROUND

[0003] A smart watch is a wearable computer in the form factor of a watch. In addition to telling time, many smart watches provide a touch screen user interface that enable the user to interact with a variety of applications that can execute on the smart watch. Many smart watches also enable users to interact with applications via voice commands that are typically captured using a microphone.

[0004] Human speech is a complex, multi-dimensional process involving intricate movements of lips, tongue, and other facial muscles. Lip movements are an essential component of speech and are crucial in conveying information in people’s daily communication. Lip movements are typically the most apparent visual feature of speech and are controlled by the same articulatory organs that control audible speech, including lips, tongue, teeth, jaw, velum, larynx, and lungs, among which only lips, teeth, jaw, and tongue are visible for lipreading. This limited information poses a significant challenge for understanding and modeling lip movements during speech.

SUMMARY OF DISCLOSURE

[0005] Systems and methods in accordance with various embodiments of the invention use depth sensors to capture depth information, which is then used to detect facial gesture commands, audible speech and/or silent speech. Silent speech is a promising interaction modality for information flow from users to devices, such as smartphones and smart watches. Silent speech provides a useful input modality, because of its intuitiveness, efficacy, and ability to preserve the user’s privacy in public settings. Because of these merits, silent speech detection on typical user devices would be appreciated due to their powerful perceptual capabilities. By enabling a device, such as (but not limited to) a smartwatch, to read silent speech, the device can act as an optical microphone that its wearer could rely on to leverage speech without leaking privacysensitive information into the public.

[0006] Lip movements can vary significantly between individuals and even within individuals, depending on the factors such as habits and emotions. Therefore, modeling lip movements for silent speech recognition can require a deep understanding of the nuances of individual speech patterns.

[0007] One embodiment of the invention is capable of performing speech detection, including silent speech detection, using: a depth sensor; a memory containing a speech detection application; and at least one processor configured by the speech detection application. The speech detection application configures the at least one processor to: capture a sequence of depth images; identify and crop a region of interest from within each depth image in the sequence of depth images, where the cropped region of interest contains a mouth; detect at least one word by providing the sequence of cropped regions of interest to a machine learning model configured to receive a sequence of cropped regions of interest and output at least one detected word from a predetermined vocabulary. The machine learning model configured to receive a sequence of cropped regions of interest comprises: a sequence-to-sequence speech recognition model; and an application layer, where the application layer receives inputs from the sequence-to- sequence speech recognition model and outputs at least one detected word; wherein the sequence-to-sequence speech recognition model is a machine learning model trained to receive information from a sequence of depth images and output a sequence of characters; and update a user interface of the speech detection system based upon a command corresponding to the detected at least one spoken word. [0008] In a further embodiment, the speech detection application further configures the processor to extract foreground information from each depth image in the sequence of depth images.

[0009] In another embodiment, the speech detection application further configures the processor to threshold each depth image in the sequence of depth images to mask background information.

[0010] In a still further embodiment, the machine learning model configured to receive a sequence of cropped regions of interest includes an encoder comprising a 3D convolutional layer, a 3D batch normalization layer, and a 3D Max pooling layer.

[0011] In still another embodiment, the captured sequence of depth images includes depth images of silent speech.

[0012] In a yet further embodiment, the captured sequence of depth images includes depth images of audible speech.

[0013] In yet another embodiment, the captured sequence of depth images includes depth images captured of a facial gesture and the at least one detected word is a command word associated with the facial gesture within the predetermined vocabulary.

[0014] A further embodiment again includes: capturing a sequence of depth images using a depth sensor in a speech detection system; cropping each of the sequence of depth images using the speech detection system; extracting temporal-spatial features from the sequence of cropped depth images using the speech detection system; detecting at least one word from the extracted temporal-spatial features; and updating the user interface of the speech detection in response to the at least one detected word.

[0015] In another embodiment again, thresholding depth samples within each of the sequence of depth images using the speech detection system.

[0016] In a further additional embodiment, the speech detection system performs the thresholding prior to the cropping each of the sequence of depth images.

[0017] In another additional embodiment, capturing a sequence of depth images using a depth sensor in a speech detection system comprises capturing a sequence of depth images of silent speech. [0018] In a still yet further embodiment, capturing a sequence of depth images using a depth sensor in a speech detection system comprises capturing a sequence of depth images of audible speech.

[0019] In still yet another embodiment, capturing a sequence of depth images using a depth sensor in a speech detection system includes capturing a sequence of depth images of a facial gesture. In addition, detecting at least one word from the extracted temporal-spatial features includes detecting at least one command associated with the facial gesture.

[0020] In another embodiment, the depth sensor is selected from a group consisting of a structured light camera, a time of flight camera, and a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

[0021] In a further embodiment, the depth sensor is a structured light camera.

[0022] In a still yet further embodiment, the depth sensor is a time of flight camera.

[0023] In still yet another embodiment, the depth sensor is a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

[0024] In another embodiment, the sequences of images received by the sequence- to-sequence speech recognition model form a point cloud video.

[0025] In a further embodiment, once a command word or phrase is detected, the command word or phrase is provided to the system to trigger a response responsive to a user command.

[0026] In still yet a further embodiment, the speech detection system is capable of performing continuous speech detection.

[0027] In still a further embodiment, the sequence-to-sequence speech recognition model is configured to generate outputs that are dependent on an application layer.

[0028] In another further embodiment, the sequence-to-sequence speech recognition model is configured to generate outputs that are sentences.

[0029] In another embodiment, the sequence-to-sequence speech recognition model is configured to generate outputs that are command words.

[0030] In a further embodiment, the cropped region of interest is cropped to exclude a jaw captured in the sequence of depth images. [0031] In yet another embodiment, the memory further contains an intrinsic matrix for the depth sensor; and the information from the sequence of cropped depth images is obtained by transforming the sequence of cropped depth images into a sequence of point clouds using the intrinsic matrix for the depth sensor.

[0032] In still another embodiment, the point clouds in the sequence of point clouds are down sampled and normalized to capture the salient features of speech patterns.

[0033] In another embodiment, a point cloud transformer is used to encode and preserve the structure of the points within the sequence of point clouds and to adaptively search for related or similar points across entire lip movements using self-attention.

[0034] In yet another embodiment, downsampling is performed using farthest sampling.

[0035] In a further embodiment, the at least one processor is further configured by the speech detection application to discard depths within the captured sequence of depth images that exceed a threshold.

[0036] In still another embodiment, the at least one processor configured by the speech detection application to identify and crop a region of interest from within each depth image using an object detection model.

[0037] In yet another embodiment, the at least one processor is further configured by the speech detection application to filter the sequence of depth images using a distance mask.

[0038] In a further embodiment, a speech detection method comprises: capturing a sequence of depth images using a depth sensor in a speech detection system; cropping each of the sequence of depth images using the speech detection system; extracting temporal-spatial features from the sequence of cropped depth images using the speech detection system; detecting at least one word from the extracted temporal-spatial features; and updating a user interface of the speech detection system in response to the at least one detected word.

[0039] In yet another embodiment, a smart watch comprises: a housing; a watch band attached to the housing; a depth sensor mounted within the housing; a display mounted within the housing to form an enclosure; a memory contained within the enclosure, where the memory contains an operating system, a speech detection application and parameters defining a machine learning model; at least one processor contained within the enclosure, where the at least one processor is configured by the speech detection application to: capture a sequence of depth images using the depth sensor; identify and crop a region of interest from within each depth image in the sequence of depth images, where the cropped region of interest contains a mouth; detect at least one word by providing the sequence of cropped regions of interest to the machine learning model configured to receive a sequence of cropped regions of interest and output at least one detected word from a predetermined vocabulary, wherein the machine learning model comprises: a sequence-to-sequence speech recognition model; an application layer, where the application layer receives inputs from the sequence-to-sequence speech recognition model and outputs at least one detected word; wherein the sequence-to- sequence speech recognition model is a machine learning model trained to receive point cloud videos and output a sequence of characters; update a user interface of the speech detection system based upon a command corresponding to the at least one detected word. In still another embodiment, the smart watch’s operating system configures the at least one processor to enable user interactions with various software applications via spoken or silent speech.

BRIEF DESCRIPTION OF THE DRAWINGS

[0040] The description and claims will be more fully understood with reference to the following figures and data graphs, which are presented as exemplary embodiments of the invention and should not be construed as a complete recitation of the scope of the invention.

[0041] Fig. 1 illustrates challenges of using a depth sensor incorporated within a wearable device to perform speech detection.

[0042] Fig. 2 conceptually illustrates a viseme-phoneme mapping.

[0043] Fig. 3A illustrates four facial regions compared with regard to viseme classification performance: 1 ) 50% of mouth area, 2) mouth region alone, 3) mouth with jaw area, and 4) whole face.

[0044] Fig. 3B illustrates viseme classification results for the four facial regions referenced in Fig. 3A. [0045] Fig. 4. illustrates a process for detecting and responding to a speech command (either silent or spoken) using depth images in accordance with an embodiment of the invention.

[0046] Fig. 5A illustrates a raw depth map of a face.

[0047] Fig. 5B illustrates a thresholded depth map of the raw depth map referenced in

Fig. 5A.

[0048] Fig. 6 illustrates a system for performing continuous silent speech detection, for detecting command words and/or to perform continuous speech detection from sequences of depth images in accordance with an embodiment of the invention.

[0049] Fig. 7 illustrates a table for a command corpus illustrating scenarios and corresponding commands.

[0050] Fig. 8A illustrates a process for training a user-dependent speech detection system utilizing incremental learning to customize a pre-trained model based upon additional training data examples of a specific user in accordance with an embodiment of the invention.

[0051] Fig. 8B illustrates a table detailing parameters of a 3D visual frontend and ResNet modules in accordance with an embodiment of the invention.

[0052] Fig. 9A illustrates a confusion matrix of 27 commands detection with a userdependent model in accordance with an embodiment of the invention.

[0053] Fig. 9B illustrates a confusion matrix of 10 digits detection with a userdependent model in accordance with an embodiment of the invention.

[0054] Fig. 10 illustrates the accuracy data of user-dependent and user-independent models in accordance with an embodiment of the invention.

[0055] Fig. 11 illustrates a speech detection system that uses a depth sensor to capture depth images that can be used to perform speech detection in accordance with an embodiment of the invention.

[0056] Fig. 12 illustrates a process for sequence-to-sequence speech recognition in accordance with an embodiment of the invention.

[0057] Fig. 13 illustrates an overview of a sequence-to-sequence speech recognition system in accordance with an embodiment of the invention. DETAILED DESCRIPTION

[0058] Turning now to the drawings, systems and methods for performing speech detection using depth information in accordance with various embodiments of the invention are illustrated. In several embodiments, speech (including silent speech) can be detected using depth information captured by a depth camera alone. In a number of embodiments, a wearable device such as (but not limited to) a smart watch is capable of performing speech detection using depth information acquired using an integrated depth sensor such as (but not limited to) a structured light camera, a time of flight camera and/or a multiview stereo system incorporating an IR projector and stereo near-IR cameras. The user can interact with the wearable device using facial gestures, spoken commands and/or silent commands that are observed by the depth sensor. Software can configure the wearable device to process the depth information captured by the depth sensor to perform speech detection.

[0059] In a number of embodiments, a silent speech detection system performs speech detection from a sequence of depth images by identifying a region of interest (ROI) and extracting a depth map of the ROI for each image. The depth maps of the ROIs can be provided to a model trained to detect words from a predetermined vocabulary. In many embodiments, the words are command words and/or the words form sequences of words that correspond to command phrases. Once a command word or phrase is detected, the command word and/or phrase can be provided to the system to trigger a response responsive to a user command. In a number of embodiments, the speech detection system is capable of performing continuous speech detection.

[0060] In further embodiments, a silent speech detection system performs speech recognition using a general framework that can be adapted to perform specific tasks. In many embodiments, a sequence-to-sequence speech recognition model is utilized that is capable of directly processing point cloud videos as input data and generating outputs that are dependent on an application layer (e.g. sentences or command words.) In several embodiments, a data processing layer may be included as an application layer. In many embodiments, a heuristic layer may be included as an application layer.

[0061] Lip movements are an important component of speech and can be crucial in conveying information in people’s daily communication. The concept of lip reading was first proposed in the 1950s, when researchers suggested that visual cues from the movements of speakers’ mouths could be used to aid in speech recognition. In visual speech, lip movements are the most apparent feature and are controlled by the same articulatory organs that control audible speech, including lips, tongue, teeth, jaw, velum, larynx, and lungs, among which only lips, teeth, jaw, and tongue are visible for lipreading. This limited information can pose a significant challenge for understanding and modeling lip movements during speech.

[0062] Silent speech detection can be useful in a variety of contexts including (but not limited to) applications in which a user wishes to provide a command to a device discretely (e.g. a smart phone or a smart watch in a crowded environment) or in a noisy environment. Silent speech detection can increase accessibility of devices and/or applications for users with damaged vocal chords and/or the hearing impaired. Previously proposed approaches to visual speech detection have typically used color (RGB) cameras to "read" lip movements in an attempt to achieve silent speech detection or improve voice recognition performance. Systems and methods in accordance with a number of embodiments of the invention use depth sensing as a robust approach to acquire high-fidelity information to reconstruct user speech. Depth sensing is inherently less sensitive to ambient factors such as lighting conditions and background color which pose challenges to conventional sensors (e.g., RGB cameras). More importantly, depth sensing can yield more equitable interaction systems as they can be less affected by skin tones than conventional RGB imaging approaches.

[0063] While the discussion that follows primarily focuses on the detection of speech using depth information, systems and methods in accordance with various embodiments of the invention are not limited to the use of depth information to perform speech detection and the methods described herein can be utilized in combination with information obtained from other sensors such as (but not limited to) microphones and/or conventional cameras (e.g. RGB cameras, B/W cameras, and/or near-IR cameras) to perform speech detection and/or voice recognition. Furthermore, speech detection systems configured to perform (silent) speech detection described herein can be adapted for use in any language. The specific configuration and/or sensing modalities that are utilized by a speech detection system to perform speech detection in accordance with various embodiments of the invention are largely dependent upon the requirements of specific applications.

[0064] A challenge that can be encountered when performing speech detection based upon depth information is motion of the speaker relative to the depth sensor. A speaker’s head may move during the capture of the speech and/or the speakers arm may move during the capture of the speech. For example, a user may provide a speech command while walking. Furthermore, Fig. 1 illustrates the challenges of using a depth sensor incorporated within a wearable device to perform speech detection. In the illustrated example, a smart watch incorporating a depth sensor is simulated using an iPhone incorporating Apple’s TrueDepth depth sensor, which is capable of capturing depth maps. When the depth sensor is wrist mounted, the captured depth maps of a user’s mouth are typically captured from a view below the user’s mouth and looking upward toward the user’s mouth. As is discussed in detail below, speech detection systems in accordance with many embodiments of the invention are capable of performing speech detection based upon depth maps captured from this viewpoint and/or when a user is in motion using techniques including (but not limited to) data augmentation in order to simulate depth maps captured from a variety of viewpoints. Systems and methods for performing speech detection using depth information in accordance with various embodiments of the invention are discussed further below.

Identifying a Region of Interest

[0065] Systems and methods in accordance with many embodiments of the invention capture depth information with respect to a user’s face and utilize a ROI within the captured depth information to perform silence speech detection. Speech production is a complex process involving multiple articulators such as lips and jaw and muscle movements. These articulators can work together to shape the airflow coming from the lungs, generating different sounds and forming words and sentences. Previous studies on silent speech recognition using RGB images as input have typically focused on the mouth/lips region as the primary area of interest. However, the extraoral regions of faces may also contribute to speech recognition. [0066] As is discussed further below, the accuracy with which silent speech detection can be performed can be impacted by the specific ROI that is utilized as the input to a silent speech detection process. In many embodiments, an ROI corresponding to at least a portion of the user’s mouth is identified and utilized to perform silent speech detection. In several embodiments, an ROI that includes the entirety of the user’s mouth is identified and utilized to perform silent speech detection. In certain embodiments, an ROI that is sufficiently large so as to include the entirety of the user’s mouth, but sufficiently small so as to exclude the user’s jaw, is utilized to perform silent speech detection. As can readily be appreciated, any of a variety of factors can be utilized in the selection of an ROI within a depth image for the purpose of performing silent speech detection in accordance with various embodiments of the invention and a number of those factors are discussed in detail below.

[0067] One popular approach to capturing the nuanced signals of lip movements is to use visemes, which are the shapes of the mouth at the apex of a given phoneme, a widely utilized token for voice-based speech recognition systems. Visemes are often considered as visual representations of phonemes. However, coarticulation of phonemes/visemes in speech presents a challenge for modeling lip movements accurately. This coarticulation is the overlapping of phonemes/visemes that occurs when people speak, which can cause subtle variations in lip movements that make it difficult to isolate individual visemes. Additionally, the mapping of phonemes to visemes is not a straightforward process as there is no universally accepted convention for defining visemes or mapping phonemes to visemes. The number of phonemes is much more than the visemes, making it difficult to establish a one-to-one correspondence between phonemes and visemes.

[0068] Viseme detection accuracy can be employed as an evaluation metric to evaluate different regions of interest within captured depth information for performing silent speech recognition. Specifically, information degradation can be observed as the ROI includes an increasing area of the surroundings around users’ lips. A data-driven viseme-phoneme conversion can be utilized to evaluate different regions of interest. In several embodiments, a number of visemes are defined (e.g. 14 visemes), among which one is designated for silence. A total of 39 phonemes have been identified to present the diverse range of sounds in American English in the Carnegie Mellon University Pronouncing Dictionary, which is widely used to transcribe the sound of American English and has practical applications in speech recognition. The selected visemes can be mapped to phonemes for the purposes of evaluating potential speech detection performance.

[0069] A mapping of the phonemes from the Carnegie Mellon University Pronouncing Dictionary to a set of 14 visemes and the corresponding viseme distribution is shown in Fig. 2. To evaluate the reliability of different regions of interest, a study was performed recruiting 12 native English speakers, where each user was asked to read a sentence list once in a typical conversational style, allowing collection of data on different visemes in a more natural setting. The depth frames as well as the speaking audio were collected using an iPhone (12 mini) TrueDepth camera and microphone, respectively. The iPhone was placed on a tabletop in front of seated participants in a quiet lab environment. In order to accurately label the depth frames with specific visemes, the CMU dictionary was utilized to translate each sentence into a sequence of phonemes. Subsequently, a transformer based phonetic aligner was employed to perform force alignment of the phonemes with the recorded audio signal of the sentence and locate the temporal boundaries of each phoneme in this sequence. This enabled extraction and labelling of the depth frames that corresponded to the duration of each phoneme. The viseme- phoneme mapping conceptually illustrated in Fig. 2 was then utilized to group the phonemes into their corresponding visemes. By following this procedure, each depth frame was accurately labelled with its corresponding viseme in order to prepare the data for further analysis, namely viseme detection and ROI selection.

[0070] In order to evaluate the potential information content of depth information obtained with respect to different regions of interest of a user’s face, viseme classification performance from different facial regions can be compared. Referring to Fig. 3A, four facial regions are compared: 1 ) 50% of mouth area, 2) mouth region alone, 3) mouth with jaw area, and 4) whole face.

[0071] To convert depth data into a more descriptive format, cropped depth frames can be transformed into point clouds using the intrinsic matrix of the depth sensor. This can enable a precise representation of the 3D shape of the user’s facial landmarks, as the point cloud data contains information about the spatial location of each point on the face, which can facilitate subsequent viseme classification. For the purpose of classification, a point cloud classification model, such as (but not limited to) PointNet, can be utilized to perform the viseme detection.

[0072] Fig. 3B shows viseme classification results for the four facial regions referenced above with reference to Fig. 3A. As can be seen, the mouth region alone provides the most informative features during speech with the highest accuracy of 92.33% in viseme classification.

[0073] While specific regions of interest are identified above, it should be appreciated that the specific ROI that is utilized in a given application may vary. Furthermore, systems and methods in accordance with a number of embodiments of the invention can alter the ROI utilized during silent speech detection based upon the performance of a specific silent speech detection system and/or with respect to a particular user. The manner in which depth information from a selected ROI of interest can be utilized to perform speech detection in accordance with various embodiments of the invention is discussed further below.

Speech Detection from Depth Images

[0074] User devices in accordance with various embodiments of the invention, such as (but not limited to) smart watches, can incorporate operating systems that enable user interactions with various software applications via spoken and/or silent speech. In several embodiments, speech is detected using depth images either alone or in combination with additional sensing modalities.

[0075] A process for detecting and responding to a speech command (either silent or spoken) using depth images in accordance with an embodiment of the invention is illustrated in Fig. 4. The process 400 includes capturing (402) a sequence of depth images. In many embodiments, foreground information in the depth images can be isolated. In several embodiments, the speaker is assumed to be within a threshold distance of the depth sensor and all depth samples with distances beyond the threshold distance are discarded. As can readily be appreciated, any of a variety of different techniques can be utilized to isolate a speaker from depth images as appropriate to the requirements of specific applications. [0076] The process 400 can also include cropping (404) the depth images to a specific ROI. As noted above, different regions of a speaker’s face provide have different informational values with respect to silent speech detection. Accordingly, cropping the depth images to a specific ROI can improve the ability of the process to accurately detect speech.

[0077] In some embodiments, the cropped depth images can then be converted (406) to point clouds. In many embodiments, the point clouds can be downsampled (408) to reduce the amount of data and/or number of computations that must be performed in subsequent processing steps. In other embodiments, speech detection is performed directly from cropped depth maps.

[0078] The sequence of cropped depth information (e.g. point clouds or depth maps) can then be used to extract (410) a set of temporal-spatial features that can be utilized to detect (412) words from a predetermined vocabulary of words. As is discussed in detail below, the extraction of spatial-temporal features and the classification of the spatial- temporal features to detect (412) words can involve the utilization of machine learning models that are specifically trained for this purpose.

[0079] In embodiments in which the words are command words, the operating system of the device can update (414) the user interface of the device in response to the detected command word(s).

[0080] While various processes for performing speech detection from depth images and/or updating the user interface of a device including a depth sensor based upon spoken and/or silent speech, any of a variety of processes can be utilized to detect speech according the requirements of specific applications in accordance with various embodiments of the invention. Specific factors that can be considered in designing such a process and various processes tailored to specific applications such as (but not limited to) command word detection, continuous speech detection, user independent speech detection, and user-dependent speech detection are discussed further below.

Region of Interest Segmentation

[0081] In many embodiments, a machine learning model is utilized to perform speech detection based upon depth images that can learn how to detect the pose of a user’s lips based upon depth data. As can readily be appreciated, systems and methods in accordance with embodiments of the invention are not limited to the use of a specific machine learning model and the specific machine learning model or ensemble of machine learning models that are used to detect the pose of informative regions of a user’s face during speech detection are typically determined based upon the requirements of specific applications. A number of relevant machine learning models that can be utilized to perform speech detection using depth information in various embodiments of the invention are discussed in detail below.

[0082] In a number of embodiments, a user’s lips can be identified within depth data by utilizing transfer learning to fine-tune a pre-trained object detection model to enable the use of the model to detect lips in a depth image. In several embodiments, an object detection model is utilized such as (but not limited to) the YOLOv7 model described in Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. ”YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464- 7475, 2023, the disclosure of which including the disclosure related to the implementation of the model YOLOv7 model and its training is incorporated herein by reference in its entirety. In order to fit depth data to the YOLOv7 model, depth information can be converted into images. However, common forms of depth information such as (but not limited to) depth maps, which are often represented in a 16-bit floating-point format, have a high dynamic range. Directly mapping them to 8-bit integers can cause substantial precision loss. In several embodiments, the depth map can be filtered using a distance mask for background subtraction. In certain embodiments, an appropriate upper threshold for the distance mask can be determined based upon the distance from which the user is likely to be observed in a particular application.

[0083] In one instance, an appropriate distance was determined by applying a face detector to an RGB image registered with the captured depth images. A face detector such as (but not limited to) the MediaPipe face detector can be used to detect faces from the RGB image and calculate the distance to the detected face using the depth information from the depth image. In embodiments that utilize a smartwatch to capture depth information, the distance between the user’s face and the camera can vary sign ificantly . During a series of experiments, the distance to the user’s face was typically less than 0.48 m with a mean value of 0.34 m (SD=0.03). Therefore, a distance threshold of 0.5 m can be used to extract the foreground in smart watch applications, which contains the user’s face, before normalizing the image between 0 and 255. As noted above, the specific depth threshold that is utilized in a given application will depend upon the requirements of that application.

[0084] Figs. 5A and 5B show thresholding of a depth map and the significantly improved resolution of the converted depth image that results, making it easier to detect the lips. As can readily be appreciated, the specific processes that can be utilized for normalization of depth images and/or thresholding depth information depend upon the requirements of specific applications.

[0085] In several embodiments, a pre-trained YOLOv7 model can be trained by transferring the knowledge from an RGB-based face detector. Specifically, ground truth information for the depth images can be obtained using lip bounding boxes predicted from registered RGB images using a face detector such as, but not limited to, MediaPipe. In this way a model can be trained to determine a rectangular ROI with a width of 1.2 x average_lip_width and a height of 1.5 x average_lip_height measured across each utterance given the results from the validation study. As noted above, the specific dimensions utilized for the ROI are largely dependent upon the requirements of a given application.

[0086] In several embodiments, the scale of the ROI is a fixed value for each utterance, which can be averaged among all frames. In a number of embodiments, a linear interpolation method can be used to estimate the ROI in frames where the lip detection model failed to find lips. Processes for utilizing lips detected within ROIs to perform both command word and continuous speech detection from depth images in accordance with various embodiments of the invention are discussed further below.

Command Word Recognition

[0087] In a number of embodiments, speech detection is performed to detect specific command words or phrases. In the interests of brevity, references to detection of command words herein should be understood as encompassing both the detection of command words and the detection of command phrases.

[0088] In several embodiments, command classification is performed using a machine learning model that is configured to receive cropped depth images of lips as inputs. In a number of embodiments, lip movements are highlighted by converting the cropped depth image into point clouds using an intrinsic matrix to map each pixel to a point in 3D space. In many embodiments, the generated point cloud can be down sampled using farthest sampling and normalization to reduce the computational load and to capture the salient features of speech patterns. To extract temporal-spatial features from the point clouds, a point cloud transformer can be used to encode and preserve the structure of the points and adaptively search for related or similar points across the entire lip movements using self-attention. In a number of embodiments, a point cloud transformer such as (but not limited to) the Point Spatio-Temporal Transformer can be utilized. As can readily be appreciated, the specific point cloud transformer that is utilized is largely dependent upon the requirements of a specific application.

[0089] Identified point cloud features observed during speech corresponding to a single command can be fed into a machine learning model, which can serve as the final classifier, to convert the point cloud features into class predictions. In many embodiments, an end-to-end command classification system is utilized that is designed to accurately classify spoken commands using only the depth images of the user’s lips, making it a useful tool for a wide range of applications.

Corpus Selection

[0090] A variety of silent speech detection applications involve the use of a vocabulary or corpus of command words. In several embodiments, speech detection systems use a vocabulary that includes a corpus of 10 digits and 27 common commands determined to be common in many user’s daily lives. The ten-digit corpus includes digits from zero to nine, while the 27-command corpus is based on the commands listed in Table 1 shown in Fig. 7. These two distinct corpora provide a dataset that can be utilized to accurately perform silent speech recognition of command words. By incorporating a range of commonly used digits and commands, the dataset provides a realistic representation of the types of speech input that users are likely to encounter in everyday interactions with voice assistant systems. The aim of this set of commands is to provide context for situations in which individuals engage with a voice assistant to perform tasks such as operating a smartphone and/or controlling smart home devices. As can readily be appreciated, the specific vocabulary that is supported by silent speech detection systems in accordance with various embodiments of the invention is largely dependent upon the requirements of specific applications.

[0091] In several embodiments, speech detection systems can also perform facial gesture detection. A facial gesture can be any distinctive pose of a user’s face or a portion of a user’s face. In several embodiments, speech detection systems are trained using processes similar to those described above to detect facial poses including (but not limited to) poses involving a user’s tongue extended out of the mouth, stretched to the left, right, up or down. As can readily be appreciated, speech detection systems in accordance with various embodiments of the invention can be trained to detect any of a variety of facial gestures appropriate to the requirements of specific applications. Furthermore, facial gesture detection systems can be implemented in a similar manner to the speech detection systems described herein but trained to detect facial gestures (but not speech) from depth images.

Sentence Recognition

[0092] In addition to detecting command words, systems and methods for performing speech detection in accordance with various embodiments of the invention can be utilized to decode continuous speech into word sequences directly, which can allow the user to input sentences of any combinations of words in a predetermined vocabulary.

[0093] A system for performing continuous silent speech detection, for detecting command words and/or to perform continuous speech detection from sequences of depth images in accordance with an embodiment of the invention is illustrated in Fig. 6. A sequence of depth images is captured and normalized to isolate foreground depth measurements. A lip detection model can be used to identify a mouth ROI within the normalized images and the depth images cropped to the identified ROIs. [0094] In several embodiments, a sequences of cropped ROI depth maps 602 is provided to a pre-trained model 604, such as (but not limited to) an AV-HuBERT model as described in Shi, Bowen, Abdelrahman Mohamed, and Wei-Ning Hsu. "Learning lipbased audio-visual speaker embeddings with av-hubert." arXiv preprint arXiv: 2205.07180 (2022), the disclosure of which including the disclosure related to the AV-HuBERT model and its training is incorporated herein by reference in its entirety. In several embodiments, the model is trained using an appropriate training data such as (but not limited to) the LRS3 and the Voxforge datasets in an auto-regression manner to learn visual speech representations.

[0095] In the illustrated embodiment, the model uses a Convolutional Neural Network (CNN) to encode 606 the input image sequence into high-dimensional vectors, and uses self-attention layers to capture contextual information. In a number of embodiments, a self-attention decoder 608 model is used that is trained to continuously output token sequences 610 from a known vocabulary with a CTC Loss (Connectionist Temporal Classification Loss). The token sequences can then be mapped against a vocabulary of command words to provide user inputs to software executing on the device. The token sequences can also be utilized to perform continuous speech detection.

[0096] In several embodiments, transfer learning is utilized to train a model that can perform speech detection of continuous speech based upon depth images by leveraging a pre-trained RGB-based visual speech recognition model, such as (but not limited to) an AV-HuBERT model. In many embodiments, transfer learning is not required and the model is directly trained using a training data corpus of annotated depth image sequences.

[0097] In many embodiments, data augmentation techniques are used to extend the training dataset used to train a model employed by a silent speech detection system to make the resulting system more robust to variations in pose, such as (but not limited to) relative position between a user’s face and a wrist worn device that incorporates the depth sensor. In several embodiments, the training data set can be augmented by applying 3- D transformations to a point cloud 624 estimated from a depth map 620 captured by a depth sensor. Additional depth maps can then be obtained by projecting the point cloud 622 back to a depth map 624 from a new viewpoint. The transformations are namely translation, scale, and rotation with random factors (e.g. up to 0.02, 0.05, and 0.2, respectively). Relatively smaller maximum values for translation and scale transformations can be utilized to avoid noticeable artifacts due to factors including (but not limited to) occlusions that occur as a result of the new viewpoint. To ensure consistency across frames within the same utterance, the same transformation parameters can be applied for each utterance. These augmentations can be adopted to improve the generalizability of the trained model by preventing it from over-fitting on the training set. In addition, the data augmentation can make the model more robust to changes in the relative orientation of and/or distance of the depth camera relative to the user.

User-dependent Speech Detection

[0098] In many embodiments, speech detection models are trained in a manner that makes them user-dependent. For example, the speech detection system can identify a user and select one or more machine learning models that have been specifically trained using training data including depth images sequences of that specific user. A process for training a user-dependent speech detection system in accordance with an embodiment of the invention is conceptually illustrated in Figs. 8A and 8B. The process utilizes incremental learning to customize a pre-trained model based upon additional training data examples of a specific user. In this way, a very small number of incremental instances of training data can be utilized to significantly improve the performance of a speech detection model configured to detect words from input depth images.

[0099] In several embodiments, the model takes an 88x88 depth image as input. The visual feature is encoded using a 3D convolutional layer, a 3D batch normalization layer, and a 3d max pooling layer. Then, the tensor is flattened into a single-dimensional vector through a 2D average pooling layer and fed into a modified ResNet-18. Each Residual Block repeats a series of operations of 2D Conv, BatchNorm, and ReLU twice. The parameters of the 3D visual frontend and the ResNet modules are detailed in the table shown in Fig. 8B. Next, a transformer module can be used for sequence modeling to decode the images into words. In the embodiment illustrated in Fig. 8A there are, in total, 24 transformer blocks with embedding dimensions of 1024/4096/16. A dropout of p = 0.1 can be used after the self-attention block within each transformer layer, and each transformer layer is dropped at a rate of 0.1. Furthermore, the model can be trained with a connectionist temporal classification (CTC) loss, where the features are projected to the probabilities of each token in the vocabulary.

[00100] While specific models are described above with reference to Figs. 8A and 8B, any of a variety of machine learning models using any combination of layers as appropriate to the requirements of specific applications can be utilized in accordance with various embodiments of the invention.

[00101] The confusion matrices shown in Figs. 9A and 9B and the accuracy data shown in Fig. 10 for both user-dependent and user-independent models demonstrate that the user-dependent model performs significantly better than the user-independent model. Specifically, the recognition accuracy for commands and digits using the user dependent model was 82.24%, with digit detection accuracy reaching 85.74%. In contrast, the accuracy of a user-independent model was only about half as effective. These results indicate that even a small amount of data collected from previously unseen users can significantly enhance the performance of silent speech detection systems and methods in accordance with various embodiments of the invention. Moreover, the confusion matrix indicates that certain commands are more likely to be confused than others, as they have similar character sequences, pronunciation, and lip movements. For example, "Test" and "Next" as well as "Turn on" and "Turn off" were frequently confused. Additionally, "Stop" and "Pause" were often mistaken for each other, likely due to the prolonged period required for the lip shapes to form a round shape. Accordingly, selection of the command words included within a particular vocabulary can further enhance performance of silent speech detection systems in accordance with various embodiments of the invention.

[00102] While specific depth image data processing pipelines, machine learning models, training techniques, training data sets, and/or approaches to data augmentation are described above with reference to Figs. 4 - 10, any of a variety of depth image data processing pipelines incorporating any of a variety of different machine learning models can be utilized to perform speech detection, including (but not limited to) the silent speech detection of command words and/or continuous speech, as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

Sequence-to-Sequence Speech Recognition

[00103] As described above, systems and methods in accordance with various embodiments of the invention can use different frameworks for specific tasks. In a number of embodiments, silent speech recognition can be performed using a general framework that can be adapted to perform specific tasks. In several embodiments, the specific tasks are implemented through application layers. In many embodiments, a sequence-to- sequence speech recognition model is utilized that is capable of directly processing point cloud videos as input data and generating outputs that are dependent on the application layer (e g. sentences or command words). Fig. 13 provides an overview of an embodiment of a Sequence-to-Sequence Speech Recognition system.

[00104] A process for sequence-to-sequence speech recognition in accordance with an embodiment of the invention is illustrated in Fig. 12. The process 1200 includes converting (1202) a depth image to a point cloud. In many embodiments, depth data is converted into point clouds. In numerous embodiments, a data processing application layer converts depth data into point clouds. In several embodiments, a transformation is employed that maps each pixel in a depth image to its corresponding 3D spatial coordinates. In further embodiments, the transformation includes the depth value of the pixel, which is defined as the distance between the pixel and the camera’s optical origin. In further embodiments, the transformation relies on the camera’s intrinsic parameter to convert the pixel from the depth data into the point in 3D point clouds. In certain embodiments, the 3D structure of a point cloud can be used to identify per-point features. In many embodiments, normal vectors are calculated and can serve as features to represent local geometric properties at orientations at an individual point within a 3D surface. As can readily be appreciated, any of a variety of different techniques can be utilized to covert a depth image to a point cloud as appropriate to the requirements of specific applications.

[00105] The process 1200 can also include standardizing point clouds (1204). In several embodiments, the sequence-to-sequence speech recognition model forms a general framework for performing standardization of point clouds. In several embodiments, the general framework performs standardization of point clouds through a data processing application layer. In many embodiments, the number of points in each frame is randomly sampled. In several embodiments, a number of points in each frame is randomly sampled (e.g. 1024). In further embodiments, the lips’ point cloud of each frame is normalized within a unit ball. In many embodiments, the centroid of the point clouds of each frame is calculated. In further embodiments, the point clouds are relocated so that the centroid is positioned at the origin of the coordinate system. In several embodiments, an affine transformation is utilized to account for varying face orientation with respect to devices. In many embodiments, a transformation network (TNet) is used to predict the affine transformation matrix. In several embodiments, the transformation network is inspired by the model PointNet. In many embodiments, each frame of point clouds is fed into TNet and rotated independently using the affine transformation matrix output by TNet. As is discussed below, the TNet model can be trained together with a model used to perform spatio-temporal feature extraction (1206) and a model that can be used to perform sentence decoding (1208). While specific processes for normalizing point clouds are described above, any of a variety of different techniques can be utilized to standardize point clouds as appropriate to the requirements of specific applications in accordance with various embodiments of the invention.

[00106] The process 1200 can also include extracting spatio-temporal features (1206). In several embodiments, the general framework utilizes a data processing application layer to perform the extraction of spatio-temporal features. In many embodiments, temporal information may be included in point clouds by introducing another dimension t. In several embodiments, a time series of point cloud frames are fed into a point 4D convolutional layer for feature extraction. In further embodiments, the point cloud 4D convolution is optimized by downsampling each frame in the point cloud videos with a spatial subsampling range. In further embodiments, the spatial subsampling range uses the Farthest Point Sampling (FPS) method. In several embodiments, anchor points are identified in each frame. In further embodiments, 64 anchor points are identified in each frame. As can readily be appreciated, the specific number of anchor points is largely dependent on the requirements of a particular application. In many embodiments, anchor points are used as centroids to define a spatio-temporal local region. In several embodiments, the spatio-temporal local region defines the searching area within the current frame. In further embodiments, a point 4D convolutional layer decodes local regions into a feature vector. In many embodiments, the point 4D convolutional layer decodes local regions into a feature vector using an equation and a multilayer perception. In further embodiments still, the feature vectors append anchor points into a max pooling layer as features for the next step. In many embodiments, a video-level spatio-temporal transformer may be employed to search and merge feature vectors extracted from the Point 4D convolution across the whole point cloud videos. In further embodiments, a global feature representation of the utterance video is generated by appending a max pooling layer right after the transformer, effectively combining the localized features into global features. In many embodiments, the global features are fed into two bi-directional Gated Recurrent Unit (Bi-GRU) layers. In several embodiments, the output from the Bi- GRU layer undergoes processing performed by a softmax layer to produce probabilities for each token. In further embodiments, the tokens are English characters. While specific processes and machine learning models are described above for performing spatiotemporal feature extraction, it should readily appreciated that any of a variety of different techniques can be utilized to extract spatio-temporal features as appropriate to the requirements of specific applications and that the processes disclosed herein are not limited to recognition of spoken English words. In several embodiments, the time series of point cloud frames are directly decoded into English characters. In many embodiments, an application layer is implemented that performs the decoding of point cloud frames into English characters. In several further embodiments the English characters include the space character. In many embodiments, a mapping is established between depth data of speech and textual information including sentences and command words. In further embodiments, a heuristic layer with simple rules that consolidate repeated adjacent characters is deployed. In several embodiments, the Connectionist Temporal Classification (CTC) Loss is employed to eliminate the need for precise viseme alignments and to simplify the training process. In many embodiments, the global feature vector is passed through a softmax layer, and the output sequence is generated by selecting the character sequence with the highest probability. As can readily be appreciated, any of a variety of different techniques can be utilized to decode sentences from point cloud frames as appropriate to the requirements of specific applications.

[00107] The process 1200 can also include recognizing commands (1210). In many embodiments, a heuristic layer is implemented for command recognition, mapping character sequences to command words within a predefined command set. In further embodiments, once predicted character sequences are obtained from the pipeline, the predicted character sequences are compared with the commands that exist in the command set based on the sequence matching algorithm ‘gestalt pattern matching.’ In further still embodiments, gestalt pattern matching is recursively applied to the segments of the sequences and yields the best-matched command as the final output. As can readily be appreciated, any of a variety of different techniques can be utilized to decode recognize commands from point cloud frames as appropriate to the requirements of specific applications.

Speech Detection Systems

[00108] A speech detection system that uses a depth sensor to capture depth images that can be used to perform speech detection in accordance with an embodiment of the invention is illustrated in Fig. 11. In many embodiments, the speech detection system 1100 includes at least one processor 1110, a depth sensor 1112, a network interface 1114, and memory 1116. One skilled in the art will recognize that a speech detection system may exclude certain components and/or include other components that are omitted for brevity without departing from this invention. Indeed, speech detection systems in accordance with various embodiments of the invention can be implemented on almost any computing device that includes a depth sensor.

[00109] The at least one processor 1110 can include (but is not limited to) a processor, microprocessor, controller, or a combination of processors, microprocessor, and/or controllers that performs instructions stored in the memory 1116 to manipulate data stored in the memory. Processor instructions can configure the at least one processor 1110 to perform processes in accordance with certain embodiments of the invention. In various embodiments, processor instructions can be stored on a non-transitory machine readable medium. [00110] In several embodiments, the at least one processor 1110 is configured to communicate with a depth sensor 1112, such as (but not limited to) a structured light camera, a time of flight camera and/or a multiview stereo system incorporating an IR projector and stereo near-IR cameras. In many embodiments, the at least one processor 1110 is also configured to communicate with sensors capable of capturing sensor information using any of a variety of sensing modalities appropriate to the requirements of specific applications including (but not limited to) one or more microphones and/or one or more color cameras.

[00111] In a variety of embodiments, the network interface 1114 can be used to gather inputs and/or provide outputs. The speech detection system 1100 can utilize the network interface 1114 to transmit and receive data over a network based upon the instructions performed by processor 1110. Network interfaces in accordance with many embodiments of the invention can be used to enable speech commands detected locally by the speech detection system to initiate processes on remote servers.

[00112] Memory 1116 may include a speech detection application 1118, model parameters 1120 and a vocabulary 1122 which can be used to implement speech detection processes in accordance with various embodiments of the invention.

[00113] Although specific examples of speech detection systems are described above with reference to Fig. 11 , any of a variety of speech detection systems can be utilized to perform processes for detecting speech from sequences of depth images as appropriate to the requirements of specific applications in accordance with embodiments of the invention.

[00114] Although specific speech detection systems and methods are discussed above, many different systems and methods for performing silent speech detection using depth images can be implemented in accordance with many different embodiments of the invention. It is therefore to be understood that the present invention may be practiced in ways other than specifically described, without departing from the scope and spirit of the present invention. Thus, embodiments of the present invention should be considered in all respects as illustrative and not restrictive. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.

[00115] Additional disclosure can be found in the manuscript filed herewith, which is incorporated by reference in its entirely. The references to additional published works made in the footnotes are incorporated by reference in their entireties. Although the description above contains many specificities, these should not be construed as limiting the scope of the invention but as merely providing illustrations of some of the presently preferred embodiments of the invention. Various other embodiments are possible within its scope. Accordingly, the scope of the invention should be determined not by the embodiments illustrated, but by the appended claims and their equivalents.

Claims

WHAT IS CLAIMED IS:

1 . A speech detection system, comprising: a depth sensor; a memory containing a speech detection application; and at least one processor configured by the speech detection application to: capture a sequence of depth images; identify and crop a region of interest from within each depth image in the sequence of depth images, where the cropped region of interest contains a mouth; detect at least one word by providing the sequence of cropped regions of interest to a machine learning model configured to receive a sequence of cropped regions of interest and output at least one detected word from a predetermined vocabulary, wherein the machine learning model comprises: a sequence-to-sequence speech recognition model; and an application layer, where the application layer receives inputs from the sequence-to-sequence speech recognition model and outputs at least one detected word; wherein the sequence-to-sequence speech recognition model is a machine learning model trained to receive information from a sequence of depth images and output a sequence of characters; and update a user interface of the speech detection system based upon a command corresponding to the at least one detected word.

2. The speech detection system of claim 1 , wherein the speech detection application further configures the processor to extract foreground information from each depth image in the sequence of depth images.

3. The speech detection system of any of claims 1 and 2, wherein the speech detection application further configures the processor to threshold each depth image in the sequence of depth images to mask background information.

4. The speech detection system of any of the above claims, wherein the captured sequence of depth images includes depth images of silent speech.

5. The speech detection system of any of the above claims, wherein the captured sequence of depth images includes depth images of audible speech.

6. The speech detection system of any of the above claims, wherein the captured sequence of depth images includes depth images captured of a facial gesture and the at least one detected word is a command word associated with the facial gesture within the predetermined vocabulary.

7. The speech detection system of any of the above claims, wherein the depth sensor is selected from the group consisting of a structured light camera, a time of flight camera, and a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

8. The speech detection system of any of the above claims, wherein the depth sensor is a structured light camera.

9. The speech detection system of any of the above claims, wherein the depth sensor is a time of flight camera.

10. The speech detection system of any of the above claims, wherein the depth sensor is a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

11 . The speech detection system of any of the above claims, wherein the sequences of images received by the sequence-to-sequence speech recognition model form a point cloud video.

12. The speech detection system of any of the above claims, wherein once a command word or phrase is detected, the command word or phrase isprovided to the system to trigger a response responsive to a user command.

13. The speech detection system of any of the above claims, wherein, the speech detection system is capable of performing continuous speech detection.

14. The speech detection system of any of the above claims, wherein the sequence-to-sequence speech recognition model is configured to generate outputs that are dependent on an application layer.

15. The speech detection system of any of the above claims, wherein the outputs are sentences.

16. The speech detection system of any of the above claims, wherein the outputs are command words.

17. The speech detection system of any of the above claims, wherein the cropped region of interest is cropped to exclude a jaw captured in the sequence of depth images.

18. The speech detection system of any of the above claims, wherein: the memory further contains an intrinsic matrix for the depth sensor; and the information from the sequence of cropped depth images is obtained by transforming the sequence of cropped depth images into a sequence of point clouds using the intrinsic matrix for the depth sensor.

19. The speech detection system of claim 18, wherein the point clouds in the sequence of point clouds are down sampled and normalized to capture the salient features of speech patterns.

-SO-

20. The speech detection system of any of claims 18 and 19, wherein a point cloud transformer is used to encode and preserve the structure of the points within the sequence of point clouds and to adaptively search for related or similar points across entire lip movements using self-attention.

21. The speech detection system of any of claims 18 to 20, wherein the downsampling is performed using farthest sampling.

22. The speech detection system of any of the above claims, wherein the at least one processor is further configured by the speech detection application to discard depths within the captured sequence of depth images that exceed a threshold.

23. The speech detection system of any of the above claims, wherein the at least one processor configured by the speech detection application to identify and crop a region of interest from within each depth image using an object detection model.

24. The speech detection system of any of the above claims, wherein the at least one processor is further configured by the speech detection application to filter the sequence of depth images using a distance mask.

25. A speech detection method comprising: capturing a sequence of depth images using a depth sensor in a speech detection system; cropping each of the sequence of depth images using the speech detection system; extracting temporal-spatial features from the sequence of cropped depth images using the speech detection system; detecting at least one word from the extracted temporal-spatial features; and updating a user interface of the speech detection system in response to the at least one detected word.

26. The speech detection method of claim 25, thresholding depth samples within each of the sequence of depth images using the speech detection system.

27. The speech detection method of any of claims 25 and 26, wherein the speech detection system performs the thresholding prior to the cropping each of the sequence of depth images.

28. The speech detection method of any of claims 25 to 27, wherein capturing a sequence of depth images using a depth sensor comprises capturing a sequence of depth images of silent speech.

29. The speech detection method of any of claims 25 to 28, wherein capturing a sequence of depth images using a depth sensor comprises capturing a sequence of depth images of audible speech.

30. The speech detection method of any of claims 25 to 29, wherein: capturing a sequence of depth images using a depth sensor comprises capturing a sequence of depth images of a facial gesture; and detecting at least one word from the extracted temporal-spatial features comprises detecting at least one command associated with the facial gesture.

31 . The speech detection method of any of claims 25 to 30, wherein the depth sensor is selected from the group consisting of a structured light camera, a time of flight camera and a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

32. The speech detection method of any of claims 25 to 31 , wherein the depth sensor is a structured light camera.

33. The speech detection method of any of claims 25 to 32, wherein the depth sensor is a time of flight camera.

34. The speech detection method of any of claims 25 to 33, wherein the depth sensor is a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

35. The speech detection method of any of claims 25 to 34, wherein the sequences of images received by the sequence-to-sequence speech recognition model form a point cloud video.

36. The speech detection method of any of claims 25 to 35, wherein once a command word or phrase is detected, the command word or phrase is provided to the system to trigger a response responsive to a user command.

37. The speech detection method of any of claims 25 to 36, wherein, the speech detection system is capable of performing continuous speech detection.

38. The speech detection method of any of claims 25 to 37, wherein the sequence-to-sequence speech recognition model is configured to generate outputs that are dependent on an application layer.

39. The speech detection method of any of claims 25 to 38, wherein the outputs are sentences.

40. The speech detection method of any of claims 25 to 39, wherein the outputs are command words.

41. The speech detection method of any of claims 25 to 40, wherein the cropped region of interest is cropped to exclude a jaw captured in the sequence of depth images.

42. The speech detection method of any of claims 25 to 41 , wherein: the depth sensor contains an intrinsic matrix; and the information from the sequence of cropped depth images is obtained by transforming the sequence of cropped depth images into a sequence of point clouds using the intrinsic matrix for the depth sensor.

43. The speech detection method of claim 42, wherein the point clouds in the sequence of point clouds are down sampled and normalized to capture the salient features of speech patterns.

44. The speech detection method of any of claims 42 and 43, wherein a point cloud transformer is used to encode and preserve the structure of the points within the sequence of point clouds and to adaptively search for related or similar points across entire lip movements using self-attention.

45. The speech detection method of any of claims 42 to 44, wherein the downsampling is performed using farthest sampling.

46. The speech detection method of any of claims 25 to 45, wherein the at least one processor is further configured by the speech detection application to discard depths within the captured sequence of depth images that exceed a threshold.

47. The speech detection method of any of claims 25 to 46, wherein the at least one processor configured by the speech detection application to identify and crop a region of interest from within each depth image using an object detection model.

48. The speech detection method of any of claims 25 to 47, wherein the at least one processor is further configured by the speech detection application to filter the sequence of depth images using a distance mask.

49. A smart watch, comprising: a housing; a watch band attached to the housing; a depth sensor mounted within the housing; a display mounted within the housing to form an enclosure; a memory contained within the enclosure, where the memory contains an operating system, a speech detection application and parameters defining a machine learning model; at least one processor contained within the enclosure, where the at least one processor is configured by the speech detection application to: capture a sequence of depth images using the depth sensor; identify and crop a region of interest from within each depth image in the sequence of depth images, where the cropped region of interest contains a mouth; detect at least one word by providing the sequence of cropped regions of interest to the machine learning model configured to receive a sequence of cropped regions of interest and output at least one detected word from a predetermined vocabulary, wherein the machine learning model comprises: a sequence-to-sequence speech recognition model; an application layer, where the application layer receives inputs from the sequence-to-sequence speech recognition model and outputs at least one detected word; wherein the sequence-to-sequence speech recognition model is a machine learning model trained to receive point cloud videos and output a sequence of characters; update a user interface of the speech detection system based upon a command corresponding to the at least one detected word.

50. The smart watch of claim 49, wherein the speech detection application further configures the processor to extract foreground information from each depth image in the sequence of depth images.

51 . The smart watch of any of claims 49 and 50, wherein the speech detection application further configures the processor to threshold each depth image in the sequence of depth images to mask background information.

52. The smart watch of any of claims 49 to 51 , wherein the captured sequence of depth images includes depth images of silent speech.

53. The smart watch of any of claims 49 to 52, wherein the captured sequence of depth images includes depth images of audible speech.

54. The smart watch of any of claims 49 to 53, wherein the captured sequence of depth images includes depth images captured of a facial gesture and the at least one detected word is a command word associated with the facial gesture within the predetermined vocabulary.

55. The smart watch of any of claims 49 to 54, wherein the operating system configures the at least one processor to enable user interactions with various software applications via spoken or silent speech.

56. The smart watch of any of claims 49 to 55, wherein the at least one processor is further configured by the speech detection application to discard depths within the captured sequence of depth images that exceed a threshold.

57. The smart watch of any of claims 49 to 56, wherein the depth sensor is selected from the group consisting of a structured light camera, a time of flight camera and a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

58. The smart watch of any of claims 49 to 57, wherein the depth sensor is a structured light camera.

59. The smart watch of any of claims 49 to 58, wherein the depth sensor is a time of flight camera.

60. The smart watch of any of claims 49 to 59, wherein the depth sensor is a multiview stereo system incorporating an IR projector and stereo near-IR cameras.

61 . The smart watch of any of claims 49 to 60, wherein the sequences of images received by the sequence-to-sequence speech recognition model form a point cloud video.

62. The smart watch of any of claims 49 to 61 , wherein once a command word or phrase is detected, the command word or phrase is provided to the system to trigger a response responsive to a user command.

63. The smart watch of any of claims 49 to 62, wherein the speech detection application is capable of performing continuous speech detection.

64. The smart watch of any of claims 49 to 63, wherein the sequence-to- sequence speech recognition model is configured to generate outputs that are dependent on an application layer.

65. The smart watch of any of claims 49 to 64, wherein the outputs are sentences.

66. The smart watch of any of claims 49 to 65, wherein the outputs are command words.

67. The smart watch of any of claims 49 to 66, wherein the cropped region of interest is cropped to exclude a jaw captured in the sequence of depth images.

68. The smart watch of any of claims 49 to 67, wherein: the memory further contains an intrinsic matrix for the depth sensor; and the information from the sequence of cropped depth images is obtained by transforming the sequence of cropped depth images into a sequence of point clouds using the intrinsic matrix for the depth sensor.

69. The smart watch of claim 68, wherein the point clouds in the sequence of point clouds are down sampled and normalized to capture the salient features of speech patterns.

70. The smart watch of any of claims 68 and 69, wherein a point cloud transformer is used to encode and preserve the structure of the points within the sequence of point clouds and to adaptively search for related or similar points across entire lip movements using self-attention.

71. The smart watch of any of claims 68 to 70, wherein the downsampling is performed using farthest sampling.

72. The smart watch of any of claims 49 to 71 , wherein the at least one processor configured by the speech detection application to identify and crop a region of interest from within each depth image using an object detection model.

73. The smart watch of any of claims 49 to 72, wherein the at least one processor is further configured by the speech detection application to filter the sequence of depth images using a distance mask.