WO2022174075A1

WO2022174075A1 - Systems and methods for computer vision based detection and alerting

Info

Publication number: WO2022174075A1
Application number: PCT/US2022/016176
Authority: WO
Inventors: Bernardino Eduardo MENDEZ IV; Cory Perry
Original assignee: Retinus LLC
Current assignee: Retinus LLC
Priority date: 2021-02-12
Filing date: 2022-02-11
Publication date: 2022-08-18
Anticipated expiration: 2023-08-12

Abstract

A computer vision system can include one or more processors configured to receive at least one image from a sensor, detect an object from the at least one image, determine that the object corresponds to an event condition, cause at least one of a depth estimation or a pose estimation responsive to determining that the object corresponds to an event condition, and output an indication of the object and the event condition based on the at least one of the depth estimation or the pose estimation. This computer vision system is also capable of detecting, ingesting and processing language and text from physical or digital sources.

Description

SYSTEMS AND METHODS FOR COMPUTER VISION BASED DETECTION AND

ALERTING

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application claims the benefit of and priority to U.S. Provisional Application No. 63/148,903, filed February 12, 2021, the disclosure of which is incorporated herein by reference in its entirety.

BACKGROUND

[0002] The present application relates generally to the field of object detection, and more particularly to computer vision-based detection and alerting of objects and events.

SUMMARY

[0003] At least one aspect relates to a system. The system can include one or more processors configured to receive at least one image from a sensor, detect an object from the at least one image, determine that the object corresponds to an event condition, cause at least one of a depth estimation or a pose estimation responsive to determining that the object corresponds to an event condition, and output an indication of the object and the event condition based on the at least one of the depth estimation or the pose estimation.

[0004] At least one aspect relates to a method. The method can include receiving, by one or more processors from at least one sensor, at least one image; detecting, by the one or more processors from the at least one image, an object; determining, by the one or more processors, that the object corresponds to an event condition; causing, by the one or more processors responsive to determining that the object corresponds to the event condition, at least one of a depth estimation of the object or a pose estimation of the object; and outputting, by the one or more processors, an indication of the object and the event condition based on the at least one of the depth estimation or the pose estimation.

[0005] At least one aspect relates to a system. The system can include one or more processors configured to detect a plurality of characters from at least one image; and apply, responsive to detecting the plurality of characters, a natural language detector to the plurality of characters to generate a sentiment from the at least one image. [0006] At least one aspect relates to a method. The method can include detecting, by one or more processors, a plurality of characters from at least one image; and applying, by the one or more processors responsive to detecting the plurality of characters, a natural language detector to the plurality of characters to generate a sentiment from the at least one image.

[0007] Those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the devices and/or processes described herein, as defined solely by the claims, will become apparent in the detailed description set forth herein and taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component can be labeled in every drawing. In the drawings:

[0009] FIG. 1 depicts an example of a computer vision system.

[0010] FIG. 2 depicts an example of an image processing system.

[0011] FIG. 3 depicts an example of a pose estimation system.

[0012] FIG. 4 depicts an example of an object detection system.

[0013] FIG. 5 depicts an example of a language processing system.

[0014] FIG. 6 depicts an example of a method of computer vision based detection and alerting.

DETAILED DESCRIPTION

[0015] Systems and methods as described herein can process, analyze, and generate actionable insights from visual data inputs in the form of images or videos. Various machine learning/artificial intelligence (AI) models can be trained based on processing such inputs at the pixel level to learn to perform computer vision functions. Multiple layers of detection algorithms can be operated to enable accuracy and provide maximum data, including for object detection and text and language processing. Such operations can be performed to detect and generate insights regarding objects and events such as vehicles, person movement and actions, identification of text and semantic concepts associated with text, weapons including pistols, rifles, shotguns, and knives, bomb making equipment, chemical laboratory equipment, and electronics and manufacturing equipment. Various detection and analysis or insight generation operations can be performed in parallel and/or at different rates, allowing for timely delivery of output and updating or confirmation of outputs as further data is detected and processed. Language and text from physical or digital sources can be detected, ingested, and processed to generate insights, monitor events, and trigger alerts.

[0016] For example, object classification can be performed to classify objects in images and videos based on defined categories. Object classification can be performed to classify weapons including pistols, rifles, shotguns, knives, bomb making equipment, chemical laboratory equipment, and electronics and manufacturing equipment. Object identification can be performed to identify specific objects from images and video, such as an exact make and model of a firearm, vehicle, or words on a label. Object tracking can be performed to identify specific objects such as guns or bomb making equipment and track movement of the objects throughout the camera/sensor scene. Depth estimation can be performed to estimate distances of objects with distance measurements based on single and multiple camera inputs and mathematical estimations, such as to provide possible weapon and threat distances. Pose estimation can be performed to predict field of fire for identified weapons and threats, such as to identify and detect directionality of limbs and the positions of head, arms, and legs in order to read movements and estimate direction of travel, direction of threat, and direction of weapons.

Various such operations can be performed and combined as described herein to enable real-time or near real-time detection of objects and events and generation of alerts regarding such objects and events with sufficient accuracy and reduced size, weight, and power considerations. For example, person and weapon detection can be performed on a continuous basis, and distance and pose estimation can be performed responsive to detecting a weapon, allowing for more rapid and accurate image analysis and timely delivery of alerts and other insights regarding the detected objects and events.

[0017] Systems and methods in accordance with the present disclosure can implement optical character recognition (OCR) for detection and translation of real world labels as well as database, application, document, and network data extraction. For example, a label reader can be used at a chemical site to enable fast on-site translation of bulk chemical containers for proper storage and maintenance. OCR can be performed to scan digital images and then recognize and retain the text from these images and store the text into a dataset for future data integration. Natural Language Processing (NLP) can be implemented to further categorize detected text, such as by using algorithm mathematical weights. NLP algorithms can provide the ability to estimate sentiment analysis based on text. By integrating language processing functions including the OCR and NLP processes described herein, computer-based event evaluation and alerting can be rapidly and accurately performed on a broader range of objects and situations.

[0018] FIG. 1 depicts an example of a computer vision system 100. The computer vision system 100 can perform object detection, such as person and weapon detection, object tracking, text detection and ingestion, distance or depth estimation, and pose estimation, which can be used to generate alerts to present to a user. For example, one or more components of the computer vision system 100 can be coupled to or implemented using a vehicle, a body camera, a portable electronic device, or an augmented reality display, such as to enable threat detection and alerting for an environment around the user.

[0019] The computer vision system 100 can include at least one sensor 104, such as at least one image capture device 104 (e.g., a camera). The image capture device 104 can output one or more images, including photos or videos. The image capture device 104 can have low latency and high resolution. The image capture device 104 can have a rugged housing, such as a waterproof housing. The images can represent a field of view of the image capture device 104. The images can include a plurality of pixels to which image data, such as intensity and color data, is assigned. The images can be color or black and white images. The image capture device 104 can be a night vision camera.

[0020] The at least one image capture device 104 can include multiple image capture devices 104, such as multiple image capture devices 104 mounted to a vehicle. For example, four image capture devices 104 can be mounted to a vehicle as depicted in FIG. 1; greater or fewer image capture devices 104 can be used. The image capture devices 104 can be calibrated based on respective fields of view to enable proper interpretation of information represented by the images detected by the image capture devices 104. For example, the fields of view of each image capture device 104 can be identified in a frame of reference to enable downstream image processing functions to determine where each field of view is posed relative to each other field of view. The image capture devices 104 can be provided identifiers corresponding to their locations (e.g., front, left, right, rear) to facilitate the calibration.

[0021] The at least one sensor 104 can include various sensors, such as infrared cameras or LIDAR devices. The infrared cameras and LIDAR devices can output data in a format analogous to the images generated by the image capture device 104, such as in a format in which pixels are arranged in rows and columns and assigned values (e.g., intensity, color) representative of the information detected by such devices. The at least one sensor 104 can include, for example, a body -worn camera or vehicle dashboard camera.

[0022] The computer vision system 100 can include a network hub 108. The network hub 108 can receive and transmit at least one of wired or wireless communications between various components of the computer vision system 100. For example, the network hub 108 can receive images outputted by the image capture devices 104 using wired or wireless communications protocols. The computer vision system 100 can include communications electronics coupled with the network hub 108 to receive and transmit data with remote devices; the communication electronics can include components such as wired and wireless communications electronics, such as radio frequency (RF), WiFi, or cellular (e.g., 4G/5G LTE) transceivers.

[0023] The computer vision system 100 can include processing circuitry 112. The processing circuitry 112 can be implemented as an edge compute device, such as a compact, self-contained, scalable device. At least a portion of the processing circuitry 112 can be implemented using a graphics processing unit (GPU). The processing circuitry 112 can include at least one processor and memory. The processor can be a general purpose or specific purpose processor, an application specific integrated circuit (ASIC), one or more field programmable gate arrays (FPGAs), a group of processing components, or other suitable processing components. The processor is configured to execute computer code or instructions stored in memory or received from other computer readable media (e.g., CDROM, network storage, a remote server, etc.). Memory can include one or more devices (e.g., memory units, memory devices, storage devices, etc.) for storing data and/or computer code for completing and/or facilitating the various processes described in the present disclosure. Memory can include random access memory (RAM), read-only memory (ROM), hard drive storage, temporary storage, non-volatile memory, flash memory, optical memory, or any other suitable memory for storing software objects and/or computer instructions. Memory can include database components, object code components, script components, or any other type of information structure for supporting the various activities and information structures described in the present disclosure. Memory can be communicably connected to processor via processing circuitry and may include computer code for executing (e.g., by processor) one or more processes described herein. When processor executes instructions stored in memory, processor generally configures the processing circuitry 112 to complete such activities. The processing circuitry 112 can be implemented using distributed hardware units that may communicate through wired or wireless connections (e.g., communicate using the network hub 108). For example, the processing circuitry 112 can be at least partially implemented using a cloud computing device in communication with the network hub 108 via a wireless communications link. The processing circuitry 112 can include various functions, code, algorithms, models, operations, routines, logic, or instructions to implement various components of the processing circuitry 112 using the at least one processor of the processing circuitry 112. The processing circuitry 112 can perform various operations using parallel processing and other computational task scheduling processes.

[0024] The processing circuitry 112 can include a video decoder 124. The video decoder 124 can receive at least one image from the image capture device 104, such as to receive a stream of image frames (e.g., video stream). The video decoder 124 can process the image frame, such as to assign a timestamp to, filter, or modify a resolution of the image frame. The video decoder 124 can resize or scale the image frames. The video decoder 124 can determine an indication of a complexity (e.g., image complexity, such as entropy) of the scene represented by the image frame, and adjust a resolution of the image based on the complexity, such as to reduce the resolution for relatively low complexity image frames, which can reduce computational and network communication requirements for processing the image frames without reducing the fidelity of the image frames.

[0025] The processing circuitry 112 can include at least one machine learning model 128 (e.g., artificial intelligence model). The machine learning model 128 can receive the image frame from the video decoder 124 (e.g., in response to processing of the image frame by the video decoder 124). The machine learning model 128 can be trained to generate outputs responsive to receiving the image frame. For example, the machine learning models 128 can include models trained to detect one or more objects represented by the image frame, and output an indication of the detected objects; detect characters or text, such as from labels; detect depth (e.g., distance) data; detect poses of persons; detect events such as threats or directions of threats (e.g. based on detecting poses of persons and directions of weapons held by persons). The machine learning models 128 can be trained using training data particular to specific use cases or environments, facilitating more rapid and accurate deployment of the machine learning models 128.

[0026] The machine learning models 128 can be trained using training data assigned labels at a plurality of levels of abstraction for a particular use case. For example, for object detection, the machine learning models 128 can be trained using at least a first subset of training samples labeled for features of objects, such as anatomical features (e.g., hands, legs, arms, shoulders, heads) and a second subset labeled for classes of objects (e.g., person, vehicle, weapon, text). When used to generate outputs responsive to input image data (e.g., in operation after having been trained), the input image data can be provided to one or more models of the machine learning models 128 in parallel, in series, or various combinations or feedback loops thereof; for example, the input image data can be provided to a first machine learning model 128 trained at a first level of abstraction (e.g., anatomical feature detection), the output of which can be provided to a second machine learning model 128 trained at a second level of abstraction (e.g., object classification).

[0027] The training data can include relatively greater training samples (e.g., labeled image frames) selected to improve model accuracy, such as training samples associated with clear positive cases (e.g., particular weapons, distances, and poses of persons associated with confidence of a threat being present) and negative cases (e.g., clear examples of objects that are not threats). For example, one or more models of the machine learning models 128 can be trained using training data that includes a plurality of training samples. Each training sample can include image data and at least one label assigned to the image data. The at least one label can include, for example, an identifier of one or more objects or features of objects represented by the image data (including text information). The at least one label can include a class of the one or more objects, features of objects, or text information. For example, for chemical label detection, the machine learning models 128 can be trained using training samples labeled with at least one of characters on a chemical label (e.g., the characters H2SO4 for sulfuric acid), language representing the characters (e.g., the text string “sulfuric acid”), or one or more classes of the chemical represented by the chemical label (e.g., “acid”). [0028] The at least one label can include an indication of an event condition being satisfied (e.g., positive case) or not being satisfied (e.g., negative case), such as an object or person being in a particular pose or orientation. The at least one label can include or be assigned a confidence score. The confidence score can indicate whether the training sample is a clear case, such as if there is a high confidence that the identifier for the object is accurate, that the text represented in the image data has been correctly labeled, that the object is in the particular pose or otherwise satisfies (or does not satisfy) the event condition, among other such examples. The confidence score can be a numerical score. The confidence score can be a particular label of a plurality of predefined labels (e.g., high confidence condition not satisfied; low confidence condition not satisfied; low confidence condition satisfied; high confidence condition satisfied). The training samples can be labeled by an auto-labeler (e.g., one or more computer vision algorithms, models, feature extractors, object detectors, or combinations thereof) or a human labeler. The computer vision system 100 can train the machine learning models 128 by at least one of (1) receiving training data in which at least a subset of the training samples satisfies a clear case condition (e.g., has a confidence score that satisfies a threshold indicative of being a clear positive or clear negative case; has been labeled to be a clear case or without a specific confidence score; has been labeled to have high confidence), the portion of the subset being, for example, at least twenty percent, at least thirty percent, or at least fifty percent of the total number of training samples; or (2) selecting, from amongst the training samples, at least an amount of training samples that satisfy the clear case condition to form the subset of, at least twenty percent, at least thirty percent, or at least fifty percent of the total number of training samples. Each model of the one or more machine learning models 128 can be trained using training data specific to particular domains or use cases. As such, the computer vision system 100 can generate the one or more machine learning models 128 to more accurately and rapidly perform object and text detection, including for particular use cases, by being more particularly trained on high- confidence training samples.

[0029] The machine learning models 128 can include at least one neural network, such as a convolutional neural network. The neural network can include residual connections. The neural network can perform feature extraction. The neural network can be trained to perform operations such as object detection, depth estimation, and pose estimation. The machine learning models 128 can include models (or model backbones) such as MobileNetV2, RetinaNet, and EfficientNet. The machine learning models 128 can be trained using training data generated from video frames, photos, and web-based sources, and which can be augmented through operations such as orientation, rotation, brightness, exposure, blur, and cutout modifications to facilitate more effective training and in turn more effective operation of the machine learning models 128. Various preprocessing operations can be performed, such as greyscale conversion, re-sizing, and pixel count modification (e.g., resolution modification).

[0030] The processing circuitry 112 can include a video compiler 132. The video compiler 132 can be used to generate image data (e.g., video data) for presentation by a user interface 136. For example, the video compiler 132 can receive the image frame and the indication of the detected one or more objects, and generate an output frame for presentation by the user interface 136.

[0031] The video decoder 124 can transmit the processed image frame to the video compiler 132, such as bypass the machine learning model 128 (e.g., the video decoder 124 can transmit the processed image frame to the video compiler 132 and not to the machine learning model 128). For example, the video decoder 124 can monitor a performance characteristic of generation of output frames by the video compiler 132 or presentation of the output frames by the user interface 136, and transmit the processed image frame to the video compiler 132 responsive to the performance characteristic not satisfying a threshold condition. The performance characteristic can be a frame rate, latency, or other characteristic representative of the effectiveness of the presentation of frames to a user. By selectively bypassing the machine learning model 128, the video decoder 124 can facilitate ensuring that the presentation of frames to the user satisfies performance targets while other frames are still used to generate information for presentation using the machine learning model 128.

[0032] The user interface 136 can include a user input device, a display device, and an audio output device. The user input device can include any component that facilitates a user interaction with the user interface. For example, the user input device can include a keyboard, a mouse, a trackpad, a touchscreen, or an audio input device. The display device can present an image or series of images for viewing. The audio output device can include any component that emits a sound. For example, the audio output device can be a speaker or headphones. [0033] The user interface 136 can present outputs such as text information and alert information, such as alert location, unit name, camera number and estimated depth, direction and object detection. Various such alert and event information can be transmitted (e.g., using communication electronics coupled with the network hub 108) to a remote device, such as a device of a command or dispatch center, or a remote database.

[0034] FIG. 2 depicts an example of an image processing system 200. The image processing system 200 can be implemented using the computer vision system 100, such as by using the processing circuitry 112. For example, the image processing system 200 can be implemented by the processing circuitry 112 to receive image frames from sensors 104 and process the image frames, such as to operate the machine learning models 128 using the image frames as input. Various features of the image processing system 200 (e.g., object detection, depth estimation) may be implemented using components that do not rely on machine learning, such as template matching of extracted features or LIDAR or SONAR processing.

[0035] The image processing system 200 can include an object classifier 204. The object classifier 204 can receive the image frame and output an indication of an object class of an object detected from the image frame. The object classifier 204 can include a machine learning model (e.g., machine learning model 128) trained to generate the indication of the object class using training data that includes image frames labeled with corresponding object classes.

[0036] The image processing system 200 can include an object identifier 208. The object identifier 208 can incorporate features of the object classifier 204. For example, the object identifier 208 can include a machine learning model trained to generate an identifier of an object from an image frame based on training data that includes image frames labeled with identifiers of objects represented by the images.

[0037] At least one of the object classifier 204 or the object identifier 208 can be operated to perform object detection. For example, one or more image frames can be received and provided as input to the at least one of the object classifier 204 or the object identifier 208, which can output an indication of a detected object responsive to processing the input (e.g., by applying the image frames as an input to the trained machine learning model). Identified objects can be assigned markers, such as bounding boxes (which can also be used to crop image frames in order to output a cropped frame that includes the identified object and does not include image data outside of the bounding box to facilitate downstream image processing regarding the identified object).

[0038] The image processing system 200 can include an object tracker 212. The object tracker 212 can receive an indication of an object identified from the image frames (e.g., from the object identifier 208) and generate an object track of the identified object. For example, the object tracker 212 can receive an indication of an object identified in a first frame, use the indication to identify the object in a second frame (e.g., a second frame corresponding to image data captured subsequent to the first frame), and maintain an object track data structure mapping the indication of the identified object to the first frame and the second frame (e.g., by indicating one or more pixels of each frame associated with the identified object).

[0039] The image processing system 200 can include a depth estimator 216. The depth estimator 216 can determine a distance of an object (e.g., an object identified using the at least one of the object classifier 204 or the object identifier 208) represented by an image frame, including based on the image frame being from a single camera. For example, the depth estimator 216 can extract features of objects from the image frame and generate a heat map indicative of distances of the extracted features. The depth estimator 216 can determine the distances by comparing two image frames (e.g., images from separate cameras; images from a single camera at different points in time) and using one or more parameters of the sensor 104 that detected the image frames, such as a focal length of the sensor 104. For example, the depth estimator 216 can determine a distance between one or more pixels representing a particular feature in the two image frames, and use the focal length to determine a depth to the particular feature based on the distance and the focal length. The depth estimator 216 can generate the heat map by mapping depth for at least a subset of pixels of image frames to color data and/or brightness data, which can be assigned to the respective pixels of the subset. The mapping can include, for example, at least one of higher brightness or colors of higher wavelength (e.g., more red) for lower distances, and vice versa.

[0040] The image processing system 200 can include a pose estimator 220. The pose estimator 220 can receive an image frame representing a person (e.g., a person identified by the object identifier 208, and which may be tracked by the object tracker 212) and determine a pose of the person based on the image frame. For example, the pose estimator 220 can determine a pose of the person indicative of at least one of a position or an orientation of a head, arms, legs (e.g., various limbs) of the person. The pose estimator 220 can determine a direction of a pose of a limb, such as by identifying one or more features of the limb (e.g., determining that a hand is closer than an elbow to the camera can indicate the direction of the limb). The pose estimator 220 can include a machine learning model (e.g., one or more of machine learning models 124) trained to output the pose based on training data including image frames representing persons oriented in various poses (which may be labeled with the pose, such as a type of pose). As noted above, the pose estimator 220 (e.g., one or more machine learning models 128 of the pose estimator 220) can be trained using training data at various levels of abstraction associated with pose detection, such as for detecting anatomical features, detecting spatial relationships between anatomical features indicative of appendages (e.g., hand-elbow relationship indicating where the arm is pointing), detecting directions (e.g., pointing, lifting, movement) associated with

[0041] For example, the pose estimator 220 can process the image frame to identify one or more nodes of the person (e.g., using object detection models of the machine learning models 128). The pose estimator 220 can generate one or more vectors between the one or more nodes, the vectors representative of appendages of a person. The pose estimator 220 can provide the one or more vectors as input to a model 128 trained to output poses responsive to receiving the vectors (or trained to output poses responsive to receiving the nodes). The pose estimator 220 can output the pose as a class of a pose, such as running, walking, jumping. The pose estimator 220 can output poses (or classes of poses) specific to a class of the object, such as running, walking, or jumping responsive to detecting the pose for an object classified as a person, driving or turning responsive to detecting the pose for an object classified as a vehicle, or galloping or jumping over a barrel responsive to detecting the pose for an object classified as a horse.

[0042] The pose estimator 220 can use the object tracker 212 to track movement of at least one of the nodes or the vectors over time (e.g., from at least a first image frame to a second image frame), and use the movement to determine the pose. For example, the pose estimator 220 can be trained using image and/or video data labeled with poses and representing movement in the image and/or video data, enabling the pose estimator 220 to detect poses in images having similar movement. [0043] The pose estimator 220 can use the pose to determine a direction of movement (e.g., direction of travel). For example, the pose estimator 220 can determine the pose of the user for at least one image frame (e.g., for multiple image frames based on a track maintained by the object tracker 212) and based on the pose determine a direction of movement of the person. Specific poses and the movement of poses can be used to estimate direction, such as to detect the knee of a person moving up and towards a side of the video data (e.g., from multiple image frames), and generate a direction based on one or more vectors representing pixels of the knee over the course of the video data.

[0044] The pose estimator 220 can determine that an object of a particular class is associated (e.g., close enough to be held) by the person for which the pose is identified, and determine or predict an event based on the pose and the identified object. For example, the pose estimator 220 can determine that the person is holding a weapon, and determine at least one of a field of fire, a direction of a threat, or a direction of a weapon based on the identified pose and the identified weapon (e.g., including based on an orientation of the weapon). One or more machine learning models 124 of the pose estimator 220 can be trained to determine various such events from the detected weapon and pose data.

[0045] The pose estimator 220 can include multiple layers (e.g., multiple layers of models), such as to perform multiple pass pose detection. For example, as depicted in FIG. 3, the pose estimator 220 can be implemented using a pose estimation system 300, which can receive image frames at an initial model backbone, which can perform initial feature extraction, and provide extracted features to a plurality of passes of a pose detection block in order to generate output (e.g., a point cloud). As depicted in FIG. 3, the passes can include both sequential and skip layers.

[0046] At least one of the depth estimator 216 or the pose estimator 220 can be operated responsive to a trigger condition (which can be an example of an event condition). For example, the trigger condition can be detection of a weapon by the at least one of the object classifier 204 of the object identifier 208. As such, the image processing system 200 can perform object detection on a continuous or regular basis (e.g., for each frame received from the video decoder 124 or for a subset of the image frames received from the video decoder 124, such as one of every two, five, or ten image frames), and trigger operation of the at least one of the depth estimator 216 or the pose estimator 220 responsive to detecting the object. This can reduce computational requirements of the image processing system 200 while maintaining sufficient generation of insights regarding the video stream. For example, the image processing system 200 can operate at least one of the object classifier 204 or the object identifier 208 for a relatively high fraction of image frames (e.g., at least fifty percent of the image frames; at least eighty percent of the image frames; every image frame), and operate at least one of the depth estimator 216 or the pose estimator 220 responsive to output of the at least one of the object classifier 204 or the object identifier 208 satisfying the trigger condition (e.g., detecting a particular object that is of a particular class or has a particular identifier, such as detecting an object that is a weapon).

[0047] The image processing system 200 can use the output of at least one of the depth estimator 216 or the pose estimator 220 to evaluate an event condition. The event condition can be, for example, a person or object moving into a particular area, such as an alarm region or a threshold distance from the at least one sensor 104 (which may in proximity to a user); the person being in range of a device to be operated (e.g., a projectile, net, or wrap deployment device); or various other events associated with locations, poses, or movement of people; identities or classes of objects or materials (e.g., chemical labels); or combinations thereof. For example, with respect to event conditions associated with pose and movement of the person, the image processing system 200 can retrieve a threshold distance or region relative to the at least one sensor 104 (which may correspond to an target or optimal distance from a device to be operated based on the event condition, such as a device to deploy a projectile, net, or wrap), and at least one of determine whether the person is within the threshold distance or region at a current point in time or predict, based on a direction and speed of movement of the person, a point in time at which the person will be within the threshold distance or region, to generate output indicative of the event condition based on the determination. For example, the image processing system 200 can determine that the person is outside of a range of the projectile, net, or wrap deployment device and generate output indicating a direction and distance of movement for a user to perform so that the predicted position of the person is within the range of the projectile, net, or wrap deployment device; or, responsive to determining that the person is in the range (or will be in the range accounting for a time of deployment of the projectile, net, or wrap), generate output indicating clearance or proper timing for deploying the wrap. Various such outputs can be provided, for example, using user interface 136, or transmitted to a remote device. [0048] FIG. 4 depicts an example of an object detection system 400 that can be operated by various devices and systems described herein, such as the computer vision system 100 or the image processing system 200. The object detection system 400 can use various machine learning models sequentially and in parallel to generate more accurate and timely object detection and alerting outputs. For example, the object detection system 400 can receive a first input 404 of an image frame having a relatively high resolution (which may be a native resolution of a sensor source from which the image frame is received, or an image frame downsized by a first amount) and a second input 408 of a second image frame having a relatively low resolution less than the resolution of the first image frame. The object detection system 400 can provide the first input to a first model, such as a high resolution detector 412, and the second input to a second model, such as a low resolution detector 416, in order to cause the first and second models to perform respective feature extractions, such as sets of pixels representative of features that the models have been trained to identify (e.g., lines or other shapes representative of objects; objects). For example, the detectors 412, 416 can process the respective image frames to each generate features of objects or candidate object identifiers or classes.

[0049] The object detection system 400 can perform the generation of the second image frame, such as a scale by which the resolution of the second image frame is reduced relative to the first image frame (or a target resolution of the second image frame) based on various parameters, such as at least one of a predetermined scaling factor (which may be specific to a type of sensor 104 from which the image frames are received), an identifier or class of a previously detected object (which can allow the object detection system 400 to dynamically calibrate the cross resolution detection process to the environment), or a processing usage factor.

[0050] Responsive to extracting features using the first and second models 412, 416, the object detection system 400 can operate a cross-resolution feature extractor 420, such as to identify features from the initially extracted sets of features detected by the first and second models 412, 416. Responsive to output of the cross-resolution feature extraction, the features can be merged by a feature merger 424, such as using a multiple pass feature merging model (e.g., Mobile DenseNet). The output of the multiple pass feature merging model 424 can be upscaled 428 for storage or presentation by various devices, such as a remote database (e.g., cloud database) or local hardware (e.g., hardware on-site at a vehicle). The use of multiple resolutions can enable the models to more effectively detect features under varying input conditions (e.g., varying camera hardware or environmental effects such as background light or obscurants). The use of multiple passes can facilitate more accurate feature detection, such as to more effectively address outlier features.

[0051] For example, the object detection system 400 can evaluate a confidence associated with outputs of one or both of the high resolution detector 412 and low resolution detector 416, and output an indication of at least one of an object identifier or an object class responsive to the confidence satisfying a confidence threshold. As such, where either detector 412, 416 detects information with high confidence, the detection information can be rapidly outputted (e.g., assigned to an image frame to be rendered as display output) and accurately provide information to a user. The object detection system 500 can continue to evaluate the other of the outputs of the detectors 412, 416 (e.g., where the other output did not meet a corresponding confidence threshold), and validate or modify the previous output as further information is detected over subsequent image frames.

[0052] FIG. 5 depicts an example of a language processing system 500. The language processing system 500 can be implemented using features of and/or as a component of the computer vision system 100, the image processing system 200, the object detection system 400, or various combinations thereof. The language processing system 500 can be used to perform OCR, NLP, and various combinations thereof to automatically generate accurate, timely insights regarding text information detected using computer vision, including to generate meaningful detections of objects or materials based on the text information. The language processing system 500 can be implemented in a same image processing pipeline as various other processes described herein; for example, images that are processed to detect objects can simultaneously (e.g., in parallel) be processed to detect text information (e.g., using OCR), to which NLP can be applied to generate NLP output, which can be used to evaluate event conditions associated with the NLP output. For example, the language processing system 500 can apply OCR to chemical labels to detect identifiers of chemicals from the text information represented by the labels, apply NLP to the identifiers to detect one or more candidate chemical processes being performed or capable of being performed using the detected chemicals, and evaluate an event condition (e.g., detect a dangerous chemical process or other target chemical process) based on the one or more candidate chemical processes. The system 500 can operate rapidly and accurately using various processes described herein, including based on generating particular relationships amongst stored data and keywords to allow for fast, accurate data retrieval. The system 500 can monitor particular data sources to generate alerts responsive to alert criteria being triggered.

[0053] For example, the system 500 can include an OCR processor 504. The OCR processor 504 can receive image data (e.g., from sensor 104) and generate one or more characters responsive to receiving the image data. The OCR processor 504 can include, for example, one or more models trained to detect text information, such as characters. The OCR processor 504 can ingest data from various sources, including but not limited to images, labels, signs, and text data from sources such as documents, email, financial data, social networks, social media profiles, mobile applications, servers, databases, and spreadsheets.

[0054] For example, the system 500 can include a data retriever 506. The data retriever 506 can include various scripts, functions, algorithms, crawlers, indexers, logic, code, instructions, or combinations thereof that can retrieve text data or images of text data from various sources, including but not limited to images, labels, signs, and text data from sources such as documents, email, financial data, social networks, social media profiles, mobile applications, servers, databases, and spreadsheets, and store the retrieved data in the database 508. As an example, the data retriever 506 can be a crawler script that can search and store data including social media and website URLs using links from initial pages (e.g., based on keyword searches). The data retriever 506 can retrieve data including images, video, and other content from the searched data, such as to link the retrieved data to the searched data and to the keywords used to perform the searches. The OCR processor 504 (and/or NLP 512) can process the retrieved data to detect text information and sentiment from the retrieved data.

[0055] The OCR processor 504 can store the ingested data (e.g., text data such as characters and words extracted from the ingested data) in one or more databases 508. The text data can be arranged in a structured format, such as to indicate spatial relationships amongst characters and words, including relationships such as sentences, paragraphs, identifiers of documents, web pages, or social media profiles from which the text data is retrieved, or other such information indicating relationships amongst the characters and words of the text data. For example, for a particular keyword of the text data, the database 508 can assign one or more links from the keyword to sources of the text data having the keyword, such as links to documents or social media profiles.

[0056] The system 500 can include a natural language processor (NLP) 512. The NLP 512 can process text data in the database 508. The NLP 512 can function together with the OCR processor 504 to process unstructured data such as documents, email, financial data, social networks, mobile applications, servers, databases, and spreadsheets, including to detect sentiment from the unstructured data. The NLP 512 can use various language techniques on the text information in the database 508, such as distribution semantics, to determine which words appear in similar sentences. For example, the NLP 512 can generate count vectors to count the number of a times a word appears next to other words. The quantities, relationships and interactions between words allow the NLP to take on a more conversational understanding of the text.

[0057] The NLP 512 can include one or more machine learning models 128 trained to generate outputs based on text data, such as to operate as a sequence to sequence network, such as an encoder-decoder model (e.g., a model including an encoder configured to flag text to remember and how it was used, allowing the decoder to predict correct word choice for language processing). For example, the NLP 512 can include an encoder 516 including a plurality of recurrent neural networks (RNNs) 518. The encoder 516 can use the RNNs 518 to detect sentence understanding by seeing which words are grouped together, how they were grouped together and other grammatical properties that can be connected to meaning, to generate an encoder vector 520. The encoder vector 520 can be a single shared representation per sentence.

[0058] The NLP 512 can include a decoder 524. The decoder 524 can receive the encoder vector 520 generated by the encoder 516 and output a score corresponding to a model prediction by the decoder 524. For example, the score can correspond to a word predicted by the decoder 524 based on the encoder vector 520. For example, the highest scoring word from the decoder 524 can become the prediction of the NLP 512. By training the NLP 512 using the most statistically correct words to predict, the NLP 512 also learns based on the decoder prediction layer and the model changes the words to numerical values with vector representation.

[0059] The NLP 512 can use the vector representation of words to generate a graphical representation of relationships between text data, such as a plot to visualize the representations as nodes in a graph based on weighted results. For example, the NLP 512 can assign word data representations to nodes of the graph, where the nodes are arranged using the vectors representative of relationships between text data and sentiment underlying the text data.

[0060] As such, the system 500 can use the OCR processor 504 and NLP 512 to performing sentiment analysis, including by collecting text data representative of sentiment through search, and using the NLP 512 to detect relationships amongst the text data to map sentiment to particular text (e.g., keywords). The system 500 can group text data based on predefined keywords, enabling the NLP 512 to detect sentiment responsive to a topic represented by a keyword. The system 500 can use the keywords to adjust the NLP 512 for future iterations. This can enable the system 500 to use the NLP 512 to identify sentiments and keywords linking correlated topics and output suggested sentiment search criteria based on conversational center of gravity within a network of individuals. For example, the NLP 512 can receive, as input, text data in the database 508 having relationship information based on how the text data is ingested (e.g., count vectors; relationships between words, such as distances, in the same document; documents or webpages from which the text data is retrieved). The NLP 512 can be trained, using the input, to generate output to predict words from the input, such as to predict words (which can represent sentiment) that are highly correlated with the input. The NLP 512 can thus map inputs, such as keywords, to sources of particular sentiments, such as documents, webpages, or social media profiles.

[0061] The system 500 can use the NLP 512 to perform translation. For example, the system 500 can implement a translation process by using a bilingual text database of the database 508 and a combination of monolingual datasets to process particular text through statistical or neural models associated with the NLP 512 (e.g., depending on the language and accuracy desired).

Each type of translation can be further tuned based on dataset growth.

[0062] FIG. 6 depicts an example of a method 600 of computer- vision based detection and alerting. The method 600 can be performed using various systems and devices described herein, including the computer vision system 100, image processing system 200, pose estimation system 300, object detection system 400, and language processing system 500. Various steps or combinations thereof of the method 600 can be performed in parallel, in series, in feedback loops, and/or in communication with one another. Various steps or combinations thereof can be performed at the same or different rates, such as rates synchronous or asynchronous with a frame rate at which image data is received. Various steps or combinations thereof can be repeated or bypassed, including depending on identities or classes of objects (including text data) detected. For example, OCR 615 can be on input data prior to or in parallel with object identification 610 (e.g., object identification, detection, and/or classification); or, as depicted in FIG. 6, OCR 615 can be performed responsive to object identification 610. Various steps or combinations thereof can be performed for machine learning model training as well as runtime use of machine learning models, and data received in runtime processes can be used to update or otherwise further train the machine learning models.

[0063] At 605, input data is received. The input data can include image data received from one or more sensors, including cameras, body cameras, vehicle/dashboard cameras, infrared cameras, and LIDAR devices. The input data can be received as a stream of image frames. The input data can represent a real-world environment around the one or more sensors, such as an environment in which objects and text (e.g., labels of objects) are present. The input data can include text data from physical or digital sources, such as documents, webpages, or social media profiles.

[0064] At 610, an object is identified. The object can be identified by performing one or more computer vision processes, such as by applying the input data as input to one or more machine learning models trained to perform object identification and detection. Identifying the object can include determining at least one of an identifier of the object (e.g., make, model) or a class of the object (e.g., person, vehicle, animal, weapon). Identifying the object can include determining that the object includes text information, such as by determining that the image data includes a label or other representation of text, or that the input data is received from a text source.

[0065] At 615, responsive to determining that the object (e.g., image data representative of an object having a label; text information received from a text source) includes text information, OCR can be performed on the text information. Performing OCR can include detecting one or more characters representing the text information, along with relationship information associated with the characters, including word, sentence, paragraph, or other document structures, as well as indicators of the source of the text information, such as a particular environment in which the text information is present (which can correspond to the sensor from which the input data is received). The text information and relationship information can be stored together in a database.

[0066] At 620, NLP is performed on the text information retrieved by OCR, such as to detect context or sentiment associated with the text information. The NLP can be performed using one or more machine learning models, such as a sequence-to-sequence model, trained to predict text information (e.g., words from the database) responsive to receiving input text information. For example, the NLP can be performed to predict a chemical process being performed based on the text information (e.g., by using the text information stored from OCR to identify a plurality of chemicals, and using the NLP to identify a process having at least a threshold confidence of association with the plurality of chemicals).

[0067] At 625, and event condition can be identified. The event condition can be, for example, a threat, a movement into a particular area, a type of object, a type of process predicted by NLP, or various other events associated objects and/or text. For example, response to identifying an object to be of a particular class (e.g., person), a particular event condition can be identified as being associated with the particular class (e.g., movement of people into a restricted area or into range of a deployment device). Responsive to the NLP indicating a particular process, such as a chemical process, the event condition can be identified as being associated with the particular process.

[0068] At 630, depth estimation is performed responsive to identifying the event condition.

For example, one or more image frames of the input data in which the object is present can be processed to detect depth associated with one or more pixels representative of the object. The depth estimation can be performed to generate a heat map. By performing depth estimation responsive to identifying the event condition (e.g., detecting a person, which may trigger evaluation of whether the person is predicted to enter a restricted area), computational requirements associated with depth estimation can be reduced while still ensuring timely, accurate delivery of insights relating to the event condition. [0069] At 635, pose estimation is performed responsive to identifying the event condition. For example, one or more image frames of the input data in which the object is present can be processed to detect nodes of anatomic features, such as nodes corresponding to hands, elbows, shoulders, or other joints or anatomical features. Representations of appendages, such as vectors, can be assigned between nodes. Responsive to generating the nodes and/or vectors, a pose of the person can be predicted. Tracking can be performed on the poses over multiple image frames (e.g., by comparing positions of pixels associated with nodes and/or vectors) to determine direction (and speed) of movement of the person.

[0070] At 640, an output of an indication of the detection (e.g., detected text; detected sentiment or context from text; detected object) and the event condition can be outputted. The output can include an identifier of the object (e.g., make and model of a weapon or vehicle) and an indication of whether the event condition is satisfied based on the object (e.g., object is a weapon and pointed towards camera; person has been detected and is moving into a restricted area or range of deployment of a deployment device). The output can indicate a sentiment or context of text information, such as one or more entities or documents retrieved by a keyword search, or a particular process identified based on the text information.

[0071] The construction and arrangement of the systems and methods as depicted in the various embodiments are illustrative only. Although only example embodiments have been described in detail in this disclosure, many modifications are possible (e.g., variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters, mounting arrangements, use of materials, colors, orientations, etc.). For example, the position of elements can be reversed or otherwise varied and the nature or number of discrete elements or positions can be altered or varied. Accordingly, such modifications are intended to be included within the scope of the present disclosure. The order or sequence of any process or method steps can be varied or re-sequenced according to various embodiments. Other substitutions, modifications, changes, and omissions can be made in the design, operating conditions and arrangement of the embodiments without departing from the scope of the present disclosure.

[0072] References to “or” may be construed as inclusive so that any terms described using “or” may indicate any of a single, more than one, and all of the described terms. References to at least one of a conjunctive list of terms may be construed as an inclusive OR to indicate any of a single, more than one, and all of the described terms. A reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both ‘A’ and ‘B’. Such references used in conjunction with “comprising” or other open terminology can include additional items.

[0073] The present disclosure contemplates methods, systems and program products on any machine-readable media for accomplishing various operations. The embodiments of the present disclosure may be implemented using existing computer processors, or by a special purpose computer processor for an appropriate system, incorporated for this or another purpose, or by a hardwired system. Embodiments within the scope of the present disclosure include program products comprising machine-readable media for carrying or having machine-executable instructions or data structures stored thereon. Such machine-readable media can be any available media that can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such machine-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of machine-readable media. Machine-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.

[0074] Although the figures show a specific order of method steps, the order of the steps may differ from what is depicted. Also two or more steps may be performed concurrently or with partial concurrence. Such variation will depend on the software and hardware systems chosen and on designer choice. All such variations are within the scope of the disclosure. Likewise, software implementations could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various connection steps, processing steps, comparison steps and decision steps.

Claims

WHAT IS CLAIMED IS:

1. A method of evaluating event conditions based on object detection, comprising: receiving, by one or more processors from at least one sensor, at least one image; detecting, by the one or more processors from the at least one image, an object; determining, by the one or more processors, that the object corresponds to an event condition; causing, by the one or more processors responsive to determining that the object corresponds to the event condition, at least one of a depth estimation of the object or a pose estimation of the object; and outputting, by the one or more processors, an indication of the object and the event condition based on the at least one of the depth estimation or the pose estimation.

2. The method of claim 1, wherein detecting the object comprises: retrieving, by the one or more processors from the at least one image, a first image having a first value of an image parameter; retrieving, by the one or more processors from the at least one image, a second image having a second value of the image parameter; generating, by the one or more processors, a first candidate identifier of the object using the first image; generating, by the one or more processors, a second candidate identifier of the object using the second image; and determining, by the one or more processors, the indication of the object responsive to the first candidate identifier and the second candidate identifier.

3. The method of claim 2, wherein retrieving the second image comprises generating, by the one or more processors, the second image by at least one of reducing a resolution of the first image or applying a spatial filter to the first image.

4. The method of any of claims 1 through 3, wherein the indication, the first candidate identifier, and the second candidate identifier comprise at least one of a make of the object, a model of the object, or a class of the object.

5. The method of claim 3 or claim 4, comprising: comparing, by the one or more processors, a first confidence associated with the first candidate identifier to a first threshold; and outputting, by the one or processors, the indication of the object based on the first candidate identifier responsive to the first confidence satisfying the first threshold.

6. The method of claim 5, comprising: updating, by the one or more processors, a second confidence associated with the second candidate identifier based on a third image detected subsequent to the first image; and validating, by the one or more processors, the indication of the object responsive to the second confidence satisfying a second threshold; or modifying, by the one or more processors, the indication of the object responsive to the second confidence not satisfying the second threshold.

7. The method of any of claims 1 through 6, wherein the event condition corresponds to an event associated with at least one of an identifier of the object or a class of the object.

8. The method of any of claims 1 through 7, wherein causing the depth estimation comprises at least one of comparing a first image of the at least one image with a second image of the at least one image, the first image received from a first sensor of the at least one sensor, the second image received from a second sensor of the at least one sensor.

9. The method of any of claims 1 through 8, wherein performing the pose estimation comprises providing image data representing the object to one or more models trained using training samples labeled with pose information.

10. The method of any of claims 1 through 9, wherein performing the pose estimation comprises identifying one or more nodes of the object from the at least one image, detecting at least one vector between the one or more nodes, the at least one vector corresponding to an appendage, and assigning a pose to the object based on the at least one vector.

11. The method of any of claims 1 through 10, comprising detecting, by the one or more processors from the image, one or more characters representing text information; and at least one of outputting an indication of the text information or evaluating the event condition based on the text information.

12. The method of any of claims 1 through 11, comprising: applying, by the one or more processors, natural language processing to text information detected from the at least one image to at least one of detect a sentiment of the text information or evaluate the event condition based on the text information.

13. The method of any of claims 1 through 12, comprising: performing, by the one or more processors, optical character recognition to detect text information from the at least one image and natural language processing on the text information to detect a sentiment of the text information.

14. The method of any of claims 1 through 13, wherein performing the pose estimation comprises performing a plurality of pose estimation passes.

15. The method of any of claims 1 through 14, further comprising evaluating the event condition by providing, by the one or more processors, a pose from pose estimation of the object as input to a model representing criteria for operating a device based on the pose.

16. A system, comprising: one or more processors configured to: receive, from at least one sensor, at least one image; detect, from the at least one image, an object; determine that the object corresponds to an event condition; cause, responsive to determining that the object corresponds to the event condition, at least one of a depth estimation of the object or a pose estimation of the object; and output an indication of the object and the event condition based on the at least one of the depth estimation or the pose estimation.

17. The system of claim 16, wherein the one or more processors are configured to detect the object by: retrieving, from the at least one image, a first image having a first value of an image parameter; retrieving, from the at least one image, a second image having a second value of the image parameter; generating a first candidate identifier of the object using the first image; generating a second candidate identifier of the object using the second image; and determining the indication of the object responsive to the first candidate identifier and the second candidate identifier.

18. The system of claim 17, wherein the one or more processors are configured to retrieve the second image by generating the second image by at least one of reducing a resolution of the first image or applying a spatial filter to the first image.

19. The system of any of claims 15 through 17, wherein the indication, the first candidate identifier, and the second candidate identifier comprise at least one of a make of the object, a model of the object, or a class of the object.

20. The system of claim 18 or claim 19, wherein the one or more processors are configured to: compare a first confidence associated with the first candidate identifier to a first threshold; and output the indication of the object based on the first candidate identifier responsive to the first confidence satisfying the first threshold.

21. The system of claim 20, wherein the one or more processors are configured to: update a second confidence associated with the second candidate identifier based on a third image detected subsequent to the first image; and validate the indication of the object responsive to the second confidence satisfying a second threshold; or modify the indication of the object responsive to the second confidence not satisfying the second threshold.

22. The system of any of claims 16 through 21, wherein the event condition corresponds to an event associated with at least one of an identifier of the object or a class of the object.

23. The system of any of claims 16 through 22, wherein the one or more processors are configured to cause the depth estimation by at least one of comparing a first image of the at least one image with a second image of the at least one image, the first image received from a first sensor of the at least one sensor, the second image received from a second sensor of the at least one sensor.

24. The system of any of claims 16 through 23, wherein the one or more processors are configured to perform the pose estimation by providing image data representing the object to one or more models trained using training samples labeled with pose information.

25. The system of any of claims 16 through 24, wherein the one or more processors are configured to perform the pose estimation by identifying one or more nodes of the object from the at least one image, detecting at least one vector between the one or more nodes, the at least one vector corresponding to an appendage, and assigning a pose to the object based on the at least one vector.

26. The system of any of claims 16 through 25, wherein the one or more processors are configured to: detect, from the at least one image, one or more characters representing text information; and at least one of output an indication of the text information or evaluate the event condition based on the text information.

27. The system of any of claims 16 through 26, wherein the one or more processors are configured to: apply natural language processing to text information detected from the at least one image to at least one of detect a sentiment of the text information or evaluate the event condition based on the text information.

28. The system of any of claims 16 through 27, wherein the one or more processors are configured to: perform optical character recognition to detect text information from the at least one image and natural language processing on the text information to detect a sentiment of the text information.

29. The system of any of claims 16 through 28, wherein the one or more processors are configured to perform the pose estimation by performing a plurality of pose estimation passes.

30. The system of any of claims 16 through 29, wherein the one or more processors are configured to evaluate the event condition by providing a pose from pose estimation of the object as input to a model representing criteria for operating a device based on the pose.

31. A method of generating sentiment from text data, comprising: detecting, by one or more processors, a plurality of characters from at least one image; and applying, by the one or more processors responsive to detecting the plurality of characters, a natural language detector to the plurality of characters to generate a sentiment from the at least one image.

32. A system, comprising: one or more processors configured to: detect a plurality of characters from at least one image; and apply, responsive to detecting the plurality of characters, a natural language detector to the plurality of characters to generate a sentiment from the at least one image.