WO2025057684A1

WO2025057684A1 - Image processing system and information processing system

Info

Publication number: WO2025057684A1
Application number: PCT/JP2024/029759
Authority: WO
Inventors: Genta MATSUKAWA; Toshinari AKASE; Kiyohiko IWAMURA; Ryota Murakami
Original assignee: Sony Semiconductor Solutions Corp
Current assignee: Sony Semiconductor Solutions Corp
Priority date: 2023-09-12
Filing date: 2024-08-22
Publication date: 2025-03-20
Anticipated expiration: 2026-03-12
Also published as: JP2025040847A

Abstract

An image processing system, comprising an image sensor configured to capture image data; and circuitry configured to acquire the image data captured by the image sensor; extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers; combine the feature data from two or more of the plurality of intermediate layers; and transmit the combined feature data.

Description

IMAGE PROCESSING SYSTEM AND INFORMATION PROCESSING SYSTEM

The present technique relates to an imaging device, a transmission method, an information processing device, an information processing method, and a feature distribution method, and more particularly to technique of improving inference performance when using an AI model that has been subjected to a compression process such as quantization.

As described in PTL 1 listed below, for example, an imaging device that performs an inference process on a captured image by using an AI (Artificial Intelligence) model are known.
This type of imaging device need to perform an inference process with limited resources, because of which it is required to apply a compression process such as quantization and pruning to the AI model.

However, applying a compression process to the AI model results in a degradation of inference accuracy or a loss of versatility (robustness of the inference performance with regard to variations in the operating environment).

To address this issue, PTL 2 listed below proposes an approach in which an AI model that implements a predetermined inference task is divided into an upstream model that extracts features from captured images and a downstream model that performs an inference process on the basis of the extracted features, with the upstream model alone being subjected to a compression process so as to be implemented in the imaging device, while the downstream model is implemented in a device downstream of the imaging device.
This approach allows the use of an uncompressed model as the downstream model, so that degradation of the inference performance can be suppressed. While it would be possible to implement a single uncompressed model that performs processing of extracting features as well as obtaining inference results in the downstream device to perform an inference process on captured images, the above approach eliminates the need to transmit captured images from the imaging device to the downstream device, and allows only the smaller size feature data to be transmitted. Thus the amount of data that needs to be transmitted to the downstream device for implementing the inference process can be reduced.

WO 2023/067952 WO 2022/269884

Summary

However, due to the compression process applied to the upstream model in the approach according to PTL 2, the feature data, on which the downstream model is based, tends to be lacking in information. A lack of information in the feature data to be input to the downstream model will result in a degradation of the inference performance of the downstream model (degradation of inference accuracy or versatility).

The present technique has been made in view of the circumstances described above, with an aim to improve the inference performance when a downstream model performs inference on the basis of features extracted by an upstream feature extraction model, while allowing the use of a compressed model as the feature extraction model.

An information processing system, comprising a controller; and an image sensor configured to capture image data, the image sensor including first circuitry configured to acquire the image data captured by the image sensor, extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers, combine the feature data from two or more of the plurality of intermediate layers, and
transmit the combined feature data to the controller.

An information processing system, comprising a controller; an image sensor configured to capture image data, the image sensor including first circuitry configured to acquire the image data captured by the image sensor, extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers, combine the feature data from two or more of the plurality of intermediate layers, and
transmit the combined feature data to the controller; and one or more electronic devices, wherein the controller includes second circuitry configured to
distribute the combined feature data received from the image sensor to the one or more electronic devices.

Fig. 1 is a block diagram illustrating a schematic configuration example of an inference system built with an imaging device and an information processing device as an embodiment. Fig. 2 is a block diagram illustrating a configuration example of the imaging device as an embodiment. Fig. 3 is a block diagram illustrating a configuration example of the information processing device as an embodiment. Fig. 4 is an explanatory diagram of a configuration example for implementing an inference technique as an embodiment. Fig. 5 is a diagram for explaining a first-half layer region and a second-half layer region in an intermediate layer region of a feature extraction model. Fig. 6 is a diagram for explaining a configuration example for implementing the inference technique in a first application example. Fig. 7 is a diagram illustrating an example of a heat map image. Fig. 8 is a diagram for explaining a configuration example for implementing the inference technique in a second application example. Fig. 9 is an explanatory diagram of an overlapping process in the second application example. Fig. 10 is an explanatory diagram illustrating a configuration example of a feature distribution system.

The embodiments according to the present technique will be described below in the following order with reference to the accompanying drawings.
<1. Inference System as One Embodiment>
1-1. System Overview
1-2. Configuration Example of Imaging Device
1-3. Configuration Example of Information Processing Device
<2. Inference Technique as One Embodiment>
<3. Application Examples>
3-1. First Application Example
3-2. Second Application Example
<4. Variation Examples>
<5. Summary of Embodiments>
<6. Present Technique>

<1. Inference System as One Embodiment>
1-1. System Overview
Fig. 1 is a block diagram illustrating a schematic configuration example of an inference system 100 built with an imaging device and an information processing device as one embodiment according to the present technique.
As shown, the inference system 100 is built with an imaging device 1 and an information processing device 2. In this example, these imaging device 1 and information processing device 2 are configured to be capable of wired or wireless data communications.
In the inference system 100, the imaging device 1 is one embodiment of an imaging device according to the present technique, and the information processing device 2 is one embodiment of an information processing device according to the present technique.

The imaging device 1 has an imaging unit (imaging unit 41 to be described later), which includes a plurality of pixels each having a light-receiving element and which obtains captured images, and performs imaging of objects.

gImaging h herein broadly means acquisition of image data that captures an object. Image data here generally refers to data consisting of plural sets of pixel data. Pixel data is not only data that indicates the amount of light received from an object by gray scale values of a predetermined number of levels, but can also be various forms of data relating to the object including, for example, data that indicates distances to the object, data that indicates polarization information, data that indicates temperatures, and so on.
In other words, gimage data h obtained by gimaging h includes data as a gray scale image indicating the gray scale values of light received in respective pixels, data as a distance image indicating information of distances to the object from respective pixels, data as a polarization image indicating polarization information of each pixel, data as a thermal image indicating temperature information of each pixel, and so on.
The gimage data h can also include an event image obtained by a so-called EVS (Event-based Vision Sensor). The event here means a change by a certain amount or more in the amount of received light, and an gevent image h is an image indicating information as to whether or not an event occurred in each pixel.
The sensor for the imaging included in the imaging device 1 will be described below as being configured as a gray scale sensor that obtains gray scale images, as one example, similarly to common image sensors. To name but few, CCD (Charge Coupled Device) image sensors and CMOS (Complementary Metal Oxide Semiconductor) image sensors are examples of gray scale sensors.

The information processing device 2 is configured as an information processing device equipped with a microcomputer including a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory).

The inference system 100 in this embodiment here performs an inference process targeting images captured by the imaging device 1 (as input images) using an AI (Artificial Intelligence) model. The inference system 100 may find use in various monitoring applications, for example, of an object. Examples of monitoring applications include indoor monitoring of stores, offices, residences and so on, outdoor surveillance of parking lots, streets and so on (including traffic monitoring), production line monitoring in FA (Factory Automation) or IA (Industrial Automation), vehicle interior monitoring, and so on.

For a monitoring application in a store, for example, a plurality of imaging devices 1 may be installed at respective predetermined locations in the store, to be able to determine the customer demographics (such as gender and age), their behavior (lines of movement) in the store, etc., using an inference function of an AI model. For a traffic monitoring application, imaging devices 1 may be installed at respective locations near a street, to be able to determine (or acquire information such as) the numbers (vehicle numbers), colors, and types, etc., of passing vehicles, using an inference function of an AI model. Imaging devices 1 may also be installed in a parking lot such as to be able to monitor each of the parked vehicles for surveillance of a suspicious individual performing a suspicious act around any of the vehicles. If suspicious individuals are detected, their presence may be notified, or their attributes (such as gender, age, clothing) may be determined and reported.

In FA or IA applications, workers f movements can be monitored, or anomalies in products can be detected.

One option here is to adopt a configuration in which a downstream device (computer device) with more abundant resources than the imaging device receives captured images from the imaging device and performs an inference process on the captured images. However, such a configuration would increase the amount of data communication between the imaging device and the downstream device for implementing the inference process. Moreover, in applications where the target object is human or where vehicle numbers are to be determined, such a configuration would be more prone to image data leakage, which could lead to a higher risk of privacy infringement.
If the imaging device were to perform the inference process, there would be no need to transmit image data to the downstream device, so that the risk of image data leakage could be avoided. Moreover, the amount of data communication could be reduced, because it would only be necessary to transmit inference result data, which is of a smaller size than images.

However, given the limited resources the imaging device has as mentioned above, the AI model implemented in the imaging device would still require a compression process such as quantization or pruning, which would make it difficult to enhance the inference performance. In other words, it would make it difficult to improve the inference accuracy or versatility (robustness of the inference performance to variations in the operating environment).

Therefore, similarly to PTL 2 mentioned above, this embodiment adopts the following technique, i.e., an AI model that performs a predetermined inference task is divided into an upstream model that extracts features from a captured image (hereinafter referred to as gupstream model 1a h), and a downstream model that performs an inference process based on the extracted features (hereinafter referred to as gdownstream model 2a h). A compression process is applied only to the upstream model 1a that is to be implemented in the imaging device 1. The downstream model 2a is implemented in a device downstream of the imaging device 1, i.e., the information processing device 2.

This approach allows an uncompressed model to be used as the downstream model 2a, minimizing the degradation in inference performance. The above configuration eliminates the need to transmit captured images from the imaging device 1 to the information processing device 2 and allows only the smaller size feature data to be transmitted, thus enabling a reduction in the amount of data that needs to be transmitted to the information processing device 2 for implementing the inference process. Further, the risk of image data leakage can be avoided, for better protection of privacy.

1-2. Configuration Example of Imaging Device
Fig. 2 is a block diagram illustrating a configuration example of the imaging device 1.
As shown, the imaging device 1 includes an image sensor 10, as well as an imaging optical system 11, an optical system driver 12, a controller 13, a memory unit 14, a communication unit 15, and a sensor unit 16. The image sensor 10, controller 13, memory unit 14, communication unit 15, and sensor unit 16 are connected via a bus 17, to be capable of mutual data communications.

The imaging optical system 11 includes lenses such as a cover lens, a zoom lens, and a focus lens, and an iris mechanism. The imaging optical system 11 guides and collects the light from the object (incident light) to a light-receiving surface of the image sensor 10.

The optical system driver 12 collectively illustrates the respective drivers of the zoom lens and focus lens in the imaging optical system 11, and of the iris mechanism. Specifically, the optical system driver 12 includes respective actuators for driving these zoom lens, focus lens, and iris mechanism, and drive circuits for the actuators.

The controller 13 is configured with a microcomputer that has a CPU, a ROM, and a RAM, for example, and performs overall control of the imaging device 1 by the CPU executing various types of processing in accordance with a program stored in the ROM, or a program loaded to the RAM.

The controller 13 gives instructions to the optical system driver 12 to drive the zoom lens, focus lens, iris mechanism, and so on. In accordance with these instructions to drive these parts, the optical system driver 12 actually moves the focus lens or the zoom lens, and opens or closes the blades of the iris mechanism.

The controller 13 also manages the writing and reading of various types of data to and from the memory unit 14.
The memory unit 14 is a non-volatile memory device such as an HDD (Hard Disk Drive) or flash memory device, for example, and used for storage of data that is used by the controller 13 in executing various operations. The memory unit 14 can also be used as a storage (recording) destination of image data output from the image sensor 10.

The controller 13 performs various data communications with external devices through the communication unit 15. The communication unit 15 in this example is configured to be able to perform data communication at least with the information processing device 2.

The sensor unit 16 collectively represents other sensors than the image sensor 10 in the imaging device 1. Examples of the sensors included in the sensor unit 16 may include GNSS (Global Navigation Satellite System) sensors or altitude sensors for detecting the location or altitude of the imaging device 1, temperature sensors for detecting the ambient temperature, and motion sensors such as acceleration sensors or angular rate sensors for detecting the motion of the imaging device 1, for example.

The image sensor 10 is configured as a solid-state image sensor such as a CCD or CMOS, for example, and includes an imaging unit 41, an image signal processor 42, an in-sensor controller 43, a feature extractor 44, a memory unit 45, a computer vision processor 46, and a communication interface (I/F) 47 as illustrated, which are capable of mutual data communications via a bus 48.

The imaging unit 41 includes a pixel array unit with a two-dimensional array of pixels, each containing a photoelectric converter such as a photodiode, and a read-out circuit that reads out electrical signals obtained by photoelectric conversion from each of the pixels in the pixel array unit.
This read-out circuit executes a CDS (Correlated Double Sampling) process and an AGC (Automatic Gain Control) process, for example, to the electrical signals obtained by photoelectric conversion, and also performs an A/D (Analog/Digital) conversion process.

The image signal processor 42 performs preprocessing, a synchronization process, a YC generation process, a resolution conversion process, etc., on the captured image signal as digital data after the A/D conversion process. In the preprocessing, a clamping process in which the black level of R (red), G (green), and B (blue) is clamped to a predetermined level, and a correction process for each of the RGB channels are performed on the captured image signals. In the synchronization process, a color separation process is performed so that image data in each pixel contains all of the RGB color components. In the case of using a Bayer-pattern color filter, for example, demosaicing is performed as a color separation process. In the YC generation process, luminance (Y) signals and color (C) signals are generated (separated) from the RGB image data. In the resolution conversion process, a resolution conversion process is performed to the image data that has been subjected to various signal processing operations.

The in-sensor controller 43 is configured with a microcomputer that is equipped with a CPU, a ROM, and a RAM, for example, and controls the overall operation of the image sensor 10. For example, the in-sensor controller 43 gives instructions to the imaging unit 41 and controls execution of the imaging operation. The in-sensor controller also controls the execution of operations by the image signal processor 42.

The feature extractor 44 is configured with a CPU and a programmable arithmetic processing unit such as an FPGA (Field Programmable Gate Array) or DSP (Digital Signal Processor), for example, and extracts features, using an AI model, from a target that is an image captured by the imaging unit 41. The AI model (feature extraction model) used by the feature extractor 44 for feature extraction has a neural network architecture including a plurality of intermediate layers. Specifically, the model has a network architecture designed as a DNN (Deep Neural Network).
The feature extraction model used by the feature extractor 44 corresponds to the upstream model 1a mentioned above, which is an AI model to which a compression process that at least includes quantization has been applied. One example of quantization is a technique that converts data types from floating-point numbers such as float32 to integer numbers such as uint8 to reduce the amount of information to be handled.

For the sake of clarity, the feature extraction model here is an AI model trained by machine learning to be able to extract features from an input image.

The memory unit 45 is configured by a volatile memory, and used to retain (temporarily store) the data required for the feature extractor 44 to perform the feature extraction process. Specifically, the memory unit is used for retaining information that is necessary for building an AI model as the feature extraction model. This gnecessary information h includes filter coefficients used when the feature extraction model performs a convolution operation, and data indicating the network architecture of the neural network.
Hereinafter, data containing this gnecessary information h will be referred to as gAI model data h in the sense that it is information used for building an AI model.

The memory unit 45 can also be used for retaining data used by the feature extractor 44 in the feature extraction process, and for retaining captured image data that has been processed in the image signal processor 42.

The computer vision processor 46 applies rule-based image processing to the captured image data. Super-resolution can be named as one example of the rule-based image processing here.

The communication interface 47 is an interface that performs communication with various units outside the image sensor 10, such as the controller 13 and memory unit 14 that are connected via the bus 17. The communication interface 47 performs communication, for example, for acquiring data from outside, based on the control by the in-sensor controller 43, to implement the AI model used by the feature extractor 44.
The communication interface 47 also allows the feature data extracted by the feature extractor 44 to be output from the image sensor 10.

1-3. Configuration Example of Information Processing Device
Fig. 3 is a block diagram illustrating a configuration example of the information processing device 2 shown in Fig. 1. As shown, the information processing device 2 includes a CPU 21. The CPU 21 functions as an arithmetic processing unit that performs various operations, and executes various types of processing in accordance with a program stored in a ROM 22, or a program loaded to a RAM 23 from a memory unit 29.
The RAM 23 also stores data necessary for the CPU 21 to execute various operations as required.
The CPU 21, ROM 22, and RAM 23 are connected to each other via a bus 33.

An inference processor 24 is connected to the bus 33.
The inference processor 24 is configured with a CPU and a programmable arithmetic processing unit such as an FPGA or DSP, for example, and performs a predetermined inference process using an AI model based on features extracted by the feature extraction model as the upstream model 1a mentioned above as input data.
The AI model the inference processor 24 uses for the inference process corresponds to the downstream model 2a mentioned above.

Specific examples of the inference process the inference processor 24 performs using an AI model that is the downstream model 2a here include: an object detection process that specifies bounding boxes and categories (classes) of objects such as the one known as an SSD (Single Shot Detector); a classification process that classifies humans, for example, as objects based on some attributes such as age group or gender, or that classifies motion patterns; and an anomaly detection process that is performed using a PatchCore model or the like based on anomaly scores of the objects.

An AI model that has been trained by machine learning to be able to perform one of these inference processes, for example, is used as the downstream model 2a.

To the bus 33 of the information processing device 2 is also connected an input/output interface (I/F) 25.
An input unit 26 including an operator and an operating device is connected to the input/output interface 25. Possible examples of the input unit 26 include various operators and operating devices such as a keyboard, mouse, key, dial, touchscreen, touch pad, remote controller, and so on.
An operation is detected by the input unit 26, and the signal in accordance with the detected operation is interpreted by the CPU 21.

To the input/output interface 25 are also connected, either integrally or as separate parts, a display unit 27 composed of an LCD (Liquid Crystal Display) or organic EL (Electro-Luminescence) panel, and an audio output unit 28 including a speaker or the like.
The display unit 27 is used for displaying various types of information, and configured as a display device provided in the housing of the information processing device 2, for example, or as a separate display device connected to the information processing device 2.

The display unit 27 displays various types of information on a display screen based on instructions from the CPU 21. For example, the display unit 27 displays various operation menus, icons, messages, etc., i.e., provides a GUI (Graphical User Interface) based on instructions from the CPU 21. The display unit 27 also displays images as specified by a user operation, for example, based on instructions from the CPU 21.

To the input/output interface 25 may also be connected the memory unit 29 composed of an HDD or a solid-state memory, and a communication unit 30 composed of a modem, as the case may be.

The communication unit 30 performs the processing for communication via a transmission line such as Internet, wired/wireless communication with various types of equipment, and communication via buses. The communication unit 30 in this embodiment, in particular, is configured to be able to perform data communication with the imaging device 1.

To the input/output interface 25 is also connected a drive 31 as required, for a removable recording medium 32 such as a magnetic disc, optical disc, magneto-optical disc, or a semiconductor memory to be mounted as needed.

The drive 31 allows data files such as programs to be used for various operations to be read from a removable recording medium 32. The read data files are stored in the memory unit 29, or images and audio contained in the data files are output by the display unit 27 or the audio output unit 28. Computer programs and the like read from the removable recording medium 32 are installed in the memory unit 29 as required.

With the information processing device 2 having the hardware configuration described above, software for the processing according to this embodiment, for example, can be installed through network communication by the communication unit 30 or via a removable recording medium 32. Alternatively, such software may be stored in the ROM 22 or memory unit 29 in advance.

The information processing device 2 is not limited to the configuration with a single computer device as shown in Fig. 3 but may adopt a configuration in which plural computer devices are integrated as a system. The plurality of computer devices may be integrated into a system through a LAN (Local Area Network) or the like, or may each be located in remote areas and connected via a VPN (Virtual Private Network) or the like, using Internet, for example. The plurality of computer devices may include those as a group of servers (cloud) made available by cloud computing services.

<2. Inference Technique as One Embodiment>
Now, this embodiment adopts the technique in which an AI model that performs a predetermined inference task is divided into an upstream model 1a that extracts features from a captured image, and a downstream model 2a that performs an inference process based on the extracted features, with a compression process being applied only to the upstream model 1a. With this approach, however, the feature data on which the downstream model 2a bases its inference tends to be lacking in information, which could result in a degradation in the inference performance of the downstream model 2a.

Therefore, this embodiment adopts a technique in which the downstream model 2a is provided with the features of a plurality of intermediate layers in the feature extraction model as input data, to compensate for the lack of information in the feature data that is the data input to the downstream model 2a.

Fig. 4 is an explanatory diagram of a configuration example for implementing the inference technique as one embodiment. In the imaging device 1, the feature extractor 44 performs a feature extraction process targeting the images captured by the imaging unit 41 as input data, using the upstream model 1a that is a feature extraction model.

The imaging device 1 of this embodiment carries out a process of transmitting the feature data acquired from a plurality of intermediate layers in the upstream model 1a during the execution of this feature extraction process to the information processing device 2 via the communication unit 15.
Specifically, in this example, the in-sensor controller 43 acquires the feature data that is obtained from a plurality of intermediate layers in the upstream model 1a, and combines the acquired feature data to generate one set of feature data Dc.

In this example here, the feature extraction model as the upstream model 1a includes four or more intermediate layers. Specifically, in this example, the feature extraction model includes six intermediate layers L1 to L6 downstream of the input layer Li.
In this example, the plural intermediate layers, from which the features are to be transmitted to the information processing device 2 as the downstream device (sources of the features), are intermediate layers in a first-half layer region of the intermediate layer region of the feature extraction model.

Referring now to Fig. 5, the first-half layer region and second-half layer region in the intermediate layer region of the feature extraction model will be explained.
Fig. 5A and Fig. 5B respectively show a first-half layer region and a second-half layer region in a six-layer intermediate layer region and a five-layer intermediate layer region of a feature extraction model.
When the total number of intermediate layers is an even number of four or more, the intermediate layers in the intermediate layer region can be split into a first half and a second half as shown in Fig. 5A. In this case, the equally divided first half region and the second half region of the intermediate layers correspond to the first-half layer region and second-half layer region, respectively.
On the other hand, when the total number of intermediate layers is an odd number of four or more, the intermediate layers in the intermediate layer region cannot be split into a first half and a second half, as shown in Fig. 5B. In this case, the intermediate layer located at the center of the intermediate layer region is considered as the divider between the first half and the second half, and the intermediate layer region before the center intermediate layer is defined as the first-half layer region, and the intermediate layer region after the center intermediate layer is defined as the second-half layer region.

The features extracted from preceding (upper) intermediate layers in a feature extraction model tend to contain more information.
Therefore, by using the intermediate layers in the first-half layer region as the intermediate layers from which the features are to be transmitted, the amount of information in the feature data to be transmitted to the downstream model can be augmented, which allows the inference performance to be improved.

In this example, the plural intermediate layers, from which the features are to be transmitted to the information processing device 2, are continuous intermediate layers (see Fig. 4).

As described above, the features extracted from upper intermediate layers tend to contain more information. Therefore, by using a plurality of continuous intermediate layers as the intermediate layers from which the features are to be transmitted as described above, it is possible to compensate for the lack of information in the feature data caused by compression as much as possible.
Accordingly, the amount of information in the feature data to be transmitted to the downstream model can be augmented, which allows the inference performance to be improved.

Furthermore, in this example, the plural intermediate layers, from which the features are to be transmitted to the information processing device 2, are two intermediate layers.
Specifically, in this example, the features obtained from uppermost located intermediate layer L1 and intermediate layer L2 are transmitted to the information processing device 2.
By using two target intermediate layers, the amount of data transmitted to the information processing device 2 can be reduced as much as possible, in the case of adopting the configuration of transmitting the features of a plurality of intermediate layers to compensate for the lack of information caused by compression.

In Fig. 4, the in-sensor controller 43 combines the feature data acquired from a plurality of intermediate layers to generate one set of feature data Dc as described above. In a feature extraction model with DNN, the lower the intermediate layer, the lower the resolution. Therefore, the data combination in this case is carried out with the resolutions of the feature data of the respective intermediate layers being made uniform. Specifically, in this example, the resolutions are first matched by increasing the feature data of the intermediate layer L2 with a lower resolution, before combining it with the feature data of the intermediate layer L1 with a higher resolution, to generate one set of feature data Dc.

Such matching of resolutions in combining the feature data acquired from a plurality of intermediate layers is only one example. Other combination techniques may be adopted, such as combining the feature data as is.

In the imaging device 1, the in-sensor controller 43 transmits the generated feature data Dc to the controller 13 outside the image sensor 10.

The controller 13 has the function as a transmission processor F1.
The transmission processor F1 performs the process of transmitting feature data Dc (i.e., features of a plurality of intermediate layers) received from the in-sensor controller 43 to an external device.
Specifically, the transmission processor F1 performs a process of causing the communication unit 15 to transmit the feature data Dc to the information processing device 2.

In the information processing device 2, the communication unit 30 receives the feature data Dc transmitted from the imaging device 1 as described above. Namely, the communication unit 30 functions as a receiving unit that receives features of a plurality of intermediate layers in the feature extraction model as the upstream model 1a.

In the information processing device 2, the inference processor 24 uses the feature data Dc thus received by the communication unit 30 as input data, and performs the inference process using the downstream model 2a.

Now, as described above, this embodiment adopts the configuration in which the imaging device 1 transmits the features of the plurality of intermediate layers in the feature extraction model as the upstream model 1a to the information processing device 2. Increasing the number of intermediate layers from which the features are acquired does not only provide the simple effect of adding up the amount of information, but also includes the following factor that contributes to the compensation for the lack of information in the features input to the downstream model 2a. Namely, making the number of feature source intermediate layers plural allows a set of weighted features extracted from an intermediate layer (features from one perspective) as well as another set of weighted features extracted from another intermediate layer (features from another perspective) to be given to the downstream model 2a as input data. The lack of information in the features is compensated for in this sense, too, that a feature from another perspective can also be added.

<3. Application Examples>
3-1. First Application Example
Next, specific application examples of the inference system 100 as one embodiment will be described. A first application example will be described first with reference to Fig. 6 and Fig. 7.

Fig. 6 is a diagram for explaining a configuration example for implementing the inference technique in a first application example. The first application example envisages an application for monitoring purposes in FA (IA). Specifically, in this application example, a product anomaly detection process is performed as the inference process.

In this case, an AI model for anomaly detection is implemented as a downstream model 2a in the inference processor 24 of the information processing device 2. Specifically, this example will show one case where a PatchCore model 50 is used as the AI model for anomaly detection.

As shown, the PatchCore model 50 includes the functions as a mapping unit 51 and a nearest neighbor search unit 52.
The mapping unit 51 performs feature mapping. Specifically, in this example, the feature data Dc received from the imaging device 1 is used as input data and subjected to feature mapping. The nearest neighbor search unit 52 carries out nearest neighbor search to calculate an anomaly score for each patch (block of a predetermined number of pixels) of the feature map of a detection target product provided by the mapping unit 51, based on a feature map of a normal product prepared in advance. The anomaly score here refers to an evaluation score indicative of the degree of anomaly relative to a normal product as the reference.

The inference processor 24 in this case includes the functions as an anomaly score calculator 53 and a heat map generator 54.
The anomaly score calculator 53 calculates an overall anomaly score indicative of the overall degree of anomaly of the detection target product based on the anomaly score of each patch output by the nearest neighbor search unit 52. The heat map generator 54 generates a heat map image showing the distribution of anomaly scores of the detection target product based on the anomaly score of each patch output by the nearest neighbor search unit 52. For example, an image showing a distribution of anomaly scores by differences in color or luminance, as shown in Fig. 7, may be generated as a heat map image.

The CPU 21 in the information processing device 2 can determine whether or not an anomalous product has been detected based on the overall anomaly score calculated by the anomaly score calculator 53, and can perform a process of notifying the user when an anomalous product is detected. Examples of such processing include displaying predetermined information on the display unit 27, or transmitting notification information to a predetermined external device.

The CPU 21 causes the display unit 27 to display the heat map image of the detection target product generated by the heat map generator 54 for presentation to the user such as an operator or manager.
Presenting the heat map image to the user can assist the user in determining the cause of the anomaly when an anomalous product is detected. Namely, the time required for determining the cause can be shortened. This is preferable for example in the case where the production line is designed to stop when an anomalous product is detected until an event that is causing the anomaly is resolved, because the time during which the line is halted can be shortened.

When using the PatchCore model 50, an existing trained model that has been trained by machine learning with ImageNet training datasets, for example, can be used as the feature extraction model as the upstream model 1a.
For the sake of clarity, the PatchCore model 50 in this embodiment is trained by machine learning (including the process of preparing the feature maps of normal products), using the feature data Dc acquired from a plurality of intermediate layers in the upstream model 1a as the input data for the training.

It should be noted here that it is difficult to train an AI model used for an inference process such as to be versatile in any and all operating environments of the imaging device 1, and sometimes a degradation can happen in the inference accuracy due to a variation in the operating environment. In the case where a process of detecting anomaly in objects using the PatchCore model 50 or the like, in particular, when for example the production line that is the target of imaging is processing products that come from nature such as vegetables (e.g., peeling onions), an event involving a large drop in the inference performance for an unknown reason has been confirmed to occur.

Accordingly, this example adopts a technique in which the upstream model 1a (AI model) used by the feature extractor 44 is switched to another one based on the inference result provided by the inference processor 24.
Specifically, in this example, the switching of upstream models 1a is performed between a plurality of AI models with different network architectures. More specifically, in this example, the upstream models 1a are a feature extraction model that has the MobileNet network architecture, and a feature extraction model that has the RepGhostNet network architecture, and are switched from one to another.

To implement such switching of the upstream models 1a, in the information processing device 2 of the first application example, the CPU 21 has the function as a switching instruction processor F3. The in-sensor controller 43 of the imaging device 1 has the function as a first switching processor F2.

In the information processing device 2, the switching instruction processor F3 gives an instruction for switching the upstream models 1a (feature extraction models) in the imaging device 1 based on an inference result provided by the inference processor 24. Here, whether the feature extraction models are to be switched is determined based on an evaluation index of the accuracy of the inference process. Specifically, the switching instruction processor F3 in this example determines the proportion of regions where the anomaly score is a predetermined value or more, specifically, the proportion of pixels where the anomaly score is a predetermined value or more in the entire image, as an accuracy evaluation score based on the heat map image generated by the heat map generator 54, and determines whether or not the upstream models 1a are to be switched based on this accuracy evaluation score.
The switching instruction processor F3, when the determination result affirms that the upstream models 1a should be switched, gives an instruction for switching the upstream models 1a to the imaging device 1 via the communication unit 30. When the determination result denies that the upstream models 1a should be switched, the switching instruction processor F3 does not give the switching instruction.

While the accuracy evaluation score is calculated based on a heat map image as one example above, various other techniques are possible for calculating the accuracy evaluation score. Other calculation techniques may be adopted, such as, for example, calculating the number or rate of detection per unit time in the detection of anomalous products based on the overall anomaly score mentioned above, as the accuracy evaluation score.

In the imaging device 1, the memory unit 45 stores first model data MD1 and second model data MD2 to allow the switching of upstream models 1a. The first model data MD1 is AI model data for building a feature extraction model according to a predetermined network architecture, and the second model data MD2 is AI model data for building a feature extraction model according to another network architecture. Specifically, in this example, the first model data MD1 is AI model data for building a feature extraction model with one of MobileNet and RepGhostNet, and the second model data MD2 is AI model data for building a feature extraction model with the other one of MobileNet and RepGhostNet.

In the in-sensor controller 43, the first switching processor F2 performs the process of switching feature extraction models used by the feature extractor 44. In this example, the switching instruction given by the switching instruction processor F3 is input to the in-sensor controller 43 via the communication unit 15 and controller 13 of the imaging device 1, and the first switching processor F2 performs the process of switching the feature extraction models used by the feature extractor 44 in accordance with this input switching instruction.

Specifically, as the process of switching the feature extraction models, the switching instruction processor F3 performs, when the switching instruction is given while the feature extractor 44 is performing a feature extraction process using a feature extraction model based on the first model data MD1, a setting process for setting the feature extractor 44 such as to build a feature extraction model with another network architecture based on the second model data MD2 stored in the memory unit 45. When, on the other hand, the switching instruction is given while the feature extractor 44 is performing a feature extraction process using a feature extraction model based on the second model data MD2, the first switching processor F2 performs a setting process for setting the feature extractor 44 such as to build a feature extraction model with another network architecture based on the first model data MD1 stored in the memory unit 45.
The feature extraction model being used by the feature extractor 44 is thus switched to an AI model having another network architecture in response to the switching instruction from the information processing device 2.

The switching of the feature extraction models as described above allows the selective use of a feature extraction model in accordance with the operating environment of the imaging device 1, which helps to improve the inference accuracy and versatility, i.e., to improve the inference performance.

While MobileNet and RepGhostNet were named as examples of feature extraction models having different network architecture above, these are merely examples given for illustrative purposes. Other feature extraction models such as ShuffleNet, for example, can also be used as one of the switching target feature extraction models.

Furthermore, as an alternative to the example above wherein the first switching processor F2 switches AI models having different network architectures from one to another, the feature extraction models switched in the process may be a plurality of feature extraction models trained by machine learning with different training datasets.

This allows the selective use of a feature extraction model in accordance with the operating environment. For example, a feature extraction model that has been trained by machine learning on a training dataset for daytime applications can be applied for daytime use, and a feature extraction model that has been trained by machine learning on a training dataset for nighttime applications can be applied for nighttime use.
Thus the inference performance can be improved.

In this case, the plurality of switching target feature extraction models may all have the same network architecture, or include at least some models having a different network architecture.

In the above example, the AI model data, depicted as the first model data MD1 and second model data MD2 as examples, i.e., the AI model data for building switching target feature extraction models, is stored in the memory unit 45 of the image sensor 10. Instead, the AI model data may be stored in other memories outside the image sensor 10 within the imaging device 1, such as the memory unit 14, for example.
Alternatively, such AI model data can be acquired from an external device as required, instead of being stored in the imaging device 1 in advance.

3-2. Second Application Example
A second application example will be described with reference to Fig. 8 and Fig. 9.
Fig. 8 is a diagram for explaining a configuration example for implementing the inference technique in the second application example. The second application example envisages an application to the monitoring of workers in a factory. Specifically, this is an application example where a process of classifying the operation contents of workers is performed as the inference process.

In this case, an AI model for implementing the classification process for classifying operation contents is implemented as the downstream model 2a in the inference processor 24 of the information processing device 2. For the sake of clarity, the classification process here refers to a process of specifying which of a plurality of categories an object (here, movement of humans as workers) falls under.

The downstream model 2a used in this case, i.e., the AI model that performs the operation content classification process, is an AI model that has been trained by machine learning using the feature data Dc acquired from a plurality of intermediate layers in the upstream model 1a as the input data for the training.

In this example, there are provided overlap periods in the classification process performed by the inference processor 24. Namely, during this overlap period, the classification process is executed in duplicate on the feature data Dc acquired from the same period.

Fig. 9 is a diagram explaining this aspect.
In the drawing, plural sets of feature data Dc obtained during a period of several hundred frames are indicated as feature data group.
In this example, these feature data groups are input to the inference processor 24 such that there is an overlap period between a temporally preceding feature data group. Accordingly, the inference processor 24 performs the classification process in duplicate on some feature data Dc acquired from the same period in a fixed cycle. Namely, there are cyclic overlap periods in the process.
Providing such cyclic overlap periods in the process can help to prevent a period of lapse in the classification process from happening even when the classification process is delayed for some reason.

In Fig. 8, the CPU 21 in this case has the function as a classification results combiner F4 as shown.
The classification results combiner F4 performs the process of inputting the classification results information of the operation contents provided by the inference processor 24, and of combining the classification results in the chronological order. In this process, if any of the classification results information from the same period is in duplicate, resulting from the process overlap periods as described above, one of the classification results information is selected to be combined with other classification results.

The CPU 21 in this example also has the function as an operation error detector F5. The operation error detector F5 performs an operation error detection process based on the classification results information according to the chronological order of the operation contents obtained by the classification results combiner F4, and correct order information that indicates the correct order of operation contents. Specifically, the operation error detector F5 determines whether or not the order of the operation contents indicated in the classification results information according to the chronological order of the operation contents obtained by the classification results combiner F4 is different from the order of operation contents indicated in the correct order information. If the orders are different, a detection result is obtained affirming that an operation error has been detected, while if the orders are the same, a detection result is obtained affirming that no operation error has been detected.

When the operation error detector F5 has detected an operation error, the CPU 21 can perform a process of notifying the user thereof. Such a notification could be a visual notification made with the use of the display unit 27, for example, or an auditory notification made with the use of the audio output unit 28.

In the second application example here, too, similarly to the first application example, a degradation of the inference accuracy can occur, resulting from a change in the operating environment.
In the case where operation contents of workers are classified as in the second application example, in particular, the motions of workers performing the same contents of work may largely differ depending on whether the workers are beginners or experts, because of which there is a possibility that the inference accuracy may drop due to a change in the workers.

Accordingly, this example adopts a technique in which the AI model used by the inference processor 24 is switched to another in accordance with an attribute of the object. Specifically, in this example, the AI models used by the inference processor 24 are switched depending on whether the worker is a beginner or an expert.

In the information processing device 2 of this example, the memory unit 29 stores the AI model data for building respective AI models to allow the AI models described above to be switched from one another. Specifically, in this example, the memory unit 29 contains beginner model data MD3 that is AI model data for building an AI model (hereinafter referred to as gbeginner model h) trained by machine learning to be able to perform a process of classifying the operation contents of a beginner, and expert model data MD4 that is AI model data for building an AI model (hereinafter referred to as gexpert model h) trained by machine learning to be able to perform a process of classifying the operation contents of an expert.

For the training of the beginner model, for example, captured images of a person corresponding to the beginner as the object may be used as the input image for the training, and for the training of the expert model, for example, captured images of a person corresponding to the expert as the object may be used as the input image for the training.

The CPU 21 in the information processing device 2 in this example has the function as a second switching processor F6.
The second switching processor F6 performs the process of switching the AI models used by the inference processor 24, specifically, the process of switching the AI models used by the inference processor 24 in accordance with a switching instruction from outside.
In this example here, an instruction is given, as the switching instruction from outside, to specify the AI model to be switched to. In this example, one of the beginner model and expert model is to be specified as the AI model to be switched to. In this case, the target AI model can be specified by a user fs operation input. Alternatively, if an external device of the information processing device 2 is able to determine which of a beginner or an expert the worker who is being captured by the imaging device 1 is, this external device may specify one of the beginner model and the expert model. In another alternative configuration that can be adopted, an AI model that classifies the attributes of workers who are the imaging target of the imaging device 1 may be implemented in the information processing device 2 or in a different device from the information processing device 2, and the CPU 21 may determine which of the beginner model or the expert model should be selected, based on the classification results by the AI model.

When the beginner model is to be selected as instructed by the switching instruction mentioned above, the second switching processor F6 performs a setting process for setting the inference processor 24 such as to build an AI model as the beginner model based on the beginner model data MD3 stored in the memory unit 29. On the other hand, when the expert model is to be selected as instructed by the switching instruction mentioned above, a setting process is implemented for setting the inference processor 24 such as to build an AI model as the expert model based on the expert model data MD4 stored in the memory unit 29.
Thus the AI models used by the inference processor 24 are switched in accordance with the worker attributes.

In the above example, the AI model data, depicted above as the beginner model data MD3 and expert model data MD4 as examples, i.e., the AI model data for building the switching target AI models, is stored in the information processing device 2. Instead, the AI model data may be acquired from an external device as required.

<4. Variation Examples>
The embodiment is not limited to the specific examples described above and can adopt various configurations as variation examples.
For example, while the feature data Dc acquired from a plurality of intermediate layers in the upstream model 1a is transmitted to a single downstream device as described in the examples above, the feature data Dc may be distributed to a plurality of devices in an alternative configuration.

Fig. 10 illustrates a configuration example of a distribution system that distributes the feature data Dc to a plurality of devices.
In the illustrated example, the feature data Dc transmitted by the imaging device 1 is received by an information processing device or a PLC (Programmable Logic Controller) 60, for example, installed in a facility such as a factory, and this PLC 60 distributes the received feature data Dc to a plurality of external devices such as a cloud server 61, a fog server 62, and a PC (Personal Computer) 63.
In the illustrated example, these cloud server 61, fog server 62, and PC 63 each include a downstream model 2a to be able to perform a predetermined inference process, using the feature data Dc as input data.

The above configuration in which the feature data Dc is distributed to a plurality of devices (downstream devices) makes it possible, for example, to cause plural distributed downstream devices to perform inference on divided sections of a video image, or to cause plural downstream devices to perform different inference processes in parallel on the same captured image.
Thus the inference process can be made more efficient.

While the feature extraction model as the upstream model 1a is implemented in the image sensor 10 in the examples described above, the feature extraction model as the upstream model 1a may be implemented in other parts outside the image sensor 10 within the imaging device 1.

The application examples shown above are merely examples. The present technique is preferably applicable to a wide range of cases where an inference process is performed using an AI model.

<5. Summary of Embodiments>
As has been described above, an imaging device (imaging device 1) as one embodiment includes: an imaging unit (imaging unit 41) that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; a feature extractor (feature extractor 44) that performs feature extraction targeting an image captured by the imaging unit, using a feature extraction model that has a neural network architecture including a plurality of intermediate layers and that performs feature extraction on an input image; and a transmission processor (transmission processor F1) that performs a process of transmitting features of a plurality of intermediate layers in the feature extraction model to an external device.
The above configuration of transmitting the features of a plurality of intermediate layers in the feature extraction model from the imaging device makes it possible to compensate for the lack of information in the feature data on which the downstream model bases its inference when the feature extraction model is compressed. In other words, the amount of information in the feature data that is the basis of the inference can be augmented, so that the inference performance of the downstream model can be improved.
Thus it is possible to improve the inference performance when a downstream model performs inference based on features extracted by an upstream feature extraction model, while allowing the use of a compressed model as the feature extraction model.

In the imaging device as one embodiment, the feature extraction model has four or more intermediate layers, and the plural intermediate layers are intermediate layers in a first-half layer region of an intermediate layer region in the feature extraction model.
The features extracted from preceding (upper) intermediate layers in a feature extraction model tend to contain more information.
Therefore, by using the intermediate layers in the first-half layer region as the intermediate layers from which the features are to be transmitted, the amount of information in the feature data to be transmitted to the downstream model can be augmented, which allows the inference performance to be improved.
The lower the intermediate layers, the higher the versatility of the obtained feature. Therefore, it could be desirable to use the feature of a lower intermediate layer for the purpose of improving the versatility. On the other hand, depending on the degree of compression, the feature could definitively lack information. In this embodiment, the versatility can be guaranteed by the uncompressed downstream AI model, and therefore the features of intermediate layers in the first-half layer region are used to ensure a sufficient amount of input information.

In the imaging device as one embodiment, the plural intermediate layers are continuous intermediate layers.
As described above, the features extracted from upper intermediate layers tend to contain more information. Therefore, by using a plurality of continuous intermediate layers as the intermediate layers from which the features are to be transmitted as described above, it is possible to compensate for the lack of information in the feature data caused by compression as much as possible.
Accordingly, the amount of information in the feature data to be transmitted to the downstream model can be augmented, which allows the inference performance to be improved.

Furthermore, in the imaging device as one embodiment, the transmission processor performs a process of transmitting features obtained from each of two intermediate layers in the feature extraction model to an external device.
This way, the amount of data transmitted to a downstream device can be reduced as much as possible, in the case of adopting the configuration of transmitting the features of a plurality of intermediate layers to compensate for the lack of information in the data input to the downstream model.

The imaging device as one embodiment further includes a first switching processor that performs a process of switching feature extraction models used by the feature extractor.
This allows the selective use of a feature extraction model in accordance with the operating environment of the imaging device.
Thus the inference accuracy and versatility, i.e., the inference performance, can be improved.

Furthermore, in the imaging device as one embodiment, the first switching processor performs a process of switching between a plurality of feature extraction models with different network architectures as the process of switching feature extraction models.
Therefore, in an operating environment where a feature extraction model with one network architecture fails to provide a preferable inference performance, the feature extraction model can be switched to another one with another network architecture.
The inference performance could be improved by switching the model to another feature extraction model with another network architecture. Therefore, the inference performance can be improved as compared to the case where the feature extraction models are not switched.

Furthermore, in the imaging device as one embodiment, the first switch processor performs a process of switching between a plurality of feature extraction models trained by machine learning with different training datasets as the process of switching feature extraction models.
This allows the selective use of a feature extraction model in accordance with the operating environment. For example, a feature extraction model that has been trained by machine learning on a training dataset for daytime applications can be applied for daytime use, and a feature extraction model that has been trained by machine learning on a training dataset for nighttime applications can be applied for nighttime use.
Thus the inference performance can be improved.

A transmission method as one embodiment is implemented in an imaging device, and includes the step of performing a process of transmitting features of a plurality of intermediate layers in a feature extraction model to an external device, the imaging device including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.
This transmission method can also provide the effects and advantages similar to those of the imaging device described above as one embodiment.

An information processing device as one embodiment includes: a receiving unit (communication unit 30) that receives features of a plurality of intermediate layers in a feature extraction model transmitted from an imaging device (imaging device 1); and an inference processor (inference processor 24) that performs a predetermined inference process using an AI model based on the features of the plurality of intermediate layers received by the receiving unit as input data, the imaging device including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.
Thus, when a downstream model (AI model) performs an inference process based on the features extracted by an upstream feature extraction model, and when the feature extraction model is compressed, it is possible to compensate for the lack of information in the feature data on which the downstream model bases its inference.
Accordingly, it is possible to improve the inference performance when a downstream model performs inference based on features extracted by an upstream feature extraction model, while allowing the use of a compressed model as the feature extraction model.

The information processing device as one embodiment further includes a switching instruction processor (switching instruction processor F3) that gives an instruction for switching the feature extraction models in the imaging device based on an inference result provided by the inference processor.
Since the operating environment of the imaging device can be estimated based on an inference result provided by the inference processor, the above configuration allows the selective use of a feature extraction model in accordance with the operating environment of the imaging device.
Thus the inference accuracy and versatility, i.e., the inference performance, can be improved.

Moreover, the information processing device as one embodiment further includes a second switching processor (second switching processor F6) that performs a process of switching AI models used by the inference processor.
This allows the selective use of a downstream AI model in accordance with the operating environment of the imaging device.
Thus the inference accuracy and versatility, i.e., the inference performance, can be improved.

Furthermore, in the information processing device as one embodiment, the inference process performed by the inference processor is an anomaly detection process using the AI model as a PatchCore model. Accurate mapping of features is essential for improving the accuracy of anomaly detection with the use of a PatchCore model. A lack of information in the features to be mapped will lead to a degradation in the mapping accuracy, which in turn will lower the accuracy of anomaly detection. In the case where an anomaly detection process is performed using a PatchCore model as the downstream AI model, the configuration in which the features of a plurality of intermediate layers in the upstream feature extraction model are input to the downstream model as in this embodiment can help to improve the accuracy of mapping in the PatchCore model, which can in turn improve the accuracy of anomaly detection.

In the information processing device as one embodiment, the inference processor performs a process of classifying objects as the inference process.
The classification process here refers to a process of specifying which of a plurality of categories an object falls under.
When the downstream AI model performs a process of classifying objects, the above configuration can help to improve the performance of the classification process.

An information processing method as one embodiment is implemented by an information processing device and includes the steps of: receiving features of a plurality of intermediate layers in a feature extraction model transmitted from an imaging device; and performing a predetermined inference process using an AI model based on the received features of the plurality of intermediate layers as input data, the imaging device including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.
This information processing method can provide the effects and advantages similar to those of the information processing device described above as one embodiment.

A feature distribution method as one embodiment is implemented by an information processing device, and includes the step of distributing features of a plurality of intermediate layers in a feature extraction model to a plurality of external devices, the features being obtained in an imaging device (imaging device 1) including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.
In a case where it is attempted to improve inference performance by transmitting features of a plurality of intermediate layers in a feature extraction model, the method makes it possible, for example, to cause plural distributed downstream devices to perform inference on divided sections of a video image, or to cause plural downstream devices to perform different inference processes in parallel on the same captured image.
Thus the inference process can be made more efficient.

The advantages described herein are only examples and there may be other advantages.

<6. Present Technique>
The present technique may also adopt the following configurations.
(1)
An image processing system, comprising:
an image sensor configured to capture image data; and
circuitry configured to
acquire the image data captured by the image sensor;
extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers;
combine the feature data from two or more of the plurality of intermediate layers; and
transmit the combined feature data.
(2)
The image processing system of (1), wherein the plurality of intermediate layers includes four or more intermediate layers.
(3)
The image processing system of (2), wherein the plurality of intermediate layers is six intermediate layers.
(4)
The image processing system of (2), wherein the circuitry is further configured to
in a case that a number of the plurality of intermediate layers is an even number of intermediate layers, equally split the plurality of intermediate layers into a first-half layer region and a second-half layer region.
(5)
The image processing system of (4), wherein the plurality of intermediate layers from which the feature data is extracted are intermediate layers from the first-half layer region of the plurality of intermediate layers.
(6)
The image processing system of (2), wherein the circuitry is further configured to
in a case that a number of the plurality of intermediate layers is an odd number of intermediate layers, split the plurality of intermediate layers, wherein a first-half layer region includes each intermediate layer before a center intermediate layer, wherein a second-half layer region includes each intermediate layer after the center intermediate layer.
(7)
The image processing system of (5), wherein the plurality of intermediate layers from which the feature data is extracted are continuous intermediate layers.
(8)
The image processing system of (7), wherein the plurality of intermediate layers from which the feature data is extracted are a first intermediate layer and a second intermediate layer.
(9)
The image processing system of (8), wherein the circuitry is further configured to
match a resolution of the second intermediate layer with the first intermediate layer before combing the feature data.
(10)
The image processing system of (9), wherein the circuitry is further configured to
increase the second intermediate layer feature data to match the resolution of the first intermediate layer.
(11)
An information processing system, comprising:
a controller; and
an image sensor configured to capture image data, the image sensor including first circuitry configured to
acquire the image data captured by the image sensor,
extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers,
combine the feature data from two or more of the plurality of intermediate layers, and
transmit the combined feature data to the controller.
(12)
The information processing system of (11), wherein the controller is installed in a factory.
(13)
The information processing system of (11), wherein the controller includes second circuitry configured to
distribute the combined feature data received from the image sensor to a plurality of external devices, wherein each of the plurality of external devices includes an artificial intelligence model configured to perform a predetermined inference process using the combined feature data as input.
(14)
The information processing system (11), wherein the controller is a programmable logic controller.
(15)
An information processing system, comprising:
a controller;
an image sensor configured to capture image data, the image sensor including first circuitry configured to
acquire the image data captured by the image sensor,
extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers,
combine the feature data from two or more of the plurality of intermediate layers, and
transmit the combined feature data to the controller; and
one or more electronic devices,
wherein the controller includes second circuitry configured to
distribute the combined feature data received from the image sensor to the one or more electronic devices.
(16)
The information processing system of (15), wherein each of the one or more electronic devices includes a trained artificial intelligence model configured to perform a predetermined inference process using the combined feature data as input.
(17)
The information processing system of (16), wherein a plurality of the one or more electronic devices are configured to perform inference, using the respective artificial intelligence model, on divided sections of a video image.
(18)
The information processing system of (16), wherein a plurality of the one or more electronic devices are configured to perform different inference processes, using the respective artificial intelligence model, in parallel on a same captured image.
(19)
The information processing system of (16), wherein the trained artificial intelligence model is trained using the combined feature data as input for training.
(20)
The information processing system of (15), wherein an electronic device of the one or more electronic devices includes
a memory configured to store artificial intelligence model data for building respective artificial intelligence models; and
third circuitry configured to
determine which of a first type of factory worker or a second type of factory worker is being captured by the image sensor, and
switch between a plurality of artificial intelligence models based on the determination,
wherein a first artificial intelligence model is trained by machine learning to classify operation contents of the first type of factory worker, and
wherein a second artificial intelligence model is trained by machine learning to classify operation contents of the second type of factory worker.
(21)
An imaging device including:
an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image;
a feature extractor that performs feature extraction targeting an image captured by the imaging unit, by using a feature extraction model that has a neural network architecture including a plurality of intermediate layers and that performs feature extraction on an input image; and
a transmission processor that performs a process of transmitting features in a plurality of intermediate layers in the feature extraction model to an external device.
(22)
The imaging device according to (21) above, wherein
the feature extraction model includes four or more intermediate layers, and
the plurality of intermediate layers are intermediate layers in a first-half layer region of an intermediate layer region in the feature extraction model.
(23)
The imaging device according to (22) above, wherein the plurality of intermediate layers are continuous intermediate layers.
(24)
The imaging device according to any one of (21) to (23) above, wherein the transmission processor performs a process of transmitting features obtained from each of two intermediate layers in the feature extraction model to an external device.
(25)
The imaging device according to any one of (21) to (24) above, further including a first switching processor that performs a process of switching feature extraction models used by the feature extractor.
(26)
The imaging device according to (25) above, wherein the first switching processor performs as the process of switching feature extraction models a process of switching between a plurality of feature extraction models with different network architectures.
(27)
The imaging device according to (25) or (26) above, wherein the first switching processor performs as the process of switching feature extraction models a process of switching between a plurality of feature extraction models trained by machine learning with different training datasets.
(28)
A transmission method implemented in an imaging device including an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, by using the feature extraction model that has a neural network architecture including a plurality of intermediate layers and that performs feature extraction on an input image, the method including performing a process of transmitting features of the plurality of intermediate layers in a feature extraction model to an external device.
(29)
An information processing device including:
a receiving unit that receives features of a plurality of intermediate layers in a feature extraction model transmitted from an imaging device; and
an inference processor that performs a predetermined inference process by using an AI model based on the features of the plurality of intermediate layers received by the receiving unit as input data, the imaging device including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, by using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.
(30)
The information processing device according to (29) above, further including a switching instruction processor that issues an instruction for switching feature extraction models in the imaging device on the basis of an inference result provided by the inference processor.
(31)
The information processing device according to (29) or (30) above, further including a second switching processor that performs a process of switching AI models used by the inference processor.
(32)
The information processing device according to any one of (29) to (31) above, wherein the inference process performed by the inference processor is an anomaly detection process using the AI model that is a PatchCore model.
(33)
The information processing device according to any one of (29) to (31) above, wherein the inference processor performs as the inference process a process of classifying objects.
(34)
An information processing method implemented by an information processing device, the method including:
receiving features of a plurality of intermediate layers in a feature extraction model transmitted from an imaging device; and
performing a predetermined inference process by using an AI model on the basis of the received features of the plurality of intermediate layers as input data, the imaging device including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, by using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.
(35)
A feature distribution method implemented by an information processing device, the method including distributing features of a plurality of intermediate layers in a feature extraction model to a plurality of external devices, the features being obtained in an imaging device including: an imaging unit that includes a plurality of pixels each having a light-receiving element and that obtains a captured image; and a feature extractor that performs feature extraction targeting an image captured by the imaging unit, by using the feature extraction model that has a neural network architecture including the plurality of intermediate layers and that performs feature extraction on an input image.

1 Imaging device
1a Upstream model
2 Information processing device
2a Downstream model
10 Image sensor
11 Imaging optical system
13 Controller
14 Memory unit
15 Communication unit
41 Imaging unit
43 In-sensor controller
44 Feature extractor
45 Memory unit
21 CPU
29 Memory unit
30 Communication unit
Li Input layer
L1 to L6 Intermediate layer
Dc Feature data
F1 Transmission processor
F2 First switching processor
F3 Switching instruction processor
MD1 First model data
MD2 Second model data
50 PatchCore model
51 Mapping unit
52 Nearest neighbor search unit
53 Anomaly score calculator
54 Heat map generator
F4 Classification results combiner
F5 Operation error detector
F6 Second switching processor
MD3 Beginner model data
MD4 Expert model data

Claims

An image processing system, comprising:
an image sensor configured to capture image data; and
circuitry configured to
acquire the image data captured by the image sensor;
extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers;
combine the feature data from two or more of the plurality of intermediate layers; and
transmit the combined feature data.
The image processing system of claim 1, wherein the plurality of intermediate layers includes four or more intermediate layers.
The image processing system of claim 2, wherein the plurality of intermediate layers is six intermediate layers.
The image processing system of claim 2, wherein the circuitry is further configured to
in a case that a number of the plurality of intermediate layers is an even number of intermediate layers, equally split the plurality of intermediate layers into a first-half layer region and a second-half layer region.
The image processing system of claim 4, wherein the plurality of intermediate layers from which the feature data is extracted are intermediate layers from the first-half layer region of the plurality of intermediate layers.
The image processing system of claim 2, wherein the circuitry is further configured to
in a case that a number of the plurality of intermediate layers is an odd number of intermediate layers, split the plurality of intermediate layers, wherein a first-half layer region includes each intermediate layer before a center intermediate layer, wherein a second-half layer region includes each intermediate layer after the center intermediate layer.
The image processing system of claim 5, wherein the plurality of intermediate layers from which the feature data is extracted are continuous intermediate layers.
The image processing system of claim 7, wherein the plurality of intermediate layers from which the feature data is extracted are a first intermediate layer and a second intermediate layer.
The image processing system of claim 8, wherein the circuitry is further configured to
match a resolution of the second intermediate layer with the first intermediate layer before combing the feature data.
The image processing system of claim 9, wherein the circuitry is further configured to
increase the second intermediate layer feature data to match the resolution of the first intermediate layer.
An information processing system, comprising:
a controller; and
an image sensor configured to capture image data, the image sensor including first circuitry configured to
acquire the image data captured by the image sensor,
extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers,
combine the feature data from two or more of the plurality of intermediate layers, and
transmit the combined feature data to the controller.
The information processing system of claim 11, wherein the controller is installed in a factory.
The information processing system of claim 11, wherein the controller includes second circuitry configured to
distribute the combined feature data received from the image sensor to a plurality of external devices, wherein each of the plurality of external devices includes an artificial intelligence model configured to perform a predetermined inference process using the combined feature data as input.
The information processing system of claim 11, wherein the controller is a programmable logic controller.
An information processing system, comprising:
a controller;
an image sensor configured to capture image data, the image sensor including first circuitry configured to
acquire the image data captured by the image sensor,
extract, by a trained machine learning model, feature data from the image data, the trained machine learning model having a plurality of intermediate layers,
combine the feature data from two or more of the plurality of intermediate layers, and
transmit the combined feature data to the controller; and
one or more electronic devices,
wherein the controller includes second circuitry configured to
distribute the combined feature data received from the image sensor to the one or more electronic devices.
The information processing system of claim 15, wherein each of the one or more electronic devices includes a trained artificial intelligence model configured to perform a predetermined inference process using the combined feature data as input.
The information processing system of claim 16, wherein a plurality of the one or more electronic devices are configured to perform inference, using the respective artificial intelligence model, on divided sections of a video image.
The information processing system of claim 16, wherein a plurality of the one or more electronic devices are configured to perform different inference processes, using the respective artificial intelligence model, in parallel on a same captured image.
The information processing system of claim 16, wherein the trained artificial intelligence model is trained using the combined feature data as input for training.
The information processing system of claim 15, wherein an electronic device of the one or more electronic devices includes
a memory configured to store artificial intelligence model data for building respective artificial intelligence models; and
third circuitry configured to
determine which of a first type of factory worker or a second type of factory worker is being captured by the image sensor, and
switch between a plurality of artificial intelligence models based on the determination,
wherein a first artificial intelligence model is trained by machine learning to classify operation contents of the first type of factory worker, and
wherein a second artificial intelligence model is trained by machine learning to classify operation contents of the second type of factory worker.