The complexity has merely moved

Posted Oct 3, 2024 16:04 UTC (Thu) by atnot (guest, #124910)
Parent article: Coping with complex cameras

A bit of context, as the framing of cameras becoming more "complex" is kind of misleading I think.

What has happened instead is that while in the past, the camera module contained a CPU and ISP with firmware that did all the processing, with smartphones this has increasingly been integrated to the main SoC instead to save cost, power, and increase flexibility. So instead of the camera module sending a finished image over a (relatively) low-speed bus, image sensors are now directly attached to the CPU and deliver raw sensor data to an on-board ISP. This means that what used to be the camera firmware needs to be moved to the main CPU too. This is not a problem under Android since they can just ship that "firmware" as a blob on the vendor partition and use standard system APIs to get a completed image. But for v4l2, which is fundamentally built around directly passing images through from the device to applications, that's a problem.

So it's not that cameras have gotten more complex. They're still doing exactly the same thing. It's just that where that code runs has changed and existing interfaces are not equipped to deal with that.

to post comments

The complexity has merely moved

Posted Oct 3, 2024 16:18 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link]

That's a good summary, thank you. It became more complex only from a Linux kernel and userspace point of view.

V4L2 was originally modelled as a high-level API meant to be used directly by applications, with an abstraction level designed for TV capture cards and webcams, where all the processing was handled internally by hardware and firmware. It has evolved with the introduction of the Media Controller and V4L2 subdev APIs to support a much lower level of abstraction. These evolutions were upstreamed nearly 15 years ago (time flies) by Nokia. Unfortunately, due to Nokia's demise, the userspace part of the framework never saw the light of day. Fast forward to 2018, the libcamera project was announced to fill that wide gap and be the "Mesa of cameras". We now have places for all of this complex code to live, but the amount of work to support a new platform is significantly larger than it used to be.

The complexity has merely moved

Posted Oct 3, 2024 16:34 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link] (5 responses)

> So it's not that cameras have gotten more complex. They're still doing exactly the same thing. It's just that where that code runs has changed and existing interfaces are not equipped to deal with that.

I think it's a bit more complicated than this. With the processing moving to the main SoC, it became possible for the main OS to have more control over that processing. I think this has led to more complex processing being implemented, that would have been more difficult (or more costly) to do if everything had remained on the camera module side. Advanced HDR processing is one such feature for instance, and NPU-assisted algorithms such as automatic white balance is another example. Some of this would probably still have happened without the processing shifting to the main SoC, but probably at a different pace and possibly in a different direction.

In any case, the situation we're facing is that cameras now look much more complex from a Linux point of view, due to a combination of processing moving from the camera module to the main SoC, and processing itself getting more complex (regardless of whether or not the latter is a partial consequence of the former).

The complexity has merely moved

Posted Oct 3, 2024 17:37 UTC (Thu) by intelfx (subscriber, #130118) [Link] (4 responses)

> due to a combination of processing moving from the camera module to the main SoC, and processing itself getting more complex (regardless of whether or not the latter is a partial consequence of the former).

Forgive me if I'm wrong, but I always thought it was mostly the other way around, no? I. e. due to R&D advances and market pressure the image processing started to get increasingly more complex (enter computational photography), and at some point during that process it became apparent that it's just all around better to get rid of the separate processing elements in the camera unit and push things into the SoC instead.

Thus, a few years later, this trend is finally trickling down to general-purpose computers and thus Linux proper.

Or am I misunderstanding how things happened?

The complexity has merely moved

Posted Oct 3, 2024 18:49 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link] (1 responses)

I think it goes both ways. I can't tell exactly when this transition started and what was the trigger, but once processing moved to the main SoC, it opened the door to more possibilities that people may not have thought of otherwise. In turn, that probably justified the transition, accelerating the move.

There are financial arguments too, in theory, at least if development costs and the economical impact on the users are ignored, this architecture is supposed to be cheaper. A more detailed historical study would be interesting. Of course it won't change the situation we're facing today.

The complexity has merely moved

Posted Oct 4, 2024 19:00 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I think it goes both ways. I can't tell exactly when this transition started and what was the trigger, but once processing moved to the main SoC, it opened the door to more possibilities that people may not have thought of otherwise. In turn, that probably justified the transition, accelerating the move.

This seems to be pretty common across a lot of technology. It starts with general purpose hardware, moves to special-purpose hardware because it's faster/cheaper, the special purpose hardware becomes obsolete / ossified, it moves back to general purpose hardware, rinse and repeat. Just look at routers or modems ...

Cheers,
Wol

The complexity has merely moved

Posted Oct 3, 2024 19:55 UTC (Thu) by excors (subscriber, #95769) [Link] (1 responses)

From what I vaguely remember, back when Android only supported Camera HAL1 (which used the old-fashioned webcam model: the application calls setParameters(), startPreview(), takePicture(), and eventually gets the JPEG in a callback and can start the process again), there was already a trend to put the ISP on the main SoC. It's almost always cheaper to have fewer chips on your PCB, plus you can use the same ISP hardware for both front and rear cameras (just reconfigure the firmware when switching between them), and you can use the same RAM for the ISP and for games (since you're not going to run both at the same time), etc, so there are significant cost and power-efficiency benefits.

The early SoC ISPs were quite slow and/or bad so they'd more likely be used for the front camera, while the higher-quality rear camera might have a discrete ISP, but eventually the SoCs got good enough to replace discrete ISPs on at least the lower-end phones. (And when they did have a discrete ISP it would probably be independent of the camera sensor, because the phone vendor wants the flexibility to pick the best sensor and the best ISP for their requirements, not have them tied together into a single module by the camera vendor.)

Meanwhile phone vendors added proprietary extensions to HAL1 to support more sophisticated camera features (burst shot, zero shutter lag, HDR, etc), because cameras were a big selling point and a great way to differentiate from competitors. They could implement that in their app/HAL/drivers/firmware however they wanted (often quite hackily), whether it was an integrated or discrete ISP.

Then Google developed Camera HAL2/HAL3 (based around a pipeline of synchronised frame requests/results) to support those features properly, putting much more control on the application side, standardising a lower-level interface to the hardware. I'm guessing that was harder to implement on some discrete ISPs that weren't flexible enough, whereas integrated ones typically depended more heavily on the CPU so they were already more flexible. It also reduced the demand for fancy new features on ISP hardware since the application could now take responsibility for them (using CPU/GPU/NPU to manipulate the images efficiently enough, and using new algorithms without the many-year lag it takes to implement a new feature in dedicated hardware).

It was a slow transition though - Android still supported HAL1 in new devices for many years after that, because phone vendors kept using cameras that wouldn't or couldn't support HAL3.

So, I don't think there's a straightforward casuality or a linear progression; it's a few different things happening in parallel, spread out inconsistently across more than a decade, applying pressure towards the current architecture. And (from my perspective) that was all driven by Android, and now Linux is having to catch up so it can run on the same hardware.

The complexity has merely moved

Posted Oct 3, 2024 20:11 UTC (Thu) by laurent.pinchart (subscriber, #71290) [Link]

> And (from my perspective) that was all driven by Android, and now Linux is having to catch up so it can run on the same hardware.

It started before Android as far as I know. Back in the Nokia days, the FrakenCamera project (https://graphics.stanford.edu/papers/fcam/) developed an implementation of a computational photography on a Nokia N900 phone (running Linux). A Finnish student at Stanford university participated in the project and later joined Google, where he participated in the design of the HAL3 API based on the FCam design.

The complexity has merely moved

Posted Oct 4, 2024 10:27 UTC (Fri) by mchehab (subscriber, #41156) [Link] (1 responses)

Not entirely true, as an ISP-based processing allow more complex processing pipelines with things like face recognition, more complex algorithms and extra steps to enhance image's quality.

During the libcamera discussions, we referred to the entire set as simply "V4L2 API", but there are actually three different APIs used to control complex camera hardware: V4L2 "standard" API, media controller and sub-devices API. Currently, on complex cameras:

1. input and capture (output) devices are controlled via V4L2 API (enabled via config VIDEO_DEV);
2. the pipeline is controlled via the media controller API (enabled via config MEDIA_CONTROLLER);
3. each element of the pipeline is individually controlled via the V4L2 sub-device API (enabled via config VIDEO_V4L2_SUBDEV_API).

Following the discussions, it seems that we could benefit of having need new ioctl(s) for (1) to simplify the number of ioctl calls for memory-to-memory sub-devices, to simplify ISP processing, perhaps as a part of sub-device API.

For (3), we may need to add something to pass calibration data.

Yet, the final node of the pipeline (the capture device) is the same, and can be completely mapped using the current V4L2 API: a video stream, usually compressed with a codec (mpeg, h-264, ...) with a known video resolution, frame rate and a fourcc identifying the output video format (bayer, yuv, mpeg, h-264, ...).

Most of vendor-specific "magic" happens at the intermediate nodes inside the pipeline. Typically, modern cameras produce two different outputs: a video stream and a metadata stream. The metadata is used by vendor-specific 3A algorithms (auto focus, auto exposure and auto whitebalance), among others. The userspace component (libcamera) need to use such metadata to produce a set of changes to be applied to the next frames by the ISP. They also use a set of vendor-specific settings that are related to the hardware attached to the ISP, including camera sensor and lens. Those are calibrated by the hardware vendor.

The main focus of the complex camera discussions is around those intermediate nodes.

As I said during libcamera discussions, from my perspective as the Media subsystem maintainer, I don't care how the calibration data was generated. This is something that IMO we can't contribute much, as it would require an specialized lab
to test the ISP+sensor+lens with different light conditions and different environments (indoor, outdoor, different focus settings, etc.). I do care, however, to now allow executing binary blobs sent from userspace at the Kernel.

By the way, even cameras that have their own CPUs and use just V4L2 API without the media controller have calibration data. Those are typically used during device initialization as a series of register values inside driver tables. While we want to know what each register contains (so we strongly prefer to have those registers mapped with #define macros), it is not mandatory to have all of them documented.

In the past, our efforts were to ensure that the Kernel drivers is fully open sourced. Now that we have libcamera, the requirement is that the driver (userspace+kernel) to be open sourced. The Kernel doesn't need to know how configuration data passed from userspace was calculated, provided that such calculus is part of libcamera.

The complexity has merely moved

Posted Oct 8, 2024 18:31 UTC (Tue) by laurent.pinchart (subscriber, #71290) [Link]

> Not entirely true, as an ISP-based processing allow more complex processing pipelines with things like face recognition, more complex algorithms and extra steps to enhance image's quality.

atnot's point is that ISP were there already, just hidden by the webcam firmware, and that's largely true. We have USB webcams today that handle face detection internally and enhance the image quality in lots of way (we also have lots of cheap webcams with horrible image quality of course).

> For (3), we may need to add something to pass calibration data.

It's unfortunately more complicated than that (I stopped counting the number of times I've said this). "Calibration" or "tuning" data is generally not something that is passed directly to drivers, but needs to be processed by userspace based on dynamically changing conditions. For instance, the lens shading compensation data, when expressed as a table, needs to be resampled based on the camera sensor field of view. Lots of tuning data never makes it to the device but only influences userspace algorithms.

> Yet, the final node of the pipeline (the capture device) is the same, and can be completely mapped using the current V4L2 API: a video stream, usually compressed with a codec (mpeg, h-264, ...) with a known video resolution, frame rate and a fourcc identifying the output video format (bayer, yuv, mpeg, h-264, ...).

There is usually no video encoder in the camera pipeline when using an ISP in the main SoC. Encoding is performed by a separate codec.

> By the way, even cameras that have their own CPUs and use just V4L2 API without the media controller have calibration data. Those are typically used during device initialization as a series of register values inside driver tables. While we want to know what each register contains (so we strongly prefer to have those registers mapped with #define macros), it is not mandatory to have all of them documented.

Those are usually not calibration or tuning data, as they are not specific to particular camera instances.