WO2026019419A1 - Systems and methods for extending a depth-of-field based on focus stack fusion - Google Patents
Systems and methods for extending a depth-of-field based on focus stack fusionInfo
- Publication number
- WO2026019419A1 WO2026019419A1 PCT/US2024/038165 US2024038165W WO2026019419A1 WO 2026019419 A1 WO2026019419 A1 WO 2026019419A1 US 2024038165 W US2024038165 W US 2024038165W WO 2026019419 A1 WO2026019419 A1 WO 2026019419A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- computer
- image frames
- focus
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
An example method includes determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. The method also includes capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. The method additionally includes providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The method also includes receiving the predicted image from the ML model.
Description
SYSTEMS AND METHODS FOR EXTENDING A DEPTH-OF-FIELD BASED ON FOCUS STACK FUSION
BACKGROUND
[0001] Many modern computing devices, including mobile phones, personal computers, and tablets, include image capture devices, such as still and/or video cameras. The image capture devices can capture images, such as images that include people, animals, landscapes, and/or objects. Such objects may appear at different depths in the image.
SUMMARY
[0002] This application generally relates to focus stack fusion. Mobile device cameras generally have a limit f-number and a shallow depth of field (DoF). The term “depth of field” as used herein, generally refers to an amount of an image (e.g., front-to-back) that appears to be in sharp focus. Some images may have a shallow DoF where the foreground, or a portion thereof, is in focus, whereas the background is out of focus. Other images may have a larger DoF where both the foreground and the background are in focus. Some cameras may have a limited depth of field (DoF) due to a main (e.g., wide) camera's larger aperture. This allows only a small portion of the image to be in focus, resulting in blurring.
[0003] For example, faces and other elements outside a focal plane may be blurred. This can be a challenge for capturing sharp images (e.g., portraits, photos of food items, documents, landscapes, and so forth). For example, in group portraits, subjects positioned at varying distances may exhibit blur, particularly those outside the DoF of the focused subject. As another example, in macro photography, close-up images may have a narrow portion of the image that is in sharp focus. Also, for example, in landscape photography, it may be challenging to achieve a full front-to-back sharpness. As another example, in text and/or document capture, it may be challenging to ensure that all the text is legible. Focus stack fusion can solve challenges due to the limit f-number and the shallow DoF by capturing multiple images at different focal distances and combining these into a single image with an extended DoF. This can result in a high-quality image with multiple points of focus.
[0004] In one aspect, a computer-implemented method is provided. The method includes determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. The method also includes capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the
portion of the scene. The method additionally includes providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The method also includes receiving the predicted image from the ML model.
[0005] In another aspect, a system is provided. The system may include one or more processors. The system may also include data storage, where the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the system to carry out operations. The operations may include determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. The operations may also include capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. The operations may additionally include providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may also include receiving the predicted image from the ML model.
[0006] In another aspect, a computing device is provided. The device includes a primary camera and a secondary camera that share a common field of view. The device also includes one or more processors and data storage that has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to carry out operations. The operations may include determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. The operations may also include capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. The operations may additionally include providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may also include receiving the predicted image from the ML model.
[0007] In another aspect, an article of manufacture is provided. The article of manufacture may include a non-transitory computer-readable medium having stored thereon program instructions that, upon execution by one or more processors of a computing device, cause the computing device to carry out operations. The operations may include determining that a
portion of a scene in an image frame being captured at a first focal length is out of focus. The operations may also include capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. The operations may additionally include providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may also include receiving the predicted image from the ML model.
[0008] In another aspect, a program is provided. The program upon execution by one or more processors of a computing device, causes the computing device to carry out operations. The operations may include determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. The operations may also include capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. The operations may additionally include providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may also include receiving the predicted image from the ML model.
[0009] In another aspect, a computer-implemented method is provided. The method includes receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. The method also includes training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The method additionally includes providing the trained ML model.
[0010] In another aspect, a system is provided. The system may include one or more processors. The system may also include data storage, where the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the system to carry out operations. The operations may include receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. The operations may also include training, based on the training data, a machine learning (ML) model to merge one or more focused
regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may additionally include providing the trained ML model.
[0011] In another aspect, a computing device is provided. The device includes a primary camera and a secondary camera that share a common field of view. The device also includes one or more processors and data storage that has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to carry out operations. The operations may include receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. The operations may also include training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may additionally include providing the trained ML model.
[0012] In another aspect, an article of manufacture is provided. The article of manufacture may include a non-transitory computer-readable medium having stored thereon program instructions that, upon execution by one or more processors of a computing device, cause the computing device to carry out operations. The operations may include receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. The operations may also include training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may additionally include providing the trained ML model.
[0013] In another aspect, a program is provided. The program upon execution by one or more processors of a computing device, causes the computing device to carry out operations. The operations may include receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. The operations may also include training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may additionally include providing the trained ML model.
[0014] The foregoing summary is illustrative only and is not intended to be in any way limiting. In addition to the illustrative aspects, embodiments, and features described above, further
aspects, embodiments, and features will become apparent by reference to the figures and the following detailed description and the accompanying drawings.
BRIEF DESCRIPTION OF THE FIGURES
[0015] Figure 1 is an illustration of front, right-side, and rear views of a digital camera device, in accordance with example embodiments.
[0016] Figure 2 is an example illustration of a focus stack fusion based image processing, in accordance with example embodiments.
[0017] Figure 3 is an example overview of focus stack fusion, in accordance with example embodiments.
[0018] Figure 4 is an example overview of training data generation, in accordance with example embodiments.
[0019] Figure 5 is an example overview of synthetic training data generation, in accordance with example embodiments.
[0020] Figure 6 is an example machine learning model for focus stack fusion, in accordance with example embodiments.
[0021] Figure 7 is an example illustration of blending, in accordance with example embodiments.
[0022] Figure 8 is an example overview of a focus stack fusion based image processing pipeline, in accordance with example embodiments.
[0023] Figure 9 is an example illustration of triggering for focus stack fusion, in accordance with example embodiments.
[0024] Figure 10 is an example illustration of focus stack fusion in landscape photography, in accordance with example embodiments.
[0025] Figure 11 is another example illustration of focus stack fusion in landscape photography, in accordance with example embodiments.
[0026] Figure 12 is another example illustration of focus stack fusion in landscape photography, in accordance with example embodiments.
[0027] Figure 13 is another example illustration of focus stack fusion in landscape photography, in accordance with example embodiments.
[0028] Figure 14 is an example illustration of focus stack fusion in portrait photography, in accordance with example embodiments.
[0029] Figure 15 is an example illustration of focus stack fusion in food photography, in accordance with example embodiments.
[0030] Figure 16 is another example illustration of focus stack fusion in food photography, in accordance with example embodiments.
[0031] Figure 17 is an example illustration of focus stack fusion in document photography, in accordance with example embodiments.
[0032] Figure 18 is another example illustration of focus stack fusion in document photography, in accordance with example embodiments.
[0033] FIG. 19 is a diagram illustrating training and inference phases of a machine learning model, in accordance with example embodiments.
[0034] FIG. 20 depicts a distributed computing architecture, in accordance with example embodiments.
[0035] Figure 21 is a block diagram of an example computing device, in accordance with example embodiments.
[0036] Figure 22 is a flowchart of a method, in accordance with example embodiments.
[0037] Figure 23 is another flowchart of a method, in accordance with example embodiments.
DETAILED DESCRIPTION
[0038] Example methods, devices, and systems are described herein. It should be understood that the words “example” and “exemplary” are used herein to mean “serving as an example, instance, or illustration.” Any embodiment or feature described herein as being an “example” or “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or features. Other embodiments can be utilized, and other changes can be made, without departing from the scope of the subject matter presented herein.
[0039] Thus, the example embodiments described herein are not meant to be limiting. Aspects of the present disclosure, as generally described herein, and illustrated in the figures, can be arranged, substituted, combined, separated, and designed in a wide variety of different configurations, all of which are contemplated herein.
[0040] Further, unless context suggests otherwise, the features illustrated in each of the figures may be used in combination with one another. Thus, the figures should be generally viewed as component aspects of one or more overall embodiments, with the understanding that not all illustrated features are necessary for each embodiment.
Overview
[0041] Focus stack fusion may take two or more images captured at different focal distances as input, and combine them to generate a single image with an extended DoF. For example, focus stack fusion may take two wide angle images captured at different focal distances as
input. As described herein, an end-to-end pipeline may be designed to include both machine learning (ML) and non-ML methods to implement the focus stack fusion. For example, focus stack fusion may involve a training data generation method that can use depth information to generate an unlimited number of inputs.
[0042] Various techniques may be used to generate depth information for an image. In some cases, depth information may be generated for the entire image (e.g., for the entire image frame). In other cases, depth information may only be generated for a certain area or areas in an image. For instance, depth information may only be generated when image segmentation is used to identify one or more objects in an image. Depth information may be determined specifically for the identified object or objects.
[0043] In embodiments, stereo imaging may be utilized to generate a depth map. In such embodiments, a depth map may be obtained by correlating left and right stereoscopic images to match pixels between the stereoscopic images. The pixels may be matched by determining which pixels are the most similar between the left and right images. Pixels correlated between the left and right stereoscopic images may then be used to determine depth information. For example, a disparity between the location of the pixel in the left image and the location of the corresponding pixel in the right image may be used to calculate the depth information using binocular disparity techniques. An image may be produced that contains depth information for a scene, such as information related to how deep or how far away objects in the scene are in relation to a camera's viewpoint.
[0044] In some embodiments, a novel training solution may involve combining UNet with synthetic focus stack training data on mobile devices to reduce domain gaps. Upon training, the ML model may be used to take two or more images captured at different focal distances as input, and combine them to generate a single image with an extended DoF. For example, the end-to-end pipeline (e.g., on mobile devices) may involve capturing, at different focus distances to simulate a focus stack, a main wide image with zero shutter lag (ZSL) frames and a reference wide image with post shutter lag (PSL) frame, and combine them to generate a single image with an extended DoF. The end-to-end pipeline may be triggered, for example, upon detection of multiple faces in an image. For example, a current DoF and the depth of each face may be determined, and the pipeline may be triggered upon a determination that one or more of the multiple faces is out of the current DoF.
[0045] The training pipeline may involve training data generation. This may be achieved by capturing sharp images from a camera. Based on an image depth map, a synthetic blur may be added to the foreground or background of the image to synthetically generate a different
number of focus stacks. For training purposes, the blurred images may be utilized as a focus stack input and the sharp image may be used as a target. In some embodiments, a 5-stage UNet model may be trained with focus stack images as input. A VGG loss may be used to optimize prediction and/or target. Also, for example, an LI loss may be used to allow the model to learn to fuse the sharp regions from the inputs together.
[0046] In some embodiments, data augmentation may be used to enhance the model’s robustness and an ability to simulate real-world conditions. For example, random brightness and saturation variations may be introduced. This simulates the potential discrepancies that can occur between PSL frames (with extended exposure for noise reduction) and ZSL frames. The model is thus trained to effectively reconcile these differences during fusion. Also, for example, random sensor noise may be added to mimic noise characteristics of a single PSL frame (which may be noisier than a high dynamic range (HDR) ZSL fusion using 6 frames). Such controlled noise injection can ensure that the model learns to prioritize detail transfer while minimizing noise amplification.
[0047] The inference pipeline may involve capturing portrait images from phones at different focal distances. For example, wide ZSL frames may be captured at one focus distance and wide PSL frames may be captured at another focal distance. Generally, it may be challenging to apply traditional non-ML alignment methods, such as global homography with feature detection, to occlusion regions and/or motion scenes. To overcome this, the techniques described herein determine a motion flow between the main and auxiliary HDR images using a pre-trained cost volume network (PWC-Net) and/or Recurrent All-pairs Field Transforms (RAFT) model. The PWC-Net/RAFT model may be configured to generate a detailed, per- pixel motion flow map, accounting for subtle movements and potential occlusions. This ensures accurate alignment of the images, even in challenging scenarios. The reference focus stack images may be warped and aligned with the source image using the PWC-Net/RAFT model. Subsequently, two occlusion masks with the forward and backward flow may be determined and combined for later use.
[0048] Some embodiments may involve a fallback logic that may facilitate skipping of fusion at an early stage. For example, the fallback logic may be triggered in situations such as when the telephoto lens is determined to be defocused, the lighting is determined to be below a lighting threshold, the reprojection error (e.g., for a warped reference and/or source) is determined to be above an error threshold, and/or a base frame delta (e.g. , for a reference and/or source) is determined to be above a delta threshold.
[0049] In some embodiments, after preprocessing an image, the source and reference may be provided to a pre-trained UNet model to output the inference result. Also, for example, the fusion image may be blended with the source image using one or more of an occlusion mask, a rejection mask, a subject mask, or a smooth mask. In some embodiments, a dynamic blending strategy may be used to ensure consistency across a face and a body.
[0050] System latency and/or memory usage may be reduced by restricting focus stack fusion to the face region instead of full-frame images. The end-to-end algorithm prioritizes operations within a face-focused region of interest (ROI). This targeted approach streamlines both the PSL frame HDR process and the primary focus stack fusion algorithm. Computational complexity may be managed by adjusting an input size of the ML fusion model (e.g., to 512 x 512). The algorithm leverages multi-threading, intelligently distributing independent computations across multiple threads for faster execution. The system pre-compiles models, caches OpenCL data, and stores model information during initial setup. Such proactive caching can significantly improve processing speed for subsequent shots. The fusion ML model may seamlessly offload its processing to an on-device tensor processing unit (TPU), resulting in a substantial increase in computational speed. The algorithm may strategically employ downsampling to reduce computational latency. By operating on smaller input sizes whenever feasible, processing speed may be significantly increased. To reduce memory footprint and latency, the algorithm can be configured to minimize data copies. Instead, the algorithm can leverage image references for a more efficient workflow.
Example Camera Systems
[0051] As image capture devices, such as cameras, become more popular, they may be employed as standalone hardware devices or integrated into various other types of devices. For instance, still and video cameras are now regularly included in wireless computing devices (e.g., mobile devices, such as mobile phones), tablet computers, laptop computers, video game interfaces, home automation devices, and even automobiles and other types of vehicles.
[0001] The physical components of a camera may include one or more apertures through which light enters, one or more recording surfaces for capturing the images represented by the light, and lenses positioned in front of each aperture to focus at least part of the image on the recording surface(s). The apertures may be of a fixed size or may be adjustable. In an analog camera, the recording surface may be a photographic film. In a digital camera, the recording surface may include an electronic image sensor (e.g., a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) sensor) to transfer and/or store captured images in a data storage unit (e.g., memory).
[0002] One or more shutters may be coupled to, or positioned near, the lenses or the recording surfaces. Each shutter may either be in a closed position, in which it blocks light from reaching the recording surface, or an open position, in which light is allowed to reach the recording surface. The position of each shutter may be controlled by a shutter button. For instance, a shutter may be in the closed position by default. When the shutter button is triggered (e.g., pressed), the shutter may change from the closed position to the open position for a period of time, known as the shutter cycle. During the shutter cycle, an image may be captured on the recording surface. At the end of the shutter cycle, the shutter may change back to the closed position.
[0003] Alternatively, the shuttering process may be electronic. For example, before an electronic shutter of a CCD image sensor is “opened,” the sensor may be reset to remove any residual signal in its photodiodes. While the electronic shutter remains open, the photodiodes may accumulate charge. When or after the shutter closes, these charges may be transferred to longer-term data storage. Combinations of mechanical and electronic shuttering may also be possible.
[0004] Regardless of type, a shutter may be activated and/or controlled by something other than a shutter button. For instance, the shutter may be activated by a softkey, a timer, or some other trigger. Herein, the term “capture” may refer to any mechanical and/or electronic shuttering process that results in one or more images being recorded, regardless of how the shuttering process is triggered or controlled.
[0005] The exposure of a captured image may be determined by a combination of the size of the aperture, the brightness of the light entering the aperture, and the length of the shutter cycle (also referred to as the shutter length, the exposure length, or the exposure time). Additionally, a digital and/or analog gain (e.g., based on an ISO setting) may be applied to the image, thereby influencing the exposure. In some embodiments, the term “exposure length,” “exposure time,” or “exposure time interval” may refer to the shutter length multiplied by the gain for a particular aperture size. Thus, these terms may be used somewhat interchangeably, and should be interpreted as possibly being a shutter length, an exposure time, and/or any other metric that controls the amount of signal response that results from light reaching the recording surface.
[0006] In some implementations or modes of operation, a camera may capture one or more still images each time image capture is triggered. In other implementations or modes of operation, a camera may capture a video image by continuously capturing images at a particular rate (e.g., 24 frames per second) as long as image capture remains triggered (e.g., while the shutter button is held down). Some cameras, when operating in a mode to capture a still image,
may open the shutter when the camera device or application is activated, and the shutter may remain in this position until the camera device or application is deactivated. While the shutter is open, the camera device or application may capture and display a representation of a scene on a viewfinder (sometimes referred to as displaying a “preview frame”). When image capture is triggered, one or more distinct payload images of the current scene may be captured.
[0052] Cameras, including digital and analog cameras, may include software to control one or more camera functions and/or settings, such as aperture size, exposure time, gain, and so on. Additionally, some cameras may include software that digitally processes images during or after image capture. While the description above refers to cameras in general, it may be particularly relevant to digital cameras. Digital cameras may be standalone devices (e.g., a DSLR camera) or may be integrated with other devices.
[0053] Either or both of a front-facing camera and a rear-facing camera may include or be associated with an ALS that may continuously or from time to time determine the ambient brightness of a scene that the camera can capture. In some devices, the ALS can be used to adjust the display brightness of a screen associated with the camera (e.g., a viewfinder). When the determined ambient brightness is high, the brightness level of the screen may be increased to make the screen easier to view. When the determined ambient brightness is low, the brightness level of the screen may be decreased, also to make the screen easier to view as well as to potentially save power. Additionally, the ambient light sensor’s input may be used to determine an exposure time of an associated camera, or to help in this determination.
[0054] Figure 1 is an illustration of front, right-side, and rear views of a digital camera device 100, in accordance with example embodiments. Digital camera device 100 may be, for example, a mobile device (e.g., a mobile phone), a tablet computer, or a wearable computing device. However, other embodiments are possible. Digital camera device 100 may include various elements, such as a body 102, a front-facing camera 104, a multi-element display 106, a shutter button 108, and other buttons 110. Digital camera device 100 could further include one or more rear-facing cameras 112, 114. Front-facing camera 104 may be positioned on a side of body 102 typically facing a user while in operation, or on the same side as multi-element display 106. Rear-facing cameras 112, 114 may be positioned on a side of body 102 opposite front-facing camera 104. Referring to the cameras as front-facing and rear-facing is arbitrary, and digital camera device 100 may include multiple cameras positioned on various sides of body 102.
[0055] Multi-element display 106 could represent a cathode ray tube (CRT) display, a lightemitting diode (LED) display, a liquid crystal display (LCD), a plasma display, or any other
type of display known in the art. In some embodiments, multi-element display 106 may display a digital representation of the current image being captured by front-facing camera 104 and/or rear-facing cameras 112, 114, or an image that could be captured or was recently captured by either or both of these cameras. Thus, multi-element display 106 may serve as a viewfinder for either camera. Multi-element display 106 may also support touchscreen and/or presencesensitive functions that may be able to adjust the settings and/or configuration of any aspect of digital camera device 100.
[0056] Multi-element display 106 may include additional features related to a camera application. For example, multiple modes may be available for a user, including, a motion mode, portrait mode, video mode, video bokeh mode, and so forth. The camera application may be in camera mode and provide additional features, such as a reverse icon to activate reverse camera view, a trigger button to capture a previewed image, and a photo stream icon to access a database of captured images. Also for example, a magnification ratio slider may be displayed and a user can move a virtual object along the magnification ratio slider to select a magnification ratio. In some embodiments, a user may use the multi-element display 106, also referred to herein as the display screen, to adjust the magnification ratio (e.g., by moving two fingers on display screen in an outward motion away from each other), and magnification ratio slider may automatically display the magnification ratio.
[0057] Front-facing camera 104 may include an image sensor and associated optical elements such as lenses. Front-facing camera 104 may offer zoom capabilities or could have a fixed focal length. In other embodiments, interchangeable lenses could be used with front-facing camera 104. Front-facing camera 104 may have a variable mechanical aperture and a mechanical and/or electronic shutter. Front-facing camera 104 also could be configured to capture still images, video images, or both. Further, front-facing camera 104 could represent a monoscopic, stereoscopic, or multiscopic camera. Rear-facing cameras 112, 114 may be similarly or differently arranged. Additionally, front-facing camera 104, rear-facing cameras 112, 114, or both, may be an array of one or more cameras.
[0058] Either or both of front-facing camera 104 and rear-facing cameras 112, 114 may include or be associated with an illumination component that provides a light field to illuminate a target object. For instance, an illumination component could provide flash or constant illumination of the target object (e.g., using one or more LEDs). An illumination component could also be configured to provide a light field that includes one or more of structured light, polarized light, and light with specific spectral content. Other types of light fields known and used to recover
three-dimensional (3D) models from an object are possible within the context of the embodiments herein.
[0059] In some digital camera devices 100, either or both of front-facing camera 104 and rearfacing cameras 112, 114 may include or be associated with an ambient light sensor that may continuously or from time to time determine the ambient brightness of a scene that the camera can capture. In some devices, the ambient light sensor can be used to adjust the display brightness of a screen associated with the camera (e.g., a viewfinder). When the determined ambient brightness is high, the brightness level of the screen may be increased to make the screen easier to view. When the determined ambient brightness is low, the brightness level of the screen may be decreased, also to make the screen easier to view as well as to potentially save power. Additionally, the ambient light sensor’s input may be used to determine an exposure time of an associated camera, or to help in this determination.
[0060] Digital camera device 100 could be configured to use multi-element display 106 and either front-facing camera 104 or rear-facing cameras 112, 114 to capture images of a target object (e.g., a subject within a scene). The captured images could be a plurality of still images or a video image (e.g., a series of still images captured in rapid succession with or without accompanying audio captured by a microphone). The image capture could be triggered by activating shutter button 108, pressing a softkey on multi-element display 106, or by some other mechanism. Depending upon the implementation, the images could be captured automatically at a specific time interval, for example, upon pressing shutter button 108, upon appropriate lighting conditions of the target object, upon moving digital camera device 100 a predetermined distance, or according to a predetermined capture schedule.
[0061] As noted above, the functions of digital camera device 100 (or another type of digital camera) may be integrated into a computing device, such as a wireless computing device, cell phone, tablet computer, laptop computer, and so on. For example, a camera controller may be integrated with the digital camera device 100 to control one or more functions of the digital camera device 100.
Example Focus Stack Fusion Pipelines
[0062] Figure 2 is an example illustration of a focus stack fusion based image processing, in accordance with example embodiments. Image 205 illustrates a partial view of a milk jug 205 A and a cup 205B. Milk jug 205A is in the background and out of focus, while cup 205B is in the foreground and is in focus. Image 210 illustrates a partial view of a milk jug 210A and a cup 210B. Milk jug 210A is in the background and in focus, while cup 210B is in the foreground and is out of focus. Image 215 illustrates another partial view of a milk jug 215 A and a cup
215B. Milk jug 215A is in the background and in focus, and cup 215B is in the foreground and is also in focus. As illustrated, image 215 can be generated by stacking together the sharper, in focus portions of images 205 and 210.
[0063] Figure 3 is an example overview of focus stack fusion, in accordance with example embodiments. A plurality of focus stack images 305 may be captured at different focal lengths. Some embodiments involve receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. For example, a training pipeline may involve capturing sharp images from camera devices. Based on an image depth map, synthetic blur may be added to the foreground and/or the background of the image to synthetically generate a different number of focus stack images 305. The blurred images may be used as a focus stack input and the sharp image may be used as the target.
[0064] In the context of an inference pipeline, the plurality of focus stack images 305 may correspond to a main HDR output (e.g., main HDR output at step 835 of Figure 8)) and an extended depth of field (EDoF) PSL HDR output (e.g., EDoF HDR output at step 840 of Figure 8).
[0065] In some embodiments, training data generation involves using depth information to generate an arbitrary number of inputs. In some embodiments, the training solution may combine a UNet with synthetic focus stack training data (e.g., on mobile devices) to reduce domain gaps.
[0066] An example focus stack fusion algorithm 310 is illustrated. Traditional non-ML alignment methods, such as global homography with feature detection, may struggle with occlusion regions and motion scenes. Such challenges may be overcome by determining precise motion flow between the main and auxiliary HDR images. In some embodiments, alignment 315 may involve coarse alignment with rotation, translation, and scaling. In some embodiments, alignment 315 may involve an optical flow based dense alignment (e.g., using PWC-Net). For example, motion flow between the main and auxiliary HDR images may be determined using a pre-trained PWC-Net/RAFT model that can generate a detailed, per-pixel motion flow map, accounting for subtle movements and potential occlusions. This can result in more accurate alignment of images, even in challenging scenarios. In some embodiments, alignment 315 may be performed by a joint training model for alignment and fusion.
[0067] In some embodiments, a dedicated ML fusion model may be trained to combine the infocus regions from the aligned images. In some embodiments, the training may involve use of
synthetic data. The fusion model may be trained to learn to address potential color shifts, noise discrepancies, and differences in focus distance between the HDR inputs, resulting in a natural and consistently sharp output.
[0068] Figure 4 is an example overview of training data generation, in accordance with example embodiments. Some embodiments involve generating the one or more associated focus stack image frames by adding a synthetic blur to a foreground or a background of the image frame of a pair. For example, training data may be synthesized using scene depth information and an all-in-focus sharp image. To simulate real -world conditions, a Gaussian blur kernel with a randomized sigma range may be applied to introduce defocus effects. A focus map may be generated using depth information to distinguish the foreground and the background. Focus stacks may be generated by selectively applying a defocus blur to either the foreground or background, guided by the focus map.
[0069] As illustrated in Figure 4, an RGB image 405 may be used to generate a depth image 410. An example blurred image 415 is shown (e.g., generated by applying a Gaussian blur kernel with a randomized sigma range). A focus map 420 may be generated using depth information from depth image 410. A near focused image 425 and a far focused image 430 may be generated by selectively applying a defocus blur to the background and the foreground, respectively, guided by the focus map 420.
[0070] Figure 5 is an example overview of synthetic training data generation, in accordance with example embodiments. For example, the first image 505 for “Focus Stack 0” shows a defocus blur applied synthetically to the background portion indicated by an oval shaped bounded region 505A. Second image 510 for “Focus Stack 1” shows a defocus blur applied synthetically to the foreground portion indicated by an oval shaped bounded region 510A. Fusion 515 shows a sharp image where both background portion 515A and foreground portion 515B are in focus. Generally, a fusion zoom wide image may be used as ground truth, and a defocus blur may be synthetically applied according to a depth of the image. In some embodiments, a motion flow between images from Wide and Tele lens may be used to infer image depth. The number of focus stack images may be based on several factors, such as a number of objects of interest (e.g., a number of detected faces), a depth of the image, luminosity, and so forth.
[0071] Some embodiments involve training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). Referring again to Figure 3, in some embodiments, machine learning (ML) model 320 may be a neural network. In some
embodiments, the neural network may be a Residual UNet. The ML model may be trained on a VGG loss and an LI loss. For example, a five (5)-stage Residual UNet model may be trained with focus stack images (e.g., from the synthetic training dataset) as input. The five (5)-stage Residual UNet model may include a 4 — 8 — 16 — 32 — 64 architecture. In some embodiments, the training of the ML model involves training the ML model based on a Visual Geometry Group (VGG) loss function. In some embodiments, the training of the ML model involves training the ML model based on an LI loss function. The VGG loss (e.g., between the prediction and the target), and the LI loss (e.g., between the prediction and the target) may be determined to allow the ML model to learn to fuse sharp regions from the focus stack images together.
[0072] Figure 6 is an example machine learning model 600 for focus stack fusion, in accordance with example embodiments. In some embodiments, ML model 600 may include encoder 610, residual layers 615 and decoder 620. A plurality of focus stack images 605 may be received as input for encoder 610. Encoder 610 may include four encoder blocks, 610A, 610B, 610C, and 610D. The first three encoder blocks 610A, 610B, and 610C of encoder 610 consist of a single convolution followed by downsampling by a factor of 2. Some embodiments may involve a blur pooling operation following the convolution operation. For example, a 3D convolution 630 may be applied to the plurality of focus stack images 605 received by the first encoder block 610A of encoder 610. The third encoder block 610C may apply a flatten operation 635 that flattens coefficients along the z-axis (to convert 3D coefficients to 2D coefficients). The output of encoder 610 may be provided to residual layers 615.
[0073] In some embodiments, residual layers 615 may include two blocks 615 A and 615B. A 2D convolution 640 may be applied by residual layers 615. The output of residual layers 615 may be provided to decoder 620.
[0074] In some embodiments, decoder 620 may include three decoder blocks 620A, 620B, and 620C that mirror the encoder blocks 610C, 610B, and 610A, respectively. Each decoder block of decoder 620 consists of a single convolution followed by upsampling by a factor of 2. Some embodiments may involve a bilinear upsampling operation following the convolution operation. Ground truth images may be used for supervised learning. In some embodiments, the upsampling may involve bicubic or nearest neighbor interpolation techniques.
[0075] In some embodiments, ML model 600 may involve skip connections. Example layers of ML model 600 can include, but are not limited to, input layers, convolutional layers, activation layers, pooling layers, and output layers. Convolutional layers can compute an output of neurons connected to local regions in the input (e.g., plurality of focus stack images 605).
In some cases, a bilinear upsampling followed by a convolution may be performed to apply a filter to a relatively small input to expand / upsample the relatively small input to become a larger output. Activation layers can determine whether or not an output of a preceding layer is “activated” or actually provided (e.g., provided to a succeeding layer). Pooling layers can downsample the input. For example, ML model 600 can use one or more pooling layers to downsample the input by a predetermined factor (e.g., a factor of two) in the horizontal and/or vertical dimensions. Output layers can provide an output (e.g., focus stack fusion 625) of ML model 600 to software and/or hardware interfacing with ML model 600; e.g. to hardware and/or software used to display focus stack fusion 625.
[0076] In some embodiments, ML model 600 may be an extended depth of field convolutional neural network (EDoF-CNN) such as an EDoF-CNN-3D, EDoF-CNN-Fast, EDoF-CNN- Pairwise, or EDoF-CNN-Max.
[0077] In some embodiments, one or more data augmentation techniques may be applied to enhance robustness and an ability to simulate real -world conditions. For example, random brightness and saturation variations may be introduced to the reference focus stack images. This approach can simulate potential discrepancies that may occur between EDoF PSL frames (with extended exposure for noise reduction) and ZSL frames. The model may be trained to effectively reconcile such differences during fusion. Also, for example, random sensor noise may be added to some (e.g, 50%) of the reference focus stack images. Such an approach may mimic noise characteristics of a single EDoF PSL frame, which may be noisier than a HDR ZSL fusion (e.g, using six (6) frames). Such controlled noise injection can enable the model to learn to prioritize detail transfer while minimizing noise amplification.
[0078] Referring again to Figure 3, to enhance system health, an 8-bit quantization fine-tuning may be incorporated after a floating-point (16-bit) ML pre-trained model. The fine-tuning stage may leverage data augmentation to train ML model 310 to suppress noise transfer from the reference image to the output.
[0079] In some embodiments, blending 325 may blend the input and fusion image to generate a focus stack fusion image 330. In some embodiments, blending 325 may be applied in a subject mask region, for example, using techniques to identify and mitigate any potential artifacts including occlusion artifacts and warping artifacts.
[0080] Some embodiments involve blending the predicted image and the one or more first image frames based on the combination of the two occlusion masks. For example, blending 325 may involve combining occlusion, subject, rejection, and smooth masks into a single blending mask. This mask may be used for alpha blending between the input and fusion result,
creating a unified image. For example, an occlusion mask may be generated based on forward and backward flow consistency checks that can facilitate detection of occlusion artifacts. In some embodiments, blending 325 may involve using a fusion zoom rejection logic to generate a rejection map (described with reference to Figure 7). Generally speaking, the rejection map can analyze local spatial differences between the main and auxiliary images, for example, by utilizing a tiled base mean and standard deviation difference algorithm.
[0081] In some embodiments, a segmenter (e.g., a one-shot segmenter) may be used to generate a subject mask that can accurately define a fusion region for an object (e.g., a face). In the event the image includes a portrait, a seamless transition between the face and body in the fused image may be generated by applying a smooth mask to a fusion boundary. For example, the smooth mask may be configured to gradually reduce a weight as an edge is approached, resulting in a smooth and natural blend between different regions.
[0082] In some embodiments, blending 325 may involve dynamic blending to maintain consistent sharpness across a face and a body. For example, an image blur may be analyzed by using a Fast Fourier Transform (FFT), and by determining a mean value of a magnitude spectrum obtained from the input image's inverse FFT. Higher values of the mean value may be indicative of sharper images. These values may be used to guide the dynamic blending algorithm in determining an optimal level of detail to recover for a seamless and uniformly sharp final image.
[0083] Figure 7 is an example illustration of blending, in accordance with example embodiments. For example, a rejection region may be determined based on a rejection map that indicates misalignment between a base focus stack frame and other image frames. A source image 705 may be received. A source luma for source image 705 may be determined to generate source luma 710. A mean for the source luma may be determined to generate source luma mean 715. Likewise, a standard deviation for the source luma may be determined to generate source luma standard deviation 720.
[0084] A warped reference image may be received, as shown as warped reference 730. The warped reference image (e.g., warped reference 730) is a reference focus stack image that is warped to align with the main source focus stack image (e.g., source image 705). A luma for warped reference 730 may be determined to generate reference luma 735, and a mean for the reference luma may be determined to generate reference luma mean 740.
[0085] In some embodiments, a delta difference between source luma mean 715 and reference luma mean 740 may be determined to generate mean delta 745. For example, mean delta 745 may be determined as:
(Eqn. 1) [0086] In some embodiments, source luma 710 and reference luma 735 may be used to determine an average LI error, as indicated by average Ll error 725. For example, the average LI error may be determined as: averageLlerror = avg (sourceiuma - referenceluma)
(Eqn. 2) [0087] A difference between average Ll error 725 and mean delta 745 may be used to generate image 750. For example, the following relationship may be used: aver ag e Ll error subtr act mean aeita
(Eqn. 3) [0088] A rejection map may be determined based on image 750 (the difference between average Ll error 725 and mean delta 745) and the standard deviation for the source luma (source luma standard deviation 720). For example, the rejection map may be determined as:
(Eqn. 4) [0089] Some embodiments involve a fallback logic for scenes with a large motion. For example, due to a ZSL to EDoF PSL pipeline delay (e.g., of at least 6 frames), a difference in content between motion scenes can make fusion challenging. A fallback algorithm may be implemented that can leverage scene motion metadata and input image differences to detect significant motion discrepancies between ZSL and EDoF PSL frames.
[0090] Some embodiments involve a fallback logic for scenes with low light conditions. For example, scenes with extremely low light (e.g., lux < 10) may present challenges for autofocus and may introduce significant noise into the EDoF PSL frame, thereby complicating the image fusion process. A fallback algorithm may be implemented for extreme low light scenes based on scene digital gain and face signal-to-noise ratio (SNR).
[0091] An EDoF PSL frame may be configured to dynamically adjust a focal distance using voice coil motor (VCM) movement to precisely align with the target position. A VCM movement in a camera is a mechanical process that utilizes a magnetic field to move a coil
back and forth to adjust a position of the lens elements for autofocus (AF) settings. Simultaneously, ZSL frames may retain their existing focus.
[0092] Some embodiments involve maintaining color consistency between the one or more additional image frames and the one or more first image frames. For example, an EDoF PSL frame's auto white balance (AWB) may be calibrated to match that of the ZSL frames, thereby resulting in consistent color reproduction across the image.
[0093] In well-lit conditions, a single EDoF PSL frame may provide a clear image. The auto exposure (AE) settings may be synchronized with those of the primary ZSL frames. In darker scenes, to compensate for potential noise in low light, the EDoF PSL frame's exposure time (ET) may be selectively increased. However, the portrait's total exposure time (TET) may be maintained. The expansion ratio can be limited by the portrait's total gain, thereby maintaining an overall gain at or above one (1).
[0094] One or more design optimizations may be implemented for the focus stack algorithm 310. For example, the focus stack algorithm 310 may prioritize operations within a face- focused region of interest (ROI). Such a targeted approach may streamline both the EDoF PSL frame HDR process and the focus stack fusion algorithm. As another example, computational complexity may be managed by constraining a size of the ML fusion model to approximately 512 x 512. Also, for example, the focus stack algorithm 310 may leverage multi-threading to intelligently distribute independent computations across multiple threads for faster execution.
[0095] In some embodiments, the focus stack algorithm 310 may utilize intelligent caching. For example, models may be pre-compiled, Open Computing Language (OpenCL) data may be cached, and model information may be stored during initial setup. Such proactive caching may significantly improve processing speeds for subsequent shots. In some embodiments, the fusion ML model may seamlessly offload processing steps to an on-device TPU, resulting in a substantial increase in computational speed. Also, for example, downsampling may be used to reduce computational latency. For example, processing speed may be significantly increased by operating on smaller input sizes whenever feasible. As another example, memory footprint and latency may be reduced by minimizing data copies. In some embodiments, image references may be used for an efficient workflow.
Example Inference Pipelines
[0096] As described herein, focus stack fusion can provide an end-to-end pipeline (e.g., on mobile devices) that takes main wide image with zero shutter lag (ZSL) frames and a reference wide image with post shutter lag (PSL) frame to generate an output image with an extended
DoF. In some embodiments, a wide ZSL may be fused with an ultra-wide ZSL. Also, for example, a wide ZSL may be fused with an ultra-wide PSL.
[0097] Some embodiments involve applying a cost volume network (PWC-Net) to align the one or more first image frames and the one or more additional image frames. Such embodiments involve determining, based on the aligned images, a combination of two occlusion masks with respective forward and backward flow. For the inference pipeline, wide ZSL frames may be captured at one focus distance and wide PSL frames may be captured at another focus distance. In some embodiments, a PWC-Net may be used to align the input images. For example, the reference focus stack images may be warped and aligned with the source image using PWC-Net. In some embodiments, two occlusion masks may be determined with a forward and backward flow and combined for later use.
[0098] To avoid a color shift, the color of all images may be modified to be consistent with the source. For example, global mean and standard deviation color matching methods may be used to match the reference to the source.
[0099] Fallback logic may be applied to bypass fusion at an early stage. For example, the fallback logic may be applied when a telephoto lens is defocused, the environmental lighting is lower than a desired threshold, the reprojection error (e.g., warped reference to source comparison) is large, and/or the base frame delta (e.g., between the source and the reference) exceeds a threshold delta.
[00100] Upon completion of the afore-mentioned pre-processing tasks, the source and reference images may be input to the pre-trained UNet model to generate an inference result. In some embodiments, the final image may be blended with the source image using the occlusion masks.
[00101] Figure 8 is an example overview of a focus stack fusion based image processing pipeline 800, in accordance with example embodiments. A camera application 805 and a native application 810 are shown. At step 815, a user may open camera application 805 (e.g., in default mode 1.0 x). At step 820, a trigger assessment may be performed (e.g., as described with reference to Figure 9). Some embodiments involve determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. For example, image processing pipeline 800 may determine whether the scene and a current DoF are suitable for group portrait focus stack fusion. The DoF may vary depending on a type of photography, such as, for example, portrait, food, landscape, documents, and so forth. In some embodiments, the trigger assessment may be performed continually at periodic time intervals. In some embodiments, the trigger assessment may be performed when the user presses a shutter button.
[00102] In some embodiments, the capturing of the one or more first image frames and the one or more additional image frames involves determining one or more of a number of focus stack frames or a number of focal lengths. For example, upon determining that a focus stack algorithm is to be applied, focus stack fusion based image processing pipeline 800 may determine a number of focus stack frames and focal positions.
[00103] Some embodiments involve capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. At step 825, a subset of ZSL frames at different focus positions may be obtained. At step 830, an EDoF PSL frame may be obtained. Some embodiments involve performing high dynamic range (HDR) processing of a subset of the one or more first image frames and the one or more additional image frames. At step 835, the ZSL frames may undergo HDR processing to generate an HDR output, and the EDoF PSL frame may undergo HDR processing at step 840 to generate an EDoF HDR output. In some embodiments, the HDR processing may be performed in parallel.
[00104] The providing of the one or more first image frames and the one or more additional image frames may be based on the HDR-processed subset. For example, at step 845, a fusion controller of the camera application 805 may deliver the multiple HDR images to the native application 810. Some embodiments involve providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). For example, at step 850, a focus stack fusion algorithm may be applied (e.g., focus stack fusion algorithm 310 of Figure 3) to expand the DoF, resulting in a sharp, detailed group portrait. In some embodiments, the predicted image may be received from the ML model. In some embodiments, the predicted image may be returned to camera application 805 and/or saved to the user's photo gallery.
[00105] Figure 9 is an example illustration of triggering for focus stack fusion, in accordance with example embodiments. For illustration purposes, the triggering is shown with reference to face detection. However, the algorithm may be configured for other objects and/or regions of interest. The triggering algorithm may utilize camera metadata (e.g., focus distance, aperture, focal length, subject distance, and/or circle of confusion) and may apply the following relationship to determine a DoF for an image:
2u2Nc
DoF =
(Eqn. 5)
[00106] where DoF is a depth of field, it is a distance to a subject (e.g, a face), N is the -number, c is the circle of confusion, and / is the focal length. In some embodiments, the DoF may be adjusted by multiplying a fixed tuning threshold to near and far field limits. The triggering algorithm may compare respective depths of one or more detected faces to the adjusted DOF near and far field. In the event one face is within the adjusted DOF and another face is outside of the adjusted DOF, the focus stack fusion pipeline may be triggered.
[00107] In some embodiments, determining that the portion of the scene is out of focus involves determining a depth of field (DoF) for the scene being captured. Such embodiments involve determining that a depth of the portion of the scene exceeds the DoF. For example, the first schematic diagram 900 A illustrates a near field 905 and a far field 910 with reference to a camera lens (not shown) positioned toward the left of the first schematic diagram 900A. An adjusted DoF 915 may be determined as described above. A first face (e.g., sharp face 920) may be detected within adjusted DoF 915, and a second face (e.g., blurry face 925) may be detected outside adjusted DoF 915, in this example, further from the camera lens than far field 910. Accordingly, the focus stack fusion pipeline is triggered.
[00108] In some embodiments, determining that the portion of the scene is out of focus involves determining a depth of field (DoF) for the scene being captured. Such embodiments involve determining that a depth of the portion of the scene is less than the DoF. For example, the second schematic diagram 900B illustrates a near field 905 and a far field 910 with reference to a camera lens (not shown) positioned toward the left of the second schematic diagram 900B. An adjusted DoF 915 may be determined as described above. A first face (e.g., sharp face 920) may be detected within adjusted DoF 915, and a second face (e.g., blurry face 925) may be detected outside adjusted DoF 915, in this example, closer to the camera lens than near field 905. Accordingly, the focus stack fusion pipeline is triggered.
[00109] In some embodiments, the scene may include a plurality of regions of interest (ROIs), and the determining that the portion of the scene is out of focus involves determining a respective depth for each of the plurality of ROIs. Such embodiments involve determining that at least one ROI of the plurality of ROIs is out of focus. In some embodiments, the plurality of ROIs include a plurality of human faces. For example, objects other than faces may be classified using a machine learning model for object detection. A current focal distance and image blurriness score may be determined. Subsequently, focal distance and image blurriness score may be tuned to trigger the focus stack fusion algorithm. For example, when the image blurriness score exceeds a threshold blur value, and the focal distance indicates that the blurred
object is within or outside the adjusted DoF (based on the focal distance), the focus stack fusion algorithm may be triggered.
[00110] Figure 10 is an example illustration of focus stack fusion in landscape photography, in accordance with example embodiments. Image 1005 has a blurred foreground and a sharper background, and image 1010 has a blurred background and a sharper foreground. Fusion of images 1005 and 1010 results in image 1015 with a sharp foreground and background.
[00111] Figure 11 is another example illustration of focus stack fusion in landscape photography, in accordance with example embodiments. Image 1105 has a blurred foreground and a sharper background, and image 1110 has a blurred background and a sharper foreground. Fusion of images 1105 and 1110 results in image 1115 with a sharp foreground and background.
[00112] Figure 12 is another example illustration of focus stack fusion in landscape photography, in accordance with example embodiments. Image 1205 has a blurred foreground and a sharper background, and image 1210 has a blurred background and a sharper foreground. Fusion of images 1205 and 1210 results in image 1215 with a sharp foreground and background.
[00113] Figure 13 is another example illustration of focus stack fusion in landscape photography, in accordance with example embodiments. Image 1305 has a blurred foreground and a sharper background, and image 1310 has a blurred background and a sharper foreground. Fusion of images 1305 and 1310 results in image 1315 with a sharp foreground and background.
[00114] Figure 14 is an example illustration of focus stack fusion in portrait photography, in accordance with example embodiments. Image 1405 has a blurred face in the foreground and a sharper face in the background, and image 1410 has a blurred face in the background and a sharper face in the foreground. Fusion of images 1405 and 1410 results in image 1415 with sharp faces in the foreground and the background.
[00115] Figure 15 is an example illustration of focus stack fusion in food photography, in accordance with example embodiments. Image 1505 has a blurred foreground and a sharper background, and image 1510 has a blurred background and a sharper foreground. Fusion of images 1505 and 1510 results in image 1515 with a sharp foreground and background.
[00116] Figure 16 is another example illustration of focus stack fusion in food photography, in accordance with example embodiments. Image 1605 has a blurred foreground and a sharper background, and image 1610 has a blurred background and a sharper foreground.
Fusion of images 1605 and 1610 results in image 1615 with a sharp foreground and background.
[00117] Figure 17 is an example illustration of focus stack fusion in document photography, in accordance with example embodiments. Image 1705 has a blurred portion of the document in the foreground and a sharper portion of the document in the background, and image 1710 has a blurred portion of the document in the background and a sharper portion of the document in the foreground. Fusion of images 1705 and 1710 results in image 1715 with a sharp foreground and background.
[00118] Figure 18 is another example illustration of focus stack fusion in document photography, in accordance with example embodiments. Image 1805 has books with blurred titles in the background and books with sharper titles in the foreground, and image 1810 has books with blurred titles in the foreground and books with sharper titles in the background. Fusion of images 1805 and 1810 results in image 1815 with sharper titles in the foreground and the background.
Training Machine Learning Models for Generating Inferences/Predictions [00119] FIG. 19 shows diagram 1900 illustrating a training phase 1902 and an inference phase 1904 of trained machine learning model(s) 1932, in accordance with example embodiments. Some machine learning techniques involve training one or more machine learning algorithms on an input set of training data to recognize patterns in the training data and provide output inferences and/or predictions about (patterns in the) training data. The resulting trained machine learning algorithm can be termed as a trained machine learning model. For example, FIG. 19 shows training phase 1902 where machine learning algorithm(s) 1920 are being trained on training data 1910 to become trained machine learning model(s) 1932. Then, during inference phase 1904, trained machine learning model(s) 1932 can receive input data 1930 and one or more inference/prediction requests 1940 (perhaps as part of input data 1930) and responsively provide as an output one or more inferences and/or prediction(s) 1950.
[00120] As such, trained machine learning model(s) 1932 can include one or more models of machine learning algorithm(s) 1920. Machine learning algorithm(s) 1920 may include, but are not limited to: an artificial neural network (e.g., a herein-described convolutional neural networks, a recurrent neural network, a Bayesian network, a hidden Markov model, a Markov decision process, a logistic regression function, a support vector machine, a suitable statistical machine learning algorithm, and/or a heuristic machine
learning system). Machine learning algorithm(s) 1920 may be supervised or unsupervised, and may implement any suitable combination of online and offline learning.
[00121] In some examples, machine learning algorithm(s) 1920 and/or trained machine learning model(s) 1932 can be accelerated using on-device coprocessors, such as graphic processing units (GPUs), tensor processing units (TPUs), digital signal processors (DSPs), and/or application specific integrated circuits (ASICs). Such on-device coprocessors can be used to speed up machine learning algorithm(s) 1920 and/or trained machine learning model(s) 1932. In some examples, trained machine learning model(s) 1932 can be trained, reside and execute to provide inferences on a particular computing device, and/or otherwise can make inferences for the particular computing device.
[00122] During training phase 1902, machine learning algorithm(s) 1920 can be trained by providing at least training data 1910 as training input using unsupervised, supervised, semisupervised, and/or reinforcement learning techniques. Unsupervised learning involves providing a portion (or all) of training data 1910 to machine learning algorithm(s) 1920 and machine learning algorithm(s) 1920 determining one or more output inferences based on the provided portion (or all) of training data 1910. Supervised learning involves providing a portion of training data 1910 to machine learning algorithm(s) 1920, with machine learning algorithm(s) 1920 determining one or more output inferences based on the provided portion of training data 1910, and the output inference(s) are either accepted or corrected based on correct results associated with training data 1910. In some examples, supervised learning of machine learning algorithm(s) 1920 can be governed by a set of rules and/or a set of labels for the training input, and the set of rules and/or set of labels may be used to correct inferences of machine learning algorithm(s) 1920.
[00123] Semi-supervised learning involves having correct results for part, but not all, of training data 1910. During semi-supervised learning, supervised learning is used for a portion of training data 1910 having correct results, and unsupervised learning is used for a portion of training data 1910 not having correct results. Reinforcement learning involves machine learning algorithm(s) 1920 receiving a reward signal regarding a prior inference, where the reward signal can be a numerical value. During reinforcement learning, machine learning algorithm(s) 1920 can output an inference and receive a reward signal in response, where machine learning algorithm(s) 1920 are configured to try to maximize the numerical value of the reward signal. In some examples, reinforcement learning also utilizes a value function that provides a numerical value representing an expected total of the numerical values provided by the reward signal over time. In some examples, machine learning algorithm(s)
1920 and/or trained machine learning model(s) 1932 can be trained using other machine learning techniques, including but not limited to, incremental learning and curriculum learning. [00124] In some examples, machine learning algorithm(s) 1920 and/or trained machine learning model(s) 1932 can use transfer learning techniques. For example, transfer learning techniques can involve trained machine learning model(s) 1932 being pre-trained on one set of data and additionally trained using training data 1910. More particularly, machine learning algorithm(s) 1920 can be pre-trained on data from one or more computing devices and a resulting trained machine learning model provided to a particular computing device, where the particular computing device is intended to execute the trained machine learning model during inference phase 1904. Then, during training phase 1902, the pre-trained machine learning model can be additionally trained using training data 1910, where training data 1910 can be derived from kernel and non-kernel data of the particular computing device. This further training of the machine learning algorithm(s) 1920 and/or the pre-trained machine learning model using training data 1910 of the particular computing device’s data can be performed using either supervised or unsupervised learning. Once machine learning algorithm(s) 1920 and/or the pre-trained machine learning model has been trained on at least training data 1910, training phase 1902 can be completed. The trained resulting machine learning model can be utilized as at least one of trained machine learning model(s) 1932.
[00125] In particular, once training phase 1902 has been completed, trained machine learning model(s) 1932 can be provided to a computing device, if not already on the computing device. Inference phase 1904 can begin after trained machine learning model(s) 1932 are provided to the particular computing device.
[00126] During inference phase 1904, trained machine learning model(s) 1932 can receive input data 1930 and generate and output one or more corresponding inferences and/or prediction(s) 1950 about input data 1930. As such, input data 1930 can be used as an input to trained machine learning model(s) 1932 for providing corresponding inference(s) and/or prediction(s) 1950 to kernel components and non-kernel components. For example, trained machine learning model(s) 1932 can generate inference(s) and/or predict! on(s) 1950 in response to one or more inference/prediction requests 1940. In some examples, trained machine learning model(s) 1932 can be executed by a portion of other software. For example, trained machine learning model(s) 1932 can be executed by an inference or prediction daemon to be readily available to provide inferences and/or predictions upon request. Input data 1930 can include data from the particular computing device executing trained machine learning model(s)
1932 and/or input data from one or more computing devices other than the particular computing device.
[00127] Input data 1930 can include one or more first image frames captured at a first focal length and one or more additional image frames captured at a second focal length. For example, input data 1930 can include a main wide image with zero shutter lag (ZSL) frames and a reference wide image with post shutter lag (PSL) frame, where the frames are captured at different focal distances. Other types of input data are possible as well. Inference(s) and/or prediction(s) 1950 can include a focus stack fusion image with an extended DoF. Inference(s) and/or predict! on(s) 1950 can include other output data produced by trained machine learning model(s) 1932 operating on input data 1930 (and training data 1910). In some examples, trained machine learning model(s) 1932 can use output inference(s) and/or predict! on(s) 1950 as input feedback 1960. Trained machine learning model(s) 1932 can also rely on past inferences as inputs for generating new inferences.
[00128] Convolutional neural networks and/or deep neural networks used herein can be an example of machine learning algorithm(s) 1920. For example, machine learning algorithm(s) 1920 may include ML model 600. After training, the trained version of a convolutional neural network can be an example of trained machine learning model(s) 1932. In this approach, an example of the one or more inference/prediction requests 1940 can be a request to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF), and a corresponding example of inferences and/or prediction(s) 1950 can be the image with an extended DoF.
Example Data Network
[00129] FIG. 20 depicts a distributed computing architecture 2000, in accordance with example embodiments. Distributed computing architecture 2000 includes server devices 2008, 2010 that are configured to communicate, via network 2006, with programmable devices 2004a, 2004b, 2004c, 2004d, 2004e. Network 2006 may correspond to a local area network (LAN), a wide area network (WAN), a WLAN, a WWAN, a corporate intranet, the public Internet, or any other type of network configured to provide a communications path between networked computing devices. Network 2006 may also correspond to a combination of one or more LANs, WANs, corporate intranets, and/or the public Internet.
[00130] Although FIG. 20 only shows five programmable devices, distributed application architectures may serve tens, hundreds, or thousands of programmable devices. Moreover, programmable devices 2004a, 2004b, 2004c, 2004d, 2004e (or any additional programmable devices) may be any sort of computing device, such as a mobile computing
device, desktop computer, wearable computing device, head-mountable device (HMD), network terminal, a mobile computing device, and so on. In some examples, such as illustrated by programmable devices 2004a, 2004b, 2004c, 2004e, programmable devices can be directly connected to network 2006. In other examples, such as illustrated by programmable device 2004d, programmable devices can be indirectly connected to network 2006 via an associated computing device, such as programmable device 2004c. In this example, programmable device 2004c can act as an associated computing device to pass electronic communications between programmable device 2004d and network 2006. In other examples, such as illustrated by programmable device 2004e, a computing device can be part of and/or inside a vehicle, such as a car, a truck, a bus, a boat or ship, an airplane, etc. In other examples not shown in FIG. 20, a programmable device can be both directly and indirectly connected to network 2006.
[00131] Server devices 2008, 2010 can be configured to perform one or more services, as requested by programmable devices 2004a-2004e. For example, server device 2008 and/or 2010 can provide content to programmable devices 2004a-2004e. The content can include, but is not limited to, web pages, hypertext, scripts, binary data such as compiled software, images, audio, and/or video. The content can include compressed and/or uncompressed content. The content can be encrypted and/or unencrypted. Other types of content are possible as well.
[00132] As another example, server devices 2008 and/or 2010 can provide programmable devices 2004a-2004e with access to software for database, search, computation, graphical, audio, video, World Wide Web/Internet utilization, and/or other functions. Many other examples of server devices are possible as well.
Computing Device Architecture
[00133] Figure 21 is a block diagram of an example computing device 2100, in accordance with example embodiments. In particular, computing device 2100 shown in Figure 21 can be configured to perform at least one function described herein, including methods 2200, and/or 2300.
[00134] Computing device 2100 may include a user interface module 2101, a network communications module 2102, one or more processors 2103, data storage 2104, one or more cameras 2118, one or more sensors 2120, and power system 2122, all of which may be linked together via a system bus, network, or other connection mechanism 2105.
[00135] User interface module 2101 can be operable to send data to and/or receive data from external user input/output devices. For example, user interface module 2101 can be configured to send and/or receive data to and/or from user input devices such as a touch screen,
a computer mouse, a keyboard, a keypad, a touch pad, a trackball, a joystick, a voice recognition module, and/or other similar devices. User interface module 2101 can also be configured to provide output to user display devices, such as one or more cathode ray tubes (CRT), liquid crystal displays, light emitting diodes (LEDs), displays using digital light processing (DLP) technology, printers, light bulbs, and/or other similar devices, either now known or later developed. User interface module 2101 can also be configured to generate audible outputs, with devices such as a speaker, speaker jack, audio output port, audio output device, earphones, and/or other similar devices. User interface module 2101 can further be configured with one or more haptic devices that can generate haptic outputs, such as vibrations and/or other outputs detectable by touch and/or physical contact with computing device 2100. In some examples, user interface module 2101 can be used to provide a graphical user interface (GUI) for utilizing computing device 2100.
[00136] Network communications module 2102 can include one or more devices that provide one or more wireless interfaces 2107 and/or one or more wireline interfaces 2108 that are configurable to communicate via a network. Wireless interface(s) 2107 can include one or more wireless transmitters, receivers, and/or transceivers, such as a Bluetooth™ transceiver, a Zigbee® transceiver, a Wi-Fi™ transceiver, a WiMAX™ transceiver, an LTE™ transceiver, and/or other type of wireless transceiver configurable to communicate via a wireless network. Wireline interface(s) 2108 can include one or more wireline transmitters, receivers, and/or transceivers, such as an Ethernet transceiver, a Universal Serial Bus (USB) transceiver, or similar transceiver configurable to communicate via a twisted pair wire, a coaxial cable, a fiberoptic link, or a similar physical connection to a wireline network.
[00137] In some examples, network communications module 2102 can be configured to provide reliable, secured, and/or authenticated communications. For each communication described herein, information for facilitating reliable communications (e.g., guaranteed message delivery) can be provided, perhaps as part of a message header and/or footer (e.g., packet/message sequencing information, encapsulation headers and/or footers, size/time information, and transmission verification information such as cyclic redundancy check (CRC) and/or parity check values). Communications can be made secure (e.g., be encoded or encrypted) and/or decry pted/decoded using one or more cryptographic protocols and/or algorithms, such as, but not limited to, Data Encryption Standard (DES), Advanced Encryption Standard (AES), a Rivest-Shamir-Adelman (RSA) algorithm, a Diffie-Hellman algorithm, a secure sockets protocol such as Secure Sockets Layer (SSL) or Transport Layer Security (TLS), and/or Digital Signature Algorithm (DSA). Other cryptographic protocols and/or algorithms
can be used as well or in addition to those listed herein to secure (and then decry pt/decode) communications.
[00138] One or more processors 2103 can include one or more general purpose processors (e.g., central processing unit (CPU), etc.), and/or one or more special purpose processors (e.g., digital signal processors, tensor processing units (TPUs), graphics processing units (GPUs), application specific integrated circuits, etc.). One or more processors 2103 can be configured to execute computer-readable instructions 2106 that are contained in data storage 2104 and/or other instructions as described herein.
[00139] Data storage 2104 can include one or more non-transitory computer-readable storage media that can be read and/or accessed by at least one of one or more processors 2103. The one or more computer-readable storage media can include volatile and/or non-volatile storage components, such as optical, magnetic, organic or other memory or disc storage, which can be integrated in whole or in part with at least one of one or more processors 2103. In some examples, data storage 2104 can be implemented using a single physical device (e.g., one optical, magnetic, organic or other memory or disc storage unit), while in other examples, data storage 2104 can be implemented using two or more physical devices.
[00140] Data storage 2104 can include computer-readable instructions 2106 and perhaps additional data. In some examples, data storage 2104 can include storage required to perform at least part of the herein-described methods, scenarios, and techniques and/or at least part of the functionality of the herein-described devices and networks. In particular, computer- readable instructions 2106 can include instructions that, when executed by processor(s) 2103, enable computing device 2100 to provide for some or all of the functionality described herein. [00141] In some embodiments, computer-readable instructions 2106 can include instructions that, when executed by processor(s) 2103, enable computing device 2100 to carry out operations. The operations may include determining that a portion of a scene in an image frame being captured at a first focal length is out of focus. The operations may also include capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene. The operations may additionally include providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may also include receiving the predicted image from the ML model.
[00142] In some embodiments, computer-readable instructions 2106 can include instructions that, when executed by processor(s) 2103, enable computing device 2100 to carry out operations. The operations may include receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame. The operations may also include training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). The operations may additionally include providing the trained ML model.
[00143] In some examples, computing device 2100 can include focus stack fusion module 2112. Focus stack fusion module 2112 can be configured to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF). Also, for example, focus stack fusion module 2112 can be configured to train a ML model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF).
[00144] In some examples, computing device 2100 can include one or more cameras 2118. Camera(s) 2118 can include one or more image capture devices, such as still and/or video cameras, equipped to capture light and record the captured light in one or more images; that is, camera(s) 2118 can generate image(s) of captured light. The one or more images can be one or more still images and/or one or more images utilized in video imagery. Camera(s) 2118 can capture light and/or electromagnetic radiation emitted as visible light, infrared radiation, ultraviolet light, and/or as one or more other frequencies of light. Camera(s) 2118 can include a wide camera, a tele camera, an ultrawide camera, and so forth. Also, for example, camera(s) 2118 can be front-facing or rear-facing cameras with reference to computing device 2100. Camera(s) 2118 can include camera components such as, but are not limited to, an aperture, shutter, recording surface (e.g., photographic film and/or an image sensor), lens, and/or shutter button. The camera components may be controlled at least in part by software executed by one or more processors 2103.
[00145] In some examples, computing device 2100 can include one or more sensors 2120. Sensors 2120 can be configured to measure conditions within computing device 2100 and/or conditions in an environment of computing device 2100 and provide data about these conditions. For example, sensors 2120 can include one or more of: (i) sensors for obtaining data about computing device 2100, such as, but not limited to, a thermometer for measuring a temperature of computing device 2100, a battery sensor for measuring power of one or more
batteries of power system 2122, and/or other sensors measuring conditions of computing device 2100; (ii) an identification sensor to identify other objects and/or devices, such as, but not limited to, a Radio Frequency Identification (RFID) reader, proximity sensor, one-dimensional barcode reader, two-dimensional barcode (e.g., Quick Response (QR) code) reader, and a laser tracker, where the identification sensors can be configured to read identifiers, such as RFID tags, barcodes, QR codes, and/or other devices and/or object configured to be read and provide at least identifying information; (iii) sensors to measure locations and/or movements of computing device 2100, such as, but not limited to, a tilt sensor, a gyroscope, an accelerometer, a Doppler sensor, a GPS device, a sonar sensor, a radar device, a laser-displacement sensor, and a compass; (iv) an environmental sensor to obtain data indicative of an environment of computing device 2100, such as, but not limited to, an infrared sensor, an optical sensor, a light sensor (e.g., an ambient light sensor), a biosensor, a capacitive sensor, a touch sensor, a temperature sensor, a wireless sensor, a radio sensor, a movement sensor, a microphone, a sound sensor, an ultrasound sensor and/or a smoke sensor; and/or (v) a force sensor to measure one or more forces (e.g., inertial forces and/or G-forces) acting about computing device 2100, such as, but not limited to one or more sensors that measure: forces in one or more dimensions, torque, ground force, friction, and/or a zero moment point (ZMP) sensor that identifies ZMPs and/or locations of the ZMPs. Many other examples of sensors 2120 are possible as well.
[00146] Power system 2122 can include one or more batteries 2124 and/or one or more external power interfaces 2126 for providing electrical power to computing device 2100. Each battery of the one or more batteries 2124 can, when electrically coupled to the computing device 2100, act as a source of stored electrical power for computing device 2100. One or more batteries 2124 of power system 2122 can be configured to be portable. Some or all of one or more batteries 2124 can be readily removable from computing device 2100. In other examples, some or all of one or more batteries 2124 can be internal to computing device 2100, and so may not be readily removable from computing device 2100. Some or all of one or more batteries 2124 can be rechargeable. For example, a rechargeable battery can be recharged via a wired connection between the battery and another power supply, such as by one or more power supplies that are external to computing device 2100 and connected to computing device 2100 via the one or more external power interfaces. In other examples, some or all of one or more batteries 2124 can be non-rechargeable batteries.
[00147] One or more external power interfaces 2126 of power system 2122 can include one or more wired-power interfaces, such as a USB cable and/or a power cord, that enable wired electrical power connections to one or more power supplies that are external to
computing device 2100. One or more external power interfaces 2126 can include one or more wireless power interfaces, such as a Qi wireless charger, that enable wireless electrical power connections, such as via a Qi wireless charger, to one or more external power supplies. Once an electrical power connection is established to an external power source using one or more external power interfaces 2126, computing device 2100 can draw electrical power from the external power source the established electrical power connection. In some examples, power system 2122 can include related sensors, such as battery sensors associated with the one or more batteries or other types of electrical power sensors.
[00148] One or more external power interfaces 2126 of power system 2122 can include one or more wired-power interfaces, such as a USB cable and/or a power cord, that enable wired electrical power connections to one or more power supplies that are external to computing device 2100. One or more external power interfaces 2126 can include one or more wireless power interfaces, such as a Qi wireless charger, that enable wireless electrical power connections, such as via a Qi wireless charger, to one or more external power supplies. Once an electrical power connection is established to an external power source using one or more external power interfaces 2126, computing device 2100 can draw electrical power from the external power source the established electrical power connection. In some examples, power system 2122 can include related sensors, such as battery sensors associated with the one or more batteries or other types of electrical power sensors.
Example Methods of Operation
[00149] Figure 22 is a flowchart of a method, in accordance with example embodiments. Method 2200 may include various blocks or steps. The blocks or steps may be carried out individually or in combination. The blocks or steps may be carried out in any order and/or in series or in parallel. Further, blocks or steps may be omitted or added to method 2200.
[00150] The blocks of method 2200 may be carried out by various elements of computing device 2100 as illustrated and described in reference to Figure 21.
[00151] Block 2210 involves determining that a portion of a scene in an image frame being captured at a first focal length is out of focus.
[00152] Block 2220 involves capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene.
[00153] Block 2230 involves providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model
having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF).
[00154] Block 2240 involves receiving the predicted image from the ML model.
[00155] In some embodiments, determining that the portion of the scene is out of focus involves determining a depth of field (DoF) for the scene being captured. Such embodiments involve determining that a depth of the portion of the scene exceeds the DoF.
[00156] In some embodiments, determining that the portion of the scene is out of focus involves determining a depth of field (DoF) for the scene being captured. Such embodiments involve determining that a depth of the portion of the scene is less than the DoF.
[00157] In some embodiments, the one or more first image frames may be captured with zero-shutter-lag (ZSL).
[00158] In some embodiments, the one or more additional image frames may be captured with positive-shutter-lag (PSL).
[00159] In some embodiments, the scene may include a plurality of regions of interest (ROIs), and the determining that the portion of the scene is out of focus involves determining a respective depth for each of the plurality of ROIs. Such embodiments involve determining that at least one ROI of the plurality of ROIs is out of focus. In some embodiments, the plurality of ROIs include a plurality of human faces.
[00160] In some embodiments, the capturing of the one or more first image frames and the one or more additional image frames involves determining one or more of a number of focus stack frames or a number of focal lengths.
[00161] In some embodiments, the ML model may be a convolutional neural network (CNN) based on a U-Net architecture.
[00162] Some embodiments involve performing high dynamic range (HDR) processing of a subset of the one or more first image frames and the one or more additional image frames. The providing of the one or more first image frames and the one or more additional image frames may be based on the HDR-processed subset.
[00163] Some embodiments involve applying a cost volume network (PWC-Net) to align the one or more first image frames and the one or more additional image frames. Such embodiments involve determining, based on the aligned images, a combination of two occlusion masks with respective forward and backward flow.
[00164] Some embodiments involve blending the predicted image and the one or more first image frames based on the combination of the two occlusion masks.
[00165] Some embodiments involve maintaining color consistency between the one or more additional image frames and the one or more first image frames.
[00166] Figure 23 is another flowchart of a method, in accordance with example embodiments. Method 2300 may include various blocks or steps. The blocks or steps may be carried out individually or in combination. The blocks or steps may be carried out in any order and/or in series or in parallel. Further, blocks or steps may be omitted or added to method 2100. [00167] The blocks of method 2300 may be carried out by various elements of computing device 2100 as illustrated and described in reference to Figure 21.
[00168] Block 2310 involves receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame.
[00169] Block 2320 involves training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF).
[00170] Block 2330 involves providing the trained ML model.
[00171] In some embodiments, the ML model may be a convolutional neural network
(CNN) based on a U-Net architecture.
[00172] In some embodiments, the training of the ML model involves training the ML model based on a Visual Geometry Group (VGG) loss function.
[00173] In some embodiments, the training of the ML model involves training the ML model based on an LI loss function.
[00174] Some embodiments involve generating the one or more associated focus stack image frames by adding a synthetic blur to a foreground or a background of the image frame of a pair.
[00175] The particular arrangements shown in the Figures should not be viewed as limiting. It should be understood that other embodiments may include more or less of each element shown in a given Figure. Further, some of the illustrated elements may be combined or omitted. Yet further, an illustrative embodiment may include elements that are not illustrated in the Figures.
[00176] A step or block that represents a processing of information can correspond to circuitry that can be configured to perform the specific logical functions of a herein-described method or technique. Alternatively or additionally, a step or block that represents a processing of information can correspond to a module, a segment, or a portion of program code (including
related data). The program code can include one or more instructions executable by a processor for implementing specific logical functions or actions in the method or technique. The program code and/or related data can be stored on any type of computer readable medium such as a storage device including a disk, hard drive, or other storage medium.
[00177] The computer readable medium can also include non-transitory computer readable media such as computer-readable media that store data for short periods of time like register memory, processor cache, and random access memory (RAM). The computer readable media can also include non-transitory computer readable media that store program code and/or data for longer periods. Thus, the computer readable media may include secondary or persistent long-term storage, like read only memory (ROM), optical or magnetic disks, compact disc read only memory (CD-ROM), for example. The computer readable media can also be any other volatile or non-volatile storage systems. A computer readable medium can be considered a computer readable storage medium, for example, or a tangible storage device.
[00178] While various examples and embodiments have been disclosed, other examples and embodiments will be apparent to those skilled in the art. The various disclosed examples and embodiments are for purposes of illustration and are not intended to be limiting, with the true scope being indicated by the following claims.
Claims
1. A computer-implemented method, comprising: determining that a portion of a scene in an image frame being captured at a first focal length is out of focus; capturing one or more first image frames at the first focal length and one or more additional image frames at a second focal length to focus on the portion of the scene; providing the one or more first image frames and the one or more additional image frames as input to a machine learning (ML) model, the ML model having been trained to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF); and receiving the predicted image from the ML model.
2. The computer-implemented method of claim 1, wherein the determining that the portion of the scene is out of focus comprises: determining a depth of field (DoF) for the scene being captured; and determining that a depth of the portion of the scene exceeds the DoF.
3. The computer-implemented method of claim 1, wherein the determining that the portion of the scene is out of focus comprises: determining a depth of field (DoF) for the scene being captured; and determining that a depth of the portion of the scene is less than the DoF.
4. The computer-implemented method of any of claims 1-3, wherein the one or more first image frames is captured with zero-shutter-lag (ZSL).
5. The computer-implemented method of any of claims 1-4, wherein the one or more additional image frames is captured with positive-shutter-lag (PSL).
6. The computer-implemented method of any of claims 1-5, wherein the scene comprises a plurality of regions of interest (ROIs), and wherein the determining that the portion
of the scene is out of focus comprises: determining a respective depth for each of the plurality of ROIs; and determining that at least one ROI of the plurality of ROIs is out of focus.
7. The computer-implemented method of claim 6, wherein the plurality of ROIs comprise a plurality of human faces.
8. The computer-implemented method of any of claims 1-7, wherein the capturing of the one or more first image frames and the one or more additional image frames further comprise: determining one or more of a number of focus stack frames or a number of focal lengths.
9. The computer-implemented method of any of claims 1-8, wherein the ML model is a convolutional neural network (CNN) based on a U-Net architecture.
10. The computer-implemented method of any of claims 1-9, further comprising: performing high dynamic range (HDR) processing of a subset of the one or more first image frames and the one or more additional image frames, and wherein the providing of the one or more first image frames and the one or more additional image frames is based on the HDR-processed subset.
11. The computer-implemented method of any of claims 1-10, further comprising: applying a cost volume network (PWC-Net) to align the one or more first image frames and the one or more additional image frames; and determining, based on the aligned images, a combination of two occlusion masks with respective forward and backward flow.
12. The computer-implemented method of claim 11, further comprising: blending the predicted image and the one or more first image frames based on the combination of the two occlusion masks.
13. The computer-implemented method of any of claims 1-12, further comprising: maintaining color consistency between the one or more additional image frames and the one or more first image frames.
14. A computer-implemented method, comprising: receiving training data comprising a plurality of pairs, each pair comprising of an image frame and one or more associated focus stack image frames, wherein the one or more associated focus stack image frames are synthetically blurred versions of the image frame; training, based on the training data, a machine learning (ML) model to merge one or more focused regions in a plurality of input images to predict an output image with an extended depth of field (DoF); and providing the trained ML model.
15. The computer-implemented method of claim 14, wherein the ML model is a convolutional neural network (CNN) based on a U-Net architecture.
16. The computer-implemented method of any of claims 14 or 15, further comprising: training the ML model based on a Visual Geometry Group (VGG) loss function.
17. The computer-implemented method of any of claims 14-16, further comprising: training the ML model based on an LI loss function.
18. The computer-implemented method of any of claims 14-17, further comprising: generating the one or more associated focus stack image frames by adding a synthetic blur to a foreground or a background of the image frame of a pair.
19. A computing device, comprising: one or more processors; and data storage, wherein the data storage has stored thereon computer-executable instructions that, when executed by the one or more processors, cause the computing device to carry out functions that comprise the computer-implemented method of any one of claims 1- 18.
20. An article of manufacture comprising one or more non-transitory computer readable media having computer-readable instructions stored thereon that, when executed by one or more processors of a computing device, cause the computing device to carry out
functions that comprise the computer-implemented method of any one of claims 1-18.
21. A program that, when executed by one or more processors of a computing device, causes the computing device to carry out functions that comprise the computer- implemented method of any one of claims 1-18.
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2026019419A1 true WO2026019419A1 (en) | 2026-01-22 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10455141B2 (en) | Auto-focus method and apparatus and electronic device | |
| JP5871862B2 (en) | Image blur based on 3D depth information | |
| CN109565551B (en) | Synthesizing images aligned to a reference frame | |
| CN103052960B (en) | Object detection and identification under situation out of focus | |
| CN116324878A (en) | Segmentation for Image Effects | |
| CN107409166A (en) | Panning lens automatically generate | |
| CN102265627A (en) | Image data obtaining method and image data obtaining device | |
| WO2014008320A1 (en) | Systems and methods for capture and display of flex-focus panoramas | |
| EP4402890A1 (en) | Low-power fusion for negative shutter lag capture | |
| CN105684440A (en) | Method and apparatus for enhanced digital imaging | |
| CN118140244A (en) | Temporal filter weight calculation | |
| US20240303841A1 (en) | Monocular image depth estimation with attention | |
| WO2026019419A1 (en) | Systems and methods for extending a depth-of-field based on focus stack fusion | |
| CN118301471A (en) | Image processing method, apparatus, electronic device, and computer-readable storage medium | |
| US12452529B2 (en) | Method and apparatus with hyperlapse video generation | |
| US12530752B2 (en) | View synthesis from images and/or video using model-based inpainting | |
| US20260032337A1 (en) | Method and apparatus with hyperlapse video generation | |
| US20250232530A1 (en) | Volumetric feature fusion based on geometric and similarity cues for three-dimensional reconstruction | |
| WO2026019420A1 (en) | Systems and methods for adjusting automatic image capture settings using a saliency-based region of interest | |
| US20250259269A1 (en) | High dynamic range (hdr) image generation with multi-domain motion correction | |
| CN114830176B (en) | Method and apparatus for asymmetric normalized correlation layers for deep neural network feature matching | |
| Anthes | Smarter photography | |
| WO2025076168A1 (en) | Systems and methods for temporal dual-depth machine learning based phase-difference autofocus (pdaf) model | |
| Pham | Integrating a Neural Network for Depth from Defocus with a Single MEMS Actuated Camera | |
| WO2025144842A1 (en) | Usage-based automatic camera mode switching |