GB2622238A - A method and device for personalised image segmentation and processing - Google Patents
A method and device for personalised image segmentation and processing Download PDFInfo
- Publication number
- GB2622238A GB2622238A GB2213082.7A GB202213082A GB2622238A GB 2622238 A GB2622238 A GB 2622238A GB 202213082 A GB202213082 A GB 202213082A GB 2622238 A GB2622238 A GB 2622238A
- Authority
- GB
- United Kingdom
- Prior art keywords
- user
- segment
- segmentation
- image
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/762—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/30—Scenes; Scene-specific elements in albums, collections or shared content, e.g. social network photos or video
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/778—Active pattern-learning, e.g. online learning of image or video features
- G06V10/7784—Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors
- G06V10/7788—Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors the supervisor being a human, e.g. interactive learning with a human teacher
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
Embodiments of the present techniques provide a method and device for personalised segmentation of an image and relate to a computer-implemented method/device/program for processing an input image on a user device by receiving, at the user device, an input image and applying using the user device, a segmentation map corresponding to the input image. The segmentation map, possibly using a segmentation model, divides the input image into a plurality of segments from which are selected, using a recommendation model stored on the user device, at least one segment from the plurality of segments of the input image. The user device then uses the at least one selected segment to generate a processed image. The recommendation model being a personalised machine leaming model which has been trained on the user device using user preferences derived from images stored on the user device.
Description
A Method and Device for Personalised Image Segmentation and Processing
Field
[1] The present techniques generally relate to a method for segmenting and processing images based on the image segmentation. Specifically, the present techniques provide a method for personalising image segmentation and a personalised recommendation method for selecting a region of the segmented image.
Background
[2] Semantic image segmentation is a technique that segments an image into different regions based on their semantic meaning, for example, recognising which pixels contain a person and which pixels contain background in an image. Some of the most successful techniques for achieving image segmentation involve using machine learning, ML, or deep learning, DL, techniques, such as convolutional neural networks (CNNs) and transformers. These techniques usually create a segmentation map that indicates the different regions of the image. Segmentation maps can then be used to process only specific parts of an image. Which parts of the image are processed in which way is most often defined by a user's choice, for example, a user may wish to "erase" a specific part of an image.
[3] While there are techniques for segmenting images, these usually do not take into account a user's preferences. That is, they simply provide a generic segmentation map that does not recognise that different users have different habits and preferences when it comes to which regions of an image they wish to modify. Current segmentation processes are not personalised to a user's preference and hence, cannot provide personalised segmentation maps or segmentation recommendations.
[4] The applicant has therefore identified the need for improved techniques for personalised image segmentation.
Surnmary [5] In a first approach of the present techniques, there is provided a computer-implemented method for processing an input image on a user device. The method comprises receiving an input image and obtaining a segmentation map corresponding to the input image. The segmentation map divides the input image into a plurality of segments (or segmented regions). The method further comprises selecting, using a recommendation model stored on the user device, at least one segment from the plurality of segments of the input image and processing the at least one selected segment in the input image to generate a processed image. The processed image may then be output. In this approach, the recommendation model is a personalised machine learning model which has been trained on the user device using user preferences derived from images stored on the user device. All of the method may be carried out locally on the user device using user preferences and/or user images which are stored locally. This results in enhance data security for the user's data because it is not necessary for the data to leave the user device.
[6] The personalisation of the recommendation model should also result in an improved user experience when editing an input image because the recommendation model is more likely to automatically select a segment which a user has previously edited in similar photos. The recommendation model may have been trained by accessing a plurality of images stored on the user device; analysing each image of the plurality of images to determine whether any segment of the image has been edited; and when it is determined that an image has been edited, storing information relating to the edited segment as a user preference. The analysis to determine whether there is any editing may be done using any suitable technique. For example, typically when a user modifies an image on their user device, there is a log specifying when the user modified which image and what modifications they made. As an alternative to a log, metadata for the image may be used to store information about editing of an image. The information in the log or metadata may be used to determine whether there is any editing of the image. The information may also be stored in the form of a flag.
[7] The training may further comprise analysing the plurality of images to determine a class at least for each edited segment and optionally for each segment in the image. A class may be a label, e.g. dog, cat, people, which shows regions of the image whose pixels belong to the semantic class indicated by the label. The classes may be pre-determined and may be fine-tuned (i.e. increased in number or otherwise updated) by the training. Analysing the plurality of images may comprise counting of the number of photos in which the edited segment belongs to one of the classes. The training may further comprise analysing the plurality of images to determine a context for each image. A context may also be a label like a class. Typically, there are fewer context labels than class labels and also the context labels are pre-determined and unchanging. The context may be stored as a user preference. Analysing the plurality of images may comprise counting the number of photos in which the edited segment belongs to one of the classes and the image has a particular context. The stored user preferences may thus include one or more of a list of classes, a count counting the number of photos in which the edited segment belongs to one of the listed classes and a context count separating the count by context On other words, the sum of the context counts for a particular class for each context totals the count for that class).
[8] The method may further comprise calculating, for each of the listed classes, a weight which is a function of at least the count. For example, the weight may be the count divided by the sum of all counts. Alternatively, the weight may be a function of the context count and optionally the count. Each listed class and optionally each context within each class may be ranked according to its weight. The weights may be used in the training of the recommendation model. For example, selecting at least one segment may comprise analysing the input image to determine a class for each of the plurality of segments, identifying the determined classes in the input image which match one of the listed classes and selecting the at least one segment with a class which matches the listed class having the highest calculated weight. Thus, the user preferences for editing stored photos have been used to improve the recommendation of a segment in the new input photo.
[9] The methods above describe how the recommendation model may be trained before being used on the user device. Additionally or alternatively, the recommendation model may be personalised at inference time (i.e. when being used to process an image) by using user feedback. The method may, before the processing step, comprise outputting the selected at least one segment to a user for feedback; receiving user feedback on the selected at least one segment and updating the recommendation model using the received user feedback. The user feedback may be explicit, e.g. an indication of a new selected segment, or implicit, e.g. rejection or approval of the recommendation.
[10] When a user indicates approval of the selection, the at least one selected segment may be used to process the input image. The user feedback may also optionally be stored as a user preference. When a user indicates rejection of the selected at least one segment, the method may comprise seeking further feedback from the user. The further user feedback may be selected from a request for the recommendation model to select a new segment and a selection of at least one segment by the user.
[11] Obtaining the segmentation map may comprise using any suitable technique to segment the image into semantically consistent segments, i.e. regions (segments) whose pixels belong to the same semantic class (e.g. when there is a class for background, all pixels belonging to a background will be in the background segment(s)). Semantic segmentation may be done using deep learning, e.g. as described in "Fully Convolutional Networks for Semantic Segmentation" by Long et al. published in the Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) conference (2015) or "Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation" by Badrinarayanan et al. published in the Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) conference (2015). Alternatively, segmentation may be done using transformers, e.g. as described in "Segmenter: Transformer for Semantic Segmentation" by Strudel et al. published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2021 or "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" by Liu et al. published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2021. For example, a segmentation model may be used to generate the segmentation map. In a similar manner to the recommendation model, the user experience may be enhanced by personalising the segmentation model. Thus, the segmentation model may be a personalised machine learning model which has been trained on the user device. The segmentation model may be trained before use using user preferences derived from images stored on the user device and/or trained whilst in use. The segmentation model may be trained using zero-shot learning (e.g. no labels), few-shot learning (e.g. a few labels are available) or continual learning (e.g. the user provides explicit feedback).
[12] The methods above describe how the segmentation model may be trained before being used on the user device. Additionally or alternatively, the segmentation model may be personalised at inference time (i.e. when being used to process an image) by using user feedback. The method may, before the processing step, comprise outputting the segmentation map to a user for feedback; receiving user feedback on the segmentation map and updating the segmentation model using the received user feedback. The user feedback may be explicit, e.g. an indication of a new segmentation map, or implicit, e.g. rejection or approval of the segmentation map. It may be possible to update the segmentation model simultaneously with the recommendation model using the user feedback received in response to outputting the selected at least one segment. In other words, the outputting of the segmentation map may not be needed in some arrangements.
[13] When a user indicates approval of the segmentation map, the segmentation map may be used to select that at least one segment in the input image as described above. The user feedback may also optionally be stored as a user preference. When a user indicates rejection of the segmentation map, the method may comprise seeking further feedback from the user. The further user feedback may be selected from a request for the segmentation model to generate a new segmentation map and creation of a segmentation map by the user. Both the personalisation before use and during use may be used to fine-tune the segmentation map. In other words, the personalisation may be to generate semantic classes which reflect the user preferences and/or user feedback.
[14] Processing the at least one selected segment in the input image to generate a processed image may comprise one or more of erasing the at least one selected segment, applying a filter to the at least one selected segment, providing a tag for the at least one selected segment to be output with the processed image, and highlighting the at least one selected segment. The processing may be done using any suitable application and may be integrated with photo storage applications (e.g. Samsung Photo Gallery, Samsung stories) on a user device.
[15] In a related approach of the present techniques, there is provided a user device such as a personal computer or a laptop and may be a more a resource constrained device, such as a smartphone or tablet computer and/or other mobile device which is configured to implement the methods described above.
[16] In a related approach of the present techniques, there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out any of the methods described herein.
[17] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
[18] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
[19] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.
[20] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.
[21] The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RIM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.
[22] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.
[23] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.
[24] The method described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.
[25] As mentioned above, the present techniques may be implemented using an Al model. A function associated with Al may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an Al-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (Al) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or Al model of a desired characteristic is made. The learning may be performed in a device itself in which Al according to an embodiment is performed, and/o may be implemented through a separate server/system.
[26] The Al model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), deep Q-networks, and transformer or visual transformer networks.
[27] The learning algorithm is a method for training a predetermined target device (for example, a user device) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
Brief description of the drawings
[28] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [29] Figure 1 is a flowchart showing an image processing process.
[30] Figure 2 is a flowchart showing the process of generating a segmentation map which is used in Figure 1.
[31] Figure 3a shows an example of an input image and a segmentation map generated based on the input image;
S
[32] Figure 3b illustrates a fine-tuned version of the segmentation map of Figure 3a..
[33] Figures 3c and 3d are the label space domains for the segmentation maps of Figures 3a and 3b, respectively.
[34] Figure 4 is a flowchart of a process for updating the segmentation model used in Figure 2.
[35] Figure 5 is a flowchart of the process showing how a region of an input image may be selected using a recommendation model.
[36] Figure 6a shows an input image with user selection input.
[37] Figure 6b is a segmentation map of the input image of Figure 6a.
[38] Figure 6c is an output image showing the selected region to be edited.
[39] Figures 6d and 6e show the results of user selection using prior art techniques.
[40] Figure 7 is a flowchart of a process for updating the recommendation model used in Figure 5.
[41] Figure 8 shows a block diagram of an apparatus for performing the image processing methods described above.
Detailed description of the drawings
[42] Broadly speaking, embodiments of the present techniques provide a method and user device for personalised segmentation of an image. In particular, the present application relates to techniques for personalising a segmentation model for creating a segmentation map of an input image, and/or personalising a recommendation model for recommending regions of the input image to the user for subsequent processing. The segmentation and/or recommendation models may be machine leaming models which are trained on the user device using data including images and user preferences which are stored on the user device. The subsequent processing may include enhancement of the input image, particularly in the recommended image region. For example, enhancement may include removal of elements of the input image, e.g. objects and/or scenery.
[43] Figure 1 is a flowchart showing an example of the overall image processing process which is applied by the user device to select a region of an input image and then process the selected region. In a first step 3100, an input image is received. The received input image may be an image which is selected by a user from their photo gallery or an image which has been captured by the user using the camera of the user device. To select a region of the image, first, a segmentation map of the image is generated in step S102. The segmentation map segments the input image into a plurality of segments. Each segment may be semantically consistent, in other words, each segment may represent a different class, e.g. a different type of object within the image or scenery within the image. Thus, each segment has pixels which belong to the same semantic class.
[44] The segmentation map may be generated by applying a segmentation model to the input image. The segmentation model may be a pretrained machine learning (ML) or deep learning (DL) model, such as a convolutional neural network (CNN) or a transformer network, or a combination of both a CNN and a transformer network. A machine learning segmentation model may be trained using user preferences and/or existing images on the user device so that the segmentation model is personalised to the user. For example, the segmentation model may be a model that has been pretrained on a server and is then deployed on the user device which may itself be a resource constrained device, such as a smartphone or tablet computer and/or other mobile device. Alternatively or additionally, the segmentation model may be optionally trained whilst editing the input image (i.e. on-the-fly) using implicit or explicit user feedback as described in Figure 2 and shown as an optional step S104. Alternatively, the segmentation model on the user device may be a standard or general segmentation model. For example, a non-personalised version of FCN model described in "Fully Convolutional Networks for Semantic Segmentation" by Long et al. published in the Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) conference (2015) may be used. As another example, a non-personalised model of the DeepLab model described in "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs" by Chen et al. published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2017 may be used. Other known models may also be used.
[45] Using the segmentation map at least one segment (or segment region and the terms can be used interchangeably) of the image is then selected at step S106. Optionally multiple segmented regions may be selected. This selection may comprise receiving a user input, or the at least one segmented region may be proposed automatically to the user by a recommendation model. The recommendation model may be a pretrained machine learning (ML) model. Like the segmentation model, the recommendation model may be a pretrained machine learning (ML) or deep learning (DL) model, such as a convolutional neural network (CNN) or a transformer network, or a combination of both a CNN and a transformer network. The recommendation model and the segmentation model are not built using the same specific architecture but both models can be built with convolutional layers, transformer layers and/or fully connected linear layers.
[46] The recommendation model may use Bayesian Inference or Reinforcement Learning. The recommendation model may, additionally or alternatively, be a frequency and/or context dependent model. As explained in more detail below, such a machine learning recommendation model may be trained using user preferences so that the recommendation model is personalised to the user. Alternatively or additionally, the recommendation model may be optionally trained whilst editing the input image (i.e. on-the-fly) using implicit or explicit user feedback as described in Figure 5 and shown as an optional step S108. When the segmentation model is not updated using feedback at step S104, the user feedback in relation to the recommended selected region may be used to simultaneously train both the segmentation and the recommendation models. Alternatively, the recommendation model may not be personalised to an individual user, for example by training the model on data from a set of users. Such a recommendation model may be considered to be personalized for the set of users.
[47] A processed image is then generated using a target application or target use case at step S110 and the processed image is output at step S112. The processed image may be stored on the user device and information regarding the processing, e.g. when the editing occurred and what was editing, may be stored with the processed image, .e.g. in a log or in metadata.
[48] Generating the processed image may involve processing the selected segmented region(s). For example, the selected segmented region may be edited by erasing or removing the selected segmented region(s) from the input image and replacing the removed region(s) by an approximation of the background behind the removed selected region. For example, this may be useful if a user took an image of themselves, and another person entered the frame at the time the image was taken. The user may then wish to remove that other person from their image. Erasing or removing a selected region may be achieved using any suitable technique For example, erasing or removal may simply be done by masking each of the pixels in the selected region. Masking the pixels leaves a hole in the image which may then be filled using known techniques such as inpainfing. These inpainfing techniques may use ML or DL models. For example, DL techniques may include CNNs and Transformers. Merely as an example, one inpainfing technique is described in "Resolution-robust Large Mask lnpainfing with Fourier Convolutions" by Suvorov et al. published in the Winter Conference on Applications of Computer Vision (WACV) in 2022.
[49] Additionally, or alternatively, processing the selected segmented region(s) may involve editing by applying a filter to the selected region(s). The filter may be a semantically aware filter-that is, the filter may be different for different parts of the image depending on what they show. For example, if one selected region contains a person and another selected regions contains greenery, a different filter may be applied to the person's face in the first selected region than to the greenery in the second selected region. The filter that is applied to the person's face may, for example, brighten the face if it was shaded and difficult to see, and the filter applied to the greenery may make the greenery look more vivid.
[50] Additionally, or alternatively, the selected region(s) may be processed to improve an image search function on the user's device. For example, the user may provide a tag or label for the selected region(s) which may be output in step S112 as part of the processed image. An image search function may then be used to find this particular tag, for example, object, pet and/or person in other images.
[51] Figure 2 is a flowchart showing more detail of the process for generating the segmentation map and includes examples of how the segmentation model can be personalised on the fly. As in Figure 1, first, an input image is received at step 5200 and a segmentation map of the input image is generated using the segmentation model at step S202. The segmentation model may be a ML or DL model, for example, a CNN. The segmentation model may also be personalised by using the user's decisions and reactions to a proposed segmentation to further train or finetune the segmentation model in such a way that it learns the user's segmentation preferences and thus creates more relevant segmentation maps for each user.
[52] One method for personalising the segmentation model is suggested in Figure 2. For example, after the segmentation map for the input image is created, the segmentation map is presented to the user, for example on a display on the user device. An example of an input image 300 and a segmentation map 302 are shown in Figure 3a. The input image 300 shows three people walking across a grassy outdoor space. By applying the segmentation model, the image is segmented into several different semantic groups or classes, each of which is represented with a different colour. A first class 304 represents the people in the image and in this example, each of the people is tagged by a user label shown by A, B and C so that there are three separate sub-classes each having a different colour. The vegetation (both grass and trees) in the image is represented by a second class 307 and the sky is represented by a third class 310.
[53] Returning to Figure 2, the user device may then determine whether or not the user approves or disapproves the proposed segmentation map at step S204. For example, the segmentation map may be presented in a user interface with an option for the user to indicate that they approve or disapprove the segmentation map. In the example shown in Figure 3a, the segmentation map segments all vegetation into a single region, but the user may wish to only edit the bushes/trees and will thus indicate that they disapprove of (i.e. reject) the proposed segmentation map. As another example, the segmentation map of Figure 3a segments the group of people into three separate regions, which allows the user to select only one of these persons. The user may then approve (i.e. accept) the proposed segmentation map. This approval or disapproval is used to update and personalise the segmentation model and improve future segmentation maps that are presented to this specific user as described below.
[54] When the user device determines that the user approves the proposed segmentation map, the process proceeds to step 5212 and the segmentation model is updated with this information. The updating may be done using any appropriate training, e.g. reinforcement learning, which uses the user preference. . [55] When the user device determines that the user disapproves of the proposed segmentation map, this information is also used to update the segmentation model. In this example, there are two methods for using the user's input to update the segmentation model. In a first method, the next step is to receive a request from the user for a different segmentation map 5206. As shown at step 3208, the next step is to adjust one or more parameters for the segmentation model so that the method can revert to step S202 and will then generate a new segmentation map. For example, the one or more parameters may be adjusted by changing the random seed which is used to generate the segmentation map. That is, the pseudorandom number generator used in the programming is initialised using a different number, which in turn results in a different string of pseudorandom numbers being generated. This may change the output of the personalised segmentation model, and hence the segmentation map that is proposed to the user. Additionally, or alternatively, a dropout configuration of the deep learning network of the segmentation model may be changed, resulting in a different segmentation map from the same model. In a dropout configuration, at every inference the active neurons are changed and thus the final prediction and/or prediction confidence also changes. This is described for example, in "Unsupervised domain adaptation for speech recognition via uncertainty driven self-training" by Khurana et al. published in the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) in 2021. The user may then approve or disapprove the new segmentation map as described above. The information that the user has not approved the segmentation map can also be used to update the segmentation model itself as shown by the dotted line to step S212.
[56] As an alternative to the user request for a new segmentation map which may be considered to be implicit feedback which can be used to update the segmentation model, the user may generate an improved segmentation map by providing explicit feedback on the segmentation map at step S210. The explicit feedback may include a different segmentation map which has been prepared by the user and/or labels for one or more segmented regions on the proposed segmentation map which would improve the segmentation map. For example, the proposed segmentation map shown in Figure 3a could be annotated by the user as shown in Figure 3b to create an improved segmentation map 312 which shows the additional sub-classes for greenery to separate the segments 306 showing grass from the segments 308 showing the bushes or trees. The other segmented regions in the segmentation map 302 are the same, e.g. there are segmented regions 304 for the people and a separate segmented region 310 for the sky.
[57] Returning to Figure 2, the next step is to update the segmentation model at step 3212. The segmentation model may be updated using the explicit feedback from step 3210 or using the implicit feedback from steps 3212 or 3206. The updating is to obtain personalised segmentation models which are specific to individual users. The personalisation may be done using zero-shot techniques in which there is no labelled data from the user, few-shot techniques in which there is a limited amount of labelled data and/or using continual/incremental learning. For example, when using zero-shot techniques, the general classes may be split into micro classes which are of interest to a user. When a segmentation map is rejected by the user, the seed (or other) parameters may be adjusted to provide a different combination of micro-classes when generating the new segmentation map. The personalised segmentation model is learned from the implicit user feedback, e.g. reject or approve the generated map. In the few-shot cases, the explicit user feedback from step 5210 may include new classes to improve the segmentation model.
[58] Updating the segmentation model can be achieved in multiple ways for example an image domain, i.e. image characteristics that are style-related such as whether the image is a countryside image or an image taken in a city, may be adapted by unsupervised domain adaptation techniques. In other words, we solve an unsupervised adaptation problem using the style extracted from the images. We determine the target style from one or from a few samples. We convert all the images to the target style by aligning the style-related parameters (i.e., for example the coefficients of the low frequencies of the Fourier transform). The resulting model can solve the task better on the target style than a model trained naively with no style alignment. Generally, techniques based on Fourier transform have proved to be very lightweight, efficient and effective in many contexts. One example of a suitable technique is described in "FDA: Fourier Domain Adaptation for Semantic Segmentation" by Yang et al. published in the Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) conference (2020). Alternatively, we could use techniques based on batch normalization as described in "Revisiting Batch Normalisation for Practical Domain Adaptation" by Li et al. published in the Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) conference (2016) or instance normalisation described in "Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization" by Huang et al published in the Proceedings of the International Conference of Computer Vision (ICCV) in 2017.
[59] As an alternative or in addition to adapting the image domain, the label space domain, i.e. classes are used to label different image regions -may be updated as schematically illustrated in Figures 3c and 3d which show feature vector spaces before and after an update to a segmentation model. Figure 3c shows the class representations of the segmentation model which generates the segmentation map shown in Figure 3a, i.e. before an update. In this example, there are five different class representations each associated with clusters 320, 322, 324, 326, 328 of feature vectors representing regions in the input image. The largest cluster 320 corresponds to the vegetation class and the smallest cluster corresponds to the sky 322. Each of the people shown in the image is represented by a separate cluster 324, 326, 328 having four image regions. Figure 3c shows that the feature vectors for each of the classes are clustered so that they are distinguishable from one another. The feature vectors for both the grass and the bushes in the greenery class are clustered together and cannot be separated from each other.
[60] Figure 3d shows the class representations of the fine-tuned segmentation model which generates the segmentation map shown in Figure 3b, i.e. after the update step. There are now six different class representations each associated with a cluster of feature vectors in the input image. When compared to Figure 3c, the vegetation cluster 320 has been separated into two separate clusters 320a, 320b for bushes/trees and grasses respectively. As shown, the feature vectors for each of the two new classes are more spaced apart than the feature vectors for the combined class representation. A suitable technique for obtaining this separation is contrastive learning which is a known technique for disentangling latent space class representations to accommodate new classes and separating their representation from previous seen ones. An early paper on this technique is "A Simple Framework for Contrastive Learning of Visual Representations" by Chen et al. published in the Proceedings of the International Conference on Machine Learning (ICML) in 2020. Its application to few shot learning is described for example in "Supervised Momentum Contrastive Learning for Few-shot Classification" by Majumder et al. published in the Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) conference (2021). The other clusters 322, 324, 326 and 328 are unaffected by the fine-graining to disentangle the separate grass and trees classes from the vegetation class representation.
[61] Retuming to Figure 2, the final step 5212 in Figure 2 is to output the segmentation map. It will be appreciated that the outputting step can be done before or simultaneously with the updating step 5212. The output segmentation map will be the segmentation map which is generated in step 5202, when this is approved by the user or the improved segmentation map received from the user at step S210.
[62] Figure 2 shows on the fly personalisation of the segmentation model. In addition to this training, there may be additional training of the segmentation model using user's photos which are stored on the user device. A number of different learning techniques may be used. Figure 4 illustrates one method of personalising the segmentation model which uses zero-shot learning and does not require any labelled data. For zero shot learning, the user's past behaviour in choosing a segmented region for processing is used to personalise the segmentation model. In a first step S400, a pre-trained segmentation model is received from a server or similar device which is remote (i.e. separate) from the user device. An image which is stored on the user device is then accessed at step 5402. The image is analysed to determine whether any region of the image has been edited (e.g. removed or enhanced by filtering etc.) at step S404. When a region of the image has been edited, this may be stored as a user preference at step S406. Similarly, if no regions have been edited, this may also be stored as a user preference at step S406.
[63] The process may then determine whether there are any more images to be processed at step S408 and if so, loop back to step S402 to access another photo and repeat the analysing and storing steps. When a plurality of images (e.g. between a minimum of two to ten image, more specifically five images) have been processed, the user preferences can be used to update the segmentation model.
[64] There are various ways to implement the update. One method, using the zero-shot case is to compute at step S410 an unsupervised clustering of the embeddings for any edited regions (e.g. removed objects). Classical methods such as k-Means clustering, mixture of Gaussians or hierarchical clustering may be used for the clustering. Mixture of Gaussians is a family of algorithms and is described for example in "A robust EM clustering algorithm for Gaussian mixture models" by Yang et al. published in the Elsevier Pattern Recognition (PR) journal in 2012. Hierarchical clustering is a family of algorithms and is described for example in "A study of Hierarchical Clustering Algorithms" by Patel et al. published in the International Conference on Computing for Sustainable Global Development (2015). As an alternative various techniques based on word embedding may be used. For example word embedding may be exploited as described in "Zero-Shot Semantic Segmentation" by Bucher et al. published in the Proceedings of the Neural Information Processing Systems (NeurIPS) conference (2019). Variational mapping may be used with word embedding as described in "Zero-Shot Semantic Segmentation via Variational Mapping" by Kato et al. published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshop (2019). Principal component analysis (PCA) may be used with word embedding as described in "Unsupervised Learning on Neural Network Outputs: with Application in Zero-shot Learning" by Lu published in the Proceedings of the International Joint Conference on Artificial Intelligence °JOAO in 2016.
[65] The next step S412 is then to assign customised class labels (identifications) to the classes which are identified by the clustering. In general, if the user always removes their dog from images, the zero-shot learning algorithm will take this into account and provide a separate class corresponding to the user's dog without being explicitly told that this is the user's dog. Similarly, other categories which are of specific interest to the specific user may be assigned class labels (e.g. particular dog may be assigned the label "dog 1", and a different dog may be assigned "dog 2", a particular friend may be assigned "friend 1" and so on). These assigned customised class labels may optionally be generated by the user, e.g. a user-specified tag but this is not required.
[66] As an alternative, the label space domain may also be adapted using few-shot learning, meaning that there exist some user-provided labels for regions that the user previously edited, e.g. the labels A, B and C for the people in the image. An overview of few-shot learning is described for example in "Prototypical Networks for Few-Shot Learning" by Snell et al. published in the Proceedings of the Neural Information Processing Systems (NeurIPS) conference (2017). At step 3414, semantic coherence between the known classes and the historically edited regions is then calculated. Semantic coherence may be computed using any suitable machine learning algorithms, including semantic co-segmentation algorithms and out-of-distribution detection describers. Example algorithms are described in in "Semantically Coherent Co-segmentation and Reconstruction of Dynamic Scenes" by Mustafa et al published in the Proceedings of the Computer Vision and Pattern Recognition (CVPR) conference (2017), "Entropy Maximization and Meta Classification for Out-of-Distribution Detection in Semantic Segmentation" by Chan et al. published in the Proceedings of the International Conference on Computer Vision (ICCV) 2021 and "Detection and Retrieval of Out-of-Distribution Objects in Semantic Segmentation" by Oberdiek et al. published in the Proceedings of the Computer Vision and Pattern Recognition (CVPR) conference as a workshop paper (2020). If it is determined that the edited region does not correspond to a known class (i.e. the semantic coherence is low, according to a given threshold), a new class is added as shown at step 5416.
[67] Figure 4 shows two options for updating the segmentation model using implicit feedback but it may also be possible to use continual or incremental learning in instances in which the user provides explicit feedback on the stored images. In general, this is the case in which we have ground truth data provided by the user. This approach also alleviates catastrophic forgetting of previous classes. Explicit feedback may be used as ground-truth. Continual learning may be achieved using knowledge distillation (at either input, feature or output levels), feature level regularisation (contrastive, sparsity, or prototype matching constraints), architectural updates and/or using pseudo-labelling of images. Knowledge distillation is described for example "Incremental Learning Techniques for Semantic Segmentation" by Michieli et al. published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2019 or in "PLOP: Learning without Forgetting for Continual Semantic Segmentation" by Douillard et al. published in the Proceedings of the Computer Vision and Pattern Recognition (CVPR) conference (2021). Feature level regularisation is described for example in "Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations" by Michieli et al. published in the Proceedings of the Computer Vision and Pattern Recognition (CVPR) conference (2021). An example of pseudo labelling of images is described in "RECALL: Replay-Based Continual Learning in Semantic Segmentation" by Maracani et al. published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 2021.
[68] Figure 5 is a flowchart showing more detail of the process for selecting the region of the input image to be edited and includes examples of how the recommendation model can be personalised. As in Figure 1, first an input image is received at step 3500 and a segmentation map 3502 of the input image is obtained. The segmentation map may be obtained using any segmentation model, including a segmentation model which has been personalised as described above. To make a recommendation of a region selection that is personalised to a specific user, the segmentation model first needs to know the user's preferences. Thus, there is an optional step to determine if the recommendation model has been personalised at step S504.
[69] When the recommendation model has not yet been personalised, there is an option to allow user personalisation on the fly (i.e. while processing an input image). For example, the user device may receive a selection input from the user for the image at step 3506. Ideally, the selection should be easy to input for the user and may for example be simply a single click on a pixel within the image to indicate a region of the image to be edited. Figure 6a shows an example input image 600 and a pixel 602 which has been selected within the image by the user, e.g. via a single touch on the screen.
[70] Returning to Figure 5, the next step is identifying the region(s) in the image which corresponds to the user selection at step 3508. The region(s) are identified using the segmentation map which was received at step 3502. For example, Figure 6b shows an example segmentation map 604 corresponding to the input image of Figure 6a and a selected region 606 which is shown in black and which corresponds to the pixel selected in Figure 6a. The example segmentation map 604 shown may also be output to the user as the processed image. The use of the segmentation map ensures that the selected region 606 is semantically consistent.
[71] Merely as examples for comparison, Figures 6d and 6e show outputs generated by applications implemented on existing devices. In both these examples, the user indicates a selection by drawing a line around the region to be edited. This user input is shown with the continuous line. Such a selection is more onerous to the user than the one-click user input suggested above. The region selection by the user device is shown by the dotted line. Typically, the devices use edge detection or similar techniques to identify the region to be edited. In the example of Figure 6d, the dotted line captures part of the grass, as well as part of the person (shoes, shorts, arms etc). In Figure 6e, a different region is surrounded by the dotted line and in this case the person's face, one arm, legs and shorts have been proposed as a selection by the user device. In both examples shown in Figure 6d and 6e, the regions selected automatically by the device (based on the initial user selection) are not semantically consistent because they incorporate parts of different objects. For example Figure 6d captures a mix of a person and vegetation but not the whole of either the person or the vegetation and Figure 6e captures parts of the person but not all similar parts, e.g. both arms are not selected. The precise algorithms which are implemented on these devices are not known but they may use a version of Lasso selection and an example of this technique is described in "ICE-Lasso: An enhanced form of Lasso selection" by Dehmeski et al. published in the IEEE International Conference on Science and Technology for Humanity (2009).
[72] Returning to Figure 5, this user selection of a region can be stored as a user preference at 3510 and the recommendation model can be updated with this user selection at step 3512. The final step is to output an amended input image 610 which shows the selected region. Figure 6c shows an example output image in which the selected region 608 which was identified from the segmentation map is highlighted to the user.
[73] It will be appreciated that a few iterations (e.g. between 1 and 5 iterations) of steps 3506 to 3512 may be sufficient for the recommendation model to be sufficiently personalised. Alternatively, the recommendation model may be personalised using stored images as described below. When the recommendation model is ready to be used, as shown at step 3514, the selection of the region may be automatically done by the recommendation model at step S514. In other words, a recommended selected region of the image is output to the user at step 3514. This output may, for example, be a visual representation showing which part of the image has been selected. For example, the visual representation may be the input image with the selected region indicated or highlighted, such as shown in Figure 6c.
[74] The recommendation may be presented with a user interface which allows the user to provide feedback on the recommendation, e.g. to indicate approval or rejection of the recommended selected region. Thus the user device determines whether or not approval is received at step 3516. When the user device receives an indication that the user approves the selected region, this is taken into account by updating the recommendation model at step S512 with the implicit positive feedback. The amended image with the approved selected segmented region is then output as described above at step 3522.
[75] When the user disapproves the selected region 3306, Figure 5 illustrates two altemafive methods for incorporating the negative feedback from the user. For example, the user may be presented with an option, e.g. on the user interface, to request another recommendation. When the user device receives a request for another recommendation at step 3518, there may be an adjustment to the recommendation model, e.g. to the seed parameters at step 3520 so that the method can revert to step 3514 and another region of the image is recommended to the user. When the user asks for a different recommendation, as indicated by the dotted arrow, the implicit feedback from the user may be used to update the recommendation model at step 3512. There is no output amended image until there is approval from the user.
[76] As an alternative to requesting another recommendation, the user may be prompted to provide their own region selection. In this instance, the method may loop to step 3506 in which the user's region selection is received. The amended image showing the region selected by the user is then output in the final step 3522 after repeating the identifying, storing and updating steps S508, 3510 and S512 described above.
[077] Figure 5 shows on the fly personalisation of the recommendation model. In addition to this training, there may be additional training of the recommendation model using user's photos which are stored on the user device. A number of different learning techniques may be used. Figure 7 illustrates one method of personalising the recommendation model. In a first step S700, a pre-trained recommendation model is received from a server or similar device which is remote (i.e. separate) from the user device. An image which is stored on the user device is then accessed at step 5702. The image is analysed to determine whether any region of the image has been edited (e.g. removed or enhanced by filtering etc.) at step 8704. When a region of the image has been edited, information in relation to the edited region may be stored as a user preference at step 5706. Similarly, if no regions have been edited, this may also be stored as a user preference at step S706. The process may then determine whether there are any more images to be processed at step 5708 and if so, loop back to step 5702 to access another photo and repeat the analysing and storing steps. When a sufficient number of images (e.g. a minimum of between two to ten, perhaps five images) have been processed, the user preferences can be used to update the recommendation model at step S710.
[078] The recommendation model may be an artificial intelligence, Al, model that uses, for example, Bayesian inference alone or in conjunction with reinforcement learning to model the most likely region selection patterns. One way to use Bayesian inference is to consider the frequency of selection for particular regions and to calculate an associated weight which is a function of the frequency of selection. As shown in table 1 below, the type/class of selected region, a count for a selection and a weight could be the information which is stored as the user preferences.
Class types counter weight Removal of my dog 2 2/11 Removal of my cat 4 4/11 Removal of people 5 5/11
Table 1
[079] In this example, the user preferences comprise a weighted list of categories/classes into which objects the user previously selected fall. A running counter measures how often a user decided to select a region containing a dog or cat or a person. As explained above, the classes may also be personalised and tailored to each specific user. The different classes may therefore also be as specific as including that a dog is the user's dog. This may either be achieved by analysing the user's images and concluding that a specific dog is often shown on them, or by the user providing their own labels.
[80] In this example, the weight for each category is simply the count for that category divided by the sum of all counted classes. When making a recommendation of a selected region using this recommendation model, the input image is segmented using the segmentation map to identify any objects or regions falling into the classes stored in the user preferences. When more than one region is identified, the region having the highest weight is recommended to the user. In other words, for the example above, when the user's dog and cat are identified in the input image, the region corresponding to the cat would be recommended to the user as a selected region rather than the region corresponding to the dog.
[81] In the example above, a simple weight function based on frequency selection is calculated. As an altemafive, table 2 shows an example in which the frequency selection is considered together with the context of the image, e.g. whether the image is taken indoors or outdoors. The recommendation model may thus be trained recognise that regions including the user's dog are more often selected in an image taken inside than outside, and this may influence the regions that are output to the user as a recommendation. The contexts can be decided a priori or automatically assessed by unsupervised clustering over the stored samples in a similar manner to that explained in relation to step S410 in Figure 4. The aim of context correction is to ensure that when there is a match between the context of the input image and the stored images, the weight will be high and low otherwise. For example, a user might want to remove people only when they appear in landscape pictures and not in pictures of birthday parties. The context is thus important when making a recommendation.
Class types Overall Outdoor Indoor Weight_outdoor Weight_indoor counter counter counter Removal of 2 2 o (2/11)"(2/2) (2/11)"(0/2) my dog Removal of 4 3 1 (4/11)*(3/4) (4/11)*(1/4) my cat Removal of 5 4 1 (5/11)*(4/5) (5/11)"(1/5) people
Table 2
[082] For the user preferences stored in the example of table 2, the weight may be a function of the context and the frequency of section. In other words, the weight for each class i in each context] may be calculated using: Weight_class_i_context = func(i, j, counter) The function can be decomposed into two functions as: Weight_class_i_context_j = fund (i, counter) " func2(i, j, counter) Merely as an example the two functions can be characterized as follow.
Fund can compute the overall count for a category (sum j(counter[i,j]) assuming counter to be a matrix num_classes " num_contexts, where classes are indexed by i and contexts are indexed by j) divided by the sum of all counted classes (sum(counter)).
fund (i, counter) = sum j(counter[i,j]) / sum(counter) In the example above, counter is the following matrix: Counter = [ [2, 0], [3, 1], [4, 1] ] Func2 can compute the count for a category in a particular context (sum j(counter[ij])) divided by the overall count for a category (counter[i]), e.g. func20, j, counter) = counter[i, j] / (sum j(counterp,jp In this example case, the final equation simplifies as Weight_class_i_context_j = funcl (i, counter) * func2(i, j, counter) = counter[i, j] / sum_j(counter[U]) . When changing fund and/or func2 the simplification might not take place anymore.
[083] When making a recommendation of a selected region using this recommendation model, the input image is segmented using the segmentation map to identify any objects or regions falling into the classes stored in the user preferences and the context of the image is identified. When more than one region is identified, the region having the highest weight for that particular context is recommended to the user. In other words, for the example above, when the user's dog and cat are identified in an input image which is taken outdoors, the region corresponding to the cat would be recommended to the user as a selected region rather than the region corresponding to the dog.
[084] The Bayesian inference described above (e.g. with/without context adjustment) can be used together with reinforcement learning to improve the personalisation of the recommendation model further. For example, the implicit user feedback received at step S516 when determining whether or not the user approves the recommendation may be used to adjust the weight score. A user-centric loss function can be used to take into account the user preferences, for example by injecting user preferences/rankings into the loss function. An example is described in "A Semantic Loss Function for Deep Learning with Symbolic Knowledge" by Xu et al. published in the Proceedings of the international Conference of Machine Leaming (ICML) in 2018. In other words, we can include preference-ranking into model training (for example, via a loss function) or via a regularization term (for example, via preference learning with a separate head and using the learned preferences to regularize the model training).
[85] Figures 2 and 5 show separate on the fly personalisation of the segmentation model and the recommendation model respectively. As indicated in Figure 1, it is possible to simultaneously update the models using implicit or explicit feedback received from the user in response to the recommendation of a selected region by the recommendation model. The recommendation model uses the segmentation model when making the recommendation and thus the user approval, rejection or selection of different regions can be used to update the segmentation model as well as the recommendation model. For example, if the user approves, nothing changes within the models. We save the region into memory and we use it as a ground truth for both segmentation and recommendation systems. If the user selects a different region, we use this as a ground truth for both segmentation and recommendation systems and we save it into memory. If the user rejects, we do not save the region into memory to avoid propagation of errors to future unseen samples [86] Figure 8 shows a block diagram of an apparatus 800 for performing the image processing methods described above. The apparatus 800 may be any type of user device such as a personal computer or a laptop and may be a more a resource constrained device, such as a smartphone or tablet computer and/or other mobile device. The apparatus 800 comprises the standard components of such devices, including for example as shown a communication module 810, a display 814 for displaying information, such as a segmentation map or selected region of the input image to the user and a user interface 816 for receiving user input such as a mouse, keyboard, voice recognition input device, touch sensitive screen or any similar device. The display 814 may comprise any suitable display screen, e.g. LCD, LED which may also be touch sensitive to allow user input. The apparatus also optionally comprises a camera 812 for taking input images that may then be processed as described above. The communication module 810 may communicate using any suitable communication, e.g. wireless communication, hypertext transfer protocol (HTTP), message queuing telemetry transport (MQTT), a wireless mobile telecommunication protocol, radio frequency communication (RFID), near field communication (NFC), ZigBee, Thread, Bluetooth, Bluetooth LE, IPv6 over Low Power Wireless Standard (6LoVVPAN), Constrained Application Protocol (CoAP) or a wired communication. There may be other standard components which are omitted from the Figure for ease of reference.
[87] The apparatus comprises a processor or processors 802 coupled to memory 804, a recommendation module 806 and a segmentation module 808 for performing the methods described above. The memory may be any suitable form of memory, including volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
[88] Using the communication module 810, the apparatus 800 may, for example, receive a pretrained segmentation and/or recommendation model from a server 850 which may be remote from the apparatus. Each of these models may then be personalised, for example on the fly as described above and/or using user preferences 822 and/or user images 824 which are stored locally in the memory on the apparatus 800. As shown in this example, the personalised recommendation model 826 may then be stored in the recommendation module 806 and the personalised segmentation model 828 may then be stored in the segmentation module 808. The memory is shown separately from the segmentation and recommendation module but it will be appreciated that these may be combined.
[89] By using the user's photos to personalise one or both of the personalisation and segmentation models, the overall user experience is much improved. For example, the user generally needs to provide less supervision to achieve the desired photo editing and may generate higher quality edited images. Merely as an example, a first user may typically remove people from an image whereas a second, different user may typically remove the background objects from an image. The methods described above can learn these user preferences to ensure that the image processing is personalised to the user.
[90] As shown in Figure 8, the user's photos may be stored locally on the user device. In such instances, when the photos are used to train one or both of the recommendation and segmentation models, the user's data remains on the user device. Thus, the data remains private and there is a reduced risk of data breach or data theft. The personalisation can also then be targeted to users who do not typically share any data [91] The personalisation is done on the user device as explained above. Furthermore, the personalised models are both used locally by the user device to process the image. This may mean that there are lower server costs because the training and processing is done locally. In other words, although there may be a server within the system, e.g. to provide the initial, pre-trained segmentation and recommendation models, when the image is being edited using the models, there is no need for any connectivity between the server and the user device. For example, there is no need for the image to be edited to be sent to the server from the user device.
[92] At least some of the example embodiments described herein may be constructed, partially or wholly, using dedicated special-purpose hardware. Terms such as 'component', 'module' or 'unit' used herein may include, but are not limited to, a hardware device, such as circuitry in the form of discrete or integrated components, a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks or provides the associated functionality. In some embodiments, the described elements may be configured to reside on a tangible, persistent, addressable storage medium and may be configured to execute on one or more processors. These functional elements may in some embodiments include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. Although the example embodiments have been described with reference to the components, modules and units discussed herein, such functional elements may be combined into fewer elements or separated into additional elements.
[93] Various combinations of optional features have been described herein, and it will be appreciated that described features may be combined in any suitable combination. In particular, the features of any one example embodiment may be combined with features of any other embodiment, as appropriate, except where such combinations are mutually exclusive. Throughout this specification, the term "comprising" or "comprises" means including the component(s) specified but not to the exclusion of the presence of others.
[94] Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.
[95] Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.
Claims (16)
- CLAIMS1. A computer-implemented method for processing an input image on a user device, the method comprising: receiving, at the user device, an input image; obtaining, using the user device, a segmentation map corresponding to the input image, wherein the segmentation map divides the input image into a plurality of segments; selecting, using a recommendation model stored on the user device, at least one segment from the plurality of segments region of the input image; and processing, using the user device, the at least one selected segment in the input image to generate a processed image; wherein the recommendation model is a personalised machine learning model which has been trained on the user device using user preferences derived from images stored on the user device.
- 2. The method as claimed in claim 1, wherein the personalised machine learning model is trained by accessing a plurality of images stored on the user device; analysing each image of the plurality of images to determine whether any segment of the image has been edited; and when it is determined that an image has been edited, storing information relating to the edited segment as a user preference.
- 3. The method as claimed in claim 2, further comprising determining a class for each edited segment and wherein the stored user preferences comprise a list of the classes and a count counting the number of photos in which the edited segment belongs to one of the listed classes.
- 4. The method as claimed in claim 3, further comprising determining a context of each image of the plurality of images and wherein the stored user preferences comprise a context count counting the number of photos in which the edited segment belongs to one of the listed classes for each context.
- 5. The method as claimed in claim 3 or claim 4, further comprising calculating, for each of the listed classes, a weight which is a function of at least the count, and wherein selecting at least one segment comprises analysing the input image to determine a class for each of the plurality of segments, identifying the determined classes in the input image which match one of the listed classes and selecting the at least one segment with a class which matches the listed class having the highest calculated weight.
- 6. The method as claimed in any one of the preceding claims, further comprising: outputting the selected at least one segment to a user for feedback; receiving user feedback on the selected at least one segment and updating the recommendation model using the received user feedback.
- 7. The method as claimed in claim 6, wherein the user feedback is approval or rejection of the selected at least one segment, and wherein when user approval for the selected at least one segment is received, the method further comprises generating the processed image using the selected at least one segment and storing the user feedback as a user preference.
- 8. The method as claimed as claimed in claim 6 or claim 7, wherein the user feedback is approval or rejection of the selected at least one segment, and wherein when user rejection of the selected at least one segment is received, the method further comprises: requesting further user feedback which is selected from a request for the recommendation model to select a new segment and a selection of at least one segment by the user.
- 9. The method as claimed in any one of the preceding claims, wherein the segmentation map uses a segmentation model to separate the input image into a plurality of semantically consistent segments.
- 10. The method as claimed in claim 9, wherein the segmentation model is a personalised machine learning model which has been trained on the user device using user preferences derived from images stored on the user device
- 11. The method as claimed in claim 10, wherein the segmentation model is trained using one of zero-shot learning, few-shot learning and continual learning.
- 12. The method as claimed in any one claims 9 to 11, further comprising: outputting the segmentation map to a user for feedback; receiving user feedback on the segmentation map and updating the segmentation model using the received user feedback to personalise the segmentation model.
- 13. The method as claimed in any one claims 9 to 12, further comprising fine-tuning the segmentation model using at least one of user preferences and user feedback.
- 14. The method as claimed in any one of the preceding claims, wherein processing the at least one selected segment in the input image to generate a processed image comprises one or more of erasing the at least one selected segment, applying a filter to the at least one selected segment, and providing a tag for the at least one selected segment to be output with the processed image.
- 15. A computer-readable storage medium comprising instructions which, when executed by a processor on a user device, causes the processor to carry out any of the methods of the preceding claims.
- 16. A user device for processing an input image, the user device comprising a processor; memory storing user preferences, images and executable instructions which when executed by the processor cause the processor to carry out any of the methods of claims 1 to 15.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB2213082.7A GB2622238B (en) | 2022-09-07 | 2022-09-07 | A method and device for personalised image segmentation and processing |
| PCT/KR2023/010242 WO2024053846A1 (en) | 2022-09-07 | 2023-07-18 | A method and device for personalised image segmentation and processing |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB2213082.7A GB2622238B (en) | 2022-09-07 | 2022-09-07 | A method and device for personalised image segmentation and processing |
Publications (3)
| Publication Number | Publication Date |
|---|---|
| GB202213082D0 GB202213082D0 (en) | 2022-10-19 |
| GB2622238A true GB2622238A (en) | 2024-03-13 |
| GB2622238B GB2622238B (en) | 2024-09-25 |
Family
ID=83933301
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB2213082.7A Active GB2622238B (en) | 2022-09-07 | 2022-09-07 | A method and device for personalised image segmentation and processing |
Country Status (1)
| Country | Link |
|---|---|
| GB (1) | GB2622238B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2635033A (en) * | 2023-09-29 | 2025-04-30 | Samsung Electronics Co Ltd | Method for personalising a machine learning model |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115861218A (en) * | 2022-11-28 | 2023-03-28 | 天翼云科技有限公司 | A road damage detection method based on image processing |
| CN115937634A (en) * | 2022-12-15 | 2023-04-07 | 浙江大学 | Classification method and device based on Transformer-based dynamic sparsification combined with prototype |
| CN116310523A (en) * | 2023-03-02 | 2023-06-23 | 中国人民解放军战略支援部队信息工程大学 | Grading method and system for the relationship between lumbar disc herniation and nerve root compression based on visual Transformer |
| CN116910233B (en) * | 2023-06-27 | 2025-04-04 | 西北工业大学 | A Text Summarization Assisted Generation Method Based on Contrastive Learning |
| CN117726808B (en) * | 2023-09-21 | 2025-05-30 | 书行科技(北京)有限公司 | A model generation method, image processing method and related equipment |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170316256A1 (en) * | 2016-04-29 | 2017-11-02 | Google Inc. | Automatic animation triggering from video |
| US10777228B1 (en) * | 2018-03-22 | 2020-09-15 | Gopro, Inc. | Systems and methods for creating video edits |
| CN111832570A (en) * | 2020-07-02 | 2020-10-27 | 北京工业大学 | An image semantic segmentation model training method and system |
| CN112381831A (en) * | 2020-11-26 | 2021-02-19 | 南开大学 | Personalized image segmentation method and system based on semantic assistance between images |
| WO2021103731A1 (en) * | 2019-11-26 | 2021-06-03 | 华为技术有限公司 | Semantic segmentation method, and model training method and apparatus |
| CN114821053A (en) * | 2022-04-26 | 2022-07-29 | 中科领航智能科技(苏州)有限公司 | Image semi-supervised semantic segmentation method based on conservative aggressive collaborative learning |
-
2022
- 2022-09-07 GB GB2213082.7A patent/GB2622238B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20170316256A1 (en) * | 2016-04-29 | 2017-11-02 | Google Inc. | Automatic animation triggering from video |
| US10777228B1 (en) * | 2018-03-22 | 2020-09-15 | Gopro, Inc. | Systems and methods for creating video edits |
| WO2021103731A1 (en) * | 2019-11-26 | 2021-06-03 | 华为技术有限公司 | Semantic segmentation method, and model training method and apparatus |
| CN111832570A (en) * | 2020-07-02 | 2020-10-27 | 北京工业大学 | An image semantic segmentation model training method and system |
| CN112381831A (en) * | 2020-11-26 | 2021-02-19 | 南开大学 | Personalized image segmentation method and system based on semantic assistance between images |
| CN114821053A (en) * | 2022-04-26 | 2022-07-29 | 中科领航智能科技(苏州)有限公司 | Image semi-supervised semantic segmentation method based on conservative aggressive collaborative learning |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2635033A (en) * | 2023-09-29 | 2025-04-30 | Samsung Electronics Co Ltd | Method for personalising a machine learning model |
Also Published As
| Publication number | Publication date |
|---|---|
| GB202213082D0 (en) | 2022-10-19 |
| GB2622238B (en) | 2024-09-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| GB2622238A (en) | A method and device for personalised image segmentation and processing | |
| US12450870B2 (en) | Classifying image styles of images based on procedural style embeddings | |
| US10459975B1 (en) | Method and system for creating an automatic video summary | |
| US10621755B1 (en) | Image file compression using dummy data for non-salient portions of images | |
| Yao et al. | Semantic annotation of high-resolution satellite images via weakly supervised learning | |
| KR102607208B1 (en) | Neural network learning methods and devices | |
| JP2018018537A (en) | Threshold change device | |
| KR20190016367A (en) | Method and apparatus for recognizing an object | |
| US10783398B1 (en) | Image editor including localized editing based on generative adversarial networks | |
| US11250295B2 (en) | Image searching apparatus, classifier training method, and recording medium | |
| CN112749737A (en) | Image classification method and device, electronic equipment and storage medium | |
| CN114329028B (en) | Data processing method, device and computer readable storage medium | |
| US10353951B1 (en) | Search query refinement based on user image selections | |
| KR102801497B1 (en) | Computing apparatus and operating method thereof | |
| Kulkarni et al. | Spleap: Soft pooling of learned parts for image classification | |
| WO2025101496A1 (en) | Model fine-tuning for automated augmented reality | |
| Liu et al. | Learning dynamic hierarchical models for anytime scene labeling | |
| US20220138247A1 (en) | Text adjusted visual search | |
| KR102847307B1 (en) | Method for recommending user-personalized plants based on artificial intelligence and plant management system performing the same | |
| CN116975743A (en) | Industry information classification methods, devices, computer equipment and storage media | |
| CN117036765A (en) | Image classification model processing and image classification methods, devices and computer equipment | |
| Medikonda et al. | Enhanced hyperspectral image classification using multi-scale residual depthwise separable convolutional networks with advanced feature extraction and selection techniques | |
| Song et al. | Unsupervised remote sensing image classification with differentiable feature clustering by coupled transformer | |
| US20230316085A1 (en) | Method and apparatus for adapting a local ml model | |
| US20240096119A1 (en) | Depth Based Image Tagging |