CN119052654A

CN119052654A - Image processing method, model training method and related device

Info

Publication number: CN119052654A
Application number: CN202411556598.2A
Authority: CN
Inventors: 况佳臣
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2024-11-04
Filing date: 2024-11-04
Publication date: 2024-11-29

Abstract

The application provides an image processing method, a training method of a model and a related device, wherein in the method, electronic equipment inputs a first image into a multi-task learning model, and determining the characteristic information corresponding to the shooting scene and the characteristic information corresponding to the shooting object from the characteristic information corresponding to the first image through the multi-task learning model. And then outputting a scene classification result according to the feature information corresponding to the scene through the multi-task learning model, and outputting a target detection result according to the feature information corresponding to the shooting object. The electronic equipment determines a target exposure parameter according to the shooting scene indicated by the scene classification result and/or the shooting object indicated by the target detection result, and displays a second image after acquiring the second image based on the target exposure parameter. Therefore, the electronic equipment can determine scene classification results and target detection results required by automatic exposure through one model, so that the operation pressure of the electronic equipment is reduced, and the performance of the electronic equipment is improved.

Description

Image processing method, model training method and related device

Technical Field

The application relates to the field of artificial intelligence, in particular to an image processing method, a model training method and a related device.

Background

Automatic exposure (Automatic Exposure, AE) is a technique widely used in photography and image capturing, for example, an electronic device automatically adjusts exposure parameters (e.g., sensitivity, aperture size, exposure time, etc.) according to ambient light by an automatic exposure technique to obtain an image or video with moderate brightness.

In general, the electronic device needs to determine exposure parameters for a specific shooting scene (e.g., snow scene, night scene, etc.) and a specific shooting object (e.g., face, etc.) in a corresponding manner, so the electronic device may determine the category of the shooting scene and the category of the shooting object before determining the exposure parameters.

However, determining a plurality of shooting scenes and a plurality of shooting objects, respectively, may have an influence on power consumption and use efficiency of a memory of the electronic device, thereby reducing use performance of the electronic device.

Disclosure of Invention

According to the image processing method, the training method of the model and the related device, the electronic equipment determines the scene classification result and the target detection result required by automatic exposure through the multi-task learning model, so that the operation pressure of the electronic equipment is reduced, and the performance of the electronic equipment is improved.

In a first aspect, the present application provides an image processing method, the method comprising:

Acquiring a first image, wherein the first image is acquired based on initial exposure parameters;

Inputting the first image into a multi-task learning model, and outputting a scene classification result and a target detection result through the multi-task learning model, wherein the scene classification result is used for representing a shooting scene corresponding to the first image, and the target detection result is used for representing a shooting object contained in the first image;

determining target exposure parameters according to the scene classification result and/or the target detection result;

and displaying a second image, wherein the second image is acquired based on the target exposure parameters.

The initial exposure parameter may be an exposure parameter that is used by default after the electronic device starts the camera application program, alternatively may be an exposure parameter manually selected by a user and received by the electronic device, and alternatively may also be an exposure parameter determined by the electronic device based on the brightness of the previous frame of image.

In the method, the electronic equipment can determine the shooting scene and the shooting object corresponding to the first image through the multi-task learning model, and compared with the prior art that a plurality of algorithms are called to perform scene classification tasks and/or target detection tasks, the method has the advantages that the power consumption generated by running one model is smaller, the memory occupied by running one model is less, the use performance of the electronic equipment is improved, and the automatic exposure processing time is shortened. Furthermore, determining the exposure parameters by the electronic device based on the results of one model output may simplify the step of determining the exposure parameters by the electronic device. And when the error result exists in the results output by the multi-task learning model, the electronic equipment can update the follow-up algorithm on the multi-task learning model based on the error result, so that the automatic exposure effect is continuously improved. Meanwhile, compared with the prior art, more algorithms are deployed, the method for deploying the multi-task learning model can reduce cost and reduce maintenance burden on the algorithms.

In a possible implementation manner of the first aspect, the inputting the first image into a multi-task learning model, outputting, by the multi-task learning model, a scene classification result and a target detection result includes:

inputting the first image into the multi-task learning model, and determining feature information corresponding to the first image through the multi-task learning model, wherein the feature information corresponding to the first image comprises feature information corresponding to the shooting scene and the shooting object;

Determining feature information corresponding to the shooting scene and feature information corresponding to the shooting object according to the feature information corresponding to the first image;

Determining the scene classification result based on the feature information corresponding to the shooting scene through a first output layer in the multi-task learning model;

and determining the target detection result based on the characteristic information corresponding to the shooting object through a second output layer in the multi-task learning model.

In the existing scene classification scheme or the target detection scheme, feature information required by a task is generally directly extracted from an input image, and then the extracted features are processed to obtain a result, so that feature decoupling is not involved. In the embodiment of the application, the feature information required by the task is decoupled from the feature information corresponding to the first image, so that the feature information corresponding to the shooting scene not only contains the related information of the shooting scene, but also contains the related information of the shooting object, and similarly, the feature information corresponding to the shooting object not only contains the related information of the shooting object, but also contains the related information of the shooting scene, namely, the feature information corresponding to the shooting scene and the feature information corresponding to the shooting object share the feature information. The electronic device determines the scene classification result based on the feature information corresponding to the photographed scene and the shared feature information, and compared with the electronic device determining the scene classification result based on the feature information corresponding to the photographed scene only, the electronic device can strengthen understanding of the multi-task learning model on the first image, so that the scene classification result with higher accuracy is output. Similarly, the electronic device can obtain a target detection result with higher accuracy based on the feature information corresponding to the shooting object and the shared feature information.

In a possible implementation manner of the first aspect, the determining, according to the feature information corresponding to the first image, feature information corresponding to the shooting scene and feature information corresponding to the shooting object includes:

determining a first weight and a second weight;

determining the characteristic information corresponding to the shooting object according to the characteristic information corresponding to the first image and the first weight;

And determining the feature information corresponding to the shooting scene according to the feature information corresponding to the first image and the second weight.

In the above method, the weight is used to characterize the importance or contribution of a factor (for example, feature information corresponding to a photographed scene or feature information corresponding to a photographed object) in addition to the percentage of the factor in the whole (for example, feature information corresponding to the first image). Therefore, the feature decoupling according to the first weight and the second weight in the embodiment of the present application may be understood as feature decoupling according to the importance degree of the feature information corresponding to the shooting scene or the feature information corresponding to the shooting object in the feature information corresponding to the first image. For example, taking the first weight as an example, the first weight is used to represent the importance degree of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image, and the feature information corresponding to the shooting object is determined according to the first weight and the feature information corresponding to the first image, so that the useful feature information (information related to the shooting object) can be enhanced, and useless feature information (feature information unrelated to the shooting object) can be suppressed, so that the association degree of the determined feature information of the shooting object and the shooting object is higher.

In a possible implementation manner of the first aspect, the feature information includes feature information in a channel dimension and feature information in a space dimension, and the determining the first weight and the second weight includes:

Performing first processing on the feature information corresponding to the first image in the channel dimension, and determining a first processing result, wherein the first processing result is used for representing the weight of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image in the channel dimension;

Performing second processing on the feature information corresponding to the first image in the space dimension, and determining a second processing result, wherein the second processing result is used for representing the weight of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image in the space dimension;

determining the first weight according to a first processing result and a second processing result, wherein the first weight is used for representing the weight of the characteristic information corresponding to the shooting object relative to the characteristic information corresponding to the first image in the channel dimension and the space dimension;

and determining the second weight according to the first weight.

For example, the process of performing the first processing on the feature information corresponding to the first image in the channel dimension may be referred to as the following expression:

Wherein, And the characteristic information corresponding to the first image.For the average pooling of the features corresponding to the first image,To maximize the pooling of the features corresponding to the first image,Is in the structure of a full-connection layer,To activate the function.

For example, the process of performing the second processing on the feature information corresponding to the first image in the spatial dimension may be referred to as the following expression:

Wherein, And the characteristic information corresponding to the first image.In order to perform an average pooling process on the feature information corresponding to the first image,In order to maximize the pooling of the feature information corresponding to the first image,To pair(s)AndThe convolution operation is performed such that,To activate the function.

For example, the first weight may be found in the following expression:

Wherein, For the feature information corresponding to the first image,As a first weight to be used,As a result of the first processing result,Is the second processing result.Indicating that the size of both the first processing result and the second processing result is expanded toAnd after the corresponding size, accumulating the expanded first processing result and the expanded second processing result, and then carrying out average operation to obtain the result.

In the method, the feature information corresponding to the first image extracted by the electronic device includes feature information in a channel dimension and feature information in a space dimension, and the electronic device can process the feature information corresponding to the first image in the channel dimension and the space dimension respectively to obtain weights of the feature information corresponding to the shooting object in the channel dimension relative to the feature information corresponding to the first image and weights of the feature information corresponding to the shooting object in the space dimension relative to the feature information corresponding to the first image.

In one implementation, different channels may have different contributions to different features, e.g., some channels may contain more critical information (e.g., information related to the subject), while other channels may contain noise or redundant information (e.g., information unrelated to the subject). Therefore, the electronic device can determine the weight of the key information (for example, the information related to the shooting object) on each channel, and by giving a higher weight to the channel containing more key information and giving a lower weight to the channel containing less key information, the characteristics related to the shooting object can be enhanced in the channel dimension, and the characteristics unrelated to the shooting object can be suppressed. Similarly, the feature related to the subject can be enhanced in the spatial dimension, and the feature unrelated to the subject can be suppressed. The first weight and the second weight are determined based on the weight in the channel dimension and the weight in the space dimension, so that the aim of decoupling the characteristic information corresponding to the shooting object from the characteristic information corresponding to the first image is fulfilled.

In a possible implementation manner of the first aspect, the determining, by the first output layer in the multitasking learning model, the scene classification result based on the feature information corresponding to the shooting scene includes:

inputting the feature information corresponding to the shooting scene into the first output layer in the multi-task learning model, and outputting the scene classification result through the first output layer, wherein the scene classification result comprises a first judgment result and a category corresponding to the shooting scene, and the first judgment result is used for representing whether the shooting scene belongs to a preset scene or not.

Illustratively, the first output layer includes a scene discriminator and a scene classifier. The scene discriminator is used for judging whether the shooting scene corresponding to the first image is a preset scene or not, a first judging result is obtained, and the scene classifier is used for determining the category corresponding to the shooting scene.

Wherein the preset scene includes, but is not limited to, snow scenes, stages, grasslands, night scenes, and the like.

In the method, the electronic device outputs the first judgment result and the category corresponding to the shooting scene in the form of two independent data respectively, and when the electronic device determines the target exposure parameters based on the scene classification result, the first judgment result can be analyzed first. For example, if the first determination result is that the shooting scene does not belong to the preset scene, the electronic device does not analyze the category corresponding to the shooting scene, but further analyzes the target detection result, so that the data volume processed by the electronic device can be simplified, and the automatic exposure speed can be further increased.

In a possible implementation manner of the first aspect, the determining a target exposure parameter according to the scene classification result and/or the target detection result includes:

And under the condition that the first judging result is that the shooting scene belongs to the preset scene, determining the target exposure parameters according to the category corresponding to the shooting scene.

In the above method, since all visible elements of the first image are included in the photographed scene, the photographed scene is related to the information of the entire first image. When the shooting scene belongs to a preset scene, the electronic equipment is required to adjust the overall brightness of the image, and the electronic equipment can also synchronously adjust the brightness of the shooting object contained in the first image. The electronic device can adjust the brightness of the image to a moderate state according to the type of the shooting scene without considering the type of the shooting object. The computing resource of the electronic equipment is saved, and the speed of the electronic equipment for automatic exposure is increased.

In a possible implementation manner of the first aspect, the determining, by the second output layer in the multitasking learning model, the target detection result based on the feature information corresponding to the photographic subject includes:

And inputting the characteristic information corresponding to the shooting object into the second output layer in the multi-task learning model, and outputting the target detection result through the second output layer, wherein the target detection result comprises the category of the shooting object and the position information of the shooting object in the first image.

Illustratively, the second output layer includes a target detector that can detect the category of the photographic subject and the position of the photographic subject at the same time, following the design principle of the decoupling detection head.

In the method, the target detector follows the design principle of the decoupling detection head, comprises a part of network structure shared by the target detector to simplify the structure of the target detector, and can be used for balancing the representation capability of the target detector to an operator and the calculation cost of hardware in the electronic equipment, so that the result output by the shared structure can be respectively used for determining the type of the shooting object and determining the position information of the shooting object in the first image, the operation pressure of the electronic equipment is reduced, and the processing time of the electronic equipment is saved.

And when the first judging result shows that the shooting scene does not belong to the preset scene, determining the target exposure parameter according to the type of the shooting object and the position information of the shooting object in the first image when the type of the shooting object belongs to the preset type.

In the above method, when the electronic device determines that the class of the shooting object belongs to the preset class, it is indicated that the electronic device needs to adjust the exposure parameters at this time. The electronic equipment determines the target exposure parameters based on the shooting object, and can adjust the brightness of the first image pertinently based on the second image with moderate brightness acquired by the target exposure parameters, thereby improving the exposure quality at the detail of the first image.

In a second aspect, the present application provides a method for training a model, the method comprising:

Acquiring a training set of a multi-task learning model, wherein the training set comprises one or more of a training image, a scene classification label corresponding to the training image and a target detection label, wherein the scene classification label is used for representing a shooting scene corresponding to the training image, and the target detection label is used for representing a shooting object contained in the training image;

And training the multi-task learning model by taking the training image as the input of the multi-task learning model, taking the scene classification label as the output of a first output layer in the multi-task learning model and taking the target detection label as the output of a second output layer in the multi-task learning model, so as to obtain a trained multi-task learning model, wherein the multi-task learning model is used for determining shooting scenes corresponding to the image and shooting objects contained in the image.

In the method, the scene classification label is taken as the output of the first output layer, so that the capability of classifying the scenes of the multi-task learning model can be trained in a targeted manner. And taking the target detection label as the output of the second output layer, the target detection capability of the multi-task learning model can be trained in a targeted manner, so that the capability of the multi-task learning model for determining shooting scenes and shooting objects is improved.

In a possible implementation manner of the second aspect, the training the multi-task learning model to obtain a trained multi-task learning model includes:

Determining feature information corresponding to the training image through the multi-task learning model, wherein the feature information corresponding to the training image comprises the shooting scene and the feature information corresponding to the shooting object;

determining third characteristic information and fourth characteristic information according to the characteristic information corresponding to the training image, wherein the third characteristic information is characteristic information corresponding to the shooting scene in the characteristic information corresponding to the training image, and the fourth characteristic information is characteristic information corresponding to the shooting object in the characteristic information corresponding to the training image;

And training the multi-task learning model based on the third characteristic information and the fourth characteristic information to obtain the trained multi-task learning model.

In the above method, the feature information corresponding to the image input into the multi-task learning model may include feature information that can be shared by a plurality of tasks, or may include feature information specific to a certain task, and decoupling the feature information corresponding to the training image into the third feature information and the fourth feature information is performed so that, on the basis of determining the feature information specific to each task of the plurality of tasks, the feature information corresponding to any one task of the plurality of tasks may also include feature information shared with other tasks. Therefore, any task of the plurality of tasks can enhance the understanding of the image based on the shared characteristic information on the basis of processing the characteristic information specific to the task, and the training effect on the multi-task learning model is improved.

In a possible implementation manner of the second aspect, the training the multi-task learning model based on the third feature information and the fourth feature information to obtain the trained multi-task learning model includes:

determining a scene classification result corresponding to the training image according to the third characteristic information through the first output layer;

Determining a target detection result corresponding to the training image according to the fourth characteristic information through the second output layer;

And training the multi-task learning model by combining the scene classification result corresponding to the training image and the scene classification label after comparison, and the target detection result corresponding to the training image and the target detection label after comparison to obtain the trained multi-task learning model.

In the method, the scene classification result output by the first output layer is compared with the scene classification label, and the target detection result output by the second output layer is compared with the target detection label, so that the parameters of the multi-task learning model can be optimized in a targeted manner, and the training effect is improved.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes one or more processors and one or more memories, where the one or more memories are coupled to the one or more processors, and the one or more memories are configured to store computer program code, where the computer program code includes computer instructions, and the one or more processors invoke the computer instructions to cause the electronic device to perform the image processing method and the training method of the model described in the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides a chip or chip system comprising at least one processor and a communication interface, the communication interface and the at least one processor being interconnected by wires, the at least one processor being adapted to execute a computer program or instructions to perform the image processing method, the training method of the model described in the first aspect or any one of the possible implementations of the first aspect. The communication interface in the chip can be an input/output interface, a pin, a circuit or the like.

In one possible implementation, the chip or chip system described above in the embodiments of the present application further includes at least one memory, where the at least one memory stores instructions. The memory may be a memory unit within the chip, such as a register, a cache, etc., or may be a memory unit of the chip (e.g., a read-only memory, a random access memory, etc.).

In a fifth aspect, embodiments of the present application provide a computer storage medium storing a computer program which, when executed by a processor, causes the computer to perform an image processing method, a training method of a model as described in the first aspect or any one of the possible implementations of the first aspect.

In a sixth aspect, embodiments of the present application provide a computer program product which, when run on a communication device, causes the communication device to perform an image processing method, a training method of a model as described in the first aspect or any one of the possible implementations of the first aspect.

It should be appreciated that the description of technical features, aspects, benefits or similar language in the present application does not imply that all of the features and advantages may be realized with any single embodiment. Conversely, it should be understood that the description of features or advantages is intended to include, in at least one embodiment, the particular features, aspects, or advantages. Therefore, the description of technical features, technical solutions or advantageous effects in this specification does not necessarily refer to the same embodiment. Furthermore, the technical features, technical solutions and advantageous effects described in the present embodiment may also be combined in any appropriate manner. Those of skill in the art will appreciate that an embodiment may be implemented without one or more particular features, aspects, or benefits of a particular embodiment. In other embodiments, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments.

Drawings

The drawings used in the embodiments of the present application are described below.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

Fig. 2 is a schematic software architecture of an electronic device according to an embodiment of the present application;

FIG. 3A is a system desktop of an electronic device according to an embodiment of the present application;

3B-3D are preview interfaces of a set of camera applications provided by embodiments of the present application;

FIG. 4A is a system desktop of an electronic device according to an embodiment of the present application;

FIGS. 4B and 4C are preview interfaces of another set of camera applications provided by embodiments of the present application;

Fig. 5 is a schematic flow chart of an image processing method according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a multi-task learning model according to an embodiment of the present application;

FIG. 7 is a flow chart of feature decoupling provided by an embodiment of the present application;

FIG. 8 is a flowchart for determining a target exposure parameter according to an embodiment of the present application.

Detailed Description

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure refers to and encompasses any or all possible combinations of one or more of the listed items.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.

In order to facilitate the clear description of the technical solutions of the embodiments of the present application, the following description will simply introduce some terms related to the embodiments of the present application.

1. Multitasking (Multi-TASK LEARNING, MTL) is a machine learning method that improves the generalization ability and learning efficiency of models by training multiple related tasks simultaneously. In multitasking, the model shares part of the parameter and feature representations for knowledge migration between different tasks. For example, in the embodiment of the application, the electronic device trains a multi-task learning model capable of processing scene classification tasks and target detection tasks in parallel through multi-task learning, so that an automatic exposure function can be realized through one multi-task learning model.

2. Scene classification-scene classification is an important task in the fields of computer vision and machine learning, the purpose of which is to identify and classify different scenes or environments presented in an image. This process involves analyzing features of the image using algorithms to determine the type of scene represented by the image. For example, in the embodiment of the present application, the electronic device performs scene classification on the acquired image to obtain a shooting scene, so as to determine whether the shooting scene belongs to a preset scene, and if so, may determine a specific scene class based on a scene classification task.

3. Object recognition-object recognition is intended to detect and recognize a specific object in an image or video. It not only involves identifying the class of the object, but it is also often necessary to determine the position of the object in the image and to further enable tracking and analysis of the object. For example, in the embodiment of the present application, the electronic device performs target recognition on the acquired image to obtain the shot object, so as to determine whether the shot object is a preset type of shot object, and if the shot object belongs to the preset type of shot object, a specific object type and position information of the shot object relative to the image may be determined based on the target recognition task.

4. Exposure convergence in computer vision, exposure convergence can be used to describe how to adjust the exposure parameters of a camera to obtain the best exposure effect in a dynamic environment. For example, in the embodiment of the present application, the electronic device determines the target brightness based on the shooting scene and/or the shooting object of the input image (e.g., the first image), determines the target exposure parameter according to the brightness of the input image and the target brightness, and then acquires the second image based on the target exposure parameter so that the brightness of the second image is equal to or close to the target brightness, thereby achieving exposure convergence.

5. The brightness and color of an object is determined by the reflectivity of the object to light, e.g., 0 for a solid black object, 100% for a solid white object, and 18% for an intermediate gray object. Since the middle gray scale in the center is an average value of all gray scales in color spectrum, it is generally considered that the average reflectance of an object in nature is 18% of the middle gray scale. Thus, the electronic device can take 18% of the intermediate gray scale as the target brightness, and the brightness of the image is adjusted to 18% of the intermediate gray scale, so that a plurality of photographed objects all present average brightness, and the situation of overexposure or underexposure of the image is reduced.

6. Exposure parameters are three major factors affecting the brightness and quality of an image in photography and image capture, and they are called exposure three elements (or exposure triangles), namely aperture, shutter speed, and sensitivity, respectively. The exposure degree, definition and noise of the image can be effectively controlled by reasonably adjusting the aperture, shutter speed and light sensitivity, so that the imaging quality of the image is improved. In the embodiment of the application, the electronic equipment can realize automatic exposure by determining the exposure parameters.

In order to facilitate understanding of the embodiments of the present application, the following first analyzes and proposes a technical problem to be solved by the present application.

With the popularity of electronic devices (e.g., smart terminal devices), it is daily for users to use electronic devices to capture images. In order to improve the quality of the image, the electronic device may determine exposure parameters (such as sensitivity, aperture size, exposure time, etc.) according to ambient light through an automatic exposure (Automatic Exposure, AE) algorithm, so as to collect an image with moderate brightness based on the exposure parameters, and reduce the overexposure or underexposure of the image.

In general, the electronic apparatus sets an exposure parameter with the aim of adjusting the brightness of an image to an intermediate gradation of 18%. However, in some boundary situations, it may be difficult for the electronic device to acquire an image of moderate brightness based on 18% of the intermediate gray scale, requiring a redetermining of the target brightness based on the current shooting scene and/or the shooting subject to determine the exposure parameters based on the redetermined target brightness.

For example, the boundary condition includes a case where a large-area dark area or a large-area bright area is included in the photographed scene. For example, in a shooting scene of a snowfield, a large area of bright area is included, at this time, the brightness of a picture collected by the electronic device is higher (i.e., higher than 18% of intermediate gray level), and the electronic device reduces the brightness of the picture by reducing the luminous flux, so that the brightness of the picture approaches to 18% of intermediate gray level (target brightness). As the electronics reduce the light flux, underexposure of the final acquired image may result, resulting in poor imaging. In order to solve the problem that the electronic device is difficult to acquire an image with moderate brightness based on 18% of intermediate gray scale in a preset scene (such as a snow scene), the target brightness can be redetermined based on the preset scene (for example, the target brightness of the intermediate gray scale higher than 18% is determined for the shooting scene of the snow scene), and then the exposure parameter suitable for the shooting scene is determined based on the target brightness, so as to reduce the situation of overexposure or underexposure of the image.

As another example, the boundary condition includes a case when the photographed object includes a preset target, which includes, but is not limited to, an object that may be of great interest to an electronic device such as a human face, an animal, a plant, a ball, etc. when photographing. Since it is explained that the electronic apparatus may need the above-mentioned preset target to exhibit a good imaging effect (e.g., an exposure effect with moderate brightness) when the photographic subject contains the above-mentioned preset target. However, the ratio of the area of the preset target in the total area of the picture may be smaller, and the brightness of the area of the large area except the preset target in the picture has a larger influence on the exposure parameters controlled by the electronic device, which may cause the exposure parameters determined by the electronic device to be unsuitable for the preset target, so that the exposure effect of the preset target in the acquired image is poor, and the photographing experience of the user is affected. In order to solve the problem that in the case that the shooting object includes a preset target, it is difficult to acquire an image with moderate brightness based on 18% of intermediate gray scale, the electronic device may generally redetermine the target brightness based on the shooting object (for example, the preset target), and then determine an exposure parameter suitable for the shooting object based on the target brightness, so as to reduce the situation of overexposure or underexposure of the image.

Therefore, before adjusting the exposure parameters, the electronic device may identify the shooting scene and the shooting object, if the current shooting scene and/or the shooting object belong to the boundary conditions, the electronic device may determine the target brightness based on the identified shooting scene and/or the shooting object, so as to determine the exposure parameters suitable for the shooting scene and/or the shooting object based on the target brightness, thereby reducing the situation of overexposure or underexposure of the image and obtaining the image with moderate brightness.

Based on the continuous development of modern computer vision and image processing fields, a plurality of algorithms related to scene classification and object detection are preset in electronic equipment. For example, in the field of scene classification, a snow scene classification algorithm for identifying a shooting scene, such as a snow scene, a stage classification algorithm for identifying a shooting scene, such as a stage, and the like are included. Also for example, in the field of object detection, an intelligent face mask algorithm for detecting a face, a flower detection algorithm for identifying flowers, and the like are included. It can be seen that the above algorithms are all directed to recognition or detection of a single task, for example, an intelligent face mask algorithm is used to detect faces, and a snow scene classification algorithm is used to determine whether a shooting scene is a snow scene. This is because the above algorithm may be generated during the development process in the image processing field, for example, when a face in an image needs to be detected, relevant manufacturers perform feature mining based on the face, then train a face detection model, and finally deploy the trained face detection model online, so that the electronic device can call the face detection model to realize the face detection function. Therefore, with the development of the image processing field, a plurality of algorithms for a single task have been designed.

Generally, in an automatic exposure scheme of an electronic device, the electronic device can call one or more algorithms of the multiple algorithms as required to perform a scene classification task and/or an object detection task, which has a certain flexibility. However, when the number of algorithms called by the electronic device is large, power consumption generated by running a plurality of algorithms simultaneously by the electronic device is large, and memory occupied by running a plurality of algorithms simultaneously is large, which may cause the use performance of the electronic device to be reduced, thereby affecting the processing time of the electronic device when the electronic device performs automatic exposure. In addition, if the electronic device invokes a plurality of algorithms, the algorithms may output corresponding results at different times, which affects the duration of determining the exposure parameter by the electronic device based on the results, and determining the exposure parameter by the electronic device based on the results output by the algorithms may increase the complexity of the electronic device in determining the exposure parameter. In addition, since the algorithms are independently developed and isolated from each other, when there is an erroneous result in the results output by the algorithms, it is difficult for the electronic device to determine the algorithm that outputs the erroneous result, so that it is difficult to update the algorithm that outputs the erroneous result with a subsequent algorithm, and the effect of automatic exposure of the electronic device is affected. Meanwhile, the electronic equipment deploys more algorithms, so that the cost of the electronic equipment and the maintenance burden on the algorithms are increased.

In summary, the electronic device invokes a plurality of algorithms to perform automatic exposure has the problems of high power consumption, more occupied memory, high complexity of determining exposure parameters, high optimization difficulty, high cost and the like. How to effectively integrate a plurality of independent algorithms to improve the efficiency in determining exposure parameters and optimize the exposure effect on images is a problem to be solved in the research of automatic exposure algorithms.

In view of this, an embodiment of the present application provides an image processing method. After the electronic equipment acquires the first image based on the initial exposure parameters, the first image is input into a multi-task learning model, and a scene classification result and a target detection result are output through the multi-task learning model. Then, the electronic device determines a target exposure parameter according to the shooting scene indicated by the scene classification result and/or the shooting object indicated by the target detection result, and the electronic device acquires a second image based on the target exposure parameter and displays the second image. In the embodiment of the application, the electronic equipment can determine the scene classification result and the target detection result through one model (namely a multi-task learning model), and compared with the prior art that the scene classification result and the target detection result can be determined through a plurality of models respectively, the method reduces the number of the models, reduces the operation pressure of the electronic equipment, and further improves the performance of the electronic equipment.

The image processing method according to the embodiment of the present application is described below with reference to a hardware structure and a software structure of a mobile phone.

In some embodiments, the electronic device includes, but is not limited to, a cell phone, a tablet (Portable Android Device, PAD), a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a camera-enabled handheld device, a computing device, an in-vehicle device, or a wearable device. The form of the electronic device in the embodiment of the application is not particularly limited.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 1, the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may generate operation control signals according to the instruction operation code and the timing signals to complete instruction fetching and instruction execution control.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it may be called directly from memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 inputs the first image acquired through the camera 193 into a multi-task learning model, and extracts feature information corresponding to the first image from the first image through the multi-task learning model, wherein the feature information corresponding to the first image includes feature information corresponding to a photographed scene and a photographed object. And then, decoupling the characteristic information corresponding to the first image into the characteristic information corresponding to the shooting scene and the characteristic information corresponding to the shooting object through a multi-task learning model. And obtaining a scene classification result based on the feature information corresponding to the shooting scene through the multi-task learning model, wherein the scene classification result comprises a judgment result about whether the shooting scene corresponding to the first image belongs to a preset scene or not and a specific scene category when the shooting scene belongs to the preset scene. And obtaining a target detection result based on the characteristic information corresponding to the shooting object through the multi-task learning model, wherein the target detection result comprises a judgment result about whether the first image contains the shooting object with the preset category or not and a specific object category when the first image belongs to the preset category. The processor 110 determines a target exposure parameter according to the scene classification result and the target detection result so that the camera 193 can acquire an image based on the target exposure parameter.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-INTEGRATED CIRCUIT, I2C) interface, an integrated circuit built-in audio (inter-INTEGRATED CIRCUITSOUND, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purposeinput/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (CAMERA SERIAL INTERFACE, CSI), display serial interfaces (DISPLAYSERIAL INTERFACE, DSI), and the like.

In some embodiments, processor 110 and display 194 communicate via a DSI interface to implement the display functionality of electronic device 100.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example, the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wirelesslocal area networks, WLAN) (e.g., wireless fidelity (WIRELESS FIDELITY, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation SATELLITE SYSTEM, GNSS), frequency modulation (frequency modulation, FM), near field communication (NEAR FIELD communication, NFC), infrared (IR), etc., applied to the electronic device 100.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The internal memory 121 may be used to store computer-executable program code that includes instructions. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

In one embodiment of the present application, the internal memory 121 may store therein a multi-task learning model for determining whether a photographing scene of an image input into the multi-task learning model belongs to a preset scene and for determining whether the image input into the multi-task learning model contains a photographing object of a category of a preset category.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions.

The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The software system of the electronic device (such as a mobile phone) can adopt a layered architecture, a transaction driven architecture, a micro-core architecture, a micro-service architecture or a cloud architecture. In the embodiment of the application, an Android system with a layered architecture is taken as an example, and the software architecture of a mobile phone is illustrated. Referring to fig. 2, fig. 2 is a schematic diagram of a software architecture of an electronic device according to an embodiment of the application.

As shown in fig. 2, the hierarchical architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into five layers, from top to bottom, an application layer (application), an application framework layer (frame), a hardware abstraction layer (hardware abstraction layer, HAL), a driver layer, and a hardware layer, respectively. Wherein:

The application layer (application) may comprise a series of packages. For example, the application may include a camera, gallery, or the like. Among other things, camera applications may include, but are not limited to, UI modules, photo taking modules, gallery modules, and the like. The UI module may be cameraUI module, and may be mainly responsible for camera applications to measure human-computer interaction, for example, control the display of the preview interface and the preview screen therein, and receive and respond to user operations occurring in the preview interface. The photographing module is used for providing photographing function, focusing function and the like. The gallery module may be used to store photos taken by a user in a file system or a specific database of the electronic device for retrieval by applications such as a gallery.

An application framework layer (framework) provides a programming interface (application programming interface, API) and a programming framework for application programs of the application layer, mainly relates to a camera framework, and can comprise camera extension libraries, camera services and other camera access interfaces, plays a role of being started up and started down, can interact with camera applications through application APIs, and can also interact with HAL through HAL interface definition language (HAL INTERFACE definition language, HIDL). The application framework layer may also include a window manager, and the camera application and gallery application may present the taken photos to the user with the support of the window manager.

A Hardware Abstraction Layer (HAL) is an interface layer located between the application framework layer and the driver layer, providing a virtual hardware platform for the operating system. By way of example, the hardware abstraction layer may include a camera hardware abstraction layer and an auto-exposure module. The camera hardware abstraction layer may provide, among other things, virtual hardware of the camera device 1 (first camera), the camera device 2 (second camera), and more camera devices.

The automatic exposure module stores a plurality of image processing algorithms. For example, in embodiments of the present application, the auto-exposure module may include an auto-exposure algorithm, or the like. The automatic exposure module can determine target exposure parameters by combining an automatic exposure algorithm and shooting parameters reported by the camera module and/or other sensors.

In one implementation, the automatic exposure module may perform scene classification and object detection on the first image according to an automatic exposure algorithm, so as to determine whether a shooting scene of the first image is a preset scene, and determine whether a shooting object with a category being a preset category is included in the first image. The first image is a frame of image acquired by the electronic device through the camera device.

Specifically, the automatic exposure module may extract, from the first image, feature information corresponding to the first image according to a multi-task learning model in an automatic exposure algorithm, where the feature information corresponding to the first image includes feature information corresponding to a shooting scene and a shooting object. And then, decoupling the characteristic information corresponding to the first image into the characteristic information corresponding to the shooting scene and the characteristic information corresponding to the shooting object through a multi-task learning model. And obtaining a scene classification result based on the feature information corresponding to the shooting scene through the multi-task learning model, wherein the scene classification result comprises a judgment result about whether the shooting scene corresponding to the first image belongs to a preset scene or not and a specific scene category in the preset scene. And obtaining a target detection result based on the characteristic information corresponding to the shooting object through the multi-task learning model, wherein the target detection result comprises a judgment result about whether the first image contains the shooting object with the preset category or not and a specific object category when the first image belongs to the preset category.

In one implementation, after the automatic exposure module obtains the scene classification result and the target detection result, the automatic exposure module may further determine a target exposure parameter according to an automatic exposure algorithm based on the scene classification result and the target detection result, so as to drive the camera module to collect an image based on the target exposure parameter.

The driver layer is a layer between hardware and software, and includes drivers for various hardware. The driving layer may include a camera device driver, a digital signal processor driver, an image processor driver, and the like. The camera device drives an image sensor for driving one or more cameras in the camera module to acquire images and drives an image signal processor to preprocess the images. The digital signal processor driver is used for driving the digital signal processor to process the image. The image processor driver is used for driving the image processor to process the image.

The hardware layer may include a camera module, an image signal processor, a digital signal processor, an image processor, and a memory.

The camera module may include one or more camera image sensors (e.g., image sensor 1, image sensor 2, etc.) therein.

The memory includes a plurality of memory cells, and the electronic device can recognize a shooting scene and a shooting object in the first image by a model in the memory. For example, in the conventional scheme, the electronic apparatus recognizes a shooting scene and a shooting object in the first image through a plurality of models, respectively, by a plurality of storage units. In the embodiment of the application, the electronic equipment runs the multi-task learning model through the storage unit, so that the shooting scene and the shooting object in the first image can be identified, and the storage pressure of the electronic equipment is relieved.

The workflow of the electronic device software and hardware is illustrated below in connection with a scenario in which a camera application is launched.

In the embodiment of the present application, after the touch sensor 180K receives the touch operation, the corresponding hardware interrupt is sent to the kernel layer, and the kernel layer processes the user operation into an original input event (including information such as touch coordinates and a timestamp of the touch operation), and identifies a control corresponding to the input event. Taking a touch operation as an example of a user operation acting on the camera application, the camera application invokes a camera access interface of the application framework layer, starts the camera application, and then sends an instruction for starting the camera capability by invoking a camera device in the camera hardware abstraction layer, such as the camera device 1, and the camera hardware abstraction layer sends the instruction to a camera device driver of the driving layer, which can start a sensor corresponding to the camera device, such as the sensor 1, and then acquires an image light signal through the sensor 1, so as to obtain a first image.

In one implementation, the image signal processor passes the first image back through the camera device driver to the hardware abstraction layer, which returns the first image data through the camera interface to the camera application. The camera application may then present the first image to the user with the support of the window manager.

Meanwhile, the hardware abstraction layer can also send the first image to the automatic exposure module. Based on the support of the image signal processor and the digital signal processor, the automatic exposure module in the hardware abstraction layer can combine the automatic exposure algorithm and shooting parameters reported by the camera module and/or other sensors to determine whether the shooting scene of the first image is a preset scene or not, and determine whether the first image contains shooting objects with the category of the preset category or not. If the shooting scene of the first image is a preset scene, the automatic exposure module determines a target exposure parameter based on the shooting scene of the first image. If the shooting scene of the first image is not a preset scene and the first image contains a shooting object with a preset category, the automatic exposure module determines a target exposure parameter based on the shooting object of the first image. If the shooting scene of the first image is not a preset scene and the first image does not contain shooting objects with the preset categories, the automatic exposure module does not output target exposure parameters.

The camera device driver may again activate a corresponding sensor of the camera device, e.g. sensor 1, and then acquire an image light signal based on the target exposure parameter via sensor 1, resulting in a second image.

The image signal processor transmits the second image back to the hardware abstraction layer through the camera device driver, and the hardware abstraction layer returns the second image data to the camera application through the camera interface. The camera application may then present the second image to the user with the support of the window manager, where the exposure effect of the second image is better than the exposure effect of the first image.

In one possible implementation, the hardware abstraction layer may also send the second image to an auto-exposure module. Based on the support of the image signal processor and the digital signal processor, the automatic exposure module in the hardware abstraction layer can determine the target exposure parameter corresponding to the second image by combining the automatic exposure algorithm so as to acquire the image based on the target exposure parameter corresponding to the second image until the image with moderate brightness (the brightness is equal to or close to the target brightness) is acquired.

The image processing method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

The image processing method provided by the embodiment of the application can be applied to the electronic equipment with the hardware structure shown in the figure 1 and the software structure shown in the figure 2. Or more or less components than illustrated, or some components may be combined, or some components may be separated, or different components may be arranged, or the like in hardware and software configurations.

In the embodiment of the application, when the electronic equipment collects the image, the shooting scene corresponds to the whole information of the image, and the whole brightness of the image can be influenced. The local information of the shooting object corresponding to the image contained when the electronic device collects the image may affect the brightness of the local image. Therefore, the embodiment of the application adjusts the brightness of the whole image based on the shooting scene and/or adjusts the brightness of the local image based on the shooting object, so that the acquired image has higher quality exposure effect from the whole or the local.

The process of automatically exposing the electronic device to the preset scene will be described with reference to fig. 3A to 3D.

Fig. 3A is a system desktop of an electronic device according to an embodiment of the present application, and fig. 3B to fig. 3D are preview interfaces of a set of camera applications according to an embodiment of the present application.

Fig. 3A is a system desktop 301 of the electronic device 100 according to an embodiment of the present application. As shown in fig. 3A, the system desktop 301 may include a status bar 302, a page indicator 303, and a plurality of application icons.

Wherein the status bar 302 may include one or more signal strength indicators of mobile communication signals (also may be referred to as cellular signals), such as fifth generation mobile communication technology (5th Generation Mobile Communication Technology,5G), wireless high-fidelity (WIRELESS FIDELITY, wi-Fi) signal strength indicators, battery status indicators, time indicators (e.g., 8:00), and the like.

The page indicator 303 may be used to indicate the positional relationship of the currently displayed page with other pages.

The plurality of application icons may include a time application icon (e.g., 08:00), a date application icon (e.g., 1 month No. 1, friday), a weather application icon (e.g., 5 ℃), an application marketplace application icon, a memo application icon, a mall application icon, a browser application icon, a phone application icon, an information application icon, a camera application icon, a settings application icon, and the like. Not limited to the above icons, the system desktop 301 may also include other application icons, which are not listed here. Multiple application icons may be distributed across multiple pages. The page indicator 303 may be used to indicate which of a plurality of pages carrying a plurality of applications the page currently viewed by the user is. The user can browse other pages through the touch operation of sliding left and right.

It will be appreciated that the user interface of fig. 3A and the following description illustrate one possible user interface style of an electronic device, such as a cell phone, and should not be construed as limiting the embodiments of the present application.

As shown in fig. 3A, the electronic device 100 may receive a user operation for starting the camera application, such as an operation of clicking a desktop icon of the camera application, and in response to the operation, the electronic device may display a preview interface of the camera application as shown in fig. 3B.

Fig. 3B illustrates a user interface, also referred to as preview interface 304, for a capture and display service provided by an embodiment of the present application. In the preview interface 304 shown in fig. 3B, the electronic device 100 displays an image 305, a function selection area 306, a mode selection area 307, a zoom function 308, a gallery 309, a photographing button 310, and a switching button 311. Wherein the function selection area 306 comprises at least one function setting button. Exemplary function setup buttons include, but are not limited to, a visual button, a flash button, an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) camera button, a color button, and a setup button. Alternatively, if the electronic device 100 detects that the user touches any one of the function setting buttons in the function selection area 306, the corresponding function is set according to the touched function setting button. For example, if it is detected that the user touches a flash button, the setting of the flash is performed. Illustratively, the mode selection area 307 includes at least one photographing mode button. The shooting mode buttons include, but are not limited to, an iris mode button, a night scene mode button, a portrait mode button, a shooting mode button, a video mode button, a professional mode button, a more mode button, etc. Alternatively, if the electronic apparatus 100 detects that the user touches any one of the photographing mode buttons in the mode selection area 307, it switches to the corresponding photographing mode according to the touched photographing mode button. Illustratively, the gallery 309 is used for reviewing the photographed pictures, specifically, the gallery 309 is used for providing an entry for a user to review the photographed pictures, and the user clicks on the gallery 309 to enter an interface for reviewing the photographed pictures. Illustratively, a capture button 310 is used to capture a photograph, and a switch button 311 is used to switch the front and rear cameras at the time of capture.

As shown in fig. 3B, after the electronic apparatus 100 switches from the system desktop 301 shown in fig. 3A to the preview interface 304 shown in fig. 3B in response to a user operation, the electronic apparatus 100 may receive a user operation to switch modes, for example, left/right in the mode selection area 307, and change a currently used photographing mode according to the operation. By default, the electronic device first uses a "photo" mode. In one implementation, if the currently used shooting mode of the electronic device 100 is not the "shooting" mode, the electronic device 100 may switch to the "shooting" mode when receiving the left/right sliding operation of the drag mode selection area 307 and stopping the buoy at the "shooting" option.

As shown in fig. 3C, in the "take a picture" mode, a frame of an image captured by a camera, e.g., image 305, is displayed in preview interface 304. Optionally, the preview interface 304 may display the images processed by the image processing algorithms corresponding to different modes in real time, so that the user may perceive the photographing effects corresponding to different photographing modes in real time.

As can be seen from the image 305, the shooting scene corresponding to the image 305 is a snow scene, and includes a large area of bright area, and the brightness of the picture collected by the electronic device 100 is illustratively higher (i.e. higher than 18% of the intermediate gray scale), and the electronic device reduces the brightness of the picture by reducing the luminous flux, so that the brightness of the picture approaches to 18% of the intermediate gray scale (the initial target brightness). The resulting acquired image 305 may be underexposed and appear darker.

In one possible implementation, after capturing the image 305, the electronic device 100 may input the image 305 into the multi-task learning model, where the scene classification result output by the multi-task learning model includes a result that the image 305 belongs to a preset scene and a result that the scene type corresponding to the image 305 is a snowscene, and the output target detection result includes a result that the image 305 includes a shooting object of a preset category (for example, a plant in the image 305) and a result that the position information of the shooting object in the image 305. Then, the electronic device 100 obtains that the shooting scene of the image 305 belongs to a preset scene based on the scene classification result, and the electronic device 100 may determine the target exposure parameter based on the scene type corresponding to the image 305, or alternatively, may not determine the target exposure parameter based on the target detection result.

Illustratively, the electronic device 100 obtains a target luminance (e.g., determines a target luminance of greater than 18% of the intermediate gray scale) based on the snowfield photographed scene, and then obtains a target exposure parameter based on the luminance of the image 305 and the target luminance. Optionally, the electronic device 100 captures an image 313 based on the target exposure parameters, which is displayed in a preview interface 312 as shown in fig. 3D.

As shown in fig. 3D, the image 313 displayed in the preview interface 312 is acquired by the electronic device 100 based on the target exposure parameter, and it can be seen that the brightness of the image 313 is higher than the brightness of the image 305, and the exposure degree of the image 313 is greater than the exposure degree of the image 305, so that the problem of underexposure in the preset scene of the snow scene can be solved.

A process of automatically exposing the electronic apparatus to the photographic subject will be described with reference to fig. 4A to 4C.

Fig. 4A is a system desktop of an electronic device, and fig. 4B and fig. 4C are preview interfaces of another set of camera applications provided by an embodiment of the present application.

Fig. 4A is a system desktop 301 of an electronic device according to an embodiment of the present application. As shown in fig. 4A, the system desktop 301 may include a status bar 302, a page indicator 303, and a plurality of application icons. The status bar 302, the page indicator 303, and the plurality of application icons may be referred to in the related description of fig. 3A, and will not be described herein.

It will be appreciated that the user interface of fig. 4A and the following description illustrate one possible user interface style of an electronic device, such as a cell phone, and should not be construed as limiting the embodiments of the application.

As shown in fig. 4A, the electronic device 100 may receive a user operation to open the camera application, such as clicking on a desktop icon of the camera application, in response to which the electronic device may display a preview interface as shown in fig. 4B.

Fig. 4B illustrates a user interface, also referred to as preview interface 401, for a capture and display service provided by an embodiment of the present application. In the preview interface 401 shown in fig. 4B, the electronic device 100 displays an image 402, a function selection area 306, a mode selection area 307, a zoom function 308, a gallery 309, a photographing button 310, and a switching button 311. The function selection area 306, the mode selection area 307, the zoom function 308, the gallery 309, the capture button 310, and the switch button 311 may be described with reference to fig. 3B, and will not be described herein.

As can be seen from the image 402, the shooting object of the image 402 includes white birds 4021 and leaves, the white birds 4021 are bright areas, the leaves are dark areas, and the area of the area where the white birds 4021 are located is smaller than the area of the area where the leaves are located. Optionally, the brightness of the picture collected by the electronic device 100 may be lower (i.e. lower than 18% of the intermediate gray level) due to the influence of the leaf area, and the electronic device may increase the brightness of the picture by increasing the luminous flux, so that the brightness of the picture approaches to 18% of the intermediate gray level (the initial target brightness). Overexposure of the white bird 4021 in the final acquired image 402 may result in a higher brightness of the white bird 4021, resulting in poor imaging quality of the image 402.

In one possible implementation, after the electronic device 100 acquires the image 402, the image 402 may be input into a multi-task learning model, where the scene classification result output by the multi-task learning model includes that the image 402 does not belong to a preset scene, and the output target detection result includes that the image 402 includes a shooting object (e.g., white bird 4021) in a preset category, and location information of the shooting object (e.g., white bird 4021) in the image 402. Then, the electronic device 100 obtains, based on the scene classification result, that the shooting scene of the image 402 does not belong to the preset scene, the electronic device 100 may continue to analyze the target detection result, determine, based on the target detection result, that the image 402 includes the shooting object of the preset category, and the electronic device 100 may determine the target exposure parameter based on the shooting object.

Illustratively, the electronic device 100 obtains a target luminance (e.g., determines a target luminance of less than 18% of the intermediate gray scale) based on an animal (e.g., white bird 4021) photographing a subject, and then obtains a target exposure parameter based on the luminance of the image 402 and the target luminance. Optionally, the electronic device 100 captures an image 404 based on the target exposure parameters, and displays the image in a preview interface 403 as shown in fig. 4C.

As shown in fig. 4C, the image 404 displayed in the preview interface 403 is acquired by the electronic device 100 based on the target exposure parameter, it can be seen that the brightness of the white bird 4041 in the image 404 is lower than the brightness of the white bird 4021 in the image 402, and the exposure degree of the image 404 is lower than the exposure degree of the image 402, so that the problem of overexposure of the area where the white bird 4021 in the image 402 is located can be solved.

Fig. 3A to 3D and fig. 4A to 4C describe interface diagrams of the electronic device when performing automatic exposure, and a specific description will be given below of a process of performing automatic exposure based on a shooting scene and/or a shooting object after determining the shooting scene and the shooting object by using the electronic device in conjunction with fig. 5.

Referring to fig. 5, fig. 5 is a flowchart of an image processing method according to an embodiment of the application. As shown in fig. 5, the flowchart includes S501 to S504, and a specific implementation of automatic exposure of the electronic device through the multi-task learning model is described below with reference to the exemplary flowchart shown in fig. 5.

In step S501, a first image is acquired.

Specifically, after the electronic device starts the camera application program, a frame of image can be acquired at a certain interval, and then the frame of image is displayed in a preview interface of the camera application, so that the change condition of a shooting scene and/or a shooting object is provided for a user in real time, and the user can adjust composition or save the currently displayed image based on the image displayed in the preview interface. For example, the continuous multi-frame images acquired by the electronic device at intervals may be called a preview stream, and the preview stream image displayed in the preview interface by the electronic device may be an image acquired by the electronic device or an image after being processed (for example, exposure processing, denoising processing, color correction processing, etc.), and optionally, the first image and the second image in the embodiment of the present application are one frame of image in the preview stream. For example, the preview interface may be referred to as preview interface 304 shown in fig. 3B and 3C, preview interface 312 shown in fig. 3D, preview interface 401 shown in fig. 4B, and preview interface 403 shown in fig. 4C. The first image may be referred to as image 305 shown in fig. 3B and 3C and image 402 shown in fig. 4B.

The first image may be an image acquired by the electronic device based on an initial exposure parameter, for example, the initial exposure parameter may be an exposure parameter that is used by the electronic device by default after the electronic device starts a camera application program, alternatively may be an exposure parameter manually selected by a user and received by the electronic device, alternatively may also be an exposure parameter determined by the electronic device based on the brightness of the previous frame of image, which is not limited in the embodiment of the present application.

Step S502, inputting the first image into a multi-task learning model, and outputting a scene classification result and a target detection result through the multi-task learning model.

Specifically, the multi-task learning model is used for processing multiple tasks in parallel, for example, in the embodiment of the application, the multi-task learning model is used for processing the scene classification task and the target detection task of the first image in parallel so as to obtain the scene classification result and the target detection result corresponding to the first image.

For example, the scene classification result is used to characterize a shooting scene in which the electronic device collects the first image, for example, the scene classification result may include a result obtained by determining a class of the shooting scene, for example, a result obtained by determining whether the class of the shooting scene belongs to a preset scene through a multi-task learning model. Further, after the electronic device determines that the class of the shooting scene belongs to the preset scene, the specific class of the shooting scene can be obtained through the multi-task learning model. The embodiment of the application does not limit shooting scenes when the electronic equipment shoots, and does not limit preset scenes. The preset scenes include, but are not limited to, scenes unsuitable for exposure processing according to 18% intermediate gradation, such as a shooting scene of snow scenes shown in fig. 3B, and shooting scenes of stages, night scenes, grasslands, and the like, without limitation.

For example, the target detection result is used to represent the photographed object included in the first image, for example, the target detection result may include a result obtained by identifying a category of the photographed object, for example, the electronic device identifies whether the photographed object in the first image has a preset category through the multi-task learning model. Further, after the electronic device identifies that the first image includes the shooting object of the preset category, the electronic device may further obtain location information of the shooting object of the preset category in the first image through the multi-task learning model. According to the embodiment of the application, the shooting object is not limited when the electronic equipment shoots, and the preset category of the shooting object is not limited. The preset categories of the photographed objects include, but are not limited to, objects unsuitable for exposure processing according to an intermediate gradation of 18%, such as the photographed object of the category of animals shown in fig. 4B, and the photographed object of the category of faces, plants, balls, and the like, without limitation.

In one possible implementation manner, the electronic device inputs the first image into the multi-task learning model, determines feature information corresponding to the first image through the multi-task learning model, wherein the feature information corresponding to the first image includes feature information corresponding to a shooting scene and a shooting object, and then determines feature information corresponding to the shooting scene and feature information corresponding to the shooting object according to the feature information corresponding to the first image. The electronic equipment determines scene classification results based on the feature information corresponding to the shooting scene through a first output layer in the multi-task learning model, and determines target detection results based on the feature information corresponding to the shooting object through a second output layer in the multi-task learning model.

Specifically, since the first image may include multiple types of feature information, for example, feature information related to a scene classification task, feature information related to a target detection task, and other feature information unrelated to the scene classification task and the target detection task in the embodiment of the present application. Optionally, the other feature information irrelevant to the scene classification task and the target detection task is redundant information, and the electronic device can selectively extract the feature information relevant to the scene classification task and the target detection task when extracting the features of the first image through the multi-task learning model, and remove the feature information irrelevant to the scene classification task and the target detection task. The process of extracting the feature information corresponding to the first image by the electronic device may be referred to as description about the feature extraction module 601 in fig. 6. Then, the electronic device decouples the feature information corresponding to the first image into feature information corresponding to the shooting scene and feature information corresponding to the shooting object, uses the feature information corresponding to the shooting scene for a scene classification task to obtain a scene classification result, and uses the feature information corresponding to the shooting object for a target detection task to obtain a target detection result. Therefore, the accuracy of the electronic equipment for performing scene classification tasks and target detection tasks based on the characteristic information of the first image can be improved, the operation amount is reduced, and the operation speed is improved.

In one implementation, the electronic device determines a first weight and a second weight, then determines feature information corresponding to a shooting object according to feature information corresponding to the first image and the first weight, and determines feature information corresponding to a shooting scene according to feature information corresponding to the first image and the second weight.

Specifically, since the feature information corresponding to the first image includes feature information corresponding to the shooting scene and the shooting object, if the feature information corresponding to the shooting scene and the feature information corresponding to the shooting object are to be decoupled from the feature information corresponding to the first image, the electronic device may determine the feature information corresponding to the shooting scene based on the weight (the second weight) of the feature information corresponding to the shooting scene relative to the feature information corresponding to the first image, and determine the feature information corresponding to the shooting object based on the weight (the first weight) of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image.

The weight is used for representing the importance degree or contribution degree of a certain factor (such as the feature information corresponding to the shooting scene or the feature information corresponding to the shooting object) in the whole (such as the feature information corresponding to the first image) as well as the whole. Therefore, the feature decoupling according to the first weight and the second weight in the embodiment of the present application may be understood as feature decoupling according to the importance degree of the feature information corresponding to the shooting scene or the feature information corresponding to the shooting object in the feature information corresponding to the first image. For example, taking the first weight as an example, the first weight is used to represent the importance degree of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image, and the feature information corresponding to the shooting object is determined according to the first weight and the feature information corresponding to the first image, so that the useful feature information (information related to the shooting object) can be enhanced, and useless feature information (feature information unrelated to the shooting object) can be suppressed, so that the association degree of the determined feature information of the shooting object and the shooting object is higher. For example, the process of determining the first weight and the second weight by the electronic device may refer to the description of fig. 7, which is not repeated herein.

In one possible implementation, the electronic device performs a first process on the feature information corresponding to the first image in the channel dimension, determines a first processing result, and performs a second process on the feature information corresponding to the first image in the space dimension, determines a second processing result. Then, a first weight is determined according to the first processing result and the second processing result, and a second weight is determined according to the first weight.

The feature information corresponding to the first image extracted by the electronic device includes feature information in a channel dimension and feature information in a space dimension, and the electronic device may process the feature information corresponding to the first image in the channel dimension and the space dimension respectively to obtain a weight of the feature information corresponding to the shooting object in the channel dimension relative to the feature information corresponding to the first image, and a weight of the feature information corresponding to the shooting object in the space dimension relative to the feature information corresponding to the first image. Then, the first weight and the second weight are determined based on the first processing result and the second processing result, so that the first weight characterizes the weight of the characteristic information corresponding to the shooting object relative to the characteristic information corresponding to the first image in the channel dimension and in the space dimension, and the second weight characterizes the weight of the characteristic information corresponding to the shooting scene relative to the characteristic information corresponding to the first image in the channel dimension and in the space dimension.

For example, the electronic device may perform a first process on the feature information corresponding to the first image in the channel dimension based on a channel attention (Channel Attention) mechanism to obtain a first processing result. The first processing result is used for representing the weight of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image in the channel dimension, and the specific process can refer to step S701.

For example, the electronic device may perform a second processing on the features corresponding to the first image in the spatial dimension based on a spatial attention (Spatial Attention) mechanism to obtain a second processing result. The second processing result is used for representing the weight of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image in the spatial dimension, and the specific process can refer to step S702.

In one possible implementation, the electronic device inputs feature information corresponding to a shooting scene into a first output layer in the multi-task learning model, and outputs a scene classification result through the first output layer.

The scene classification result comprises a first judgment result and a class corresponding to the shooting scene, wherein the first judgment result is used for representing whether the shooting scene belongs to a preset scene or not.

Specifically, the electronic device inputs feature information corresponding to a shooting scene into a first output layer in the multi-task learning model, and judges whether the shooting scene of the first image belongs to a preset scene or not through the first output layer to obtain a first judgment result. If the shooting scene belongs to a preset scene, the electronic equipment can also output the scene category corresponding to the shooting scene through the first output layer. The process of determining the scene classification result by the electronic device may refer to the description of the multitasking module 603 in fig. 6.

In one possible implementation, the electronic device inputs feature information corresponding to the shooting object into a second output layer in the multi-task learning model, and outputs a target detection result through the second output layer, wherein the target detection result includes a category of the shooting object and position information of the shooting object in the first image.

Specifically, the electronic device inputs feature information corresponding to the shooting object into a second output layer in the multi-task learning model, and whether the first image contains the shooting object of a preset category is determined through the second output layer. If the electronic device includes a preset type of shooting object, the electronic device may further output the type of the shooting object and the position information of the shooting object relative to the first image through the second output layer. The process of determining the target detection result by the electronic device may refer to the description of the multitasking module 603 in fig. 6.

Step S503, determining the target exposure parameters according to the scene classification result and the target detection result.

Specifically, after the electronic device obtains the scene classification result and the target detection result, the electronic device may determine the target brightness according to the scene classification result and the target detection result. Then, a target exposure parameter is determined with the aim of adjusting the brightness of the first image to the target brightness.

In one possible implementation manner, if the first determination result is that the shooting scene belongs to the preset scene, determining the target exposure parameter according to the category corresponding to the shooting scene.

Specifically, since the shooting scene includes all visible elements of the first image, the shooting scene is related to the information of the whole first image. When the shooting scene belongs to a preset scene, the electronic equipment is required to adjust the overall brightness of the image, and the electronic equipment can also synchronously adjust the brightness of the shooting object contained in the first image. The electronic device can adjust the brightness of the image to a moderate state according to the type of the shooting scene without considering the type of the shooting object.

In one possible implementation manner, when the first determination result is that the shooting scene does not belong to the preset scene, the target exposure parameter is determined according to the type of the shooting object and the position information of the shooting object in the first image when the type of the shooting object belongs to the preset type.

Specifically, when the electronic device determines that the shooting scene does not belong to the preset scene according to the first determination result, the electronic device determines whether the category of the shooting object belongs to the preset category according to the target detection result. When the electronic device judges that the class of the shooting object belongs to the preset class, the electronic device is required to adjust the exposure parameters, and the electronic device can determine the target exposure parameters based on the class of the shooting object and the position information of the shooting object in the first image.

It should be noted that, in the above embodiment, the electronic device determines the target exposure parameter based on the shooting scene preferentially. In addition to the above-described embodiments, the electronic apparatus may preferentially determine the target exposure parameter based on the photographic subject. For example, in the case where the category of the photographic subject belongs to the target category, when the ratio of the area of the photographic subject to the total area of the first image is greater than or equal to the ratio threshold, at this time, the electronic apparatus may preferentially determine the target exposure parameter based on the photographic subject. This is because, when the ratio of the area of the subject to the total area of the first image is greater than or equal to the ratio threshold, it is explained that the area of the subject is large and the influence of the luminance of the subject on the entire luminance of the image is also large, so the electronic apparatus may consider the luminance of the subject.

Further, in one implementation, when the difference between the luminance of the photographic subject and the luminance of the background in the image is large, it may be difficult for the electronic device to eliminate the luminance difference between the photographic subject and the background by adjusting the luminance of the image as a whole. The electronic device may process the brightness of the region where the photographic subject is located based on the position information of the photographic subject in the first image, so as to reduce the brightness gap between the photographic subject and the background, so that the image overall presents moderate brightness.

Step S504, a second image is displayed.

Specifically, after the electronic device determines the target exposure parameter based on the shooting scene and/or the shooting object, the electronic device may acquire the second image based on the target exposure parameter. The target exposure parameters include parameters such as photosensitivity, aperture size, shutter speed and the like. The electronic equipment can adjust the sensitivity of the image sensor to light by adjusting the sensitivity, can adjust the light entering the camera by adjusting the aperture size, and can adjust the exposure time by adjusting the shutter speed, thereby controlling the exposure degree of the camera, collecting a second image with moderate brightness, and reducing the overexposure or underexposure of the second image.

For example, the second image may be referred to as image 313 shown in fig. 3D, and image 404 shown in fig. 4C.

The structure of the multi-task learning model will be described below in conjunction with fig. 6 to illustrate one possible implementation of the electronic device to process the first image through the multi-task learning model.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a multi-task learning model according to an embodiment of the application. As shown in fig. 6, the multitasking learning model includes a feature extraction module 601, a feature decoupling module 602, and a multitasking processing module 603.

Illustratively, the feature extraction module 601 is configured to perform feature extraction on an image (for example, a first image) input into the multi-task learning model, so as to obtain feature information corresponding to the first image. Optionally, the image feature extraction refers to processing and analyzing information contained in an image, and extracting information which is not easily interfered by random factors as features of the image, so as to realize that original features of the image are represented as a group of features with obvious physical significance or statistical significance. Image features refer to a collection of mathematics that can characterize the content or characteristics of an image.

In one implementation, the feature extraction module 601 extracts feature information corresponding to a task according to the task processed by the electronic device through the multitask learning model. For example, the electronic device processes the scene classification task and the object detection task through the multitasking learning model, and the feature extraction module 601 may extract feature information according to the scene classification task and the object detection task. The scene classification task is used for determining a shooting scene when the electronic equipment acquires the first image, and the feature information related to the scene classification task is the feature information related to the shooting scene. Since the shooting scene contains all visible elements of the first image, so that the shooting scene is related to the information of the whole first image, the electronic device needs to identify the shooting scene based on the characteristic of the whole first image. For example, features characterizing the entirety of the first image include, but are not limited to, statistical features that can be usefully applied in tasks such as image classification, segmentation, retrieval, and enhancement by quantifying pixel values, color distribution, and structure of the image. The electronic device may analyze the statistical features of the first image to determine the style characteristics of the first image, for example, the gray level histogram in the statistical features may learn the contrast, brightness and tone distribution of the first image, and the color histogram in the statistical features may determine the color distribution in the first image, so that the style of the first image may be more fully described based on the information about the overall situation of the first image, such as the color, the contrast, etc., so that the accuracy of the determined scene classification result is higher. Therefore, the feature extraction module 601 may extract feature information about the whole of the first image from the first image for a subsequent scene classification task.

In one implementation, the target detection task is configured to detect a photographic subject included in the first image, and the feature information related to the target detection task is feature information related to the photographic subject. Since the area of the region where the photographed object is located is smaller than the total area of the first image, so that the photographed object is related to the information of the first image part, the electronic device needs to detect the photographed object based on the feature characterizing the first image part. For example, features characterizing the first image portion include, but are not limited to, texture features, edge features, shape features, spatial relationship features, and the like. For example, texture features are used to describe the surface properties of a scene to which an image or image region corresponds. Edge features represent regions of the image where the gray level changes significantly, typically corresponding to the boundary or shape of an object. Shape features are features that describe the geometry of an object, including information about the boundary, contour, area, perimeter, shape complexity, etc. of the object. The spatial relationship features describe the relative position, orientation, and distance between objects, focusing not only on the shape and size of the objects, but also on the distribution and arrangement of the objects in space. Therefore, the electronic device can identify the type of the shooting object, determine the position of the shooting object or determine the outline of the shooting object based on the characteristic of the first image part, so that the accuracy of the determined target detection result is higher. Thus, the feature extraction module 601 may extract feature information related to the first image part from the first image for subsequent object detection tasks.

Illustratively, the structure of the feature extraction module 601 includes, but is not limited to, a convolution layer, a pooling layer, a normalization layer, and an activation layer, and the like, where the connection relationship, the location, and the number of the convolution layer, the pooling layer, the normalization layer, and the activation layer are not limited by the embodiments of the present application. In one implementation, the feature extraction model may first scan the first image through a filter (also referred to as a convolution kernel) of the convolution layer in a sliding window manner, multiply and sum the filter with a local area of the first image, and obtain an output of the convolution layer, so as to extract features (for example, features corresponding to a shooting scene and features corresponding to a shooting object) from the first image. Then, the feature extraction module 601 inputs the output of the convolution layer into a pooling layer, by which the spatial size of the feature map extracted from the convolution layer is reduced while retaining important information (information about the shooting scene and the shooting subject). The feature extraction module 601 inputs the output of the pooling layer into the normalization layer, and performs linear transformation on the batch data obtained from the pooling layer through the normalization layer, so that the distribution of the batch data is more stable, and the feature extraction module 601 is beneficial to accelerating the convergence speed of the batch data. Finally, the feature extraction module 601 inputs the output of the normalization layer into the activation layer, so as to obtain feature information corresponding to the first image. The nonlinear factors can be introduced through the activation layer, so that the multi-task learning model can learn more complex features (such as nonlinear features), and the extraction capacity of the multi-task learning model for the complex features is improved. Therefore, the feature extraction module 601 can effectively reduce the parameter number of the multi-task learning model through the combination of the convolution layer and the pooling layer, and enhances the generalization capability of the multi-task learning model on different images. The training effect and generalization capability of the multi-task learning model can be enhanced through the normalization layer and the activation layer, and the performance of the multi-task learning model is improved.

In one possible implementation, the Multi-task learning model may extract the feature information corresponding to the first image through a Multi-task Decoder (Multi-TASK LEARNING Decoder, MTLDecoder). The multi-task decoder includes a feature extraction module 601 and a feature decoupling module 602, which are configured to perform feature extraction on an image (e.g., a first image) input into the multi-task decoder, and then map the extracted features into a plurality of output tasks (e.g., scene classification tasks and object detection tasks in the multi-task processing module 603), so that the multi-task learning model can process the plurality of output tasks in parallel.

The format of the image input to the multi-task decoder by the electronic device is not limited, for example, the format of the image acquired by the electronic device through the sensor may be YUV format, the electronic device may convert the image in YUV format into RGB color mode (RGB) format, and then input the image in RGB format to the multi-task decoder, so that the electronic device performs feature extraction on the image in RGB format through the feature extraction module 601 in the multi-task decoder. For example, the process of feature extraction by the electronic device through the multi-task decoder can be seen in the following expression (1):

(1)

Wherein, in the above expression (1), An image (e.g. a first image) input into the multi-tasking decoder for the electronic device,For features extracted by the electronic device through the multi-task decoder (e.g. features corresponding to the first image),A functional expression for feature extraction of the first image for the electronic device by the multitasking decoder.

For example, the feature decoupling module 602 is configured to decouple the feature information (for example, the feature information corresponding to the first image) output by the feature extraction module 601 into feature information corresponding to each of the plurality of tasks in the multi-task processing module 603, for example, the multi-task processing module 603 includes a scene classification task and a target detection task, and the feature decoupling module 602 may decouple the feature information corresponding to the first image into feature information corresponding to the shooting object and feature information corresponding to the shooting scene, where the feature information corresponding to the shooting scene is applicable to the scene classification task and the feature information corresponding to the shooting object is applicable to the target detection task.

It may be understood that, in the multitask learning, the feature information corresponding to the image input by the electronic device into the multitask learning model (for example, the feature information corresponding to the first image) may include feature information that may be shared by a plurality of tasks, and may also include feature information specific to a certain task, so the feature decoupling module 602 is configured to determine, based on the feature information specific to each task of the plurality of tasks, that feature information corresponding to any one task of the plurality of tasks may also include feature information shared with other tasks. For example, the feature information corresponding to the shooting scene determined by the feature decoupling module 602 includes not only feature information specific to the scene classification task, but also feature information shared by the scene classification task and the target detection task. Therefore, any task of the plurality of tasks can strengthen understanding of the first image based on the shared characteristic information on the basis of processing the characteristic information specific to the task, and therefore a result with higher accuracy is output.

The process of feature decoupling by the feature decoupling module will be described in detail below with reference to fig. 7.

Referring to fig. 7, fig. 7 is a flowchart of feature decoupling provided in the embodiment of the present application, where a process of feature decoupling is described by taking a weight of feature information corresponding to a shooting object relative to feature information corresponding to a first image as an example, it can be understood that the weight of feature information corresponding to a shooting scene relative to feature information corresponding to the first image may be determined first based on the implementation manner shown in fig. 7, and then feature decoupling is performed, which is not described herein again. As shown in fig. 7, the flowchart includes S701 to S705, and the details are as follows.

S701, determining a weight of feature information corresponding to the shooting object relative to feature information corresponding to the first image in the channel dimension.

Specifically, the feature information corresponding to the first image includes a feature map, where the feature map has a size ofWherein C is the number of channels of the feature map, and belongs to the channel dimension, it can be seen that the feature information corresponding to the first image includes a plurality of different channels. In general, different channels may have different contributions to different features, e.g., some channels may contain more critical information (e.g., information related to the subject), while other channels may contain noise or redundant information (e.g., information unrelated to the subject). Therefore, the electronic device can determine the weight of the key information (for example, the information related to the shooting object) on each channel through the channel attention mechanism, and can strengthen the characteristics related to the shooting object in the channel dimension and inhibit the characteristics unrelated to the shooting object by giving higher weight to the channel containing more key information and giving lower weight to the channel containing less key information, thereby achieving the purpose of decoupling the characteristic information corresponding to the shooting object from the characteristic information corresponding to the first image.

The channel attention mechanism belongs to one of attention mechanisms (Attention Mechanism), and the attention mechanism can automatically focus on the most relevant information in mass data by simulating the selective attention of human vision, so that the efficiency and the accuracy of a model are improved. Further, the channel attention mechanism focuses on judging the importance of different channels, and determining which channels have more discriminative features on tasks.

Exemplary, the electronic device extracts the feature information corresponding to the first image based on the channel attention mechanismIs the following expression (2):

(2)

Wherein, For the feature information corresponding to the first image, see expression (1) in particular.For the average pooling of the features corresponding to the first image,To maximize the pooling of the features corresponding to the first image,Is in the structure of a full-connection layer,To activate the function.

Specifically, since the channel attention mechanism is processed in the channel dimension, it is first necessary to pool the channel in the spatial dimension to compress the spatial dimension and obtain the vector in the channel dimension. For example, in expression (2), the electronic device compresses the spatial size by performing an average pooling process and a maximum pooling process on the feature information corresponding to the first image, so that the obtained first feature information is the feature information of the first image in the channel dimension. Then, the electronic device processes the first feature information through the full connection layer to obtain a weight of each channel containing information related to the shooting object, wherein the weight can be regarded as the attention weight of each channel and is used for measuring the contribution degree of each channel to the feature information corresponding to the shooting object. Finally, the electronic equipment performs mapping processing on the output of the full-connection layer through the activation function to obtain a first processing result, so that the first processing result comprises the weight of the characteristic information corresponding to the shooting object on each channel relative to the first characteristic information. At this time, the first processing result has a size ofIncluding information in the channel dimension.

S702 determines the weight of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image in the spatial dimension.

Specifically, the feature information corresponding to the first image includes a feature map, where the feature map has a size ofH is the height of the feature map, belongs to the space dimension, and W is the width of the feature map, belongs to the space dimension. Based on the height and width of the feature map, a plurality of spatial locations of the feature map may be determined. In general, different spatial locations may have different contributions to different features, e.g., some spatial locations may contain more critical information (e.g., information related to the subject), while other spatial locations may contain noise or redundant information (e.g., information unrelated to the subject). Therefore, the electronic device may determine the weight of the key information (for example, the information related to the shooting object) in each spatial position through the spatial attention mechanism, and may strengthen the feature related to the shooting object in the spatial dimension and suppress the feature unrelated to the shooting object by giving a higher weight to the spatial position containing more key information and giving a lower weight to the spatial position containing less key information, thereby achieving the purpose of decoupling the feature information corresponding to the shooting object from the feature information corresponding to the first image.

Among them, the spatial attention mechanism is an important attention mechanism in deep learning for focusing on a specific area when processing, for example, image or video data, to improve the performance and efficiency of a model. For example, the electronic device may calculate the attention weight of each pixel in the first image through a spatial attention mechanism, and then concentrate the attention of the model (e.g., a multi-task learning model) on the most important area in the first image, so that the model can more effectively process complex visual tasks, and improve the performance and generalization capability thereof.

Exemplary, the electronic device extracts the feature information corresponding to the first image based on the spatial attention mechanismIs the following expression (3):

(3)

Wherein, For the feature information corresponding to the first image, see expression (1) in particular.In order to perform an average pooling process on the feature information corresponding to the first image,In order to maximize the pooling of the feature information corresponding to the first image,To pair(s)AndThe convolution operation is performed such that,To activate the function.

Specifically, since the spatial attention mechanism is processed in the spatial dimension, pooling processing in the channel dimension is first required to compress the channel size. For example, in expression (3), the electronic device performs an average pooling process and a maximum pooling process on the feature information corresponding to the first image based on the channel dimension, to obtain two sizesIs a feature map of (1). The two feature graphs are spliced based on the channel to obtain oneIs a feature map of (1). Then for thisIs convolved (for example, the convolution kernel may be of a size of) So that thisThe feature map of the (2) is reduced to a one-dimensional feature map, and second feature information is obtained. Wherein the second characteristic information has a size ofIncluding characteristic information of the first image in the spatial dimension. In addition, the electronic device can process the second characteristic information through convolution operation to obtain a weight containing information related to the shooting object at each spatial position, wherein the weight can be regarded as the attention weight of each spatial position and is used for measuring the contribution degree of each spatial position to the characteristic information corresponding to the shooting object. Finally, the electronic device normalizes the output of the convolution operation into probability distribution through an activation function to obtain a second processing result, so that the second processing result comprises the weight of the characteristic information corresponding to the shooting object on each spatial position relative to the second characteristic information, and the sum of the weights of each spatial position is ensured to be 1. The second processing result has the size ofIncluding information in the spatial dimension.

S703, determining a first weight and a second weight.

Specifically, the first processing result is the weight in the channel dimension, the second processing result is the weight in the space dimension, and the feature information corresponding to the first image includes the feature information in the channel dimension and also includes the feature information in the space dimension. In order to enable the feature information corresponding to the shooting object which is decoupled from the feature information corresponding to the first image to also include the feature information in the channel dimension and the feature information in the space dimension, the electronic device needs to obtain the first weight according to the first processing result and the second processing result, so that the first weight characterizes the weight of the feature information corresponding to the shooting object relative to the feature information corresponding to the first image in the channel dimension and in the space dimension, and when executing S704, the electronic device can decouple the feature information corresponding to the shooting object from the feature information corresponding to the first image according to the first weight.

In one possible implementation, since the first processing result is of a size ofThe second processing result has the size ofThe size of the characteristic information corresponding to the first image isTherefore, the electronic device can first scale the first processing result from the sizeExpanded toThe values of the height of the extended feature map and the width of the feature map are 1, which corresponds to the duplication of H or W times in the spatial dimension. Sizing the second treatment result fromExpanded toThe number of channels extended is 1, corresponding to C copies in the channel dimension. Then, the electronic device accumulates the expanded first processing result and the expanded second processing result and then carries out average operation, so that the first weight can be obtained. Wherein the first weightSee the following expression (4):

(4)

Wherein, For the feature information corresponding to the first image, see expression (1) in particular.For the first processing result, see expression (2) in particular,For the second processing result, see expression (3) in particular.Indicating that the size of both the first processing result and the second processing result is expanded toAnd after the corresponding size, accumulating the expanded first processing result and the expanded second processing result, and then carrying out average operation to obtain the result.

Illustratively, since the feature information corresponding to the first image includes feature information corresponding to the photographic subject and feature information corresponding to the photographic scene, a sum of a weight (i.e., a first weight) of the feature information corresponding to the photographic subject with respect to the feature information corresponding to the first image and a weight (i.e., a second weight) of the feature information corresponding to the photographic scene with respect to the feature information corresponding to the first image is 1. Therefore, when the electronic device determines the first weightThen, the second weight can be determined as being 1 according to the sum of the first weight and the second weight. It should be noted that, if the feature information corresponding to the first image further includes other feature information except the feature information corresponding to the shooting object and the shooting scene, the electronic device may determine the corresponding weight in the same manner as in steps S701 and S702, and then perform feature decoupling.

S704, determining feature information corresponding to the shooting object based on the first weight.

Specifically, after determining the first weight, the electronic device may determine feature information corresponding to the shooting object according to feature information corresponding to the first image and the first weight. For example, the feature information corresponding to the first image is multiplied by the first weight, so that the feature information related to the photographing object is amplified among the feature information corresponding to the first image, and the feature information unrelated to the photographing object is attenuated, thereby decoupling the feature information corresponding to the photographing object from the feature information corresponding to the first image.

For example, feature information corresponding to a photographic subject can be referred to the following expression (5):

(5)

Wherein, For the feature information corresponding to the first image, see expression (1) in particular.For the first weight, see expression (4) in particular.And carrying out weighted fusion on the characteristic information corresponding to the first image and the first weight.

It should be noted that the number of the substrates,And may also be understood as a mask. Wherein the set of mask indices or a particular region in the image is used to select, hide or modify a portion of the content of the data. For example, in the first weight, the first weight may be used to select feature information corresponding to a photographic subject from feature information corresponding to the first image, while hiding other feature information, so that the multitask learning model may focus on information about the photographic subject.

S705, determining feature information corresponding to the shooting scene based on the second weight.

Specifically, after determining the second weight, the electronic device may determine feature information corresponding to the shooting scene according to feature information corresponding to the first image and the second weight. For example, the feature information corresponding to the first image is multiplied by the second weight, so that the feature information related to the shooting scene is amplified among the feature information corresponding to the first image, and the feature information unrelated to the shooting scene is attenuated, thereby decoupling the feature information corresponding to the shooting scene from the feature information corresponding to the first image.

For example, feature information corresponding to a shooting scene can be referred to the following expression (6):

(6)

Wherein, For the feature information corresponding to the first image, see expression (1) in particular.As a result of the second weight being set,And carrying out weighted fusion on the characteristic information corresponding to the first image and the second weight.

It should be noted that the number of the substrates,And may also be understood as a mask. For example, the second weight may be used to select feature information corresponding to a shooting scene from feature information corresponding to the first image, while hiding other feature information, so that the multitasking learning model may focus on information about the shooting scene.

Illustratively, according to the expression (5) and the expression (6), the feature information corresponding to the first image may also be expressed as the following expression (7):

(7)

Wherein, For the feature information corresponding to the first image, see expression (1) in particular.For the first weight, see expression (4) for details,As a result of the second weight being set,In order to capture the feature information corresponding to the subject,And the characteristic information corresponding to the shooting scene is obtained.

It should be noted that, the electronic device decouples the feature information corresponding to the first image into the feature information corresponding to the shooting scene and the feature information corresponding to the shooting object, so that the multi-task learning model can better capture the feature information required by different tasks when processing a plurality of tasks in parallel, thereby improving the accuracy of the tasks.

In addition, in the existing scene classification scheme or the object detection scheme, feature information required by a task is generally directly extracted from an input image, and then the extracted features are processed to obtain a result, so that feature decoupling is not involved. For example, in the existing face detection task, the electronic device directly extracts feature information corresponding to a face from an input image, and then performs a target detection task based on the feature information corresponding to the face, thereby obtaining a face detection result. The embodiment of the application is based on the weight to decouple the feature information required by the task from the feature information corresponding to the first image, so that the feature information corresponding to the shooting scene not only comprises the related information of the shooting scene, but also comprises the related information of the shooting object, and similarly, the feature information corresponding to the shooting object not only comprises the related information of the shooting object, but also comprises the related information of the shooting scene, namely, the feature information corresponding to the shooting scene and the feature information corresponding to the shooting object share the feature information. The electronic device determines the scene classification result based on the feature information corresponding to the photographed scene and the shared feature information, and compared with the electronic device determining the scene classification result based on the feature information corresponding to the photographed scene only, the electronic device can strengthen understanding of the multi-task learning model on the first image, so that the scene classification result with higher accuracy is output. Similarly, the electronic device can obtain a target detection result with higher accuracy based on the feature information corresponding to the shooting object and the shared feature information.

With continued reference to fig. 6, after obtaining the feature information corresponding to the shooting scene and the feature information corresponding to the shooting object, the feature decoupling module 602 may input the feature information and the feature information corresponding to the shooting object into the multitasking module 603.

The multitasking module 603 is configured to process a plurality of tasks in parallel, for example, a scene classification task or a target detection task in the embodiment of the present application.

As shown in fig. 6, the multitasking module 603 includes a scene discriminator and a scene classifier for performing scene classification tasks. Wherein the scene discriminator and the scene classifier belong to a first output layer of the multi-task learning model.

The scene discriminator is used for judging whether the shooting scene corresponding to the first image is a preset scene or not, and a first judging result is obtained. In one implementation, the scene discriminator is a multi-layer full-connection structure, where the multi-layer full-connection structure is used for making a decision, and the first judgment result can be obtained according to the feature information corresponding to the shooting scene. In another implementation, the scene identifier may be a combination of a multi-layer convolution layer and a full connection layer, and the embodiment of the present application does not limit the structure of the scene identifier.

For example, the electronic apparatus inputs feature information corresponding to a photographed scene into a scene discriminator, and a process of prediction by the scene discriminator is as follows expression (8):

(8)

Wherein, For example, if the first judgment result is 0, it indicates that the shooting scene does not belong to the preset scene, and if the judgment result is 1, it indicates that the shooting scene belongs to the preset scene.As a result of the second weight being set,In order to capture the feature information corresponding to the scene,For the network structure of the scene discriminator,Is a predictive process of the scene discriminator.

For example, the scene classifier is used for determining the category corresponding to the shooting scene, and the design principle of the scene classifier may be the same as that of the scene discriminator. For example, the scene classifier may be a multi-layer fully connected structure or a combination of multi-layer convolution layers and fully connected layers, and the embodiment of the present application does not limit the structure of the scene classifier.

For example, the electronic device inputs feature information corresponding to a shooting scene into a scene classifier, and the process of prediction by the scene classifier is as follows expression (9):

(9)

Wherein, For the class corresponding to the shooting scene, for example, if the shooting scene belongs to a preset scene, the class corresponding to the shooting scene output by the scene classifier belongs to one or more of the preset scenes. If the shooting scene does not belong to the preset scene, the scene classifier does not output a specific scene category.As a result of the second weight being set,In order to capture the feature information corresponding to the scene,Is a network structure of a scene classifier,Is a predictive process for a scene classifier.

As shown in fig. 6, the multitasking module 603 further includes a target detector for performing a target detection task. Wherein the object detector belongs to a second output layer of the multi-task learning model.

The object detector is configured to output an object detection result based on feature information corresponding to the photographic subject, where the object detection result includes a category corresponding to the photographic subject and position information of the photographic subject in the first image.

Unlike the scene discriminator and the scene classifier, the object detector can detect the class of the photographing object and the position of the photographing object at the same time following the design principle of the decoupling detection head, for example, in fig. 6, the object detector includes a common structure, a classification branch, and a regression branch. The shared structure is a part of network structure shared by the target detectors, so that the simplification of the structure of the target detectors is realized, the method can be used for balancing the representation capability of the target detectors to operators and the calculation cost of hardware in the electronic equipment, the result output by the shared structure can be directly used for classifying branches and regression branches, the operation pressure of the electronic equipment is reduced, and the processing time of the electronic equipment is saved.

For example, the electronic apparatus inputs feature information corresponding to a photographic subject into a target detector, and a process of prediction by the target detector is as follows expression (10):

(10)

Wherein, The target detection result includes the classification of the subject output by the classification branch and the positional information of the subject output by the regression branch in the first image.As a target feature of the shared structure,For the network structure of the object detector,Is a predictive process for the object detector.

The classification branch is used for predicting the class of the shooting object, and the structure of the classification branch can be a full-connection layer or a convolution layer, which is not limited herein. For example, if the classification branch identifies that the photographic subject belongs to a preset category, the classification branch may display a candidate frame of the category on the photographic subject, and output a category probability of the candidate frame, where the category probability is used to characterize the probability that the photographic subject belongs to the category.

Wherein the regression branch is used for determining position information of the photographic subject in the first image, for example, the position information includes coordinate information of the candidate frame in the above-mentioned classification branch (for example, center point coordinates, width, height, and the like of the candidate frame). The structure of the regression branches may be a fully connected layer or a convolution layer, without limitation.

The process by which the electronic device determines the target exposure parameters from the photographed scene and/or the photographed object is described below in connection with fig. 8.

Referring to fig. 8, fig. 8 is a flowchart of determining a target exposure parameter according to an embodiment of the application. As shown in fig. 8, the flowchart includes steps S801 to S805, and the specific flow is as follows:

S801, a scene classification result and a target detection result are obtained.

Specifically, after the electronic device inputs the first image into the multi-task learning model, a scene classification result and a target detection result output by the multi-task learning model are obtained, and then the target exposure parameters are determined according to the scene classification result and the target detection result based on steps S801 to S805.

The scene classification result comprises a first judgment result and a class corresponding to the shooting scene, and the target detection result comprises a class of the shooting object and position information of the shooting object in the first image.

S802, judging whether the shooting scene belongs to a preset scene.

When the electronic device determines that the shooting scene belongs to the preset scene according to the first judgment result in the scene classification result, the electronic device executes step S803 without considering the target detection result, so that the computing resource of the electronic device is saved, and the automatic exposure speed of the electronic device is increased. When the electronic device determines that the shooting scene does not belong to the preset scene according to the first determination result, the electronic device executes step S804 to determine the target exposure parameter according to the target detection result.

S803, determining a target exposure parameter based on the class of the shooting scene.

Specifically, when the electronic device determines that the shooting scene belongs to the preset scene according to the first determination result, the electronic device may determine the target brightness based on the type of the shooting scene, and then determine the target exposure parameter based on the difference between the brightness of the first image and the target brightness, so that the brightness of the image acquired by the electronic device based on the target exposure parameter is equal to or close to the target brightness, thereby achieving exposure convergence. The brightness of the first image may be obtained from an image signal output from an image sensor of the electronic device.

Wherein the preset scene includes, but is not limited to, snow scenes, stages, grasslands, night scenes, and the like. Taking a shooting scene corresponding to the first image as a night scene as an example, because the night scene contains a large-area dark area, the brightness of a picture acquired by the electronic equipment is low (lower than 18% of intermediate gray level), and the brightness of the picture is increased by the electronic equipment in a mode of increasing luminous flux, so that the first image acquired at the moment is overexposed. Therefore, when the electronic device determines that the shooting scene of the first image is a night scene, the electronic device can determine the target exposure parameter in a mode of reducing the target brightness, so that the exposure intensity of the image acquired by the electronic device based on the target exposure parameter is smaller than that of the first image, and the exposure degree is reduced.

S804, judging whether the class of the shooting object belongs to a preset class.

Specifically, when the electronic device determines that the type of the shooting object belongs to the preset type according to the first determination result, the electronic device may execute step S805 to determine the target exposure parameter corresponding to the shooting object. When the electronic equipment judges that the class of the shooting object does not belong to the preset class, the shooting scene does not belong to the preset scene at the moment and the class of the shooting object does not belong to the preset class, and the fact that the shooting scene and the shooting object at the moment can present moderate exposure based on the target brightness of 18% of the intermediate gray level is explained, and the exposure parameters are not required to be adjusted. And if the scene classification result and/or the target detection result are wrong, the electronic equipment does not adjust the exposure parameters, so that the possibility of wrong adjustment can be avoided, and the stability and reliability of automatic exposure are ensured.

Illustratively, in step S502, the electronic apparatus may output a candidate frame of the photographing object and a category probability of the candidate frame through the object detector. The electronic device may determine the class of the photographic object based on the class probability of the candidate frame, for example, when the class probability is greater than or equal to the threshold, the electronic device determines that the class of the photographic object belongs to the class corresponding to the candidate frame, and the electronic device may execute step S805 to adjust the exposure parameter. When the class probability is smaller than the threshold, the electronic equipment determines that the class of the shooting object does not belong to the class corresponding to the candidate frame, and does not adjust the exposure parameters.

S805, determining a target exposure parameter based on the category and the position information of the photographing object.

Specifically, when the category of the photographic subject belongs to a preset category, the electronic device may determine the target brightness according to the category of the photographic subject and the position information of the photographic subject in the first image. The target exposure parameter is then determined based on the difference between the brightness of the first image and the target brightness such that the brightness of the image captured by the electronic device based on the target exposure parameter is equal to or close to the target brightness, thereby achieving exposure convergence.

It will be appreciated that since automatic exposure by the electronic device is a process of adjusting the brightness of the image to the target brightness, the electronic device may need to adjust the exposure parameters multiple times to adjust the brightness of the image to the target brightness. The electronic equipment can determine shooting scenes and shooting objects of the input image through the multi-task learning model based on the method provided by the embodiment of the application, then determine target exposure parameters based on the shooting scenes and/or the shooting objects, and acquire the image again based on the target exposure parameters until the image with the target brightness is acquired, so as to provide a preview image with better quality for a user.

The above embodiment is a process of automatically exposing an electronic device through a multi-task learning model, and a training process of the multi-task learning model will be specifically described below.

The multi-task learning model may be trained on a first device, which may be a device with data transceiver capability, data storage capability, and data processing capability, and may be a physical device such as a host, a rack-mounted server, a blade server, or the like, or may be a virtual device such as a virtual machine, a container, or the like. Further, the first device may be one server, or may be a server cluster formed by a plurality of servers.

In one possible implementation, the first device obtains a training set of the multi-task learning model, the training set including one or more of a training image, a scene classification label corresponding to the training image, and a target detection label. Then, the first device takes the training image as input of the multi-task learning model, takes the scene classification label corresponding to the training image as output of a first output layer in the multi-task learning model, takes the target detection label as output of a second output layer in the multi-task learning model, trains the multi-task learning model, and obtains the trained multi-task learning model.

The structure of the multi-task learning model can be seen in fig. 6, and as can be seen in fig. 6, the multi-task learning model includes a scene discriminator, a scene classifier, and a target detector. The scene discriminator is used for judging whether the shooting scene of the input image belongs to a preset scene or not, and the scene classifier is used for determining which of the preset scenes the shooting scene specifically belongs to. Thus, the scene discriminator and the scene classifier are used for the scene classification task, and the first device may train the scene discriminator and the scene classifier according to the scene classification tags. The target detector belongs to a second output layer of the multi-task learning model, and is used for determining whether a shooting object belonging to a preset category exists in the input image, and if the shooting object belonging to the preset category exists, the target detector can also determine the position information of the shooting object in the input image. Thus, the object detector is used for an object detection task, and the first device may train the object detector according to the object detection tag.

The training set of the multi-task learning model may be obtained by the first device from other electronic devices, or may be constructed by the first device. In this description, the first device is taken as an example to construct a training set, and first, the first device acquires a plurality of images including a preset scene and/or including a shooting object as a preset category, where the images may be acquired by the electronic device or images from other electronic devices, so that diversity of the training set may be ensured. And then, the first equipment marks the acquired images and marks scene classification labels and/or target detection labels. The scene classification label is mainly used for marking the class of the shooting scene corresponding to the image. For example, a scene classification label corresponding to a shooting scene of a snow scene is labeled "0", a scene classification label corresponding to a shooting scene of a stage is labeled "1", a scene classification label corresponding to a shooting scene of a grassland is labeled "2", and so on. The object detection tag is mainly used for marking the category of the shooting object contained in the image and the position of the shooting object in the image, for example, the first device determines the position of the shooting object in the image in a mode of marking a marking frame in the area where the shooting object is located. Thus, after the first device annotates the acquired image, a training set is obtained that includes the training image (i.e., the acquired image), the scene classification tag, and the target detection tag.

In one possible implementation manner, the first device inputs the training images in the training set into the multi-task learning model, determines feature information corresponding to the training images through the multi-task learning model, and then decouples the feature information corresponding to the training images into the third feature information and the fourth feature information through the multi-task learning model. And finally, training the multi-task learning model based on the third characteristic information and the fourth characteristic information to obtain a trained multi-task learning model.

The structure of the multi-task learning model may be referred to as fig. 6, and as shown in fig. 6, the multi-task learning model includes a feature extraction module 601, a feature decoupling module 602, and a multi-task processing module 603. Firstly, the first device performs feature extraction on the training image through a feature extraction module 601, and extracts feature information corresponding to the training image. In order to train the scene classification task and the target detection task of the multi-task learning model, feature information corresponding to the training image includes feature information corresponding to a shooting scene related to the scene classification task and feature information corresponding to a shooting object related to the target detection task. Then, the first device decouples the feature information corresponding to the training image into third feature information and fourth feature information through the feature decoupling module 602, where the third feature information is feature information corresponding to a shooting scene in the feature information corresponding to the training image, and the fourth feature information is feature information corresponding to a shooting object in the feature information corresponding to the training image. The first device then trains the multitasking module 603 based on the third and fourth characteristic information.

In one possible implementation, as shown in fig. 6, the multi-tasking module 603 includes a scene discriminator, a scene classifier, and a target detector. The first device may obtain the scene classification result through the multitasking module 603, for example, the first device determines, through the scene identifier, whether the shooting scene of the training image belongs to a preset scene according to the third feature information, and determines, through the scene classifier, which category of the preset scene the shooting scene specifically belongs to. The first device may also obtain the target detection result through the multitasking module 603, for example, the first device may output, through the target detector, a category of the photographed object according to the fourth feature information, and location information of the photographed object in the training image. The first device may compare the scene classification result with the scene classification tag to obtain a first comparison result. And comparing the target detection result with the target detection label to obtain a second comparison result. And then, adjusting parameters of the multi-task learning model based on the first comparison result and the second comparison result. The first device processes other training images in the training set through the multi-task learning model according to the steps until the first comparison result is smaller than the first comparison threshold value and the second comparison result is smaller than the second comparison threshold value, and at the moment, the first device completes training of the multi-task learning model, so that the multi-task learning model has the capability of parallel processing of scene classification tasks and target detection tasks.

It should be understood that each step in the above method embodiments provided by the present application may be implemented by an integrated logic circuit of hardware in a processor or an instruction in software form. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

The application also provides an electronic device, which may include a memory and a processor. Wherein the memory is operable to store a computer program and the processor is operable to invoke the computer program in the memory to cause the electronic device to perform the method of any of the embodiments described above.

The application also provides a chip system comprising at least one processor for implementing the functions involved in the method executed by the electronic device in any of the above embodiments.

In one possible design, the system on a chip further includes a memory to hold program instructions and data, the memory being located either within the processor or external to the processor.

The chip system may be formed of a chip or may include a chip and other discrete devices.

Alternatively, the processor in the system-on-chip may be one or more. The processor may be implemented in hardware or in software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general purpose processor, implemented by reading software code stored in a memory.

Alternatively, the memory in the system-on-chip may be one or more. The memory may be integral with the processor or separate from the processor, and embodiments of the present application are not limited. The memory may be a non-transitory processor, such as a ROM, which may be integrated on the same chip as the processor, or may be separately provided on different chips, and the type of memory and the manner of providing the memory and the processor are not particularly limited in the embodiments of the present application.

Illustratively, the chip system may be a field programmable gate array (field programmable GATE ARRAY, FPGA), an Application Specific Integrated Chip (ASIC), a system on chip (SoC), a central processing unit (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (DIGITAL SIGNAL processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip.

The present application also provides a computer program product comprising a computer program (which may also be referred to as code, or instructions) which, when executed, causes a computer to perform the method performed by the electronic device in any of the embodiments described above.

The present application also provides a computer-readable storage medium storing a computer program (which may also be referred to as code, or instructions). The computer program, when executed, causes a computer to perform the method performed by the electronic device in any of the embodiments described above.

The embodiments of the present application may be arbitrarily combined to achieve different technical effects.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. The storage medium includes a ROM or a random access memory RAM, a magnetic disk or an optical disk, and other various media capable of storing program codes.

In summary, the foregoing description is only exemplary embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made according to the disclosure of the present invention should be included in the protection scope of the present invention.

Claims

1. An image processing method, the method comprising:

2. The method of claim 1, wherein the inputting the first image into a multi-task learning model, outputting scene classification results and object detection results through the multi-task learning model, comprises:

3. The method according to claim 2, wherein the determining the feature information corresponding to the shooting scene and the feature information corresponding to the shooting object according to the feature information corresponding to the first image includes:

determining a first weight and a second weight;

4. A method according to claim 3, wherein the characteristic information comprises characteristic information in a channel dimension and characteristic information in a space dimension, and wherein determining the first weight and the second weight comprises:

determining the first weight according to the first processing result and the second processing result, wherein the first weight is used for representing the weight of the characteristic information corresponding to the shooting object relative to the characteristic information corresponding to the first image in the channel dimension and the space dimension;

and determining the second weight according to the first weight.

5. The method according to any one of claims 2 to 4, wherein the determining, by the first output layer in the multitasking learning model, the scene classification result based on the feature information corresponding to the photographed scene includes:

6. The method according to claim 5, wherein said determining a target exposure parameter from said scene classification result and/or said target detection result comprises:

7. The method according to claim 5, wherein the determining, by the second output layer in the multitasking learning model, the target detection result based on the feature information corresponding to the photographic subject includes:

8. The method according to claim 7, wherein said determining a target exposure parameter from said scene classification result and/or said target detection result comprises:

9. A method of training a model, the method comprising:

10. The method of claim 9, wherein training the multi-task learning model to obtain a trained multi-task learning model comprises:

11. The method of claim 10, wherein training the multi-task learning model based on the third feature information and the fourth feature information results in the trained multi-task learning model, comprising:

12. An electronic device comprising one or more processors and one or more memories, wherein the one or more memories are coupled to the one or more processors, the one or more memories to store computer program code comprising computer instructions that the one or more processors invoke to cause the electronic device to perform the method of any of claims 1-11.

13. A chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform the method of any of claims 1 to 11.

14. A computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 11.

15. A computer readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 11.