CN111401546A

CN111401546A - Training method of neural network model and its medium and electronic device

Info

Publication number: CN111401546A
Application number: CN202010086380.0A
Authority: CN
Inventors: 刘默翰; 周力; 白立勋; 石文元; 俞清华; 隋志成
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2020-07-10
Anticipated expiration: 2040-02-11
Also published as: CN111401546B

Abstract

The application relates to the technical field of neural networks, and discloses a training method of a neural network model, a medium and electronic equipment thereof. The training method of the neural network model comprises the following steps: a first network layer in the n network layers acquires sample data and inputs the sample data into a second network layer; for the ith network layer of the n network layers, the following operations are performed: and when i is equal to 2, obtaining output data of the ith network layer based on the initial input data and a plurality of initial weights of the ith network layer, and when 2< i is equal to or less than n, obtaining output data of the ith network layer based on the output data of the (i-1) th network layer and a plurality of initial weights of the ith network layer, wherein the plurality of initial weights of the ith network layer are obtained based on m discrete values. According to the method and the device, the plurality of initial weights of the neural network model are set to be low-bit discrete values, the problem that the gradient of the neural network model disappears in the low-bit privilege weight training process can be effectively avoided, and convergence of the neural network model is accelerated.

Description

Training method of neural network model, medium thereof, and electronic device

Technical Field

The present disclosure relates to the field of neural network technologies, and in particular, to a training method for a neural network model, a medium and an electronic device thereof.

Background

The neural network model is an operational model formed by connecting a large number of nodes (or called neurons) with each other. A common neural network model includes an input layer, an output layer, and a plurality of hidden layers (also referred to as hidden layers). The inputs to each node of each layer are typically weighted, thus generating a weighted sum (or other weighted operation result) at each node. The weight of each layer may be adjusted during training.

During training of the traditional neural network model, each training process adopts a random initialization mode to initialize the weight of the neural network model. The weights of the traditional neural network model are generally floating point numbers within a certain value range, and the random initialization mode is to train from any floating point number within the value range. In this training process, a large number of floating point numbers and a plurality of training processes make training of the neural network model require a long time.

Disclosure of Invention

The embodiment of the application provides a training method of a neural network model, a medium and electronic equipment thereof.

In a first aspect, an embodiment of the present application provides a method for training a neural network model, where the neural network model includes n network layers, where n is a positive integer greater than 1; and the method comprises:

a first network layer in the n network layers acquires sample data and inputs the sample data into a second network layer, wherein the sample data comprises initial input data and expected result data;

for the ith network layer of the n network layers, performing the following operations:

when i is 2, a plurality of initial weights based on the initial input data and the ith network layer

The output data of the ith network layer is obtained,

when 2 is in<When i is less than or equal to n, based on output data of the i-1 th network layer and multiple initial weights of the i-th network layer

Obtaining output data of the ith network layer, wherein,

the plurality of initial weights of the ith network layer

Is derived based on m discrete values, wherein the plurality of initial weights

Has a numerical value range of

And m ═ {2, 3}, i.e., the discrete values can be two or three;

the plurality of initial weights for the i-th network layer based on an error between output data of the n network layers and expected result data in the sample data

And (6) carrying out adjustment.

For example, the plurality of initial weights of the ith network layer

Can be set to either-1, 1 or-1, 0, 1. That is, in the present embodiment, in order to make the final weights defined as 1 and-1 and convert the multiplication operation into an exclusive or operation between bits to reduce the memory access rate and the occupancy rate, a plurality of initial weights of the neural network model are used

Set as the distances of { -1, 1} or { -1, 0, 1 { -Divergence, accelerating the convergence of the model while avoiding disappearance of the gradient of the model.

In a possible implementation of the first aspect, the method further includes: the plurality of initial weights of the ith network layer

Each of which is one of m discrete values.

In a possible implementation of the first aspect, the method further includes: the m discrete values are-1 and 1, and the plurality of initial weights of the i-th network layer

Has a mean value of 0 and a variance of 1.

In a possible implementation of the first aspect, the method further includes: the m discrete values are-1, 0 and 1, and the plurality of initial weights of the i-th network layer

Has a mean value of 0 and a variance of 2/3.

In a possible implementation of the first aspect, the method further includes: the ith network layer has p initial weights

And the p initial weights of the i-th network layer

Calculated by the following formula:

wherein, W^bFor one of the m discrete values, the W^bHas a value range of-1 to W^b≦ 1, and corresponding to the p initial weights

P of W^bIf a value is simply selected from discrete values as an initial weight, there may be a case where the input data and the output data are not distributed uniformly, so in order to make the distributions of the input data and the output data of the network layers substantially uniform, the scaling factor is set here, wherein the scaling factor is obtained by a normalization method to scale the variance of the initial weight of the neural network model, so that the neural network model can be propagated to a deeper layer.

In a possible implementation of the first aspect, the method further includes: corresponding to the p initial weights

P of W^bIs 1 and the m discrete values are-1 and 1.

P of W^bIs 2/3, the m discrete values are-1, 0 and 1.

In a possible implementation of the first aspect, the method further includes: the scaling factor is obtained by the following formula:

wherein,

is the p W^bCorresponding to the p initial weights

Dispersion of jth initial weight in (1)The value of the one or more of,

is p Ws of the ith network layer^bAverage value of l_iThe number of input channels of the ith network layer.

Calculated by the following formula:

wherein, W^tAny one of a plurality of weights determined for a previous training of the ith network layer, the W^tHas a value range of-1 to W^tIf a value is simply selected from discrete values as an initial weight, there may be a case where the input data and the output data are not distributed uniformly, so in order to make the distributions of the input data and the output data of the network layers substantially uniform, a scaling factor is set here, wherein the scaling factor is obtained by a normalization method to scale the variance of the initial weight of the neural network model, so that the neural network model can be propagated to a deeper layer.

wherein p is the number of weights determined by the ith network layer from the previous training, W_j ^tRepresenting the jth weight of the weights determined from the p previous trains,

the average of the weights determined for the p previous trains.

wherein l_iIs the input channel number, l, of the ith network layer_i+1The number of input channels of the (i + 1) th network layer.

In a second aspect, an embodiment of the present application provides a method for training a neural network model, where the neural network model includes n network layers and the neural network model has converged, n is a positive integer greater than 1; and is

The method is used for a converged neural network model, and is used for carrying out low-bit quantization on the full-precision weight of a training number of the neural network model so as to convert multiplication operation into exclusive-nor operation among bits and reduce memory access rate and occupancy rate. Specifically, the method comprises the following steps:

when i is 2, performing symbol dereferencing on the multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

And based on the initial input data and the plurality of initial weights

The output data of the ith network layer is obtained,

when 2 is in<i≤When n, carrying out symbol dereferencing on the multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

And based on the output data of the i-1 st network layer and the plurality of initial weights

Obtaining output data of the ith network layer, wherein,

the plurality of initial weights of the ith network layer

Is derived based on m discrete values, and the plurality of initial weights

Has a numerical value range of

And m ═ {2, 3}, i.e., the discrete values can be two or three;

And (6) carrying out adjustment.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1 and 1, the plurality of initial weights of the i network layer

Has a mean value of 0 and a variance of 1; and is

The obtaining of the plurality of initial weights of the ith network layer by performing symbol dereferencing on the plurality of full-precision weights of the ith network layer includes:

if the full-precision weight is less than or equal to 0, taking-1 as an initial weight corresponding to the full-precision weight;

and if the full-precision weight is larger than 0, taking 1 as an initial weight corresponding to the full-precision weight.

Namely, symbol dereferencing is carried out on the trained full-precision weight in the converged neural network model, and the symbol dereferencing is converted into one of 1 and-1.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1, 0 and 1, the plurality of initial weights of the i network layer

Has a mean value of 0 and a variance of 2/3; and is

Performing symbol dereferencing on the multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

The method comprises the following steps:

if the full-precision weight is less than 0, taking-1 as an initial weight corresponding to the full-precision weight

If the full-precision weight is equal to 0, taking 0 as an initial weight corresponding to the full-precision weight

If the full-precision weight is larger than 0, taking 1 as an initial weight corresponding to the full-precision weight

Namely, symbol value taking is carried out on the trained full-precision weight in the converged neural network model, and the value is converted into one of 1, 0 and-1.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1 and 1, and,

The method comprises the following steps:

if the full-precision weight is less than or equal to 0, taking the product of-1 and the scaling factor as the initial weight corresponding to the full-precision weight

If the full-precision weight is greater than 0, taking the product of 1 and the scaling factor as the initial weight corresponding to the full-precision weight

Wherein the scaling factor is a positive number smaller than 1, and is used for adjusting the distribution of the output data of the ith network layer.

Namely, the symbol value is taken for the trained full-precision weight in the converged neural network model, and the symbol value is converted into the product of one of 1 and-1 and the scaling factor. If the value is simply selected from 1 and-1 as the initial weight after the value is taken as the sign, there may be a case that the input data and the output data of the network layer are not distributed uniformly, so in order to make the distribution of the input data and the output data of the network layer substantially uniform, a scaling factor is set here, wherein the scaling factor is obtained by a normalization method to scale the variance of the initial weight of the neural network model, so that the neural network model can be propagated deeper.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1, 0 and 1, and the symbol dereferencing is performed on the multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

The method comprises the following steps:

if the full-precision weight is less than 0, taking the product of-1 and the scaling factor as the initial weight corresponding to the full-precision weight

In a possible implementation of the second aspect, the method further includes: the scaling factor is obtained by the following formula:

where α is the scaling factor,/_iIs the input channel number, l, of the ith network layer_i+1The number of input channels of the (i + 1) th network layer.

wherein α is a scaling factor, p represents the number of initial weights of the ith network layer, and W_j ^zIs one of-1 and 1, and corresponds to a jth initial weight of the p initial weights, and p W initial weights corresponding to the p initial weights_j ^zHas a mean value of 0 and a variance of 1;

is the p W_j ^zAverage value of (d); l_iThe number of input channels of the ith network layer.

wherein α is a scaling factor, p represents the number of initial weights of the ith network layer, and W_j ^qIs one of-1, 0, 1, and corresponds to jth initial weight of p initial weights, and p W of the p initial weights_j ^qHas a mean value of 0 and a variance of 2/3;

is the p W_j ^qAverage value of (d); l_iThe number of input channels of the ith network layer.

In a possible implementation of the second aspect, the method further includes: the sample data comprises image data, and the neural network model is used for image recognition.

In a third aspect, an embodiment of the present application provides an electronic device for training a neural network model, including:

the first data acquisition module is used for acquiring sample data and inputting the sample data into a second network layer, wherein the sample data comprises initial input data and expected result data;

a first data processing module for performing the following operations:

The output data of the ith network layer is obtained,

Obtaining output data of the ith network layer, wherein,

the plurality of initial weights of the ith network layer

Is derived based on m discrete values, wherein the plurality of initial weights

Has a numerical value range of

And m ═ {2, 3 };

a first weight adjustment module for adjusting the plurality of initial weights of the ith network layer based on an error between output data of the n network layers and expected result data in the sample data.

In a fourth aspect, an embodiment of the present application provides an electronic device for training a neural network model, including:

the second data acquisition module is used for acquiring sample data and inputting the sample data into a second network layer, wherein the sample data comprises initial input data and expected result data;

a second data processing module for performing the following operations

And based on the initial input data and the plurality of initial weights

The output data of the ith network layer is obtained,

when 2 is in<When i is less than or equal to n, carrying out symbol dereferencing on the multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

Obtaining output data of the ith network layer, wherein,

the plurality of initial weights of the ith network layer

Is derived based on m discrete values, and the plurality of initial weights

Has a numerical value range of

And m ═ {2, 3 };

a second weight adjustment module for adjusting the weight of the network based on the n networksError between output data of a layer and expected result data in the sample data, the plurality of initial weights to the i-th network layer

And (6) carrying out adjustment.

In a fifth aspect, an embodiment of the present application provides a computer-readable medium, where instructions are stored on the computer-readable medium, and when the instructions are executed on a computer, the instructions cause the computer to perform a method for training a neural network model according to any one of the first and second aspects.

In a sixth aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing instructions for execution by one or more processors of the system, an

The processor is one of the processors of the system, and is configured to execute the method for training the neural network model according to any one of the first aspect and the second aspect.

Drawings

FIG. 1 illustrates a block diagram of an electronic device, according to some embodiments of the present application;

FIG. 2 is a schematic diagram of a neural network model;

FIG. 3 illustrates a schematic diagram of a computational process of a node of a neural network model, according to some embodiments of the present application;

FIG. 4 is a graph of an output distribution of activation functions of several layers near the output layer in a randomly initialized neural network model using a Gaussian distribution with a mean of 0 and a variance of 1;

FIG. 5(a) is a weight distribution graph illustrating 1-bit weight initialization of a convolutional neural network model, according to some embodiments of the present application;

FIG. 5(b) is a graph showing a weight distribution over a period of time during model convergence after a convolutional neural network model is initialized with 1-bit weights, according to some embodiments of the present application;

FIG. 5(c) is a graph illustrating a convolutional neural network model trained after initialization with 1-bit weights, the weight distribution after model convergence, according to some embodiments of the present application;

FIG. 6(a) is a graph illustrating training of a 1-bit fixed-point quantization model using an Xavier initialization function;

FIG. 6(b) illustrates a graph for training a 1-bit fixed-point quantization model using the weight initialization method illustrated in FIG. 2, according to some embodiments of the present application;

FIG. 7 illustrates a schematic diagram of an electronic device for training a neural network model, according to some embodiments of the present application;

FIG. 8 illustrates a schematic structural diagram of another electronic device for training a neural network model, in accordance with some embodiments of the present application;

fig. 9 illustrates a schematic structural diagram of an electronic device, according to some embodiments of the present application.

Detailed Description

Illustrative embodiments of the present application include, but are not limited to, a weight initialization method, apparatus, medium, and electronic device for a neural network.

It is to be appreciated that as used herein, the term module may refer to or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality, or may be part of such hardware components.

It is to be appreciated that in various embodiments of the present application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, the like, and/or any combination thereof.

Embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It is to be understood that the Neural Network model provided in the present application may be any artificial Neural Network model that employs multiply-add operation, such as Convolutional Neural Network (CNN), Deep Neural Network (DNN), and Recurrent Neural Network (RNN), Binary Neural Network (BNN), and the like.

It is to be appreciated that the method of weight initialization for neural networks provided herein can be implemented on a variety of electronic devices including, but not limited to, a server, a distributed server cluster of multiple servers, a cell phone, a tablet, a laptop, a desktop computer, a wearable device, a head mounted display, a mobile email device, a portable game console, a portable music player, a reader device, a personal digital assistant, a virtual reality or augmented reality device, a television or other electronic device having one or more processors embedded or coupled therein, and the like.

Particularly, the weight initialization of the neural network provided by the application is suitable for edge equipment, edge calculation is a distributed open platform (framework) which integrates network, calculation, storage and application core capabilities at the edge side of a network close to an object or a data source, edge intelligent service is provided nearby, and the key requirements of real-time business, data optimization, application intelligence, safety, privacy protection and the like can be met. For example, the edge device may be a device capable of performing edge calculation on video data near a video data source (network smart camera) in a video surveillance system.

The following describes a weight initialization scheme of the neural network disclosed in the present application, taking the electronic device 100 as an example.

Fig. 1 illustrates a block diagram of an electronic device 100, according to some embodiments of the present application. Specifically, as shown in FIG. 1, electronic device 100 includes one or more processors 104, system control logic 108 coupled to at least one of processors 104, system memory 112 coupled to system control logic 108, non-volatile memory (NVM)116 coupled to system control logic 108, and network interface 120 coupled to system control logic 108.

In some embodiments, the processor 104 may include one or more single-core or multi-core processors. In some embodiments, the processor 104 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where the electronic device 100 employs an enhanced Node B (eNB) or a Radio Access Network (RAN) controller, the processor 104 may be configured to perform various consistent embodiments.

In some embodiments, the processor 104 may be configured to invoke training information to train out a neural network model. Specifically, for example, the processor 104 may obtain initialization information of weights of the neural network model and input data information (e.g., image information, voice information, etc.), and train the neural network model. The neural network model can be quantized into a binary network or a ternary network, and the weight of the neural network model can be set to a preset discrete numerical value. In each layer of training of the neural network model, the processor 104 continuously adjusts the weights according to the obtained training information until the model converges. The processor 104 may also periodically update the neural network model to better accommodate changes in the various actual requirements of the neural network model.

In some embodiments, system control logic 108 may include any suitable interface controllers to provide any suitable interface to at least one of processors 104 and/or any suitable device or component in communication with system control logic 108.

In some embodiments, system control logic 108 may include one or more memory controllers to provide an interface to system memory 112. System memory 112 may be used to load and store data and/or instructions. Memory 112 of electronic device 100 may comprise any suitable volatile memory in some embodiments, such as suitable Dynamic Random Access Memory (DRAM). In some embodiments, system memory 112 may be used to load or store instructions that implement the neural network model described above, or system memory 112 may be used to load or store instructions that implement an application that utilizes the neural network model described above.

NVM/memory 116 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, NVM/memory 116 may include any suitable non-volatile memory, such as flash memory, and/or any suitable non-volatile storage device, such as at least one of a Hard Disk Drive (HDD), Compact Disc (CD) Drive, and Digital Versatile Disc (DVD) Drive. NVM/memory 116 may also be used to store the trained weights for the neural network model described above.

NVM/memory 116 may comprise a portion of a storage resource on a device on which electronic device 100 is installed, or it may be accessible by, but not necessarily a part of, the device. For example, NVM/storage 116 may be accessed over a network via network interface 120.

In particular, system memory 112 and NVM/storage 116 may each include: a temporary copy and a permanent copy of instructions 124. The instructions 124 may include: instructions that, when executed by at least one of the processors 104, cause the electronic device 100 to implement the method as shown in fig. 3. In some embodiments, the instructions 124, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in the system control logic 108, the network interface 120, and/or the processor 104.

Network interface 120 may include a transceiver to provide a radio interface for electronic device 100 to communicate with any other suitable device (e.g., front end module, antenna, etc.) over one or more networks. In some embodiments, the network interface 120 may be integrated with other components of the electronic device 100. For example, network interface 120 may be integrated with at least one of processor 104, system memory 112, NVM/storage 116, and a firmware device (not shown) having instructions that, when executed by at least one of processors 104, electronic device 100 implements the method shown in fig. 3.

The network interface 120 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 120 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In some embodiments, at least one of the processors 104 may be packaged together with logic for one or more controllers of the system control logic 108 to form a System In Package (SiP). In some embodiments, at least one of the processors 104 may be integrated on the same die with logic for one or more controllers of the system control logic 108 to form a system on a chip (SoC).

The electronic device 100 may further include: input/output (I/O) devices 132. I/O device 132 may include a user interface to enable a user to interact with electronic device 100; the design of the peripheral component interface enables peripheral components to also interact with the electronic device 100. In some embodiments, the electronic device 100 further comprises a sensor for determining at least one of environmental conditions and location information associated with the electronic device 100.

It is understood that the method for weight initialization provided by the embodiments of the present application is applicable to example applications of neural network models including, but not limited to, image recognition, voice recognition in the field of machine vision, and the like.

In the following, according to some embodiments of the present application, a technical solution for training the neural network model 200 shown in fig. 2 by using the electronic device 100 shown in fig. 1 is described in detail, taking image recognition as an example (for performing facial recognition, recognizing facial features such as mouth shape, eyebrow feature, and eye feature in a face image, for example).

Specifically, as shown in fig. 2, the neural network model 200 includes n network layers of an input layer, a plurality of hidden layers, and an output layer. Wherein the first layer is called an input layer, the last layer is called an output layer, and the other layers are called hidden layers. Each layer has a number of nodes (e.g., s nodes for the input layer in fig. 2), each node having a corresponding weight. All cross connections are formed between layers of the n network layers, and the output of the previous layer is the input of the next adjacent layer. The calculation formula of each node in the network layer is as follows:

y＝f(Wx+b)

where W is the weight, b is the offset, x is the input, y is the output, and f is the activation function.

The following describes in detail a specific process of training the neural network model 200 shown in fig. 2 by using sample images when performing image recognition, for example, face recognition.

In training the neural network model 200, a large amount of sample image data and expected result data may be input into the model 200, where the image data of each sample image is input into s nodes of the input layer of the neural network model 200 shown in fig. 2, subjected to the calculation of the hidden layer, and finally calculated by the output layer to generate face recognition result data. It should be noted that, when a large number of sample images are used to train the model, each complete training process corresponds to only one image, for example, there are 1000 images in total, the first image may be trained first, and after the training of the first image is completed, the second image may be trained, and so on, until the neural network model 200 converges. After each image training is completed, the face recognition result data finally output by the neural network model 200 is compared with the expected result data to calculate an error, a partial derivative is calculated according to the error, and the weight of each node in the network layer other than the input layer is adjusted based on the calculated partial derivative. In this way, the neural network model 200 is trained by inputting the image data of the image, the weights are adjusted continuously, and when the error between the face recognition result data finally output by the neural network model 200 and the expected result data is smaller than the error threshold, it is determined that the neural network model 200 converges.

In particular, FIG. 3 illustrates the computational process of the network layers in the neural network model 200 in some embodiments. As shown in fig. 3, the calculation process of each network layer in the neural network model 200 includes:

1. input layer computation

Image data of a sample image a is input to the input layer as input data.

2. Computation of hidden layers

a) The input layer outputs image data of the sample image a to the first hidden layer. For example, the input image data may be color information (e.g., numbers between 0 and 255 of an RGB color space) of each pixel point in the image, and the image data is input into s nodes (i.e., inputs x1, x2 to xs) of the input layer (first layer) of the model shown in fig. 2.

b) And initializing the weight of each hidden layer to obtain an initial weight.

In embodiments of the present application, the initialization model with a low bit weight is used to obtain the initial weight of each hidden layer, for example, a 1-bit weight initialization or a 2-bit weight initialization model is used, the weight range in the 1-bit weight initialization is (1, -1), and the weight range in the 2-bit weight initialization is (1, 0, -1), that is, when the 1-bit weight initialization model is used to initialize the weight, the initial weight value of a certain node in the hidden layer is set to one of values 1 and-1, and all weights of the hidden layer need to satisfy a distribution with a mean value of 0 and a variance of 1. When the 2-bit weight initialization model is used for initializing the weights, the initial weight value of a certain node in the hidden layer is set to be one of three values of-1, 0 and 1, and all the weights of the hidden layer need to satisfy the distribution that the mean value is 0 and the variance is 2/3.

Specifically, for example, in some embodiments, the floating-point type weights of the untrained neural network model may be initialized using the following 1-bit weight initialization method:

the initial weights of the neural network model 200 are quantized 1 bit, and the quantized initial weights take one of discrete values 1 or-1, and since a distribution with a mean of 0 and a variance of 1 is satisfied, the number of weights taking 1 and-1 in all weights in the same network layer is substantially the same (e.g., 1 and-1 are uniformly sampled by generating a uniform probability through a uniform distribution function uniform).

In addition, in other embodiments, the input and output distribution of each layer of the neural network model is kept substantially consistent to alleviate the problem of gradient disappearance during propagation to deeper layers. The discrete value W selected from 1 and-1 for a node in the quantization process can be obtained based on a preset scaling factor^bCompressing to obtain initial weight of neural network model

The preset scaling factor is obtained by a normalization method to scale the variance of the weights of the neural network model, so that the network can be propagated to a deeper layer.

For example, in some embodiments, W for discrete values that have already been chosen may be used^bThe compression is as follows:

where α is a scaling factor, generally α is a positive decimal number less than 1, for adjusting the distribution of the output data of the ith network layer.

In some embodiments, the scaling factor α may be calculated as follows:

wherein l_iNumber of input channels, l, representing the ith network layer of the neural network model_i+1The number of input channels of the (i + 1) th network layer of the neural network model is represented. It will be appreciated that for the input layer, i is 1 here; i +1 denotes the next network layer to the ith network layer in the neural network model.

In some embodiments, assume that the ith network layer has p initial weights

Then there are p W values of the selected discrete value^bCorresponding to p initial weights

The scaling factor α may also be calculated as follows:

wherein, W_j ^bRepresents p W^bMiddle pairShould P initial weights

The discrete value of the jth initial weight in (a),

is p Ws of the ith network layer^bAverage value of l_iIndicating the number of input channels of the ith network layer.

In this way, the scaling factor α of the weight of each network layer in the neural network model 200 is calculated, the scaling factor α is multiplied by the weight of the corresponding network layer, and in the training process of the neural network model, the input data of each layer and the compressed weight are subjected to weighting operation, so that the input data and the output data of each layer of the neural network model are transmitted forwards, and the distribution of the input data and the output data of each layer of the neural network model can be ensured to be basically consistent, so that the problem of gradient disappearance in the process of transmitting the input data and the output data to a deeper layer is solved.

It is to be appreciated that the above method of calculating the scaling factor α is merely exemplary and not limiting, and in other embodiments, other normalization methods may be employed to calculate the scaling factor α.

In some embodiments, the weights of the neural network model may be initialized using a 2-bit weight initialization method as follows:

first, in some embodiments, the weights of the floating point type of the untrained neural network model may be quantized 2 bits, the initial weights after quantization may be one of discrete-1, 0, or 1, and the number of weights of 1 and-1 in all weights in the same network layer is substantially the same (e.g., 1 and-1 are uniformly sampled by generating a uniform probability through a uniform distribution function uniform) because of the distribution of mean 0 and variance 2/3.

In particular, in some other embodiments, to ensure that the distribution of inputs and outputs of each layer of the neural network model remains substantially consistent, the problem of gradient disappearance during propagation to deeper layers is alleviated. The discrete value of W selected from 1 and-1 for a node in the quantization process can be calculated based on a preset scaling factor^bIs compressedObtaining the initial weight of the neural network model

The preset scaling factor is obtained by a normalization method to scale the variance of the weights of the neural network model, so that the network can be propagated to a deeper layer. The quantized discretized values of W may be initialized in a similar manner to the 1-bit weight initialization method described above^bCompressing to obtain initial weight of neural network model

For a specific compression method, please refer to the above, which is not described herein again.

Because the full-precision neural network model occupies a large amount of storage space, and the floating-point multiply-add operation consumes a large amount of computing resources, especially for edge devices, the operation and storage resources are limited, and a large amount of floating-point multiply-add operation and a large amount of floating-point numbers cannot be borne and stored generally. 8-bit quantization is a common solution at present, but can only support maximum 4 times of compression, and although integer arithmetic is adopted to replace floating-point arithmetic, the consumption of arithmetic resources is still large. Therefore, the above-mentioned method of initializing the 1-bit weight or the method of initializing the 2-bit weight performs low-bit quantization (1&2 bits) on the weight of the full-precision neural network model, and is far beyond the 8-bit quantization model in terms of both storage space and operation efficiency, so that the method is very suitable for operating in edge devices, and can reduce the power consumption of the devices. Compared with the existing initialization method, the adoption of the 1-bit weight initialization or the 2-bit weight initialization model can enable the neural network model to be more easily converged. As mentioned above, when training the BNN network, it is desirable to limit the final weights to 1 and-1 and convert the multiplication operation into an exclusive or operation between bits to reduce the memory access rate and the occupancy rate, and the solution of the present application can avoid the disappearance of the model gradient and accelerate convergence by directly setting the initial weights to { -1, 1} or { -1, 0, 1 }.

It will be appreciated that in some embodiments, in addition to 1-bit or 2-bit quantization of the initial weights of the neural network model, the inputs to the neural network model (e.g., image data of sample image a) may be quantized, and thus the matrix multiplication between the original weights and the inputs can be equivalently replaced by an XNOR operation, which can better speed up convergence.

It can be understood that the above-mentioned 1-bit quantized discrete value W^bThe value range of { -1, 1} is exemplary only and not limiting. In some embodiments, the 1-bit quantized discretized value of W^bFor other integer discrete values, for example, taking { -2, 2} or {1, 100}, etc., to satisfy the condition that the mean is 0 and the variance is 1, when calculating each network layer in the neural network model, both { -2, 2} and {1, 100} are converted into { -1, 1 }. After the calculation of each network layer in the neural network model is completed, the weights of each network layer in the model are reduced according to a preset proportion, for example, 0.8 is reduced to 90, and 0.5 is reduced to 2.

It can be understood that the above-mentioned 2-bit quantized discrete value W^bThe value range of { -1, 0, 1} is also exemplary only and not limiting, and in some embodiments, the 2-bit quantized discretized value of W^bIt can be a discrete value of other integer, such as { -2, 0, 2} or {0, 50, 100}, etc. To satisfy the conditions of mean 0 and variance 2/3, both-2, 0, 2 and 0, 50, 100 are converted to-1, 0, 1 when computing the network layers in the neural network model. After the calculation of each network layer in the neural network model is completed, the weight of each network layer in the model is reduced according to a preset proportion, for example, 1 is reduced to 100.

c) Input data of each hidden layer and corresponding initial weight

For example, the image data of the sample image A is divided into s data blocks, the s data blocks are input as s input data to s nodes of the input layer in FIG. 2, and the weighting operation is performed with the weight of the first hidden layer (for example, the weighting operation of the first node of the first hidden layer is: x1 × w11+ x2 × w12+ x3 × w13+ … + xs × w1s + b). d) for activationAnd the function carries out activation operation on the result of the weighting operation.

For example, in some embodiments, the activation function may be a Sigmoid function, a Tanh function, or the like. Specifically, the feature data of the sample image a output by each node of the input layer and the initial weights of each node (h nodes shown in fig. 2) of the first hidden layer

A weighting operation is performed to generate the outputs of the nodes of the first hidden layer (i.e. the inputs of the second hidden layer) by the activation functions (e.g. Sigmoid function, Tanh function) of the first hidden layer. Input to the second hidden layer and initial weights of the respective nodes (u nodes shown in FIG. 2) of the second hidden layer

After the weighting operation is carried out, the output of each node of the second hidden layer (namely the input of the third hidden layer) is generated through the activation function of the second hidden layer, and the input of the n-2 hidden layer (n-2 hidden layers in the figure 2), the input of the v nodes of the n-2 hidden layer (the output of the n-3 hidden layer) and the corresponding initial weight are sequentially calculated in the same way

After the weighting operation is carried out, the output (namely the input of the output layer) of each node of the n-2 th hidden layer is generated through the activation function of each node of the n-2 th hidden layer.

It will be appreciated that in some embodiments, the initial weights of the neural network model are used as the initial weights

After the input and the output are quantized according to the 1-bit or 2-bit quantization method, the corresponding sign function or equivalent function can only be used as an activation function, namely, the input is transformed into { -1, 1} or { -1, 0, 1}, so as to ensure that XNOR (exclusive OR) and popcount bit calculation are used, and the calculation resources and the storage resources are saved.

3. Computation of output layers

Also the computation of the output layer is similar to the computation of the hidden layer described above. Specifically, the input of each node of the output layer (i.e., the output of each node of the n-2 th hidden layer) and the initial weight of each node of the output layer

After weighting, a final output of a training process of the neural network model by using the sample image a is generated through an activation function (such as Relu, Tanh, Sigmoid and the like) of each node of the output layer. Wherein the initial weight of the output layer

Or may be obtained by using the above-mentioned low-bit-weight initialization scheme (using 1-bit-weight initialization or 2-bit-weight initialization model). For detailed description, please refer to the above, which is not repeated herein.

4. Adjusting weights

After outputting the face recognition result value output by the layer each time, the face recognition result data is compared with the expected result data of the image data corresponding to the input sample image, and the error is calculated. From the error, a partial derivative is obtained, and the weight of each node in the network layer other than the input layer is adjusted based on the obtained partial derivative. The weights are continually adjusted until the final error reaches an error threshold.

The output of the model (i.e., the training result of the model trained using the sample image a) is then compared with the actual image characteristics of the sample image a to determine an error (i.e., the difference between the two), a partial derivative is determined for the error, and the weights are updated based on the partial derivative. Other sample image data may be input subsequently to train the model, so that in training of a large amount of sample image data, by continuously adjusting the weight, when the output error reaches a small value (for example, a predetermined error threshold is met), the neural network model 200 is considered to be converged, and the model training is completed.

After each training of input sample image data, the weights of the network layers of the neural network model 200 are adjusted, and the adjusted weights may be directly used as initial weights for next training of input sample image data, or may be scaled by a scaling factor and used as initial weights for next training of input sample image data.

For example, in some embodiments, after the neural network model is trained via sample image a, the weights of the network layers determined after the neural network model is trained via sample image a may be used as the initial weights of the neural network model at the next training (e.g., training the neural network model trained via sample image a using sample image B).

In other embodiments, in order to ensure that the input and output distributions of each layer of the neural network model are substantially consistent to alleviate the problem of gradient disappearance during the process of transferring to a deeper layer, after the neural network model is trained through the sample image a, the product of the weight of each network layer determined after the neural network model is trained through the sample image a and the scaling factor may be used as the initial weight of the neural network model during the next training (for example, the neural network model trained through the sample image a is trained through the sample image B).

For example, in some embodiments, the plurality of initial weights for the ith network layer of the n network layers of the neural network model

Can be calculated by the following formula:

wherein, W^tW is any one of a plurality of weights of the ith network layer determined after training for sample image A^tHas a value range of-1 to W^tα is a scaling factor and is a positive number less than 1 for adjusting the distribution of output data for the ith network layer, wherein, in some embodiments, the scaling factor α may be calculated as follows:

wherein l_iRepresenting the number of input channels, l, of the ith network layer of the n network layers of the neural network model_i+1The number of input channels of the (i + 1) th network layer of the neural network model is represented. It will be appreciated that for the input layer, i is 1 here; i +1 denotes the next network layer to the ith network layer in the neural network model.

In other embodiments, the scaling factor α may also be calculated according to the following formula:

where p is the number of weights of the ith network layer determined after training for sample image A, W_j ^tRepresenting the jth weight of the p weights determined after training for sample image a,

is the average of the p weights determined after training for sample image a.

In addition, for a detailed process of training the neural network model trained by the sample image a by using the sample image B, please refer to the above description of the process of training the neural network model by using the sample image a, which is not described herein again. In addition, it can be understood that, in the present application, in order to ensure that the distribution of the input and the output of each layer of the neural network model is kept substantially consistent, so as to alleviate the problem of gradient disappearance in the process of transferring to a deeper layer, a scaling factor is provided, wherein the calculation method of the scaling factor is not limited to the above formula, and other calculation methods may also be adopted, which are not limited herein.

As described above, the weights generated by the weight initialization method of the present application can avoid the problem that the gradient of the neural network model easily disappears in the convergence process, and can enable the neural network model to converge quickly. Fig. 4 and 5 respectively show the convergence of the model when the initial weight is obtained by the gaussian random initialization method of the prior art and the initial weight is obtained by the initialization method of the present application. It can be seen that the initial weight distribution of the model with the initial weight of the discrete value obtained by the initialization method of the application is almost the same as the weight distribution after convergence, and the convergence rate and the stability of the model are better.

It is understood that the above description of the technical solution for training the neural network model 200 shown in fig. 2 is only exemplary and not limiting, and in other embodiments, the weight initialization method of the present application may also be used for speech recognition and the like.

Specifically, as shown in fig. 4, when the neural network model uses the Sign function as the activation function, when the number of layers of the neural network model increases, since the neural network model generates a gradient only when its weight is between-k and k (for example, between-1 and 1), it is found that the output value of the activation function of the layer further behind is almost close to 0, easily causing the model gradient to disappear.

As shown in fig. 5(a) and 5(b), a weight distribution graph of a convolutional neural network model of the present application with 1-bit weight initialization is shown, according to some embodiments of the present application. FIG. 5(b) illustrates a weight distribution graph for a certain period of time during model convergence after a convolutional neural network model is initialized with 1-bit weights, according to some embodiments of the present application. FIG. 5(c) is a graph illustrating a convolutional neural network model trained after initialization with 1-bit weights, the model having a converged weight distribution, according to some embodiments of the present application. Wherein the horizontal axis is the value of the weight and the vertical axis is the sampling number of the weight.

In the illustrated embodiment, the convolution kernel of the trained convolutional neural network model is 3x3x64x 128. Referring to fig. 5(a), the number of weights of-1 and the number of weights of 1 in the weight at the time of initialization are equal (the number of weights of-1 is 35000, and the number of weights of 1 is also 35000). Referring to fig. 5(b), it can be seen that only a very small number of weights take values between-0.004 and 0.004 during a certain period of model convergence, the weights tend to take discrete values of-0.004 and 0.004, and it should be noted that the weights participating in training are compressed weights. For a specific compression method, please refer to the above, which is not described herein again. Referring to fig. 5(c), the weight distribution of the converged model is substantially the same as the distribution of the initial weights, but the number of-1 s is greater than the number of 1s (the number of s is less different). In other embodiments, the number of-1 may be less than the number of 1. Therefore, after the 1-bit weight initialization method is adopted to initialize the weights of the target neural network model, the model is trained, the initial weight distribution and the converged weight distribution of the model are almost the same, and the convergence speed and the stability of the model are good.

It is understood that in some embodiments, the convolutional neural network model is trained after initialization with 2-bit weights (initial weights are-1, 0, and 1), and after the model converges, the weights of the network layers are also-1, 0, and 1.

Fig. 6(a) and fig. 6(b) respectively show model convergence conditions of training an existing model through an Xavier initialization function and the weight initialization method provided by the present application in the prior art, and it can be seen that the training result of training a 1-bit fixed-point quantization model by the weight initialization method provided by the present application is high in precision, the model is more stable, and the convergence is easier.

Specifically, as shown in fig. 6(a), the weight of the ResNet-32 (depth residual network, ResNet) model is initialized by the Xavier initialization function, and the ResNet-32 model with 1-bit fixed point quantization is trained on the cfar 10 data set, where the cfar-10 is composed of 60000 RGB color images of 32 × 32, and 10 classes (airplane, car, bird, cat, deer, dog, frog, horse, boat, truck) are provided, referring to fig. 6(a), where the abscissa is the number of training steps and the ordinate is the precision, it can be seen that the accuracy of the ResNet-32 model is about 82% at most, and the model precision swing amplitude is large, i.e., the noise is large, and the model is unstable (150Epoch stops after stopping, and the precision cannot be improved).

Referring to fig. 6(b), where the abscissa is the number of training steps and the ordinate is the accuracy, it can be seen that, when the ResNet-32 model is initialized by using the weight initialization method provided in the embodiment of the present application, the accuracy of the ResNet-32 model is stable at about 98% with increasing number of training steps, and the noise is small, compared with the result of training the 1-bit fixed-point quantization model by using the Xavier initialization function in fig. 6(a), the training result of training the 1-bit fixed-point quantization model by using the weight initialization method provided in the present application is high, the model is more stable, and the convergence is easier (300Epoch), it can be understood that in the embodiment of fig. 6(a) and fig. 6(b), the training result of the 1-bit fixed-point quantization model by using the weight initialization method provided in the present application is high, and the model is more stable, and the model is applicable to the neural network model after the initialization method of the ResNet-32 (e.g., the ResNet-32 model is initialized by using the weight initialization method provided in the embodiment of the present application, and the present application, the ResNet-32 is applicable to the neural network initialization method of initializing the ResNet-network model, such as the example, and the algorithm 3632, the algorithm of the example, the algorithm of the present application, and the example, the algorithm of the example, the algorithm of the present application, and the example, the algorithm of the present application, and the algorithm of the algorithm.

For example, when the method for low-ratio-privilege reinitializing provided by the embodiment of the present application is used for image recognition, the acquired image information to be learned is subjected to necessary preprocessing (such as sampling, analog-to-digital conversion, feature extraction, and the like) to form data to be subjected to neural network model operation, the data to be trained is input into the neural network model for training, and the method for low-ratio-privilege reinitializing provided by the embodiment of the present application is applied to the model during training, so that the model convergence efficiency can be improved and the stability is high under the condition of ensuring that the accuracy of operation meets the requirement.

The following describes in detail a technical solution for performing weight initialization on a trained neural network model by using the terminal device 100 shown in fig. 1 according to some embodiments of the present application.

1. Input layer computation

The image data of the sample image C is input to the input layer as input data.

2. Computation of hidden layers

a) The input layer outputs the image data of the sample image C to the first hidden layer. For example, the input image data may be color information (e.g., numbers between 0 and 255 of an RGB color space) of each pixel point in the image, and the image data is input into s nodes (i.e., inputs x1, x2 to xs) of the input layer (first layer) of the model shown in fig. 2.

In embodiments of the present application, the initialization model with low bit weight is used to obtain the initial weight of each hidden layer, for example, the following 1-bit weight initialization or 2-bit weight initialization model is used to initialize the weight of a trained neural network model (for example, a converged 8-bit model). The weight value range in 1-bit weight initialization is (1, -1), and the weight value range in 2-bit weight initialization is (1, 0, -1), that is, when the weight is initialized by using the 1-bit weight initialization model, the initial weight of a certain node in the hidden layer is set to one of two values of 1 and-1, and all weights of the hidden layer need to satisfy the distribution that the mean value is 0 and the variance is 1. When the 2-bit weight initialization model is used for initializing the weights, the initial weight value of a certain node in the hidden layer is set to be one of three values of-1, 0 and 1, and all the weights of the hidden layer need to satisfy the distribution that the mean value is 0 and the variance is 2/3.

Specifically, for example, in some embodiments, assuming that the trained model has converged, the following 1-bit weight initialization method may be used to initialize the full-precision weights of the neural network model:

that is, the full-precision weight of the trained neural network model can be valued according to the sign of the full-precision numerical value thereof through a sign function sign (w), that is: when the full-precision weight is a value greater than 0 (e.g., 0.23), it is rotatedBy changing to 1, when the full-precision weight is a value less than 0 (e.g., -0.15) or equal to 0, it is converted to-1, and the initial weight after quantization is thus changed to

Taking one of discrete values 1 or-1 and initial weight

The distribution of (c) still satisfies a mean of 0 and a variance of 1.

In addition, in some embodiments, to ensure that the input and output distributions of each layer of the neural network model remain substantially consistent to mitigate the loss of gradient in the process of propagating to deeper layers, the discrete values of W selected from 1 and-1 for a node in the quantization process may be scaled by a predetermined scaling factor α^zCompressing to obtain initial weight of neural network model

Wherein the preset scaling factor is obtained by a normalization method. For example, in some embodiments, quantized discretized values of W may be combined^zThe compression is as follows:

where α is a scaling factor, usually α is a positive decimal smaller than 1, and is used to adjust the distribution of the output data of the ith network layer.

In some embodiments, the scaling factor α may be calculated as follows:

wherein l_iIs the input channel number of the ith network layer, l_i+1The number of input channels of the (i + 1) th network layer. It will be appreciated that for the input layer, i is 1 here; i +1 represents the lower part of the ith network layer in the neural network modelA network layer.

In addition, in some other embodiments, the scaling factor α may be calculated according to the following formula:

wherein α is a scaling factor, p is the number of initial weights of the ith network layer, and W_j ^zIs one of-1 and 1, and corresponds to the jth initial weight of the p initial weights, and p W initial weights of the p initial weights_j ^zHas a mean value of 0 and a variance of 1;

is p W_j ^zAverage value of (d); l_iIs the input channel number of the ith network layer.

Thus, the scaling factor α for the weight of each layer in the neural network model 200 is calculated, and the discrete values of W corresponding to the network layer in the scaling factor α are set^zAnd multiplying, namely performing weighting operation on the input data of each layer and the compressed weight in the training process of the neural network model, and transmitting the weighted data and the compressed weight forwards to ensure that the input and output of the neural network model are distributed consistently so as to relieve the problem of gradient disappearance in the process of transmitting the weighted data and the compressed weight to a deeper layer.

In some embodiments, the full-precision weights of the trained neural network model may be initialized using a 2-bit weight initialization method as follows:

that is, the full-precision weight of the trained model can be valued according to the sign of the full-precision numerical value thereof through a sign function sign (w), that is: when the full-precision weight is a value greater than 0 (e.g., 0.31), it is converted to 1, when the full-precision weight is a value less than 0 (e.g., -0.17), it is converted to-1, and when the full-precision weight is 0,take it to 0, the initial weight after quantization

Taking one of discrete values-1, 0 or 1, and initial weight

The distribution of (c) still satisfies a mean of 0 and a variance of 2/3.

In order to ensure that the input and output distribution of each layer of the neural network model is kept basically consistent, the problem of gradient disappearance in the process of transferring to a deeper layer is relieved. The discrete value of W selected from-1, 0 or 1 for a node in the quantization process can be obtained based on a preset scaling factor^zCompressing to obtain initial weight of neural network model

The preset scaling factor is obtained by a normalization method to scale the variance of the weights of the neural network model, so that the network can be propagated to a deeper layer. For example, in some embodiments, the scaling factor is calculated as follows:

wherein l_iIs the input channel number of the ith network layer, l_i+1The number of input channels of the (i + 1) th network layer.

For another example, in some other embodiments, the scaling factor α may also be calculated according to the following formula:

wherein α is a scaling factor, p represents the number of initial weights of the ith network layer, and W_j ^qIs one of-1, 0, 1, and corresponds to jth initial weight of the p initial weights, and p W of the p initial weights_j ^qHas a mean value of 0 and a variance of 2/3;

is p W_j ^qAverage value of (d); l_iIs the input channel number of the ith network layer.

Therefore, compared with the traditional full-precision floating point operation and the commonly used 8-bit model in the prior art, the full-precision neural network model is quantized by 1 bit or 2 bits, so that the size of the model can be greatly reduced, the operation resource is reduced, and the power consumption is reduced.

c) Input data of each hidden layer and corresponding initial weight

For example, the image data of the sample image C is divided into s data blocks, the s data blocks are input as s input data to the s nodes of the input layer in FIG. 2, and the weighting operation is performed with the weights of the first hidden layer (for example, the weighting operation of the first node of the first hidden layer is x1 × w11+ x2 × w12+ x3 × w13+ … + xs × w1s + b).

d) And performing activation operation on the result of the weighting operation by using an activation function.

For example, in some embodiments, the activation function may be a Sigmoid function, a Tanh function, or the like. Specifically, the feature data of the sample image C output by each node of the input layer and the initial weights of each node (h nodes shown in fig. 2) of the first hidden layer

After the weighted operation is carried out, a second hidden layer is generated through an activation function of the second hidden layerThe output of each node (i.e. the input of the third hidden layer) of the (n-2) th hidden layer (n-2 hidden layers in fig. 2), the input of the (u) nodes of the (n-2) th hidden layer (the output of the (n-3) hidden layer) and the corresponding initial weight are calculated in the same way in turn

It is understood that the hidden layer calculation is similar to the hidden layer calculation method described in the above-mentioned untrained neural network model 200 training scheme, and the difference is only that the weight initialization method samples the sign function sign (w) to take the full-precision weight of the trained model into value according to the sign of the full-precision value. For detailed description, please refer to the above, which is not repeated herein.

3. Computation of output layers

The calculation of the output layer is similar to the calculation method of the output layer described in the above-mentioned scheme of training the untrained neural network model 200, and the difference is only that the initialization method of the weight samples the sign function sign (w) to take the value of the full-precision weight of the trained model according to the sign of the full-precision numerical value. For detailed description, please refer to the above, which is not repeated herein.

4. The weights are adjusted, and the specific adjusting method is similar to the weight adjusting method described in the above-mentioned scheme for training the untrained neural network model 200, and for the detailed description, refer to the above, and are not repeated here.

Although the above embodiments exemplify face recognition of an image, the weight initialization model of the present application may be applied to any Neural Network model, such as a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), a Recurrent Neural Network (RNN), and the like.

For example, in some embodiments, after the trained neural network model is trained through the sample image C, the weights of the network layers determined after the trained neural network model is trained through the sample image C may be used as initial weights of the neural network model in the next training (for example, the neural network model trained through the sample image C is trained through the sample image D).

In other embodiments, in order to ensure that the input and output distributions of each layer of the neural network model are substantially consistent to alleviate the problem of gradient disappearance during the process of transferring to a deeper layer, after the neural network model is trained through the sample image C, the product of the weight of each network layer determined after the neural network model is trained through the sample image C and the scaling factor may be used as the initial weight of the neural network model during the next training (for example, the neural network model trained through the sample image C is trained through the sample image D).

Can be calculated by the following formula:

wherein, W^rW is any one of a plurality of weights of the ith network layer determined after training for sample image C^rHas a value range of-1 to W^rA 1 ≦ α and a positive decimal fraction less than 1 α for adjusting the distribution of the output data for the ith network layerThe formula is calculated as follows:

where p is the number of weights of the ith network layer determined after training for sample image C, W_j ^rRepresenting the jth weight of the p weights determined after training for sample image C,

is the average of the p weights determined after training for sample image C.

In addition, for a detailed process of training the neural network model trained by the sample image C by using the sample image D, please refer to the above description of the process of training the neural network model by using the sample image C, which is not described herein again.

In addition, it can be understood that, in the present application, in order to ensure that the distribution of the input and the output of each layer of the neural network model is kept substantially consistent, so as to alleviate the problem of gradient disappearance in the process of transferring to a deeper layer, a scaling factor is provided, wherein the calculation method of the scaling factor is not limited to the above formula, and other calculation methods may also be adopted, which are not limited herein.

FIG. 7 provides a block diagram of an electronic device 700 for training a neural network model, according to some embodiments of the present application. As shown in fig. 7, the electronic device 700 includes:

a first data obtaining module 702, configured to obtain sample data, and input the sample data to a second network layer, where the sample data includes initial input data and expected result data;

a first data processing module 704 configured to perform the following operations:

The output data of the ith network layer is obtained,

when i is more than 2 and less than or equal to n, the output data of the i-1 network layer and a plurality of initial weights of the i network layer are used

Obtaining output data of the ith network layer, wherein a plurality of initial weights of the ith network layer

Is obtained based on m discrete values, wherein a plurality of initial weights

Has a numerical value range of

And m ═ {2, 3 };

a first weight adjusting module 706, configured to adjust a plurality of initial weights of the ith network layer based on an error between output data of the n network layers and expected result data in the sample data.

It can be understood that the electronic device 700 for training the neural network model shown in fig. 7 corresponds to the training method of the neural network model provided in the present application, and the technical details in the above detailed description about the training method of the neural network model provided in the present application are still applicable to the electronic device 700 for training the neural network model shown in fig. 7, and the detailed description is referred to above and is not repeated herein.

FIG. 8 provides a block diagram of an electronic device 800 for training a neural network model, according to some embodiments of the present application. As shown in fig. 8, the electronic device 800 includes:

a second data obtaining module 802, configured to obtain sample data, and input the sample data to a second network layer, where the sample data includes initial input data and expected result data;

a second data processing module 804 for performing the following operations

When i is 2, performing symbol dereferencing on a plurality of full-precision weights of the ith network layer to obtain a plurality of initial weights of the ith network layer

And based on the initial input data and a plurality of initial weights

The output data of the ith network layer is obtained,

when i is more than 2 and less than or equal to n, carrying out symbol dereferencing on the multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

And obtaining output data of the ith network layer based on the output data of the (i-1) th network layer and the plurality of initial weights, wherein,

multiple initial weights of ith network layer

Is obtained based on m discrete values and a plurality of initial weights

Has a numerical value range of

And m ═ {2, 3 }; second weight adjustment module 80And 6, adjusting the plurality of initial weights of the ith network layer based on the error between the output data of the n network layers and the expected result data in the sample data.

It can be understood that the electronic device 800 for training a neural network model shown in fig. 8 corresponds to the training method for a neural network model provided in the present application, and the technical details in the above detailed description about the training method for a neural network model provided in the present application are still applicable to the electronic device 800 for training a neural network model shown in fig. 8, and the detailed description is referred to above and is not repeated herein.

Fig. 9 shows a schematic structural diagram of an electronic device 900 according to an embodiment of the present application. The electronic device 900 is also capable of performing the training of the neural network model disclosed in the above-described embodiments of the present application. In fig. 9, like parts have the same reference numerals. As shown in fig. 9, the electronic device 900 may include a processor 910, a power module 940, a memory 980, a mobile communication module 930, a wireless communication module 920, a sensor module 990, an audio module 950, a camera 970, an interface module 960, buttons 901, a display 902, and the like.

It is to be understood that the illustrated architecture of the present invention is not to be construed as a specific limitation for the electronic device 900. In other embodiments of the present application, electronic device 900 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 910 may include one or more Processing units, for example, Processing modules or Processing circuits that may include a central Processing Unit (cpu), (central Processing Unit), an image Processing Unit (gpu), (graphics Processing Unit), a Digital Signal Processor (DSP), a Micro-programmed Control Unit (MCU), an Artificial Intelligence (AI) processor, or a Programmable logic device (fpga), (field Programmable gate array), etc. The different processing units may be separate devices or may be integrated into one or more processors. A memory unit may be provided in the processor 910 for storing instructions and data. In some embodiments, the storage unit in the processor 910 is a cache 980. The memory 980 mainly includes a storage program area 9801 and a storage data area 9802, wherein the storage program area 9801 can store an operating system and application programs required for at least one function (such as functions of sound playing, image recognition, and the like). The neural network model provided in the embodiment of the present application can be regarded as an application program that can implement functions such as image processing and voice processing in the storage program area 9801. The weight of each network layer of the neural network model is stored in the above-described stored data area 9802.

The power module 940 may include a power supply, power management components, and the like. The power source may be a battery. The power management component is used for managing the charging of the power supply and the power supply of the power supply to other modules. In some embodiments, the power management component includes a charge management module and a power management module. The charging management module is used for receiving charging input from the charger; the power management module is used for connecting a power supply, and the charging management module is connected with the processor 910. The power management module receives power and/or charge management module input and provides power to the processor 910, the display 902, the camera 970, and the wireless communication module 920.

The mobile communication module 930 may include, but is not limited to, AN antenna, a power amplifier, a filter, a low noise amplifier (L low noise amplifier, L NA), etc. the mobile communication module 930 may provide a solution for wireless communication including 2G/3G/4G/5G, etc. applied to the electronic device 900, the mobile communication module 930 may receive electromagnetic waves from the antenna, filter, amplify, etc. the received electromagnetic waves, and transmit to the modem processor for demodulation, the mobile communication module 930 may further amplify signals modulated by the modem processor, and convert the signals into electromagnetic waves radiated by the antenna, in some embodiments, at least a part of the functional modules of the mobile communication module 930 may be disposed in the processor 910, in some embodiments, at least a part of the functional modules of the mobile communication module 930 may be disposed in the same device as at least a part of the processor 910, wireless communication technologies may include a global system for mobile communication (GSM), a general packet radio service (radio system for mobile communication), a wireless satellite system (GPRS) and satellite (radio system), a global navigation system (GPS) radio system, satellite navigation system, GPS-satellite system (GPS) and satellite communication system (GNSS), etc., GPS-radio system, GPS-satellite communication system (GPS-satellite communication system) may include, GPS-radio system, GPS-satellite communication system, GPS-radio system (CDMA-satellite communication system) and satellite communication system (CDMA) including CDMA-satellite communication system, CDMA-radio system, CDMA system for wireless system for.

The wireless communication module 920 may include AN antenna and may transmit and receive electromagnetic waves via the antenna, the wireless communication module 920 may provide solutions for wireless communication applied to the electronic device 900, including a wireless local area network (W L AN) (e.g., a wireless fidelity (Wi-Fi) network), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like, the electronic device 900 may communicate with a network and other devices via wireless communication technologies.

In some embodiments, the mobile communication module 930 and the wireless communication module 920 of the electronic device 900 may also be located in the same module.

The display panel may be a liquid crystal display (L CD), an organic light-emitting diode (O L ED), an active matrix organic light-emitting diode (AMO L ED), a flexible light-emitting diode (F L ED), a miniature, Micro L ED, a Micro-oled, a quantum dot light-emitting diode (Q L ED), or the like.

The sensor module 990 may include a proximity light sensor, a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

The audio module 950 is used to convert digital audio information into an analog audio signal for output, or convert an analog audio input into a digital audio signal. The audio module 950 may also be used to encode and decode audio signals. In some embodiments, the audio module 950 may be disposed in the processor 910, or some functional modules of the audio module 950 may be disposed in the processor 910. In some embodiments, audio module 950 may include speakers, an earpiece, a microphone, and a headphone interface.

The camera 970 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to an ISP (Image signal processing) to be converted into a digital Image signal. The electronic device 900 may implement a shooting function through an ISP, a camera 970, a video codec, a GPU (graphics Processing Unit), a display 902, an application processor, and the like.

The interface module 960 includes an external memory interface, a Universal Serial Bus (USB) interface, a Subscriber Identity Module (SIM) card interface, and the like. The external memory interface may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the electronic device 900. The external memory card communicates with the processor 910 through an external memory interface to implement a data storage function. The universal serial bus interface is used for communication between the electronic device 900 and other electronic devices. The SIM card interface is used to communicate with a SIM card installed to the electronic device 900, such as to read a phone number stored in the SIM card or to write a phone number into the SIM card.

In some embodiments, the electronic device 900 also includes keys 901, a motor, and indicators, among other things, where the keys 901 may include a volume key, an on/off key, etc. the motor is used to cause a vibration effect to the electronic device 900, such as when the user's electronic device 900 is being called, to alert the user to answer an incoming call to the electronic device 900. the indicators may include a laser indicator, a radio frequency indicator, an L ED indicator, etc.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or a tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A training method for a neural network model, wherein the neural network model comprises n network layers, and n is a positive integer greater than 1; and

The method includes:

The first network layer of the n network layers obtains sample data, and inputs the sample data to the second network layer, wherein the sample data includes initial input data and expected result data;

For the ith network layer of the n network layers, do the following:

When i=2, based on the initial input data and multiple initial weights of the i-th network layer

Get the output data of the i-th network layer,

When 2<i≤n, based on the output data of the ith network layer and multiple initial weights of the ith network layer

Get the output data of the i-th network layer, where,

the plurality of initial weights of the i-th network layer

is obtained based on m discrete values, wherein the multiple initial weights

The range of values is

and m={2,3};

the plurality of initial weights for the i-th network layer based on the error between the output data of the n network layers and the expected result data in the sample data

Make adjustments.

2. The method of claim 1, wherein the plurality of initial weights of the i-th network layer

Each of is one of m discrete values.

3. The method of claim 2, wherein the m discrete values are -1 and 1, and the plurality of initial weights of the ith network layer

The mean is 0 and the variance is 1.

4. The method of claim 2, wherein the m discrete values are -1, 0, and 1, and the plurality of initial weights of the ith network layer

The mean is 0 and the variance is 2/3.

5. The method of claim 1, wherein the i-th network layer has p initial weights

and the p initial weights of the i-th network layer

Calculated by the following formula:

Wherein, W ^b is one of the m discrete values, the value range of W ^b is -1≤W ^b ≤1, and corresponds to the p initial weights

The mean of the p W ^b is 0, and the variance is 1 or 2/3; α is a scaling factor and is a positive number less than 1, which is used to adjust the distribution of the output data of the ith network layer.

6. The method according to claim 5, wherein, corresponding to the p initial weights

The variance of the p W ^b is 1, and the m discrete values are -1 and 1.

7. The method according to claim 5, wherein, corresponding to the p initial weights

The variance of the p W ^b is 2/3, and the m discrete values are -1, 0 and 1.

8. The method of any one of claims 5 to 7, wherein the scaling factor is obtained by the following formula:

Wherein, W _j ^b is the p initial weights corresponding to the p W ^b

The discrete value of the j-th initial weight in ,

is the average value of p W ^b of the ith network layer, and l _i is the number of input channels of the ith network layer.

9. The method of claim 1, wherein the plurality of initial weights of the i-th network layer

Calculated by the following formula:

Wherein, W ^t is any one of the multiple weights determined by the previous training of the i-th network layer, the numerical range of the W ^t is -1≤W ^t ≤1, and α is the scaling factor and is less than 1 A positive number for adjusting the distribution of the output data of the i-th network layer.

10. The method of claim 9, wherein the scaling factor is obtained by the following formula:

Among them, p is the number of weights determined by the previous training of the i-th network layer,

represents the jth weight among the weights determined by the p previous training,

The average of the weights determined for the p previous training sessions.

11. The method of claim 5 or 9, wherein the scaling factor is obtained by the following formula:

Wherein, l _i is the number of input channels of the i-th network layer, and l _i+1 is the number of input channels of the i+1-th network layer.

12. A method for training a neural network model, wherein the neural network model comprises n network layers and the neural network model has converged, and n is a positive integer greater than 1; and

The method includes:

For the ith network layer of the n network layers, do the following:

When i=2, perform sign value on multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

and based on the initial input data and the plurality of initial weights

Get the output data of the i-th network layer,

When 2<i≤n, perform sign value on multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layer

and based on the output data of the i-1th network layer and the multiple initial weights

Get the output data of the i-th network layer, where,

the plurality of initial weights of the i-th network layer

is obtained based on m discrete values, and the multiple initial weights

The range of values is

and m={2,3};

Make adjustments.

13. The method of claim 12, wherein the m discrete values are -1 and 1, the plurality of initial weights of the i-th network layer

has a mean of 0 and a variance of 1; and

The multiple initial weights of the ith network layer obtained by performing symbol value on multiple full-precision weights of the ith network layer include:

If the full-precision weight is less than or equal to 0, use -1 as the initial weight corresponding to the full-precision weight;

If the full-precision weight is greater than 0, use 1 as the initial weight corresponding to the full-precision weight.

14. The method of claim 12, wherein the m discrete values are -1, 0, and 1, and the plurality of initial weights of the i-th network layer

has a mean of 0 and a variance of 2/3; and

The symbol value of multiple full-precision weights of the ith network layer is obtained to obtain multiple initial weights of the ith network layer

include:

If the full-precision weight is less than 0, use -1 as the initial weight corresponding to the full-precision weight

If the full-precision weight is equal to 0, use 0 as the initial weight corresponding to the full-precision weight

If the full-precision weight is greater than 0, use 1 as the initial weight corresponding to the full-precision weight

15. The method of claim 12, wherein the m discrete values are -1 and 1, and,

include:

If the full-precision weight is less than or equal to 0, the product of -1 and the scaling factor is used as the initial weight corresponding to the full-precision weight

If the full-precision weight is greater than 0, use the product of 1 and the scaling factor as the initial weight corresponding to the full-precision weight

Wherein, the scaling factor is a positive number less than 1, and is used to adjust the distribution of the output data of the i-th network layer.

16. The method of claim 12, wherein the m discrete values are -1, 0, and 1, and the sign is performed on a plurality of full-precision weights of the i-th network layer. value to get multiple initial weights of the i-th network layer

include:

If the full-precision weight is less than 0, the product of -1 and the scaling factor is used as the initial weight corresponding to the full-precision weight

17. The method of claim 15 or 16, wherein the scaling factor is obtained by the following formula:

Wherein, α is a scaling factor, l _i is the number of input channels of the i-th network layer, and l _i+1 is the number of input channels of the i+1-th network layer.

18. The method of claim 15, wherein the scaling factor is obtained by the following formula:

Among them, α is a scaling factor; p represents the number of multiple initial weights of the i-th network layer; W _j ^z is one of -1 and 1, and corresponds to the j-th initial weight in the p initial weights, And the mean value of the p W _j ^z corresponding to the p initial weights is 0, and the variance is 1;

is the average value of the p W _j ^z ; li is the number of input channels of the _ith network layer.

19. The method of claim 16, wherein the scaling factor is obtained by the following formula:

Among them, α is the scaling factor; p represents the number of multiple initial weights of the i-th network layer; W _j ^q is one of -1, 0, and 1, and corresponds to the j-th initial weight in the p initial weights weight, and the mean value of the p W _j ^q corresponding to the p initial weights is 0, and the variance is 2/3;

is the average value of the p W _j ^qs ; li is the number of input channels of the _i -th network layer.

20. The method of any one of claims 12 to 19, wherein the sample data includes image data, and the neural network model is used for image recognition.

21. An electronic device for training a neural network model, comprising:

a first data acquisition module for acquiring sample data and inputting the sample data to the second network layer, wherein the sample data includes initial input data and expected result data;

The first data processing module is configured to perform the following operations:

Get the output data of the i-th network layer,

Get the output data of the i-th network layer, where,

the plurality of initial weights of the i-th network layer

is obtained based on m discrete values, wherein the multiple initial weights

The range of values is

and m={2,3};

a first weight adjustment module, configured to adjust the multiple initial weights of the i-th network layer based on the error between the output data of the n network layers and the expected result data in the sample data .

22. An electronic device for training a neural network model, comprising:

a second data acquisition module, configured to acquire sample data and input the sample data to the second network layer, wherein the sample data includes initial input data and expected result data;

The second data processing module is used to perform the following operations

and based on the initial input data and the plurality of initial weights

Get the output data of the i-th network layer,

Get the output data of the i-th network layer, where,

the plurality of initial weights of the i-th network layer

is obtained based on m discrete values, and the multiple initial weights

The range of values is

and m={2,3};

The second weight adjustment module is configured to, based on the error between the output data of the n network layers and the expected result data in the sample data, apply the plurality of initial weights to the i-th network layer

Make adjustments.

23. A computer-readable medium, characterized in that the computer-readable medium has instructions stored thereon, and when the instructions are executed on a computer, the computer executes the neural network model of any one of claims 1-20. training method.

24. An electronic device, characterized in that, comprising:

memory for storing instructions for execution by one or more processors of the system, and

The processor, which is one of the processors of the system, is configured to execute the training method of the neural network model according to any one of claims 1-20.