WO2017214970A1

WO2017214970A1 - Building convolutional neural network

Info

Publication number: WO2017214970A1
Application number: PCT/CN2016/086154
Authority: WO
Inventors: Jiale CAO
Original assignee: Nokia Technologies Beijing Co Ltd; Nokia Technologies Oy
Current assignee: Nokia Technologies Beijing Co Ltd; Nokia Technologies Oy
Priority date: 2016-06-17
Filing date: 2016-06-17
Publication date: 2017-12-21
Anticipated expiration: 2018-12-17
Also published as: EP3472760A1; CN109643396A; EP3472760A4

Abstract

Embodiments of the present disclosure provide a method, apparatus and computer program product for information processing. The method comprises: determining, based on a training dataset for multimedia content, convolution parameters and first feature maps for a convolutional layer in a convolutional neural network; changing an order of the first feature maps according to correlation among the first feature maps to obtain second feature maps; and updating the convolution parameters based on the training dataset and the second feature maps.

Description

BUILDING CONVOLUTIONAL NEURAL NETWORK

FIELD OF THE INVENTION

Embodiments of the present disclosure generally relate to information processing， and more particularly to methods， apparatuses and computer program products for building a Convolutional Neural Network (CNN) .

BACKGROUND

CNNs have achieved state-of-the-art performance in the applications of image recognition， object detection， speech recognition， and so on. Representative applications of the CNN include AlphaGo， Advanced Driver Assistance Systems (ADAS) ， self-driving cars， Optical Character Recognition (OCR) ， face recognition， large-scale image classification (for example， ImageNet classification) ， and Human Machine Interaction (HCI) .

Generally， the CNNs are organized in interweaved layers of two types： convolutional layers and pooling (subsampling) layers. The role of the convolutional layers is feature representation with the semantic level of the features increasing with the depth of the layers. Designing effective convolutional layers to obtain robust feature maps is the key of improving the performance of the CNNs.

SUMMARY

In general， example embodiments of the present disclosure include a method， apparatus and computer program product for building a CNN.

In a first aspect of the present disclosure， a method is provided. The method comprises： determining， based on a training dataset for multimedia content， convolution parameters and first feature maps for a convolutional layer in a convolutional neural network； changing an order of the first feature maps according to correlation among the first feature maps to obtain second feature maps； and updating the convolution parameters based on the training dataset and the second feature maps.

In some embodiments， updating the convolution parameters comprises： determining an amount of change in the order of the first feature maps； and in response to the amount being larger than a predetermined threshold， updating the convolution parameters.

In some embodiments， the method further comprises： assigning indexes to the first feature maps； and generating a list of the indexes.

In some embodiments， the method further comprises： updating the list of the indexes based on the second feature maps.

In some embodiments， determining an amount of change in the order of the first feature maps comprises： determining a difference between the generated list of the indexes and the updated list of the indexes.

In some embodiments， changing an order of the first feature maps according to correlation among the first feature maps comprises： obtaining representation information of the first feature maps； determining differences among the representation information； and determining the correlation based on the differences among the representation information.

In a second aspect of the present disclosure， an apparatus is provided. The apparatus comprises： at least one processor； and at least one memory including computer program code； the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus at least to perform： determining， based on a training dataset for multimedia content， convolution parameters and first feature maps for a convolutional layer in a convolutional neural network； changing an order of the first feature maps according to correlation among the first feature maps to obtain second feature maps； and updating the convolution parameters based on the training dataset and the second feature maps.

In a third aspect of the present disclosure， an apparatus is provided. The apparatus comprises means for performing a method according to the first aspect of the present disclosure.

In a fourth aspect of the present disclosure， a computer program product is provided. The computer program product comprises at least one computer readable non-transitory memory medium having program code stored thereon， the program code which， when executed by an apparatus， causes the apparatus to perform a method according to the first aspect of the present disclosure.

It is to be understood that the Summary is not intended to identify key or essential features of embodiments of the present disclosure， nor is it intended to be used to limit the scope of the present disclosure. Other features of the present disclosure will become easily comprehensible through the description below.

BRIEF DESCRIPTION OF THE DRAWINGS

Through the more detailed description of some embodiments of the present disclosure in the accompanying drawings， the above and other objects， features and advantages of the present disclosure will become more apparent， wherein：

Fig. 1 schematically shows an architecture of a CNN in which embodiments of the present disclosure can be implemented；

Fig. 2 is a flowchart of a method in accordance with embodiments of the present disclosure；

Fig. 3a shows an example of the feature maps prior to re-ranking；

Fig. 3b shows an example of the re-ranked feature maps； and

Fig. 4 is a block diagram of an electronic device in which embodiments of the present disclosure can be implemented；

Throughout the drawings， same or similar reference numerals represent the same or similar element.

DETAILED DESCRIPTION

Principle of the present disclosure will now be described with reference to some example embodiments. It is to be understood that these embodiments are described for the purpose of illustration only and help those skilled in the art to understand and implement the present disclosure， without suggesting any limitations as to the scope of the invention. The invention described herein can be implemented in various manners other than the ones describe below.

As used herein， the term “includes” and its variants are to be read as opened terms that mean “includes， but is not limited to. ” The term “based on” is to be read as “based at least in part on. ” The term “one embodiment” and “an embodiment” are to be read as “at least one embodiment. ” The term “another embodiment” is to be read as “at least one other embodiment. ” Other definitions， explicit and implicit， may be included below.

Reference is first made to Fig. 1， which schematically shows an architecture of a CNN 100 in which embodiments of the present disclosure can be implemented. It is to be understood that the structure and functionality of the CNN 100 are described only for the purpose of illustration without suggesting any limitations as to the scope of the present disclosure described herein. The present disclosure described herein can be embodied with a different structure and/or functionality.

As shown in Fig. 1， the CNN 100 includes an input layer 110，

convolutional layers

120， 140，

pooling layers

130， 150， and an output layer 160. Typically， the

convolutional layers

120， 140 and the

pooling layers

130， 150 are organized in an alternating form. In some embodiments， the convolutional layer 120 is followed by the pooling layer 130 and the convolutional layer 140 is followed by the pooling layer 150. In some embodiments， the CNN 100 only includes one of the

pooling layers

130 and 150 which follows the successive

convolutional layers

120 and 140. In other embodiments， the CNN 100 does not include any pooling layers.

The CNN 100 may be trained with a training dataset. The training dataset enters the CNN 100 at the input layer 110. Once trained， the CNN 100 may be used for image recognition， object detection， speech recognition， and so on. The role of the

convolutional layers

120 and 140 is feature representation with the semantic level of the features increasing with the depth of the layers. The

pooling layers

130 and 150 are obtained by replacing an output of a preceding convolutional layer at certain location with summary statistic of the nearby outputs. The output layer 160 outputs classification results.

Each of the input layer 110，

convolutional layers

120， 140，

pooling layers

130， 150， and output layer 160 includes one or more planes arranged along a Z dimension. Each of the planes is defined by an X dimension and a Y dimension， which is referred to as a spatial domain. Each of the planes in the

convolutional layers

120， 140 and pooling

layers

130， 150 may be considered as a feature map or a channel which has a feature detector. Thus， the Z dimension is also referred to as a channel dimension or channel domain.

The feature maps of each of the

convolutional layers

120 and 140 may be obtained by applying convolution operation on the feature maps in a respective preceding layer in both spatial domain and channel domain. By means of the convolution operation， each of elements in the feature maps of the

convolutional layers

120 and 140 is only connected with elements in a local region of the feature maps in a preceding layer. In this sense， applying the convolution operation to a preceding layer of a convolutional layer means that there is a sparse connection between these two layers. Thus， as used herein， the terms “convolution operation” and “sparse connection” may be used interchangeably.

It is known that the convolution operation is suitable for the situation where neighboring elements are highly correlated. However， because existing learning algorithms do not guarantee that neighboring elements between different feature maps in the channel domain are highly correlated， the correlation between neighboring elements in the channel domain is not as large as the correlation between neighboring elements in the spatial domain. As a result， the sparse connections in the channel domain cannot result in a good performance. For example， in the case of image recognition， the feature maps obtained via the convolution operation do not have strong ability of feature representation and thus cannot be used as discriminative representations of an image.

In accordance with embodiments of the present disclosure， a scheme for building a CNN is proposed to improve the correlation between neighboring elements in the channel domain so that applying convolution operations in both spatial domain and channel domain yields a better performance.

Fig. 2 shows a flowchart of a method 200 for building a CNN in accordance with embodiments of the present disclosure. The method 200 may be implemented in a CNN， such as the CNN 100 as shown in Fig. 1. In some embodiments， the method 200 may be performed with respect to both of the

convolutional layers

120 and 140 so as to improve correlation between different feature maps in these convolutional layers. In other embodiments， the method 200 may be performed with respect to any of the

convolutional layers

120 and 140. Hereinafter， for the purpose of illustration， the method 200 will be described by taking convolutional layer 120 as an example.

As shown， the method 200 is entered in step 210， where convolution parameters and first feature maps for the convolutional layer 120 in the CNN 100 are determined based on a training dataset for multimedia content.

As described above， once trained， the CNN 100 may be used for image recognition， object detection， speech recognition， and so on. In this regard， examples of the training dataset for multimedia content include， but are not limited to， training datasets for images， speech， video and the like.

Typically， a convolution operation may be performed by using linear filters. In particular， each of the filters is convoluted over the feature maps in a preceding layer with a predefined stride followed by a nonlinear activation. In the case of using linear filters， the convolution parameters include weights of the linear filters. Different feature maps correspond to different parameters of the filters with a feature map sharing the same parameters. Alternatively， instead of using the linear filters， the convolution operation may be performed by using nonlinear functions， such as shallow MultiLayer Perceptron (MLP) . In this case， the convolution parameters include parameters for the MLP.

In some embodiments， the convolution parameters and first feature maps for the convolutional layer 120 are determined by applying a learning algorithm to the training dataset. Examples of the learning algorithm include， but are not limited to， back propagation (BP) ， stochastic gradient descent (SGD) ， and limited-memory BFGS (Broyden， Fletcher， Goldfarb， and Shanno) .

It is to be understood that in order to determine the convolution parameters and first feature maps for the convolutional layer 120， a number of the first feature maps may be pre-determined. Usually， in order to detect multiple features of the multimedia content， multiple feature maps are used in the convolutional layer 120. Thus， the number， denoted by M， may be pre-determined as any appropriate integer larger than one， for example， eight.

In some embodiments， the method 200 further comprises assigning indexes to the determined feature maps. Fig. 3a shows an example of the first feature maps with the indexes being assigned. As shown in Fig. 3a， the first feature maps includes eight feature maps with the indexes of 1， 2， ... ， 8. Thus， the eight feature maps with the indexes of 1， 2， ... ， 8 also are called feature maps C₁， C₂， ... ， C_M.

In some embodiments， the method 200 further comprises generating a list of the indexes. With the list of the indexes， it may be known that which feature maps are neighbors. In the example as shown in Fig. 3a， the list of the indexes， denoted by R， is [1， 2， 3， 4， 5， 6， 7， 8] .

As described above， the existing learning algorithms， such as the BP algorithm， do not guarantee that neighboring elements in different feature maps in the channel domain are highly correlated. For example， elements in the feature map 1 and those in feature map 2 are neighboring elements as shown in Fig. 3a， but correlation between these neighboring elements may not be high. Similarly， correlation between the neighboring elements in the feature maps 3 and 4 may not be high.

Referring back to Fig. 2， in order to improve the correlation between the neighboring elements in different feature maps， in step 220， an order of the first feature maps obtained in step 210 is changed according to correlation among the first feature maps so as to obtain second feature maps. In other words， the second feature maps are obtained by re-ranking the first feature maps. Thus， hereinafter， the second feature maps are also called the “re-ranked feature maps” .

Examples of the representation information include， but are not limited to， Histograms of Oriented Gradient (HOG) in an intensity image， information extracted by an algorithm of Scale-Invariant Feature Transform (SIFT) and the like. In the case of HOG， the representation information is also referred to as HOG features. Hereinafter， for the purpose of illustration， the step 220 will be described in detail by taking HOG features as an example and referring to Fig. 3a.

With respect to each of the feature maps C₁， C₂， ... ， C_M as shown in Fig. 3a， HOG features are extracted and is expressed as f (C_i) ， where i＝1 to M. It is to be appreciated， a process for extracting the HOG features is known in the art and thus a detailed description thereof is omitted herein.

The re-ranked maps (for example， the second feature maps) are denoted by D₁， D₂， ... ， D_M and let the feature map C₁ be the first one of the re-ranked maps D₁， D₂， ... ， D_M， for example， D₁←C₁. Then， differences between the HOG features f (C_i) of the feature map C₁ and the remaining M-1 feature maps C₂， C₃， ... ， C_M may be determined as follows：

where K is a number of the HOG features in each feature map andf (C_i) (k) is the k^th HOG feature of the feature map C_i.

Then， a feature map j which has the smallest difference g to the feature map C₁ is determined as follows：

Next， let the feature map C_j be the second one of the re-ranked maps D₁， D₂， ... ， D_M， for example， D₂←C_j， and the above Equations (1) and (2) are applied to the feature map C_j to find from the remaining M-2 feature maps a feature map which has the smallest difference to the feature map C_j. Similarly， the other feature maps of the re-ranked maps can be determined.

Fig. 3b shows an example of the re-ranked feature maps. As shown in Fig. 3b， the re-ranked feature maps includes eight feature maps with the indexes of 1， 5， 3， 4 8， 6， 7， 2.

In some embodiments， the method 200 further comprises updating the list of the indexes R based on the re-ranked feature maps. With reference to Fig. 3b， the updated list of the indexes， denoted by R^＊， is [1， 5， 3， 4 8， 6， 7， 2] .

Referring back to Fig. 2， in step 230， the convolution parameters for the convolutional layer 120 is updated based on the training dataset and the re-ranked feature maps.

In some embodiments， determining the amount of change in the order of the first feature maps comprises： determining a difference between the generated list of the indexes and the updated list of the indexes. In the above example， determining the amount of change in the order of the first feature maps comprises determining a difference between the list of the indexes R and the updated list of the indexes R^＊.

As an example， the difference between the list of the indexes R and the updated list of the indexes R^＊ may be determined as follows：

where s (R^*， R) represents the difference between the list of the indexes R and the updated list of the indexes R^＊，

represents the j^th element in the list R^＊， and R_j represents the j^th element in the list R. If s (R^*， R) is larger than a predetermined threshold， for example， 0， the convolution parameters will be updated.

As another example， the difference between the list of the indexes R and the updated list of the indexes R^＊ may be determined by determining a ratio of different elements between the lists R^＊ and R to a total number of the elements in the lists R^＊ as follows：

where w (R^*， R) represents the ratio of different elements between the lists R^＊ and R to a total number of the elements in the lists R^＊，

represents the j^th element in the list R^＊， and R_j represents the j^th element in the list R. If w (R^*， R) is larger than a predetermined threshold， the convolution parameters will be updated. The predetermined threshold may be any value in the range of 0.5 to 1， for example， 0.8.

As described above， the method 200 may be performed with respect to both of the convolutional layers 120 and 140 so as to improve correlation between different feature maps in these convolutional layers. In this case， the above Equation (3) may be re-written as follows：

where R^＊ (i，j) represents the j^th element in the list R^＊ for the convolutional layer i， and R_j represents the j^th element in the list R for the convolutional layer i， N represents a number of the convolutional layers.

Similarly， the above Equations (4) and (5) may be re-written as follows：

It is to be appreciated that the step 230 may be understood as a process of re-learning of the convolution parameters for the convolutional layer 120 based on the training dataset and the re-ranked feature maps.

In addition， the method 200 may be performed iteratively until the amount of change in the order of the first feature maps is equal to or smaller than the predetermined threshold. In this sense， the method 200 may further comprise updating the feature maps for the convolutionat layer 120 based on the updated convolution parameters and changing an order of the updated feature maps according to correlation among the updated feature maps to obtain third feature maps. Furthermore， the method 200 may further comprise： determining an amount of change in the order of the updated feature maps； and in response to the amount being equal to or smaller than the predetermined threshold， stopping updating the convolution parameters.

In some embodiments， the method 200 may further comprise receiving a testing dataset for multimedia content at the input layer 110， performing an classification on the testing dataset and outputting results of the classification at the output layer 160.

In some embodiments， the convolution operation is performed by using linear filters and a length of each of the filters is smaller than the number of the feature maps in each of the convolutional layers.

To sum up， in the embodiments of the present disclosure， an order of feature maps of at least one of convolutional layers in a CNN is changed according to correlation among the feature maps so that similar feature maps are arranged to be neighbors and then the convolution parameters are re-learned. Thus， the correlation between neighboring elements in the channel domain is improved so that applying convolution operations in both spatial domain and channel domain yields a better performance.

Reference is made to Fig. 4， in which an example electronic device or computer system/server 12 which is applicable to implement the embodiments of the present disclosure is shown. Computer system/server 12 is only illustrative and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein.

As shown in Fig. 4， computer system/server 12 is shown in the form of a general-purpose computing device. The components of computer system/server 12 may include， but are not limited to， one or more processors or processing units 16， a system memory 28， and a bus 18 that couples various system components including system memory 28 to processor 16.

Bus 18 represents one or more of any of several types of bus structures， including a memory bus or memory controller， a peripheral bus， an accelerated graphics port， and a processor or local bus using any of a variety of bus architectures. By way of example， and not limitation， such architectures include Industry Standard Architecture (ISA) bus， Micro Channel Architecture (MCA) bus， Enhanced ISA (EISA) bus， Video Electronics Standards Association (VESA) local bus， and Peripheral Component Interconnect (PCI) bus.

Computer system/server 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer system/server 12， and it includes both volatile and non-volatile media， removable and non-removable media.

System memory 28 can include computer system readable media in the form of volatile memory， such as random access memory (RAM) 30 and/or cache memory 32. Computer system/server 12 may further include other removable/non-removable， volatile/non-volatile computer system storage media. By way of example only， storage system 34 can be provided for reading from and writing to a non-removable， non-volatile magnetic media (not shown and typically called a “hard drive” ) . Although not shown， a magnetic disk drive for reading from and writing to a removable， non-volatile magnetic disk (for example， a “floppy disk” ) ， and an optical disk drive for reading from or writing to a removable， non-volatile optical disk such as a CD-ROM， DVD-ROM or other optical media can be provided. In such instances， each can be connected to bus 18 by one or more data media interfaces. As will be further depicted and described below， memory 28 may include at least one program product having a set (for example， at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

Program/utility 40， having a set (at least one) of program modules 42， may be stored in memory 28 by way of example， and not limitation， as well as an operating system， one or more application programs， other program modules， and program data. Each of the operating system， one or more application programs， other program modules， and program data or some combination thereof， may include an implementation of a networking environment. Program modules 42 generally carry out the functions and/or methodologies of embodiments of the invention as described herein.

Computer system/server 12 may also communicate with one or more external devices 14 such as a keyboard， a pointing device， a display 24， and the like. One or more devices that enable a user to interact with computer system/server 12； and/or any devices (for example， network card， modem， etc. ) that enable computer system/server 12 to communicate with one or more other computing devices. Such communication can occur via Input/Output (I/O) interfaces 22. Still yet， computer system/server 12 can communicate with one or more networks such as a local area network (LAN) ， a general wide area network (WAN) ， and/or a public network (for example， the Internet) via network adapter 20. As depicted， network adapter 20 communicates with the other components of computer system/server 12 via bus 18. It should be understood that although not shown， other hardware and/or software components could be used in conjunction with computer system/server 12. Examples， include， but are not limited to： microcode， device drivers， redundant processing units， external disk drive arrays， RAID systems， tape drives， and data archival storage systems， and the like.

In computer system/server 12， I/O interfaces 22 may support one or more of various different input devices that can be used to provide input to computer system/server 12. For example， the input device (s) may include a user device such keyboard， keypad， touch pad， trackball， and the like. The input device (s) may implement one or more natural user interface techniques， such as speech recognition， touch and stylus recognition， recognition of gestures in contact with the input device (s) and adjacent to the input device (s) ， recognition of air gestures， head and eye tracking， voice and speech recognition， sensing user brain activity， and machine intelligence.

Claims

A method， comprising：

determining， based on a training dataset for multimedia content， convolution parameters and first feature maps for a convolutional layer in a convolutional neural network；

changing an order of the first feature maps according to correlation among the first feature maps to obtain second feature maps； and

updating the convolution parameters based on the training dataset and the second feature maps.
The method of Claim 1， wherein updating the convolution parameters comprises：

determining an amount of change in the order of the first feature maps； and

in response to the amount being larger than a predetermined threshold， updating the convolution parameters.
The method of Claim 2， further comprising：

assigning indexes to the first feature maps； and

generating a list of the indexes.
The method of Claim 3， further comprising：

updating the list of the indexes based on the second feature maps.
The method of Claim 4， wherein determining an amount of change in the order of the first feature maps comprises：

determining a difference between the generated list of the indexes and the updated list of the indexes.
The method of any of Claims 1 to 5， wherein changing an order of the first feature maps according to correlation among the first feature maps comprises：

obtaining representation inforrnation of the first feature maps；

determining differences among the representation information； and

determining the correlation based on the differences among the representation information.
An apparatus， comprising：

at least one processor； and

at least one memory including computer program code；

the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus at least to：

determine， based on a training dataset for multimedia content， convolution parameters and first feature maps for a convolutional layer in a convolutional neural network；

change an order of the first feature maps according to correlation among the first feature maps to obtain second feature maps； and

update the convolution parameters based on the training dataset and the second feature maps.
The apparatus of Claim 7， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus to：

determine an amount of change in the order of the first feature maps； and

update the convolution parameters in response to the amount being larger than a predetermined threshold.
The apparatus of Claim 8， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus at least to further perform：

assign indexes to the first feature maps； and

generate a list of the indexes.
The apparatus of Claim 9， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus at least to further perform：

update the list of the indexes based on the second feature maps.
The apparatus of Claim 10， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus to determine an amount of change in the order of the first feature maps by determining a difference between the generated list of the indexes and the updated list of the indexes.
The apparatus of any of Claims 7 to 11， wherein the at least one memory and the computer program code are configured to， with the at least one processor， cause the apparatus to：

obtain representation information of the first feature maps；

determine differences among the representation information；

determine the correlation based on the differences among the representation information； and

change the order of the first feature maps according to the correlation among the first feature maps to obtain the second feature maps.
An apparatus comprising means for performing a method according to any of Claims 1 to 6.
A computer program product comprising at least one computer readable non-transitory memory medium having program code stored thereon， the program code which， when executed by an apparatus， causes the apparatus to perform a method according to any of Claims 1 to 6.