CN104021224A

CN104021224A - Image labeling method based on layer-by-layer label fusing deep network

Info

Publication number: CN104021224A
Application number: CN201410290316.9A
Authority: CN
Inventors: 徐常胜; 袁召全; 桑基韬
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2014-06-25
Filing date: 2014-06-25
Publication date: 2014-09-03

Abstract

The invention discloses an image labeling method based on a layer-by-layer label fusing deep network. The method comprises the following steps: extracting a bottom layer vision characteristic for a training image with centralized training; layering the label of the training image to construct a hierarchical structure of the label; fusing the bottom vision characteristic information and label information layer by layer for the training image and obtaining the layered characteristic representation of the training image through parameter learning of the deep network; extracting a bottom layer vision characteristic for a testing image with centralized test, thereby obtaining the layered characteristic representation through deep network learning; and finally, forecasting the labeling information of the image according to layered characteristic representation of the testing image. The image labeling method disclosed by the invention belongs to layered labeling and is more precise than conventional labeling methods.

Description

Image annotation method based on layer-by-layer label fusion deep network

技术领域technical field

本发明涉及社交网络图像标注技术领域，尤其涉及一种基于逐层标签融合深度网络的图像标注方法。The invention relates to the technical field of social network image tagging, in particular to an image tagging method based on layer-by-layer tag fusion deep network.

背景技术Background technique

近年来，随着社交媒体的不断发展，社交平台上的图像数量呈爆炸式增长，如何对海量的社交图像进行标注成为网络多媒体领域重要的研究内容。In recent years, with the continuous development of social media, the number of images on social platforms has exploded. How to label massive social images has become an important research content in the field of network multimedia.

目前主流的图像标注方法主要集中在基于视觉信息的方法，该类方法首先进行底层特征提取，然后利用机器学习模型来对基于特征表示的图像进行分类。该类方法在一定程度上取得了较好的效果，然而由于仅利用视觉信息而忽视了其上下文的文本信息，其效果仍不够理想。The current mainstream image annotation methods mainly focus on methods based on visual information. This type of method first extracts the underlying features, and then uses machine learning models to classify images based on feature representation. This type of method has achieved good results to a certain extent, but its effect is still not ideal because it only uses visual information and ignores its contextual text information.

图像标注的核心在于利用图像相关的信息(包括视觉，上下文文本标签信息等)进行图像内容的理解，融合图像的标签信息和视觉信息，得到更加有表达能力的图像特征，对图像标注，特别是社交图像有重要的促进作用。然而，视觉特征和文本标签信息的异构性，给两类信息的融合带来了挑战，本发明提出的基于逐层标签融合深度网络的图像标注方法逐层地融合两类信息，解决了异构信息融合的难题，对于社交图像标注有着重要的作用。The core of image annotation is to use image-related information (including vision, contextual text label information, etc.) to understand image content, integrate image label information and visual information, and obtain more expressive image features. Social imagery has an important facilitative effect. However, the heterogeneity of visual features and text label information brings challenges to the fusion of the two types of information. The image annotation method based on the layer-by-layer label fusion deep network proposed in the present invention fuses the two types of information layer by layer and solves the problem of heterogeneity. The problem of structural information fusion plays an important role in social image annotation.

发明内容Contents of the invention

为了解决现有技术中存在的上述问题，本发明提出了一种基于逐层标签融合深度网络的图像标注方法。In order to solve the above-mentioned problems existing in the prior art, the present invention proposes an image labeling method based on layer-by-layer label fusion deep network.

本发明提出的一种基于逐层标签融合深度网络的图像标注方法包括以下步骤：A kind of image labeling method based on layer-by-layer label fusion depth network proposed by the present invention comprises the following steps:

步骤1、对于训练集中的训练图像，提取其底层视觉特征X；Step 1. For the training image in the training set, extract its underlying visual feature X;

步骤2、对于所述训练图像的标签进行层级化，构建标签的层级结构；Step 2. Hierarchize the labels of the training images to construct a hierarchical structure of the labels;

步骤3、对于所述训练图像，逐层融合其底层视觉特征信息和标签信息，并通过深度网络参数学习，得到所述训练图像的层级特征表示；Step 3, for the training image, fuse its underlying visual feature information and label information layer by layer, and learn the hierarchical feature representation of the training image through deep network parameter learning;

步骤4、对于测试集中的测试图像，提取其底层视觉特征，然后通过所述深度网络学习得到其层级特征表示，最后根据所述测试图像的层级特征表示预测其标注信息。Step 4. For the test images in the test set, extract their underlying visual features, then obtain their hierarchical feature representations through the deep network learning, and finally predict their annotation information according to the hierarchical feature representations of the test images.

互联网图像标注在很多重要的相关领域已经有了广泛的应用。由于视觉顶层信息与高层语义之间的语义鸿沟的存在，基于视觉的图像标注是一个具有挑战性的难题。本发明提出的上述基于逐层标签融合深度网络的图像标注的方法能够自动对社交图像进行标注，另外本发明层级的标注方法比传统的标注方法更加精确。Internet image annotation has been widely used in many important related fields. Vision-based image annotation is a challenging problem due to the existence of a semantic gap between visual top-level information and high-level semantics. The above-mentioned image tagging method based on layer-by-layer tag fusion deep network proposed by the present invention can automatically tag social images, and in addition, the layered tagging method of the present invention is more accurate than the traditional tagging method.

附图说明Description of drawings

图1是根据本发明一实施例的基于逐层标签融合深度网络的图像标注方法的流程图；Fig. 1 is the flowchart of the image tagging method based on layer-by-layer tag fusion depth network according to an embodiment of the present invention;

图2是标签层级示例图；Figure 2 is an example diagram of label hierarchy;

图3是根据本发明一实施例的逐层特征融合深度网络的模型结构图。Fig. 3 is a model structure diagram of a layer-by-layer feature fusion deep network according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明所提出的方法所涉及的相关数据集包括：1)训练集，其中包括图像以及该图像所对应的社交标签；2)测试集，仅包括待标注的测试图像，而没有标签信息。The relevant data sets involved in the method proposed by the present invention include: 1) training set, which includes images and social labels corresponding to the images; 2) test set, which only includes test images to be labeled without label information.

考虑到图像底层视觉信息和社交标签信息的异构性，本发明提出了一种基于逐层标签融合深度网络的图像标注方法。该方法的核心思想是在深度网络的框架下，逐层地进行标签信息和视觉信息的融合，从而学习图像的层级特征，为图像的标注提供特征表示。Considering the heterogeneity of the underlying visual information and social tag information of the image, the present invention proposes an image tagging method based on a layer-by-layer tag fusion deep network. The core idea of this method is to fuse label information and visual information layer by layer under the framework of deep network, so as to learn the hierarchical features of images and provide feature representation for image annotation.

图1示出了本发明提出的基于逐层标签融合深度网络的图像标注方法流程图，如图1所示，所述方法包括：Fig. 1 shows the flow chart of the image labeling method based on the layer-by-layer label fusion depth network proposed by the present invention. As shown in Fig. 1, the method includes:

步骤1、对于训练集中的训练图像，提取其底层视觉特征；Step 1, for the training image in the training set, extract its underlying visual features;

下面详细介绍上述四个步骤的具体执行过程。The specific execution process of the above four steps will be described in detail below.

步骤1中，对象的底层视觉特征提取是得到对象的初始表示，对于图像信息，本发明优选采用尺度不变特征变换特征(SIFT)(比如1000维)作为图像的底层视觉特征，图像的底层视觉特征用X来表示。In step 1, the underlying visual feature extraction of the object is to obtain the initial representation of the object. For image information, the present invention preferably adopts the scale-invariant feature transform feature (SIFT) (such as 1000 dimensions) as the underlying visual feature of the image, and the underlying visual feature of the image Features are denoted by X.

步骤2中，利用一些可以用的工具，本发明优选WordNet，对于图像的社交标签构建层数为K的标签层级。比如：若某图像带有标签animal,plant,cat,dog,flower,则对应的标签层级如图2所示(此处层数为2)。In step 2, using some available tools, WordNet is preferred in the present invention, and a label hierarchy with a layer number of K is constructed for the social label of the image. For example: if an image has tags animal, plant, cat, dog, flower, the corresponding tag hierarchy is shown in Figure 2 (here the number of layers is 2).

所述步骤3为对于训练图像，逐层融合其底层视觉特征信息和标签信息，并通过深度网络参数学习，得到所述训练图像的层级特征。The step 3 is to fuse the underlying visual feature information and label information of the training image layer by layer, and obtain the hierarchical features of the training image through deep network parameter learning.

步骤3中，构建层数为L(L>K)的深度网络，并使标签层级结构的K层对应深度网络的最高层。设深度网络各层的变量表示为h={h^（0)，...，h^(L)}，其中，h⁽⁰⁾表示图像的底层视觉特征X；K层的标签层级结构对应的各个层的变量表示为y={y^(L-K+1)，...，y^(L)}。In step 3, a deep network with layers L (L>K) is constructed, and the K layer of the label hierarchy corresponds to the highest layer of the deep network. Let the variables of each layer of the deep network be expressed as h={h ⁽⁰⁾ , ..., h ^(L) }, wherein, h ⁽⁰⁾ represents the underlying visual feature X of the image; each of the label hierarchy structure corresponding to the K layer The variables of a layer are denoted as y={y ^(L-K+1) ,...,y ^(L) }.

该步骤是本发明的重要部分，图3是根据本发明一实施例的逐层特征融合深度网络的模型结构图，参照图3，所述步骤3可以分为以下几个子步骤：This step is an important part of the present invention. FIG. 3 is a model structure diagram of a layer-by-layer feature fusion depth network according to an embodiment of the present invention. Referring to FIG. 3, the step 3 can be divided into the following sub-steps:

步骤3.1：通过构建自编码器(auto-encoder)，基于重构误差对于深度网络中从h⁽⁰⁾层到h^(L-K+1)层的参数进行初步调整；Step 3.1: Preliminarily adjust the parameters of the deep network from layer h ⁽⁰⁾ to layer h ^(L-K+1) based on the reconstruction error by constructing an auto-encoder;

所述步骤3.1进一步包括以下步骤：Said step 3.1 further comprises the following steps:

步骤3.1.1：从h⁽⁰⁾层向上到h^(L-K+1)层，在每相邻两层之间构建一个自编码器，通过所述自编码器可由下一层的表示得到上一层表示的映射；Step 3.1.1: From layer h ⁽⁰⁾ up to layer h ^(L-K+1) , build an autoencoder between every two adjacent layers, through which the autoencoder can be obtained from the representation of the next layer The mapping represented by the previous layer;

比如，基于h^(l-1)和h^(l)层之间的自编码器，由h^(l-1)层的表示可映射得到h^(l)层的表示：For example, based on the autoencoder between the h ^(l-1) and h ^(l) layers, the representation of the h ^(l-1) layer can be mapped to the h ^(l) layer representation:

${h h}^{((11))} = = s the s (({W W}_{h h}^{((l l - - 11))} {h h}^{((l l - - 11))} + + {b b}^{((l l))})) - - - - - - ((11))$

其中，表示h^(l-1)和h^(l)层之间的权重参数，b^(l)表示h^(l)层的偏置(bias)参数，s()表示logistic函数： in, Represents the weight parameter between h ^(l-1) and h ^(l) layers, b ^(l) represents the bias (bias) parameter of h ^(l) layer, s() represents the logistic function:

这样由h^(l-1)层的表示通过映射就可得到h^(l)层的表示。In this way, the representation of h ^{(l) layer can be obtained through mapping from the representation of h (l-1} ⁾ layer.

步骤3.1.2：由上一层表示映射回来得到下一层的重构表示；Step 3.1.2: Map back from the representation of the previous layer to obtain the reconstructed representation of the next layer;

比如，由h^(l)的表示映射回来可得到h^(l-1)的重构表示z：For example, the reconstructed representation z of h ^(l-1) ^can be obtained by mapping the representation of h (l):

$z z = = s the s (({W W}_{h h}^{' ' ((l l - - 11))} {h h}^{((l l))} + + {b b}^{' '})) - - - - - - ((22))$

其中，为的转置表示，b′表示h^(l-1)的偏置(bias)参数。in, for The transpose representation of , b' represents the bias (bias) parameter of h ^(l-1) .

步骤3.1.3：根据正确表示与重构表示之间的差错，对于所述深度网络的参数进行调整。Step 3.1.3: Adjust the parameters of the deep network according to the error between the correct representation and the reconstructed representation.

比如通过最小化z与h^(l-1)层表示之间的重构差错就可实现对于所述深度网络参数的初步调整，在本发明一实施例中，优选使用最小化重构交叉熵来对上述参数进行初步调整：For example, the initial adjustment of the deep network parameters can be achieved by minimizing the reconstruction error between z and h ^(l-1) layer representations. In an embodiment of the present invention, it is preferable to minimize the reconstruction cross entropy to Make preliminary adjustments to the above parameters:

其中，k表示z的分量的下标，D^(l-1)表示z的维数。Among them, k represents the subscript of the component of z, and D ^(l-1) represents the dimension of z.

如此进行下去，一直调整到h^(L-K+1)层。Proceed in this way until the h ^(L-K+1) layer is adjusted.

步骤3.2：对于所述深度网络中的h^(L-K+1)层到最高h^(L)层，结合深度网络中的某一层，比如h^(l)层和标签层级结构中的相应层，比如u^(l)层，进行特征融合以及所述深度网络中相应参数的调整；Step 3.2: For the h ^(L-K+1 ) layer to the highest h ^(L) layer in the deep network, combine a certain layer in the deep network, such as the h ^(l) layer and the corresponding layer in the label hierarchy , such as the u ^(l) layer, performing feature fusion and adjusting corresponding parameters in the deep network;

该步骤又可以分为两个子步骤：(以h^(l)为例)This step can be divided into two sub-steps: (take h ^(l) as an example)

步骤3.2.1：利用所述标签层级结构中的y^(l)层标签调整所述深度网络中从h⁽⁰⁾到h^(l)层的参数；Step 3.2.1: Utilize the y ⁽¹⁾ layer label in the label hierarchy structure to adjust the parameters from h ⁽⁰⁾ to h ⁽¹⁾ layer in the deep network;

该步骤中，首先计算交叉熵损失：In this step, first calculate the cross-entropy loss:

$Loss loss (({{W W,, b b}})) = = - - {Σ Σ}_{n no = = 11}^{N N} {Σ Σ}_{k k = = 11}^{K K} {t t}_{nk nk} ln ln {y the y}_{nk nk} - - - - - - ((44))$

其中，N表示样本的数目，K表示该层的标签的个数，y_nt表示模型对第n个样本的预测的第k维的值，t_nk表示训练样本中第n个样本的第k维的真实的值。Among them, N represents the number of samples, K represents the number of labels in this layer, y _nt represents the value of the k-th dimension of the model's prediction of the n-th sample, and t _nk represents the k-th dimension of the n-th sample in the training sample the true value of .

然后将该损失反过来对深度网络从h⁽⁰⁾到h^(l)层进行参数调整，在本发明一实施例中，采用著名的后向传播算法进行全局参数调整。Then the loss is reversed to adjust the parameters of the deep network from h ⁽⁰⁾ to h ^(l) layers. In an embodiment of the present invention, the well-known backpropagation algorithm is used to adjust the global parameters.

步骤3.2.2：通过h^(l)层和y^(l)层表示合并学习得到h^(l+1)层的特征表示；Step 3.2.2: Obtain the feature representation of the h ( ^l+1) layer by combining the representations of the h ^(l) layer and the y ^(l) layer;

该步骤中，将h^(l)层和y^(l)层的表示合并起来，与h^(l+1)层的表示构成一个自编码器(auto-encoder)：In this step, the representations of layer h ^(l) and layer y ^(l) are combined to form an auto-encoder with the representation of layer h ^(l+1) :

${h h}^{((l l + + 11))} = = s the s (({W W}_{h h}^{((l l))} {h h}^{((l l))} + + {W W}_{y the y}^{((l l))} {y the y}^{((l l))} + + {b b}^{((l l + + 11))})) - - - - - - ((55))$

同样，h^(l)，y^(l)和h^(l+1)之间的参数通过最小化重构交叉熵来优化。Likewise, the parameters between h ^(l) , y ^(l) and h ^(l+1) are optimized by minimizing the reconstruction cross-entropy.

如此进行下去，一直到h^(L)层。And so on, until the h ^(L) layer.

通过上述逐层的特征融合，就可以将图像的标签信息融合到视觉信息中，同时深度网络的参数也得到了优化。Through the above-mentioned layer-by-layer feature fusion, the label information of the image can be fused into the visual information, and the parameters of the deep network are also optimized.

步骤4中，利用参数已经优化的深度网络，对于测试集中的测试图像进行标注。In step 4, use the deep network whose parameters have been optimized to annotate the test images in the test set.

所述步骤4进一步分为以下几个子步骤：The step 4 is further divided into the following sub-steps:

步骤4.1：对于测试图像提取其底层视觉特征X_test，该步骤与步骤1中对训练集中的训练图像提取底层视觉特征的方法类似；Step 4.1: Extracting the underlying visual features X _test of the test image, this step is similar to the method of extracting the underlying visual features of the training images in the training set in step 1;

步骤4.2：利用优化参数后的深度网络，得到所述测试图像底层视觉特征X_test的层级特征表示{h^(L-K+1)，...，h^(L)}；Step 4.2: Using the deep network after optimizing the parameters, obtain the hierarchical feature representation {h ^(L-K+1) ,..., h ^(L) } of the underlying visual feature X _test of the test image;

步骤4.3：利用该层级特征表示预测所述测试图像的标签信息{h^(L-K+1)，...，h^(L)}：Step 4.3: Use the hierarchical feature representation to predict the label information {h ^(L-K+1) , ..., h ^(L) } of the test image:

${y the y}_{i i}^{((l l))} = = \frac{exp exp (({W W}_{i i}^{T T} {h h}_{i i}^{((l l))}))}{{Σ Σ}_{j j} exp exp (({W W}_{j j}^{T T} {h h}_{j j}^{((l l))}))} - - - - - - ((66))$

其中，W_i表示标签与特征h^(l)之间的权重。Among them, W _i represents the label and the weight between feature h ^(l) .

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. An image annotation method based on a layer-by-layer label fusion depth network is characterized by comprising the following steps:

step 1, extracting a bottom layer visual characteristic X of a training image in a training set;

step 2, carrying out hierarchy on the labels of the training images to construct a hierarchical structure of the labels;

step 3, fusing bottom layer visual characteristic information and label information of the training image layer by layer, and obtaining the hierarchical characteristic representation of the training image through deep network parameter learning;

and 4, extracting bottom layer visual features of the test images in the test set, then obtaining hierarchical feature representation of the test images through deep network learning, and finally predicting the labeling information of the test images according to the hierarchical feature representation of the test images.

2. The method of claim 1, wherein the underlying visual features of the training image are its scale-invariant feature transform features.

3. The method of claim 1, wherein the deep network has a number of layers L and the label hierarchy has a number of layers K, wherein L is>K, and the variable of each layer of the deep network is represented as h = { h = { h⁽⁰⁾，...，h^(L)In which h⁽⁰⁾An underlying visual feature X representing an image; the variable of each layer corresponding to the label hierarchy is represented as y = { y = { y =^(L-K+1)，...，y^(L)}。

4. The method according to claim 3, wherein the step 3 comprises the steps of:

step 3.1: by constructing an auto-encoder, the reconstruction error is based on the slave h in the depth network⁽⁰⁾Layer to h^(L-K+1)The parameters of the layer are preliminarily adjusted;

step 3.2: for h in the deep network^(L-K+1)Layer to maximum h^(L)Layer, incorporating a layer in a deep network, such as h^(l)Layers and corresponding layers in the hierarchy of labels, e.g. y^(l)And the layer is used for carrying out feature fusion and adjusting corresponding parameters in the deep network.

5. The method according to claim 4, characterized in that said step 3.1 further comprises the steps of:

step 3.1.1: from h⁽⁰⁾Layer up to h^(L-K+1)Layers, a self-plaiting structure is constructed between each two adjacent layersA coder by which a mapping of a representation of an upper layer is derivable from a representation of a lower layer;

step 3.1.2: mapping back from the previous layer representation to obtain a reconstructed representation of the next layer;

step 3.1.3: adjusting parameters of the deep network up to h according to errors between a correct representation and a reconstructed representation^(L-K+1)And (3) a layer.

6. The method according to claim 5, characterized in that in step 3.1.3, the parameters of the deep network are adjusted using a minimum reconstruction cross entropy.

7. The method according to claim 4, characterized in that said step 3.2 further comprises the steps of:

step 3.2.1: utilizing a certain level y in the hierarchy of labels^(l)The label adjusts the slave h in the deep network⁽⁰⁾To h^(l)Parameters of the layer;

step 3.2.2: through h^(l)Layer and y^(l)Layer representation merge learning to get h^(l+1)The characteristic of the layer is expressed, and the corresponding parameter of the deep network is adjusted until h^(L)And (3) a layer.

8. The method according to claim 7, characterized in that in step 3.2.1 and step 3.2.2, parameters are adjusted for the deep network using a back propagation algorithm based on cross entropy loss.

9. The method of claim 7, wherein in step 3.2.2, h is^(l)Layer and y^(l)The representation of the layer is combined with h^(l+1)The representation of the layers constitutes an auto-encoder.

10. The method of claim 1, wherein the step 4 further comprises the steps of:

step 4.1: extracting bottom layer visual features of the test image;

step 4.2: obtaining a hierarchical feature representation of the bottom visual features of the test image by using the depth network;

step 4.3: predicting label information of the test image using a hierarchical feature representation of the test image.