CN110097010A

CN110097010A - Picture and text detection method, device, server and storage medium

Info

Publication number: CN110097010A
Application number: CN201910371247.7A
Authority: CN
Inventors: 申世伟
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2019-08-06

Abstract

The disclosure is directed to a kind of picture and text detection method, device, server and storage mediums, belong to Internet technical field.The picture and text detection method that the embodiment of the present disclosure provides, obtain picture and text to be detected, picture and text to be detected include image to be detected and text to be detected, extract the image feature vector of image to be detected, obtain the first image feature vector, and the Text eigenvector of text to be detected is extracted, obtain the first Text eigenvector.Based on picture and text transition matrix, the first image feature vector is converted into Text eigenvector, obtains the second Text eigenvector, is based on the first Text eigenvector and the second Text eigenvector, image to be detected and text to be detected are detected.This method passes through picture and text transition matrix, first image feature vector is converted to the second Text eigenvector under text mode, to detect according to the first Text eigenvector and the second Text eigenvector of same mode to picture and text to be detected, the accuracy rate of detection is improved.

Description

Image-text detection method and device, server and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a server, and a storage medium for detecting an image and text.

Background

With the development of internet technology, a user can upload some images on the internet and can also attach texts to the images, but the image contents and the text information corresponding to the texts may be consistent or inconsistent. For example, the image uploaded by the user is an image of weather, the matched text is "weather is fine today", and the content of the image is consistent with the text information. For another example, the image uploaded by the user is an image of a cat, but the matched text is ' today's weather is really good ', and the image content and the text information are inconsistent. Therefore, it is necessary to detect the consistency between the image content and the text information when the user uploads the image content.

In the related technology, image features and text features of the image and text to be detected are mainly extracted, the image features and the text features are spliced and fused, then the fused features are input into a classification model, and a classification result is obtained, wherein the classification result can be that image content is consistent with or inconsistent with text information.

However, in the related art, classification is mainly performed based on the fused features, but the fused features cannot reflect the relationship between the image features and the text features, so that the method for classifying according to the spliced and fused features can reduce the accuracy of image-text detection.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a server and a storage medium for detecting graphics and text, which can solve the problem of low accuracy of graphics and text detection.

According to a first aspect of the embodiments of the present disclosure, there is provided an image-text detection method, including:

acquiring a to-be-detected image and text, wherein the to-be-detected image and the to-be-detected text comprise the to-be-detected image and the to-be-detected text;

extracting an image characteristic vector of the image to be detected to obtain a first image characteristic vector, and extracting a text characteristic vector of the text to be detected to obtain a first text characteristic vector;

converting the first image characteristic vector into a text characteristic vector based on a graph-text conversion matrix to obtain a second text characteristic vector;

and detecting the image to be detected and the text to be detected based on the first text characteristic vector and the second text characteristic vector.

In a possible implementation manner, the converting the first image feature vector into a text feature vector based on a text-to-text conversion matrix to obtain a second text feature vector includes:

and performing cross multiplication operation on the first image characteristic vector and the image-text conversion matrix to obtain an operation result, and taking the operation result as the second text characteristic vector.

In another possible implementation manner, the detecting the image to be detected and the text to be detected based on the first text feature vector and the second text feature vector includes:

determining a matching degree between the first text feature vector and the second text feature vector;

when the matching degree is larger than a preset threshold value, determining that the image to be detected is consistent with the text to be detected;

and when the matching degree is not greater than the preset threshold value, determining that the image to be detected is inconsistent with the text to be detected.

In another possible implementation manner, before the converting the first image feature vector into a text feature vector based on the graph-text conversion matrix to obtain a second text feature vector, the method further includes:

obtaining a second image feature vector and a third text feature vector corresponding to each sample image in at least one sample image, wherein the sample image included in each sample image is consistent with the sample text;

converting the third text characteristic vector corresponding to each sample image-text into an image characteristic vector based on a first variable corresponding to a transpose matrix of the image-text conversion matrix to obtain an image vector function of each sample image-text;

and for each sample image-text, determining a first variable when the image vector function of the sample image-text is matched with the second image feature vector of the sample image-text, and transposing the first variable to obtain the image-text conversion matrix.

In another possible implementation manner, the determining, for each sample image-text, a first variable when an image vector function of the sample image-text matches a second image feature vector of the sample image-text, and transposing the first variable to obtain the image-text conversion matrix includes:

determining the difference between the second image feature vector of the sample image-text and the image vector function of the sample image-text to obtain a first function;

and determining a first variable when the function value of the first function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

In another possible implementation manner, the determining a first variable when the image vector function of the sample image-text matches with the second image feature vector of the sample image-text, and transposing the first variable to obtain the image-text conversion matrix includes:

converting a second image feature vector of the sample image-text into a text feature vector based on a second variable corresponding to the image-text conversion matrix to obtain a text vector function of the sample image-text, wherein the second variable is a transposed variable of the first variable;

determining the difference between the text vector function of the sample image-text and the third text characteristic vector of the sample image-text to obtain a second function;

determining the sum of the first function and the second function to obtain a third function;

and determining a first variable when the function value of the third function is minimum to obtain the image-text conversion matrix.

In another possible implementation manner, the acquiring the to-be-detected image-text includes:

receiving an uploading request of a first terminal, wherein the uploading request carries the image-text to be detected;

and acquiring the image-text to be detected from the uploading request.

In another possible implementation manner, the method further includes:

when the image to be detected is consistent with the text to be detected, storing the image to be detected into an image library;

when an inquiry request sent by a second terminal is received, acquiring an image corresponding to the inquiry text from the image library according to the inquiry text carried in the inquiry request.

According to a second aspect of the embodiments of the present disclosure, there is provided an image-text detection apparatus, the apparatus including:

the image-text detection device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire images and texts to be detected, and the images and texts to be detected comprise images to be detected and texts to be detected;

the extraction unit is configured to extract an image feature vector of the image to be detected to obtain a first image feature vector, and extract a text feature vector of the text to be detected to obtain a first text feature vector;

the conversion unit is configured to convert the first image feature vector into a text feature vector based on a graph-text conversion matrix to obtain a second text feature vector;

and the detection unit is configured to detect the image to be detected and the text to be detected based on the first text feature vector and the second text feature vector.

In a possible implementation manner, the conversion unit is further configured to perform a cross-product operation on the first image feature vector and the image-text conversion matrix to obtain an operation result, and use the operation result as the second text feature vector.

In another possible implementation manner, the detection unit is further configured to determine a matching degree between the first text feature vector and the second text feature vector; when the matching degree is larger than a preset threshold value, determining that the image to be detected is consistent with the text to be detected; and when the matching degree is not greater than the preset threshold value, determining that the image to be detected is inconsistent with the text to be detected.

In another possible implementation manner, the apparatus further includes:

the obtaining unit is further configured to obtain a second image feature vector and a third text feature vector corresponding to each sample image in at least one sample image, where each sample image includes a sample image consistent with a sample text;

the conversion unit is further configured to convert the third text feature vector corresponding to each sample image-text into an image feature vector based on a first variable corresponding to a transpose matrix of the image-text conversion matrix, so as to obtain an image vector function of each sample image-text;

the transposition unit is configured to determine a first variable when an image vector function of each sample image-text is matched with a second image feature vector of the sample image-text, and transpose the first variable to obtain the image-text conversion matrix.

In another possible implementation manner, the transposing unit is further configured to determine a difference between the second image feature vector of the sample image-text and the image vector function of the sample image-text, so as to obtain a first function; and determining a first variable when the function value of the first function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

In another possible implementation manner, the transposing unit is further configured to determine a difference between the second image feature vector of the sample image-text and the image vector function of the sample image-text, so as to obtain a first function; converting a second image feature vector of the sample image-text into a text feature vector based on a second variable corresponding to the image-text conversion matrix to obtain a text vector function of the sample image-text, wherein the second variable is a transposed variable of the first variable; determining the difference between the text vector function of the sample image-text and the third text characteristic vector of the sample image-text to obtain a second function; determining the sum of the first function and the second function to obtain a third function; and determining a first variable when the function value of the third function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

In another possible implementation manner, the obtaining unit is further configured to receive an upload request of a first terminal, where the upload request carries the to-be-detected graphics and text; and acquiring the image-text to be detected from the uploading request.

In another possible implementation manner, the apparatus further includes:

the storage unit is configured to store the image to be detected into an image library when the image to be detected is consistent with the text to be detected;

the acquiring unit is further configured to acquire, when receiving an inquiry request sent by a second terminal, an image corresponding to the inquiry text from the image library according to the inquiry text carried in the inquiry request.

According to a third aspect of embodiments of the present disclosure, there is provided a server, including:

one or more processors;

volatile or non-volatile memory for storing the one or more processor-executable instructions;

wherein the one or more processors are configured to perform the teletext detection method of any one of the above first aspects.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions stored thereon, which when executed by a processor of a server, implement the image-text detection method according to any one of the first aspect above.

According to a fifth aspect of embodiments of the present disclosure, there is provided an application program, wherein when instructions of the application program are executed by a processor of a server, the server is enabled to execute the teletext detection method according to any one of the above-mentioned first aspects.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the image-text detection method provided by the embodiment of the disclosure obtains an image-text to be detected, the image-text to be detected comprises an image to be detected and a text to be detected, extracts an image characteristic vector of the image to be detected to obtain a first image characteristic vector, and extracts a text characteristic vector of the text to be detected to obtain a first text characteristic vector. And based on the first text characteristic vector and the second text characteristic vector, detecting the image to be detected and the text to be detected. According to the method, the first image characteristic vector is converted into the second text characteristic vector in the text mode through the image-text conversion matrix, so that the image-text to be detected is detected according to the first text characteristic vector and the second text characteristic vector in the same mode, and the accuracy of detection is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram illustrating an environment for performing teletext detection according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of teletext detection according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a method of teletext detection according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating encoding and decoding by a semantic auto-encoder according to an example embodiment.

Fig. 5 is a schematic structural diagram illustrating a teletext detection arrangement according to an exemplary embodiment.

FIG. 6 is a block diagram illustrating the structure of a server in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The embodiment of the present disclosure provides an implementation environment for image-text detection, referring to fig. 1, the implementation environment includes a first terminal 101 and a server 102, and the first terminal 101 and the server 102 may be connected through a wireless network. The first terminal 101 runs an application associated with the server 102, and based on the application, can log into the server 102 to interact with the server 102. The application may be a short video application, a social networking application, or a browser. In the embodiment of the present disclosure, the application is described as an example of a short video application.

In a possible implementation manner, the first terminal 101 may upload the image-text to the server 102; for convenience of description, the text is referred to as the text to be detected. The server 102 may share the to-be-detected image-text with the information display platform corresponding to the first terminal 101. The server 102 may detect the to-be-detected pictures and texts when the first terminal 101 uploads the to-be-detected pictures and texts to the server 102.

In the embodiment of the present disclosure, the image-text to be detected includes an image to be detected and a text to be detected. When the server 102 detects the image and text to be detected, the image feature vector of the image to be detected in the image to be detected is extracted to obtain a first image feature vector, and the text feature vector of the text to be detected in the image to be detected is extracted to obtain a first text feature vector. And based on the image-text conversion matrix, converting the first image characteristic vector into a text characteristic vector to obtain a second text characteristic vector, and detecting the image to be detected and the text to be detected according to the first text characteristic vector and the second text characteristic vector.

In another possible implementation, the implementation environment further includes a second terminal 103. The second terminal 103 can check the pictures and texts uploaded by the first terminal 101 through the server 102; correspondingly, the server 102 may also detect the teletext uploaded by the first terminal 101 when the second terminal 103 views the teletext. In the embodiment of the present disclosure, the timing of the image-text detection by the server 102 is not particularly limited. In the embodiment of the present disclosure, the server 102 detects the to-be-detected graphics when receiving the to-be-detected graphics.

Correspondingly, after the server 102 detects the image-text to be detected, when the image to be detected is consistent with the text to be detected, the server 102 may store the image to be detected in the image library. The text to be detected can be positioned on the image to be detected, namely the text matched by the user is positioned on the image; the text to be detected and the image to be detected can also be independent texts and images, namely the texts are not on the images.

When the text to be detected is located on the image to be detected, the server 102 extracts the keywords in the text to be detected, the keywords are used as the labels of the images and texts to be detected, the images and texts to be detected with the same keywords are stored in an image library, and when a subsequent server 102 receives an inquiry request, the image corresponding to the label can be inquired in the image library according to the label of the image and text to be detected.

When the text to be detected is not on the image to be detected, the server 102 extracts the keywords in the text to be detected, stores the image to be detected into the image library corresponding to the keywords, and when the follow-up server 102 receives the query request, the image can be queried from the image library corresponding to the keywords according to the keywords in the text to be detected.

The terminal can be any Device such as a mobile phone terminal, a PAD (PAD) terminal or a computer Device which is provided with a short video APP. The server 102 is a server 102 providing a background service for the terminal, and may be one server 102, a server cluster composed of a plurality of servers 102, or a cloud computing server center, which is not specifically limited in this disclosure. In one possible implementation, the server 102 may be a background server 102 of a short video APP installed in the terminal.

Fig. 2 is a flowchart illustrating a teletext detection method according to an exemplary embodiment, as shown in fig. 2, comprising the steps of:

in step S21, the to-be-detected text is obtained, and the to-be-detected text includes the to-be-detected image and the to-be-detected text.

In step S22, the image feature vector of the image to be detected is extracted to obtain a first image feature vector, and the text feature vector of the text to be detected is extracted to obtain a first text feature vector.

In step S23, the first image feature vector is converted into a text feature vector based on the text-to-text conversion matrix, and a second text feature vector is obtained.

In step S24, the image to be detected and the text to be detected are detected based on the first text feature vector and the second text feature vector.

In a possible implementation manner, converting the first image feature vector into a text feature vector based on the image-text conversion matrix to obtain a second text feature vector includes:

and performing cross multiplication operation on the first image characteristic vector and the image-text conversion matrix to obtain an operation result, and taking the operation result as a second text characteristic vector.

In another possible implementation manner, detecting the image to be detected and the text to be detected based on the first text feature vector and the second text feature vector includes:

determining the matching degree between the first text feature vector and the second text feature vector;

and when the matching degree is not greater than a preset threshold value, determining that the image to be detected is inconsistent with the text to be detected.

In another possible implementation manner, before converting the first image feature vector into a text feature vector based on the text-to-text conversion matrix to obtain the second text feature vector, the method further includes:

obtaining a second image feature vector and a third text feature vector corresponding to each sample image in at least one sample image, wherein each sample image comprises a sample image consistent with a sample text;

converting a third text characteristic vector corresponding to each sample image-text into an image characteristic vector based on a first variable corresponding to a transpose matrix of the image-text conversion matrix to obtain an image vector function of each sample image-text;

and for each sample image-text, determining a first variable when the image vector function of the sample image-text is matched with the second image feature vector of the sample image-text, and transposing the first variable to obtain an image-text conversion matrix.

In another possible implementation manner, for each sample image-text, determining a first variable when the image vector function of the sample image-text matches with the second image feature vector of the sample image-text, and transposing the first variable to obtain an image-text conversion matrix, includes:

determining the difference between the second image characteristic vector of the sample image-text and the image vector function of the sample image-text to obtain a first function;

In another possible implementation manner, determining a first variable when the image vector function of the sample image-text is matched with the second image feature vector of the sample image-text, and transposing the first variable to obtain an image-text conversion matrix includes:

converting the second image characteristic vector of the sample image-text into a text characteristic vector based on a second variable corresponding to the image-text conversion matrix to obtain a text vector function of the sample image-text, wherein the second variable is a transposed variable of the first variable;

determining the difference between the text vector function of the sample image and the third text feature vector of the sample image to obtain a second function;

In another possible implementation manner, the obtaining of the to-be-detected image-text includes:

and acquiring the image-text to be detected from the uploading request.

In another possible implementation manner, the method further includes:

and when an inquiry request sent by the second terminal is received, acquiring an image corresponding to the inquiry text from the image library according to the inquiry text carried in the inquiry request.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 3 is a flowchart illustrating a teletext detection method according to an exemplary embodiment, applied to a server, as shown in fig. 3, including the following steps:

in step S31, the server receives an upload request from the first terminal, and obtains an image to be detected from the upload request, where the image to be detected includes an image to be detected and a text to be detected.

And an application program is installed on the first terminal, and the first terminal can share the pictures and texts through the application program. When the first terminal shares the image and text, the first terminal obtains the image and text in the image and text and sends an uploading request to the server, wherein the uploading request carries the image and text. And the server receives the uploading request, acquires the image and text from the uploading request, takes the image and text as the image to be detected, calls the image in the image and text as the image to be detected, and calls the text in the image and text as the text to be detected.

The upload request may also carry a terminal identifier of the first terminal, where the terminal identifier may be a user account registered in the server by the first terminal. The text to be detected can be positioned on the image to be detected and can be a text independent of the image to be detected. For example, the image to be detected in the image text to be detected uploaded by the user is an image of weather, and the text to be detected is "weather today is really good".

In step S32, the server extracts an image feature vector of the image to be detected to obtain a first image feature vector, and extracts a text feature vector of the text to be detected to obtain a first text feature vector.

In the step, the server can extract the image characteristic vector of the image to be detected through the image characteristic extractor to obtain a first image characteristic vector; and extracting the text characteristic vector of the text to be detected through a text characteristic extractor to obtain a first text characteristic vector.

Correspondingly, the step of extracting the image feature vector of the image to be detected by the server through the image feature extractor to obtain the first image feature vector may be: the server extracts the abstract semantic features of the image to be detected through the image feature extractor to obtain an image feature vector, and the image feature vector is used as a first image feature vector. The abstract semantic features may be at least one of scene semantic features, behavior semantic features and emotion semantic features contained in the image. The image feature extractor may be a pre-trained deep learning Network for image Network or VGG (visual geometry Group Network).

For example, when the image feature extractor is VGG, the step may be: and the server extracts the image characteristic vector of the image to be detected through the VGG to obtain a first image characteristic vector. In the embodiments of the present disclosure, the image feature extractor is not particularly limited.

The step of extracting, by the server, the text feature vector of the text to be detected through the text feature extractor to obtain the first text feature vector may be: the server maps each word in the text to be detected to a vector with a fixed size through the feature extractor to obtain at least one vector, determines an average vector of the at least one vector, and takes the average vector as a first text feature vector. For example, the text feature extractor may be a Word2vec neural network. In the embodiments of the present disclosure, the text feature extractor is not particularly limited.

In step S33, the server converts the first image feature vector into a text feature vector based on the text-to-text conversion matrix, and obtains a second text feature vector.

In this step, the server performs cross multiplication on the first image feature vector and the image-text conversion matrix to obtain an operation result, and the operation result is used as a second text feature vector, so that the text feature vector of the first image feature vector in a text mode, that is, in a text mode, is obtained.

In or before this step, the server obtains the teletext matrix. The image-text conversion matrix can be obtained by training the server by the server, and can also be obtained by training other equipment acquired by the server. In the embodiments of the present disclosure, this is not particularly limited.

When the server trains itself to obtain the image-text conversion matrix, the method can be realized by the following steps (1) to (3), including:

(1) the server obtains a second image feature vector and a third text feature vector corresponding to each sample image-text in at least one sample image-text.

This step is similar to step S32 described above and will not be described herein again.

(2) And the server converts the third text characteristic vector corresponding to each sample image-text into an image characteristic vector based on the first variable corresponding to the transpose matrix of the image-text conversion matrix, so as to obtain an image vector function of each sample image-text.

In the embodiment of the present disclosure, the teletext matrix may be a matrix of m rows and n columns, for example, the teletext matrix may be represented asWherein,is the second variable of the teletext conversion matrix. The transposed matrix is obtained by interchanging rows and columns of the image-text conversion matrix, so that the transposed matrix of the image-text conversion matrix is as follows:wherein,the second variable is a transposed variable of the first variable.

And the server performs cross multiplication on the third text characteristic vector corresponding to each sample image-text and the transposed matrix to obtain an image characteristic vector, wherein the obtained image characteristic vector is an image vector function because the first variable in the transposed matrix is an unknown variable.

(3) And for each sample image-text, the server determines a first variable when the image vector function of the sample image-text is matched with the second image feature vector of the sample image-text, and transposes the first variable to obtain an image-text conversion matrix.

For each sample image-text, in the embodiment of the present disclosure, the server may encode the second image feature vector of the sample image-text through the image-text conversion matrix based on the semantic self-encoder, and decode the encoded feature vector through the transpose matrix of the image-text conversion matrix to obtain a decoded feature vector.

The semantic self-encoder is one of neural networks and mainly comprises three parts: an encoder, a hidden layer, and a decoder. Referring to fig. 4, fig. 4 is a schematic diagram of encoding and decoding performed by the semantic self-encoder. In fig. 4, the encoder encodes the original data X into a new representation S, i.e. a hidden layer, via a matrix W; the decoder transposes the hidden layer S by W^TDecoded to X 'to obtain output data X'. The original data and the output data can be image feature vectors, and the hidden layer can be a text feature vector.

When the semantic self-encoder is used for encoding and decoding, the data obtained after decoding is recovered as original data as far as possible. Based on this, the server can obtain the image-text conversion matrix through the following two implementation modes.

In a first implementation manner, a server determines a difference between a second image feature vector of a sample image and an image vector function of the sample image to obtain a first function; and determining a first variable when the function value of the first function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

In the embodiment of the disclosure, the semantic autoencoder has only one hidden layer, the dimension of the hidden layer S is smaller than that of the original data X, while the dimension of the image feature is generally 1024 or 2048 dimensions, the dimension of the text feature is generally 300 dimensions, and the dimension of the text feature is smaller than that of the image feature. Therefore, in the embodiment of the present disclosure, the server may use the image feature as the original data X in the semantic self-encoder, use the text feature as the hidden layer S, and output data as the image feature X', use the text transformation matrix as the matrix W, and use the transpose matrix of the text transformation matrix as the transpose matrix W of W^T。

In summary, the second image feature vector can be represented as X, and the image vector function is a product of a transpose matrix and the third text feature vector, where the transpose matrix can be represented as W^TAnd the third text feature vector is denoted as S, the image vector function can be expressed asW^TS, the server obtains a first function according to a difference between the second image vector and the image vector, where the first function can be represented as:whereinRepresents X-W^TThe square of the absolute value of S, F is the norm,then represents X-W^TThe square of the absolute value of S is the smallest.

The server can directly solve the first function, obtain a first variable corresponding to the transpose matrix when the function value of the first function is minimum, and exchange rows and columns of the first variable to obtain a second variable of the image-text conversion matrix, so that the image-text conversion matrix is obtained.

In a second implementation mode, the server determines the difference between a second image feature vector of the sample image text and an image vector function of the sample image text to obtain a first function; converting the second image characteristic vector of the sample image-text into a text characteristic vector based on a second variable corresponding to the image-text conversion matrix to obtain a text vector function of the sample image-text; determining the difference between the text vector function of the sample image and the third text feature vector of the sample image to obtain a second function; determining the sum of the first function and the second function to obtain a third function; and determining a first variable when the function value of the third function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

In this implementation, before the server solves the first function, the server may perform a relaxation operation on the first function to obtain a third function, solve the third function, determine a first variable when a function value of the third function is minimum, transpose the first variable to obtain a second variable of the image-text conversion matrix, and thereby obtain the image-text conversion matrix.

The process of detecting the relaxation operation performed on the first function may be: and the server performs cross multiplication on the second image characteristic vector of the sample image-text and the image-text conversion matrix based on a second variable corresponding to the image-text conversion matrix to obtain a text characteristic vector, wherein the obtained text characteristic vector is a text vector function because the first variable is an unknown variable. And the server determines the difference between the text vector function and the third text feature vector to obtain a second function. And summing the first function and the second function to obtain a third function.

For example, if the third text feature vector is denoted as S and the text vector function is denoted as WX, the second function can be expressed as:the server sums the second function and the first function to obtain a third function, which may be represented as:the server can solve the third function through any solving algorithm, obtain a first variable corresponding to the transpose matrix when the function value of the third function is minimum, and exchange rows and columns of the first variable to obtain a second variable, so that the image-text conversion matrix is obtained. When the server solves the third function, the third function may be solved by a lagrangian method, and in the embodiment of the present disclosure, the solving algorithm is not specifically limited. The relaxation operation may be lagrangian relaxation, and in the embodiment of the present disclosure, the relaxation operation is not particularly limited.

In a possible implementation manner, the server may further determine a slack factor corresponding to the second function, determine a product of the slack factor and the second function to obtain a fourth function, and sum the fourth function and the first function to obtain a fifth function. And determining a first variable when the function value of the fifth function is minimum, and transposing the first variable to obtain a second variable corresponding to the image-text conversion matrix, thereby obtaining the image-text conversion matrix. For example, the fourth function may be expressed as:the fifth function can be expressed as:wherein λ is a relaxation factor.

In the embodiment of the disclosure, a server obtains a plurality of sample pictures and texts, and continuously performs iterative optimization through semantic self-coding to finally obtain a picture and text conversion matrix.

It should be noted that the original semantic self-encoder is unsupervised, and when the second image feature vector is directly encoded by the semantic self-encoder, it may be a text feature vector or a feature vector in other modes, so that the conversion has uncertainty. In the embodiment of the disclosure, the product of the text conversion matrix and the second image feature vector is converted into the text feature vector, which plays a role in constraining the encoding process of the semantic self-encoder, so that the semantic self-encoder is changed from an unsupervised learning semantic self-encoder to a supervised learning semantic self-encoder, and the hidden layer S of the semantic self-encoder is represented in the corresponding modal space. In addition, in the embodiment of the present disclosure, the hidden layer S is not only another representation of the second image feature vector in the text modality space, but also has a clear semantic meaning, that is, has a common feature of the second image feature vector and the third text feature vector.

In step S34, the server determines a degree of matching between the first text feature vector and the second text feature vector.

In this step, the server may determine a cosine distance between the first text feature vector and the second text feature vector, and use the cosine distance as a matching degree between the first text feature vector and the second text feature vector.

The server determines the magnitude relation between the matching degree and a preset threshold, and when the matching degree is greater than the preset threshold, the server executes step S35; when the matching degree is not greater than the preset threshold, the server performs step S36.

In step S35, when the matching degree is greater than the preset threshold, the server determines that the image to be detected and the text to be detected are consistent.

In this step, when the matching degree is greater than a preset threshold value, the server determines that the image to be detected is consistent with the text to be detected. The preset threshold may be set and changed as needed, and is not particularly limited in the embodiment of the present disclosure.

It should be noted that, in this step, the server may store the image to be detected in the image library when the image to be detected and the text to be detected are consistent. In the embodiment of the present disclosure, the text to be detected may be located on the image to be detected, or the text to be detected is not located on the image to be detected, that is, the text to be detected and the image to be detected are independent text and image. When the text to be detected is located on the image to be detected, the server can store the image-text to be detected into the first image library when detecting that the text to be detected is consistent with the image to be detected, and the first image library is used for storing the image with consistent image-text. When the server stores the pictures and texts to be detected, the first keywords in the texts to be detected can be extracted and used as the labels of the pictures and texts to be detected, and when a follow-up server receives a query request, the follow-up server can query the images corresponding to the labels in the first image library according to the labels of the pictures and texts to be detected.

When the text to be detected and the image to be detected are independent texts and images, the server extracts a first keyword in the text to be detected, stores the image to be detected into a second image library corresponding to the first keyword, and when a follow-up server receives a query request, the follow-up server can query the image from the image library corresponding to the first keyword according to the first keyword in the text to be detected.

In the disclosed embodiment, the second terminal may query the server for the image. When the second terminal inquires the image from the server, the second terminal sends an inquiry request to the server, wherein the inquiry request carries an inquiry text; and the server receives a query request of the second terminal, acquires an image corresponding to the query text from the image library according to the query text carried in the query request, and sends the image to the second terminal.

When the image to be detected is stored in the first image library, the step of obtaining the image corresponding to the query text from the first image library by the server according to the query text carried in the query request may be: and the server extracts a second keyword in the query text and queries an image with a label as the second keyword from the first image library according to the second keyword. The second keyword and the first keyword may be the same or different. When the second keyword is the same as the first keyword, the second terminal inquires the pictures and texts uploaded by the first terminal; and when the second keyword is different from the first keyword, the second terminal inquires the pictures and texts uploaded by other terminals.

When the detected image is stored in the second image library, the step of obtaining, by the server, an image corresponding to the query text from the image library according to the query text carried in the query request may be: and the server extracts a second keyword in the query text and acquires an image in a second image library corresponding to the second keyword according to the second keyword. The server may obtain all images in the second image library corresponding to the second keyword, or may obtain a partial image in the second image library corresponding to the second keyword.

In step S36, when the matching degree is not greater than the preset threshold, the server determines that the image to be detected and the text to be detected are inconsistent.

And when the matching degree is not greater than the preset threshold value, the server determines that the image to be detected is inconsistent with the text to be detected. The server can store the image to be detected into another image library or not store the image to be detected when the image to be detected is inconsistent with the text to be detected. In the embodiments of the present disclosure, this is not particularly limited.

Fig. 5 is a block diagram illustrating an teletext detection arrangement according to an exemplary embodiment, the arrangement comprising, as shown in fig. 5:

the acquiring unit 501 is configured to acquire to-be-detected pictures and texts, wherein the to-be-detected pictures and texts comprise to-be-detected images and texts;

the extracting unit 502 is configured to extract an image feature vector of an image to be detected to obtain a first image feature vector, and extract a text feature vector of a text to be detected to obtain a first text feature vector;

a conversion unit 503 configured to convert the first image feature vector into a text feature vector based on the image-text conversion matrix, resulting in a second text feature vector;

a detecting unit 504 configured to detect the image to be detected and the text to be detected based on the first text feature vector and the second text feature vector.

In a possible implementation manner, the conversion unit 503 is further configured to perform a cross multiplication operation on the first image feature vector and the text-to-text conversion matrix to obtain an operation result, and use the operation result as the second text feature vector.

In another possible implementation manner, the detecting unit 504 is further configured to determine a matching degree between the first text feature vector and the second text feature vector; when the matching degree is larger than a preset threshold value, determining that the image to be detected is consistent with the text to be detected; and when the matching degree is not greater than a preset threshold value, determining that the image to be detected is inconsistent with the text to be detected.

In another possible implementation manner, the apparatus further includes:

the obtaining unit 501 is further configured to obtain a second image feature vector and a third text feature vector corresponding to each sample image in at least one sample image, where each sample image includes a sample image consistent with a sample text;

the converting unit 503 is further configured to convert the third text feature vector corresponding to each sample image-text into an image feature vector based on the first variable corresponding to the transpose matrix of the image-text conversion matrix, so as to obtain an image vector function of each sample image-text;

and the transposition unit is configured to determine a first variable when the image vector function of the sample image-text is matched with the second image feature vector of the sample image-text for each sample image-text, and transpose the first variable to obtain an image-text conversion matrix.

In another possible implementation manner, the transposing unit is further configured to determine a difference between the second image feature vector of the sample image-text and the image vector function of the sample image-text, so as to obtain a first function; converting the second image characteristic vector of the sample image-text into a text characteristic vector based on a second variable corresponding to the image-text conversion matrix to obtain a text vector function of the sample image-text, wherein the second variable is a transposed variable of the first variable; determining the difference between the text vector function of the sample image and the third text feature vector of the sample image to obtain a second function; determining the sum of the first function and the second function to obtain a third function; and determining a first variable when the function value of the third function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

In another possible implementation manner, the obtaining unit 501 is further configured to receive an upload request of the first terminal, where the upload request carries the to-be-detected pictures and texts; and acquiring the image-text to be detected from the uploading request.

In another possible implementation manner, the apparatus further includes:

the storage unit is configured to store the image to be detected into the image library when the image to be detected is consistent with the text to be detected;

the obtaining unit 501 is further configured to, when receiving an inquiry request sent by the second terminal, obtain an image corresponding to the inquiry text from the image library according to the inquiry text carried in the inquiry request.

The image-text detection device provided by the embodiment of the disclosure obtains an image-text to be detected, the image-text to be detected comprises an image to be detected and a text to be detected, extracts an image characteristic vector of the image to be detected to obtain a first image characteristic vector, and extracts a text characteristic vector of the text to be detected to obtain a first text characteristic vector. And based on the first text characteristic vector and the second text characteristic vector, detecting the image to be detected and the text to be detected. The device converts the first image characteristic vector into the second text characteristic vector under the text mode through the image-text conversion matrix, so that the image-text to be detected is detected according to the first text characteristic vector and the second text characteristic vector of the same mode, and the accuracy of detection is improved.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 600 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 601 and one or more memories 602, where the memory 602 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 601 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a non-transitory computer-readable storage medium is further provided, on which instructions are stored, and when executed by a processor of a server, the instructions implement the image-text detection method provided by the embodiments of the present disclosure.

In an exemplary embodiment, an application program is further provided, and when instructions in the application program are executed by a processor of a server, the server is enabled to execute the teletext detection method provided by the embodiments of the disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for detecting graphics and text is characterized by comprising the following steps:

2. The method of claim 1, wherein converting the first image feature vector into a text feature vector based on a teletext matrix to obtain a second text feature vector comprises:

3. The method according to claim 1, wherein the detecting the image to be detected and the text to be detected based on the first text feature vector and the second text feature vector comprises:

4. The method according to any of claims 1-3, wherein before said converting said first image feature vector into a text feature vector based on a teletext matrix, resulting in a second text feature vector, the method further comprises:

5. The method of claim 4, wherein the determining, for each sample image-text, a first variable when the image vector function of the sample image-text matches a second image feature vector of the sample image-text, and transposing the first variable to obtain the image-text conversion matrix comprises:

6. The method of claim 4, wherein the determining a first variable when the image vector function of the sample teletext matches a second image feature vector of the sample teletext, transposing the first variable to obtain the teletext conversion matrix, comprises:

and determining a first variable when the function value of the third function is minimum, and transposing the first variable to obtain the image-text conversion matrix.

7. The method according to claim 1, wherein the obtaining the image-text to be detected comprises:

and acquiring the image-text to be detected from the uploading request.

8. An image-text detection device, characterized in that the device comprises:

9. A server, characterized in that the server comprises:

one or more processors;

wherein the one or more processors are configured to perform the teletext detection method of any one of claims 1-7.

10. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed by a processor of a server, implement the method of any of claims 1-7.