CN117763175A

CN117763175A - Heterogeneous knowledge-fused multi-strategy image retrieval method and system

Info

Publication number: CN117763175A
Application number: CN202311515168.1A
Authority: CN
Inventors: 陈嘉诚; 牛冠杰; 付成; 王心雨; 靳星; 赵静坤; 张鹏; 周琼
Original assignee: China Unicom Online Information Technology Co Ltd
Current assignee: China Unicom Online Information Technology Co Ltd
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-03-26

Abstract

The invention relates to a multi-strategy image retrieval method and system integrating heterogeneous knowledge, belonging to the technical field of image retrieval, wherein the method comprises the following steps: collecting images, respectively carrying out corresponding processing on each image by adopting different strategies, and storing the processing results in a database; inputting a query text by a user, extracting or preprocessing semantic features of the query text, and matching a corresponding search result set based on semantic feature extraction results or preprocessing results; and acquiring intersections between the search result sets and intersections between any two of the search result sets, and forming a total result set based on the intersections between the search result sets and the intersections between any two of the search result sets. The method and the system provided by the invention can deeply mine potential semantic information in the image, thereby improving the recall rate of retrieval, more accurately understanding the query intention of the user, and in addition, reordering the retrieval result set, so as to ensure that the most relevant information is preferentially displayed on the basis of meeting the retrieval correlation.

Description

Heterogeneous knowledge-fused multi-strategy image retrieval method and system

Technical Field

The invention relates to the technical field of image retrieval, in particular to a multi-strategy image retrieval method and system integrating heterogeneous knowledge.

Background

With the increasing number of network disk users, the number of data in the network disk is increased, and the problems of various types, wide sources and large scale of the data in the network disk are accompanied, and the problems can cause that the users can not easily retrieve the wanted data among the cross-modal data.

At present, data retrieval is mainly performed by a cloud disk search method or an image retrieval method, however, when the data retrieval is performed by the cloud disk search method, only the search capability of an elastic search is utilized, so that the retrieval efficiency is improved, but the image semantic information and the text semantic information in the existing data cannot be utilized; when the data is searched by the image searching method, although the limitation that only text information can be processed is solved by adopting a multi-mode association map mode, support is provided for cross-mode searching, the text content and specific character information which exist in the image cannot be determined, the method is focused on an aviation field key book, and the correlation with network disk user data is low.

Disclosure of Invention

The invention aims to provide a multi-strategy image retrieval method and system integrating heterogeneous knowledge, which are used for solving the defects in the prior art.

The multi-strategy image retrieval method for fusing heterogeneous knowledge provided by the invention comprises the following steps:

collecting a plurality of images uploaded by a user, respectively carrying out corresponding processing on each image by adopting different strategies, and storing the processing results in a database;

inputting a query text by a user, extracting or preprocessing semantic features of the query text, and matching a corresponding search result set based on semantic feature extraction results or preprocessing results;

and acquiring intersections between the search result sets and intersections between any two of the search result sets, and forming a total result set corresponding to the query text based on the intersections between the search result sets and the intersections between any two of the search result sets.

In the above scheme, the respective processing of each image by adopting different strategies includes:

and respectively carrying out face recognition clustering, optical character recognition and semantic feature vector extraction on each image by adopting different strategies.

In the above scheme, the respective processing of each image by adopting different strategies respectively further includes:

forming a plurality of clusters based on face recognition clusters, assigning the same identifier to the images in the same cluster, and associating the identifier with the corresponding image by adopting an index;

and correlating the optical character recognition result with the corresponding image.

In the above scheme, preprocessing the query text includes:

and performing word segmentation processing on the query text to obtain a plurality of characters or a plurality of sub-character strings.

In the above-mentioned scheme, matching the corresponding search result set based on the semantic feature extraction result includes:

calculating cosine similarity between the extracted semantic features and semantic feature vectors of the various images stored in the database;

comparing cosine similarity between the extracted semantic features and semantic feature vectors of the respective images stored in the database with a similarity threshold;

judging the image corresponding to the semantic feature vector when the cosine similarity is larger than a similarity threshold as a related image;

and arranging all the related images according to the similarity from large to small, and taking the related images arranged according to the similarity from large to small as a first retrieval result set.

In the above-mentioned scheme, matching the corresponding search result set based on the preprocessing result includes:

acquiring position ordinals of a plurality of characters corresponding to the query text in the query text, and judging whether the difference between the position ordinals of any two characters is smaller than or equal to a corresponding numerical value;

when the difference of the position ordinals of any two characters is smaller than or equal to the corresponding numerical value, the same character number between the query text and the identifier in the database is obtained, and the Jaro distance between the query text and the identifier in the database is calculated;

acquiring the Jaro-Winkler similarity between the query text and the identifiers in the database based on the calculated Jaro distance;

comparing the obtained Jaro-Winkler similarity with a similarity threshold, and judging an image corresponding to the identifier when the Jaro-Winkler similarity is larger than the similarity threshold as a related image;

and arranging all the related images according to the similarity from large to small, and taking the related images arranged according to the similarity from large to small as a second retrieval result set.

In the above scheme, the calculation formula of the corresponding numerical value is:

where str1 is the length of the query text and str2 is the length of the identifier.

In the above-mentioned scheme, matching the corresponding search result set based on the preprocessing result further includes:

taking the proportion of the length of a plurality of sub-strings corresponding to the query text to the total length of the query text as the weight of each sub-string, matching each sub-string corresponding to the query text with an optical character recognition result to obtain the matching rate corresponding to each sub-string, and multiplying the weight of each sub-string with the matching rate corresponding to each sub-string to obtain the query similarity;

comparing the obtained query similarity with a similarity threshold, and judging an image corresponding to an optical character recognition result when the query similarity is larger than the similarity threshold as a related image;

and arranging all the related images according to the similarity from large to small, and taking the related images arranged according to the similarity from large to small as a third retrieval result set.

In the above-mentioned scheme, forming the total result set corresponding to the query text based on the intersection between the search result sets and the intersection between any two of the search result sets includes:

the intersection between the search result sets is recorded as a first total result set;

calculating cosine similarity between semantic features corresponding to the query text and semantic feature vectors of the image in the intersection between any two of the search result sets, and arranging all images in the intersection between any two of the search result sets from large to small according to the cosine similarity to form a second total result set;

and combining the first total result set and the second total result set to form a total result set corresponding to the query text.

The multi-strategy image retrieval system for fusing heterogeneous knowledge provided by the invention adopts the multi-strategy image retrieval method for fusing heterogeneous knowledge to conduct image retrieval, and the system comprises the following steps:

the multi-strategy processing module is used for collecting a plurality of images uploaded by a user, respectively and correspondingly processing each image by adopting different strategies, and storing the processing results in the database;

the query result matching module is used for inputting a query text by a user, extracting or preprocessing semantic features of the query text, and matching a corresponding search result set based on the semantic feature extraction result or the preprocessing result;

and the total result set acquisition module is used for acquiring the intersection between the search result sets and the intersection between any two of the search result sets, and forming a total result set corresponding to the query text based on the intersection between the search result sets and the intersection between any two of the search result sets.

The embodiment of the invention has the following advantages:

the multi-strategy image retrieval method and system integrating heterogeneous knowledge can integrate face recognition clustering, optical character recognition and semantic feature vector extraction processes of images, fully mine semantic information in query texts input by users and calculate similarity of the semantic information, aim at deep mining potential semantic information in the images, so that retrieval recall rate is remarkably improved, user query intention is more accurately understood, in addition, retrieval result sets are reordered, and the most relevant information is preferentially displayed on the basis of meeting retrieval relevance.

Drawings

FIG. 1 is a step diagram of a multi-strategy image retrieval method incorporating heterogeneous knowledge in an embodiment of the invention;

FIG. 2 is a flow chart of a multi-policy process in one embodiment of the invention;

FIG. 3 is a flow chart of query result matching in one embodiment of the invention;

FIG. 4 is a schematic diagram of the composition of a multi-strategy image retrieval system incorporating heterogeneous knowledge in an embodiment of the invention.

Detailed Description

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.

As shown in fig. 1, the present invention provides a multi-strategy image retrieval method fusing heterogeneous knowledge, which comprises the following steps:

step S1: collecting a plurality of images uploaded by a user, respectively carrying out face recognition clustering, optical character recognition and semantic feature vector extraction on each image by adopting different deep learning models, matching unique identifiers for corresponding images based on face recognition clustering results, associating the identifiers with the corresponding images by adopting indexes, storing each identifier in an elastic search database, simultaneously associating the optical character recognition results with the corresponding images and storing the optical character recognition results in the elastic search database, and storing semantic feature vectors of the images obtained by extracting the semantic feature vectors of the images in the elastic search database, wherein the specific reference can be seen in fig. 2.

Specifically, if the image uploaded by the user contains characters but does not contain characters, performing face recognition clustering, optical character recognition and semantic feature vector extraction on the image respectively to obtain a face recognition clustering result and the semantic feature vector of the image; if the image uploaded by the user contains characters but does not contain characters, performing face recognition clustering, optical character recognition and semantic feature vector extraction on the image respectively to only obtain a character recognition result and the semantic feature vector of the image; if the image uploaded by the user contains characters and contains characters, the image is subjected to face recognition clustering, optical character recognition and semantic feature vector extraction to obtain a face recognition clustering result, a character recognition result and the semantic feature vector of the image.

Specifically, the characters in the image include text information on an identification card, location information on a roadside guideboard, and the like.

In one embodiment of the invention, face recognition clustering is performed on each image by adopting an insight deep learning model to form a plurality of clusters, the images in the same cluster are endowed with the same identifier, a user sets the identifier content as required, for example, the identifier content is set as a name, when the clusters need to be combined, a new index table maps indexes of the clusters to the same index, and then the images are mapped to the same index no matter which type of the clusters is clustered to.

Specifically, the index is a character ID, a plurality of clusters are formed in face recognition clustering of images by adopting an insight deep learning model, the same character ID is given to the images in the same cluster, and a 32-bit character string is randomly generated based on the current time in the process of establishing a new index.

In one embodiment of the invention, the Paddle OCR deep learning model is adopted to identify and extract character information in each image, the extracted character content and the corresponding image are associated, and the extracted character content and the corresponding image are stored in the elastic search database for subsequent retrieval, so that the retrieval accuracy is ensured.

In one embodiment of the present invention, extraction of semantic feature vectors of images is implemented based on a Chinese CLIP deep learning model, and when a user uploads a new image with 224×224, the model extracts semantic feature vectors of 768-dimensional or 512-dimensional images and stores the extracted semantic feature vectors in an elastic search database.

Specifically, a trained Chinese CLIP deep learning model is adopted to extract semantic feature vectors of images, and in the training process of the Chinese CLIP model, if a semantic matching relationship exists between the images and texts, the distance between the corresponding image feature vectors and text feature vectors in a vector space is reduced; if the images are not matched, the corresponding distance is increased, so that the semantic consistency of the images and the texts is realized.

Step S2: the method comprises the steps of inputting a query text by a user, extracting semantic features corresponding to the query text input by the user by a text encoder RoBERTa model of the CLIP, calculating cosine similarity between the extracted semantic features and semantic feature vectors of all images stored in an elastic search database, matching corresponding images based on the cosine similarity, preprocessing the query text, matching a preprocessing result with an identifier stored in the elastic search database, acquiring corresponding images based on the matching result, and matching the preprocessing result with an optical character recognition result stored in the elastic search database, wherein the corresponding images are acquired based on the matching result, and particularly can refer to FIG. 3.

Specifically, when the preprocessing result is matched with the identifier stored in the elastic search database or the preprocessing result is matched with the optical character recognition result stored in the elastic search database, no corresponding image is matched, and only a result obtained based on cosine similarity matching is returned.

Specifically, in the process of extracting semantic features corresponding to query text input by a user by adopting a text encoder RoBERTa model of the CLIP, 768-dimensional or 512-dimensional semantic feature vectors are output according to the selected image encoder.

In one embodiment of the invention, in the process of matching corresponding images based on cosine similarity, the cosine similarity between the extracted semantic features and the semantic feature vectors of the images stored in the elastic search database is compared with a similarity threshold, the image corresponding to the semantic feature vector when the cosine similarity is greater than the similarity threshold is judged to be a related image, all the related images are arranged according to the similarity from large to small, and the related images arranged according to the similarity from large to small are used as a first retrieval result set.

Specifically, preprocessing the query text includes word segmentation processing of the query text to obtain a plurality of characters or a plurality of substrings.

Specifically, matching the preprocessing result with the identifier stored in the elastic search database, and acquiring the corresponding image based on the matching result includes:

acquiring position numbers of a plurality of characters corresponding to the query text in the query text, calculating the difference between the position numbers of any two characters corresponding to the query text, judging whether the difference between the position numbers of any two characters is smaller than or equal to a corresponding numerical value, specifically, judging whether the difference between the position numbers of any two characters is smaller than or equal to the corresponding numerical value can help to determine whether the query text is similar to or matched with an identifier in an elastic search database, wherein the corresponding numerical value is acquired through the following calculation formula:

wherein, |str1| is the length of the query text, |str2| is the length of the identifier;

when the difference between the position ordinals of any two characters is smaller than or equal to the corresponding numerical value, the same number of characters between the query text and the identifier in the elastic search database is obtained, and the Jaro distance between the query text and the identifier in the elastic search database is calculated based on the same number of characters, wherein the Jaro distance has the following calculation formula:

where str1 is the length of the query text, str2 is the length of the identifier, m is the number of characters that are the same between the query text and the identifier in the elastomer search database, t is half the number of characters that need to be adjusted when the query text is the same as the identifier in the elastomer search database, e.g., the query text is "abcde", the identifier in the elastomer search database is "abcd", so that the query text is the same as the elastomer search databaseWhen the identifiers in the query text are the same, adjusting 'e' in the query text to'd', wherein the number of characters to be adjusted is 1, so that t=0.5 at the moment;

and acquiring the Jaro-Winkler similarity between the query text and the identifier in the elastic search database based on the calculated Jaro distance, wherein the Jaro-Winkler similarity has the following calculation formula:

jarowenkler (str 1, str 2) =jaro (str 1, str 2) +l×p× (1-Jaro (str 1, str 2)), wherein jarowenkler (str 1, str 2) is the Jaro-windkler similarity between the query text and the identifiers in the elastic search database, jaro (str 1, str 2) is the Jaro distance between the query text and the identifiers in the elastic search database, l is the maximum length of the common prefix of the query text and the identifiers in the elastic search database, p is a constant for determining the weight of l, in one embodiment of the present invention, the value of l is 4, the value of p is 0.1, wherein, the common prefix of the query text and the identifier in the elastomer search database is the part of the query text and the identifier in the elastomer search database, which has the same initial character sequence, is a character string segment consisting of the query text and a part of characters at the beginning of the identifier in the elastomer search database, and the character string segment is the same in all given character strings and can be 0 in length, for example, the query text is "app" and the identifier in the elastomer search database is "app";

comparing the obtained Jaro-Winkler similarity with a similarity threshold, judging images corresponding to identifiers when the Jaro-Winkler similarity is larger than the similarity threshold as related images, arranging all the related images according to the similarity from large to small, and taking the related images arranged according to the similarity from large to small as a second retrieval result set.

Specifically, matching the preprocessing result with the optical character recognition result stored in the elastic search database, and acquiring the corresponding image based on the matching result includes:

taking the proportion of the length of a plurality of sub-strings corresponding to the query text to the total length of the query text as the weight of each sub-string to consider the importance of different parts of the query text, matching each sub-string corresponding to the query text with an optical character recognition result to obtain the matching rate corresponding to each sub-string, and multiplying the weight of each sub-string with the matching rate corresponding to each sub-string to obtain the query similarity;

comparing the obtained query similarity with a similarity threshold, judging images corresponding to the optical character recognition results when the query similarity is larger than the similarity threshold as related images, arranging all the related images according to the similarity from large to small, and taking the related images arranged according to the similarity from large to small as a third retrieval result set.

Step S3: acquiring intersections of the first search result set, the second search result set and the third search result set, recording the intersections as first total result sets, combining the first total result sets and the second total result sets to form total result sets corresponding to the query text, wherein the intersections are formed by the first search result sets, the second search result sets and the third search result sets, calculating cosine similarity between semantic features corresponding to the query text and semantic feature vectors of image images in the intersections between the arbitrary two result sets, and arranging all images in the intersections between the arbitrary two result sets according to the cosine similarity from large to small.

Specifically, when the total result set is an empty set, the second total result set is taken as the total result set corresponding to the query text.

Specifically, when the second search result set and the third search result set are empty sets, that is, when the preprocessing result is matched with the identifier stored in the elastic search database or the preprocessing result is not matched with the corresponding image when the preprocessing result is matched with the optical character recognition result stored in the elastic search database, the image arranged in the first position in the first search result set is used as the search result corresponding to the query text.

As shown in fig. 4, the present invention provides a heterogeneous knowledge-fused multi-policy image retrieval system, which performs image retrieval by adopting the heterogeneous knowledge-fused multi-policy image retrieval method, and the system includes:

the multi-strategy processing module is used for collecting a plurality of images uploaded by a user, respectively carrying out face recognition clustering, optical character recognition and semantic feature vector extraction on each image by adopting different deep learning models, matching unique identifiers for the corresponding images based on face recognition clustering results, associating the identifiers with the corresponding images by adopting indexes, storing each identifier in an elastic search database, associating the optical character recognition results with the corresponding images and storing the optical character recognition results in the elastic search database, and storing semantic feature vectors of the images obtained by extracting the semantic feature vectors of the images in the elastic search database;

the query result matching module is used for inputting a query text by a user, extracting semantic features corresponding to the query text input by the user by adopting a text encoder RoBERTa model of the CLIP, calculating cosine similarity between the extracted semantic features and semantic feature vectors of all images stored in the elastic search database, matching corresponding images based on the cosine similarity, preprocessing the query text, matching a preprocessing result with an identifier stored in the elastic search database, acquiring corresponding images based on the matching result, and matching the preprocessing result with an optical character recognition result stored in the elastic search database, and acquiring corresponding images based on the matching result;

the total result set acquisition module is used for acquiring intersections of the first search result set, the second search result set and the third search result set, recording the intersections as the first total result set, calculating cosine similarity between semantic features corresponding to the query text and semantic feature vectors of image images in the intersections between any two result sets, arranging all images in the intersections between any two result sets according to the cosine similarity from large to small, forming the second total result set, combining the first total result set with the second total result set, and forming the total result set corresponding to the query text.

It should be noted that the foregoing detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is intended to include the plural unless the context clearly indicates otherwise. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or groups thereof.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Spatially relative terms, such as "above … …," "above … …," "upper surface at … …," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial location relative to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "above" or "over" other devices or structures would then be oriented "below" or "beneath" the other devices or structures. Thus, the exemplary term "above … …" may include both orientations of "above … …" and "below … …". The device may also be positioned in other different ways, such as rotated 90 degrees or at other orientations, and the spatially relative descriptors used herein interpreted accordingly.

In the above detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components unless context indicates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-strategy image retrieval method fusing heterogeneous knowledge, the method comprising:

2. The heterogeneous knowledge fusion multi-strategy image retrieval method according to claim 1, wherein the respective processing of each image by using different strategies includes:

3. The heterogeneous knowledge fusion multi-strategy image retrieval method according to claim 2, wherein the respective processing of each image by using different strategies respectively further comprises:

4. The heterogeneous knowledge-fused multi-strategy image retrieval method as recited in claim 3, wherein preprocessing the query text comprises:

5. The heterogeneous knowledge-fused multi-strategy image retrieval method according to claim 2, wherein matching the corresponding set of retrieval results based on the semantic feature extraction results comprises:

6. The heterogeneous knowledge-fused multi-strategy image retrieval method according to claim 4, wherein matching the corresponding set of retrieval results based on the pre-processing results comprises:

7. The heterogeneous knowledge fusion multi-strategy image retrieval method according to claim 6, wherein the calculation formula of the corresponding numerical value is:

8. The heterogeneous knowledge-fused multi-strategy image retrieval method according to claim 6, wherein matching the corresponding set of retrieval results based on the pre-processing results further comprises:

9. The heterogeneous knowledge-fused multi-policy image retrieval method according to claim 1, wherein forming a total result set corresponding to the query text based on intersections between the retrieval result sets and intersections between any two of the retrieval result sets comprises:

10. A heterogeneous knowledge-fused multi-strategy image retrieval system for image retrieval using the heterogeneous knowledge-fused multi-strategy image retrieval method according to any one of claims 1 to 9, characterized in that the system comprises: