CN112015923A

CN112015923A - Multi-mode data retrieval method, system, terminal and storage medium

Info

Publication number: CN112015923A
Application number: CN202010922939.9A
Authority: CN
Inventors: 王硕; 吴振宇; 王建明
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-01
Also published as: WO2021155682A1

Abstract

The invention discloses a multi-mode data retrieval method, a system, a terminal and a storage medium. The method comprises the following steps: acquiring historical multi-modal data, wherein the historical multi-modal data at least comprises picture data and text data; training a cross-modal retrieval model according to the historical multi-modal data; the cross-modal retrieval model at least comprises a picture modal retrieval model and a text modal retrieval model; inputting the data to be retrieved into the cross-modal retrieval model, wherein the cross-modal retrieval model retrieves the data to be retrieved through the picture modal retrieval model and the text modal retrieval model respectively to obtain a candidate set of similar data files of the data to be retrieved, and performs similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved. The invention can well search multi-mode data such as pictures, texts and the like, and improves the search accuracy and the search efficiency.

Description

A kind of multimodal data retrieval method, system, terminal and storage medium

技术领域technical field

本发明涉及数据检索技术领域，特别是涉及一种多模态数据检索方法、系统、终端及存储介质。The present invention relates to the technical field of data retrieval, in particular to a multimodal data retrieval method, system, terminal and storage medium.

背景技术Background technique

随着网络技术的快速发展，包含有文本与图像等数据的多模态文档大规模地出现在人们的日常生活中。在信息的世界中，这些不同模态的数据资源可以无形地提升感觉器官接受知识的能力。With the rapid development of network technology, multimodal documents containing data such as text and images appear in people's daily life on a large scale. In the world of information, these data resources of different modalities can invisibly enhance the ability of the sense organs to receive knowledge.

由于多模态数据所呈现出的多样性、复杂性与随意性，如何快速并准确地从大量多模态文档中检索出对用户有用的信息具有重要意义。而传统的数据检索方法通常都是通过关键词进行检索，需要预先人为提取出对应的关键词，且由于关键词是粗粒度的，检索的准确度和效率都相对较差。Due to the diversity, complexity and randomness of multimodal data, how to quickly and accurately retrieve information useful to users from a large number of multimodal documents is of great significance. However, traditional data retrieval methods usually search through keywords, which need to manually extract the corresponding keywords in advance, and because the keywords are coarse-grained, the retrieval accuracy and efficiency are relatively poor.

发明内容SUMMARY OF THE INVENTION

本发明提供了一种多模态数据检索方法、系统、终端及存储介质，能够在一定程度上解决现有技术中存在的不足。The present invention provides a multimodal data retrieval method, system, terminal and storage medium, which can solve the deficiencies in the prior art to a certain extent.

为解决上述技术问题，本发明采用的技术方案为：In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:

一种多模态数据检索方法，包括：A multimodal data retrieval method, comprising:

获取历史多模态数据，所述历史多模态数据至少包括图片数据和文本数据；Obtaining historical multimodal data, the historical multimodal data at least includes picture data and text data;

根据所述历史多模态数据训练跨模态检索模型；所述跨模态检索模型至少包括图片模态检索模型和文本模态检索模型；Train a cross-modal retrieval model according to the historical multimodal data; the cross-modal retrieval model includes at least a picture modal retrieval model and a text modal retrieval model;

将待检索数据输入所述跨模态检索模型，所述跨模态检索模型通过所述图片模态检索模型和文本模态检索模型分别对所述待检索数据进行检索，得到所述待检索数据的相似数据文件候选集，并对所述相似数据文件候选集进行相似度排序，得到与所述待检索数据相似度最高的数据文件。Input the data to be retrieved into the cross-modal retrieval model, and the cross-modal retrieval model retrieves the data to be retrieved through the image modal retrieval model and the text modal retrieval model, respectively, to obtain the data to be retrieved The similar data file candidate set is obtained, and the similarity degree is sorted on the similar data file candidate set to obtain the data file with the highest similarity with the to-be-retrieved data.

本发明实施例采取的技术方案还包括：所述获取历史多模态数据还包括：The technical solution adopted in the embodiment of the present invention further includes: the obtaining historical multimodal data further includes:

构建多模态文件数据库，所述多模态文件数据库中包括每一个数据文件的图片数据和文本数据。A multimodal file database is constructed, and the multimodal file database includes picture data and text data of each data file.

本发明实施例采取的技术方案还包括：所述根据所述历史多模态数据训练跨模态检索模型之前还包括：The technical solution adopted in the embodiment of the present invention further includes: before the training of the cross-modal retrieval model according to the historical multimodal data further includes:

对所述多模态文件数据库中的数据文件所属类别进行标注，生成用于训练模型的数据样本。Labeling the categories to which the data files in the multimodal file database belong to generate data samples for training the model.

本发明实施例采取的技术方案还包括：所述根据所述历史多模态数据训练跨模态检索模型包括：The technical solution adopted in the embodiment of the present invention further includes: the training of a cross-modal retrieval model according to the historical multimodal data includes:

所述模型训练包括检索召回阶段和精准排序阶段，其中：The model training includes a retrieval recall stage and a precise sorting stage, wherein:

在所述检索召回阶段使用匹配算法对所有数据样本进行粗筛，分别得到待检索文件在不同模态下的至少两个相似数据文件集合，然后取所述至少两个相似数据文件集合的并集作为待检索数据的相似数据文件候选集；In the retrieval and recall stage, a matching algorithm is used to perform rough screening on all data samples to obtain at least two similar data file sets of the files to be retrieved in different modalities, and then take the union of the at least two similar data file sets A candidate set of similar data files as the data to be retrieved;

在所述精准排序阶段对所述相似数据文件候选集进行相似度排序，得到与待检索数据相似度最高的数据文件。In the precise sorting stage, similarity sorting is performed on the similar data file candidate set to obtain a data file with the highest similarity with the data to be retrieved.

本发明实施例采取的技术方案还包括：所述跨模态检索模型通过所述图片模态检索模型和文本模态检索模型分别对所述待检索数据进行检索包括：The technical solution adopted in the embodiment of the present invention further includes: the cross-modal retrieval model respectively retrieving the data to be retrieved through the picture modal retrieval model and the text modal retrieval model includes:

判断所述待检索数据的图片数据是否为空，如果不为空，将所述图片数据输入所述图片模态检索模型，得到图片模态下的相似数据文件检索结果，并对所述检索结果进行排序后，取前M个检索结果作为图片模态下的相似数据文件集合S_I；Determine whether the picture data of the data to be retrieved is empty, if not, input the picture data into the picture modal retrieval model, obtain the similar data file retrieval results in the picture modal, and compare the retrieval results to the retrieval results. After sorting, get the first M retrieval results as the similar data file set S _I under the picture mode;

判断所述待检索数据的文本数据是否为空，如果不为空，将所述文本数据输入所述文本模态检索模型，得到文本模态下的相似数据文件检索结果，并对所述检索结果进行排序后，取前M个检索结果作为文本模态下的相似数据文件集合S_T；Judging whether the text data of the data to be retrieved is empty, if not, inputting the text data into the text modal retrieval model, obtaining the retrieval results of similar data files in the text modal, and analyzing the retrieval results After sorting, take the first M retrieval results as the similar data file set _ST under the text mode;

取集合S_I和S_T的并集作为待检索数据的相似数据文件候选集；Take the union of sets S _I and S _T as a candidate set of similar data files of the data to be retrieved;

对所述相似数据文件候选集进行相似度排序，得到与所述待检索数据相似度最高的数据文件。Sorting the similarity of the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved.

本发明实施例采取的技术方案还包括：所述图片模态检索模型采用ResNet进行编码，所述文本模态检索模型采用BERT进行编码。The technical solutions adopted in the embodiments of the present invention further include: the picture modal retrieval model is coded by using ResNet, and the text modal retrieval model is coded by using BERT.

本发明实施例采取的技术方案还包括：所述文本模态检索模型的检索算法包括BM25或者TFIDF算法，所述图片模态检索模型的检索算法包括利用图片的视觉特征进行相似性匹配，所述视觉特征包括颜色分布、几何形状或纹理。The technical solution adopted in the embodiment of the present invention further includes: the retrieval algorithm of the text modal retrieval model includes BM25 or TFIDF algorithm, the retrieval algorithm of the picture modal retrieval model includes similarity matching using the visual features of the pictures, the Visual features include color distribution, geometry or texture.

本发明实施例采取的另一技术方案为：一种多模态数据检索系统，包括：Another technical solution adopted by the embodiment of the present invention is: a multimodal data retrieval system, comprising:

数据收集模块：用于获取历史多模态数据，所述历史多模态数据至少包括图片数据和文本数据；Data collection module: used to obtain historical multimodal data, the historical multimodal data includes at least picture data and text data;

模型构建模块：用于根据所述历史多模态数据训练跨模态检索模型；所述跨模态检索模型至少包括图片模态检索模型和文本模态检索模型；Model building module: used to train a cross-modal retrieval model according to the historical multimodal data; the cross-modal retrieval model includes at least a picture modal retrieval model and a text modal retrieval model;

数据检索模块：用于将待检索数据输入所述跨模态检索模型，所述跨模态检索模型通过所述图片模态检索模型和文本模态检索模型分别对所述待检索数据进行检索，得到所述待检索数据的相似数据文件候选集，并对所述相似数据文件候选集进行相似度排序，得到与所述待检索数据相似度最高的数据文件。Data retrieval module: used to input the data to be retrieved into the cross-modal retrieval model, and the cross-modal retrieval model retrieves the data to be retrieved through the image modal retrieval model and the text modal retrieval model, respectively, Obtaining a candidate set of similar data files of the data to be retrieved, and sorting the candidate sets of similar data files by similarity to obtain a data file with the highest similarity with the data to be retrieved.

本发明实施例采取的又一技术方案为：一种终端，所述终端包括处理器、与所述处理器耦接的存储器，其中，Another technical solution adopted by the embodiments of the present invention is: a terminal, the terminal includes a processor and a memory coupled to the processor, wherein,

所述存储器存储有用于实现上述的多模态数据检索方法的程序指令；The memory stores program instructions for implementing the above-mentioned multimodal data retrieval method;

所述处理器用于执行所述存储器存储的所述程序指令以执行所述多模态数据检索操作。The processor is configured to execute the program instructions stored in the memory to perform the multimodal data retrieval operation.

本发明实施例采取的又一技术方案为：一种存储介质，存储有处理器可运行的程序指令，所述程序指令用于执行上述的多模态数据检索方法。Another technical solution adopted by the embodiments of the present invention is: a storage medium storing program instructions executable by a processor, where the program instructions are used to execute the above-mentioned multimodal data retrieval method.

本发明的有益效果是：本发明实施例的多模态数据检索方法、系统、终端及存储介质通过基于不同模态的数据文件构建跨模态检索模型，通过直接输入不同模态的数据文件，然后输出对应模态的检索结果，从而实现端到端的检索方案，能够很好地对图片、文本等多模态数据进行检索，提高了检索准确性以及检索效率。The beneficial effects of the present invention are: the multi-modal data retrieval method, system, terminal and storage medium of the embodiments of the present invention construct a cross-modal retrieval model based on data files of different modalities, and directly input data files of different modalities, Then, the retrieval results of the corresponding modalities are output, so as to realize an end-to-end retrieval scheme, which can well retrieve multi-modal data such as pictures and texts, and improve retrieval accuracy and retrieval efficiency.

附图说明Description of drawings

图1是本发明第一实施例的多模态数据检索方法的流程示意图；1 is a schematic flowchart of a multimodal data retrieval method according to a first embodiment of the present invention;

图2是本发明第二实施例的多模态数据检索方法的流程示意图；2 is a schematic flowchart of a multimodal data retrieval method according to a second embodiment of the present invention;

图3是本发明实施例多模态数据检索系统的结构示意图；3 is a schematic structural diagram of a multimodal data retrieval system according to an embodiment of the present invention;

图4是本发明实施例的终端结构示意图；4 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

图5是本发明实施例的存储介质结构示意图。FIG. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明中的术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。本发明实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等，如果该特定姿态发生改变时，则该方向性指示也相应地随之改变。此外，术语“包括”和“具有”以及它们任何变形，意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second" and "third" in the present invention are only used for description purposes, and should not be understood as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first", "second", "third" may expressly or implicitly include at least one of that feature. In the description of the present invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined. All directional indications (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative positional relationship between various components under a certain posture (as shown in the accompanying drawings). , motion situation, etc., if the specific posture changes, the directional indication also changes accordingly. Furthermore, the terms "comprising" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.

在本文中提及“实施例”意味着，结合实施例描述的特定特征、结构或特性可以包含在本发明的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例，也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是，本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

多模态数据通常包括文本、图像、语音、视频等不同形式的数据。对于同一类型的数据来说，虽然不同模态的数据之间会呈现出底层特征异构，但其高层语义具有一定的相关性。例如，对于某种特定的疾病，其用药情况和所做的医学影像检查在很大程度上是类似的，即用药情况的文本数据和医学影像检查的医学图像数据在语义上是相关的。基于该特点，本发明实施例利用历史多模态数据训练一个跨模态检索模型，通过跨模态检索模型检索得到与待检索数据相似的不同模态的数据集合，然后取不同模态的数据集合的交集作为最终的检索结果。Multimodal data usually includes data in different forms such as text, image, voice, video, etc. For the same type of data, although data of different modalities will show heterogeneity of underlying features, their high-level semantics are related to a certain extent. For example, for a certain disease, the drug use and the medical imaging examination are similar to a large extent, that is, the text data of the drug use and the medical image data of the medical imaging examination are semantically related. Based on this feature, the embodiment of the present invention uses historical multi-modal data to train a cross-modal retrieval model, retrieves data sets of different modalities similar to the data to be retrieved through the cross-modal retrieval model, and then retrieves data from different modalities The intersection of the sets serves as the final retrieval result.

为了便于说明，本发明以下实施例仅以最常用的图片和文本两种模态数据为例进行具体说明，可以理解，本发明同样适用于语音、视频等其他模态数据的检索。For convenience of description, the following embodiments of the present invention only take the two most commonly used modal data, picture and text, as examples for specific description. It can be understood that the present invention is also applicable to the retrieval of other modal data such as voice and video.

具体的，请参阅图1，是本发明第一实施例的多模态数据检索方法的流程示意图。本发明第一实施例的多模态数据检索方法包括以下步骤：Specifically, please refer to FIG. 1 , which is a schematic flowchart of the multimodal data retrieval method according to the first embodiment of the present invention. The multimodal data retrieval method according to the first embodiment of the present invention includes the following steps:

S10：获取历史多模态数据，基于历史多模态数据构建多模态文件数据库；S10: Obtain historical multimodal data, and build a multimodal file database based on the historical multimodal data;

本步骤中，多模态文件数据库中包括每一个数据文件的图片、文本等多模态数据。假设多模态文件数据库中收集的数据文件数量为N，则该数据库中包含的数据集合为{(I₁,T₁),(I₂,T₂)，(I₃,T₃),…,(I_N,T_N)}，其中(I_i,T_i)表示第i个数据文件的图片-文本对。In this step, the multimodal file database includes multimodal data such as pictures and texts of each data file. Assuming that the number of data files collected in the multimodal file database is N, the data sets contained in the database are {(I ₁ ,T ₁ ),(I ₂ ,T ₂ ),(I ₃ ,T ₃ ),… ,( _{IN ,T N} ₎ }, where (I _i ,T _i ) represents the image-text pair of the ith data file.

S11：对多模态文件数据库中一定数量的数据文件所属类别进行标注，生成用于训练模型的数据样本；S11: Label the categories to which a certain number of data files in the multimodal file database belong, and generate data samples for training the model;

本步骤中，以医疗数据类型的数据文件为例，数据文件所述类别的标注包括疾病名称、用药类型、影像检查类型等，通过手动标注每个数据文件的类别，使得属于同一个类别的数据文件具有相似性。In this step, taking a data file of medical data type as an example, the labels of the categories in the data file include disease name, medication type, imaging examination type, etc. By manually labeling the category of each data file, the data belonging to the same category can be marked Documents are similar.

S12：根据数据样本训练跨模态检索模型，通过跨模态检索模型对待检索文件的图片数据和文本数据分别进行检索，得到与待检索文件相似度最高的数据文件；S12: Train a cross-modal retrieval model according to the data samples, and retrieve the image data and text data of the file to be retrieved through the cross-modal retrieval model, to obtain a data file with the highest similarity to the file to be retrieved;

本步骤中，跨模态检索模型至少包括文本模态检索模型和图片模态检索模型，本发明实施例采用Pairwise模式进行模型训练，训练过程包括检索召回和精准排序两个阶段：In this step, the cross-modal retrieval model includes at least a text modal retrieval model and a picture modal retrieval model. In this embodiment of the present invention, the Pairwise mode is used for model training, and the training process includes two stages: retrieval recall and precise sorting:

在检索召回阶段，使用匹配算法对所有数据样本进行粗筛，得到一个相对较小的相似数据文件候选集；检索召回阶段是在单模态下进行的检索，即以待检索文件的图片数据从图片模态检索模型中检索出相似的图片数据文件，以待检索文件的文本数据从文本模态检索模型中检索出相似的文本数据文件，分别得到待检索文件在图片模态下和文本模态下的相似数据文件集合，然后取两个相似数据文件集合的并集作为待检索文件的相似数据文件候选集。假设经过检索召回阶段筛选得到的相似数据文件候选集大小为K，则该候选集对应的数据文件集合为{(I₁,T₁),(I₂,T₂),…,(I_K,T_K)}。In the retrieval and recall stage, the matching algorithm is used to roughly screen all data samples to obtain a relatively small candidate set of similar data files; the retrieval and recall stage is a single-modal retrieval, that is, the image data of the files to be retrieved are retrieved from Similar image data files are retrieved from the image modal retrieval model, and similar text data files are retrieved from the text modal retrieval model based on the text data of the files to be retrieved, and the files to be retrieved in the image modal and text modalities are obtained respectively. Then, the union of the two similar data file sets is taken as the similar data file candidate set of the files to be retrieved. Assuming that the size of the candidate set of similar data files filtered through the retrieval and recall stage is K, the set of data files corresponding to the candidate set is {(I ₁ ,T ₁ ),(I ₂ ,T ₂ ),…,(I _K , T _K )}.

在精准排序阶段，对检索召回阶段得到的相似数据文件候选集进行相似度排序，得到与待检索文件相似度最高的数据文件；其中，精准排序阶段基于Learning to Rank(排序学习)的思想进行设计，精准排序阶段的优化目标是文本数据和图片数据之间的匹配程度，Pairwise模式下使用Hinge Loss作为损失函数：In the precise sorting stage, the similarity sorting of similar data file candidate sets obtained in the retrieval and recall stage is carried out, and the data files with the highest similarity to the files to be retrieved are obtained; among them, the precise sorting stage is designed based on the idea of Learning to Rank (sort learning). , the optimization goal of the precise sorting stage is the degree of matching between text data and image data. In Pairwise mode, Hinge Loss is used as the loss function:

基于上述训练模式分别得到图片模态检索模型和文本模态检索模型，图片模态检索模型采用深度学习预训练图像模型ResNet进行编码，文本模态检索模型采用BERT(Bidirectional Encoder Representations from Transformers，深度学习预训练语言模型)进行编码，如下所示，图片I_i和文本T_i经过编码后的嵌入向量分别为I_Ei和T_Ei：Based on the above training modes, the image modal retrieval model and the text modal retrieval model are obtained respectively. The image modal retrieval model adopts the deep learning pre-trained image model ResNet for encoding, and the text modal retrieval model adopts BERT (Bidirectional Encoder Representations from Transformers, deep learning). pre-trained language model) for encoding, as shown below, the encoded embedding vectors of image I _i and text T _i are I _Ei and T _Ei respectively:

T_Ei＝BERT(T_i)T _Ei =BERT(T _i )

I_Ei＝ResNet(I_i)I _Ei =ResNet(I _i )

本发明实施例中，文本模态检索模型的检索算法包括但不限于BM25或者TFIDF算法，图片模态检索模型的检索算法包括利用图片的颜色分布、几何形状、纹理等简单视觉特征进行相似性匹配。In the embodiment of the present invention, the retrieval algorithm of the text modal retrieval model includes but is not limited to the BM25 or TFIDF algorithm, and the retrieval algorithm of the image modal retrieval model includes using simple visual features such as color distribution, geometric shape, texture and other simple visual features of pictures to perform similarity matching. .

请参阅图2，是本发明第二实施例的多模态数据检索方法的流程示意图。本发明第二实施例的多模态数据检索方法包括以下步骤：Please refer to FIG. 2 , which is a schematic flowchart of a multimodal data retrieval method according to a second embodiment of the present invention. The multimodal data retrieval method of the second embodiment of the present invention includes the following steps:

S20：选择待检索文件，获取待检索文件的图片和文本数据；S20: Select the file to be retrieved, and obtain the picture and text data of the file to be retrieved;

S21：将图片和文本数据输入训练好的跨模态检索模型；S21: Input the image and text data into the trained cross-modal retrieval model;

S22：通过跨模态检索模型对图片数据和文本数据分别进行检索，得到待检索文件的相似数据文件候选集，并对相似数据文件候选集进行相似度排序，得到与待检索文件相似度最高的数据文件；S22: Retrieve the image data and the text data separately through the cross-modal retrieval model to obtain a candidate set of similar data files of the files to be retrieved, and sort the candidate sets of similar data files by similarity, and obtain the file with the highest similarity to the file to be retrieved. data files;

本步骤中，跨模态检索模型的检索方式具体包括：In this step, the retrieval method of the cross-modal retrieval model specifically includes:

1、判断待检索文件的图片数据是否为空，如果不为空，将图片数据输入图片模态检索模型，得到图片模态的相似数据文件检索结果，并对图片模态的相似数据文件检索结果进行排序后，取前M个检索结果作为图片模态下的相似数据文件集合S_I；1. Determine whether the image data of the file to be retrieved is empty. If it is not empty, input the image data into the image modal retrieval model to obtain the retrieval results of the similar data files of the image modal, and retrieve the similar data files of the image modal. After sorting, get the first M retrieval results as the similar data file set S _I under the picture mode;

2、判断待检索文件的文本数据是否为空，如果不为空，将文本数据输入文本模态检索模型，得到文本模态的相似数据文件检索结果，并对该检索结果进行排序后，取前M个检索结果作为文本模态下的相似数据文件集合S_T；2. Determine whether the text data of the file to be retrieved is empty, if not, input the text data into the text modal retrieval model to obtain the retrieval results of similar data files in the text modal, and sort the retrieval results, take the top The M retrieval results are used as a set of similar data files _ST in the text mode;

3、取集合S_I和S_T的并集作为待检索文件的相似数据文件候选集；3. Take the union of sets S _I and S _T as a candidate set of similar data files of the files to be retrieved;

4、对相似数据文件候选集进行相似度排序，得到与待检索文件相似度最高的数据文件检索结果。4. Sort the similarity of the candidate set of similar data files, and obtain the retrieval result of the data file with the highest similarity with the file to be retrieved.

上述中，M的取值可根据实际操作进行设定。In the above, the value of M can be set according to the actual operation.

综上所述，本发明实施例的多模态数据检索方法基于不同模态的数据文件构建跨模态检索模型，通过直接输入不同模态的数据文件，然后输出对应模态的检索结果，从而实现端到端的检索方案，能够很好地处理图片、文本等多模态数据，提高了检索准确性以及检索效率。To sum up, the multi-modal data retrieval method of the embodiment of the present invention constructs a cross-modal retrieval model based on data files of different modalities, directly inputting data files of different modalities, and then outputting the retrieval results of the corresponding modalities, thereby The end-to-end retrieval scheme is implemented, which can handle multi-modal data such as pictures and texts well, and improves retrieval accuracy and retrieval efficiency.

在一个可选的实施方式中，还可以：将所述的多模态数据检索方法的结果上传至区块链中。In an optional implementation manner, the results of the multimodal data retrieval method may also be uploaded to the blockchain.

具体地，基于所述的多模态数据检索方法的结果得到对应的摘要信息，具体来说，摘要信息由所述的多模态数据检索方法的结果进行散列处理得到，比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户可以从区块链中下载得该摘要信息，以便查证所述的多模态数据检索方法的结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain)，本质上是一个去中心化的数据库，是一串使用密码学方法相关联产生的数据块，每一个数据块中包含了一批次网络交易的信息，用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。Specifically, the corresponding summary information is obtained based on the result of the multimodal data retrieval method. Specifically, the summary information is obtained by hashing the result of the multimodal data retrieval method, for example, using the sha256s algorithm. get. Uploading summary information to the blockchain ensures its security and fairness and transparency to users. The user can download the summary information from the blockchain in order to verify whether the results of the multimodal data retrieval method described have been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

请参阅图3，是本发明实施例多模态数据检索系统的结构示意图。本发明实施例多模态数据检索系统40包括：Please refer to FIG. 3 , which is a schematic structural diagram of a multimodal data retrieval system according to an embodiment of the present invention. The multimodal data retrieval system 40 according to the embodiment of the present invention includes:

数据获取模块41：用于获取历史多模态数据，基于历史多模态数据构建多模态文件数据库；其中，多模态文件数据库中包括每一个数据文件的图片、文本等多模态数据。假设多模态文件数据库中收集的数据文件数量为N，则该数据库中包含的数据集合为{(I₁,T₁),(I₂,T₂)，(I₃,T₃),…,(I_N,T_N)}，其中(I_i,T_i)表示第i个数据文件的图片-文本对。Data acquisition module 41: used to acquire historical multimodal data, and build a multimodal file database based on the historical multimodal data; wherein, the multimodal file database includes pictures, texts and other multimodal data of each data file. Assuming that the number of data files collected in the multimodal file database is N, the data sets contained in the database are {(I ₁ ,T ₁ ),(I ₂ ,T ₂ ),(I ₃ ,T ₃ ),… ,( _{IN ,T N} ₎ }, where (I _i ,T _i ) represents the image-text pair of the ith data file.

模型构建模块42：用于根据多模态文件数据库中的数据样本训练跨模态检索模型；其中，模型训练方法具体为：首先，对多模态文件数据库中一定数量的数据文件所属类别进行标注，生成用于训练模型的数据样本；然后，根据数据样本训练跨模态检索模型；Model building module 42: used to train a cross-modal retrieval model according to the data samples in the multi-modal file database; wherein, the model training method is specifically: first, mark the category to which a certain number of data files in the multi-modal file database belong , generate data samples for training the model; then, train a cross-modal retrieval model based on the data samples;

本发明实施例中，跨模态检索模型包括文本模态检索模型和图片模态检索模型，本发明实施例采用Pairwise模式进行模型训练，训练过程包括检索召回和精准排序两个阶段：In the embodiment of the present invention, the cross-modal retrieval model includes a text modal retrieval model and a picture modal retrieval model. The embodiment of the present invention adopts the Pairwise mode for model training, and the training process includes two stages: retrieval recall and precise sorting:

在检索召回阶段，使用匹配算法对所有数据样本进行粗筛，得到一个相对较小的相似数据文件候选集；检索召回阶段是在单模态下进行的检索，即以待检索文件的图片数据从图片模态检索模型中检索出相似的图片数据文件，以待检索文件的文本数据从文本模态检索模型中检索出相似的文本数据文件，分别得到待检索文件在图片模态下和文本模态下的相似数据文件集合，然后取两个相似数据文件集合的并集作为待检索文件的相似数据文件候选集。假设经过检索召回阶段筛选得到的相似数据文件候选集大小为K，则该候选集对应的数据文件集合为{(I₁,T₁),(I₂,T₂),…,(I_K,T_K)}。In the retrieval and recall stage, the matching algorithm is used to roughly screen all data samples to obtain a relatively small candidate set of similar data files; the retrieval and recall stage is a single-modal retrieval, that is, the image data of the files to be retrieved are retrieved from Similar image data files are retrieved from the image modal retrieval model, and similar text data files are retrieved from the text modal retrieval model based on the text data of the files to be retrieved, and the files to be retrieved in the image modal and text modalities are obtained respectively. Then, the union of the two similar data file sets is taken as the similar data file candidate set of the files to be retrieved. Assuming that the size of the candidate set of similar data files obtained through the retrieval and recall stage is K, the data file set corresponding to the candidate set is {(I ₁ ,T ₁ ),(I ₂ ,T ₂ ),…,(I _K , T _K )}.

T_Ei＝BERT(T_i)T _Ei =BERT(T _i )

I_Ei＝ResNet(I_i)I _Ei =ResNet(I _i )

数据检索模块43：用于通过跨模态检索模型对图片数据和文本数据分别进行检索，得到待检索文件的相似数据文件候选集，并对相似数据文件候选集进行相似度排序，得到与待检索文件相似度最高的数据文件；Data retrieval module 43: used to retrieve image data and text data respectively through a cross-modal retrieval model, obtain a candidate set of similar data files of the files to be retrieved, and sort the candidate sets of similar data files by similarity to obtain a candidate set of similar data files to be retrieved. The data file with the highest file similarity;

其中，跨模态检索模型的检索方式具体包括：Among them, the retrieval method of the cross-modal retrieval model specifically includes:

请参阅图4，为本发明实施例的终端结构示意图。该终端50包括处理器51、与处理器51耦接的存储器52。Please refer to FIG. 4 , which is a schematic structural diagram of a terminal according to an embodiment of the present invention. The terminal 50 includes a processor 51 and a memory 52 coupled to the processor 51 .

存储器52存储有用于实现上述多模态数据检索方法的程序指令。The memory 52 stores program instructions for implementing the multimodal data retrieval method described above.

处理器51用于执行存储器52存储的程序指令以执行多模态数据检索操作。The processor 51 is configured to execute program instructions stored in the memory 52 to perform multimodal data retrieval operations.

其中，处理器51还可以称为CPU(Central Processing Unit，中央处理单元)。处理器51可能是一种集成电路芯片，具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 51 may be an integrated circuit chip with signal processing capability. The processor 51 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component . A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

请参阅图5，图5为本发明实施例的存储介质的结构示意图。本发明实施例的存储介质存储有能够实现上述所有方法的程序文件61，其中，该程序文件61可以以软件产品的形式存储在上述存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本发明各个实施方式方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质，或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 5 , which is a schematic structural diagram of a storage medium according to an embodiment of the present invention. The storage medium of the embodiment of the present invention stores a program file 61 capable of implementing all the above methods, wherein the program file 61 may be stored in the above-mentioned storage medium in the form of a software product, and includes several instructions to make a computer device (which may It is a personal computer, a server, or a network device, etc.) or a processor that executes all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes , or terminal devices such as computers, servers, mobile phones, and tablets.

在本发明所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的系统实施例仅仅是示意性的，例如，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the system embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。以上仅为本发明的实施方式，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units. The above is only an embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields, All are similarly included in the scope of patent protection of the present invention.

Claims

1. a multimodal data retrieval method, is characterized in that, comprises:

Obtaining historical multimodal data, the historical multimodal data at least includes picture data and text data;

Train a cross-modal retrieval model according to the historical multimodal data; the cross-modal retrieval model includes at least a picture modal retrieval model and a text modal retrieval model;

Input the data to be retrieved into the cross-modal retrieval model, and the cross-modal retrieval model retrieves the data to be retrieved through the image modal retrieval model and the text modal retrieval model, respectively, to obtain the data to be retrieved The similar data file candidate set is obtained, and the similarity degree is sorted on the similar data file candidate set to obtain the data file with the highest similarity with the to-be-retrieved data.

2. The multimodal data retrieval method according to claim 1, wherein the acquiring historical multimodal data further comprises:

A multimodal file database is constructed, and the multimodal file database includes picture data and text data of each data file.

3. The multimodal data retrieval method according to claim 2, wherein before the cross-modal retrieval model is trained according to the historical multimodal data, the method further comprises:

Labeling the categories to which the data files in the multimodal file database belong to generate data samples for training the model.

4. The multimodal data retrieval method according to claim 3, wherein the training of a cross-modal retrieval model according to the historical multimodal data comprises:

The model training includes a retrieval recall stage and a precise sorting stage, wherein:

In the retrieval and recall stage, a matching algorithm is used to perform rough screening on all data samples to obtain at least two similar data file sets of the files to be retrieved in different modalities, and then take the union of the at least two similar data file sets A candidate set of similar data files as the data to be retrieved;

In the precise sorting stage, similarity sorting is performed on the similar data file candidate set to obtain a data file with the highest similarity with the data to be retrieved.

5 . The multimodal data retrieval method according to claim 4 , wherein the cross-modal retrieval model retrieves the data to be retrieved respectively through the picture modal retrieval model and the text modal retrieval model. 6 . include:

Determine whether the picture data of the data to be retrieved is empty, if not, input the picture data into the picture modal retrieval model, obtain the similar data file retrieval results in the picture modal, and compare the retrieval results to the retrieval results. After sorting, get the first M retrieval results as the similar data file set S _I under the picture mode;

Judging whether the text data of the data to be retrieved is empty, if not, inputting the text data into the text modal retrieval model, obtaining the retrieval results of similar data files in the text modal, and analyzing the retrieval results After sorting, take the first M retrieval results as the similar data file set _ST under the text mode;

Take the union of sets S _I and S _T as a candidate set of similar data files of the data to be retrieved;

Sorting the similarity of the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved.

6 . The multi-modal data retrieval method according to claim 1 , wherein the image modal retrieval model adopts ResNet for encoding, and the text modal retrieval model adopts BERT for encoding. 7 .

7. multimodal data retrieval method according to claim 1, is characterized in that, the retrieval algorithm of described text modal retrieval model comprises BM25 or TFIDF algorithm, and the retrieval algorithm of described picture modal retrieval model comprises the retrieval algorithm that utilizes picture. Similarity matching is performed on visual features including color distribution, geometry or texture.

8. A multimodal data retrieval system, comprising:

Data collection module: used to obtain historical multimodal data, the historical multimodal data includes at least picture data and text data;

Model building module: used to train a cross-modal retrieval model according to the historical multimodal data; the cross-modal retrieval model includes at least a picture modal retrieval model and a text modal retrieval model;

Data retrieval module: used to input the data to be retrieved into the cross-modal retrieval model, and the cross-modal retrieval model retrieves the data to be retrieved through the image modal retrieval model and the text modal retrieval model, respectively, Obtaining a candidate set of similar data files of the data to be retrieved, and performing similarity sorting on the candidate set of similar data files to obtain a data file with the highest similarity with the data to be retrieved.

9. A terminal, characterized in that the terminal comprises a processor and a memory coupled to the processor, wherein,

The memory stores program instructions for implementing the multimodal data retrieval method according to any one of claims 1 to 7;

The processor is configured to execute the program instructions stored in the memory to perform the multimodal data retrieval method.

10 . A storage medium, characterized in that it stores program instructions executable by a processor, and the program instructions are used to execute the multimodal data retrieval method according to any one of claims 1 to 7 .