CN114676138A

CN114676138A - Data processing method, electronic device and readable storage medium

Info

Publication number: CN114676138A
Application number: CN202210317485.1A
Authority: CN
Inventors: 谢超; 葛希; 龙际全; 栾小凡
Original assignee: Shanghai Xuyu Intelligent Technology Co ltd
Current assignee: Shanghai Xuyu Intelligent Technology Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-06-28
Anticipated expiration: 2042-03-29
Also published as: CN114676138B

Abstract

The present application relates to the field of computer technology, and in particular, to a data processing method, an electronic device, and a readable storage medium. The data processing method includes: receiving a data query request; determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset; The target index information matching the data query request and the target data entity corresponding to the target index information are determined in the first information and the second information of each index information set. The data processing method in the embodiment of the present application increases the storage and query of string type data, and enriches the query capability of data entities.

Description

Data processing method, electronic device and readable storage medium

技术领域technical field

本申请涉及计算机技术领域，具体涉及一种数据处理方法、电子设备及可读存储介质。The present application relates to the field of computer technologies, and in particular, to a data processing method, an electronic device, and a readable storage medium.

背景技术Background technique

随着计算机技术的发展，数据量越来越多，数据处理的难度也在不断升高。其中数据包括结构化数据和非结构化数据。结构化数据一般将其处理为二维表结构来表现，结构化数据可以例如数字、日期、字符串数据等。非结构化数据一般将其处理为向量，非结构化数据可以例如图片、视频、文本数据等。With the development of computer technology, the amount of data is increasing, and the difficulty of data processing is also increasing. The data includes structured data and unstructured data. Structured data is generally represented as a two-dimensional table structure, and structured data can be, for example, numbers, dates, string data, and the like. Unstructured data is generally processed as a vector, and unstructured data can be, for example, pictures, videos, text data, and so on.

对于非结构化数据，目前计算机一般将其处理为向量类型数据，在针对一个既包含结构化数据又包含非结构化数据的数据实体进行数据存储时，在存储区域划分为非字符串型的标量列和向量列进行存储，或者在存储区域划分为非字符串型的标量行或向量行进行持久化存储，并构建向量列的索引。当需要进行数据查询时，会根据查询请求生成关于向量列的过滤条件，依照构建的向量列的索引，对向量列数据进行过滤。例如，对于人脸图像，进行数据处理后得到表征人脸数据的向量类型的非结构化数据，然后结合人脸图像的年龄的标量类型的结构化数据，将其存储为表征人脸数据的向量列和表征年年龄的标量列，其中人脸图像和其对应的年龄可以作为一个数据实体。For unstructured data, currently, computers generally process it as vector type data. When storing data for a data entity that contains both structured data and unstructured data, the storage area is divided into non-string type scalars. Columns and vector columns are stored, or the storage area is divided into non-string scalar rows or vector rows for persistent storage, and an index of vector columns is constructed. When data query is required, the filter conditions about the vector column will be generated according to the query request, and the vector column data will be filtered according to the index of the constructed vector column. For example, for a face image, the unstructured data of the vector type representing the face data is obtained after data processing, and then combined with the structured data of the scalar type of the age of the face image, it is stored as a vector representing the face data. column and a scalar column representing the age in years, where the face image and its corresponding age can be used as a data entity.

上述数据处理方法，在对数据实体进行查询时，仅支持非字符串型的标量类型数据和向量类型数据的查询，可查询数据类型少、数据查询效率较低，进而使得计算机对于数据实体的查询能力不高。The above data processing method, when querying a data entity, only supports the query of non-string type scalar type data and vector type data, the queryable data types are few, and the data query efficiency is low, thereby making the computer query the data entity. Ability is not high.

发明内容SUMMARY OF THE INVENTION

本申请实施例提供了一种数据处理方法、电子设备及可读存储介质，增加了字符串类型数据的存储与查询，丰富了对数据实体的查询能力。The embodiments of the present application provide a data processing method, an electronic device, and a readable storage medium, which increase the storage and query of string type data, and enrich the query capability for data entities.

第一方面，本申请实施例提供了一种数据处理方法，用于电子设备，包括：In a first aspect, an embodiment of the present application provides a data processing method for an electronic device, including:

接收数据查询请求；Receive data query requests;

确定所述查询请求对应的目标数据集，以及确定所述目标数据集的至少一个数据子集和每个数据子集对应的索引信息集，其中，所述数据子集包括多个数据实体，且所述多个数据实体中的部分数据实体包括字符串类型的第一数据和非字符串类型的第二数据；并且，所述索引信息集中至少包括用于表征所述数据子集中的第一数据和字符串索引的对应关系的第一信息，和用于表征所述数据子集中的所述字符串索引与所述数据实体的对应关系的第二信息；determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset, wherein the data subset includes a plurality of data entities, and Some data entities in the plurality of data entities include first data of string type and second data of non-string type; and the index information set includes at least the first data used to characterize the data subset The first information of the corresponding relationship with the string index, and the second information used to characterize the corresponding relationship between the string index in the data subset and the data entity;

在所述目标数据集的多个索引信息集的所述第一信息和所述第二信息中确定匹配于所述数据查询请求的目标索引信息以及所述目标索引信息对应的目标数据实体。The target index information matching the data query request and the target data entity corresponding to the target index information are determined from the first information and the second information of the multiple index information sets of the target data set.

本申请实施例提供的数据处理方法，通过对包括字符串数据的目标数据集的查询，可以实现对非结构化数据的字符串数据的查询，扩展了可查询的数据类型。同时，字符串数据和非字符串数据的查询可以同时进行，丰富了对数据实体的查询能力。The data processing method provided by the embodiments of the present application can query the string data of the unstructured data by querying the target data set including the string data, and expand the data types that can be queried. At the same time, the query of string data and non-string data can be carried out at the same time, which enriches the query ability of data entities.

在上述第一方面的一种可能的实现方式中，所述用于表征所述数据子集中的第一数据和字符串索引的对应关系的第一信息为字典树信息。In a possible implementation manner of the above-mentioned first aspect, the first information used to represent the correspondence between the first data in the data subset and the character string index is dictionary tree information.

在上述第一方面的一种可能的实现方式中，所述数据查询请求中包括字符串类型的第三数据；In a possible implementation manner of the above-mentioned first aspect, the data query request includes third data of string type;

所述在所述目标数据集的多个索引信息集的所述第一信息和所述第二信息中确定匹配于所述数据查询请求的目标索引信息以及所述目标索引信息对应的目标数据实体，包括：Determining, in the first information and the second information of the multiple index information sets of the target data set, the target index information that matches the data query request and the target data entity corresponding to the target index information ,include:

确定所述目标数据集的所述至少一个数据子集的状态信息，所述状态信息包括封存状态和生长状态；determining state information of the at least one data subset of the target data set, the state information including an archive state and a growth state;

若所述至少一个数据子集处于所述封存状态，则根据所述字典树信息，确定所述第三数据对应的目标字符串索引为所述目标索引信息；If the at least one data subset is in the archived state, determining the target string index corresponding to the third data as the target index information according to the dictionary tree information;

根据所述第二关系，确定所述第一目标字符串索引对应的所述目标数据实体。According to the second relationship, the target data entity corresponding to the first target string index is determined.

在上述第一方面的一种可能的实现方式中，所述根据所述字典树信息，确定所述第三数据对应的目标字符串索引为所述目标索引信息，包括：In a possible implementation manner of the above-mentioned first aspect, determining the target string index corresponding to the third data as the target index information according to the dictionary tree information includes:

在所述字典树信息中查找所述第三数据；looking up the third data in the dictionary tree information;

根据所述字典树信息确定所述第三数据对应的所述目标字符串索引为所述目标索引信息。It is determined according to the dictionary tree information that the target character string index corresponding to the third data is the target index information.

在上述第一方面的一种可能的实现方式中，所述在所述目标数据集的多个索引信息集的所述第一信息和所述第二信息中确定匹配于所述数据查询请求的目标索引信息以及所述目标索引信息对应的目标数据实体，还包括：In a possible implementation manner of the above-mentioned first aspect, the first information and the second information of the multiple index information sets of the target data set are determined to match the data query request. The target index information and the target data entity corresponding to the target index information also include:

确定所述目标数据实体为所述数据查询请求对应的数据查询结果。It is determined that the target data entity is the data query result corresponding to the data query request.

在上述第一方面的一种可能的实现方式中，所述数据查询请求还包括非字符串类型的第四数据。In a possible implementation manner of the above-mentioned first aspect, the data query request further includes fourth data of a non-string type.

在上述第一方面的一种可能的实现方式中，所述数据查询请求包括查询条件，所述查询条件包括下列中的至少一种：In a possible implementation manner of the above-mentioned first aspect, the data query request includes a query condition, and the query condition includes at least one of the following:

布尔表达式；boolean expression;

表征字符串前缀的前缀匹配条件；Prefix matching conditions that characterize string prefixes;

表征字符串的精确匹配条件。An exact match condition that characterizes a string.

第二方面，本申请实施例提供了一种数据处理方法，用于电子设备，其特征在于，包括：In a second aspect, an embodiment of the present application provides a data processing method for an electronic device, characterized in that it includes:

接收数据插入请求，所述数据插入请求中包括待插入的字符串集，所述字符串集中包括多个字符串数据；receiving a data insertion request, where the data insertion request includes a character string set to be inserted, and the character string set includes a plurality of character string data;

响应于所述数据插入请求，将所述多个字符串数据分别写入目标数据集的对应的数据子集中，每个数据子集中包括至少一个数据实体，且所述至少一个数据实体中包括至少一个字符串数据。In response to the data insertion request, write the plurality of string data into corresponding data subsets of the target data set respectively, each data subset includes at least one data entity, and the at least one data entity includes at least one data entity. A string of data.

可以理解，本申请实施例提供的数据处理方法，通过对个数据子集的字符串数据构建对应的字典树信息和映射信息，可以快速、有效地实现字符串数据的索引的构建，同时构建的映射关系可以用于实现对非字符串数据的字段列的索引的构建，为更多种类型的标量索引提供了基础。It can be understood that the data processing method provided by the embodiment of the present application can quickly and effectively realize the construction of the index of the string data by constructing the corresponding dictionary tree information and mapping information for the string data of each data subset. The mapping relationship can be used to implement the construction of indexes on the field columns of non-string data, providing a basis for more types of scalar indexes.

在上述第二方面的一种可能的实现方式中，上述方法还包括：In a possible implementation manner of the above-mentioned second aspect, the above-mentioned method further includes:

获取所述数据子集中的所述至少一个字符串数据；obtaining the at least one character string data in the data subset;

根据所述至少一个字符串数据，构建所述数据子集的用于表征所述数据子集中的所述至少一个字符串数据和字符串索引的对应关系的第三信息；constructing, according to the at least one character string data, third information of the data subset for representing the correspondence between the at least one character string data in the data subset and a character string index;

根据所述第一信息中的各字符串数据及对应的字符串索引，以及所述各字符串数据对应的数据实体，构建用于表征所述数据子集中的所述字符串索引与所述数据实体的对应关系的第四信息。According to each character string data and the corresponding character string index in the first information, and the data entity corresponding to each character string data, construct the character string index and the data characterizing the data subset. The fourth information of the corresponding relationship of the entity.

在上述第二方面的一种可能的实现方式中，所述响应于所述数据插入请求，将所述多个字符串数据分别写入对应的所述数据子集中，包括：In a possible implementation manner of the above second aspect, the writing of the plurality of character string data into the corresponding data subsets in response to the data insertion request includes:

响应于所述数据插入请求，将所述多个字符串数据划分为M个字符串子集，并将所述字符串子集发送至对应的M个数据节点，其中M大于等于2；In response to the data insertion request, dividing the plurality of character string data into M character string subsets, and sending the character string subsets to corresponding M data nodes, where M is greater than or equal to 2;

将所述M个数据节点的所述字符串子集中的各字符串数据分别写入对应的所述数据子集中。Writing each character string data in the character string subsets of the M data nodes into the corresponding data subsets respectively.

第三方面，本申请实施例提供了一种电子设备，包括：一个或多个处理器；一个或多个存储器；一个或多个存储器存储有一个或多个程序，当一个或者多个程序被一个或多个处理器执行时，使得电子设备执行上述第一方面或第二方面的数据处理方法。In a third aspect, embodiments of the present application provide an electronic device, including: one or more processors; one or more memories; one or more memories stored with one or more programs, when the one or more programs are When executed by one or more processors, the electronic device is made to execute the data processing method of the first aspect or the second aspect.

第四方面，本申请实施例提供了一种计算机可读存储介质，存储介质上存储有指令，指令在计算机上执行时使计算机执行上述第一方面或第二方面的数据处理方法。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where instructions are stored on the storage medium, and when the instructions are executed on a computer, the instructions cause the computer to execute the data processing method of the first aspect or the second aspect.

第五方面，本申请实施例提供了一种计算机程序产品，包括计算机程序/指令，该计算机程序/指令被处理器执行时实现上述第一方面或第二方面的数据处理方法。In a fifth aspect, an embodiment of the present application provides a computer program product, including a computer program/instruction, which implements the data processing method of the first aspect or the second aspect when the computer program/instruction is executed by a processor.

附图说明Description of drawings

图1所示为根据本申请的一些实施例，示例性地示出了一种数据查询系统的示意图；FIG. 1 is a schematic diagram of a data query system according to some embodiments of the present application;

图2所示为根据本申请的一些实施例，示例性地示出了一种数据表的结构示意图；Fig. 2 shows a schematic structural diagram of a data table exemplarily according to some embodiments of the present application;

图3所示为根据本申请的一些实施例，示例性地示出了一种数据处理方法的流程示意图；FIG. 3 is a schematic flowchart illustrating a data processing method according to some embodiments of the present application;

图4所示为根据本申请的一些实施例，示例性地示出了一种数据插入的流程示意图；FIG. 4 is a schematic flowchart illustrating a data insertion according to some embodiments of the present application;

图5所示为根据本申请的一些实施例，示例性地示出了一种数据插入的流程示意图；FIG. 5 is a schematic flowchart illustrating a data insertion according to some embodiments of the present application;

图6所示为根据本申请的一些实施例，示例性地示出了一种构建索引的流程示意图；FIG. 6 is a schematic flow chart illustrating an example of constructing an index according to some embodiments of the present application;

图7所示为根据本申请的一些实施例，示例性地示出了一种构建索引的流程示意图；FIG. 7 is a schematic flow chart illustrating an example of constructing an index according to some embodiments of the present application;

图8所示为根据本申请的一些实施例，示例性地示出了一种数据处理方法的流程示意图；FIG. 8 is a schematic flowchart illustrating a data processing method according to some embodiments of the present application;

图9所示为根据本申请的一些实施例，示例性地示出了一种数据查询过程的交互过程示意图；FIG. 9 is a schematic diagram of an interaction process exemplarily showing a data query process according to some embodiments of the present application;

图10所示为根据本申请的一些实施例，示例性地示出了另一种数据查询过程的交互过程示意图；FIG. 10 is a schematic diagram of an interaction process exemplarily showing another data query process according to some embodiments of the present application;

图11所示为根据本申请的一些实施例，示例性地示出了一种电子设备的硬件结构示意图。FIG. 11 is a schematic diagram illustrating a hardware structure of an electronic device according to some embodiments of the present application.

具体实施方式Detailed ways

以下基于实施例对本发明进行描述，但是本发明并不仅仅限于这些实施例。在下文对本发明的细节描述中，详尽描述了一些特定的细节部分。对本领域技术人员来说没有这些细节部分的描述也可以完全理解本发明。为了避免混淆本发明的实质，公知的方法、过程、流程、元件和电路并没有详细叙述。The present invention is described below based on examples, but the present invention is not limited to these examples only. In the following detailed description of the invention, some specific details are described in detail. The present invention can be fully understood by those skilled in the art without the description of these detailed parts. Well-known methods, procedures, procedures, components and circuits have not been described in detail in order to avoid obscuring the essence of the present invention.

此外，本领域普通技术人员应当理解，在此提供的附图都是为了说明的目的，并且附图不一定是按比例绘制的。Furthermore, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

除非上下文明确要求，否则整个说明书中的“包括”、“包含”等类似词语应当解释为包含的含义而不是排他或穷举的含义；也就是说，是“包括但不限于”的含义。Unless clearly required by the context, words "including", "comprising" and the like throughout this specification should be construed in an inclusive rather than an exclusive or exhaustive sense; that is, in the sense of "including but not limited to".

在本发明的描述中，需要理解的是，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性。此外，在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上。In the description of the present invention, it should be understood that the terms "first", "second" and the like are used for descriptive purposes only, and should not be construed as indicating or implying relative importance. Also, in the description of the present invention, unless otherwise specified, "plurality" means two or more.

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请的实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the embodiments of the present application will be further described in detail below with reference to the accompanying drawings.

在介绍本申请中的方案之前，为了便于理解本申请中的方案，下面结合图1和图2对本申请中涉及一些概念、术语进行介绍。Before introducing the solutions in this application, in order to facilitate understanding of the solutions in this application, some concepts and terms involved in this application are introduced below with reference to FIG. 1 and FIG. 2 .

图1所示为本申请实施例提供的一种数据查询系统。FIG. 1 shows a data query system according to an embodiment of the present application.

如图1所示，数据查询系统包括访问组件1、协调服务组件2、消息存储组件3、执行组件4以及对象存储模块5。As shown in FIG. 1 , the data query system includes an access component 1 , a coordination service component 2 , a message storage component 3 , an execution component 4 and an object storage module 5 .

其中，代理组件1用于接收外部的数据插入/删除/查询等请求，并向外部返回数据处理结果。代理组件1包括多个代理模块，例如图1中的代理模块11和代理模块12。Among them, the proxy component 1 is used to receive external data insertion/deletion/query and other requests, and return data processing results to the outside. The proxy component 1 includes a plurality of proxy modules, such as the proxy module 11 and the proxy module 12 in FIG. 1 .

协调服务组件2用于向执行组件4的各模块分配数据处理任务，并保存各模块的状态信息等。协调服务组件2可以包括协调管理模块21、协调查询模块22、协调数据模块23、协调索引模块24、以及存储执行组件4状态等信息的元数据存储模块25。The coordination service component 2 is used for allocating data processing tasks to each module of the execution component 4, and saving the status information of each module and so on. The coordination service component 2 may include a coordination management module 21 , a coordination query module 22 , a coordination data module 23 , a coordination index module 24 , and a metadata storage module 25 that stores information such as the status of the execution component 4 .

消息存储组件3用于接收代理组件1发送的数据增加/插入/删除/查询等请求对应的数据操作语言(Data Manipulation Language，DML)命令，并将其以日志的形式暂存至消息存储组件3的日志存储模块31中。The message storage component 3 is used to receive data manipulation language (Data Manipulation Language, DML) commands corresponding to requests for data addition/insertion/deletion/query sent by the proxy component 1, and temporarily store them in the message storage component 3 in the form of logs in the log storage module 31.

执行组件4用于执行协调服务组件2分配的数据处理任务以及代理组件1的代理模块发起的DML命令，并将日志存储模块31中存储的数据以二进制日志(binlog)的形式写入对象存储模块5中。至新高组件4还用于获取对象存储模块中存储的binlog文件，并为binlog文件中的数据构建索引，并将其存储至对象存储模块5中。The execution component 4 is used to execute the data processing task assigned by the coordination service component 2 and the DML command initiated by the proxy module of the proxy component 1, and write the data stored in the log storage module 31 into the object storage module in the form of a binary log (binlog). 5 in. The Zhixingao component 4 is also used to obtain the binlog file stored in the object storage module, build an index for the data in the binlog file, and store it in the object storage module 5 .

对象存储模块5用于存储binlog文件以及执行组件4构建的索引。在一些实施例中，对象存储模块5中的日志文件51中可以存储binlog文件，对象存储模块5中的索引文件53中可以存储标量/向量的索引，其中，标量包括数字型数据、字符串型数据、日期型数据以及布尔型数据。The object storage module 5 is used to store the binlog file and the index constructed by the execution component 4 . In some embodiments, the log file 51 in the object storage module 5 may store a binlog file, and the index file 53 in the object storage module 5 may store an index of a scalar/vector, wherein the scalar includes numeric data, string data data, date data, and boolean data.

下面结合图2对对象存储模块5中存储的数据进行介绍。图2所示为本申请实施例中，一种数据表的结构示意图。The data stored in the object storage module 5 will be introduced below with reference to FIG. 2 . FIG. 2 is a schematic structural diagram of a data table in an embodiment of the present application.

如图2所示，在一些实施例中，对象存储模块5中数据存储为数据表6。数据表6中包括多个数据分区，例如数据分区601、数据分区602等。其中每个数据分区可以包括多个数据片段，例如数据分区601包括数据片段1和数据片段2，数据分区602包括数据片段3、数据片段4和数据片段5。每个数据片段包括a*b个数据字段，其中a行数据字段中的每行数据字段为一个数据实体，即对应于同一个非结构化数据，b列中的每列数据为同一种数据类型。其中，每个数据分区可以根据用户定义的规则对数据进行划分，例如按照日期划分多个数据分区，按照用户的位置划分多个数据分区等。As shown in FIG. 2 , in some embodiments, the data in the object storage module 5 is stored as a data table 6 . The data table 6 includes a plurality of data partitions, for example, a data partition 601, a data partition 602, and the like. Each data partition may include multiple data segments, for example, data partition 601 includes data segment 1 and data segment 2, and data partition 602 includes data segment 3, data segment 4, and data segment 5. Each data segment includes a*b data fields, where each row of data fields in row a is a data entity, that is, corresponding to the same unstructured data, and each column of data in column b is of the same data type . Wherein, each data partition can divide the data according to the rules defined by the user, for example, multiple data partitions are divided according to the date, and multiple data partitions are divided according to the user's location.

例如，以非结构化数据为人脸图像为例，则一个人脸图像对应的计算机可读数据为一个数据实体。每个人脸图像对应的数据实体中可以包括年龄组成的整型列字段、人脸数据组成的向量列字段以及表征人脸数据的标签的字符串列字段，4个人脸图像对应的计算机可读数据作为一组进行存储，其中年龄组成的整型列字段和表征人脸数据的标签的字符串列字段为标量列字段。则对于增加的10个人脸图像，在数据表6中可以表示为3个数据片段，即每个数据片段中可以包括4*3个数据字段。其中，人脸数据的标签可以例如人脸图像对应的人的姓名等。For example, taking the unstructured data as a face image as an example, the computer-readable data corresponding to a face image is a data entity. The data entity corresponding to each face image may include an integer column field composed of age, a vector column field composed of face data, and a string column field representing the label of the face data, and the computer-readable data corresponding to the four face images Stored as a set, where the integer column field consisting of age and the string column field representing the label of the face data are scalar column fields. Then, the added 10 face images can be represented as 3 data segments in the data table 6, that is, each data segment can include 4*3 data fields. The label of the face data may be, for example, the name of the person corresponding to the face image.

在一些实施例中，每个数据片段包括a*b个数据字段，其中b列数据字段中的每列数据字段为一个数据实体，即对应于同一个非结构化数据，a行中的每列数据为同一种数据类型。In some embodiments, each data segment includes a*b data fields, wherein each column of data fields in column b is a data entity, that is, corresponding to the same unstructured data, each column in row a is a data entity. The data are of the same data type.

继续参考图1，在一些实施例中，执行组件4包括至少一个查询模块41、至少一个数据模块42以及至少一个索引模块43。Continuing to refer to FIG. 1 , in some embodiments, the execution component 4 includes at least one query module 41 , at least one data module 42 and at least one index module 43 .

其中，数据模块42用于接收日志存储模块31中存储的数据，并将其中的数据以二进制日志(binlog)的形式写入对象存储模块5中(即日志文件51)。若按数据类型的不同，将数据分别存储至不同的字段列，日志文件可以包括标量列和向量列。The data module 42 is configured to receive the data stored in the log storage module 31, and write the data therein in the form of a binary log (binlog) into the object storage module 5 (ie, the log file 51). If the data is stored in different field columns according to different data types, the log file can include scalar columns and vector columns.

查询模块41用于接收日志存储模块31中存储的查询操作数据，同时将对象存储模块5中的日志文件51加载至查询模块41的内存中，然后基于查询操作数据完成数据查询。查询模块41可以实现搜索、混合搜索、查询的功能。The query module 41 is configured to receive the query operation data stored in the log storage module 31, and simultaneously load the log file 51 in the object storage module 5 into the memory of the query module 41, and then complete the data query based on the query operation data. The query module 41 can realize the functions of search, mixed search and query.

可以理解，搜索表示对向量列字段进行近邻搜索，并返回与查询条件匹配度最高的k个数据实体。混合搜索表示按照预设的搜索顺序，对字符串型标量列字段、向量列字段以及非字符串型标量列字段中的包括向量列字段的至少两列数据字段分别进行搜索。查询表示对于非字符串型标量列字段和字符串型标量列字段进行过滤，返回符合查询条件的非字符串型标量列字段或字符串列字段对应的数据实体。It can be understood that the search means performing a nearest neighbor search on the vector column field, and returning the k data entities with the highest matching degree with the query condition. The hybrid search means to search for at least two data fields including the vector column field in the string type scalar column field, the vector column field and the non-string type scalar column field respectively according to a preset search order. Query means to filter non-string type scalar column fields and string type scalar column fields, and return the data entities corresponding to non-string type scalar column fields or string column fields that meet the query conditions.

例如，假设每个数据实体表示一张人脸图像，每个数据实体包括表征年龄的整型列字段、表征人脸数据的向量列字段和表征姓名的字符串列字段。搜索可以为查询出与目标人脸最相似的人脸图像，此时，可以对人脸数据进行近邻搜索，并返回与目标人脸最相似的k个人脸数据对应的数据实体。混合搜索可以为查询出为姓名为A且与目标人脸最相似的人脸图像，此时，可以先根据查询条件对姓名对应的字符串列进行过滤，得到姓名为A的，然后将过滤结果参与人脸数据的近邻搜索，将近邻搜索结果中姓名为A且与目标人脸最相似的k个人脸图像返回。查询可以为查询出数据表6中的前5列数据，此时，可以先根据查询条件过滤出数据表6中前5列数据对应的数据片段(每个数据片段包括3列数据字段)，然后将过滤出数据字段对应的人脸图像返回。For example, suppose that each data entity represents an image of a face, and each data entity includes an integer column field representing age, a vector column field representing face data, and a string column field representing name. The search can be to find the face image most similar to the target face. At this time, the face data can be searched for the nearest neighbors, and the data entities corresponding to the k face data most similar to the target face can be returned. The hybrid search can query the face image with the name A that is most similar to the target face. At this time, you can filter the string column corresponding to the name according to the query conditions to get the name A, and then filter the results. Participate in the nearest neighbor search of face data, and return k face images named A and most similar to the target face in the nearest neighbor search results. The query can be to query the first 5 columns of data in the data table 6. In this case, the data fragments corresponding to the first 5 columns of data in the data table 6 can be filtered according to the query conditions (each data fragment includes 3 columns of data fields), and then The face image corresponding to the filtered data field will be returned.

如前文背景技术中所述，现有的数据处理方法中对于非结构数据的查询能力不高，为了解决上述问题，本申请公开了一种数据处理方法，该方法可以应用于数据查询系统中，且数据查询系统中的数据表中存储有字符串数据以及每个数据片段的字符串数据对应的索引信息集。具体地，该方法包括：接收数据查询请求；确定查询请求对应的目标数据集，以及确定目标数据集的至少一个数据子集和每个数据子集对应的索引信息集，其中，数据子集包括多个数据实体，且多个数据实体中的部分数据实体包括字符串类型的第一数据和非字符串类型的第二数据；并且，索引信息集中至少包括用于表征数据子集中的第一数据和字符串索引的对应关系的第一信息，和用于表征数据子集中的字符串索引与数据实体的对应关系的第二信息；在目标数据集的多个索引信息集的第一信息和第二信息中确定匹配于数据查询请求的目标索引信息以及目标索引信息对应的目标数据实体。As mentioned above in the background art, the existing data processing methods have low query capabilities for unstructured data. In order to solve the above problems, the present application discloses a data processing method, which can be applied to a data query system. And the data table in the data query system stores the string data and the index information set corresponding to the string data of each data segment. Specifically, the method includes: receiving a data query request; determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset, wherein the data subset includes a plurality of data entities, and some of the data entities in the plurality of data entities include first data of a string type and second data of a non-string type; and the index information set includes at least the first data used to characterize the data subset The first information of the corresponding relationship with the string index, and the second information used to characterize the corresponding relationship between the string index in the data subset and the data entity; the first information and the first information in the multiple index information sets of the target data set In the second information, the target index information that matches the data query request and the target data entity corresponding to the target index information are determined.

在本申请实施例中，数据查询请求可以是从外部接收的请求，该请求可以用于查询数据实体，数据实体包括结构化数据和非结构化数据中的至少一种，其中，结构化数据存储在数据库里,可以用二维表结构来逻辑表达实现的数据，每个数据都有具体地含义，例如整型数据、字符串数据等。非结构化数据指的是数据结构不规则，没有统一的预定义数据模型，不方便用数据库二维逻辑表来表现的数据。非结构化数据包括图片、视频、音频、自然语言等，占所有数据总量的80％。非结构化数据的处理可以通过各种人工智能(AI)或机器学习(ML)模型转化为向量数据进行处理。In this embodiment of the present application, the data query request may be a request received from the outside, and the request may be used to query a data entity, where the data entity includes at least one of structured data and unstructured data, wherein the structured data storage In the database, a two-dimensional table structure can be used to logically express the implemented data, and each data has a specific meaning, such as integer data, string data, and so on. Unstructured data refers to data whose data structure is irregular, there is no unified predefined data model, and it is inconvenient to use the two-dimensional logical table of the database to represent the data. Unstructured data includes pictures, video, audio, natural language, etc., accounting for 80% of all data. The processing of unstructured data can be transformed into vector data for processing by various artificial intelligence (AI) or machine learning (ML) models.

可以理解，在一些实施例中，目标数据集为图2中的数据表6，数据子集为图2中的数据片段，例如数据片段1至数据片段5，每个数据实体对应一个非结构化数据，第一数据和第二数据可以为数据片段中的数据字段，且第一数据和第二数据以binlog形式存储。It can be understood that, in some embodiments, the target data set is the data table 6 in FIG. 2 , the data subsets are the data segments in FIG. 2 , such as data segment 1 to data segment 5, each data entity corresponds to an unstructured The data, the first data and the second data may be data fields in the data segment, and the first data and the second data are stored in the form of binlog.

可以理解，非字符串类型的第二数据可以为非字符串型的标量字段，例如整型字段、浮点型字段，也可以为向量字段，例如浮点向量字段等，本申请对此不作限制。It can be understood that the second data of the non-string type may be a non-string type scalar field, such as an integer field, a floating-point field, or a vector field, such as a floating-point vector field, etc., which is not limited in this application .

可以理解，索引信息集中包括字符串类型的数据(下文称字符串数据)的索引信息以及非字符串类型的数据的索引信息。It can be understood that the index information set includes index information of string type data (hereinafter referred to as string data) and index information of non-string type data.

本申请实施例提供的数据处理方法，通过在目标数据集中增加字符串数据的存储，并对应字符串数据构建用于表征数据子集中的第一数据和字符串索引的对应关系的第一信息。进而在对数据实体进行查询时，既可以实现对向量数据、非字符串型的标量数据的查询，还可以基于第一信息实现对字符串数据的查询，扩展了数据实体的可查询的数据类型。同时，字符串数据和非字符串数据的查询可以同时进行，丰富了非结构化数据的查询能力。In the data processing method provided by the embodiment of the present application, the storage of character string data is added to the target data set, and first information representing the correspondence between the first data in the data subset and the character string index is constructed corresponding to the character string data. Furthermore, when querying a data entity, it can not only query vector data and non-string scalar data, but also query string data based on the first information, which expands the queryable data types of the data entity. . At the same time, the query of string data and non-string data can be carried out at the same time, which enriches the query ability of unstructured data.

此外，本申请实施例中的索引信息集中还包括用于表征数据子集中的字符串索引与数据实体的对应关系的第二信息，基于此第二信息可以实现对标量型字符串字段列的查询，实现对标量类型数据构建索引，即本申请实施例可以提供对标量索引的支持，为更多类型的标量索引的构建提供了基础。In addition, the index information set in this embodiment of the present application further includes second information used to represent the correspondence between the string index in the data subset and the data entity, and based on this second information, a query for a scalar string field column can be implemented , to implement index building for scalar type data, that is, the embodiments of the present application can provide support for scalar indexes, and provide a basis for the construction of more types of scalar indexes.

为了支持对字符串数据的查询，本申请实施例中提供的数据处理方法需要先完成字符串数据的存储与索引的构建。下面结合图3对本申请实施例中的数据的存储过程与索引的构建过程进行介绍。In order to support the query of character string data, the data processing method provided in the embodiment of the present application needs to complete the storage of character string data and the construction of an index first. The following describes the data storage process and the index construction process in the embodiment of the present application with reference to FIG. 3 .

图3所示为本申请实施例中数据存储与构建索引的流程示意图。FIG. 3 is a schematic flowchart of data storage and index building in an embodiment of the present application.

如图3所示，数据存储与收件索引的过程包括：As shown in Figure 3, the process of data storage and recipient indexing includes:

301：接收数据插入请求。其中，数据插入请求中包括待插入的字符串集，字符串集中包括多个字符串数据。其中，待插入的字符串集为字符串型的标量型数据。301: A data insertion request is received. The data insertion request includes a character string set to be inserted, and the character string set includes multiple character string data. The string set to be inserted is string-type scalar data.

可以理解，数据查询系统中的代理模块接收到数据插入请求，会根据数据插入请求生成数据插入命令以及待插入的数据，并将数据插入命令以及待插入的数据以日志的形式存储于日志存储模块31。It can be understood that when the proxy module in the data query system receives the data insertion request, it will generate the data insertion command and the data to be inserted according to the data insertion request, and store the data insertion command and the data to be inserted in the log storage module in the form of logs. 31.

在一些实施例中，数据插入请求中还可以包括非字符串数据，例如非字符串型标量数据、向量数据等。例如，数据插入请求为插入人脸图像，则数据插入请求中可以包括表征年龄的整型标量数据、表征人脸数据的向量数据以及表征姓名的字符串数据等。In some embodiments, the data insertion request may further include non-string data, such as non-string type scalar data, vector data, and the like. For example, if the data insertion request is to insert a face image, the data insertion request may include integer scalar data representing age, vector data representing face data, and character string data representing name.

302：响应于数据插入请求，将多个字符串数据分别写入对应的数据子集中。其中，每个数据子集中包括至少一个数据实体，且至少一个数据实体中包括至少一个字符串数据。302: In response to the data insertion request, write a plurality of string data into corresponding data subsets respectively. Wherein, each data subset includes at least one data entity, and at least one data entity includes at least one character string data.

可以理解，数据子集中包括至少一个非结构化数据对应的数据实体，每个数据实体中包括一个字符串数据，作为一个数据实体的字符串字段。可以理解，数据查询系统中的数据模块42在接收到协调服务组件2生成的数据插入任务后，数据模块42会从日志存储模块31读取待插入的数据，并将待插入的数据以及数据插入操作以binlog的形式存储到对象存储模块5中。It can be understood that the data subset includes at least one data entity corresponding to the unstructured data, and each data entity includes a string data as a string field of a data entity. It can be understood that after the data module 42 in the data query system receives the data insertion task generated by the coordination service component 2, the data module 42 will read the data to be inserted from the log storage module 31, and insert the data to be inserted and the data into Operations are stored in the object storage module 5 in the form of binlogs.

本申请实施例中的每个数据子集所占的存储空间是预设好的。在一些实施例中，待插入的字符串数据对应的数据实体在数据表中不存在，则将多个字符串数据分别写入对应的数据子集中，即从当前可以存储数据的数据子集开始，依次将多个字符串数据分别写入数据子集中对应的数据实体中的字符串列，当前数据子集的字符串数据存储达到上限后，将其余待存储的字符串数据写入下一个数据子集中。例如，以数据子集的存储上限为4个数据实体，待插入10张图像为例，若待插入的10张图像对应的数据实体在数据表6中不存在，则确定未达到存储上限的最前一个数据子集的位置，然后依次将10张图像的字符串数据依次写入该数据子集中，直至数据子集的字符串数据达到存储上限4时，开始进行下一个数据子集的数据写入。The storage space occupied by each data subset in the embodiment of the present application is preset. In some embodiments, if the data entity corresponding to the string data to be inserted does not exist in the data table, the plurality of string data are respectively written into the corresponding data subsets, that is, starting from the data subset that can currently store data , and sequentially write multiple string data into the string column in the corresponding data entity in the data subset. After the string data storage of the current data subset reaches the upper limit, write the rest of the string data to be stored into the next data. subset. For example, taking the storage limit of the data subset as 4 data entities and 10 images to be inserted as an example, if the data entities corresponding to the 10 images to be inserted do not exist in the data table 6, it is determined that the first data entity that has not reached the storage upper limit is determined. The position of a data subset, and then write the string data of 10 images into the data subset in turn, until the string data of the data subset reaches the storage limit of 4, and then start to write the data of the next data subset. .

在另一些实施例中，待插入的字符串数据对应的数据实体在数据表中已经存在，则将多个字符串数据分别写入对应的数据子集中，即通过查询模块41进行数据查询，确定数据实体对应的数据子集在数据表6中的位置，然后将多个字符串数据依次插入对应的数据实体中的字符串列。例如，以数据子集的存储上限为4个数据实体，待插入10张图像为例，若待插入的10张图像对应的数据实体在数据表6中已存在，则确定10张图像的数据实体的位置，然后依次将10张图像的字符串数据依次写入对应的数据实体中的字符串列。In other embodiments, if the data entity corresponding to the string data to be inserted already exists in the data table, then multiple string data are written into the corresponding data subsets, that is, the query module 41 performs data query, and determines The position of the data subset corresponding to the data entity in the data table 6, and then insert a plurality of string data into the string column in the corresponding data entity in sequence. For example, taking the storage limit of the data subset as 4 data entities and 10 images to be inserted as an example, if the data entities corresponding to the 10 images to be inserted already exist in the data table 6, the data entities of the 10 images are determined. position, and then sequentially write the string data of 10 images into the string column in the corresponding data entity.

在一些实施例中，不同数据子集所占的存储空间可以相同，也可以不同，本申请对此不作限制。In some embodiments, the storage space occupied by different data subsets may be the same or different, which is not limited in this application.

在一些实施例中，每个数据实体中还可以包括非字符串数据，作为数据实体的非字符串字段，例如包括整型标量字段、浮点型标量字段等的非字符串型标量字段和向量字段。In some embodiments, each data entity may further include non-string data, as a non-string field of the data entity, such as non-string scalar fields and vectors including integer scalar fields, floating-point scalar fields, etc. field.

在一些实施例中，302具体可以包括：响应于数据插入请求，将多个字符串数据划分为M个字符串子集，并将字符串子集发送至对应的M个数据节点，其中M大于等于2；将M个数据节点的字符串子集中的各字符串数据分别写入对应的数据子集中。In some embodiments, 302 may specifically include: in response to the data insertion request, dividing the plurality of string data into M string subsets, and sending the string subsets to the corresponding M data nodes, where M is greater than Equal to 2; each character string data in the character string subsets of the M data nodes is written into the corresponding data subset respectively.

可以理解，数据节点即图1中的数据模块42。本申请实施例中，将多个字符串数据划分为M个小批量数据，即划分为M个字符串子集，进而在进行数据存储时，可以将M个字符串子集同时发送至对应的数据模块42并完成数据存储。下面将结合图4和图5对302进行进一步介绍。It can be understood that the data node is the data module 42 in FIG. 1 . In the embodiment of the present application, the multiple character string data is divided into M small batches of data, that is, into M character string subsets, and then when data storage is performed, the M character string subsets can be simultaneously sent to the corresponding Data module 42 and complete data storage. 302 will be further described below in conjunction with FIG. 4 and FIG. 5 .

图4和图5所示为本申请实施例中的数据存储过程的流程图。FIG. 4 and FIG. 5 are flowcharts of a data storage process in an embodiment of the present application.

如图4所示，在一些实施例中，待插入的数据包括数据实体的ID、字符串数据和向量数据。其中的ID也可以理解为数据实体的主键哈希值。As shown in FIG. 4 , in some embodiments, the data to be inserted includes IDs of data entities, string data, and vector data. The ID in it can also be understood as the primary key hash value of the data entity.

具体地，代理模块根据数据实体的主键哈希值，将数据实体划分为两个小批量数据(批量数据1和批量数据2)，并将其存储到日志存储模块31中的对应数据节点(节点s1和节点s2)，每个数据节点可以被一个查询模块或数据模块服务。其中，节点s1对应的查询模块411和数据模块421，节点s2对应查询模块412和数据模块422。Specifically, the proxy module divides the data entity into two small batches of data (batch data 1 and batch data 2) according to the primary key hash value of the data entity, and stores them in the corresponding data nodes (nodes) in the log storage module 31 s1 and node s2), each data node can be served by a query module or data module. The query module 411 and the data module 421 corresponding to the node s1, and the query module 412 and the data module 422 corresponding to the node s2.

数据模块421可以获取节点s1上存储的批量数据1，数据模块422可以获取节点s2上存储的批量数据2，并且数据模块421和数据模块422可以同时将批量数据1和批量数据2持久化存储至对象存储模块5中。The data module 421 can obtain the batch data 1 stored on the node s1, the data module 422 can obtain the batch data 2 stored on the node s2, and the data module 421 and the data module 422 can simultaneously store the batch data 1 and batch data 2 persistently to Object storage module 5.

如图5所示，在一些实施例中，不同节点的数据可以通过不同的数据传输通道传输至数据模块进行数据的持久化存储。As shown in FIG. 5 , in some embodiments, data of different nodes may be transmitted to the data module through different data transmission channels for persistent storage of the data.

例如，节点s1的批量数据1可以通过通道A传输至数据模块421，数据模块421将批量数据1写入对应的多个数据片段61中。同时，节点s2的批量数据2可以通过通道B传输至数据模块422，数据模块422将批量数据2写入对应的多个数据片段61中。For example, the batch data 1 of the node s1 may be transmitted to the data module 421 through the channel A, and the data module 421 writes the batch data 1 into the corresponding multiple data segments 61 . At the same time, the batch data 2 of the node s2 can be transmitted to the data module 422 through the channel B, and the data module 422 writes the batch data 2 into the corresponding multiple data segments 61 .

可以理解，数据片段61中的每行数据表示一个数据实体，即对应的一个非结构化数据。每个数据实体中可以包括多个字段，例如id字段(整型标量字段)、字符串字段(字符串型标量字段)和向量字段。当一个数据片段的数据量大小到达存储上限后，数据实体会被写入下一个数据片段61中。其中的存储上限可以例如512MB。It can be understood that each row of data in the data segment 61 represents a data entity, that is, a corresponding piece of unstructured data. Each data entity may include multiple fields, such as an id field (integer scalar field), a string field (string scalar field), and a vector field. When the data size of one data segment reaches the storage limit, the data entity will be written into the next data segment 61 . The storage upper limit may be, for example, 512MB.

303：获取数据子集中的至少一个字符串数据。303: Acquire at least one character string data in the data subset.

可以理解，获取数据字节的至少一个字符串数据，即获取数据子集中的字符串列数据。在一些实施例中，获取的字符串数据可以包括字符串字段以及该字符串字段相对于每个数据子集的第一行数据实体的行偏移量。It can be understood that obtaining at least one character string of data bytes means obtaining character string column data in the data subset. In some embodiments, the retrieved string data may include a string field and a row offset of the string field relative to the first row data entity of each data subset.

例如，数据子集中包括如图6所示的字符串字段列S，则获取的字符串数据包括字符串字段列S及其对应的行偏移列表R。其中，行偏移列表由字符串字段列S中每个字符串字段相对于数据子集的第一行字符串字段的行偏移量组成。For example, if the data subset includes the string field column S as shown in FIG. 6 , the acquired character string data includes the string field column S and its corresponding row offset list R. where the row offset list consists of the row offset of each string field in the string field column S relative to the string field of the first row of the data subset.

可以理解，索引模块43从对象存储模块5中加载数据子集的至少一个字符串数据，并将其存储至索引模块43的内存中。It can be understood that the indexing module 43 loads at least one character string data of the data subset from the object storage module 5 and stores it in the memory of the indexing module 43 .

304：根据至少一个字符串数据，构建数据子集用于表征数据子集中的至少一个字符串数据和字符串索引的对应关系的第三信息。304 : According to the at least one character string data, construct a data subset to represent third information of the correspondence between the at least one character string data in the data subset and the character string index.

在一些实施例中，第三信息为字典树信息。可以理解，索引模块43根据内存中加载的数据子集的字符串数据，构建字典树信息。In some embodiments, the third information is dictionary tree information. It can be understood that the indexing module 43 constructs the dictionary tree information according to the character string data of the data subset loaded in the memory.

可以理解，字典树信息是把数据子集中的每个字符串字段看做一个字符序列，然后根据字符序列的先后顺序构建从上至下的树形结构。树形结构中的每条边对应一个字符，字典树的每个子节点可以表示为从根节点到该子节点之间的字符组成的字符序列的索引。It can be understood that the dictionary tree information is to regard each string field in the data subset as a character sequence, and then construct a top-to-bottom tree structure according to the sequence of the character sequence. Each edge in the tree structure corresponds to a character, and each child node of the dictionary tree can be represented as an index of a character sequence consisting of characters from the root node to the child node.

例如，如图6所示，对于字符串字段列S，包括字符串字段组成的集合(a，ab，abc，ac，acd，acc，accd，acdd)。索引模块43根据字符串字段组成的集合，以字符“a”为根节点，以数据子集中的其余字符为边，以树ID(即字符串索引)为子节点，构建字典树T(即上文中的字典树信息)。其中字典树T的边包括(c，b，c，b，dd，c，d),树ID包括0至7。字典树T中连接根节点与任意一个子节点的边存储的字符组成的字符序列即表示字符串字段列S中的一个字符串字段。For example, as shown in FIG. 6, for the string field column S, a set composed of string fields (a, ab, abc, ac, acd, acc, accd, acdd) is included. The index module 43 uses the character "a" as the root node, the remaining characters in the data subset as edges, and the tree ID (ie, the string index) as the child node, and constructs a dictionary tree T (ie, the upper dictionary tree information in the text). The edges of the dictionary tree T include (c, b, c, b, dd, c, d), and the tree IDs include 0 to 7. The character sequence composed of characters stored in the edge connecting the root node and any child node in the dictionary tree T represents a string field in the string field column S.

305：根据第一信息中的各字符串数据及对应的字符串索引，以及各字符串数据对应的数据实体，构建数据子集的用于表征所述数据子集中的所述字符串索引与所述数据实体的对应关系的第四信息。305: According to each character string data in the first information and the corresponding character string index, and the data entity corresponding to each character string data, construct a data subset for characterizing the character string index in the data subset and the corresponding character string index. The fourth information describing the corresponding relationship of the data entities.

可以理解，在一些实施例中，索引模块43根据构建的字典树信息，以及各字符串字段相对于第一行数据实体的行偏移量，构建索引与数据实体的第四信息。It can be understood that, in some embodiments, the indexing module 43 constructs the index and the fourth information of the data entity according to the constructed dictionary tree information and the row offset of each string field relative to the first row of data entities.

在一些实施例中，第四信息(下文称映射关系)可以表征为字典树信息中的树ID与数据实体的对应关系。例如，映射关系可以表征为图6中的索引ID，索引ID与字符串字段S的行数相等。索引ID中的每行数据表示该行字符串字段在字典树T中的树ID，树ID所处的行相对于索引ID中的第一树ID的行偏移量，表示了树ID对应的字符串字段相对于字符串字段列中的第一个字符串字段的行偏移量。In some embodiments, the fourth information (hereinafter referred to as a mapping relationship) may be characterized as a corresponding relationship between tree IDs and data entities in the dictionary tree information. For example, the mapping relationship can be represented as the index ID in FIG. 6 , and the index ID is equal to the number of rows of the string field S. Each row of data in the index ID represents the tree ID of the string field of the row in the dictionary tree T, and the row where the tree ID is located is relative to the row offset of the first tree ID in the index ID, indicating the corresponding tree ID. The row offset of the string field relative to the first string field in the string field column.

在一些实施例中，数据模块42在执行上述步骤301和302完成一个数据子集的写入后，索引模块43即可执行步骤303至305，此时数据模块42可以完成下一个数据子集的数据写入。In some embodiments, after the data module 42 performs the above steps 301 and 302 to complete the writing of a data subset, the index module 43 can perform steps 303 to 305, and the data module 42 can complete the writing of the next data subset. data write.

下面结合图7对数据子集的索引构建过程进行进一步介绍。The index construction process of the data subset will be further introduced below with reference to FIG. 7 .

图7所示为本申请实施例中构建字符串数据的索引的流程图。FIG. 7 shows a flowchart of constructing an index of character string data in an embodiment of the present application.

在一些实施例中，如图7所示，对象存储模块5中存储数据片段611和数据片段612。其中数据片段611中包括多个数据实体，每个数据实体包括两个字段列：字段列1和字段列2，每个字段列中的字段以binlog的形式存储为数据字段，例如字段列1中的日志1和日志2，字段列2中的日志3和日志4。其中数据片段612中包括多个数据实体，每个数据实体包括两个字段列：字段列3和字段列4，每个字段列中的字段以binlog的形式存储为数据字段，例如字段列3中的日志5，字段列4中的日志6。In some embodiments, as shown in FIG. 7 , the object storage module 5 stores a data segment 611 and a data segment 612 . The data segment 611 includes multiple data entities, and each data entity includes two field columns: field column 1 and field column 2. The fields in each field column are stored as data fields in the form of binlog, for example, in field column 1 of log 1 and log 2, field column 2 of log 3 and log 4. The data segment 612 includes multiple data entities, and each data entity includes two field columns: field column 3 and field column 4. The fields in each field column are stored as data fields in the form of binlog, for example, in field column 3 of log 5, field column 4 of log 6.

索引模块43将数据片段611和数据片段612中的字符串字段列(假设字段列2和字段列4为字符串字段列)加载到索引模块43的内存中，并对应于数据片段611和数据片段612的字符串字段列执行构建索引任务1和构建索引任务2。具体地，索引模块43执行构建索引任务1，为字段列2构建索引，并生成字典树信息5311和映射信息5321。索引模块43执行构建索引任务2，为字段列4构建索引，并生成字典树信息5312和映射信息5322。The indexing module 43 loads the string field columns in the data segment 611 and the data segment 612 (assuming that the field column 2 and the field column 4 are string field columns) into the memory of the indexing module 43, and corresponds to the data segment 611 and the data segment. The string field column of 612 performs build index task 1 and build index task 2. Specifically, the indexing module 43 performs index building task 1, builds an index for field column 2, and generates dictionary tree information 5311 and mapping information 5321. The indexing module 43 performs index building task 2, builds an index for field column 4, and generates dictionary tree information 5312 and mapping information 5322.

索引模块43完成索引构建后，将生成的字典树信息和映射信息分别存储至对象存储模块5对应的数据片段的索引文件53中。具体地，将字典树信息5311和映射信息5321存储至对象存储模块5中的数据片段611对应的索引文件中，将将字典树信息5312和映射信息5322存储至对象存储模块5中的数据片段612对应的索引文件中。After the indexing module 43 completes the index construction, the generated dictionary tree information and mapping information are respectively stored in the index file 53 of the data segment corresponding to the object storage module 5 . Specifically, the dictionary tree information 5311 and the mapping information 5321 are stored in the index file corresponding to the data segment 611 in the object storage module 5, and the dictionary tree information 5312 and the mapping information 5322 are stored in the data segment 612 in the object storage module 5. in the corresponding index file.

可以理解，本申请实施例中，为每个数据子集均构建了对应的字典树信息和映射信息，为了更清楚的阐述上述索引的构建方法的积极效果，本申请还以12核Intel(R)Core(TM)i7-8700 CPU@3.20GHz，存储空间32G的中央处理器的机器配置为例，对不同数据量的字符串数据的索引构建的时间开销和空间开销进行了测试，测试结果如下表1所示。It can be understood that in the embodiment of the present application, corresponding dictionary tree information and mapping information are all constructed for each data subset. ) Core(TM) i7-8700 CPU@3.20GHz, the machine configuration of the central processing unit with 32G storage space as an example, the time and space overhead of index construction of string data with different data volumes are tested, and the test results are as follows shown in Table 1.

表1Table 1

由上述表1可以看出当字符串的行数保持不变，字符串长度不断增大时，构建索引的时间、加载字典树的时间、索引数据量、字符串数据量、前缀查询时间均大幅度上升。当字符串行数增加同时字符串长度增加时，索引构建时间、引数据量、字符串数据量、前缀查询时间均大幅度上升，同时当行数过多或字符串过长时索引构建可能失败。It can be seen from Table 1 above that when the number of lines of the string remains unchanged and the length of the string continues to increase, the time to build the index, the time to load the dictionary tree, the amount of index data, the amount of string data, and the prefix query time are all large. The magnitude increased. When the number of string rows increases and the string length increases, the index construction time, the amount of quoted data, the amount of string data, and the prefix query time all increase significantly. At the same time, when the number of rows is too large or the string is too long, the index construction may fail.

可以理解，随着字符串数据的长度和行数的增多，字符串索引的时间开销和空间开销大幅度上升，同时索引还可能构建失败。It can be understood that as the length and number of rows of string data increase, the time and space overhead of string indexing increases significantly, and the index may also fail to be constructed.

本申请实施例中，通过将目标数据集划分为多个数据子集，进而对每个数据子集中的字符串字段分别构建字典树信息以及映射信息，可以缩短索引信息集的构建时间，减少每个字典树信息占用的内存空间，同时降低了索引构建失败的可能。In the embodiment of the present application, by dividing the target data set into multiple data subsets, and then constructing dictionary tree information and mapping information respectively for the character string fields in each data subset, the construction time of the index information set can be shortened, and the time required to construct the index information set can be shortened. The memory space occupied by each dictionary tree information, while reducing the possibility of index construction failure.

在一些实施例中，索引模块43还可以获取数据子集中的向量字段列，根据预设的索引类型，对向量字段列构建索引。其中的索引类型可例如平面点阵转换(Flat LatticeTransformer，FLAT)、倒排乘积量化转换(Inverted File Product Quantization，IVF_PQ)等。In some embodiments, the indexing module 43 may also acquire vector field columns in the data subset, and build an index for the vector field columns according to a preset index type. The index type may be, for example, a flat lattice transformation (Flat LatticeTransformer, FLAT), an inverted product quantization transformation (Inverted File Product Quantization, IVF_PQ), and the like.

下面结合图8，对本申请实施例中的数据查询过程进行进一步介绍。The data query process in the embodiment of the present application will be further introduced below with reference to FIG. 8 .

图8所示为本申请实施例中的一种数据查询的流程图。FIG. 8 shows a flowchart of a data query in an embodiment of the present application.

如图8所示，在一些实施例中，数据查询的过程包括：As shown in FIG. 8, in some embodiments, the process of data query includes:

801：接收数据查询请求。801: Receive a data query request.

可以理解代理模块接收到外部的数据查询请求后，根据数据查询请求生成数据查询命令以及数据查询条件，协调处理组件2可以生成对应的数据查询任务。代理模块将数据查询命令和数据查询条件以日志的形式存储于日志存储模块31。查询模块41接收协调处理组件2生成的数据查询任务，并从日志存储模块31获取数据查询条件。It can be understood that after the proxy module receives an external data query request, it generates a data query command and data query conditions according to the data query request, and the coordination processing component 2 can generate a corresponding data query task. The proxy module stores the data query commands and data query conditions in the log storage module 31 in the form of logs. The query module 41 receives the data query task generated by the coordination processing component 2 , and obtains data query conditions from the log storage module 31 .

在一些实施例中，数据查询条件中包括字符串数据的查询。例如数据查询请求中的数据查询条件为查询姓名为A的人脸图像。在另一些实施例中，数据查询条件中还可以包括非字符串数据的查询，例如包括整型标量字段、浮点型标量字段等的非字符串型标量数据、向量数据等。例如，数据查询请求为查询与目标人脸数据最接近的人脸数据。In some embodiments, the data query conditions include query of string data. For example, the data query condition in the data query request is to query the face image whose name is A. In other embodiments, the data query condition may also include a query of non-string data, such as non-string scalar data, vector data, etc. including integer scalar fields, floating-point scalar fields, and the like. For example, the data query request is to query the face data closest to the target face data.

802：确定查询请求对应的目标数据集，以及确定目标数据集的至少一个数据子集和每个数据子集对应的索引信息集。802: Determine a target data set corresponding to the query request, and determine at least one data subset of the target data set and an index information set corresponding to each data subset.

其中，数据子集包括多个数据实体，且多个数据实体中的部分数据实体包括字符串类型的第一数据和非字符串类型的第二数据。并且，索引信息集中至少包括用于表征数据子集中的第一数据和字符串索引的对应关系的第一信息，和用于表征数据子集中的所述字符串索引与数据实体的对应关系的第二信息(即前文中的映射信息，下文称映射信息)。在一些实施例中，第一信息可以表征为字典树信息。The data subset includes multiple data entities, and some of the multiple data entities include first data of a string type and second data of a non-string type. In addition, the index information set includes at least first information for characterizing the correspondence between the first data in the data subset and the string index, and first information for characterizing the correspondence between the character string index and the data entity in the data subset. Second information (ie, the mapping information in the foregoing, hereinafter referred to as the mapping information). In some embodiments, the first information may be characterized as dictionary tree information.

可以理解，查询模块41在接收到数据查询任务时，可以从对象存储模块5中加载目标数据集的数据子集至查询模块41的内存中。It can be understood that when the query module 41 receives the data query task, it can load the data subset of the target data set from the object storage module 5 into the memory of the query module 41 .

在一些实施例中，数据子集包括两种状态，即生长状态和封存状态。进而，查询模块41加载到其内存的数据子集包括两种：生长数据子集和封存数据子集。其中，生长数据子集可以向其中持续写入数据，当生长数据子集的数据量超过预设存储上限或预设时间段内未写入数据，生长数据子集将转换为封存数据子集。对于生长数据子集的查询，可以采用暴搜的方式。封存数据子集不支持数据的写入，只能够实现数据的删除，且索引模块43获取封存数据子集中的字符串数据构建索引，查询模块可以采用本申请实施例中提供的数据处理方法对封存数据子集进行查询。In some embodiments, the subset of data includes two states, a growth state and a sequestered state. Furthermore, the data subsets loaded into its memory by the query module 41 include two types: growth data subsets and archived data subsets. The growth data subset can continuously write data to it, and when the data volume of the growth data subset exceeds the preset storage limit or no data is written within a preset time period, the growth data subset will be converted into a sealed data subset. For queries of growing data subsets, a violent search method can be used. The archived data subset does not support data writing, and can only realize data deletion, and the index module 43 obtains the string data in the archived data subset to construct an index, and the query module can use the data processing method provided in the embodiment of the present application. Query a subset of data.

在一些实施例中，查询模块41加载到其内存中的数据还包括数据子集的索引信息集。进一步地，查询模块41加载的索引文件53中的字典树信息可以支持“＝＝”、“！＝”、“>”、“>＝”、“<”、“<＝”、前缀查找、精确查找等查询条件。In some embodiments, the data that query module 41 loads into its memory also includes a set of index information for a subset of the data. Further, the dictionary tree information in the index file 53 loaded by the query module 41 can support "==", "!=", ">", ">=", "<", "<=", prefix search, exact Find and other query conditions.

在一些实施例中，查询模块41的查询条件可以表示为布尔表达式，例如查询条件为字符串字段>“abc”。在一些实施例中查询条件也可以表示为指定字符串前缀，例如查询条件为字符串字段的前缀为“abc”。在一些实施例中，查询模块的查询条件还可以表示为非字符串字段列的过滤条件，例如查询条件为选出非字符串型标量字段列中的前100列。In some embodiments, the query condition of the query module 41 can be expressed as a Boolean expression, for example, the query condition is a string field>"abc". In some embodiments, the query condition can also be expressed as a specified string prefix, for example, the query condition is that the prefix of a string field is "abc". In some embodiments, the query condition of the query module may also be expressed as a filter condition of a non-string field column, for example, the query condition is to select the first 100 columns in the non-string type scalar field column.

在一些实施例中，索引信息集包括字典树信息和映射信息，其中，字典树信息表征了数据子集中的字符串字段列与字符串索引的对应关系，映射信息表征了数据子集中的所述字符串索引与数据实体的对应关系。例如，索引信息集中的字典树信息可以表示为图6所示的字典树T，映射信息可以表示为图6所示的索引ID。进一步地，在一些实施例中，映射关系还可以表示向量字段对应的向量索引与数据实体的对应关系。In some embodiments, the index information set includes dictionary tree information and mapping information, wherein the dictionary tree information represents the correspondence between string field columns and string indexes in the data subset, and the mapping information represents the Correspondence between string indexes and data entities. For example, the dictionary tree information in the index information set may be represented as the dictionary tree T shown in FIG. 6 , and the mapping information may be represented as the index ID shown in FIG. 6 . Further, in some embodiments, the mapping relationship may also represent the corresponding relationship between the vector index corresponding to the vector field and the data entity.

803：在目标数据集的多个索引信息集的第一信息和第二信息中确定匹配于数据查询请求的目标索引信息以及目标索引信息对应的目标数据实体。803: Determine, from the first information and the second information of the multiple index information sets of the target data set, the target index information that matches the data query request and the target data entity corresponding to the target index information.

可以理解，查询模块41可以基于加载的索引信息集，在索引信息集中确定匹配于数据查询条件的目标索引信息。It can be understood that the query module 41 can determine the target index information that matches the data query condition in the index information set based on the loaded index information set.

在一些实施例中，数据查询请求为搜索，其对应的查询条件为对向量字段的查询。在另一些实施例中，数据查询请求为混合搜索，其对应的查询条件既包括对字符串字段列(即字符串型标量列)、向量字段列和非字符串型标量字段列中的至少两种字段列的查询。在其他实施例中，数据查询请求为查询，其对应的查询条件为对标量字段列(包括字符串字段列和非字符串型标量字段列)的查询。下文中将结合图9和图10对数据查询请求以及查询过程进行进一步介绍。In some embodiments, the data query request is a search, and the corresponding query condition is a query on a vector field. In some other embodiments, the data query request is a mixed search, and the corresponding query conditions include at least two of a string field column (ie, a string-type scalar column), a vector field column, and a non-string-type scalar field column. A query on a field column. In other embodiments, the data query request is a query, and the corresponding query condition is a query on a scalar field column (including a string field column and a non-string type scalar field column). The data query request and the query process will be further introduced below with reference to FIG. 9 and FIG. 10 .

在一些实施例中，数据查询请求为对字符串字段列的查询，则查询模块41根据索引信息集中的映射信息，确定目标字符串索引信息对应的目标数据实体。例如，查询模块41在字典树信息中，筛选出匹配于数据查询请求中查询条件的树ID，然后查询模块41根据表征树ID与树ID对应的字符串字段相对于该数据子集中的第一个字符串字段的行偏移量的映射关系，确定筛选出的树ID对应的字符串字段相对于该数据子集中的第一个字符串字段的行偏移量，确定符合查询条件的数据实体。In some embodiments, the data query request is a query for a string field column, and the query module 41 determines the target data entity corresponding to the target string index information according to the mapping information in the index information set. For example, the query module 41 filters out the tree ID that matches the query condition in the data query request from the dictionary tree information, and then the query module 41 characterizes the tree ID and the string field corresponding to the tree ID relative to the first tree ID in the data subset. The mapping relationship between the row offsets of each string field, determine the row offset of the string field corresponding to the filtered tree ID relative to the first string field in the data subset, and determine the data entities that meet the query conditions .

在一些实施例中，数据查询请求为对向量字段列的查询，则查询模块41可以根据索引信息中的表征量字段对应的向量索引与数据实体的对应关系的映射信息，确定目标向量索引信息对应的目标数据实体。In some embodiments, if the data query request is a query for a vector field column, the query module 41 may determine the corresponding information of the target vector index according to the mapping information of the corresponding relationship between the vector index corresponding to the character field and the data entity in the index information the target data entity.

在一些实施例中，查询模块41查询出的符合查询条件的数据实体可以作为查询结果，并通过代理模块返回给发起数据查询请求的外部。In some embodiments, the data entities that meet the query conditions queried by the query module 41 can be used as query results, and returned to the outside that initiates the data query request through the proxy module.

图9所示，为本申请实施例提供的一种数据查询的交互过程示意图。FIG. 9 is a schematic diagram of an interaction process of a data query provided by an embodiment of the present application.

如图9所示，在一些实施例中，数据查询请求为查询，查询条件901中包括字符串数据，查询模块41可以根据从对象存储模块5加载的字典树信息531，确定匹配于查询条件901的字符串索引组成的索引列表902，索引列表902即为目标索引信息。进一步地，查询模块41根据从对象存储模块5加载的映射信息532，生成表征数据子集中每行数据是否符合查询条件的位图903。位图903中，值为1表示该值所对应的字符串字段符合查询条件，该值对应的数据实体可以作为查询结果返回；值为0表示该值所对应的字符串字段不符合查询条件，该值对应的数据实体不可以作为查询结果返回。As shown in FIG. 9 , in some embodiments, the data query request is a query, the query condition 901 includes string data, and the query module 41 may determine that the query condition 901 matches the query condition 901 according to the dictionary tree information 531 loaded from the object storage module 5 . The index list 902 composed of the string index of , and the index list 902 is the target index information. Further, the query module 41 generates a bitmap 903 representing whether each row of data in the data subset meets the query condition according to the mapping information 532 loaded from the object storage module 5 . In the bitmap 903, a value of 1 indicates that the string field corresponding to the value meets the query conditions, and the data entity corresponding to the value can be returned as the query result; a value of 0 indicates that the string field corresponding to the value does not meet the query conditions, The data entity corresponding to this value cannot be returned as a query result.

例如，索引信息集中的字典树信息可以表示为图6所示的字典树T，映射信息可以表示为图6所示的索引ID，查询条件为字符串前缀为“ac”的数据，则生成的索引列表902可以为(1,4,3,7,5)，根据映射信息532生成的位图903可以表示为(0,0,0,1,1,1,1)，即字符串字段列S中符合查询条件的数据实体为第4至8行数据实体。查询模块41可以将数据子集中第4至8行数据实体通过代理模块返回至发起数据查询请求的外部。For example, the dictionary tree information in the index information set can be represented as the dictionary tree T shown in Fig. 6, the mapping information can be represented as the index ID shown in Fig. 6, and the query condition is the data whose string prefix is "ac", then the generated The index list 902 can be (1, 4, 3, 7, 5), and the bitmap 903 generated according to the mapping information 532 can be expressed as (0, 0, 0, 1, 1, 1, 1), that is, a string field column The data entities in S that meet the query conditions are the data entities in rows 4 to 8. The query module 41 can return the data entities in the fourth to eighth rows in the data subset to the outside that initiates the data query request through the proxy module.

在一些实施例中，图9中的数据查询请求为混合搜索，查询条件901中还可以包括向量数据，查询模块41在得到位图903后，可以选择其中值为1对应的字符串字段所属的数据实体的向量字段，参与向量数据的近邻查询，并返回符合查询条件901的k个向量字段对应的数据实体。In some embodiments, the data query request in FIG. 9 is a hybrid search, and the query condition 901 may also include vector data. After obtaining the bitmap 903, the query module 41 can select the string field corresponding to the value 1 to which it belongs. The vector field of the data entity participates in the neighbor query of vector data, and returns the data entities corresponding to the k vector fields that meet the query condition 901 .

可以理解，混合搜索对应的查询条件可以包括多种标量数据、向量数据中的包括向量数据的至少两种，并不限于上述实例，本申请在此不作赘述。It can be understood that the query conditions corresponding to the hybrid search may include at least two kinds of scalar data and vector data including vector data, which are not limited to the above examples, and are not described in detail in this application.

在另一些实施例中，数据查询请求为搜索，查询条件901中包括向量数据，查询模块41可以根据从对象存储模块5加载的向量数据的索引信息，通过近邻搜索确定匹配于查询条件的向量索引组成的索引列表，索引列表即为目标索引信息。进一步地，查询模块41根据从对象存储模块5加载的映射信息，生成表征数据子集中向量字段列中的每行数据是否符合查询条件的位图。位图中，值为1表示该值所对应的向量字段符合查询条件，该值对应的数据实体可以作为查询结果返回；值为0表示该值所对应的向量字段不符合查询条件，该值对应的数据实体不可以作为查询结果返回。In other embodiments, the data query request is a search, the query condition 901 includes vector data, and the query module 41 may determine a vector index that matches the query condition through a neighbor search according to the index information of the vector data loaded from the object storage module 5 . The index list is composed, and the index list is the target index information. Further, the query module 41 generates a bitmap representing whether each row of data in the vector field column in the data subset meets the query condition according to the mapping information loaded from the object storage module 5 . In the bitmap, a value of 1 indicates that the vector field corresponding to this value meets the query conditions, and the data entity corresponding to this value can be returned as the query result; a value of 0 indicates that the vector field corresponding to this value does not meet the query conditions, and the value corresponds to The data entity cannot be returned as a query result.

例如，数据实体为人脸图像，向量字段表征了人脸数据，查询条件为搜索匹配于目标人脸数据的人脸数据。查询模块41获取数据子集中的人脸数据的索引信息，并通过近邻搜索确定于目标人脸数据最相似的k个人脸数据，并将k个人脸数据对应的人脸图像作为查询结果返回至发起数据查询请求的外部。For example, the data entity is a face image, the vector field represents the face data, and the query condition is to search for face data matching the target face data. The query module 41 obtains the index information of the face data in the data subset, and determines the k face data most similar to the target face data through the nearest neighbor search, and returns the face image corresponding to the k face data as the query result to the initiator. External to data query requests.

图10所示为本申请实施例提供的另一种数据查询的交互过程示意图。FIG. 10 is a schematic diagram of another interactive process of data query provided by an embodiment of the present application.

如图10所示，在一些实施例中，数据查询请求为查询，数据查询请求中包括查询条件1001。查询条件1001中包括非字符串型标量数据，查询模块41可以根据从对象存储模块5加载的数据子集，确定匹配于查询条件1001的非字符串型标量字段相对于数据子集中的第一个非字符串型标量字段的行偏移量组成的行偏移列表1002。进一步地，查询模块41根据从对象存储模块5加载的映射信息532，确定行偏移列表1002中的行偏移量对应的字符串索引，并根据字典树信息确定字符串索引对应的字符串1003，并将字符串1003所属的数据实体作为查询结果返回至发起数据查询请求的外部。As shown in FIG. 10 , in some embodiments, the data query request is a query, and the data query request includes a query condition 1001 . The query condition 1001 includes non-string type scalar data, and the query module 41 can determine, according to the data subset loaded from the object storage module 5, the non-string type scalar field that matches the query condition 1001 relative to the first one in the data subset A list of row offsets 1002 consisting of row offsets for non-string scalar fields. Further, the query module 41 determines the string index corresponding to the row offset in the row offset list 1002 according to the mapping information 532 loaded from the object storage module 5, and determines the string 1003 corresponding to the string index according to the dictionary tree information , and returns the data entity to which the string 1003 belongs as the query result to the outside that initiates the data query request.

例如，索引信息集中的字典树信息可以表示为图6所示的字典树T，映射信息可以表示为图6所示的索引ID，查询条件1001为非字符串型标量数据>100，则查询模块41先利用查询条件1001对数据子集的非字符串型标量字段列进行过滤，过滤出符合查询条件1001的非字符串型标量字段的行偏移列表1002。之后查询模块41可以基于从对象存储模块5中加载的映射信息确定行偏移列表对应的字符串索引，然后在基于从对象存储模块5中加载的字典树信息确定字符串索引对应的字符串1003。查询模块41可以将数据子集中符合查询条件1001字符串1003对应的数据实体通过代理模块返回至发起数据查询请求的外部。For example, the dictionary tree information in the index information set can be represented as the dictionary tree T shown in FIG. 6, the mapping information can be represented as the index ID shown in FIG. 6, and the query condition 1001 is non-string scalar data>100, then the query module 41 First, use the query condition 1001 to filter the non-string type scalar field column of the data subset, and filter out the row offset list 1002 of the non-string type scalar field that meets the query condition 1001 . Then the query module 41 can determine the string index corresponding to the row offset list based on the mapping information loaded from the object storage module 5, and then determine the string 1003 corresponding to the string index based on the dictionary tree information loaded from the object storage module 5. . The query module 41 can return the data entities corresponding to the character string 1003 in the data subset that meet the query condition 1001 to the outside that initiates the data query request through the proxy module.

可以理解，查询模块41将对应非字符串型标量字段列的查询，映射为对字符串字段的字符串索引的查询，然后基于字典树信息获取字符串索引对应的字符串1003。其中的映射关系也可以理解为字符串型标量字段列的索引，可以实现对于字符串型标量字段列的查询。It can be understood that the query module 41 maps the query corresponding to the non-string type scalar field column to the query on the string index of the string field, and then obtains the string 1003 corresponding to the string index based on the dictionary tree information. The mapping relationship can also be understood as the index of the string-type scalar field column, which can realize the query for the string-type scalar field column.

图11示出了根据本申请的实施例的电子设备1100的结构示意图，电子设备1100可以包括处理器1110，内部存储器1120，接口模块1130，电源模块1140，无线通信模块1150。11 shows a schematic structural diagram of an electronic device 1100 according to an embodiment of the present application. The electronic device 1100 may include a processor 1110 , an internal memory 1120 , an interface module 1130 , a power module 1140 , and a wireless communication module 1150 .

可以理解，电子设备1100包括但不限于手机(包括折叠屏手机)、平板电脑、膝上型计算机、台式计算机、服务器、可穿戴设备、头戴式显示器、移动电子邮件设备、车机设备、便携式游戏机、便携式音乐播放器、阅读器设备、其中嵌入或耦接有一个或多个处理器的电视机等各类电子设备。It will be appreciated that the electronic device 1100 includes, but is not limited to, cell phones (including foldable cell phones), tablet computers, laptop computers, desktop computers, servers, wearable devices, head-mounted displays, mobile email devices, in-vehicle devices, portable devices Various electronic devices such as game consoles, portable music players, reader devices, televisions with one or more processors embedded in or coupled to them.

可以理解的是，本申请实施例示意的结构并不构成对电子设备1100的具体限定。在本申请另一些实施例中，电子设备1100可以包括比图示更多或更少的部件，或者组合某些部件，或者拆分某些部件，或者不同的部件布置。图示的部件可以以硬件，软件或软件和硬件的组合实现。It can be understood that the structures illustrated in the embodiments of the present application do not constitute a specific limitation on the electronic device 1100 . In other embodiments of the present application, the electronic device 1100 may include more or less components than shown, or combine some components, or separate some components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

处理器1110可以包括一个或多个处理单元，例如：处理器1110可以包括应用处理器(application processor，AP)，调制解调处理器，图形处理器(graphics processingunit，GPU)，图像信号处理器(image signal processor，ISP)，控制器，视频编解码器，数字信号处理器(digital signal processor，DSP)，基带处理器，和/或神经网络处理器等。其中，不同的处理单元可以是独立的器件，也可以集成在一个或多个处理器中。The processor 1110 may include one or more processing units, for example, the processor 1110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor ( image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor, etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.

处理器1110中还可以设置存储器，用于存储指令和数据。在本申请的实施例中，处理器1110可以运行本申请中的数据处理方法。A memory may also be provided in the processor 1110 for storing instructions and data. In the embodiment of the present application, the processor 1110 may execute the data processing method in the present application.

内部存储器1120可以用于存储计算机可执行程序代码，可执行程序代码包括指令。内部存储器1120可以包括存储程序区和存储数据区。其中，存储程序区可存储操作系统，至少一个功能所需的应用程序(比如声音播放功能，图像播放功能等)等。存储数据区可存储电子设备1100使用过程中所创建的数据(比如音频数据，电话本等)等。Internal memory 1120 may be used to store computer executable program code, which includes instructions. The internal memory 1120 may include a stored program area and a stored data area. The storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like. The storage data area may store data (such as audio data, phone book, etc.) created during the use of the electronic device 1100 and the like.

接口模块1130可以用于连接外部存储装置，例如外接硬盘，实现扩展电子设备1100的存储能力。外接硬盘通过接口模块1130与处理器1110通信，实现数据存储功能。The interface module 1130 can be used to connect an external storage device, such as an external hard disk, so as to expand the storage capacity of the electronic device 1100 . The external hard disk communicates with the processor 1110 through the interface module 1130 to realize the data storage function.

电源模块1140用于接入电网，为处理器1110，内部存储器1120等供电。The power module 1140 is used to connect to the power grid, and supply power to the processor 1110, the internal memory 1120, and the like.

无线通信模块1150可以提供应用在电子设备1100上的包括无线局域网(wirelesslocal area networks，WLAN)(如无线保真(wireless fidelity，Wi-Fi)网络)，蓝牙(bluetooth，BT)，全球导航卫星系统(global navigation satellite system，GNSS)，调频(frequency modulation，FM)，近距离无线通信技术(near field communication，NFC)，红外技术(infrared，IR)等的无线通信的解决方案。The wireless communication module 1150 can provide applications on the electronic device 1100 including wireless local area networks (WLAN) (eg wireless fidelity (Wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.

在说明书对“一个实施例”或“实施例”的引用意指结合实施例所描述的具体特征、结构或特性被包括在根据本申请公开的至少一个范例实施方案或技术中。说明书中的各个地方的短语“在一个实施例中”的出现不一定全部指代同一个实施例。Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one example embodiment or technique disclosed in accordance with this application. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment.

本申请公开还涉及用于执行文本中的操作装置。该装置可以专门处于所要求的目的而构造或者其可以包括被存储在计算机中的计算机程序选择性地激活或者重新配置的通用计算机。这样的计算机程序可以被存储在计算机可读介质中，诸如，但不限于任何类型的盘，包括软盘、光盘、CD-ROM、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、EPROM、EEPROM、磁或光卡、专用集成电路(ASIC)或者适于存储电子指令的任何类型的介质，并且每个可以被耦合到计算机系统总线。此外，说明书中所提到的计算机可以包括单个处理器或者可以是采用针对增加的计算能力的多个处理器涉及的架构。The present disclosure also relates to apparatuses for performing operations in text. This apparatus may be specially constructed for the required purposes or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored on a computer readable medium, such as, but not limited to, any type of disk, including floppy disks, optical disks, CD-ROMs, magneto-optical disks, read only memory (ROM), random access memory (RAM) , EPROM, EEPROM, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of medium suitable for storing electronic instructions, and each may be coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processors for increased computing power.

另外，在本说明书所使用的语言已经主要被选择用于可读性和指导性的目的并且可能未被选择为描绘或限制所公开的主题。因此，本申请公开旨在说明而非限制本文所讨论的概念的区域。Additionally, the language used in this specification has been principally selected for readability and instructional purposes and may not have been selected to delineate or limit the disclosed subject matter. Accordingly, the present disclosure is intended to illustrate, but not to limit, the areas of the concepts discussed herein.

Claims

1. A data processing method for an electronic device, comprising:

receiving a data query request;

determining a target data set corresponding to the query request, and determining at least one data subset of the target data set and an index information set corresponding to each data subset, wherein the data subsets comprise a plurality of data entities, and a part of the data entities in the plurality of data entities comprise first data of a character string type and second data of a non-character string type; and the index information set at least comprises first information for characterizing the correspondence of the first data and the string index in the data subset, and second information for characterizing the correspondence of the string index and the data entity in the data subset;

and determining target index information matched with the data query request and a target data entity corresponding to the target index information in the first information and the second information of a plurality of index information sets of the target data set.

2. The data processing method according to claim 1, wherein the first information for characterizing the correspondence between the first data in the data subset and the string index is dictionary tree information.

3. The data processing method according to claim 2, wherein the data query request includes third data of a string type;

the determining, from the first information and the second information of multiple index information sets of the target data set, target index information matching the data query request and a target data entity corresponding to the target index information includes:

determining state information for the at least one subset of data of the target dataset, the state information including a sequestration state and a growth state;

if the at least one data subset is in the sealed state, determining a target character string index corresponding to the third data as the target index information according to the dictionary tree information;

and determining the target data entity corresponding to the first target character string index according to the second relation.

4. The data processing method according to claim 3, wherein the determining, according to the dictionary tree information, that a target string index corresponding to the third data is the target index information includes:

looking up the third data in the dictionary tree information;

and determining the target character string index corresponding to the third data as the target index information according to the dictionary tree information.

5. The data processing method according to claim 3, wherein the determining, among the first information and the second information of the index information sets of the target data set, a target index information that matches the data query request and a target data entity corresponding to the target index information further comprises:

and determining the target data entity as a data query result corresponding to the data query request.

6. The data processing method of claim 1, wherein the data query request further includes fourth data of a non-string type.

7. The data processing method of claim 1, wherein the data query request comprises a query condition, the query condition comprising at least one of:

a Boolean expression;

characterizing prefix matching conditions of the character string prefixes;

an exact match condition of the string is characterized.

8. A data processing method for an electronic device, comprising:

receiving a data insertion request, wherein the data insertion request comprises a character string set to be inserted, and the character string set comprises a plurality of character string data;

and responding to the data insertion request, writing the plurality of character string data into corresponding data subsets of a target data set respectively, wherein each data subset comprises at least one data entity, and the at least one data entity comprises at least one character string data.

9. The data processing method of claim 8, further comprising:

acquiring the at least one character string data in the data subset;

according to the at least one character string data, constructing third information of the data subset, wherein the third information is used for representing the corresponding relation between the at least one character string data in the data subset and a character string index;

and according to each character string data in the first information, the corresponding character string index and the data entity corresponding to each character string data, fourth information of the data subset is constructed, wherein the fourth information is used for representing the corresponding relation between the character string index and the data entity in the data subset.

10. The data processing method according to claim 8, wherein said writing the plurality of character string data into the corresponding data subsets in response to the data insertion request comprises:

in response to the data insertion request, dividing the plurality of character string data into M character string subsets, and sending the character string subsets to corresponding M data nodes, wherein M is greater than or equal to 2;

and respectively writing the character string data in the character string subsets of the M data nodes into the corresponding data subsets.

11. An electronic device, comprising:

a memory for storing instructions for execution by one or more processors of the electronic device, an

A processor, being one of processors of an electronic device, for controlling execution of the data processing method of any one of claims 1 to 7 or the data processing method of any one of claims 8 to 10.

12. A computer-readable storage medium, characterized in that the storage medium has stored thereon instructions which, when executed on a computer, cause the computer to carry out the data processing method of any one of claims 1 to 7 or the data processing method of any one of claims 8 to 10.

13. A computer program product, characterized in that it comprises instructions for implementing the data processing method of any one of claims 1 to 7 or the data processing method of any one of claims 8 to 10.