[go: up one dir, main page]

CN118550447A - A data processing method, device, equipment and storage medium - Google Patents

A data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN118550447A
CN118550447A CN202310175932.9A CN202310175932A CN118550447A CN 118550447 A CN118550447 A CN 118550447A CN 202310175932 A CN202310175932 A CN 202310175932A CN 118550447 A CN118550447 A CN 118550447A
Authority
CN
China
Prior art keywords
data
processed
storage
primary key
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310175932.9A
Other languages
Chinese (zh)
Inventor
刘庆
王宇超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202310175932.9A priority Critical patent/CN118550447A/en
Publication of CN118550447A publication Critical patent/CN118550447A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种数据处理方法、装置、设备及存储介质,应用于数据处理技术领域。该方法先获取待处理数据,将待处理数据存储至待处理数据的主键对应的存储分区的内存缓存中;响应于满足存储条件,将存储分区的内存缓存存储的至少部分待处理数据,存储至所述存储分区的磁盘存储空间中。如此,能够基于待处理数据的主键确定存储待处理数据的存储分区,便于后续在数据查询时,在查询的目标主键所对应的存储分区中进行数据查询,缩小数据查询的范围,提高查询数据的速度。

The present application discloses a data processing method, device, equipment and storage medium, which are applied to the field of data processing technology. The method first obtains the data to be processed, and stores the data to be processed in the memory cache of the storage partition corresponding to the primary key of the data to be processed; in response to satisfying the storage condition, at least part of the data to be processed stored in the memory cache of the storage partition is stored in the disk storage space of the storage partition. In this way, the storage partition storing the data to be processed can be determined based on the primary key of the data to be processed, which is convenient for subsequent data query in the storage partition corresponding to the target primary key of the query, thereby narrowing the scope of the data query and improving the speed of querying data.

Description

一种数据处理方法、装置、设备及存储介质A data processing method, device, equipment and storage medium

技术领域Technical Field

本申请涉及数据处理技术领域,具体涉及一种数据处理方法、装置、设备及存储介质。The present application relates to the field of data processing technology, and in particular to a data processing method, device, equipment and storage medium.

背景技术Background Art

目前,数据处理技术已经能够应用于多个行业,在应用过程中,通常会将获取的大量数据存储在数据管理平台中。在后续需要进行数据查询或者数据分析时,从数据管理平台中获取相关数据。以日志数据为例,日志数据可以存储在数据管理平台中,用于异常检测和故障诊断。在需要进行故障的排查时,可以从数据管理平台查询相关的日志数据。At present, data processing technology has been applied to many industries. During the application process, a large amount of data is usually stored in the data management platform. When data query or data analysis is required later, relevant data is obtained from the data management platform. Taking log data as an example, log data can be stored in the data management platform for anomaly detection and fault diagnosis. When troubleshooting is required, relevant log data can be queried from the data management platform.

目前,从数据管理平台中查询数据的效率较低,耗时较长。Currently, querying data from the data management platform is inefficient and time-consuming.

发明内容Summary of the invention

有鉴于此,本申请提供一种数据处理方法、装置、设备及存储介质,能够提高查询数据的效率。In view of this, the present application provides a data processing method, apparatus, device and storage medium, which can improve the efficiency of querying data.

为解决上述问题,本申请提供的技术方案如下:To solve the above problems, the technical solutions provided by this application are as follows:

第一方面,本申请提供一种数据处理方法,所述方法包括:In a first aspect, the present application provides a data processing method, the method comprising:

获取待处理数据;Get the data to be processed;

将所述待处理数据存储至所述待处理数据的主键对应的存储分区的内存缓存中;Storing the data to be processed in a memory cache of a storage partition corresponding to the primary key of the data to be processed;

响应于满足存储条件,将所述存储分区的内存缓存存储的至少部分待处理数据,存储至所述存储分区的磁盘储存空间中。In response to the storage condition being met, at least a portion of the to-be-processed data stored in the memory cache of the storage partition is stored in the disk storage space of the storage partition.

第二方面,本申请提供一种数据处理装置,所述装置包括:In a second aspect, the present application provides a data processing device, the device comprising:

第一获取单元,用于获取待处理数据;A first acquisition unit, used to acquire data to be processed;

第一存储单元,用于将所述待处理数据存储至所述待处理数据的主键对应的存储分区的内存缓存中;A first storage unit, used to store the data to be processed into a memory cache of a storage partition corresponding to a primary key of the data to be processed;

第二存储单元,用于响应于满足存储条件,将所述存储分区的内存缓存存储的至少部分待处理数据,存储至所述存储分区的磁盘储存空间中。The second storage unit is used to store at least part of the to-be-processed data stored in the memory cache of the storage partition into the disk storage space of the storage partition in response to satisfying the storage condition.

第三方面,本申请提供一种电子设备,包括:In a third aspect, the present application provides an electronic device, including:

一个或多个处理器;one or more processors;

存储装置,其上存储有一个或多个程序,a storage device having one or more programs stored thereon,

当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现第一方面中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement any method described in the first aspect.

第四方面,本申请提供一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现第一方面中任一所述的方法。In a fourth aspect, the present application provides a computer-readable medium having a computer program stored thereon, wherein when the program is executed by a processor, any method described in the first aspect is implemented.

第五方面,本申请提供一种计算机程序产品,所述计算机程序产品在设备上运行时,使得所述设备执行第一方面中任一所述的方法。In a fifth aspect, the present application provides a computer program product, which, when executed on a device, enables the device to execute any method described in the first aspect.

由此可见,本申请具有如下有益效果:It can be seen that this application has the following beneficial effects:

本申请提供一种数据处理方法、装置、设备及存储介质,先获取待处理数据,将待处理数据存储至待处理数据的主键对应的存储分区的内存缓存中;响应于满足存储条件,将存储分区的内存缓存存储的至少部分待处理数据,存储至属于所述存储分区的磁盘存储空间中。如此,能够基于待处理数据的主键确定存储待处理数据的存储分区,便于后续在数据查询时,在查询的目标主键所对应的存储分区中进行数据查询,缩小数据查询的范围,提高查询数据的速度。The present application provides a data processing method, device, equipment and storage medium, which first obtains the data to be processed, and stores the data to be processed in the memory cache of the storage partition corresponding to the primary key of the data to be processed; in response to satisfying the storage condition, at least part of the data to be processed stored in the memory cache of the storage partition is stored in the disk storage space belonging to the storage partition. In this way, the storage partition storing the data to be processed can be determined based on the primary key of the data to be processed, which facilitates the subsequent data query in the storage partition corresponding to the target primary key of the query, thereby narrowing the scope of the data query and improving the speed of querying data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请实施例提供的示例性应用场景的框架示意图;FIG1 is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application;

图2为本申请实施例提供的一种数据处理方法的流程示意图;FIG2 is a flow chart of a data processing method provided in an embodiment of the present application;

图3为本申请实施例提供的一种将待处理数据存储至待处理数据的主键对应的存储分区的内存表中的示意图;3 is a schematic diagram of storing data to be processed in a memory table of a storage partition corresponding to a primary key of the data to be processed, provided by an embodiment of the present application;

图4为本申请实施例提供的一种SST文件的结构示意图;FIG4 is a schematic diagram of the structure of an SST file provided in an embodiment of the present application;

图5为本申请实施例提供的一种SST文件中块的结构示意图;FIG5 is a schematic diagram of the structure of a block in an SST file provided in an embodiment of the present application;

图6为本申请实施例提供的一种数据块的结构示意图;FIG6 is a schematic diagram of the structure of a data block provided in an embodiment of the present application;

图7为本申请实施例提供的一种叶索引块的结构示意图;FIG7 is a schematic diagram of the structure of a leaf index block provided in an embodiment of the present application;

图8为本申请实施例提供的一种根索引块的结构示意图;FIG8 is a schematic diagram of the structure of a root index block provided in an embodiment of the present application;

图9为本申请实施例提供的一种合并SST文件的示意图;FIG9 is a schematic diagram of a merged SST file provided in an embodiment of the present application;

图10为本申请实施例提供的一种数据查询的流程示意图;FIG10 is a schematic diagram of a data query process provided by an embodiment of the present application;

图11为本申请实施例提供的一种数据处理装置的结构示意图;FIG11 is a schematic diagram of the structure of a data processing device provided in an embodiment of the present application;

图12为本申请实施例提供的一种电子设备的基本结构的示意图。FIG. 12 is a schematic diagram of the basic structure of an electronic device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

为了便于理解和解释本申请实施例提供的技术方案,下面将先对本申请的背景技术进行说明。In order to facilitate the understanding and explanation of the technical solutions provided by the embodiments of the present application, the background technology of the present application will be described below.

对于数据量较大的数据,比如日志数据,通常是存储在专门用于储存数据的数据管理平台中。用户在需要使用数据时,在数据管理平台中查询预先存储的符合查询条件的日志数据,得到数据管理平台反馈的查询结果。但是,目前利用数据管理平台查询数据的效率较低,时延较长。For data with large amounts of data, such as log data, it is usually stored in a data management platform dedicated to storing data. When users need to use data, they query the pre-stored log data that meets the query conditions in the data management platform and obtain the query results fed back by the data management platform. However, the efficiency of querying data using the data management platform is currently low and the latency is long.

基于此,本申请提供一种数据处理方法、装置、设备及存储介质,先获取待处理数据,将待处理数据存储至待处理数据的主键对应的存储分区的内存缓存中;响应于满足存储条件,将存储分区的内存缓存存储的至少部分待处理数据,存储至属于所述存储分区的磁盘存储空间中。如此,能够基于待处理数据的主键确定存储待处理数据的存储分区,便于后续在数据查询时,在查询的目标主键所对应的存储分区中进行数据查询,缩小数据查询的范围,提高查询数据的速度。Based on this, the present application provides a data processing method, device, equipment and storage medium, which first obtains the data to be processed, and stores the data to be processed in the memory cache of the storage partition corresponding to the primary key of the data to be processed; in response to satisfying the storage condition, at least part of the data to be processed stored in the memory cache of the storage partition is stored in the disk storage space belonging to the storage partition. In this way, the storage partition storing the data to be processed can be determined based on the primary key of the data to be processed, which facilitates the subsequent data query in the storage partition corresponding to the target primary key of the query, thereby narrowing the scope of the data query and improving the speed of querying data.

为了便于理解本申请实施例提供的数据处理方法,下面结合图1所示的场景示例进行说明。参见图1所示,该图为本申请实施例提供的示例性应用场景的框架示意图。In order to facilitate understanding of the data processing method provided in the embodiment of the present application, the following is an explanation in conjunction with the scenario example shown in Figure 1. Referring to Figure 1, this figure is a schematic diagram of the framework of an exemplary application scenario provided in the embodiment of the present application.

作为一种示例,本申请实施例提供的数据处理方法应用于数据管理平台中。数据管理平台能够实现大量数据的存储和管理。数据管理平台能够获取上报端上报的需要进行存储的待处理数据,并将待处理数据进行存储,在后续获取到查询指令后,反馈查询得到的待处理数据。数据管理平台先将上报的待处理数据写入消息队列中。数据管理平台再从消息队列中读取待处理数据,并且将待处理数据存储至待处理数据的主键对应的存储分区的内存缓存中。如此,在后续数据查询的过程中,能够根据查询指令包括的目标主键在对应的存储分区中查询数据,缩小查询数据的范围,提高查询效率。在当满足存储条件,比如,存储分区的内存缓存存储的待处理数据的数据量大于数据量阈值时,将该存储分区的内存缓存存储的全部或者部分待处理数据存储至存储分区的磁盘存储空间。如此能够形成内存缓存和磁盘存储空间的两级的存储结构。对于存储时间较晚的数据,可能存储在内存缓存中,能够优先在内存缓存中进行查询,如果查询不到可以在磁盘存储空间中查询。对于存储时间较早的数据,可能已经存储在磁盘存储空间中,能够直接在磁盘存储空间中查询。如此能够实现对不同储存时间的数据进行快速查询,避免出现未储存到磁盘存储空间所导致的不能查询的问题。As an example, the data processing method provided in the embodiment of the present application is applied to a data management platform. The data management platform can realize the storage and management of a large amount of data. The data management platform can obtain the pending data that needs to be stored reported by the reporting end, and store the pending data. After obtaining the query instruction later, the pending data obtained by the query is fed back. The data management platform first writes the reported pending data into the message queue. The data management platform then reads the pending data from the message queue, and stores the pending data in the memory cache of the storage partition corresponding to the primary key of the pending data. In this way, in the subsequent data query process, data can be queried in the corresponding storage partition according to the target primary key included in the query instruction, narrowing the scope of the query data and improving the query efficiency. When the storage condition is met, for example, the amount of data to be processed stored in the memory cache of the storage partition is greater than the data amount threshold, all or part of the pending data stored in the memory cache of the storage partition is stored in the disk storage space of the storage partition. In this way, a two-level storage structure of the memory cache and the disk storage space can be formed. For data with a later storage time, it may be stored in the memory cache, and you can query it first in the memory cache. If the query fails, you can query it in the disk storage space. For data with an earlier storage time, it may have been stored in the disk storage space, and you can query it directly in the disk storage space. In this way, you can quickly query data with different storage times, avoiding the problem of being unable to query due to not being stored in the disk storage space.

为了便于理解本申请实施例提供的技术方案,下面结合附图对本申请实施例提供的一种数据处理方法进行说明。首先先对待处理数据的存储过程进行介绍。参见图2所示,该图为本申请实施例提供的一种数据处理方法的流程示意图。数据处理方法包括S201-S203。In order to facilitate understanding of the technical solution provided by the embodiment of the present application, a data processing method provided by the embodiment of the present application is described below in conjunction with the accompanying drawings. First, the storage process of the data to be processed is introduced. Referring to FIG. 2, the figure is a flow chart of a data processing method provided by the embodiment of the present application. The data processing method includes S201-S203.

S201:获取待处理数据。S201: Obtain data to be processed.

获取的待处理数据是由上报端上报的需要数据管理平台进行存储的数据。本申请实施例不限定待处理数据的数据类型。作为一种示例,待处理数据为日志数据。日志数据是由用户使用服务的过程生成的用于实现运行维护的数据。作为另一种示例,待处理数据为记录特定地点的视频检测数据。上报端为处理待处理数据的网络节点。比如,待处理数据为直播服务的日志数据,由提供直播服务的网络节点生成日志数据并上报至数据管理平台。上报端为各个提供直播服务的网络节点。又比如,待处理数据为检测数据。上报端能够为生成视频检测数据的检测设备。The acquired data to be processed is data reported by the reporting end and needs to be stored by the data management platform. The embodiment of the present application does not limit the data type of the data to be processed. As an example, the data to be processed is log data. Log data is data generated by the process of users using the service for implementing operation and maintenance. As another example, the data to be processed is video detection data recording a specific location. The reporting end is a network node that processes the data to be processed. For example, the data to be processed is log data of a live broadcast service, and the log data is generated by the network node providing the live broadcast service and reported to the data management platform. The reporting end is each network node providing the live broadcast service. For another example, the data to be processed is detection data. The reporting end can be a detection device that generates video detection data.

本申请实施例不限定待处理数据的来源。在一种可能的实现方式中,从消息队列中获取待处理数据。上报端将上报的待处理数据发送至消息队列,以便数据管理平台从消息队列中获取待处理数据。例如,消息队列为BMQ(基于存算分离架构的流批一体消息队列)。消息队列包括多个队列分区。待处理数据随机分布在消息队列的多个队列分区中。本申请实施例不限定将待处理数据分配在消息队列中的具体实现方式。作为一种示例,在上报端上报待处理数据时,能够利用随机算法实现将待处理数据随机分布在消息队列的多个队列分区中。如此能够避免按照待处理数据的主键分配待处理数据分布的队列分区,而消息队列中部分队列分区的待处理数据的数据量较大,导致部分队列分区的负载过高,出现丢失待处理数据的问题。The embodiment of the present application does not limit the source of the data to be processed. In a possible implementation, the data to be processed is obtained from the message queue. The reporting end sends the reported data to be processed to the message queue so that the data management platform obtains the data to be processed from the message queue. For example, the message queue is BMQ (a stream-batch integrated message queue based on a storage-computing separation architecture). The message queue includes multiple queue partitions. The data to be processed is randomly distributed in multiple queue partitions of the message queue. The embodiment of the present application does not limit the specific implementation method of allocating the data to be processed in the message queue. As an example, when the reporting end reports the data to be processed, a random algorithm can be used to implement the random distribution of the data to be processed in multiple queue partitions of the message queue. In this way, it is possible to avoid allocating the queue partitions for the distribution of the data to be processed according to the primary key of the data to be processed, and the amount of data to be processed in some queue partitions in the message queue is large, resulting in excessive load on some queue partitions and the problem of losing data to be processed.

在另一种可能的实现方式中,从其他数据管理平台或者系统中获取待处理数据。例如,从kafka(一种消息系统)获取待处理数据。作为一种示例,能够采用Flink(一种框架和分布式处理引擎)Kafka连接器实现待处理数据的获取。Flink kafka连接器用于建立数据管理平台与kafka的连接。In another possible implementation, the data to be processed is obtained from other data management platforms or systems. For example, the data to be processed is obtained from Kafka (a messaging system). As an example, the Flink (a framework and distributed processing engine) Kafka connector can be used to obtain the data to be processed. The Flink Kafka connector is used to establish a connection between the data management platform and Kafka.

本申请实施例也不限定获取待处理数据的方式。作为一种示例,获取待处理数据由Flink任务实现。数据管理平台建立Flink任务的数据源(source)算子。数据源算子用于实现数据管理平台从消息队列或者kafka中获取待处理数据。The embodiment of the present application also does not limit the method of obtaining the data to be processed. As an example, obtaining the data to be processed is implemented by a Flink task. The data management platform establishes a data source (source) operator for the Flink task. The data source operator is used to enable the data management platform to obtain the data to be processed from a message queue or kafka.

S202:将所述待处理数据存储至所述待处理数据的主键对应的存储分区的内存缓存中。S202: Storing the data to be processed in the memory cache of the storage partition corresponding to the primary key of the data to be processed.

存储分区是预先划分的用于储存待处理数据的存储空间。每个存储分区具有对应的主键。存储分区用于储存主键为该储存分区对应的主键的待处理数据。存储分区包括内存缓存和磁盘存储空间。待处理数据先存储至存储分区的内存缓存中,在满足存储条件后,再存储至存储分区的磁盘存储空间中。在一种可能的实现方式中,待处理数据能够存储在存储分区的内存缓存的内存表中。本申请实施例不限定存储分区所采用的存储引擎。作为一种示例,存储分区能采用LSM-tree(Log-Structured Merge Tree,日志结构合并树)存储引擎实现待处理数据的存储。A storage partition is a pre-divided storage space for storing data to be processed. Each storage partition has a corresponding primary key. A storage partition is used to store data to be processed whose primary key is the primary key corresponding to the storage partition. A storage partition includes a memory cache and a disk storage space. The data to be processed is first stored in the memory cache of the storage partition, and then stored in the disk storage space of the storage partition after the storage conditions are met. In one possible implementation, the data to be processed can be stored in a memory table of the memory cache of the storage partition. The embodiments of the present application do not limit the storage engine used by the storage partition. As an example, a storage partition can use an LSM-tree (Log-Structured Merge Tree) storage engine to implement storage of data to be processed.

作为一种示例,数据管理平台储存待处理数据由Flink任务实现。数据管理平台建立数据输出(sink)算子。数据管理平台建立的数据源算子用于从消息队列或者kafka中获取待处理数据,并根据Flink的自定义路由机制,根据待处理数据的主键,将待处理数据路由到负责待处理数据的主键所对应的存储分区的数据写入的数据输出算子。数据输出算子将获取的待处理数据输出至存储分区的内存缓存。自定义路由机制预先规定存储分区与主键的对应关系。如此能够实现将随机分布在消息队列的队列分区的待处理数据,按照主键重新整理进行存储,以便后续查询数据时,查询目标主键对应的存储分区就能够实现数据查询,缩小查询数据的范围,进而提高查询数据的效率。As an example, the data management platform stores the data to be processed by Flink tasks. The data management platform establishes a data output (sink) operator. The data source operator established by the data management platform is used to obtain the data to be processed from the message queue or kafka, and according to Flink's custom routing mechanism, the data to be processed is routed to the data output operator responsible for writing the data to the storage partition corresponding to the primary key of the data to be processed according to the primary key of the data to be processed. The data output operator outputs the obtained data to be processed to the memory cache of the storage partition. The custom routing mechanism predetermines the correspondence between the storage partition and the primary key. In this way, the data to be processed that is randomly distributed in the queue partition of the message queue can be reorganized and stored according to the primary key, so that when the data is queried later, the storage partition corresponding to the target primary key can be queried, which narrows the scope of the query data and improves the efficiency of the query data.

数据管理平台建立的数据源算子的数量能够为一个或者多个,建立的数据输出算子的数量与存储分区的数量一致。以从消息队列中获取待处理数据为例,消息队列具有多个队列分区,每个数据源算子用于读取一个或者多个消息队列的队列分区。数据源算子的数量小于或者等于队列分区的数量。数据源算子的数量与数据输出算子的数量相互独立。在一些可能的实现方式中,数据输出算子的数量大于数据源算子的数量。The number of data source operators established by the data management platform can be one or more, and the number of data output operators established is consistent with the number of storage partitions. Taking obtaining unprocessed data from a message queue as an example, the message queue has multiple queue partitions, and each data source operator is used to read one or more queue partitions of the message queue. The number of data source operators is less than or equal to the number of queue partitions. The number of data source operators is independent of the number of data output operators. In some possible implementations, the number of data output operators is greater than the number of data source operators.

作为一种示例,从消息队列中获取待处理数据。参见图3所示,该图为本申请实施例提供的一种将待处理数据存储至待处理数据的主键对应的存储分区的内存缓存的示意图。数据管理平台建立两个数据源算子。两个数据源算子从消息队列的队列分区中获取待处理数据,并根据待处理数据的主键以及自定义路由机制预先规定的存储分区与主键的对应关系,确定待处理数据所要存储至的存储分区。数据源算子将该待处理数据发送至负责该储存分区的数据写入的数据输出算子。数据输出算子将获取的待处理数据存储至存储分区的内存缓存中。As an example, the data to be processed is obtained from the message queue. Referring to FIG3, this figure is a schematic diagram of a memory cache of a storage partition corresponding to the primary key of the data to be processed provided by an embodiment of the present application. The data management platform establishes two data source operators. The two data source operators obtain the data to be processed from the queue partition of the message queue, and determine the storage partition to be stored in which the data to be processed is to be stored based on the primary key of the data to be processed and the correspondence between the storage partition and the primary key predefined by the custom routing mechanism. The data source operator sends the data to be processed to the data output operator responsible for writing data to the storage partition. The data output operator stores the obtained data to be processed in the memory cache of the storage partition.

S203:响应于满足存储条件,将所述存储分区的内存表存储的至少部分待处理数据,存储至所述存储分区的磁盘储存空间中。S203: In response to the storage condition being met, storing at least a portion of the to-be-processed data stored in the memory table of the storage partition into the disk storage space of the storage partition.

存储分区的内存缓存的存储资源有限。在满足存储条件时,将存储分区的内存缓存中储存的待处理数据存储在存储分区的磁盘存储空间中。作为一种示例,存储条件为内存缓存存储的待处理数据达到数据量阈值。作为另一种示例,存储条件为内存表存储的待处理数据的存储时间超过时长阈值。存储条件能够基于存储需要进行设置,本申请实施例对此不做限定。The storage resources of the memory cache of the storage partition are limited. When the storage condition is met, the data to be processed stored in the memory cache of the storage partition is stored in the disk storage space of the storage partition. As an example, the storage condition is that the data to be processed stored in the memory cache reaches a data volume threshold. As another example, the storage condition is that the storage time of the data to be processed stored in the memory table exceeds a time threshold. The storage condition can be set based on storage needs, and the embodiments of the present application do not limit this.

另外,在满足存储条件时,能够将存储分区的内存缓存所存储的全部待处理数据或者部分待处理数据存储至存储分区的磁盘存储空间中。比如,作为一种示例,存储条件为内存缓存存储的待处理数据达到数据量阈值,数据量阈值为1G(吉字节,Gigabyte)。内存缓存中存储1.5G的待处理数据。将内存缓存中1G的待处理数据存储至存储分区的磁盘存储空间中,剩余0.5G的待处理数据。In addition, when the storage conditions are met, all or part of the pending data stored in the memory cache of the storage partition can be stored in the disk storage space of the storage partition. For example, as an example, the storage condition is that the pending data stored in the memory cache reaches the data volume threshold, and the data volume threshold is 1G (Gigabyte). 1.5G of pending data is stored in the memory cache. 1G of pending data in the memory cache is stored in the disk storage space of the storage partition, and 0.5G of pending data remains.

基于上述S201-S203的相关内容可知,能够基于主键确定存储待处理数据的存储分区,便于后续在数据查询时,在查询的目标主键所对应的存储分区中进行数据查询,缩小数据查询的范围,提高查询数据的速度。并且,基于存储分区包括的内存缓存和磁盘存储空间两级存储结构,能够实现对不同时间存储的待处理数据的查询,提高查询数据的效率,减少时延。Based on the above S201-S203, it can be known that the storage partition for storing the data to be processed can be determined based on the primary key, so that when querying data later, data query can be performed in the storage partition corresponding to the target primary key of the query, thereby narrowing the scope of the data query and improving the speed of querying data. In addition, based on the two-level storage structure of memory cache and disk storage space included in the storage partition, it is possible to query the data to be processed stored at different times, improve the efficiency of querying data, and reduce latency.

在一种可能的实现方式中,待处理数据存储在磁盘存储空间的SST(sortedstring table,排序字符串表)文件中。In a possible implementation, the data to be processed is stored in an SST (sorted string table) file in a disk storage space.

在一次落盘,也就是在一次将待处理数据存储至磁盘存储空间的过程中,生成一个排序字符串表文件。该排序字符串表文件中储存本次存储至磁盘存储空间的待处理数据。一个存储分区的磁盘存储文件能够包括一个或者多个SST文件。每个存储分区所属的SST文件相互独立。During a disk write, that is, during a process of storing the data to be processed in the disk storage space, a sorted string table file is generated. The sorted string table file stores the data to be processed stored in the disk storage space this time. The disk storage file of a storage partition can include one or more SST files. The SST files belonging to each storage partition are independent of each other.

生成的SST文件能够储存至HDFS(Hadoop Distributed File System,分布式文件系统)中,以降低存储数据的成本。The generated SST files can be stored in HDFS (Hadoop Distributed File System) to reduce the cost of storing data.

SST文件是以块为单位进行存储的。SST文件包括多个数据块。每个所述数据块用于以键值对的格式存储待处理数据。数据块内存储的待处理数据按照主键排序。多个数据块之间按照各个数据块存储的首个待处理数据的主键进行排序。比如,按照待处理数据的主键由小到大进行排序。例如,SST文件包括数据块A和数据块B。数据块A包括主键取值从90至100的待处理数据。数据块B包括主键取值从95至105的待处理数据。数据块A和数据块B内存储的待处理数据,按照主键由小到大的顺序有序存储。数据块A的首个待处理数据的主键为90,小于数据块B的首个待处理数据的主键为95。SST文件中,数据块A排序在数据块B之前。SST files are stored in blocks. SST files include multiple data blocks. Each of the data blocks is used to store data to be processed in the format of key-value pairs. The data to be processed stored in the data block is sorted according to the primary key. Multiple data blocks are sorted according to the primary key of the first data to be processed stored in each data block. For example, the data to be processed are sorted from small to large according to the primary key. For example, the SST file includes data block A and data block B. Data block A includes data to be processed with primary key values from 90 to 100. Data block B includes data to be processed with primary key values from 95 to 105. The data to be processed stored in data block A and data block B are stored in order from small to large according to the primary key. The primary key of the first data to be processed in data block A is 90, which is smaller than the primary key of the first data to be processed in data block B, which is 95. In the SST file, data block A is sorted before data block B.

如此便于在后续查询数据时,能够构建聚集索引,有序地对目标主键的数据进行查询,提高数据查询的效率。This makes it easier to build a clustered index when querying data later, and to query the data of the target primary key in an orderly manner, thereby improving the efficiency of data query.

另外,SST文件还能够包括索引块。索引块用于指示各个数据块存储的首个数据的主键以及数据块在排序字符串表文件中的存储位置。根据索引块,能够确定存储目标主键的待处理数据的数据块,并确定该数据块在SST文件中的存储位置,进而能够读取该数据块,查询得到目标主键的待处理数据。In addition, the SST file can also include an index block. The index block is used to indicate the primary key of the first data stored in each data block and the storage position of the data block in the sorted string table file. According to the index block, the data block storing the data to be processed of the target primary key can be determined, and the storage position of the data block in the SST file can be determined, and then the data block can be read to query and obtain the data to be processed of the target primary key.

在SST文件的数据块的数量较多的场景中,SST文件的索引块包括多个叶索引块和根索引块。一个叶索引块用于存储对应的一个或者多个数据块的索引的信息。根索引块用于存储SST文件中全部的叶索引块的索引的信息。叶索引块存储所对应的各个数据块存储的首个数据的主键以及对应的各个数据块在所述排序字符串表文件中的存储位置。根索引块存储各个叶索引块存储的数据的首个主键以及叶索引块在排序字符串表文件中的存储位置。In a scenario where the number of data blocks in an SST file is large, the index block of the SST file includes multiple leaf index blocks and a root index block. A leaf index block is used to store the index information of the corresponding one or more data blocks. The root index block is used to store the index information of all leaf index blocks in the SST file. The leaf index block stores the primary key of the first data stored in each corresponding data block and the storage position of each corresponding data block in the sorted string table file. The root index block stores the first primary key of the data stored in each leaf index block and the storage position of the leaf index block in the sorted string table file.

作为一种示例,本申请实施例提供一种SST文件。下面对本申请实施例提供的SST文件的格式进行介绍。As an example, an embodiment of the present application provides an SST file. The format of the SST file provided in the embodiment of the present application is introduced below.

参见图4所示,该图为本申请实施例提供的一种SST文件的结构示意图。SST文件包括数据块(Data Block)、叶索引块(Leaf index block)、根索引块(Root index block)、元信息块(Meta-Info Block)以及附加数据(Fixed-Length Meta-Info Trailer)。附加数据包括首次打开SST文件必需的文件属性信息。根索引块、元信息块以及附加数据在SST文件打开时会被读取以及缓存。数据块和叶索引块是执行查询指令时,被读取与查询指令相关的部分。Refer to Figure 4, which is a schematic diagram of the structure of an SST file provided in an embodiment of the present application. The SST file includes a data block (Data Block), a leaf index block (Leaf index block), a root index block (Root index block), a meta-info block (Meta-Info Block) and additional data (Fixed-Length Meta-Info Trailer). The additional data includes the file attribute information required for opening the SST file for the first time. The root index block, meta-information block and additional data will be read and cached when the SST file is opened. The data block and leaf index block are the parts related to the query instruction that are read when the query instruction is executed.

SST文件中每个块(数据块、叶索引块、根索引块以及元信息块)的结构参见图5所示。每个块由块头信息(Block header)和块数据(block data)构成。块头信息固定有5个字段。5个字段分别表示块类型(Block Type)、块在磁盘上占据的数据量大小(On-Disksize)、解压缩后原始块大小(Uncompressed Raw Size)、上一个同类型块的偏移量(Previous Block Offset)以及块内数据条数(Number of Entries)。块类型为数据块、叶索引块、根索引块以及元信息块中的一个。块数据是块中储存的具体信息。数据块的块数据为待处理数据。叶索引块和根索引块的块数据为索引。元信息块的块数据为元数据。The structure of each block (data block, leaf index block, root index block and meta information block) in the SST file is shown in Figure 5. Each block consists of block header information (Block header) and block data (block data). The block header information has five fixed fields. The five fields represent the block type (Block Type), the amount of data occupied by the block on the disk (On-Disksize), the original block size after decompression (Uncompressed Raw Size), the offset of the previous block of the same type (Previous Block Offset) and the number of data entries in the block (Number of Entries). The block type is one of the data block, leaf index block, root index block and meta information block. Block data is the specific information stored in the block. The block data of the data block is the data to be processed. The block data of the leaf index block and the root index block is the index. The block data of the meta information block is metadata.

下面分别对数据块、叶索引块和根索引块的具体结构进行介绍。The specific structures of data blocks, leaf index blocks and root index blocks are introduced below.

参见图6所示,该图为本申请实施例提供的一种数据块的结构示意图。数据块是SST文件中最小的数据存储单元。一个数据块包括块头信息(Block Header)以及多条待处理数据。待处理数据以键值(Key-Value)对形式存储。作为一种示例,一条待处理数据包括键长度(Key Length)、值长度(Value Length)、键(Key)和值(Value)。数据块内部存储的待处理数据被压缩,读取数据块内的待处理数据时,需要将数据块解压缩到内存中再进行读取。Refer to Figure 6, which is a structural diagram of a data block provided in an embodiment of the present application. A data block is the smallest data storage unit in an SST file. A data block includes block header information (Block Header) and multiple pieces of data to be processed. The data to be processed is stored in the form of key-value pairs. As an example, a piece of data to be processed includes a key length (Key Length), a value length (Value Length), a key (Key) and a value (Value). The data to be processed stored inside the data block is compressed. When reading the data to be processed in the data block, the data block needs to be decompressed into the memory and then read.

数据块内的待处理数据按照主键有序排列。一个SST文件中的数据块之间也按照数据块存储的首个待处理数据的主键进行有序排列。The data to be processed in a data block are arranged in order according to the primary key. The data blocks in an SST file are also arranged in order according to the primary key of the first data to be processed stored in the data block.

参见图7所示,该图为本申请实施例提供的一种叶索引块的结构示意图。叶索引块包含一个或者多个数据块的索引信息。一个叶索引块由块头信息(Block Header)、元索引(Meta index)和多条叶索引(Entries)组成。其中,元索引用于根据下标直接定位单个叶索引的存储位置,从而对叶索引进行二分查找。元索引包括:叶索引块中索引条数(Number ofIndex Entries)、各条叶索引相对于第一条叶索引的偏移量(Relative Offsets toEntries)以及所有叶索引的总大小(Total Size of Entries)。每条叶索引包含一个数据块的索引信息。每条叶索引包含数据块的偏移量(Data Block Offset)、数据块在磁盘中占据的数据大小(Data Block On-Disk Size),以及数据块的首个键(First Key in theData Block)。Referring to FIG. 7 , this figure is a schematic diagram of the structure of a leaf index block provided in an embodiment of the present application. A leaf index block contains index information of one or more data blocks. A leaf index block consists of block header information (Block Header), a meta index (Meta index) and multiple leaf indexes (Entries). Among them, the meta index is used to directly locate the storage location of a single leaf index according to the subscript, thereby performing a binary search on the leaf index. The meta index includes: the number of index entries in the leaf index block (Number of Index Entries), the offset of each leaf index relative to the first leaf index (Relative Offsets to Entries) and the total size of all leaf indexes (Total Size of Entries). Each leaf index contains index information of a data block. Each leaf index contains the offset of the data block (Data Block Offset), the data size occupied by the data block on the disk (Data Block On-Disk Size), and the first key of the data block (First Key in the Data Block).

参见图8所示,该图为本申请实施例提供的一种根索引块的结构示意图。根索引块包含SST文件中所有叶索引的索引信息。一个根索引块由块头信息(Block header)和若干条根索引(Entries)组成。每条根索引包含一个叶索引的索引信息。每条根索引包括叶索引的偏移量(Leaf Block Offset)、叶索引在磁盘中占据的大小(Leaf Block On-DiskSize),以及叶索引块的首个键(First Key in the Leaf Block)。Refer to Figure 8, which is a schematic diagram of the structure of a root index block provided in an embodiment of the present application. The root index block contains the index information of all leaf indexes in the SST file. A root index block consists of block header information (Block header) and several root indexes (Entries). Each root index contains the index information of a leaf index. Each root index includes the offset of the leaf index (Leaf Block Offset), the size of the leaf index on the disk (Leaf Block On-DiskSize), and the first key of the leaf index block (First Key in the Leaf Block).

SST文件是不可变的只读文件,一旦写入不可追加或更改。随着多次落盘,一个储存分区的SST文件的数量逐渐上涨。在后续查询的过程中,一次查询可能需要涉及多个SST文件,可能会造成查询效率降低。SST files are immutable read-only files. Once written, they cannot be appended or changed. With multiple writes to disk, the number of SST files in a storage partition gradually increases. In subsequent queries, a query may involve multiple SST files, which may reduce query efficiency.

为了避免SST文件数量随时间持续增长,能够对SST文件进行合并,实现将多个SST文件合为一个有序的SST文件,增加数据的聚合度,便于后续进行数据的查询。In order to prevent the number of SST files from growing continuously over time, the SST files can be merged to combine multiple SST files into an ordered SST file, thereby increasing the aggregation of the data and facilitating subsequent data queries.

需要说明的是,在一些可能的实现方式中,待处理数据,例如日志数据,是随着用户使用服务的时间顺序生成的。在合并SST文件时,需要使得合并的SST文件时间跨度相差不大,便于实现集中地进行数据查询。It should be noted that in some possible implementations, the data to be processed, such as log data, is generated in chronological order as the user uses the service. When merging SST files, the time span of the merged SST files needs to be similar to facilitate centralized data query.

针对此类待处理数据,本申请实施例提供一种数据处理方法,除上述步骤以外,还包括以下步骤:For such data to be processed, an embodiment of the present application provides a data processing method, which, in addition to the above steps, further includes the following steps:

响应于属于同一存储分区的创建时刻属于预设时间段的SST文件的数量大于或者等于阈值,合并所述创建时刻属于预设时间段的SST文件。In response to the number of SST files belonging to the same storage partition whose creation times belong to a preset time period being greater than or equal to a threshold, the SST files whose creation times belong to the preset time period are merged.

对于属于一个存储分区的SST文件,获取各个SST文件的创建时刻。各个SST文件的创建时刻是创建该SST文件的时刻。For SST files belonging to a storage partition, the creation time of each SST file is obtained. The creation time of each SST file is the time when the SST file is created.

若创建时刻属于预设时间段的SST文件的数量大于或者等于阈值,则说明创建时刻属于预设时间段的SST文件的数量较多,并且时间跨度不大,能够进行合并。需要说明的是,在合并SST文件的过程中,将需要合并的SST文件所存储的待处理数据重新按照主键进行排序整理,合并得到的SST文件所包括的待处理数据仍是按照主键排序的。If the number of SST files whose creation time belongs to the preset time period is greater than or equal to the threshold, it means that the number of SST files whose creation time belongs to the preset time period is large and the time span is not large, and they can be merged. It should be noted that in the process of merging SST files, the unprocessed data stored in the SST files to be merged will be re-sorted according to the primary key, and the unprocessed data included in the merged SST files will still be sorted according to the primary key.

预设时间段能够是预先设置的,并且是随着时间推移进行调整的。作为一种示例,参见图9所示,图9为本申请实施例提供的一种合并SST文件的示意图。将时间段划分为多个预设时间段,比如每1小时为一个预设时间段。如果属于一个预设时间段内的SST文件的数量大于或者等于阈值,将创建时刻属于同一个预设时间段的SST文件进行合并。再将多个连续的预设时间段进行合并,作为更新后的预设时间段。比如将相邻的4个1个小时进行合并,得到更新后的4个小时的预设时间段。如果属于更新后的预设时间段内的SST文件的数量大于或者等于阈值,再将创建时刻属于同一个预设时间段的SST文件进行合并。以此类推,直到满足停止合并条件后,停止对SST文件的合并。停止合并条件为预先设置的停止合并SST文件的条件。停止合并条件例如为预设时间段超过限制时长。比如预设时间段不能超过8小时。停止合并条件又例如为合并后的SST文件的大小超过限制大小。比如,合并后的SST文件的大小超过4GB(Giga Byte,千兆字节)。如此,随着基于时间推移调整预设时间段,能够实现SST文件的逐渐合并,增加数据的聚合度,便于后续进行数据的查询。The preset time period can be pre-set and adjusted over time. As an example, see Figure 9, which is a schematic diagram of merging SST files provided in an embodiment of the present application. The time period is divided into multiple preset time periods, such as one preset time period for each hour. If the number of SST files belonging to a preset time period is greater than or equal to a threshold, the SST files that belong to the same preset time period at the creation time are merged. Then merge multiple consecutive preset time periods as the updated preset time period. For example, merge four adjacent one-hour periods to obtain an updated preset time period of four hours. If the number of SST files that belong to the updated preset time period is greater than or equal to a threshold, then merge the SST files that belong to the same preset time period at the creation time. And so on, until the stop merging condition is met, stop merging the SST files. The stop merging condition is a pre-set condition for stopping merging SST files. The stop merging condition is, for example, that the preset time period exceeds the limit. For example, the preset time period cannot exceed 8 hours. Another example of the stop merging condition is that the size of the merged SST file exceeds the limit. For example, the size of the merged SST file exceeds 4GB (Giga Byte). In this way, as the preset time period is adjusted based on the passage of time, the SST files can be gradually merged, the aggregation of data is increased, and subsequent data query is facilitated.

以上对待处理数据的存储过程进行介绍,下面基于上述存储方法对数据的查询过程进行说明。The above is an introduction to the storage process of the data to be processed. The following is an explanation of the data query process based on the above storage method.

在存储待处理数据后,能够对待处理数据进行查询。本申请实施例提供一种数据处理方法,除上述步骤以外,还包括以下步骤:After the data to be processed is stored, it is possible to query the data to be processed. The embodiment of the present application provides a data processing method, which, in addition to the above steps, also includes the following steps:

A1:获取查询指令,所述查询指令包括目标主键。A1: Obtain a query instruction, where the query instruction includes a target primary key.

查询指令用于查询存储的目标主键对应的待处理数据。查询指令能够包括一个或者多个目标主键。The query instruction is used to query the stored target primary key corresponding to the to-be-processed data. The query instruction can include one or more target primary keys.

本申请实施例不限定查询指令的触发方式。在一种可能的实现方式中,参见图10所示,该图为本申请实施例提供的一种数据查询的流程示意图。由用户通过实现数据查询功能的客户端将查询指令写入redis(Remote Dictionary Server,远程字典服务)中。数据管理平台定时查询redis,获取查询指令。查询指令包括目标主键。其中,目标主键为需要查询的待处理数据的主键。The embodiment of the present application does not limit the triggering method of the query instruction. In a possible implementation method, refer to Figure 10, which is a flow chart of a data query provided by an embodiment of the present application. The user writes the query instruction into redis (Remote Dictionary Server) through the client that implements the data query function. The data management platform periodically queries redis to obtain the query instruction. The query instruction includes a target primary key. Among them, the target primary key is the primary key of the data to be processed that needs to be queried.

A2:在所述目标主键对应的存储分区的内存缓存,和所述目标主键对应的存储分区的磁盘储存空间中的一种或者多种查询所述目标主键对应的待处理数据,得到查询结果。A2: query the to-be-processed data corresponding to the target primary key in one or more of the memory cache of the storage partition corresponding to the target primary key and the disk storage space of the storage partition corresponding to the target primary key to obtain a query result.

基于查询指令包括的目标主键,能够在目标主键对应的存储分区的内存缓存和磁盘存储空间中一种或者多种进行查询。在目标主键对应的存储分区进行数据查询,能够实现在较小的数据范围内进行查询,提高数据查询的效率。Based on the target primary key included in the query instruction, one or more of the memory cache and disk storage space of the storage partition corresponding to the target primary key can be queried. Performing data query in the storage partition corresponding to the target primary key can achieve query within a smaller data range, thereby improving the efficiency of data query.

本申请实施例不限定查询数据的存储空间,能够基于查询的需要进行设置。比如,在一种可能的实现方式中,能够在存储分区的内存缓存和磁盘存储空间共同查询目标主键对应的待处理数据。再比如,在另一种可能的实现方式中,查询指令还包括存储时刻范围。存储时刻范围为待处理数据向数据管理平台上报的时刻范围。在忽略传输待处理数据的时间差的情况下,存储时刻范围为生成待处理数据的时刻范围。在一些情况下,存储时刻范围为单独的时刻,或者为时间段。基于存储时刻范围,能够确定目标主键对应的待处理数据是存储在存储分区的内存缓存中还是存储在存储分区的磁盘存储空间中。The embodiments of the present application do not limit the storage space for query data, and can be set based on the needs of the query. For example, in one possible implementation, the pending data corresponding to the target primary key can be queried jointly in the memory cache and disk storage space of the storage partition. For another example, in another possible implementation, the query instruction also includes a storage time range. The storage time range is the time range when the pending data is reported to the data management platform. When the time difference for transmitting the pending data is ignored, the storage time range is the time range when the pending data is generated. In some cases, the storage time range is a single moment, or a time period. Based on the storage time range, it can be determined whether the pending data corresponding to the target primary key is stored in the memory cache of the storage partition or in the disk storage space of the storage partition.

作为一种示例,判断存储时刻范围与落盘时刻的早晚关系。落盘时刻为目标主键对应的存储分区,距离当前时刻最近的,将内存缓存的至少部分待处理数据存储至磁盘存储空间的时刻。As an example, the relationship between the storage time range and the disk drop time is determined. The disk drop time is the time when at least part of the to-be-processed data in the memory cache is stored in the disk storage space for the storage partition corresponding to the target primary key, which is closest to the current time.

如果存储时刻范围晚于落盘时刻,则说明需要查询的待处理数据可能还存储在内存缓存中。比如,存储时刻范围为10:00-10:05。落盘时刻为9:30。说明需要查询的待处理数据在10:00-10:05时段内存储至数据管理平台,并且并未在9:30时存储至磁盘存储空间。在目标主键对应的存储分区的内存缓存中查询所述目标主键对应的待处理数据。如果查询得到目标主键对应的待处理数据,则将查询得到的待处理数据作为查询结果,得到查询结果。若未查询到与所述目标主键对应的待处理数据,在目标主键对应的存储分区的磁盘储存空间中查询目标主键对应的待处理数据。如果能够在磁盘储存空间中查询目标主键对应的待处理数据,将查询得到的待处理数据作为查询结果。如果未能查询得到目标主键对应的待处理数据,则查询结果为空。If the storage time range is later than the disk drop time, it means that the pending data to be queried may still be stored in the memory cache. For example, the storage time range is 10:00-10:05. The disk drop time is 9:30. This means that the pending data to be queried is stored in the data management platform during the period of 10:00-10:05, and is not stored in the disk storage space at 9:30. Query the pending data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key. If the pending data corresponding to the target primary key is obtained through the query, the pending data obtained through the query is used as the query result to obtain the query result. If the pending data corresponding to the target primary key is not found, query the pending data corresponding to the target primary key in the disk storage space of the storage partition corresponding to the target primary key. If the pending data corresponding to the target primary key can be queried in the disk storage space, the pending data obtained through the query is used as the query result. If the pending data corresponding to the target primary key cannot be queried, the query result is empty.

如果存储时刻范围早于落盘时刻,则说明需要查询的待处理数据可能已经存储在磁盘存储空间中。在目标主键对应的存储分区的内存缓存中查询目标主键对应的待处理数据。If the storage time range is earlier than the disk-stack time, it means that the pending data to be queried may have been stored in the disk storage space. Query the pending data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key.

作为一种示例,本申请实施例提供一种从内存缓存中查询待处理数据的具体实现方式。在一种可能的实现方式中,内存缓存的内存表存储待处理数据。内存表储存的待处理数据是基于主键排序的。内存表的下标为待处理数据的主键。查询获取下标与目标主键相同的待处理数据。作为一种示例,在对内存表进行数据查询时,能够针对内存表创建对应的迭代器。迭代器用于定位该内存表中主键为目标主键的待处理数据的起始位置。数据管理平台建立最小堆管理各个迭代器。最小堆从迭代器定位的起始位置,依次读取内存表中主键为目标主键的待处理数据,直到读取的待处理数据的主键不为目标主键为止,得到目标主键对应的待处理数据,也就是查询结果。As an example, an embodiment of the present application provides a specific implementation method for querying pending data from a memory cache. In a possible implementation method, a memory table of the memory cache stores pending data. The pending data stored in the memory table is sorted based on the primary key. The subscript of the memory table is the primary key of the pending data. The query obtains the pending data whose subscript is the same as the target primary key. As an example, when querying data on the memory table, a corresponding iterator can be created for the memory table. The iterator is used to locate the starting position of the pending data whose primary key is the target primary key in the memory table. The data management platform establishes a minimum heap to manage each iterator. The minimum heap reads the pending data whose primary key is the target primary key in the memory table from the starting position located by the iterator, in sequence, until the primary key of the read pending data is not the target primary key, and obtains the pending data corresponding to the target primary key, that is, the query result.

作为另一种示例,本申请实施例提供一种从磁盘存储空间中查询待处理数据的具体实现方式。以上述图4所示的SST文件为例,由于SST文件存储的待处理数据是基于主键排序的,在查询数据的过程中,能够查询SST文件储存的最小的主键的值。如果SST文件储存的待处理数据的主键是由小到大进行排序,则SST文件储存的最小的主键是第一个数据块所存储的首个待处理数据的主键。如果SST文件的最小的主键大于目标主键,则说明该SST文件中不包括目标主键对应的待处理数据,无需在该SST文件中进行查询。As another example, an embodiment of the present application provides a specific implementation method for querying the data to be processed from the disk storage space. Taking the SST file shown in Figure 4 above as an example, since the data to be processed stored in the SST file is sorted based on the primary key, in the process of querying the data, the value of the smallest primary key stored in the SST file can be queried. If the primary keys of the data to be processed stored in the SST file are sorted from small to large, the smallest primary key stored in the SST file is the primary key of the first data to be processed stored in the first data block. If the smallest primary key of the SST file is larger than the target primary key, it means that the SST file does not include the data to be processed corresponding to the target primary key, and there is no need to query in the SST file.

参见图10所示,对在SST文件中查询所述目标主键对应的待处理数据,得到查询结果进行详细介绍。Referring to FIG. 10 , a query result obtained by querying the SST file for the data to be processed corresponding to the target primary key is introduced in detail.

获取能够实现数据查询功能的客户端向远程字典服务发送的查询指令。查询指令包括目标主键。远程字典服务向数据管理平台发送查询指令。数据管理平台在目标主键对应的存储分区的内存缓存的内存表中,以及磁盘存储空间的SST文件中查询目标主键对应的待处理数据。Get the query instruction sent by the client capable of implementing the data query function to the remote dictionary service. The query instruction includes the target primary key. The remote dictionary service sends the query instruction to the data management platform. The data management platform searches for the pending data corresponding to the target primary key in the memory table of the memory cache of the storage partition corresponding to the target primary key and the SST file of the disk storage space.

SST文件的根索引块中,存储该SST文件的各个叶索引块的首个键以及各个叶索引块在SST文件中的存储位置。读取SST文件的根索引块,查询目标主键对应的目标叶索引块,以及目标叶索引块在SST文件中的第一存储位置。第一存储位置具体能够为目标叶索引块在SST文件中的偏移量。再从第一存储位置开始读取目标叶索引块,查询目标叶索引块中储存的目标主键对应的目标数据块的第二存储位置。第二存储位置为目标数据块在SST文件中的起始位置。具体的,第二存储位置能够为目标数据块在SST文件的偏移量。SST文件中的数据是按照主键由小到大排序的。从第二存储位置起,依次读取主键为目标主键的待处理数据,生成查询结果。如此,基于根索引块和叶索引块两层索引结构,能够较为快速地在SST文件中定位所要查询的待处理数据,并且基于按照主键排序,能够在SST文件中较为集中地获取主键为目标主键的待处理数据,实现数据的快速查询。In the root index block of the SST file, the first key of each leaf index block of the SST file and the storage position of each leaf index block in the SST file are stored. Read the root index block of the SST file, query the target leaf index block corresponding to the target primary key, and the first storage position of the target leaf index block in the SST file. The first storage position can specifically be the offset of the target leaf index block in the SST file. Then read the target leaf index block from the first storage position, and query the second storage position of the target data block corresponding to the target primary key stored in the target leaf index block. The second storage position is the starting position of the target data block in the SST file. Specifically, the second storage position can be the offset of the target data block in the SST file. The data in the SST file is sorted from small to large according to the primary key. Starting from the second storage position, read the data to be processed whose primary key is the target primary key in sequence to generate the query result. In this way, based on the two-layer index structure of the root index block and the leaf index block, the data to be queried can be located in the SST file more quickly, and based on the sorting by the primary key, the data to be processed whose primary key is the target primary key can be obtained more concentratedly in the SST file, thereby realizing fast data query.

作为一种示例,数据管理平台能够针对每个需要查询的SST文件创建一个迭代器。迭代器用于定位该SST文件中主键为目标主键的待处理数据的起始位置。数据管理平台建立最小堆管理各个SST文件的迭代器。需要说明的是,若在内存表与SST文件中查询目标主键的待处理数据,最小堆管理各个SST文件的迭代器以及内存表的迭代器。最小堆从迭代器定位的起始位置,依次读取数据块中主键为目标主键的待处理数据,直到读取的待处理数据的主键不为目标主键为止,得到目标主键对应的待处理数据,也就是查询结果。为每个需要查询的SST文件建立对应的迭代器进行数据查询,能够实现数据的并行查询,提高查询数据的速度。As an example, the data management platform can create an iterator for each SST file that needs to be queried. The iterator is used to locate the starting position of the data to be processed whose primary key is the target primary key in the SST file. The data management platform establishes a minimum heap to manage the iterators of each SST file. It should be noted that if the data to be processed of the target primary key is queried in the memory table and the SST file, the minimum heap manages the iterators of each SST file and the iterator of the memory table. The minimum heap reads the data to be processed whose primary key is the target primary key in the data block in sequence from the starting position located by the iterator until the primary key of the data to be processed read is not the target primary key, and obtains the data to be processed corresponding to the target primary key, which is the query result. Establishing a corresponding iterator for each SST file that needs to be queried for data query can realize parallel query of data and improve the speed of querying data.

在一些可能的实现方式中,查询指令还包括过滤条件。过滤条件为需要查询的待处理数据除目标主键以外需要满足的条件。作为一种示例,过滤条件为客户端版本号大于目标版本号。在得到查询结果之前,还能够基于过滤条件对查询到的待处理数据进行筛选,得到满足过滤条件的过滤数据。将过滤数据作为查询结果。如此能够实现较为准确的数据的查询。In some possible implementations, the query instruction also includes a filtering condition. The filtering condition is a condition that the pending data to be queried needs to meet in addition to the target primary key. As an example, the filtering condition is that the client version number is greater than the target version number. Before obtaining the query result, the queried pending data can also be filtered based on the filtering condition to obtain filtered data that meets the filtering condition. The filtered data is used as the query result. In this way, more accurate data query can be achieved.

在一种可能的实现方式中,在得到查询结果后,能够输出包括查询结果的结果文件。作为一种示例,SST文件储存在HDFS中,结果文件输出至HDFS。作为一种示例,查询指令还包括输出结果路径。输出结果路径用于指示结果文件所输出的位置。基于输出结果路径,数据管理平台能够将包括查询结果的结果文件输出至输出结果路径所指示的位置,便于用户获取查询结果。In a possible implementation, after obtaining the query results, a result file including the query results can be output. As an example, the SST file is stored in HDFS, and the result file is output to HDFS. As an example, the query instruction also includes an output result path. The output result path is used to indicate the location where the result file is output. Based on the output result path, the data management platform can output the result file including the query results to the location indicated by the output result path, so that the user can obtain the query results.

基于上述方法实施例提供的一种数据处理方法,本申请实施例还提供了一种数据处理装置,下面将结合附图对数据处理装置进行说明。Based on a data processing method provided by the above method embodiment, an embodiment of the present application also provides a data processing device, which will be described below in conjunction with the accompanying drawings.

参见图11所示,该图为本申请实施例提供的一种数据处理装置的结构示意图。如图11所示,该数据处理装置包括:Referring to FIG11 , this figure is a schematic diagram of the structure of a data processing device provided in an embodiment of the present application. As shown in FIG11 , the data processing device includes:

第一获取单元1101,用于获取待处理数据;A first acquisition unit 1101 is used to acquire data to be processed;

第一存储单元1102,用于将所述待处理数据存储至所述待处理数据的主键对应的存储分区的内存缓存中;The first storage unit 1102 is used to store the data to be processed into a memory cache of a storage partition corresponding to a primary key of the data to be processed;

第二存储单元1103,用于响应于满足存储条件,将所述存储分区的内存缓存存储的至少部分待处理数据,存储至所述存储分区的磁盘储存空间中。The second storage unit 1103 is used to store at least part of the to-be-processed data stored in the memory cache of the storage partition into the disk storage space of the storage partition in response to the storage condition being met.

在一种可能的实现方式中,所述待处理数据存储在磁盘储存空间的排序字符串表文件中,所述排序字符串表文件包括多个数据块,每个所述数据块用于以键值对的格式存储待处理数据,所述数据块内存储的所述待处理数据按照主键排序,多个所述数据块之间按照各个数据块存储的首个待处理数据的主键进行排序。In a possible implementation, the data to be processed is stored in a sorted string table file in a disk storage space, the sorted string table file includes multiple data blocks, each of the data blocks is used to store the data to be processed in a key-value pair format, the data to be processed stored in the data block is sorted according to a primary key, and the multiple data blocks are sorted according to the primary key of the first data to be processed stored in each data block.

在一种可能的实现方式中,所述排序字符串表文件还包括索引块,所述索引块用于指示各个数据块存储的首个待处理数据的主键以及所述数据块在所述排序字符串表文件中的存储位置。In a possible implementation, the sorted string table file further includes an index block, where the index block is used to indicate a primary key of a first to-be-processed data stored in each data block and a storage position of the data block in the sorted string table file.

在一种可能的实现方式中,所述索引块包括多个叶索引块和根索引块,各个所述叶索引块存储所对应的各个数据块存储的首个数据的主键以及对应的所述各个数据块在所述排序字符串表文件中的存储位置,所述根索引块存储各个叶索引块存储的数据的首个主键以及所述叶索引块在所述排序字符串表文件中的存储位置。In one possible implementation, the index block includes multiple leaf index blocks and a root index block, each leaf index block stores the primary key of the first data stored in the corresponding data block and the storage position of the corresponding data block in the sorted string table file, and the root index block stores the first primary key of the data stored in each leaf index block and the storage position of the leaf index block in the sorted string table file.

在一种可能的实现方式中,所述装置还包括:In a possible implementation manner, the device further includes:

合并单元,用于响应于属于同一存储分区的,创建时刻属于预设时间段的排序字符串表文件的数量大于或者等于阈值,合并所述创建时刻属于预设时间段的排序字符串表文件。The merging unit is used to merge the sorted string table files whose creation time belongs to the preset time period in response to the number of sorted string table files belonging to the same storage partition and whose creation time belongs to the preset time period being greater than or equal to a threshold.

在一种可能的实现方式中,所述装置还包括:In a possible implementation manner, the device further includes:

第二获取单元,用于获取查询指令,所述查询指令包括目标主键;A second acquisition unit, used to acquire a query instruction, wherein the query instruction includes a target primary key;

查询单元,用于在所述目标主键对应的存储分区的内存缓存查询所述目标主键对应的待处理数据,和/或,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果。A query unit is used to query the pending data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key, and/or to query the pending data corresponding to the target primary key in the disk storage space of the storage partition corresponding to the target primary key to obtain a query result.

在一种可能的实现方式中,所述查询指令还包括存储时刻范围,所述查询单元,用于若所述存储时刻范围晚于落盘时刻,在所述目标主键对应的存储分区的内存缓存中查询所述目标主键对应的待处理数据,所述落盘时刻为所述目标主键对应的存储分区,距离当前时刻最近的,将内存缓存的至少部分待处理数据存储至磁盘存储空间的时刻;若查询到与所述目标主键对应的待处理数据,则将查询到的与所述目标主键对应的待处理数据作为查询结果,得到所述查询结果;若未查询到与所述目标主键对应的待处理数据,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果。In a possible implementation, the query instruction also includes a storage time range, and the query unit is used to query the pending data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key if the storage time range is later than the disk drop time, and the disk drop time is the storage partition corresponding to the target primary key, which is closest to the current time, and the time when at least part of the pending data in the memory cache is stored in the disk storage space; if the pending data corresponding to the target primary key is queried, the queried pending data corresponding to the target primary key is used as the query result to obtain the query result; if the pending data corresponding to the target primary key is not queried, the pending data corresponding to the target primary key is queried in the disk storage space of the storage partition corresponding to the target primary key to obtain the query result.

在一种可能的实现方式中,所述待处理数据存储在磁盘储存空间的排序字符串表文件中,所述排序字符串表文件包括多个数据块,每个所述数据块用于以键值对的格式存储待处理数据,所述数据块内存储的所述待处理数据按照主键排序,多个所述数据块之间按照各个数据块存储的首个待处理数据的主键进行排序,所述排序字符串表文件还包括多个叶索引块和根索引块,各个所述叶索引块存储所对应的各个数据块存储的首个数据的主键以及对应的所述各个数据块在所述排序字符串表文件中的存储位置,所述根索引块存储各个叶索引块存储的数据的首个主键以及所述叶索引块在所述排序字符串表文件中的存储位置,所述查询单元,用于所述在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果,包括:In a possible implementation, the data to be processed is stored in a sorted string table file of a disk storage space, the sorted string table file includes a plurality of data blocks, each of the data blocks is used to store the data to be processed in a key-value pair format, the data to be processed stored in the data block is sorted according to a primary key, and the plurality of data blocks are sorted according to the primary key of the first data to be processed stored in each data block, the sorted string table file also includes a plurality of leaf index blocks and a root index block, each of the leaf index blocks stores the primary key of the first data stored in the corresponding data block and the storage position of the corresponding data block in the sorted string table file, the root index block stores the first primary key of the data stored in each leaf index block and the storage position of the leaf index block in the sorted string table file, the query unit is used to query the data to be processed corresponding to the target primary key in the disk storage space of the storage partition corresponding to the target primary key, and obtain the query result, including:

所述查询单元,用于在所述目标主键对应的存储分区的排序字符串表文件的根索引块中,查询所述目标主键对应的目标叶索引块的第一存储位置,所述第一存储位置为存储所述目标叶索引块的起始位置;基于所述第一存储位置,在所述目标叶索引块中查询所述目标主键对应的目标数据块的第二存储位置,所述第二存储位置为存储所述目标数据块的起始位置;从所述第二存储位置起,依次读取主键为所述目标主键的待处理数据。The query unit is used to query the first storage position of the target leaf index block corresponding to the target primary key in the root index block of the sorted string table file of the storage partition corresponding to the target primary key, and the first storage position is the starting position for storing the target leaf index block; based on the first storage position, query the second storage position of the target data block corresponding to the target primary key in the target leaf index block, and the second storage position is the starting position for storing the target data block; starting from the second storage position, read the data to be processed whose primary key is the target primary key in sequence.

在一种可能的实现方式中,所述查询指令还包括过滤条件,所述装置还包括:In a possible implementation manner, the query instruction further includes a filtering condition, and the device further includes:

过滤单元,用于从查询得到的目标主键对应的待处理数据中筛选得到满足所述过滤条件的过滤数据;A filtering unit, used to filter the to-be-processed data corresponding to the target primary key obtained by the query to obtain the filtered data that meets the filtering condition;

所述查询单元,用于得到查询结果,包括:The query unit is used to obtain the query result, including:

所述查询单元,用于将所述过滤数据作为查询结果。The query unit is used to use the filtered data as a query result.

在一种可能的实现方式中,所述第一获取单元1101,用于从消息队列中获取待处理数据,所述待处理数据随机分布在所述消息队列的多个队列分区中。In a possible implementation, the first acquiring unit 1101 is used to acquire the data to be processed from the message queue, and the data to be processed is randomly distributed in multiple queue partitions of the message queue.

基于上述方法实施例提供的一种数据处理方法,本申请还提供一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述任一实施例所述的数据处理方法。Based on a data processing method provided by the above method embodiment, the present application also provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, when the one or more programs are executed by the one or more processors, the one or more processors implement the data processing method as described in any of the above embodiments.

下面参考图12,其示出了适于用来实现本申请实施例的电子设备1200的结构示意图。本申请实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(Personal Digital Assistant,个人数字助理)、PAD(portable androiddevice,平板电脑)、PMP(Portable Media Player,便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV(television,电视机)、台式计算机等等的固定终端。图12示出的电子设备仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。Reference is made to FIG12, which shows a schematic diagram of the structure of an electronic device 1200 suitable for implementing an embodiment of the present application. The terminal device in the embodiment of the present application may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (Personal Digital Assistants), PADs (portable android devices), PMPs (Portable Media Players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs (televisions), desktop computers, etc. The electronic device shown in FIG12 is only an example and should not impose any restrictions on the functions and scope of use of the embodiments of the present application.

如图12所示,电子设备1200可以包括处理装置(例如中央处理器、图形处理器等)1201,其可以根据存储在只读存储器(ROM)1202中的程序或者从存储装置1208加载到随机访问存储器(RAM)1203中的程序而执行各种适当的动作和处理。在RAM1203中,还存储有电子设备1200操作所需的各种程序和数据。处理装置1201、ROM 1202以及RAM 1203通过总线1204彼此相连。输入/输出(I/O)接口1205也连接至总线1204。As shown in FIG12 , the electronic device 1200 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage device 1208 into a random access memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the electronic device 1200 are also stored. The processing device 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.

通常,以下装置可以连接至I/O接口1205:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1208;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置1207;包括例如磁带、硬盘等的存储装置1208;以及通信装置1209。通信装置1209可以允许电子设备1200与其他设备进行无线或有线通信以交换数据。虽然图12示出了具有各种装置的电子设备1200,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 1205: input devices 1208 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 1207 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 1208 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 1209. The communication device 1209 may allow the electronic device 1200 to communicate wirelessly or wired with other devices to exchange data. Although FIG. 12 shows an electronic device 1200 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.

特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1209从网络上被下载和安装,或者从存储装置1208被安装,或者从ROM1202被安装。在该计算机程序被处理装置1201执行时,执行本申请实施例的方法中限定的上述功能。In particular, according to an embodiment of the present application, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present application includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program includes a program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication device 1209, or installed from the storage device 1208, or installed from the ROM 1202. When the computer program is executed by the processing device 1201, the above-mentioned functions defined in the method of the embodiment of the present application are executed.

本申请实施例提供的电子设备与上述实施例提供的数据处理方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。The electronic device provided in the embodiment of the present application and the data processing method provided in the above embodiment belong to the same inventive concept. The technical details not fully described in this embodiment can be referred to the above embodiment, and this embodiment has the same beneficial effects as the above embodiment.

基于上述方法实施例提供的一种数据处理方法,本申请实施例提供了一种计算机存储介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如上述任一实施例所述的数据处理方法。Based on a data processing method provided in the above method embodiment, an embodiment of the present application provides a computer storage medium on which a computer program is stored, wherein when the program is executed by a processor, the data processing method as described in any of the above embodiments is implemented.

需要说明的是,本申请上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium mentioned above in the present application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present application, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device. In the present application, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries a computer-readable program code. This propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. The computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device. The program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

在一些实施方式中,客户端、服务器可以利用诸如HTTP(Hyper Text TransferProtocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server may communicate using any currently known or future developed network protocol such as HTTP (Hyper Text Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or future developed network.

上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The computer-readable medium may be included in the electronic device, or may exist independently without being installed in the electronic device.

上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述数据处理方法。The computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device executes the data processing method.

可以以一种或多种程序设计语言或其组合来编写用于执行本申请的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present application may be written in one or more programming languages or a combination thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).

附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings illustrate the possible architecture, function and operation of the system, method and computer program product according to various embodiments of the present application. In this regard, each box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a sequence different from that marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram and/or flow chart, and the combination of the boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.

描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元/模块的名称在某种情况下并不构成对该单元本身的限定,例如,语音数据采集模块还可以被描述为“数据采集模块”。The units involved in the embodiments described in this application may be implemented by software or hardware. The name of a unit/module does not limit the unit itself in some cases. For example, a voice data acquisition module may also be described as a "data acquisition module".

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.

在本申请的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present application, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

根据本申请的一个或多个实施例,【示例一】提供了一种数据处理方法,所述方法包括:According to one or more embodiments of the present application, [Example 1] provides a data processing method, the method comprising:

获取待处理数据;Get the data to be processed;

将所述待处理数据存储至所述待处理数据的主键对应的存储分区的内存缓存中;Storing the data to be processed in a memory cache of a storage partition corresponding to the primary key of the data to be processed;

响应于满足存储条件,将所述存储分区的内存缓存存储的至少部分待处理数据,存储至所述存储分区的磁盘储存空间中。In response to the storage condition being met, at least a portion of the to-be-processed data stored in the memory cache of the storage partition is stored in the disk storage space of the storage partition.

根据本申请的一个或多个实施例,【示例二】提供了一种数据处理方法,所述待处理数据存储在磁盘储存空间的排序字符串表文件中,所述排序字符串表文件包括多个数据块,每个所述数据块用于以键值对的格式存储待处理数据,所述数据块内存储的所述待处理数据按照主键排序,多个所述数据块之间按照各个数据块存储的首个待处理数据的主键进行排序。According to one or more embodiments of the present application, [Example 2] provides a data processing method, wherein the data to be processed is stored in a sorted string table file in a disk storage space, the sorted string table file includes multiple data blocks, each of the data blocks is used to store the data to be processed in a key-value pair format, the data to be processed stored in the data block is sorted according to a primary key, and the multiple data blocks are sorted according to the primary key of the first data to be processed stored in each data block.

根据本申请的一个或多个实施例,【示例三】提供了一种数据处理方法,所述排序字符串表文件还包括索引块,所述索引块用于指示各个数据块存储的首个待处理数据的主键以及所述数据块在所述排序字符串表文件中的存储位置。According to one or more embodiments of the present application, [Example Three] provides a data processing method, wherein the sorted string table file also includes an index block, which is used to indicate the primary key of the first data to be processed stored in each data block and the storage position of the data block in the sorted string table file.

根据本申请的一个或多个实施例,【示例四】提供了一种数据处理方法,所述索引块包括多个叶索引块和根索引块,各个所述叶索引块存储所对应的各个数据块存储的首个数据的主键以及对应的所述各个数据块在所述排序字符串表文件中的存储位置,所述根索引块存储各个叶索引块存储的数据的首个主键以及所述叶索引块在所述排序字符串表文件中的存储位置。According to one or more embodiments of the present application, [Example 4] provides a data processing method, wherein the index block includes multiple leaf index blocks and a root index block, each leaf index block stores the primary key of the first data stored in the corresponding data block and the storage position of the corresponding data block in the sorted string table file, and the root index block stores the first primary key of the data stored in each leaf index block and the storage position of the leaf index block in the sorted string table file.

根据本申请的一个或多个实施例,【示例五】提供了一种数据处理方法,所述方法还包括:According to one or more embodiments of the present application, [Example 5] provides a data processing method, the method further comprising:

响应于属于同一存储分区的,创建时刻属于预设时间段的排序字符串表文件的数量大于或者等于阈值,合并所述创建时刻属于预设时间段的排序字符串表文件。In response to the number of sorted string table files belonging to the same storage partition and whose creation times belong to a preset time period being greater than or equal to a threshold, the sorted string table files whose creation times belong to the preset time period are merged.

根据本申请的一个或多个实施例,【示例六】提供了一种数据处理方法,所述方法还包括:According to one or more embodiments of the present application, [Example 6] provides a data processing method, the method further comprising:

获取查询指令,所述查询指令包括目标主键;Obtaining a query instruction, wherein the query instruction includes a target primary key;

在所述目标主键对应的存储分区的内存缓存查询所述目标主键对应的待处理数据,和/或,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果。The pending data corresponding to the target primary key is queried in the memory cache of the storage partition corresponding to the target primary key, and/or the pending data corresponding to the target primary key is queried in the disk storage space of the storage partition corresponding to the target primary key to obtain a query result.

根据本申请的一个或多个实施例,【示例七】提供了一种数据处理方法,所述查询指令还包括存储时刻范围,所述在所述目标主键对应的存储分区的内存缓存查询所述目标主键对应的待处理数据,和/或,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果,包括:According to one or more embodiments of the present application, [Example 7] provides a data processing method, wherein the query instruction further includes a storage time range, and the querying of the to-be-processed data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key, and/or the querying of the to-be-processed data corresponding to the target primary key in the disk storage space of the storage partition corresponding to the target primary key, to obtain a query result, including:

若所述存储时刻范围晚于落盘时刻,在所述目标主键对应的存储分区的内存缓存中查询所述目标主键对应的待处理数据,所述落盘时刻为所述目标主键对应的存储分区,距离当前时刻最近的,将内存缓存的至少部分待处理数据存储至磁盘存储空间的时刻;If the storage time range is later than the disk-falling time, query the memory cache of the storage partition corresponding to the target primary key for the data to be processed corresponding to the target primary key, and the disk-falling time is the time when at least part of the data to be processed in the memory cache is stored in the disk storage space of the storage partition corresponding to the target primary key, which is closest to the current time;

若查询到与所述目标主键对应的待处理数据,则将查询到的与所述目标主键对应的待处理数据作为查询结果,得到所述查询结果;If the to-be-processed data corresponding to the target primary key is found, the to-be-processed data corresponding to the target primary key is taken as the query result to obtain the query result;

若未查询到与所述目标主键对应的待处理数据,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果。If the to-be-processed data corresponding to the target primary key is not found, the to-be-processed data corresponding to the target primary key is searched in the disk storage space of the storage partition corresponding to the target primary key to obtain a query result.

根据本申请的一个或多个实施例,【示例八】提供了一种数据处理方法,所述待处理数据存储在磁盘储存空间的排序字符串表文件中,所述排序字符串表文件包括数据块,每个所述数据块用于以键值对的格式存储待处理数据,所述数据块内存储的所述待处理数据按照主键排序,多个所述数据块之间按照各个数据块存储的首个待处理数据的主键进行排序,所述在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果,包括:According to one or more embodiments of the present application, [Example 8] provides a data processing method, wherein the data to be processed is stored in a sorted string table file in a disk storage space, the sorted string table file includes data blocks, each of the data blocks is used to store the data to be processed in a key-value pair format, the data to be processed stored in the data block is sorted according to the primary key, and the multiple data blocks are sorted according to the primary key of the first data to be processed stored in each data block, and the data to be processed corresponding to the target primary key is queried in the disk storage space of the storage partition corresponding to the target primary key to obtain the query result, including:

在所述目标主键对应的存储分区的排序字符串表文件的根索引块中,查询所述目标主键对应的目标叶索引块的第一存储位置,所述第一存储位置为存储所述目标叶索引块的起始位置,所述根索引块存储所述目标主键与所述目标叶索引块的对应关系以及所述目标叶索引块的所述第一存储位置;In the root index block of the sorted string table file of the storage partition corresponding to the target primary key, query the first storage position of the target leaf index block corresponding to the target primary key, the first storage position is the starting position for storing the target leaf index block, and the root index block stores the correspondence between the target primary key and the target leaf index block and the first storage position of the target leaf index block;

基于所述第一存储位置,在所述目标叶索引块中查询所述目标主键对应的目标数据块的第二存储位置,所述第二存储位置为存储所述目标数据块的起始位置,所述目标叶索引块存储所述目标主键与所述目标数据块的对应关系以及所述目标数据块的所述第二存储位置;Based on the first storage position, querying the second storage position of the target data block corresponding to the target primary key in the target leaf index block, where the second storage position is the starting position for storing the target data block, and the target leaf index block stores the correspondence between the target primary key and the target data block and the second storage position of the target data block;

从所述第二存储位置起,依次读取主键为所述目标主键的待处理数据。Starting from the second storage location, the data to be processed whose primary key is the target primary key is read in sequence.

根据本申请的一个或多个实施例,【示例九】提供了一种数据处理方法,所述查询指令还包括过滤条件,在所述得到查询结果之前,所述方法还包括:According to one or more embodiments of the present application, [Example 9] provides a data processing method, wherein the query instruction further includes a filtering condition, and before obtaining the query result, the method further includes:

从查询得到的目标主键对应的待处理数据中筛选得到满足所述过滤条件的过滤数据;Filter the to-be-processed data corresponding to the target primary key obtained by the query to obtain the filtered data that meets the filtering condition;

所述得到查询结果,包括:The query result obtained includes:

将所述过滤数据作为查询结果。The filtered data is used as the query result.

根据本申请的一个或多个实施例,【示例十】提供了一种数据处理方法,所述获取待处理数据,包括:According to one or more embodiments of the present application, [Example 10] provides a data processing method, wherein obtaining the data to be processed includes:

从消息队列中获取待处理数据,所述待处理数据随机分布在所述消息队列的多个队列分区中。The data to be processed is obtained from a message queue, where the data to be processed is randomly distributed in a plurality of queue partitions of the message queue.

根据本申请的一个或多个实施例,【示例十一】提供了一种数据处理装置,所述装置包括:According to one or more embodiments of the present application, [Example 11] provides a data processing device, the device comprising:

第一获取单元,用于获取待处理数据;A first acquisition unit, used to acquire data to be processed;

第一存储单元,用于将所述待处理数据存储至所述待处理数据的主键对应的存储分区的内存缓存中;A first storage unit, used to store the data to be processed into a memory cache of a storage partition corresponding to a primary key of the data to be processed;

第二存储单元,用于响应于满足存储条件,将所述存储分区的内存缓存存储的至少部分待处理数据,存储至所述存储分区的磁盘储存空间中。The second storage unit is used to store at least part of the to-be-processed data stored in the memory cache of the storage partition into the disk storage space of the storage partition in response to satisfying the storage condition.

根据本申请的一个或多个实施例,【示例十二】提供了一种数据处理装置,所述待处理数据存储在磁盘储存空间的排序字符串表文件中,所述排序字符串表文件包括多个数据块,每个所述数据块用于以键值对的格式存储待处理数据,所述数据块内存储的所述待处理数据按照主键排序,多个所述数据块之间按照各个数据块存储的首个待处理数据的主键进行排序。According to one or more embodiments of the present application, [Example 12] provides a data processing device, wherein the data to be processed is stored in a sorted string table file in a disk storage space, the sorted string table file includes multiple data blocks, each of the data blocks is used to store the data to be processed in a key-value pair format, the data to be processed stored in the data block is sorted according to a primary key, and the multiple data blocks are sorted according to the primary key of the first data to be processed stored in each data block.

根据本申请的一个或多个实施例,【示例十三】提供了一种数据处理装置,所述排序字符串表文件还包括索引块,所述索引块用于指示各个数据块存储的首个待处理数据的主键以及所述数据块在所述排序字符串表文件中的存储位置。According to one or more embodiments of the present application, [Example 13] provides a data processing device, wherein the sorted string table file also includes an index block, which is used to indicate the primary key of the first data to be processed stored in each data block and the storage position of the data block in the sorted string table file.

根据本申请的一个或多个实施例,【示例十四】提供了一种数据处理装置,所述索引块包括多个叶索引块和根索引块,各个所述叶索引块存储所对应的各个数据块存储的首个数据的主键以及对应的所述各个数据块在所述排序字符串表文件中的存储位置,所述根索引块存储各个叶索引块存储的数据的首个主键以及所述叶索引块在所述排序字符串表文件中的存储位置。According to one or more embodiments of the present application, [Example 14] provides a data processing device, wherein the index block includes multiple leaf index blocks and a root index block, each leaf index block stores the primary key of the first data stored in the corresponding data block and the storage position of the corresponding data block in the sorted string table file, and the root index block stores the first primary key of the data stored in each leaf index block and the storage position of the leaf index block in the sorted string table file.

根据本申请的一个或多个实施例,【示例十五】提供了一种数据处理装置,所述装置还包括:According to one or more embodiments of the present application, [Example 15] provides a data processing device, the device further comprising:

合并单元,用于响应于属于同一存储分区的,创建时刻属于预设时间段的排序字符串表文件的数量大于或者等于阈值,合并所述创建时刻属于预设时间段的排序字符串表文件。The merging unit is used to merge the sorted string table files whose creation time belongs to the preset time period in response to the number of sorted string table files belonging to the same storage partition and whose creation time belongs to the preset time period being greater than or equal to a threshold.

根据本申请的一个或多个实施例,【示例十六】提供了一种数据处理装置,所述装置还包括:According to one or more embodiments of the present application, [Example 16] provides a data processing device, the device further comprising:

第二获取单元,用于获取查询指令,所述查询指令包括目标主键;A second acquisition unit, used to acquire a query instruction, wherein the query instruction includes a target primary key;

查询单元,用于在所述目标主键对应的存储分区的内存缓存查询所述目标主键对应的待处理数据,和/或,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果。A query unit is used to query the pending data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key, and/or to query the pending data corresponding to the target primary key in the disk storage space of the storage partition corresponding to the target primary key to obtain a query result.

根据本申请的一个或多个实施例,【示例十七】提供了一种数据处理装置,所述查询指令还包括存储时刻范围,所述查询单元,用于在所述目标主键对应的存储分区的内存缓存和磁盘储存空间中的至少一种查询所述目标主键对应的待处理数据,得到查询结果,包括:According to one or more embodiments of the present application, [Example 17] provides a data processing device, wherein the query instruction further includes a storage time range, and the query unit is used to query the to-be-processed data corresponding to the target primary key in at least one of the memory cache and the disk storage space of the storage partition corresponding to the target primary key, and obtain the query result, including:

所述查询单元,用于若所述存储时刻范围晚于落盘时刻,在所述目标主键对应的存储分区的内存缓存中查询所述目标主键对应的待处理数据,所述落盘时刻为所述目标主键对应的存储分区,距离当前时刻最近的,将内存缓存的至少部分待处理数据存储至磁盘存储空间的时刻;若未查询到与所述目标主键对应的待处理数据,在所述目标主键对应的存储分区的磁盘储存空间中查询所述目标主键对应的待处理数据,得到查询结果。The query unit is used to query the pending data corresponding to the target primary key in the memory cache of the storage partition corresponding to the target primary key if the storage time range is later than the disk placement time, and the disk placement time is the time when at least part of the pending data in the memory cache is stored in the disk storage space of the storage partition corresponding to the target primary key, which is closest to the current time; if the pending data corresponding to the target primary key is not found, the disk storage space of the storage partition corresponding to the target primary key is queried for the pending data corresponding to the target primary key to obtain the query result.

根据本申请的一个或多个实施例,【示例十八】提供了一种数据处理装置,According to one or more embodiments of the present application, [Example 18] provides a data processing device,

所述待处理数据存储在磁盘储存空间的排序字符串表文件中,所述排序字符串表文件包括数据块,每个所述数据块用于以键值对的格式存储待处理数据,所述数据块内存储的所述待处理数据按照主键排序,多个所述数据块之间按照各个数据块存储的首个待处理数据的主键进行排序,所述查询单元,用于在所述目标主键对应的存储分区的排序字符串表文件的根索引块中,查询所述目标主键对应的目标叶索引块的第一存储位置,所述第一存储位置为存储所述目标叶索引块的起始位置,所述根索引块存储所述目标主键与所述目标叶索引块的对应关系以及所述目标叶索引块的所述第一存储位置;基于所述第一存储位置,在所述目标叶索引块中查询所述目标主键对应的目标数据块的第二存储位置,所述第二存储位置为存储所述目标数据块的起始位置,所述目标叶索引块存储所述目标主键与所述目标数据块的对应关系以及所述目标数据块的所述第二存储位置;从所述第二存储位置起,依次读取主键为所述目标主键的待处理数据。The data to be processed is stored in a sorted string table file in a disk storage space, the sorted string table file includes data blocks, each of the data blocks is used to store the data to be processed in a key-value pair format, the data to be processed stored in the data blocks are sorted according to the primary key, and the multiple data blocks are sorted according to the primary key of the first data to be processed stored in each data block, the query unit is used to query the first storage position of the target leaf index block corresponding to the target primary key in the root index block of the sorted string table file of the storage partition corresponding to the target primary key, the first storage position is the starting position for storing the target leaf index block, the root index block stores the correspondence between the target primary key and the target leaf index block and the first storage position of the target leaf index block; based on the first storage position, query the second storage position of the target data block corresponding to the target primary key in the target leaf index block, the second storage position is the starting position for storing the target data block, the target leaf index block stores the correspondence between the target primary key and the target data block and the second storage position of the target data block; starting from the second storage position, read the data to be processed whose primary key is the target primary key in sequence.

根据本申请的一个或多个实施例,【示例十九】提供了一种数据处理装置,所述查询指令还包括过滤条件,所述装置还包括:According to one or more embodiments of the present application, [Example 19] provides a data processing device, wherein the query instruction further includes a filtering condition, and the device further includes:

过滤单元,用于从查询得到的目标主键对应的待处理数据中筛选得到满足所述过滤条件的过滤数据;A filtering unit, used to filter the to-be-processed data corresponding to the target primary key obtained by the query to obtain the filtered data that meets the filtering condition;

所述查询单元,用于得到查询结果,包括:The query unit is used to obtain the query result, including:

所述查询单元,用于将所述过滤数据作为查询结果。The query unit is used to use the filtered data as a query result.

根据本申请的一个或多个实施例,【示例二十】提供了一种数据处理装置,所述第一获取单元,用于从消息队列中获取待处理数据,所述待处理数据随机分布在所述消息队列的多个队列分区中。According to one or more embodiments of the present application, [Example 20] provides a data processing device, wherein the first acquisition unit is used to obtain data to be processed from a message queue, and the data to be processed is randomly distributed in multiple queue partitions of the message queue.

根据本申请的一个或多个实施例,【示例二十一】提供了一种电子设备,包括:According to one or more embodiments of the present application, [Example 21] provides an electronic device, including:

一个或多个处理器;one or more processors;

存储装置,其上存储有一个或多个程序,a storage device having one or more programs stored thereon,

当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如【示例一】-【示例十】中任一所述的方法。When the one or more programs are executed by the one or more processors, the one or more processors implement any method described in [Example 1] to [Example 10].

根据本申请的一个或多个实施例,【示例二十二】提供了一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如【示例一】-【示例十】中任一所述的方法。According to one or more embodiments of the present application, [Example 22] provides a computer-readable medium on which a computer program is stored, wherein when the program is executed by a processor, it implements any method described in [Example 1] to [Example 10].

需要说明的是,本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统或装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the system or device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant parts can be referred to the method part description.

应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。It should be understood that in the present application, "at least one (item)" means one or more, and "plurality" means two or more. "And/or" is used to describe the association relationship of associated objects, indicating that three relationships may exist. For example, "A and/or B" can mean: only A exists, only B exists, and A and B exist at the same time, where A and B can be singular or plural. The character "/" generally indicates that the objects associated before and after are in an "or" relationship. "At least one of the following" or similar expressions refers to any combination of these items, including any combination of single or plural items. For example, at least one of a, b or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, c can be single or multiple.

还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, the elements defined by the statement "comprise a ..." do not exclude the presence of other identical elements in the process, method, article or device including the elements.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be implemented directly using hardware, a software module executed by a processor, or a combination of the two. The software module may be placed in a random access memory (RAM), a memory, a read-only memory (ROM), an electrically programmable ROM, an electrically erasable programmable ROM, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present application. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, the present application will not be limited to the embodiments shown herein, but will conform to the widest scope consistent with the principles and novel features disclosed herein.

Claims (13)

1. A method of data processing, the method comprising:
Acquiring data to be processed;
Storing the data to be processed into a memory cache of a storage partition corresponding to a main key of the data to be processed;
and in response to the meeting of the storage condition, storing at least part of data to be processed stored in the memory cache of the storage partition into the disk storage space of the storage partition.
2. The method of claim 1, wherein the data to be processed is stored in an ordering string table file in a disk storage space, the ordering string table file including a plurality of data blocks, each of the data blocks being configured to store the data to be processed in a key-value pair format, the data to be processed stored in the data blocks being ordered according to a primary key, the plurality of data blocks being ordered according to a primary key of a first data to be processed stored in each data block.
3. The method of claim 2, wherein the sort string table file further comprises an index block for indicating a primary key of the first data to be processed stored by each data block and a storage location of the data block in the sort string table file.
4. A method according to claim 3, wherein the index blocks comprise a plurality of leaf index blocks and root index blocks, each of the leaf index blocks storing a primary key of a first data to be processed stored by a corresponding respective data block and a corresponding storage location of the respective data block in the ordered string table file, the root index block storing a primary key of a first data stored by a respective leaf index block and a storage location of the leaf index block in the ordered string table file.
5. The method according to claim 2, wherein the method further comprises:
And in response to the fact that the number of the ordered string table files belonging to the same storage partition and the creation time belongs to a preset time period is greater than or equal to a threshold value, merging the ordered string table files of which the creation time belongs to the preset time period.
6. The method according to claim 1, wherein the method further comprises:
Acquiring a query instruction, wherein the query instruction comprises a target primary key;
And querying the data to be processed corresponding to the target main key in the memory cache of the storage partition corresponding to the target main key, and/or querying the data to be processed corresponding to the target main key in the disk storage space of the storage partition corresponding to the target main key, so as to obtain a query result.
7. The method of claim 6, wherein the querying instruction further includes a storage time range, the querying the memory cache of the storage partition corresponding to the target primary key for the data to be processed corresponding to the target primary key, and/or querying the disk storage space of the storage partition corresponding to the target primary key for the data to be processed corresponding to the target primary key, to obtain a query result, and the querying result includes:
If the storage time range is later than the disk-falling time, inquiring the data to be processed corresponding to the target main key in a memory cache of a storage partition corresponding to the target main key, wherein the disk-falling time is the time when at least part of the data to be processed in the memory cache is stored in a disk storage space, and the storage partition corresponds to the target main key and is closest to the current time;
If the data to be processed corresponding to the target main key is inquired, the inquired data to be processed corresponding to the target main key is used as an inquiry result, and the inquiry result is obtained;
And if the data to be processed corresponding to the target main key is not queried, querying the data to be processed corresponding to the target main key in the disk storage space of the storage partition corresponding to the target main key, and obtaining a query result.
8. The method according to claim 6 or 7, wherein the data to be processed is stored in an ordering string table file in a disk storage space, the ordering string table file includes data blocks, each data block is used for storing the data to be processed in a format of key value pairs, the data to be processed stored in the data blocks are ordered according to a primary key, a plurality of data blocks are ordered according to a primary key of first data to be processed stored in each data block, the data to be processed corresponding to the target primary key is queried in the disk storage space of a storage partition corresponding to the target primary key, and a query result is obtained, and the method includes:
Querying a first storage position of a target leaf index block corresponding to the target main key in a root index block of an ordered string table file of a storage partition corresponding to the target main key, wherein the first storage position is a starting position for storing the target leaf index block, and the root index block stores a corresponding relation between the target main key and the target leaf index block and the first storage position of the target leaf index block;
Inquiring a second storage position of a target data block corresponding to the target main key in the target leaf index block based on the first storage position, wherein the second storage position is a starting position for storing the target data block, and the target leaf index block stores the corresponding relation between the target main key and the target data block and the second storage position of the target data block;
And sequentially reading the data to be processed, of which the main key is the target main key, from the second storage position.
9. The method of claim 6 or 7, wherein the query instruction further comprises a filter condition, the method further comprising, prior to the obtaining the query result:
Screening the data to be processed corresponding to the target primary key obtained by inquiry to obtain filtering data meeting the filtering condition;
The obtaining the query result comprises the following steps:
and taking the filtered data as a query result.
10. The method according to any one of claims 1-7, wherein the acquiring data to be processed comprises:
And acquiring data to be processed from the message queue, wherein the data to be processed are randomly distributed in a plurality of queue partitions of the message queue.
11. A data processing apparatus, the apparatus comprising:
the first acquisition unit is used for acquiring data to be processed;
The first storage unit is used for storing the data to be processed into a memory cache of a storage partition corresponding to a main key of the data to be processed;
And the second storage unit is used for responding to the condition of meeting the storage condition and storing at least part of data to be processed stored in the memory cache of the storage partition into the disk storage space of the storage partition.
12. An electronic device, comprising:
One or more processors;
a storage device having one or more programs stored thereon,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-10.
13. A computer readable medium, characterized in that a computer program is stored thereon, wherein the program, when executed by a processor, implements the method according to any of claims 1-10.
CN202310175932.9A 2023-02-27 2023-02-27 A data processing method, device, equipment and storage medium Pending CN118550447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310175932.9A CN118550447A (en) 2023-02-27 2023-02-27 A data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310175932.9A CN118550447A (en) 2023-02-27 2023-02-27 A data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN118550447A true CN118550447A (en) 2024-08-27

Family

ID=92454667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310175932.9A Pending CN118550447A (en) 2023-02-27 2023-02-27 A data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN118550447A (en)

Similar Documents

Publication Publication Date Title
US10713589B1 (en) Consistent sort-based record-level shuffling of machine learning data
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
US11100420B2 (en) Input processing for machine learning
US10318882B2 (en) Optimized training of linear machine learning models
WO2023029854A1 (en) Data query method and apparatus, storage medium, and electronic device
CN111241177B (en) Data collection method, system and network equipment
US20150379425A1 (en) Consistent filtering of machine learning data
JP7723366B2 (en) Computer program, data archiving method, and recording medium
US20120323919A1 (en) Distributed reverse semantic index
CN112035529A (en) Caching method and device, electronic equipment and computer readable storage medium
CN106687955B (en) Simplifying invocation of an import procedure to transfer data from a data source to a data target
US12105716B2 (en) Parallel compute offload to database accelerator
CN113536763B (en) Information processing method, device, equipment and storage medium
CN113553300B (en) File processing method, device, readable medium and electronic device
CN111625561A (en) Data query method and device
CN119149116B (en) Data processing method, device, equipment and storage medium
CN111241137A (en) Data processing method and device, electronic equipment and storage medium
CN112100211B (en) Data storage method, apparatus, electronic device, and computer readable medium
CN118535651A (en) Astronomical star catalog data archiving method and device, equipment, storage medium and program product
CN118897840A (en) Method, device and electronic device for checking table data balance of distributed database
CN117349401B (en) A metadata storage method, device, medium and equipment for unstructured data
CN113760905A (en) Database index processing method and device, electronic equipment and computer readable medium
CN118445452A (en) Vector search method, device, equipment and medium
CN118550447A (en) A data processing method, device, equipment and storage medium
CN117931813A (en) Lake bin metadata change determining method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination