[go: up one dir, main page]

CN116028506A - Hash connection method, device, equipment and medium - Google Patents

Hash connection method, device, equipment and medium Download PDF

Info

Publication number
CN116028506A
CN116028506A CN202310160998.0A CN202310160998A CN116028506A CN 116028506 A CN116028506 A CN 116028506A CN 202310160998 A CN202310160998 A CN 202310160998A CN 116028506 A CN116028506 A CN 116028506A
Authority
CN
China
Prior art keywords
data
hash
hash table
target
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310160998.0A
Other languages
Chinese (zh)
Other versions
CN116028506B (en
Inventor
王帅
王雄儒
于帆
赵鑫鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Science Research Institute Co Ltd
Original Assignee
Shandong Inspur Science Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Science Research Institute Co Ltd filed Critical Shandong Inspur Science Research Institute Co Ltd
Priority to CN202310160998.0A priority Critical patent/CN116028506B/en
Publication of CN116028506A publication Critical patent/CN116028506A/en
Application granted granted Critical
Publication of CN116028506B publication Critical patent/CN116028506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a hash connection method, a hash connection device, hash connection equipment and a hash connection medium, which relate to the field of hardware acceleration of database query operation and comprise the following steps: acquiring first data of a first data tuple to be connected and second data of a second data tuple to be connected, and respectively calculating the first data and the second data by using a cuckoo algorithm to obtain a hash result of the first data and a hash result of the second data; determining a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data; dividing the first data and the second data into a plurality of groups of first sub data and second sub data, and respectively storing the first sub data and the second sub data into a first target hash table and a second target hash table; reading the first sub data and the second sub data from the first hash target table and the second hash target table, and merging and connecting the first sub data and the second sub data meeting the equivalent condition to obtain connected data. And the hash connection speed is improved.

Description

哈希连接方法、装置、设备及介质Hash connection method, device, equipment and medium

技术领域technical field

本发明涉及数据库查询操作的硬件加速领域,特别涉及哈希连接方法、装置、设备及介质。The invention relates to the field of hardware acceleration of database query operations, in particular to a hash connection method, device, equipment and media.

背景技术Background technique

数据库查询操作中,哈希连接算法是数据库中应用最为广泛的连接算法之一,可实现低时间复杂度和低功耗的连接计算。哈希连接算法分为构建和探测两个阶段,在构建阶段所有数据计算完成后,再进行探测阶段的数据计算。In database query operations, the hash join algorithm is one of the most widely used join algorithms in databases, which can realize join calculations with low time complexity and low power consumption. The hash join algorithm is divided into two phases: construction and detection. After all data calculations in the construction phase are completed, the data calculation in the detection phase is performed.

常用的基于FPGA(Field Programmable Gate Array,即现场可编程逻辑门阵列)的哈希连接加速方式,主要通过解决哈希冲突的实现方式来实现更优的加速方式,但是这些方式都无法实现探测阶段的大量数据的并行处理。现有探测阶段的并行数据处理,仅能通过划分哈希连接实现,计算复杂,且并行度和可扩展性较差,因此哈希连接速度较低。The commonly used hash connection acceleration method based on FPGA (Field Programmable Gate Array, that is, Field Programmable Logic Gate Array) mainly achieves a better acceleration method by resolving hash conflicts, but these methods cannot achieve the detection stage. Parallel processing of large amounts of data. The parallel data processing in the existing detection stage can only be realized by dividing the hash join, which is complex in calculation, and has poor parallelism and scalability, so the speed of the hash join is low.

综上可见,如何提高哈希连接的速度是本领域有待解决的问题。To sum up, how to increase the speed of the hash join is a problem to be solved in this field.

发明内容Contents of the invention

有鉴于此,本发明的目的在于提供一种哈希连接方法、装置、设备及介质,能够提高哈希连接的速度。其具体方案如下:In view of this, the object of the present invention is to provide a hash join method, device, equipment and medium, which can increase the speed of hash join. The specific plan is as follows:

第一方面,本申请公开了一种哈希连接方法,包括:In the first aspect, the application discloses a hash connection method, including:

获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果;Obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and use the first hash function and the second hash function in the cuckoo algorithm to respectively process the first data and the second data performing calculations on the second data to obtain a hash result of the first data and a hash result of the second data;

确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表;determining a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data;

分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表;respectively dividing the first data and the second data into several groups of first sub-data and second sub-data, and respectively storing the first sub-data and the second sub-data in the first target hash table and said second target hash table;

从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。Read the first sub-data and the second sub-data from the first hash table and the second target hash table, and store the first sub-data and the second sub-data that satisfy the equivalence condition Merge and join the second sub-data to obtain the joined data.

可选的,所述获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,包括:Optionally, the acquiring the first data of the first data tuple to be connected and the second data of the second data tuple to be connected includes:

读取外部存储器中第一待连接数据元组和第二待连接数据元组,以得到所述第一待连接数据元组的第一数据和所述第二待连接数据元组的第二数据;其中,所述外部存储器为双倍速率同步动态随机存储器、同步动态随机存取存储器中的任意一种存储器。Reading the first data tuple to be connected and the second data tuple to be connected in the external memory to obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected ; Wherein, the external memory is any one of double-rate synchronous dynamic random access memory and synchronous dynamic random access memory.

可选的,所述利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果,包括:Optionally, using the first hash function and the second hash function in the cuckoo algorithm to calculate the first data and the second data respectively, so as to obtain a hash result of the first data and The hash result of the second data includes:

利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的第一哈希结果和第二哈希结果以及所述第二数据的第一哈希结果和第二哈希结果;Using the first hash function and the second hash function in the cuckoo algorithm to calculate the first data and the second data respectively, so as to obtain the first hash result and the second hash of the first data a result and a first hash result and a second hash result of said second data;

相应的,所述确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表,包括:Correspondingly, the determining the first target hash table corresponding to the hash result of the first data and the second target hash table corresponding to the hash result of the second data includes:

确定出与所述第一数据的第一哈希结果和第二哈希结果分别对应的第一哈希表地址、第二哈希表地址以及与所述第二数据的第一哈希结果和第二哈希结果分别对应的第三哈希表地址、第四哈希表地址;determining a first hash table address and a second hash table address corresponding to the first hash result and the second hash result of the first data, and the first hash result and The third hash table address and the fourth hash table address respectively corresponding to the second hash result;

从所述第一哈希表地址、所述第二哈希表地址中确定出第一目标哈希表地址,并从所述第三哈希表地址、所述第四哈希表地址中确定出第二目标哈希表地址;Determine the first target hash table address from the first hash table address and the second hash table address, and determine from the third hash table address and the fourth hash table address Output the address of the second target hash table;

可选的,所述分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表,包括:Optionally, storing the first sub-data and the second sub-data in the first target hash table and the second target hash table respectively includes:

存储所述第一子数据至与所述第一目标哈希表地址对应的哈希表内,并存储所述第二子数据至与所述第二目标哈希表地址对应的哈希表内。storing the first sub-data in a hash table corresponding to the first target hash table address, and storing the second sub-data in a hash table corresponding to the second target hash table address .

可选的,所述从所述第一哈希表地址、所述第二哈希表地址中确定出第一目标哈希表地址,并从所述第三哈希表地址、所述第四哈希表地址中确定出第二目标哈希表地址,包括:Optionally, determining the first target hash table address from the first hash table address and the second hash table address, and determining the first target hash table address from the third hash table address, the fourth hash table address The second target hash table address is determined in the hash table address, including:

判断与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中是否存在空闲哈希桶,若存在则将与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中存在空闲哈希桶数量最大的哈希表的地址确定为第一目标哈希表地址,将另一哈希表的地址确定为第一目标哈希表备用地址;Judging whether there is an idle hash bucket in the hash table corresponding to the first hash table address and the second hash table address respectively, if there is, it will be combined with the first hash table address, the second hash table address The address of the hash table with the largest number of free hash buckets in the hash tables corresponding to the two hash table addresses is determined as the first target hash table address, and the address of the other hash table is determined as the first target hash table table alternate address;

判断与所述第三哈希表地址、所述第四哈希表地址分别对应的哈希表中是否存在空闲哈希桶,若存在则将与所述第三哈希表地址、所述第四哈希表地址分别对应的哈希表中存在空闲哈希桶数量最大的哈希表确定为第二目标哈希表地址,将另一哈希表的地址确定为第二目标哈希表备用地址;Judging whether there is a free hash bucket in the hash table corresponding to the third hash table address and the fourth hash table address respectively, if there is, it will be combined with the third hash table address and the fourth hash table address The hash table with the largest number of free hash buckets in the hash tables corresponding to the addresses of the four hash tables is determined as the second target hash table address, and the address of the other hash table is determined as the second target hash table for backup address;

相应的,所述分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表,包括:Correspondingly, storing the first sub-data and the second sub-data in the first target hash table and the second target hash table respectively includes:

将所述第一子数据和第一目标哈希表备用地址存储至所述第一目标哈希表的空闲哈希桶中,并将所述第二子数据和第二目标哈希表备用地址存储至所述第二目标哈希表的空闲哈希桶中。storing the first sub-data and the standby address of the first target hash table into an idle hash bucket of the first target hash table, and storing the second sub-data and the standby address of the second target hash table stored in an idle hash bucket of the second target hash table.

可选的,所述判断与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中是否存在空闲哈希桶之后,还包括:Optionally, after the judging whether there is an idle hash bucket in the hash table respectively corresponding to the first hash table address and the second hash table address, the method further includes:

若不存在,则查找与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中哈希桶内存储数据的备用地址,基于所述备用地址判断是否存在满足预设条件的哈希桶;If it does not exist, then look up the backup address of the data stored in the hash bucket in the hash table corresponding to the first hash table address and the second hash table address respectively, and judge whether there is a satisfying condition based on the backup address. Hash buckets with preset conditions;

若存在,将满足预设条件的哈希桶内存储的数据转存至与所述备用地址对应的哈希表中的哈希桶内,以便将所述第一子数据存储到满足预设条件的哈希桶;若不存在,将第一子数据存储到所述第一目标哈希表对应的链表中。If it exists, transfer the data stored in the hash bucket that meets the preset condition to the hash bucket in the hash table corresponding to the standby address, so that the first sub-data is stored until the preset condition is met hash bucket; if it does not exist, store the first sub-data in the linked list corresponding to the first target hash table.

可选的,所述分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表之后,还包括:Optionally, after storing the first sub-data and the second sub-data respectively in the first target hash table and the second target hash table, further comprising:

记录所述第一目标哈希表和所述第二目标哈希表中存储的数据的总量。Record the total amount of data stored in the first target hash table and the second target hash table.

第二方面,本申请公开了一种哈希连接装置,包括:In a second aspect, the present application discloses a hash connection device, including:

哈希结果获取模块,用于获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果;The hash result acquisition module is used to acquire the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and utilize the first hash function and the second hash function in the cuckoo algorithm performing calculations on the first data and the second data respectively to obtain a hash result of the first data and a hash result of the second data;

哈希表确定模块,用于确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表;a hash table determining module, configured to determine a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data;

存储模块,用于分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表;a storage module, configured to divide the first data and the second data into several groups of first sub-data and second sub-data, and respectively store the first sub-data and the second sub-data in the a first target hash table and said second target hash table;

归并连接模块,用于从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。a merge connection module, configured to read the first sub-data and the second sub-data from the first hash table and the second target hash table, and combine the The first sub-data and the second sub-data are merged and joined to obtain the joined data.

第三方面,本申请公开了一种电子设备,包括:In a third aspect, the present application discloses an electronic device, comprising:

存储器,用于保存计算机程序;memory for storing computer programs;

处理器,用于执行所述计算机程序,以实现前述公开的哈希连接方法的步骤。A processor, configured to execute the computer program, so as to realize the steps of the hash connection method disclosed above.

第四方面,本申请公开了一种计算机可读存储介质,用于存储计算机程序;其中,所述计算机程序被处理器执行时实现前述公开的哈希连接方法的步骤。In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, the steps of the hash connection method disclosed above are implemented.

可见,本申请获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果;确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表;分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表;从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。由此可见,本申请在构建阶段实现对第一待连接数据元组的第一数据和第二待连接数据元组的第二数据进行哈希计算与存储,以便后续可以将满足等值条件的第一子数据和第二子数据进行归并连接,以得到连接后数据,消除了探测阶段每次只能处理一个数据元组连接的限制,即可以实现并行哈希计算、存储以及归并连接,因此提高了哈希连接的速度。It can be seen that the present application obtains the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and utilizes the first hash function and the second hash function in the cuckoo algorithm to respectively Perform calculations on the first data and the second data to obtain a hash result of the first data and a hash result of the second data; determine the first data corresponding to the hash result of the first data a target hash table and a second target hash table corresponding to the hash result of the second data; respectively dividing the first data and the second data into several groups of first sub-data and second sub-data, And storing the first sub-data and the second sub-data to the first target hash table and the second target hash table respectively; from the first hash target hash table and the second target The first sub-data and the second sub-data are read from the hash table, and the first sub-data and the second sub-data satisfying the equivalence condition are merged and connected to obtain the connected data. It can be seen that, in the construction stage, the present application implements hash calculation and storage of the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, so that the subsequent The first sub-data and the second sub-data are merged and connected to obtain the connected data, which eliminates the limitation that only one data tuple connection can be processed at a time in the detection stage, that is, parallel hash calculation, storage and merge connection can be realized, so Increased the speed of hash joins.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1为本申请公开的一种哈希连接方法流程图;Fig. 1 is a flow chart of a hash connection method disclosed in the present application;

图2为本申请公开的一种具体的哈希连接方法流程图;Fig. 2 is a kind of specific hash connection method flowchart disclosed in the present application;

图3为本申请公开的一种具体的数据存储示意图;FIG. 3 is a specific schematic diagram of data storage disclosed in the present application;

图4为本申请公开的一种具体的数据归并连接示意图;FIG. 4 is a schematic diagram of a specific data merge connection disclosed in the present application;

图5为本申请公开的一种哈希连接装置结构示意图;FIG. 5 is a schematic structural diagram of a hash connection device disclosed in the present application;

图6为本申请公开的一种电子设备结构图。FIG. 6 is a structural diagram of an electronic device disclosed in the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

数据库查询操作中,哈希连接算法是数据库中应用最为广泛的连接算法之一,可实现低时间复杂度和低功耗的连接计算。哈希连接算法分为构建和探测两个阶段,在构建阶段所有数据计算完成后,再进行探测阶段的数据计算。In database query operations, the hash join algorithm is one of the most widely used join algorithms in databases, which can realize join calculations with low time complexity and low power consumption. The hash join algorithm is divided into two phases: construction and detection. After all data calculations in the construction phase are completed, the data calculation in the detection phase is performed.

常用的基于FPGA的哈希连接加速方式,主要通过解决哈希冲突的实现方式来实现更优的加速方式,但是这些方式都无法实现探测阶段的大量数据的并行处理。现有探测阶段的并行数据处理,仅能通过划分哈希连接实现,计算复杂,且并行度和可扩展性较差,因此哈希连接速度较低。Commonly used FPGA-based hash join acceleration methods mainly achieve better acceleration methods by solving hash conflicts, but none of these methods can achieve parallel processing of a large amount of data in the detection phase. The parallel data processing in the existing detection stage can only be realized by dividing the hash join, which is complex in calculation, and has poor parallelism and scalability, so the speed of the hash join is low.

为此本申请相应的提供了一种哈希连接方案,能够提高哈希连接的速度。For this reason, the present application correspondingly provides a hash join solution, which can increase the speed of hash join.

参见图1所示,本申请实施例公开了一种哈希连接方法,包括:Referring to Fig. 1, the embodiment of the present application discloses a hash connection method, including:

步骤S11:获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果。Step S11: Obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and use the first hash function and the second hash function in the cuckoo algorithm to respectively A calculation is performed on the first data and the second data to obtain a hash result of the first data and a hash result of the second data.

本实施例中,所述获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,包括:读取外部存储器中第一待连接数据元组和第二待连接数据元组,以得到所述第一待连接数据元组的第一数据和所述第二待连接数据元组的第二数据;其中,所述外部存储器为双倍速率同步动态随机存储器、同步动态随机存取存储器中的任意一种存储器。可以理解的是,当外部存储器为DDR(Double Data Rate,即双倍速率同步动态随机存储器)时,则从DDR中读取第一待连接数据元组和第二待连接数据元组,以得到第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,当外部存储器为SDRAM(Synchronous Dynamic Random Access Memory,即同步动态随机存取存储器)时,则从SDRAM中读取第一待连接数据元组和第二待连接数据元组,以得到第一待连接数据元组的第一数据和第二待连接数据元组的第二数据。In this embodiment, the acquisition of the first data of the first data tuple to be connected and the second data of the second data tuple to be connected includes: reading the first data tuple to be connected and the second data tuple to be connected in the external memory Connecting the data tuples to obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected; wherein, the external memory is a double-rate synchronous dynamic random access memory, Any type of synchronous dynamic random access memory. It can be understood that, when the external memory is DDR (Double Data Rate, double-rate synchronous dynamic random access memory), then read the first data tuple to be connected and the second data tuple to be connected from the DDR to obtain The first data of the first data tuple to be connected and the second data of the second data tuple to be connected, when the external memory is SDRAM (Synchronous Dynamic Random Access Memory, i.e. synchronous dynamic random access memory), then from the SDRAM The first data tuple to be connected and the second data tuple to be connected are read to obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected.

需要注意的是,因为第一待连接数据元组和第二待连接数据元组数据量较大,一般将第一待连接数据元组和第二待连接数据元组分别划分为若干个数据单元,第一数据、第二数据是其中一个数据单元,例如可以将第一待连接数据元组和第二待连接数据元组分别划分为10个数据单元,按照顺序可以依次哈希计算、存储以及归并连接第一待连接数据元组和第二待连接数据元组的各个数据单元。It should be noted that because the data volume of the first data tuple to be connected and the second data tuple to be connected is relatively large, the first data tuple to be connected and the second data tuple to be connected are generally divided into several data units , the first data and the second data are one of the data units, for example, the first to-be-connected data tuple and the second to-be-connected data tuple can be divided into 10 data units, and hash calculation, storage and The respective data units of the first data tuple to be connected and the second data tuple to be connected are merged and connected.

步骤S12:确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表。Step S12: Determine a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data.

本实施例中,确定与第一数据的哈希结果对应的第一目标哈希表的地址,即可以根据该确定出对应的第一目标哈希表,同理,可以确定与第二数据的哈希结果对应的第二目标哈希表的地址,即可以根据该确定出对应的第二目标哈希表。In this embodiment, the address of the first target hash table corresponding to the hash result of the first data is determined, that is, the corresponding first target hash table can be determined based on the address. Similarly, the address of the second data can be determined. The address of the second target hash table corresponding to the hash result can be used to determine the corresponding second target hash table.

步骤S13:分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表。Step S13: Separate the first data and the second data into several groups of first sub-data and second sub-data, and respectively store the first sub-data and the second sub-data in the first a target hash table and the second target hash table.

需要注意的是,第一数据和第二数据可以分别包含若干组子数据,例如第一数据包含1000个子数据,则依次将这1000个子数据存储至第一目标哈希表中。It should be noted that the first data and the second data may respectively contain several sets of sub-data, for example, the first data contains 1000 sub-data, then these 1000 sub-data are stored in the first target hash table in turn.

可以理解的是,所述分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表之后,还包括:记录所述第一目标哈希表和所述第二目标哈希表中存储的数据的总量。It can be understood that after storing the first sub-data and the second sub-data in the first target hash table and the second target hash table respectively, it also includes: recording the first The total amount of data stored in the target hash table and the second target hash table.

步骤S14:从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。Step S14: Read the first sub-data and the second sub-data from the first hash table and the second target hash table, and store the first sub-data satisfying the equivalence condition The data and the second sub-data are merged and joined to obtain the joined data.

本实施例中,进行归并连接是哈希连接的探测阶段,因为预先将第一数据和第二数据分别划分为若干组子数据,所以需要将第一数据的第一子数据和第二数据的第二子数据依次进行归并连接,直至所有的子数据均连接完毕。其中,需要注意的是,在进行归并连接前,需要判断当前的第一子数据和当前的第二子数据之间是否满足等值条件,如果满足,才可以进行归并连接,如果不满足,则跳过当前的第一子数据和当前的第二子数据,判断下一个第一子数据和下一个第二子数据是否满足等值条件。其中,以单行哈希表数据元归并计算进行说明:In this embodiment, the merge join is the detection stage of the hash join. Since the first data and the second data are divided into several groups of sub-data in advance, it is necessary to divide the first sub-data of the first data and the sub-data of the second data The second sub-data is sequentially merged and connected until all the sub-data are connected. Among them, it should be noted that before performing the merge connection, it is necessary to judge whether the current first sub-data and the current second sub-data meet the equivalence condition. If it is satisfied, the merge connection can be performed. If not, then Skip the current first sub-data and the current second sub-data, and judge whether the next first sub-data and the next second sub-data satisfy the equivalence condition. Among them, the merge calculation of single-line hash table data elements is used for illustration:

1)分别检测第一子数据和第二子数据对应的相同哈希表行中的数据元数量,当有一个子数据对应的哈希表行数据总量为0时,直接跳过该行的数据归并计算,检测下一个哈希行数据;1) Detect the number of data elements in the same hash table row corresponding to the first sub-data and the second sub-data respectively. When the total amount of data in the hash table row corresponding to a sub-data is 0, directly skip the The data is merged and calculated, and the next hash row data is detected;

2)当两个哈希行都存在子数据时,分别从对应的哈希表中读取单行哈希桶中的数据;当存在链表时,读取对应链表内所有数据;读完所有数据并缓存;每个哈希桶对应一个链表存储模块,用于提高链表的读取速度;2) When both hash rows have sub-data, read the data in the single-row hash bucket from the corresponding hash table; when there is a linked list, read all the data in the corresponding linked list; read all the data and Cache; each hash bucket corresponds to a linked list storage module, which is used to improve the reading speed of the linked list;

3)当第一子数据缓存数据大于等于第二子数据时,以第一子数据缓存为基准,将数据顺序写入M个归并连接单元;第二子数据缓存中的数据以流水线形式写入归并连接单元,并顺序传递到归并连接单元M,完成归并连接计算;3) When the first sub-data cache data is greater than or equal to the second sub-data, the data is sequentially written into M merge connection units based on the first sub-data cache; the data in the second sub-data cache is written in pipeline form The merge connection unit is sequentially passed to the merge connection unit M to complete the merge connection calculation;

4)在归并连接单元中,当同时存在两个数据元时,比较两个数据元,是否满足连接条件,满足连接条件则输出连接结果;否则,跳过;重复上述操作,完成所有哈希表行中数据的归并连接。4) In the merge connection unit, when there are two data elements at the same time, compare the two data elements to see if the connection condition is satisfied, and if the connection condition is met, the connection result is output; otherwise, skip; repeat the above operation, and complete all hash tables Merge join of data in rows.

可以理解的是,得到连接后数据后,需要通过DMA(直接内存访问,即DirectMemory Access)方式将连接后数据传输到加速结构外部的CPU(central processingunit,即中央处理器)处理模块。It is understandable that after obtaining the connected data, the connected data needs to be transferred to the CPU (central processing unit, central processing unit) processing module outside the acceleration structure through DMA (direct memory access, namely DirectMemory Access).

可见,本申请获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果;确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表;分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表;从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。由此可见,本申请在构建阶段实现对第一待连接数据元组的第一数据和第二待连接数据元组的第二数据进行哈希计算与存储,以便后续可以将满足等值条件的第一子数据和第二子数据进行归并连接,以得到连接后数据,消除了探测阶段每次只能处理一个数据元组连接的限制,即可以实现并行哈希计算、存储以及归并连接,因此提高了哈希连接的速度。It can be seen that the present application obtains the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and utilizes the first hash function and the second hash function in the cuckoo algorithm to respectively Perform calculations on the first data and the second data to obtain a hash result of the first data and a hash result of the second data; determine the first data corresponding to the hash result of the first data a target hash table and a second target hash table corresponding to the hash result of the second data; respectively dividing the first data and the second data into several groups of first sub-data and second sub-data, And storing the first sub-data and the second sub-data to the first target hash table and the second target hash table respectively; from the first hash target hash table and the second target The first sub-data and the second sub-data are read from the hash table, and the first sub-data and the second sub-data satisfying the equivalence condition are merged and connected to obtain the connected data. It can be seen that, in the construction stage, the present application implements hash calculation and storage of the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, so that the subsequent The first sub-data and the second sub-data are merged and connected to obtain the connected data, which eliminates the limitation that only one data tuple connection can be processed at a time in the detection stage, that is, parallel hash calculation, storage and merge connection can be realized, so Increased the speed of hash joins.

参见图2所示,本申请实施例公开了一种具体的哈希连接方法,包括:Referring to Figure 2, the embodiment of the present application discloses a specific hash connection method, including:

步骤S21:获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据。Step S21: Obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected.

步骤S22:利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的第一哈希结果和第二哈希结果以及所述第二数据的第一哈希结果和第二哈希结果。Step S22: Using the first hash function and the second hash function in the cuckoo algorithm to calculate the first data and the second data respectively, so as to obtain the first hash result and the second hash result of the first data Two hash results and a first hash result and a second hash result of the second data.

步骤S23:确定出与所述第一数据的第一哈希结果和第二哈希结果分别对应的第一哈希表地址、第二哈希表地址以及与所述第二数据的第一哈希结果和第二哈希结果分别对应的第三哈希表地址、第四哈希表地址。Step S23: Determine the address of the first hash table, the address of the second hash table and the first hash of the second data respectively corresponding to the first hash result and the second hash result of the first data. The third hash table address and the fourth hash table address corresponding to the hash result and the second hash result respectively.

步骤S24:从所述第一哈希表地址、所述第二哈希表地址中确定出第一目标哈希表地址,并从所述第三哈希表地址、所述第四哈希表地址中确定出第二目标哈希表地址。Step S24: Determine the first target hash table address from the first hash table address and the second hash table address, and determine the first target hash table address from the third hash table address, the fourth hash table address The second target hash table address is determined from the address.

本实施例中,所述从所述第一哈希表地址、所述第二哈希表地址中确定出第一目标哈希表地址,并从所述第三哈希表地址、所述第四哈希表地址中确定出第二目标哈希表地址,包括:判断与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中是否存在空闲哈希桶,若存在则将与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中存在空闲哈希桶数量最大的哈希表的地址确定为第一目标哈希表地址,将另一哈希表的地址确定为第一目标哈希表备用地址;判断与所述第三哈希表地址、所述第四哈希表地址分别对应的哈希表中是否存在空闲哈希桶,若存在则将与所述第三哈希表地址、所述第四哈希表地址分别对应的哈希表中存在空闲哈希桶数量最大的哈希表确定为第二目标哈希表地址,将另一哈希表的地址确定为第二目标哈希表备用地址。其中,哈希桶的空满状态项为空,即判定该哈希桶为空闲哈希桶。可以理解的是,哈希表中包含N(N≥2)个哈希桶槽位,哈希桶包含数据元、链表地址、空满状态、备用哈希桶地址等标志信号,每个槽位以{数据元;空满标志;链表地址;备用哈希桶地址}的格式存储数据。In this embodiment, the first target hash table address is determined from the first hash table address and the second hash table address, and the first target hash table address is determined from the third hash table address, the second hash table address Determining the second target hash table address from the four hash table addresses includes: judging whether there is an idle hash bucket in the hash table respectively corresponding to the first hash table address and the second hash table address , if it exists, the address of the hash table with the largest number of free hash buckets in the hash tables corresponding to the first hash table address and the second hash table address respectively is determined as the first target hash Table address, the address of another hash table is determined as the standby address of the first target hash table; judge whether there is in the hash table respectively corresponding to the third hash table address and the fourth hash table address Free hash buckets, if they exist, determine the hash table with the largest number of free hash buckets in the hash tables corresponding to the third hash table address and the fourth hash table address as the second target The address of the hash table is used to determine the address of another hash table as the standby address of the second target hash table. Wherein, the empty and full status item of the hash bucket is empty, that is, it is determined that the hash bucket is an idle hash bucket. It can be understood that the hash table contains N (N ≥ 2) hash bucket slots, and the hash bucket contains flag signals such as data elements, linked list addresses, empty and full states, and backup hash bucket addresses. Store data in the format of {data element; empty and full flag; linked list address; backup hash bucket address}.

本实施例中,所述判断与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中是否存在空闲哈希桶之后,还包括:若不存在,则查找与所述第一哈希表地址、所述第二哈希表地址分别对应的哈希表中哈希桶内存储数据的备用地址,基于所述备用地址判断是否存在满足预设条件的哈希桶;若存在,将满足预设条件的哈希桶内存储的数据转存至与所述备用地址对应的哈希表中的哈希桶内,以便将所述第一子数据存储到满足预设条件的哈希桶;若不存在,将第一子数据存储到所述第一目标哈希表对应的链表中。In this embodiment, after the judging whether there is a free hash bucket in the hash table respectively corresponding to the first hash table address and the second hash table address, it also includes: if not, searching A standby address for storing data in the hash bucket in the hash table corresponding to the first hash table address and the second hash table address respectively, and judging whether there is a hash that satisfies the preset condition based on the standby address Bucket; if it exists, transfer the data stored in the hash bucket that meets the preset condition to the hash bucket in the hash table corresponding to the standby address, so that the first sub-data is stored in the hash bucket that meets the preset condition A conditional hash bucket is set; if it does not exist, the first sub-data is stored in a linked list corresponding to the first target hash table.

可以理解的是,如果不存在空闲哈希桶,则检测备用哈希桶是否存在空闲,如果存在空闲备用哈希桶,则在与第一哈希表地址、第二哈希表地址分别对应的哈希表中筛选出空闲备用哈希桶最多的哈希表,并将其确定为第一目标哈希表,然后将哈希桶1中的数据转移至备用哈希桶1中,以便后续存储将第一数据的子数据时,将子数据存储至该哈希桶1中。It can be understood that, if there is no idle hash bucket, it is detected whether the standby hash bucket is idle, and if there is an idle standby hash bucket, then in the addresses corresponding to the first hash table address and the second hash table address respectively Filter out the hash table with the most free spare hash buckets from the hash table, and determine it as the first target hash table, and then transfer the data in hash bucket 1 to spare hash bucket 1 for subsequent storage When sub-data of the first data is stored, the sub-data is stored in the hash bucket 1 .

步骤S25:分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,存储所述第一子数据至与所述第一目标哈希表地址对应的哈希表内,并存储所述第二子数据至与所述第二目标哈希表地址对应的哈希表内。Step S25: Separate the first data and the second data into several groups of first sub-data and second sub-data, and store the first sub-data in the hash corresponding to the address of the first target hash table and store the second sub-data in the hash table corresponding to the address of the second target hash table.

本实施例中,将所述第一子数据和第一目标哈希表备用地址存储至所述第一目标哈希表的空闲哈希桶中,并将所述第二子数据和第二目标哈希表备用地址存储至所述第二目标哈希表的空闲哈希桶中。In this embodiment, the first subdata and the standby address of the first target hash table are stored in an idle hash bucket of the first target hash table, and the second subdata and the second target The standby address of the hash table is stored in an idle hash bucket of the second target hash table.

步骤S26:从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。Step S26: Read the first sub-data and the second sub-data from the first hash table and the second target hash table, and store the first sub-data satisfying the equivalence condition The data and the second sub-data are merged and joined to obtain the joined data.

由此可见,本申请在构建阶段并行计算并存储两个待连接数据元组,以便后续在探测阶段也可以并行连接两个子数据,提高了哈希连接的速度;其中,利用布谷鸟算法中两个哈希函数进行计算,第一数据和第二数据的计算结果增加,其可以进行存储的哈希表也相应的增加,因此提高了存储第一数据、第二数据的概率。It can be seen that the application calculates and stores two data tuples to be connected in parallel in the construction phase, so that the two sub-data can also be connected in parallel in the subsequent detection phase, which improves the speed of the hash connection; A hash function is calculated, the calculation results of the first data and the second data increase, and the hash tables that can be stored also increase accordingly, thus increasing the probability of storing the first data and the second data.

下面对本申请的技术方案进行进一步的说明:The technical scheme of the present application is further described below:

1)读取外部存储器中第一待连接数据元组和第二待连接数据元组,以得到第一待连接数据元组的第一数据和第二待连接数据元组的第二数据。1) Reading the first data tuple to be connected and the second data tuple to be connected in the external memory to obtain the first data of the first data tuple to be connected and the second data of the second data tuple to be connected.

2)利用布谷鸟算法中第一哈希函数和第二哈希函数分别对第一数据和第二数据进行计算,以得到第一数据的哈希结果11和哈希结果12以及第二数据的哈希结果21和哈希结果22。2) Using the first hash function and the second hash function in the cuckoo algorithm to calculate the first data and the second data respectively, to obtain the hash result 11 and the hash result 12 of the first data and the hash result 12 of the second data Hash result 21 and hash result 22.

3)确定出与哈希结果11对应的第一哈希表地址11以及与哈希结果12对应的第二哈希表地址12,从第一哈希表地址11和第二哈希表地址12中确定出第一目标哈希表地址;确定过程具体为:3) Determine the first hash table address 11 corresponding to the hash result 11 and the second hash table address 12 corresponding to the hash result 12, from the first hash table address 11 and the second hash table address 12 The address of the first target hash table is determined in ; the determination process is specifically as follows:

确定与第一哈希表地址11和第二哈希表地址12分别对应的哈希表11和哈希表12中空闲哈希桶的数量,判断哈希表11和哈希表12中是否存在空闲哈希桶;Determine the number of free hash buckets in the hash table 11 and the hash table 12 respectively corresponding to the first hash table address 11 and the second hash table address 12, and determine whether there is a hash bucket in the hash table 11 and the hash table 12 free hash bucket;

如果存在,则以空闲哈希桶数量最多的哈希表为第一目标哈希表,另一个哈希表的地址则为第一目标哈希表备用地址;If it exists, the hash table with the largest number of free hash buckets is used as the first target hash table, and the address of another hash table is the standby address of the first target hash table;

如果不存在,则确定哈希表11和哈希表12中空闲备用哈希桶的数量,首先判断哈希表11和哈希表12中是否存在空闲备用哈希桶,如果存在,则以空闲备用哈希桶数量最多的哈希表为第一目标哈希表,如果不存在,则确定哈希桶的链表地址槽位,将链表地址槽位为空的数量最多的哈希表确定为第一目标哈希表;If it does not exist, then determine the number of idle standby hash buckets in the hash table 11 and the hash table 12, first judge whether there is an idle standby hash bucket in the hash table 11 and the hash table 12, if there is, then use idle The hash table with the largest number of spare hash buckets is the first target hash table. If it does not exist, the linked list address slot of the hash bucket is determined, and the hash table with the largest number of empty linked list address slots is determined as the first target hash table. a target hash table;

可以理解的是,第二目标哈希表的过程与确定第一目标哈希表的过程相同,需要注意的是,如果将其中一个哈希表确定为目标哈希表,则将另一个哈希表确定为备用哈希表。It can be understood that the process of the second target hash table is the same as the process of determining the first target hash table. It should be noted that if one of the hash tables is determined as the target hash table, the other hash table The table is identified as an alternate hash table.

4)分别将第一数据和第二数据分成若干组第一子数据和第二子数据,并分别存储第一子数据和第二子数据至第一目标哈希表和第二目标哈希表;其中,因为确定目标哈希表的过程不同,其存储过程也不同,如以存储第一子数据为例:4) Dividing the first data and the second data into several groups of first sub-data and second sub-data respectively, and storing the first sub-data and the second sub-data to the first target hash table and the second target hash table respectively ; Among them, because the process of determining the target hash table is different, the storage process is also different, such as storing the first sub-data as an example:

如果确定出的第一目标哈希表中存在空闲哈希桶,则将第一数据的第一子数据存储至空闲的哈希桶1的数据元项中,并将第一目标哈希表备用地址存储至哈希桶1的备用哈希桶地址项中;If there is an idle hash bucket in the determined first target hash table, store the first sub-data of the first data in the data element item of the free hash bucket 1, and reserve the first target hash table The address is stored in the standby hash bucket address item of hash bucket 1;

如果确定出的第一目标哈希表中不存在空闲哈希桶,但是存在空闲的备用哈希桶,则将哈希桶1的数据元项中的数据转移至与哈希桶1对应的备用哈希桶中,然后将第一数据的第一子数据存储至哈希桶1中;If there is no idle hash bucket in the determined first target hash table, but there is an idle standby hash bucket, then the data in the data element item of hash bucket 1 is transferred to the standby corresponding to hash bucket 1 in the hash bucket, and then store the first sub-data of the first data in the hash bucket 1;

如果确定出的第一目标哈希表中不存在空闲哈希桶,也不存在空闲的备用哈希桶,则例如图3所示的一种具体的数据存储示意图,将第一数据的第一组第一子数据存储至链表1-1中,链表1-1的地址存储到链表地址项中,将第一数据的第二组第一子数据存储至链表1-2中,链表1-2的地址存储到链表地址项中,直至第一数据存储完毕;If there is neither an idle hash bucket nor an idle standby hash bucket in the determined first target hash table, then, for example, in a specific data storage schematic diagram shown in FIG. 3 , the first The first sub-data of the group is stored in the linked list 1-1, the address of the linked list 1-1 is stored in the linked list address item, the second group of the first sub-data of the first data is stored in the linked list 1-2, and the linked list 1-2 The address of is stored in the linked list address item until the storage of the first data is completed;

可以理解的是,上述过程中,会存在数据的写入或/和数据的转移,那么需要记录当前哈希表写入数据元的总量,其中,哈希桶的数据格式可以进行一定的调整、删除、增加,备用地址槽位没有空闲时,可继续检测备用槽位的备用地址;进行多级备用槽位的检测,提高数据元的插入成功率。It is understandable that in the above process, there will be data writing or/and data transfer, so it is necessary to record the total amount of data elements written in the current hash table, and the data format of the hash bucket can be adjusted to a certain extent , delete, add, and when the spare address slot is not free, it can continue to detect the spare address of the spare slot; perform multi-level spare slot detection to improve the success rate of data element insertion.

5)从第一哈目标希表和第二目标哈希表中读取第一子数据和第二子数据,并将满足等值条件的第一子数据和第二子数据进行归并连接,以得到连接后数据;以图4所示的一种具体的数据归并连接示意图为例,按照数据组的顺序将第一子数据写入归并连接单元,读取第一组第二子数据,如果第一组第一子数据和第一组第二子数据满足等值条件,则将第一组第二子数据写入归并连接单元1,进行归并连接,以此类推,直至所有的数据进行归并连接完毕,并将所有的连接后数据输出至连接结果缓存模块,根据控制指令,将数据结果输入至到CPU处理单元。其中,扩展多行哈希表数据元归并计算,每个哈希表归并计算模块读取并计算单行的数据元;完成计算后,继续读取下一个有效行并计算,通过扩展数据元归并模块,可实现多行哈希表内数据的并行归并计算,充分利用了FPGA并行计算的能力,可扩展性和可移植性较强。5) read the first sub-data and the second sub-data from the first hash table and the second target hash table, and merge and connect the first sub-data and the second sub-data satisfying the equivalence condition, to Get the data after connection; take a specific data merge connection schematic diagram shown in Figure 4 as an example, write the first sub-data into the merge connection unit according to the order of the data groups, read the first group of second sub-data, if the first A group of first sub-data and a first group of second sub-data satisfy the equivalence condition, then write the first group of second sub-data into the merge connection unit 1, perform merge connection, and so on, until all data are merged and connected After completion, output all the connected data to the connection result buffer module, and input the data result to the CPU processing unit according to the control instruction. Among them, the multi-row hash table data elements are merged and calculated, and each hash table merged calculation module reads and calculates the data elements of a single row; after the calculation is completed, continue to read the next valid row and calculate, through the extended data element merge module , can realize the parallel merging calculation of data in the multi-line hash table, fully utilizes the ability of FPGA parallel computing, and has strong scalability and portability.

参见图5所示,本申请实施例公开了一种哈希连接装置,包括:Referring to Figure 5, the embodiment of the present application discloses a hash connection device, including:

哈希结果获取模块11,用于获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果;The hash result acquisition module 11 is used to acquire the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and utilize the first hash function and the second hash in the cuckoo algorithm The function calculates the first data and the second data respectively to obtain a hash result of the first data and a hash result of the second data;

哈希表确定模块12,用于确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表;a hash table determining module 12, configured to determine a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data;

存储模块13,用于分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表;A storage module 13, configured to divide the first data and the second data into several groups of first sub-data and second sub-data, and respectively store the first sub-data and the second sub-data in the said first target hash table and said second target hash table;

归并连接模块14,用于从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。A merge connection module 14, configured to read the first sub-data and the second sub-data from the first hash object hash table and the second target hash table, and combine all the sub-data that meet the equivalence condition The first sub-data and the second sub-data are merged and connected to obtain the connected data.

可见,本申请获取第一待连接数据元组的第一数据和第二待连接数据元组的第二数据,并利用布谷鸟算法中第一哈希函数和第二哈希函数分别对所述第一数据和所述第二数据进行计算,以得到所述第一数据的哈希结果以及所述第二数据的哈希结果;确定出与所述第一数据的哈希结果对应的第一目标哈希表以及与所述第二数据的哈希结果对应的第二目标哈希表;分别将所述第一数据和所述第二数据分成若干组第一子数据和第二子数据,并分别存储所述第一子数据和所述第二子数据至所述第一目标哈希表和所述第二目标哈希表;从所述第一哈目标希表和所述第二目标哈希表中读取所述第一子数据和所述第二子数据,并将满足等值条件的所述第一子数据和所述第二子数据进行归并连接,以得到连接后数据。由此可见,本申请在构建阶段实现对第一待连接数据元组的第一数据和第二待连接数据元组的第二数据进行哈希计算与存储,以便后续可以将满足等值条件的第一子数据和第二子数据进行归并连接,以得到连接后数据,消除了探测阶段每次只能处理一个数据元组连接的限制,即可以实现并行哈希计算、存储以及归并连接,因此提高了哈希连接的速度。It can be seen that the present application obtains the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, and utilizes the first hash function and the second hash function in the cuckoo algorithm to respectively Perform calculations on the first data and the second data to obtain a hash result of the first data and a hash result of the second data; determine the first data corresponding to the hash result of the first data a target hash table and a second target hash table corresponding to the hash result of the second data; respectively dividing the first data and the second data into several groups of first sub-data and second sub-data, And storing the first sub-data and the second sub-data to the first target hash table and the second target hash table respectively; from the first hash target hash table and the second target The first sub-data and the second sub-data are read from the hash table, and the first sub-data and the second sub-data satisfying the equivalence condition are merged and connected to obtain the connected data. It can be seen that, in the construction stage, the present application implements hash calculation and storage of the first data of the first data tuple to be connected and the second data of the second data tuple to be connected, so that the subsequent The first sub-data and the second sub-data are merged and connected to obtain the connected data, which eliminates the limitation that only one data tuple connection can be processed at a time in the detection stage, that is, parallel hash calculation, storage and merge connection can be realized, so Increased the speed of hash joins.

进一步的,本申请实施例还提供了一种电子设备。图6是根据一示例性实施例示出的电子设备20结构图,图中的内容不能认为是对本申请的使用范围的任何限制。Further, the embodiment of the present application also provides an electronic device. Fig. 6 is a structural diagram of an electronic device 20 according to an exemplary embodiment, and the content in the diagram should not be regarded as any limitation on the application scope of the present application.

图6为本申请实施例提供的一种电子设备的结构示意图。具体可以包括:至少一个处理器21、至少一个存储器22、电源23、通信接口24、输入输出接口25和通信总线26。其中,所述存储器22用于存储计算机程序,所述计算机程序由所述处理器21加载并执行,以实现前述任一实施例公开的由电子设备执行的哈希连接方法中的相关步骤。FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. Specifically, it may include: at least one processor 21 , at least one memory 22 , a power supply 23 , a communication interface 24 , an input/output interface 25 and a communication bus 26 . Wherein, the memory 22 is used to store a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the hash connection method performed by the electronic device disclosed in any of the foregoing embodiments.

本实施例中,电源23用于为电子设备上的各硬件设备提供工作电压;通信接口24能够为电子设备创建与外界设备之间的数据传输通道,其所遵循的通信协议是能够适用于本申请技术方案的任意通信协议,在此不对其进行具体限定;输入输出接口25,用于获取外界输入数据或向外界输出数据,其具体的接口类型可以根据具体应用需要进行选取,在此不进行具体限定。In this embodiment, the power supply 23 is used to provide working voltage for each hardware device on the electronic device; the communication interface 24 can create a data transmission channel between the electronic device and the external device, and the communication protocol it follows is applicable to this Any communication protocol for applying for a technical solution is not specifically limited here; the input and output interface 25 is used to obtain external input data or output data to the external world, and its specific interface type can be selected according to specific application needs, and will not be described here. Specific limits.

其中,处理器21可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器21可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器21也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central ProcessingUnit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器21可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器21还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。Wherein, the processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 21 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. Processor 21 also can comprise main processor and coprocessor, and main processor is the processor that is used for processing the data in wake-up state, also claims CPU (Central Processing Unit, central processing unit); Low-power processor for processing data in standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may also include an AI (Artificial Intelligence, artificial intelligence) processor, and the AI processor is used to process computing operations related to machine learning.

另外,存储器22作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作系统221、计算机程序222及数据223等,存储方式可以是短暂存储或者永久存储。In addition, the memory 22, as a resource storage carrier, can be a read-only memory, random access memory, magnetic disk or optical disk, etc., and the resources stored thereon include the operating system 221, computer program 222 and data 223, etc., and the storage method can be short-term storage or permanent storage.

其中,操作系统221用于管理与控制电子设备上的各硬件设备以及计算机程序222,以实现处理器21对存储器22中海量数据223的运算与处理,其可以是Windows、Unix、Linux等。计算机程序222除了包括能够用于完成前述任一实施例公开的由电子设备执行的哈希连接方法的计算机程序之外,还可以进一步包括能够用于完成其他特定工作的计算机程序。数据223除了可以包括电子设备接收到的由外部设备传输进来的数据,也可以包括由自身输入输出接口25采集到的数据等。Among them, the operating system 221 is used to manage and control each hardware device and computer program 222 on the electronic device, so as to realize the calculation and processing of the massive data 223 in the memory 22 by the processor 21, which can be Windows, Unix, Linux, etc. In addition to the computer program that can be used to complete the hash connection method performed by the electronic device disclosed in any of the foregoing embodiments, the computer program 222 can further include a computer program that can be used to complete other specific tasks. The data 223 may not only include data received by the electronic device and transmitted from an external device, but may also include data collected by its own input and output interface 25 and the like.

进一步的,本申请实施例还公开了一种计算机可读存储介质,所述存储介质中存储有计算机程序,所述计算机程序被处理器加载并执行时,实现前述任一实施例公开的由哈希连接过程中执行的方法步骤。Further, the embodiment of the present application also discloses a computer-readable storage medium, and a computer program is stored in the storage medium, and when the computer program is loaded and executed by a processor, it can implement the hashing method disclosed in any of the above-mentioned embodiments. Method steps performed during connection.

最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this text, relational terms such as first and second etc. are only used to distinguish one entity or operation from another, and do not necessarily require or imply that these entities or operations, any such actual relationship or order exists. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

以上对本发明所提供的一种哈希连接方法、装置、设备及介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。A hash connection method, device, equipment, and medium provided by the present invention have been introduced in detail above. In this paper, specific examples have been used to illustrate the principle and implementation of the present invention. The descriptions of the above embodiments are only used to help Understand the method of the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in the specific implementation and scope of application. In summary, the content of this specification is not It should be understood as a limitation of the present invention.

Claims (10)

1. A hash connection method, comprising:
acquiring first data of a first data tuple to be connected and second data of a second data tuple to be connected, and respectively calculating the first data and the second data by utilizing a first hash function and a second hash function in a cuckoo algorithm to obtain a hash result of the first data and a hash result of the second data;
determining a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data;
dividing the first data and the second data into a plurality of groups of first sub data and second sub data respectively, and storing the first sub data and the second sub data into the first target hash table and the second target hash table respectively;
And reading the first sub data and the second sub data from the first hash target hash table and the second hash target table, and merging and connecting the first sub data and the second sub data which meet the equivalent condition to obtain connected data.
2. The hash connection method as claimed in claim 1, wherein the acquiring the first data of the first data tuple to be connected and the second data of the second data tuple to be connected comprises:
reading a first data tuple to be connected and a second data tuple to be connected in an external memory to obtain first data of the first data tuple to be connected and second data of the second data tuple to be connected; the external memory is any one of a double-rate synchronous dynamic random access memory and a synchronous dynamic random access memory.
3. The hash connection method as claimed in claim 1, wherein the calculating the first data and the second data by using a first hash function and a second hash function in a cuckoo algorithm to obtain a hash result of the first data and a hash result of the second data includes:
Calculating the first data and the second data by using a first hash function and a second hash function in a cuckoo algorithm to obtain a first hash result and a second hash result of the first data and a first hash result and a second hash result of the second data;
correspondingly, the determining the first target hash table corresponding to the hash result of the first data and the second target hash table corresponding to the hash result of the second data includes:
determining a first hash table address and a second hash table address which correspond to the first hash result and the second hash result of the first data respectively, and a third hash table address and a fourth hash table address which correspond to the first hash result and the second hash result of the second data respectively;
and determining a first target hash table address from the first hash table address and the second hash table address, and determining a second target hash table address from the third hash table address and the fourth hash table address.
4. The hash connection method as claimed in claim 3, wherein said storing said first sub data and said second sub data to said first target hash table and said second target hash table, respectively, comprises:
Storing the first sub data into a hash table corresponding to the first target hash table address, and storing the second sub data into a hash table corresponding to the second target hash table address.
5. The hash table connection method as claimed in claim 4, wherein determining a first target hash table address from the first hash table address and the second hash table address, and determining a second target hash table address from the third hash table address and the fourth hash table address, comprises:
judging whether idle hash buckets exist in hash tables corresponding to the first hash table address and the second hash table address respectively, if so, determining the hash table address with the largest idle hash bucket number in the hash tables corresponding to the first hash table address and the second hash table address as a first target hash table address, and determining the other hash table address as a first target hash table standby address;
judging whether idle hash buckets exist in hash tables corresponding to the third hash table address and the fourth hash table address respectively, if so, determining the hash table with the largest idle hash bucket quantity in the hash tables corresponding to the third hash table address and the fourth hash table address as a second target hash table address, and determining the address of the other hash table as a second target hash table standby address;
Correspondingly, the storing the first sub data and the second sub data in the first target hash table and the second target hash table respectively includes:
storing the first sub-data and the first target hash table standby address into a free hash bucket of the first target hash table, and storing the second sub-data and the second target hash table standby address into a free hash bucket of the second target hash table.
6. The hash connection method as claimed in claim 5, wherein after determining whether there is a free hash bucket in the hash tables corresponding to the first hash table address and the second hash table address, respectively, the method further comprises:
if the hash table addresses do not exist, searching standby addresses of data stored in hash buckets in the hash tables corresponding to the first hash table address and the second hash table address respectively, and judging whether the hash buckets meeting preset conditions exist or not based on the standby addresses;
if so, the data stored in the hash bucket meeting the preset condition are transferred to the hash bucket in the hash table corresponding to the standby address, so that the first sub-data are stored in the hash bucket meeting the preset condition; and if the first sub data does not exist, storing the first sub data into a linked list corresponding to the first target hash table.
7. The hash connection method as claimed in any one of claims 1 to 6, wherein after storing the first sub data and the second sub data in the first target hash table and the second target hash table, respectively, further comprises:
and recording the total amount of data stored in the first target hash table and the second target hash table.
8. A hash connection apparatus, comprising:
the hash result acquisition module is used for acquiring first data of a first data tuple to be connected and second data of a second data tuple to be connected, and calculating the first data and the second data by using a first hash function and a second hash function in a cuckoo algorithm respectively to obtain a hash result of the first data and a hash result of the second data;
the hash table determining module is used for determining a first target hash table corresponding to the hash result of the first data and a second target hash table corresponding to the hash result of the second data;
the storage module is used for dividing the first data and the second data into a plurality of groups of first sub data and second sub data respectively and storing the first sub data and the second sub data to the first target hash table and the second target hash table respectively;
And the merging connection module is used for reading the first sub data and the second sub data from the first hash target table and the second hash target table, and merging and connecting the first sub data and the second sub data which meet the equivalent condition to obtain connected data.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the hash connection method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program; wherein the computer program when executed by a processor implements the steps of the hash connection method as claimed in any one of claims 1 to 7.
CN202310160998.0A 2023-02-23 2023-02-23 Hash connection method, device, equipment and medium Active CN116028506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310160998.0A CN116028506B (en) 2023-02-23 2023-02-23 Hash connection method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310160998.0A CN116028506B (en) 2023-02-23 2023-02-23 Hash connection method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN116028506A true CN116028506A (en) 2023-04-28
CN116028506B CN116028506B (en) 2025-08-22

Family

ID=86079648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310160998.0A Active CN116028506B (en) 2023-02-23 2023-02-23 Hash connection method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116028506B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263331B1 (en) * 1998-07-30 2001-07-17 Unisys Corporation Hybrid hash join process
CN102508924A (en) * 2011-11-22 2012-06-20 上海达梦数据库有限公司 Method for realizing grace hash joint by using merge join
CN108549666A (en) * 2018-03-22 2018-09-18 上海达梦数据库有限公司 A kind of sort method of tables of data, device, equipment and storage medium
CN111382158A (en) * 2020-03-09 2020-07-07 北京东方金信科技有限公司 Method and system for realizing parallel connection by adopting two-stage hash table structure
CN113468181A (en) * 2021-07-14 2021-10-01 中国人民解放军军事科学院国防科技创新研究院 Parallel Hash connection acceleration method and system based on FPGA
CN114021113A (en) * 2021-11-02 2022-02-08 北京天融信网络安全技术有限公司 Threat detection method and device and storage medium
CN114268501A (en) * 2021-12-24 2022-04-01 深信服科技股份有限公司 Data processing method, firewall generation method, computing device and storage medium
CN115186145A (en) * 2022-09-09 2022-10-14 华控清交信息科技(北京)有限公司 Privacy keyword query method, device and system
CN115237954A (en) * 2022-07-08 2022-10-25 三星(中国)半导体有限公司 Method, PIM device and system for cuckoo hash query based on PIM device
CN115576954A (en) * 2022-11-24 2023-01-06 恒生电子股份有限公司 Hash table determining method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263331B1 (en) * 1998-07-30 2001-07-17 Unisys Corporation Hybrid hash join process
CN102508924A (en) * 2011-11-22 2012-06-20 上海达梦数据库有限公司 Method for realizing grace hash joint by using merge join
CN108549666A (en) * 2018-03-22 2018-09-18 上海达梦数据库有限公司 A kind of sort method of tables of data, device, equipment and storage medium
CN111382158A (en) * 2020-03-09 2020-07-07 北京东方金信科技有限公司 Method and system for realizing parallel connection by adopting two-stage hash table structure
CN113468181A (en) * 2021-07-14 2021-10-01 中国人民解放军军事科学院国防科技创新研究院 Parallel Hash connection acceleration method and system based on FPGA
CN114021113A (en) * 2021-11-02 2022-02-08 北京天融信网络安全技术有限公司 Threat detection method and device and storage medium
CN114268501A (en) * 2021-12-24 2022-04-01 深信服科技股份有限公司 Data processing method, firewall generation method, computing device and storage medium
CN115237954A (en) * 2022-07-08 2022-10-25 三星(中国)半导体有限公司 Method, PIM device and system for cuckoo hash query based on PIM device
CN115186145A (en) * 2022-09-09 2022-10-14 华控清交信息科技(北京)有限公司 Privacy keyword query method, device and system
CN115576954A (en) * 2022-11-24 2023-01-06 恒生电子股份有限公司 Hash table determining method and device

Also Published As

Publication number Publication date
CN116028506B (en) 2025-08-22

Similar Documents

Publication Publication Date Title
CN110704360B (en) Graph calculation optimization method based on heterogeneous FPGA data flow
CN107391653B (en) Distributed NewSQL database system and picture data storage method
US20230195569A1 (en) Solid state disk access method and apparatus, device, and medium
CN104778077B (en) Figure processing method and system outside high speed core based on random and continuous disk access
US11030714B2 (en) Wide key hash table for a graphics processing unit
US11475009B2 (en) Intelligent memory allocation and deallocation of data
CN111625546B (en) Data writing method, device, equipment and medium
CN118672654B (en) Register sharing method, device, equipment and medium for general graphics processor
CN107894921A (en) A kind of realization method and system of distributed block storage volume performance statistics
CN106909554A (en) A kind of loading method and device of database text table data
CN116227599A (en) An optimization method, device, electronic equipment and storage medium for a reasoning model
CN111475736A (en) Community mining method, device and server
US20230132117A1 (en) Handling system-characteristics drift in machine learning applications
CN103729166A (en) Method, device and system for determining thread relation of program
WO2025185569A1 (en) Data management
CN113064895B (en) Incremental updating method, device and system for map
CN110442594A (en) A kind of Dynamic Execution method towards Spark SQL Aggregation Operators
CN103150157B (en) Based on the GPU kernel program restructuring optimization method of memory access difference
CN116028506A (en) Hash connection method, device, equipment and medium
US9857864B1 (en) Systems and methods for reducing power consumption in a memory architecture
CN116339643B (en) Formatting method, device, equipment and medium of a disk array
CN103870204A (en) Data writing and reading method in cache as well as cache controller
CN109992535B (en) A storage control method, device and system
CN105912404A (en) Method for searching strongly connected component in large-scale graph data on the basis of disk
JP2022552885A (en) Near memory data reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant