[go: up one dir, main page]

HK40020782B - Data transmission method and device basedon cloud storage and computer equipment - Google Patents

Data transmission method and device basedon cloud storage and computer equipment Download PDF

Info

Publication number
HK40020782B
HK40020782B HK42020010229.1A HK42020010229A HK40020782B HK 40020782 B HK40020782 B HK 40020782B HK 42020010229 A HK42020010229 A HK 42020010229A HK 40020782 B HK40020782 B HK 40020782B
Authority
HK
Hong Kong
Prior art keywords
data
partition
database
sorted
hbase database
Prior art date
Application number
HK42020010229.1A
Other languages
Chinese (zh)
Other versions
HK40020782A (en
Inventor
邓煜
Original Assignee
平安科技(深圳)有限公司
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of HK40020782A publication Critical patent/HK40020782A/en
Publication of HK40020782B publication Critical patent/HK40020782B/en

Links

Description

基于云存储的数据传输方法、装置及计算机设备Cloud storage-based data transmission methods, devices, and computer equipment

技术领域Technical Field

本发明涉及云存储技术领域,尤其涉及一种基于云存储的数据传输方法、装置、计算机设备及存储介质。This invention relates to the field of cloud storage technology, and in particular to a data transmission method, apparatus, computer equipment, and storage medium based on cloud storage.

背景技术Background Technology

目前,将Hive数据库(Hive是一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表)中的数据写入HBase(HBase是一个分布式的、面向列的开源数据库)中时,一般采用离线批量写入或者流式写入的方式,但是上述两种方式在将数据写入HBase时均是采用put的方式(put是HBase中数据插入方式中的一种),通过put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,而且导致数据写入效率低下。Currently, when writing data from a Hive database (Hive is a data warehouse tool that can map structured data files to a database table) to HBase (HBase is a distributed, column-oriented open-source database), offline batch writing or streaming writing is generally used. However, both of these methods use the put command (put is one of the data insertion methods in HBase) to write data to HBase. When inserting data using the put command, sorting is done at the same time, which affects the data processing efficiency of the HBase cluster and results in low data writing efficiency.

发明内容Summary of the Invention

本发明实施例提供了一种基于云存储的数据传输方法、装置、计算机设备及存储介质,旨在解决现有技术中将数据写入HBase时均是采用put的方式,通过put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,而且导致数据写入效率低下的问题。This invention provides a data transmission method, apparatus, computer device, and storage medium based on cloud storage, aiming to solve the problem that in the prior art, data is written to HBase using the put method, and data is inserted while sorting, which affects the data processing efficiency of the HBase cluster and results in low data writing efficiency.

第一方面,本发明实施例提供了一种基于云存储的数据传输方法,其包括:In a first aspect, embodiments of the present invention provide a data transmission method based on cloud storage, comprising:

接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;Receive and store all data uploaded from a Hive database; wherein the Hive database is a data warehouse database.

获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;Get the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;

根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;Based on the number of pre-partitions and the row keys of each data in the full data, the full data is partitioned to obtain the corresponding partition data; wherein, the total number of partitions of the partition data is equal to the number of pre-partitions, and each partition data uniquely corresponds to a partition server;

将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及Sort the data in each partition in ascending order according to the column and row keys to obtain the corresponding sorted partition data; and

将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。The sorted partition data is sent to the partition server corresponding to the HBase database for storage.

第二方面,本发明实施例提供了一种基于云存储的数据传输装置,其包括:Secondly, embodiments of the present invention provide a data transmission device based on cloud storage, comprising:

接收单元,用于接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;A receiving unit is used to receive and store the full amount of data uploaded from the Hive database; wherein the Hive database is a data warehouse database.

分区个数获取单元,用于获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;The partition count acquisition unit is used to obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;

分区单元,用于根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;A partitioning unit is used to partition the full data according to the number of pre-partitions and the row key of each data in the full data to obtain corresponding partition data; wherein, the total number of partitions of the partition data is equal to the number of pre-partitions, and each partition data uniquely corresponds to a partition server;

排序单元,用于将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及The sorting unit is used to sort the data in each partition in ascending order according to the column and row keys, thus obtaining the corresponding sorted partition data; and

传输单元,用于将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。The transmission unit is used to send the sorted partition data to the partition server corresponding to the HBase database for storage.

第三方面,本发明实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于云存储的数据传输方法。Thirdly, embodiments of the present invention provide a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the cloud storage-based data transmission method described in the first aspect.

第四方面,本发明实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的基于云存储的数据传输方法。Fourthly, embodiments of the present invention also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to perform the cloud storage-based data transmission method described in the first aspect.

本发明实施例提供了一种基于云存储的数据传输方法、装置、计算机设备及存储介质,实现了全量数据写入Hbase数据库之前,将排序过程在云端完成,提高了数据写入Hbase数据库的效率。This invention provides a data transmission method, apparatus, computer device, and storage medium based on cloud storage, which enables the sorting process to be completed in the cloud before all data is written to the HBase database, thereby improving the efficiency of writing data to the HBase database.

附图说明Attached Figure Description

为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

图1为本发明实施例提供的基于云存储的数据传输方法的应用场景示意图;Figure 1 is a schematic diagram of an application scenario of the data transmission method based on cloud storage provided in an embodiment of the present invention;

图2为本发明实施例提供的基于云存储的数据传输方法的流程示意图;Figure 2 is a flowchart illustrating the data transmission method based on cloud storage provided in an embodiment of the present invention;

图3为本发明实施例提供的基于云存储的数据传输方法的子流程示意图;Figure 3 is a schematic diagram of a sub-process of the data transmission method based on cloud storage provided in an embodiment of the present invention;

图4为本发明实施例提供的基于云存储的数据传输方法的另一子流程示意图;Figure 4 is a schematic diagram of another sub-process of the data transmission method based on cloud storage provided in an embodiment of the present invention;

图5为本发明实施例提供的基于云存储的数据传输方法的另一子流程示意图;Figure 5 is a schematic diagram of another sub-process of the data transmission method based on cloud storage provided in an embodiment of the present invention;

图6为本发明实施例提供的基于云存储的数据传输方法的另一子流程示意图;Figure 6 is a schematic diagram of another sub-process of the data transmission method based on cloud storage provided in an embodiment of the present invention;

图7为本发明实施例提供的基于云存储的数据传输装置的示意性框图;Figure 7 is a schematic block diagram of a cloud storage-based data transmission device provided in an embodiment of the present invention;

图8为本发明实施例提供的基于云存储的数据传输装置的子单元示意性框图;Figure 8 is a schematic block diagram of a subunit of a cloud storage-based data transmission device provided in an embodiment of the present invention;

图9为本发明实施例提供的基于云存储的数据传输装置的另一子单元示意性框图;Figure 9 is a schematic block diagram of another sub-unit of the cloud storage-based data transmission device provided in an embodiment of the present invention;

图10为本发明实施例提供的基于云存储的数据传输装置的另一子单元示意性框图;Figure 10 is a schematic block diagram of another sub-unit of the cloud storage-based data transmission device provided in an embodiment of the present invention;

图11为本发明实施例提供的基于云存储的数据传输装置的另一子单元示意性框图;Figure 11 is a schematic block diagram of another sub-unit of the cloud storage-based data transmission device provided in an embodiment of the present invention;

图12为本发明实施例提供的计算机设备的示意性框图。Figure 12 is a schematic block diagram of a computer device provided in an embodiment of the present invention.

具体实施方式Detailed Implementation

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that, when used in this specification and the appended claims, the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and/or collections thereof.

还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本发明。如在本发明说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

还应当进一步理解,在本发明说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be further understood that the term "and/or" as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

请参阅图1和图2,图1为本发明实施例提供的基于云存储的数据传输方法的应用场景示意图;图2为本发明实施例提供的基于云存储的数据传输方法的流程示意图,该基于云存储的数据传输方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。Please refer to Figures 1 and 2. Figure 1 is a schematic diagram of an application scenario of the data transmission method based on cloud storage provided in an embodiment of the present invention; Figure 2 is a schematic diagram of the process of the data transmission method based on cloud storage provided in an embodiment of the present invention. The data transmission method based on cloud storage is applied to a server and is executed by application software installed on the server.

如图2所示,该方法包括步骤S110~S150。As shown in Figure 2, the method includes steps S110 to S150.

S110、接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库。S110. Receive and store the full amount of data uploaded from the Hive database; wherein the Hive database is a data warehouse database.

在本实施例中,是在云计算平台的角度描述技术方案。本申请中云计算平台具体采用的是Spark,Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。In this embodiment, the technical solution is described from the perspective of a cloud computing platform. Specifically, the cloud computing platform used in this application is Spark. Spark is a fast and general-purpose computing engine designed for large-scale data processing. Spark enables in-memory distributed datasets, which, in addition to providing interactive queries, can also optimize iterative workloads.

当云计算平台接收了由Hive数据库上传的全量数据后,是生成逻辑上的dataframe(dataframe是dataset的行的集合,dataset是Spark 1.6+中添加的一个新接口)进行物理存储(物理存储是内存和磁盘结合存储的)。After the cloud computing platform receives the full amount of data uploaded from the Hive database, it generates logical dataframes (a dataframe is a collection of rows of a dataset, and a dataset is a new interface added in Spark 1.6+) for physical storage (physical storage is a combination of memory and disk storage).

S120、获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器。S120. Obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open-source database, and each pre-partition in the HBase database corresponds to a partition server.

在本实施例中,当在云计算平台完成了全量数据的存储后,为了获知后续将全量数据划分为多少个分区进行存储,此时需先从所述HBase数据库中获取预分区个数。In this embodiment, after the storage of all data is completed on the cloud computing platform, in order to know how many partitions the full data will be divided into for subsequent storage, it is necessary to first obtain the number of pre-partitions from the HBase database.

其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器,它是基于Hadoop的高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在廉价电脑服务器上搭建起大规模结构化存储集群。The HBase database is a distributed open-source database, and each pre-partition in the HBase database corresponds to a partition server. It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system based on Hadoop. Using HBase technology, large-scale structured storage clusters can be built on inexpensive computer servers.

在一实施例中,如图3所示,步骤S120包括:In one embodiment, as shown in FIG3, step S120 includes:

S121、发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;S121. Send an RPC request to the HBase database; wherein, the RPC request is a Remote Procedure Call protocol request;

S122、接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。S122. Receive the metadata sent by the HBase database according to the RPC request, and obtain the number of pre-partitions based on the metadata.

在本实施例中,当在云计算平台完成了全量数据的存储后,云计算平台会发起RPC请求(RPC请求即远程过程调用协议请求,它是一种通过网络从远程计算机程序上请求服务),访问所述Hbase数据库的zk元信息(即ZooKeeper元信息,ZooKeeper是一个分布式的、开放源码的分布式应用程序协调服务),在zk元信息里已经存储了HBase预先建好的表的分区信息,也就获知了HBase数据库中的预分区个数。通过获知所述HBase数据库中的预分区个数,能准确的将全量数据划分为相同分区个数。In this embodiment, after the cloud computing platform has completed storing all the data, it initiates an RPC request (Remote Procedure Call, a request to request services from a remote computer program over a network) to access the ZooKeeper metadata (a distributed, open-source distributed application coordination service) of the HBase database. The ZooKeeper metadata stores the partition information of the pre-built tables in HBase, thus revealing the number of pre-partitions in the HBase database. Knowing the number of pre-partitions in the HBase database allows for accurate division of the entire dataset into the same number of partitions.

S130、根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器。S130. Based on the number of pre-partitions and the row key of each data in the full data, the full data is partitioned to obtain the corresponding partition data; wherein, the total number of partitions of the partition data is equal to the number of pre-partitions, and each partition data uniquely corresponds to a partition server.

在本实施例中,将云计算平台中的dataframe中所存储的全量数据,根据HexStringSplit的预分区方式将所述全量数据中每一条数据打散到对应的分区中。其中,HexStringSplit是一种适用于行键是十六进制的字符串作为前缀的预划分。In this embodiment, the entire data stored in the dataframe of the cloud computing platform is distributed into corresponding partitions based on the HexStringSplit pre-partitioning method. HexStringSplit is a pre-partitioning method suitable for rows whose row keys are prefixed with hexadecimal strings.

在一实施例中,如图4所示,步骤S130包括:In one embodiment, as shown in FIG4, step S130 includes:

S131、获取所述全量数据中各数据对应的行键;S131. Obtain the row key corresponding to each data in the full data;

S132、将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;S132. Generate the corresponding hash value for the row key of each data using the MD5 encryption algorithm or the SHA-256 encryption algorithm;

S133、将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;S133. Take the modulo of the hash value corresponding to each row key with the number of pre-partitions to obtain the remainder corresponding to each row key;

S134、将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。S134. Store the data corresponding to each row key into the partition corresponding to the remainder of that row key to obtain the corresponding partition data.

在本实施例中,在Spark中各数据均对应有一个行键(即rowkey),此时先获取各数据的行键,便于对应进行处理后将数据划分至对应的区域。In this embodiment, each piece of data in Spark has a corresponding row key. The row key of each piece of data is obtained first, so that the data can be processed and divided into the corresponding regions.

之后对各数据的行键通过MD5加密算法或SHA加密算法进行计算时,能对应生成的哈希值。其中,MD5算法是一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。SHA-256算法是一种安全散列算法,能计算出一个数字消息所对应到的长度固定的字符串(又称消息摘要)的算法。通过上述MD5或SHA-256的方式将行键生成哈希值以打散到对应的分区中,使得具有相同行键余数的数据被划分在同一分区中。通过这一方式,实现了对全量数据的快速且有效的划分。Then, when the row keys of each data item are calculated using the MD5 or SHA encryption algorithm, a corresponding hash value is generated. The MD5 algorithm is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value to ensure the integrity and consistency of transmitted information. The SHA-256 algorithm is a secure hash algorithm that calculates a fixed-length string (also known as a message digest) corresponding to a digital message. By generating hash values for the row keys using MD5 or SHA-256, the hash values are distributed across the corresponding partitions, ensuring that data with the same row key remainder is grouped into the same partition. This method achieves fast and efficient partitioning of the entire dataset.

由于所述HBase数据库中每一预分区均对应一个分区服务器,而且每一分区数据唯一对应一个分区服务器,故分区数据与分区服务器的对应关系可以是预先就设置了对应关系,例如分区1对应分区服务器1,……,分区N应分区服务器N。在获知了各分区数据与分区服务器的对应关系后,后续进行数据存储时,则可实现定向存储,提高存储效率。Since each pre-partition in the HBase database corresponds to a partition server, and each partition's data uniquely corresponds to a partition server, the correspondence between partition data and partition servers can be pre-set. For example, partition 1 corresponds to partition server 1, ..., partition N corresponds to partition server N. Once the correspondence between each partition's data and its partition server is known, targeted storage can be implemented during subsequent data storage, improving storage efficiency.

S140、将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据。S140. Sort each partition's data in ascending order according to the column and row keys to obtain the corresponding sorted partition data.

在本实施例中,当在云计算平台中将全量数据根据预分区个数对应进行分区后,之后还需要对每一分区数据再进行排序,当完成排序后再发送至所述Hbase数据库即可快速存储。此时对各分区数据进行排序时,可以选取列值和行键值的大小来进行排序。In this embodiment, after the full data is partitioned according to the number of pre-defined partitions in the cloud computing platform, the data in each partition needs to be sorted. Once sorting is complete, the data is sent to the HBase database for fast storage. When sorting the data in each partition, the column value and row key value can be selected for sorting.

在一实施例中,如图5所示,步骤S140包括:In one embodiment, as shown in FIG5, step S140 includes:

S141、在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的第一排序后分区数据;S141. Obtain the data with the same row key from each partition data, sort the data with the same row key according to the ascending order of the column, and obtain the first sorted partition data corresponding to each partition data.

S142、将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。S142. Sort each first sorted partition data in ascending order according to the row key to obtain the sorted partition data corresponding to each first sorted partition data.

在本实施例中,在每一分区数据中先是将具有相同行键值的数据归为一类,具有相同行键值的数据内部则是按照列值进行升序排序,从而得到第一排序后分区数据。完成初次排序后所得到的第一排序后分区数据中,可以再据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。可见,通过列和行键对个分区数据进行排序后,能将数据更有规律性的存储。In this embodiment, within each partition, data with the same row key value are first grouped together. Within each partition, data with the same row key value is then sorted in ascending order by column value, resulting in the first sorted partition data. This first sorted partition data can then be further sorted according to the row key in ascending order to obtain the next sorted partition data corresponding to each first sorted partition. Therefore, sorting the partition data by column and row key allows for more systematic data storage.

S150、将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。S150. Send the sorted partition data to the partition server corresponding to the HBase database for storage.

在本实施例中,完成对各分区数据的排序而得到对应的各排序分区数据后,直接发送至所述Hbase数据库进行存储即可,无需再采用如put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,直接将完成分区和排序的数据存储于所述Hbase数据库,只需直接存储即可,提高了存储效率。In this embodiment, after sorting the data in each partition to obtain the corresponding sorted partition data, it can be directly sent to the HBase database for storage. There is no need to sort and insert data simultaneously when using the put command, which would affect the data processing efficiency of the HBase cluster. The data that has been partitioned and sorted is directly stored in the HBase database, which improves storage efficiency.

在一实施例中,如图6所示,步骤S150包括:In one embodiment, as shown in FIG6, step S150 includes:

S151、将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;S151. Input the sorted partition data into the local HDFS layer to convert the sorted partition data into corresponding data files; wherein, the HDFS layer is a distributed file system layer.

S152、将所述数据文件发送至所述Hbase数据库对应的分区服务器中以进行存储。S152. The data file is sent to the partition server corresponding to the HBase database for storage.

在本实施例中,云计算平台(即Spark)的最底层是用于存储数据的HDFS层,将各排序后分区数据输入至HDFS层,即可由HDFS层将各排序后分区数据转化为数据文件。数据文件具体为HFile文件,HFile文件包括7个块(即block),按照block类型可分为:In this embodiment, the lowest layer of the cloud computing platform (i.e., Spark) is the HDFS layer, which stores data. The sorted partition data is input into the HDFS layer, which then converts the sorted partition data into data files. Specifically, the data files are HFile files, which consist of 7 blocks. These blocks can be categorized as follows:

a)datablock(datablock即数据块),存放的key-value数据(即键值对数据),一般一个datablock大小默认为64KB;a) Datablock (datablock is a block of data), which stores key-value data (i.e., key-value pair data). The default size of a datablock is 64KB.

b)data index block,其中存放的是datablock的index(index即索引),索引可以是多级索引,中间索引,叶子索引一般会分布在HFile文件当中;b) Data index block, which stores the index of the data block (index is the index). The index can be a multi-level index, intermediate index, leaf index, which is usually distributed in the HFile file;

c)bloom filter block,保存了bloom过滤器(即布隆过滤器)的值;c) The bloom filter block stores the values of the bloom filter.

d)meta data block,meta data block(即元数据块)有多个,且连续分布;d) Meta data block: There are multiple meta data blocks, which are distributed continuously.

e)meta data index,表示meta data(即元数据)的索引;e) Meta data index, representing the index of metadata;

f)file-info block(即文件信息块),其中记录了关于文件的一些信息,比如:HFile中最大的key、平均Key长度、HFile创建时间戳、data block使用的编码方式等;f) The file-info block records information about the file, such as the largest key in the HFile, the average key length, the HFile creation timestamp, and the encoding method used by the data block.

g)trailer block(即报尾),每个HFile文件都会有的trailer block,对于不同版本的HFile(有V1,V2,V3三个版本,V2和V3相差不大)来说trailer长度可能不一样,但是同一个版本的所有HFile trailer的长度都是一样长的,并且trailer的最后4B一定是版本信息。g) Trailer block (i.e., the end of the message): Every HFile file has a trailer block. The trailer length may be different for different versions of HFile (there are three versions: V1, V2, and V3, with little difference between V2 and V3). However, the trailer length of all HFiles of the same version is the same, and the last 4 bytes of the trailer must be version information.

可见,各排序后分区数据是存储在本地的HDFS层中,而且是转化为HFile文件的方式进行存储。As can be seen, the sorted partition data is stored in the local HDFS layer, and it is stored by converting it into HFile files.

当在HSFS层将各排序后分区数据转化为HFile文件,即可将各排序后分区数据对应的HFile文件发送至Hbase数据库对应的分区服务器中。之后再由Hbase数据库的分区服务器采用Bulkload方案(即主体加载方案)将HFile写入HBase数据库。其中,Bulkload的优点在于导入过程不占用分区资源;能快速导入海量的数据;节省内存。Once the sorted partition data is converted into HFile files at the HSFS layer, the corresponding HFile files can be sent to the corresponding partition server in the HBase database. The HBase partition server then uses the Bulkload scheme (i.e., the main loading scheme) to write the HFiles into the HBase database. The advantages of Bulkload are that the import process does not consume partition resources; it can quickly import massive amounts of data; and it saves memory.

在一实施例中,步骤S150之后还包括:In one embodiment, step S150 is followed by:

若检测到已接收所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;If a data transmission error message sent by the HBase database is detected, the data transmission interruption point is located in the sorted partition data according to the log file corresponding to the data transmission error message.

将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述Hbase数据库对应的分区服务器中以进行存储。Data located after the data transmission interruption point of each sorted partition is sent to the partition server corresponding to the HBase database for storage.

在本实施例中,在将各排序后分区数据发送至所述Hbase数据库进行存储的过程中,若存在传输中断的情况,此时可以接收由所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点。在获取了数据传输中断点后,即可从该数据传输中断点之后的数据开始继续传输数据,确保了异常发生后也能恢复正常传输。In this embodiment, if a transmission interruption occurs during the process of sending the sorted partition data to the HBase database for storage, a data transmission error message sent by the HBase database can be received. The interruption point can then be located in the sorted partition data based on the log file corresponding to the error message. Once the interruption point is identified, data transmission can resume from the data after that point, ensuring normal transmission can be restored even after an anomaly occurs.

该方法实现了全量数据写入Hbase数据库之前,将排序过程在云端完成,提高了数据写入Hbase数据库的效率。This method enables the sorting process to be completed in the cloud before all data is written to the HBase database, thus improving the efficiency of writing data to the HBase database.

本发明实施例还提供一种基于云存储的数据传输装置,该基于云存储的数据传输装置用于执行前述基于云存储的数据传输方法的任一实施例。具体地,请参阅图7,图7是本发明实施例提供的基于云存储的数据传输装置的示意性框图。该基于云存储的数据传输装置100可以配置于服务器中。This invention also provides a cloud storage-based data transmission device, which is used to execute any of the aforementioned cloud storage-based data transmission methods. Specifically, please refer to FIG7, which is a schematic block diagram of the cloud storage-based data transmission device provided in this invention. This cloud storage-based data transmission device 100 can be configured in a server.

如图7所示,基于云存储的数据传输装置100包括接收单元110、分区个数获取单元120、分区单元130、排序单元140、传输单元150。As shown in Figure 7, the cloud storage-based data transmission device 100 includes a receiving unit 110, a partition count acquisition unit 120, a partitioning unit 130, a sorting unit 140, and a transmission unit 150.

接收单元110,用于接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库。The receiving unit 110 is used to receive and store the full amount of data uploaded by the Hive database; wherein the Hive database is a data warehouse database.

在本实施例中,是在云计算平台的角度描述技术方案。本申请中云计算平台具体采用的是Spark,Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。In this embodiment, the technical solution is described from the perspective of a cloud computing platform. Specifically, the cloud computing platform used in this application is Spark. Spark is a fast and general-purpose computing engine designed for large-scale data processing. Spark enables in-memory distributed datasets, which, in addition to providing interactive queries, can also optimize iterative workloads.

当云计算平台接收了由Hive数据库上传的全量数据后,是生成逻辑上的dataframe(dataframe是dataset的行的集合,dataset是Spark 1.6+中添加的一个新接口)进行物理存储(物理存储是内存和磁盘结合存储的)。After the cloud computing platform receives the full amount of data uploaded from the Hive database, it generates logical dataframes (a dataframe is a collection of rows of a dataset, and a dataset is a new interface added in Spark 1.6+) for physical storage (physical storage is a combination of memory and disk storage).

分区个数获取单元120,用于获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器。The partition count acquisition unit 120 is used to obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server.

在本实施例中,当在云计算平台完成了全量数据的存储后,为了获知后续将全量数据划分为多少个分区进行存储,此时需先从所述HBase数据库中获取预分区个数。In this embodiment, after the storage of all data is completed on the cloud computing platform, in order to know how many partitions the full data will be divided into for subsequent storage, it is necessary to first obtain the number of pre-partitions from the HBase database.

其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器,它是基于Hadoop的高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在廉价电脑服务器上搭建起大规模结构化存储集群。The HBase database is a distributed open-source database, and each pre-partition in the HBase database corresponds to a partition server. It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system based on Hadoop. Using HBase technology, large-scale structured storage clusters can be built on inexpensive computer servers.

在一实施例中,如图8所示,分区个数获取单元120包括:In one embodiment, as shown in FIG8, the partition count acquisition unit 120 includes:

请求发送单元121,用于发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;The request sending unit 121 is used to send an RPC request to the HBase database; wherein the RPC request is a Remote Procedure Call protocol request;

元信息解析单元122,用于接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。Meta-information parsing unit 122 is used to receive meta-information sent by the HBase database according to the RPC request, and obtain the number of pre-partitions based on the meta-information.

在本实施例中,当在云计算平台完成了全量数据的存储后,云计算平台会发起RPC请求(RPC请求即远程过程调用协议请求,它是一种通过网络从远程计算机程序上请求服务),访问所述Hbase数据库的zk元信息(即ZooKeeper元信息,ZooKeeper是一个分布式的、开放源码的分布式应用程序协调服务),在zk元信息里已经存储了HBase预先建好的表的分区信息,也就获知了HBase数据库中的预分区个数。通过获知所述HBase数据库中的预分区个数,能准确的将全量数据划分为相同分区个数。In this embodiment, after the cloud computing platform has completed storing all the data, it initiates an RPC request (Remote Procedure Call, a request to request services from a remote computer program over a network) to access the ZooKeeper metadata (a distributed, open-source distributed application coordination service) of the HBase database. The ZooKeeper metadata stores the partition information of the pre-built tables in HBase, thus revealing the number of pre-partitions in the HBase database. Knowing the number of pre-partitions in the HBase database allows for accurate division of the entire dataset into the same number of partitions.

分区单元130,用于根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器。Partitioning unit 130 is used to partition the full data according to the number of pre-partitions and the row key of each data in the full data to obtain corresponding partition data; wherein, the total number of partitions of the partition data is equal to the number of pre-partitions, and each partition data uniquely corresponds to a partition server.

在本实施例中,将云计算平台中的dataframe中所存储的全量数据,根据HexStringSplit的预分区方式将所述全量数据中每一条数据打散到对应的分区中。其中,HexStringSplit是一种适用于行键是十六进制的字符串作为前缀的预划分。In this embodiment, the entire data stored in the dataframe of the cloud computing platform is distributed into corresponding partitions based on the HexStringSplit pre-partitioning method. HexStringSplit is a pre-partitioning method suitable for rows whose row keys are prefixed with hexadecimal strings.

在一实施例中,如图9所示,分区单元130包括:In one embodiment, as shown in FIG9, the partitioning unit 130 includes:

行键获取单元131,用于获取所述全量数据中各数据对应的行键;Row key acquisition unit 131 is used to acquire the row key corresponding to each data in the full data;

哈希单元132,用于将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;Hash unit 132 is used to generate a corresponding hash value for each row key of the data using the MD5 encryption algorithm or the SHA-256 encryption algorithm;

求模运算单元133,用于将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;Modulo operation unit 133 is used to calculate the modulo of the hash value corresponding to each row key with the number of pre-partitions to obtain the remainder corresponding to each row key;

数据分区单元134,用于将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。Data partitioning unit 134 is used to store the data corresponding to each row key into the partition corresponding to the remainder of the row key, so as to obtain the corresponding partition data.

在本实施例中,在Spark中各数据均对应有一个行键(即rowkey),此时先获取各数据的行键,便于对应进行处理后将数据划分至对应的区域。In this embodiment, each piece of data in Spark has a corresponding row key. The row key of each piece of data is obtained first, so that the data can be processed and divided into the corresponding regions.

之后对各数据的行键通过MD5加密算法或SHA加密算法进行计算时,能对应生成的哈希值。其中,MD5算法是一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。SHA-256算法是一种安全散列算法,能计算出一个数字消息所对应到的长度固定的字符串(又称消息摘要)的算法。通过上述MD5或SHA-256的方式将行键生成哈希值以打散到对应的分区中,使得具有相同行键余数的数据被划分在同一分区中。通过这一方式,实现了对全量数据的快速且有效的划分。Then, when the row keys of each data item are calculated using the MD5 or SHA encryption algorithm, a corresponding hash value is generated. The MD5 algorithm is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value to ensure the integrity and consistency of transmitted information. The SHA-256 algorithm is a secure hash algorithm that calculates a fixed-length string (also known as a message digest) corresponding to a digital message. By generating hash values for the row keys using MD5 or SHA-256, the hash values are distributed across the corresponding partitions, ensuring that data with the same row key remainder is grouped into the same partition. This method achieves fast and efficient partitioning of the entire dataset.

由于所述HBase数据库中每一预分区均对应一个分区服务器,而且每一分区数据唯一对应一个分区服务器,故分区数据与分区服务器的对应关系可以是预先就设置了对应关系,例如分区1对应分区服务器1,……,分区N应分区服务器N。在获知了各分区数据与分区服务器的对应关系后,后续进行数据存储时,则可实现定向存储,提高存储效率。Since each pre-partition in the HBase database corresponds to a partition server, and each partition's data uniquely corresponds to a partition server, the correspondence between partition data and partition servers can be pre-set. For example, partition 1 corresponds to partition server 1, ..., partition N corresponds to partition server N. Once the correspondence between each partition's data and its partition server is known, targeted storage can be implemented during subsequent data storage, improving storage efficiency.

排序单元140,用于将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据。The sorting unit 140 is used to sort the data of each partition in ascending order according to the column and row keys to obtain the corresponding sorted partition data.

在本实施例中,当在云计算平台中将全量数据根据预分区个数对应进行分区后,之后还需要对每一分区数据再进行排序,当完成排序后再发送至所述Hbase数据库即可快速存储。此时对各分区数据进行排序时,可以选取列值和行键值的大小来进行排序。In this embodiment, after the full data is partitioned according to the number of pre-defined partitions in the cloud computing platform, the data in each partition needs to be sorted. Once sorting is complete, the data is sent to the HBase database for fast storage. When sorting the data in each partition, the column value and row key value can be selected for sorting.

在一实施例中,如图10所示,排序单元140包括:In one embodiment, as shown in FIG10, the sorting unit 140 includes:

第一排序单元141,用于在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的第一排序后分区数据;The first sorting unit 141 is used to obtain data with the same row key in each partition data, sort the data with the same row key according to the ascending order of the column, and obtain the first sorted partition data corresponding to each partition data.

第二排序单元142,用于将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。The second sorting unit 142 is used to sort each first sorted partition data according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data.

在本实施例中,在每一分区数据中先是将具有相同行键值的数据归为一类,具有相同行键值的数据内部则是按照列值进行升序排序,从而得到第一排序后分区数据。完成初次排序后所得到的第一排序后分区数据中,可以再据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。可见,通过列和行键对个分区数据进行排序后,能将数据更有规律性的存储。In this embodiment, within each partition, data with the same row key value are first grouped together. Within each partition, data with the same row key value is then sorted in ascending order by column value, resulting in the first sorted partition data. This first sorted partition data can then be further sorted according to the row key in ascending order to obtain the next sorted partition data corresponding to each first sorted partition. Therefore, sorting the partition data by column and row key allows for more systematic data storage.

传输单元150,用于将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。The transmission unit 150 is used to send the sorted partition data to the partition server corresponding to the HBase database for storage.

在本实施例中,完成对各分区数据的排序而得到对应的各排序分区数据后,直接发送至所述Hbase数据库进行存储即可,无需再采用如put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,直接将完成分区和排序的数据存储于所述Hbase数据库,只需直接存储即可,提高了存储效率。In this embodiment, after sorting the data in each partition to obtain the corresponding sorted partition data, it can be directly sent to the HBase database for storage. There is no need to sort and insert data simultaneously when using the put command, which would affect the data processing efficiency of the HBase cluster. The data that has been partitioned and sorted is directly stored in the HBase database, which improves storage efficiency.

在一实施例中,如图11所示,传输单元150包括:In one embodiment, as shown in FIG11, the transmission unit 150 includes:

底层存储单元151,用于将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;The underlying storage unit 151 is used to input the sorted partition data to the local HDFS layer to convert the sorted partition data into corresponding data files; wherein, the HDFS layer is a distributed file system layer.

数据发送单元152,用于将所述数据文件发送至所述Hbase数据库对应的分区服务器中以进行存储。The data sending unit 152 is used to send the data file to the partition server corresponding to the HBase database for storage.

在本实施例中,云计算平台(即Spark)的最底层是用于存储数据的HDFS层,将各排序后分区数据输入至HDFS层,即可由HDFS层将各排序后分区数据转化为数据文件。可见,各排序后分区数据是存储在本地的HDFS层中,而且是转化为HFile文件的方式进行存储。In this embodiment, the lowest layer of the cloud computing platform (i.e., Spark) is the HDFS layer used for data storage. The sorted partition data is input into the HDFS layer, which then converts the sorted partition data into data files. Therefore, the sorted partition data is stored locally in the HDFS layer and is stored as HFile files.

当在HSFS层将各排序后分区数据转化为HFile文件,即可将各排序后分区数据对应的HFile文件发送至Hbase数据库对应的分区服务器中。之后再由Hbase数据库的分区服务器采用Bulkload方案(即主体加载方案)将HFile写入HBase数据库。其中,Bulkload的优点在于导入过程不占用分区资源;能快速导入海量的数据;节省内存。Once the sorted partition data is converted into HFile files at the HSFS layer, the corresponding HFile files can be sent to the corresponding partition server in the HBase database. The HBase partition server then uses the Bulkload scheme (i.e., the main loading scheme) to write the HFiles into the HBase database. The advantages of Bulkload are that the import process does not consume partition resources; it can quickly import massive amounts of data; and it saves memory.

在一实施例中,基于云存储的数据传输装置100还包括:In one embodiment, the cloud storage-based data transmission device 100 further includes:

中断点获取单元,用于若检测到已接收所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;The interruption point acquisition unit is used to locate and acquire the data transmission interruption point in each sorted partition data according to the log file corresponding to the data transmission error information if a data transmission error information has been received from the HBase database.

数据传输恢复单元,用于将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述Hbase数据库对应的分区服务器中以进行存储。The data transmission recovery unit is used to send the data located after the data transmission interruption point of each sorted partition data to the partition server corresponding to the HBase database for storage.

在本实施例中,在将各排序后分区数据发送至所述Hbase数据库进行存储的过程中,若存在传输中断的情况,此时可以接收由所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点。在获取了数据传输中断点后,即可从该数据传输中断点之后的数据开始继续传输数据,确保了异常发生后也能恢复正常传输。In this embodiment, if a transmission interruption occurs during the process of sending the sorted partition data to the HBase database for storage, a data transmission error message sent by the HBase database can be received. The interruption point can then be located in the sorted partition data based on the log file corresponding to the error message. Once the interruption point is identified, data transmission can resume from the data after that point, ensuring normal transmission can be restored even after an anomaly occurs.

该装置实现了全量数据写入Hbase数据库之前,将排序过程在云端完成,提高了数据写入Hbase数据库的效率。This device enables the sorting process to be completed in the cloud before all data is written to the HBase database, thus improving the efficiency of writing data to the HBase database.

上述基于云存储的数据传输装置可以实现为计算机程序的形式,该计算机程序可以在如图12所示的计算机设备上运行。The aforementioned cloud storage-based data transmission device can be implemented as a computer program, which can run on the computer device shown in Figure 12.

请参阅图12,图12是本发明实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to Figure 12, which is a schematic block diagram of a computer device provided in an embodiment of the present invention. The computer device 500 is a server, which can be a standalone server or a server cluster composed of multiple servers.

参阅图12,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to Figure 12, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected via a system bus 501. The memory may include a non-volatile storage medium 503 and internal memory 504.

该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于云存储的数据传输方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, it enables the processor 502 to perform a cloud storage-based data transfer method.

该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 provides computing and control capabilities to support the operation of the entire computer device 500.

该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于云存储的数据传输方法。The internal memory 504 provides an environment for the execution of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute a cloud storage-based data transmission method.

该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图12中示出的结构,仅仅是与本发明方案相关的部分结构的框图,并不构成对本发明方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art will understand that the structure shown in FIG12 is merely a block diagram of a portion of the structure related to the present invention and does not constitute a limitation on the computer device 500 to which the present invention is applied. A specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have different component arrangements.

其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本发明实施例公开的基于云存储的数据传输方法。The processor 502 is used to run a computer program 5032 stored in a memory to implement the cloud storage-based data transmission method disclosed in this embodiment of the invention.

本领域技术人员可以理解,图12中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图12所示实施例一致,在此不再赘述。Those skilled in the art will understand that the embodiments of the computer device shown in FIG12 do not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or fewer components than shown, or combine certain components, or have different component arrangements. For example, in some embodiments, the computer device may include only a memory and a processor. In such embodiments, the structure and function of the memory and processor are consistent with those shown in FIG12, and will not be repeated here.

应当理解,在本发明实施例中,处理器502可以是中央处理单元(CentralProcessing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(DigitalSignal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the invention, the processor 502 may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

在本发明的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本发明实施例公开的基于云存储的数据传输方法。In another embodiment of the invention, a computer-readable storage medium is provided. This computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the cloud storage-based data transmission method disclosed in this embodiment of the invention.

所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art will readily understand that, for the sake of convenience and brevity, the specific working processes of the devices, apparatuses, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in terms of function in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention.

在本发明所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the embodiments provided by this invention, it should be understood that the disclosed devices, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Units with the same function may be grouped into one unit. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interfaces, devices, or units, or it may be an electrical, mechanical, or other form of connection.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of the present invention, depending on actual needs.

另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), magnetic disks, or optical disks.

以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims (8)

1.一种基于云存储的数据传输方法,其特征在于,包括:1. A data transmission method based on cloud storage, characterized in that it includes: 接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;Receive and store all data uploaded from a Hive database; wherein the Hive database is a data warehouse database. 获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;Get the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server; 根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;Based on the number of pre-partitions and the row keys of each data in the full data, the full data is partitioned to obtain the corresponding partition data; wherein, the total number of partitions of the partition data is equal to the number of pre-partitions, and each partition data uniquely corresponds to a partition server; 将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及Sort the data in each partition in ascending order according to the column and row keys to obtain the corresponding sorted partition data; and 将各排序后分区数据发送至所述HBase数据库对应的分区服务器中以进行存储;The sorted partition data is sent to the partition server corresponding to the HBase database for storage; 所述获取HBase数据库中的预分区个数,包括:The process of obtaining the number of pre-partitions in the HBase database includes: 发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;Send an RPC request to the HBase database; wherein the RPC request is a Remote Procedure Call protocol request; 接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数;Receive metadata sent by the HBase database according to the RPC request, and obtain the number of pre-partitions based on the metadata; 其中,所述元信息里已存储HBase数据库预先建好的表的分区信息。The metadata contains partition information for tables pre-built in the HBase database. 2.根据权利要求1所述的基于云存储的数据传输方法,其特征在于,所述根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据,包括:2. The data transmission method based on cloud storage according to claim 1, characterized in that, the step of partitioning the full data according to the number of pre-partitions and the row key of each data in the full data to obtain the corresponding partition data includes: 获取所述全量数据中各数据对应的行键;Obtain the row key corresponding to each data point in the full dataset; 将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;Generate the corresponding hash value for the row key of each data using the MD5 encryption algorithm or the SHA-256 encryption algorithm; 将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;Take the modulo of the hash value corresponding to each row key with the number of pre-partitions to obtain the remainder corresponding to each row key; 将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。The data corresponding to each row key is stored in the partition corresponding to the remainder of that row key, so as to obtain the corresponding partition data. 3.根据权利要求1所述的基于云存储的数据传输方法,其特征在于,所述将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据,包括:3. The data transmission method based on cloud storage according to claim 1, characterized in that, the step of sorting each partition data in ascending order according to the column and row keys to obtain the corresponding sorted partition data includes: 在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的第一排序后分区数据;In each partition, retrieve the data with the same row key, sort the data with the same row key in ascending order of the column, and obtain the first sorted partition data corresponding to each partition. 将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。Sort each first sorted partition data in ascending order according to the row key to obtain the sorted partition data corresponding to each first sorted partition data. 4.根据权利要求1所述的基于云存储的数据传输方法,其特征在于,所述将各排序后分区数据发送至所述HBase数据库对应的分区服务器中以进行存储,包括:4. The data transmission method based on cloud storage according to claim 1, characterized in that, sending the sorted partition data to the partition server corresponding to the HBase database for storage includes: 将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;The sorted partition data is input into the local HDFS layer to convert the sorted partition data into corresponding data files; wherein, the HDFS layer is a distributed file system layer. 将所述数据文件发送至所述HBase数据库对应的分区服务器中以进行存储。The data file is sent to the partition server corresponding to the HBase database for storage. 5.根据权利要求1所述的基于云存储的数据传输方法,其特征在于,所述将各排序后分区数据发送至所述HBase数据库对应的分区服务器中以进行存储之后,还包括:5. The data transmission method based on cloud storage according to claim 1, characterized in that, after sending the sorted partition data to the partition server corresponding to the HBase database for storage, it further includes: 若检测到已接收所述HBase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;If a data transmission error message is detected that has been received from the HBase database, the data transmission interruption point is located in the sorted partition data according to the log file corresponding to the data transmission error message. 将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述HBase数据库对应的分区服务器中以进行存储。Data located after the data transmission interruption point of each sorted partition is sent to the partition server corresponding to the HBase database for storage. 6.一种基于云存储的数据传输装置,其特征在于,包括:6. A data transmission device based on cloud storage, characterized in that it comprises: 接收单元,用于接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;A receiving unit is used to receive and store the full amount of data uploaded from the Hive database; wherein the Hive database is a data warehouse database. 分区个数获取单元,用于获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;The partition count acquisition unit is used to obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server; 分区单元,用于根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;A partitioning unit is used to partition the full data according to the number of pre-partitions and the row key of each data in the full data to obtain corresponding partition data; wherein, the total number of partitions of the partition data is equal to the number of pre-partitions, and each partition data uniquely corresponds to a partition server; 排序单元,用于将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及The sorting unit is used to sort the data in each partition in ascending order according to the column and row keys, thus obtaining the corresponding sorted partition data; and 传输单元,用于将各排序后分区数据发送至所述HBase数据库对应的分区服务器中以进行存储;The transmission unit is used to send the sorted partition data to the partition server corresponding to the HBase database for storage; 所述分区个数获取单元,包括:The partition count acquisition unit includes: 请求发送单元,用于发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;A request sending unit is used to send an RPC request to the HBase database; wherein the RPC request is a Remote Procedure Call protocol request; 元信息解析单元,用于接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数;Meta-information parsing unit, used to receive meta-information sent by the HBase database according to the RPC request, and obtain the number of pre-partitions based on the meta-information; 其中,所述元信息里已存储HBase数据库预先建好的表的分区信息。The metadata contains partition information for tables pre-built in the HBase database. 7.一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至5中任一项所述的基于云存储的数据传输方法。7. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the computer program, it implements the data transmission method based on cloud storage as described in any one of claims 1 to 5. 8.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行如权利要求1至5任一项所述的基于云存储的数据传输方法。8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to perform the data transmission method based on cloud storage as described in any one of claims 1 to 5.
HK42020010229.1A 2020-06-30 Data transmission method and device basedon cloud storage and computer equipment HK40020782B (en)

Publications (2)

Publication Number Publication Date
HK40020782A HK40020782A (en) 2020-10-30
HK40020782B true HK40020782B (en) 2024-05-03

Family

ID=

Similar Documents

Publication Publication Date Title
CN111090645B (en) Cloud storage-based data transmission method and device and computer equipment
CN112889054B (en) System and method for database encryption in a multi-tenant database management system
CN106233259B (en) Method and system for retrieving multi-generational stored data in a decentralized storage network
JP6463796B2 (en) Archive data storage system
US9910736B2 (en) Virtual full backups
US20140304513A1 (en) Storage drive processing multiple commands from multiple servers
CN103098035B (en) Storage system
US8788831B2 (en) More elegant exastore apparatus and method of operation
JP6224102B2 (en) Archive data identification
US20150142756A1 (en) Deduplication in distributed file systems
US10762051B1 (en) Reducing hash collisions in large scale data deduplication
US10558581B1 (en) Systems and techniques for data recovery in a keymapless data storage system
TW201209576A (en) Approach for optimizing restores of deduplicated data
CN103353867A (en) Distributed replica storage system with web services interface
US11422891B2 (en) Global storage solution with logical cylinders and capsules
US11314432B2 (en) Managing data reduction in storage systems using machine learning
US12474851B2 (en) Combining data block I/O and checksum block I/O into a single I/O operation during processing by a storage stack
CN118860751A (en) Data backup and recovery method and device based on anomaly detection
JP6113816B1 (en) Information processing system, information processing apparatus, and program
US20220245097A1 (en) Hashing with differing hash size and compression size
HK40020782B (en) Data transmission method and device basedon cloud storage and computer equipment
HK40020782A (en) Data transmission method and device basedon cloud storage and computer equipment
US11995060B2 (en) Hashing a data set with multiple hash engines
WO2014165451A2 (en) Key/value storage device and method
Walse A Novel Method to Improve Data Deduplication System