[go: up one dir, main page]

CN107566496A - A kind of hadoop date storage methods and device - Google Patents

A kind of hadoop date storage methods and device Download PDF

Info

Publication number
CN107566496A
CN107566496A CN201710799237.4A CN201710799237A CN107566496A CN 107566496 A CN107566496 A CN 107566496A CN 201710799237 A CN201710799237 A CN 201710799237A CN 107566496 A CN107566496 A CN 107566496A
Authority
CN
China
Prior art keywords
datanode
data
node
data storage
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710799237.4A
Other languages
Chinese (zh)
Inventor
辛永欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710799237.4A priority Critical patent/CN107566496A/en
Publication of CN107566496A publication Critical patent/CN107566496A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the invention discloses a kind of hadoop date storage methods, including:When receiving the data storage request of user's submission, namenode NameNode randomly selects the back end DataNode of predetermined number from multiple different frames;Obtain the data trnascription quantity that each DataNode is currently deposited in network topology to current DataNode range information and each DataNode in the DataNode of predetermined number;Each DataNode scheduling evaluation of estimate is calculated according to the range information and data trnascription quantity;Data storage node is chosen according to the scheduling evaluation of estimate calculated.The embodiment of the invention also discloses a kind of hadoop data storage devices.By scheme of the embodiment of the present invention, the load balancing of data storage and good data transmission performance are realized.

Description

一种hadoop数据存储方法和装置A hadoop data storage method and device

技术领域technical field

本发明实施例涉及数据存储技术,尤指一种hadoop数据存储方法和装置。Embodiments of the present invention relate to data storage technology, in particular to a hadoop data storage method and device.

背景技术Background technique

随着互联网和分布式计算技术的发展,出现了越来越多的数据密集型应用,这些应用常常需要涉及数TB(太字节)的数据,如何高效、可靠而又方便地处理大量的数据成为当前一个重要研究方向,并且如何可靠而且合理地存放海量数据是Hadoop(一种分布式系统基础架构)体系中的一个重要问题。Hadoop将数据的多个副本存放在集群中的不同机器上,当有节点失效时,其仍然可以读取数据。但另一方面,由于MapReduce中的运算常常需要输入大量的数据,而大量数据的移动会显著影响运算的性能,所以数据的存放应当遵循本地性的原则,即数据应当距离运算节点较近,从而减少因数据移动带来的性能损失。HDFS(Hadoop分布式文件系统)目前的副本放置策略如下:如果写入者在一个数据节点DataNode上,那么第一个副本在本机,否则随机选取一个节点。第二个副本先在另一个机架上,第三个副本被放置在同第二个副本同一机架,但不同的数据节点上。该方案存在以下问题:随机选取的机架的节点可能会由于距离本地节点太远而增加不必要的数据恢复时间,同时随机选取节点也不能保证节点之间数据存储的平衡。由于系统中节点的失效是常态,数据恢复时不必要的性能损失会导致整个存储系统性能下降。With the development of Internet and distributed computing technology, there are more and more data-intensive applications, these applications often need to involve several terabytes of data, how to process large amounts of data efficiently, reliably and conveniently It has become an important research direction at present, and how to store massive data reliably and reasonably is an important issue in Hadoop (a distributed system infrastructure) system. Hadoop stores multiple copies of data on different machines in the cluster. When a node fails, it can still read the data. But on the other hand, since the operation in MapReduce often needs to input a large amount of data, and the movement of a large amount of data will significantly affect the performance of the operation, the storage of data should follow the principle of locality, that is, the data should be closer to the operation node, so that Reduce performance loss due to data movement. The current copy placement strategy of HDFS (Hadoop Distributed File System) is as follows: if the writer is on a data node DataNode, then the first copy is on the local machine, otherwise a node is randomly selected. The second copy is first on another rack, and the third copy is placed on the same rack as the second copy, but on a different data node. This scheme has the following problems: the nodes of the randomly selected rack may increase unnecessary data recovery time due to being too far away from the local node, and at the same time, randomly selecting nodes cannot guarantee the balance of data storage between nodes. Since the failure of nodes in the system is normal, unnecessary performance loss during data recovery will lead to performance degradation of the entire storage system.

发明内容Contents of the invention

为了解决上述技术问题,本发明实施例提供了一种hadoop数据存储方法,能够实现数据存放的负载均衡,又能实现良好的数据传输性能。In order to solve the above technical problems, an embodiment of the present invention provides a hadoop data storage method, which can realize load balancing of data storage and good data transmission performance.

为了达到本发明实施例目的,本发明实施例提供了一种hadoop数据存储方法,包括:In order to achieve the purpose of the embodiment of the present invention, the embodiment of the present invention provides a hadoop data storage method, including:

当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode;When receiving the data storage request submitted by the user, the name node NameNode randomly selects a preset number of data nodes DataNode from multiple different racks;

获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;Obtain the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes and the number of data copies currently stored by each DataNode;

根据该距离信息以及数据副本数量计算每个DataNode的调度评价值;Calculate the scheduling evaluation value of each DataNode according to the distance information and the number of data copies;

根据计算出的调度评价值选取数据存放节点。Select data storage nodes according to the calculated scheduling evaluation value.

可选地,当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode包括:Optionally, when receiving a data storage request submitted by a user, the name node NameNode randomly selects a preset number of data nodes DataNodes from multiple different racks, including:

根据该数据存储请求,NameNode调用预设的副本放置策略BlockPlacementPolicy;其中,该BlockPlacementPolicy的节点选取函数chooseTarget()中增加有网络拓扑NetworkTopology类成员变量clusterMap;According to the data storage request, the NameNode invokes the preset copy placement policy BlockPlacementPolicy; wherein, the node selection function chooseTarget() of the BlockPlacementPolicy adds a network topology NetworkTopology class member variable clusterMap;

根据clusterMap的节点选择函数Node chooseRandom(String scope)从多个不同机架中获得随机的DataNode。Obtain random DataNodes from multiple different racks according to the node selection function Node chooseRandom(String scope) of clusterMap.

可选地,获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息包括:Optionally, obtaining distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes includes:

根据clusterMap的目标距离函数intgetDistance(Node node1,Node node2)获得各个DataNode与当前DataNode之间的网络距离。Obtain the network distance between each DataNode and the current DataNode according to the target distance function intgetDistance(Node node1, Node node2) of clusterMap.

可选地,获取各个DataNode当前存放的数据副本数量包括:Optionally, obtaining the number of data copies currently stored by each DataNode includes:

调用Hadoop系统中描述DataNode的数据节点描述策略DataNodeDescriptor;Call the data node description strategy DataNodeDescriptor describing DataNode in the Hadoop system;

根据DataNodeDescriptor中的块数据组函数intnumBlocks()获取各个DataNode上已经存放的数据块数量,作为数据副本数量。According to the block data group function intnumBlocks() in DataNodeDescriptor, the number of data blocks stored on each DataNode is obtained as the number of data copies.

可选地,根据该距离信息以及数据副本数量计算每个DataNode的调度评价值包括:将获取的关于每个DataNode的所述距离信息以及数据副本数量代入以下预设的节点评价函数计算每个DataNode的调度评价值E:Optionally, calculating the scheduling evaluation value of each DataNode according to the distance information and the number of data copies includes: substituting the obtained distance information about each DataNode and the number of data copies into the following preset node evaluation function to calculate each DataNode Scheduling evaluation value E:

E=f(ld,d)=Al+(1-A)d;E=f(ld,d)=Al+(1-A)d;

其中,f(ld,d)为节点评价函数;1为该DataNode的负载系数,反比于该DataNode当前存放的数据副本数量;d为距离系数,反比于该DataNode到当前DataNode的网络距离;A∈[0,1]为平衡因子。Among them, f(ld,d) is the node evaluation function; 1 is the load factor of the DataNode, which is inversely proportional to the number of data copies currently stored in the DataNode; d is the distance coefficient, which is inversely proportional to the network distance from the DataNode to the current DataNode; A∈ [0,1] is the balance factor.

可选地,根据计算出的调度评价值选取数据存放节点包括:Optionally, selecting a data storage node according to the calculated scheduling evaluation value includes:

对计算出的调度评价值进行排序;Sort the calculated scheduling evaluation values;

依据调度评价值从高到低选取数据存放节点。Select data storage nodes from high to low according to the scheduling evaluation value.

为了达到本发明实施例目的,本发明实施例还提供了一种hadoop数据存储装置,包括:选取模块、获取模块、计算模块和存储模块;In order to achieve the purpose of the embodiment of the present invention, the embodiment of the present invention also provides a hadoop data storage device, including: a selection module, an acquisition module, a calculation module and a storage module;

选取模块,用于当接收到用户提交的数据存储请求时,从多个不同机架中随机选取预设数量的数据节点DataNode;The selection module is used to randomly select a preset number of data nodes DataNodes from multiple different racks when receiving a data storage request submitted by a user;

获取模块,用于获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;The obtaining module is used to obtain the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes and the number of data copies currently stored by each DataNode;

计算模块,用于根据该距离信息以及数据副本数量计算每个DataNode的调度评价值;A calculation module, configured to calculate the scheduling evaluation value of each DataNode according to the distance information and the number of data copies;

存储模块,用于根据计算出的调度评价值选取数据存放节点。The storage module is used to select a data storage node according to the calculated scheduling evaluation value.

可选地,当接收到用户提交的数据存储请求时,选取模块从多个不同机架中随机选取预设数量的数据节点DataNode包括:Optionally, when receiving a data storage request submitted by a user, the selection module randomly selects a preset number of data nodes DataNodes from multiple different racks, including:

根据数据存储请求,选取模块调用预设的副本放置策略BlockPlacementPolicy;其中,BlockPlacementPolicy的节点选取函数chooseTarget()中增加有网络拓扑NetworkTopology类成员变量clusterMap;According to the data storage request, the selection module invokes the preset copy placement policy BlockPlacementPolicy; among them, the node selection function chooseTarget() of the BlockPlacementPolicy adds the network topology NetworkTopology class member variable clusterMap;

根据clusterMap的节点选择函数Node chooseRandom(String scope)从多个不同机架中获得随机的DataNode。Obtain random DataNodes from multiple different racks according to the node selection function Node chooseRandom(String scope) of clusterMap.

可选地,获取模块获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息包括:Optionally, the acquisition module acquires the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes, including:

根据clusterMap的目标距离函数intgetDistance(Node node1,Node node2)获得各个DataNode与当前DataNode之间的网络距离。Obtain the network distance between each DataNode and the current DataNode according to the target distance function intgetDistance(Node node1, Node node2) of clusterMap.

可选地,获取模块获取各个DataNode当前存放的数据副本数量包括:Optionally, the number of data copies currently stored by each DataNode obtained by the obtaining module includes:

调用Hadoop系统中描述DataNode的数据节点描述策略DataNodeDescriptor;Call the data node description strategy DataNodeDescriptor describing DataNode in the Hadoop system;

根据DataNodeDescriptor中的块数据组函数intnumBlocks()获取各个DataNode上已经存放的数据块数量,作为数据副本数量。According to the block data group function intnumBlocks() in DataNodeDescriptor, the number of data blocks stored on each DataNode is obtained as the number of data copies.

可选地,计算模块根据距离信息以及数据副本数量计算每个DataNode的调度评价值包括:将获取的关于每个DataNode的距离信息以及数据副本数量代入以下预设的节点评价函数计算所述每个DataNode的调度评价值E:Optionally, the calculation module calculates the scheduling evaluation value of each DataNode according to the distance information and the number of data copies, including: substituting the obtained distance information about each DataNode and the number of data copies into the following preset node evaluation function to calculate the each DataNode's scheduling evaluation value E:

E=f(ld,d)=Al+(1-A)d;E=f(ld,d)=Al+(1-A)d;

其中,f(ld,d)为节点评价函数;l为该DataNode的负载系数,反比于该DataNode当前存放的数据副本数量;d为距离系数,反比于该DataNode到当前DataNode的网络距离;A∈[0,1]为平衡因子。Among them, f(ld,d) is the node evaluation function; l is the load factor of the DataNode, which is inversely proportional to the number of data copies currently stored in the DataNode; d is the distance coefficient, which is inversely proportional to the network distance from the DataNode to the current DataNode; A∈ [0,1] is the balance factor.

可选地,存储模块根据计算出的调度评价值选取数据存放节点包括:Optionally, the storage module selecting a data storage node according to the calculated scheduling evaluation value includes:

对计算出的调度评价值进行排序;Sort the calculated scheduling evaluation values;

依据调度评价值从高到低选取数据存放节点。Select data storage nodes from high to low according to the scheduling evaluation value.

本发明实施例包括:当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode;获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;根据该距离信息以及数据副本数量计算每个DataNode的调度评价值;根据计算出的调度评价值选取数据存放节点。通过本发明实施例方案,实现了数据存放的负载均衡,并实现了良好的数据传输性能。The embodiment of the present invention includes: when receiving the data storage request submitted by the user, the name node NameNode randomly selects a preset number of data nodes DataNode from a plurality of different racks; obtains the data node of each DataNode in the preset number of DataNodes in the network topology The distance information to the current DataNode and the number of data copies currently stored in each DataNode; the scheduling evaluation value of each DataNode is calculated according to the distance information and the number of data copies; the data storage node is selected according to the calculated scheduling evaluation value. Through the solution of the embodiment of the present invention, the load balance of data storage is realized, and good data transmission performance is realized.

本发明实施例的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明实施例而了解。本发明实施例的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the embodiments of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the invention. The objectives and other advantages of the embodiments of the present invention can be realized and obtained by the structures particularly pointed out in the description, claims and accompanying drawings.

附图说明Description of drawings

附图用来提供对本发明实施例技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本发明实施例的技术方案,并不构成对本发明实施例技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solutions of the embodiments of the present invention, and constitute a part of the description, and are used together with the embodiments of the application to explain the technical solutions of the embodiments of the present invention, and do not constitute limitations to the technical solutions of the embodiments of the present invention .

图1为本发明实施例的hadoop数据存储方法流程图;Fig. 1 is the hadoop data storage method flowchart of the embodiment of the present invention;

图2为本发明实施例的hadoop数据存储方法示意图;Fig. 2 is the hadoop data storage method schematic diagram of the embodiment of the present invention;

图3为本发明实施例的hadoop数据存储装置组成框图。FIG. 3 is a block diagram of a hadoop data storage device according to an embodiment of the present invention.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚明白,下文中将结合附图对本发明的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined arbitrarily with each other.

在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。The steps shown in the flowcharts of the figures may be performed in a computer system, such as a set of computer-executable instructions. Also, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

为了达到本发明实施例目的,本发明实施例提供了一种hadoop数据存储方法,如图1所示,包括S101-S104:In order to achieve the purpose of the embodiment of the present invention, the embodiment of the present invention provides a Hadoop data storage method, as shown in Figure 1, including S101-S104:

S101、当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode。S101. When receiving a data storage request submitted by a user, the name node NameNode randomly selects a preset number of data nodes DataNodes from multiple different racks.

在本发明实施例中,本文中针对HDFS数据块副本的放置问题,提出了基于评价值的选取策略,该策略结合结点距离与数据负载来改进数据块副本放置,以提高系统的存储性能和平衡结点存储资源的利用率。In the embodiment of the present invention, aiming at the placement problem of HDFS data block copy, a selection strategy based on evaluation value is proposed. This strategy combines node distance and data load to improve data block copy placement, so as to improve the storage performance of the system and Balance the utilization of node storage resources.

在本发明实施例中,如图2所示,当用户提交数据存储的请求时,NameNode随机选取一定数量的多个不同机架的节点,如图2中框1中所示的多个DataNode,对于选取的DataNode的具体数量,即上述的预设数量,可以根据不同的应用场景自行定义,在此不做具体限制。In the embodiment of the present invention, as shown in FIG. 2, when a user submits a request for data storage, the NameNode randomly selects a certain number of nodes of multiple different racks, such as the multiple DataNodes shown in box 1 in FIG. 2, The specific number of selected DataNodes, that is, the preset number mentioned above, can be defined according to different application scenarios, and no specific limitation is set here.

在本发明实施例中,可以在Hadoop上实现本发明实施例所提出的策略,主要是在Hadoop-2.6上实现该策略,并且主要是实现抽象类BlockPlacementPolicy。这个类负责为一个数据块的各副本选择目标数据节点,每当有数据块存储请求提交时BlockPlacementPolicy就会被调用。基于该BlockPlacementPolicy可以完成从多个不同机架中随机选取预设数量的数据节点DataNode。In the embodiment of the present invention, the policy proposed in the embodiment of the present invention can be implemented on Hadoop, mainly on Hadoop-2.6, and mainly implement the abstract class BlockPlacementPolicy. This class is responsible for selecting the target data node for each copy of a data block, and BlockPlacementPolicy will be called whenever a data block storage request is submitted. Based on the BlockPlacementPolicy, a preset number of data nodes DataNodes can be randomly selected from multiple different racks.

可选地,当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode包括S201-S202:Optionally, when receiving the data storage request submitted by the user, the name node NameNode randomly selects a preset number of data nodes DataNodes from multiple different racks, including S201-S202:

S201、根据该数据存储请求,NameNode调用预设的副本放置策略BlockPlacementPolicy;其中,该BlockPlacementPolicy的节点选取函数chooseTarget()中增加有网络拓扑NetworkTopology类成员变量clusterMap;S201. According to the data storage request, the NameNode invokes the preset replica placement policy BlockPlacementPolicy; wherein, the node selection function chooseTarget() of the BlockPlacementPolicy adds a network topology NetworkTopology class member variable clusterMap;

S202、根据clusterMap的节点选择函数Node chooseRandom(String scope)从多个不同机架中获得随机的DataNode。S202. Obtain random DataNodes from multiple different racks according to the node selection function Node chooseRandom(String scope) of the clusterMap.

在本发明实施例中,BlockPlacementPolicy最主要的方法是chooseTarget(),它直接负责存放数据块时选取DataNode结。并且为了获取DataNode数据负载与网络距离信息,可以对方法chooseTarget()进行重写,实现如下功能:在该类中增加NetworkTopology类成员变量clusterMap,该成员在BlockPlacementPolicy类的构造函数中实例化。使用clusterMap的成员方法Node chooseRandom(String scope)可以获得预设数量的随机的DataNode。In the embodiment of the present invention, the most important method of the BlockPlacementPolicy is chooseTarget (), which is directly responsible for selecting the DataNode node when storing the data block. And in order to obtain DataNode data load and network distance information, the method chooseTarget() can be rewritten to realize the following functions: add the member variable clusterMap of the NetworkTopology class to this class, and this member is instantiated in the constructor of the BlockPlacementPolicy class. Use the member method Node chooseRandom(String scope) of clusterMap to obtain a preset number of random DataNodes.

S102、获取该预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量。S102. Obtain the distance information from each DataNode in the network topology to the current DataNode among the preset number of DataNodes and the number of data copies currently stored by each DataNode.

在本发明实施例中,为了进一步对获取的各个DataNode进行评价,需要首先获取各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量。In the embodiment of the present invention, in order to further evaluate each obtained DataNode, it is necessary to obtain the distance information from each DataNode to the current DataNode in the network topology and the number of data copies currently stored by each DataNode.

可选地,获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息包括:Optionally, obtaining distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes includes:

根据clusterMap的目标距离函数intgetDistance(Node node1,Node node2)获得各个DataNode与当前DataNode之间的网络距离。Obtain the network distance between each DataNode and the current DataNode according to the target distance function intgetDistance(Node node1, Node node2) of clusterMap.

在本发明实施例中,基于上述的S201,可以进一步采用clusterMap的成员方法intgetDistance(Node node1,Node node2)得到新的DataNode与当前DataNode之间的网络距离。函数intgetDistance(Node node1,Node node2)中,Node node1和Node node2是指分别待计算其网络距离的两个DataNode。In the embodiment of the present invention, based on the above S201, the member method intgetDistance(Node node1, Node node2) of clusterMap can be further used to obtain the network distance between the new DataNode and the current DataNode. In the function intgetDistance(Node node1, Node node2), Node node1 and Node node2 refer to two DataNodes whose network distance is to be calculated respectively.

可选地,获取各个DataNode当前存放的数据副本数量可以包括:Optionally, obtaining the number of data copies currently stored by each DataNode may include:

调用Hadoop系统中描述DataNode的数据节点描述策略DataNodeDescriptor;Call the data node description strategy DataNodeDescriptor describing DataNode in the Hadoop system;

根据DataNodeDescriptor中的块数据组函数intnumBlocks()获取各个DataNode上已经存放的数据块数量,作为数据副本数量。According to the block data group function intnumBlocks() in DataNodeDescriptor, the number of data blocks stored on each DataNode is obtained as the number of data copies.

在本发明实施例中,Hadoop系统中描述DataNode的类DataNodeDescriptor提供的方法intnumBlocks()可以获取特定DataNode上已经存放的数据块数量(可用于表示该DataNode上当前的负载)。In the embodiment of the present invention, the method intnumBlocks() provided by the class DataNodeDescriptor describing the DataNode in the Hadoop system can obtain the number of data blocks stored on a specific DataNode (which can be used to represent the current load on the DataNode).

S103、根据该距离信息以及数据副本数量计算每个DataNode的调度评价值。S103. Calculate the scheduling evaluation value of each DataNode according to the distance information and the number of data copies.

在本发明实施例中,基于前面步骤获取的距离信息以及数据副本数量,可以代入预设的节点评价函数计算每个DataNode的调度评价值E。In the embodiment of the present invention, based on the distance information obtained in the previous steps and the number of data copies, the preset node evaluation function can be substituted to calculate the scheduling evaluation value E of each DataNode.

可选地,根据该距离信息以及数据副本数量计算每个DataNode的调度评价值包括:将获取的关于每个DataNode的所述距离信息以及数据副本数量代入以下预设的节点评价函数计算每个DataNode的调度评价值E:Optionally, calculating the scheduling evaluation value of each DataNode according to the distance information and the number of data copies includes: substituting the obtained distance information about each DataNode and the number of data copies into the following preset node evaluation function to calculate each DataNode Scheduling evaluation value E:

E=f(ld,d)=Al+(1-A)d;E=f(ld,d)=Al+(1-A)d;

其中,f(ld,d)为节点评价函数;l为该DataNode的负载系数,反比于该DataNode当前存放的数据副本数量;d为距离系数,反比于该DataNode到当前DataNode的网络距离;A∈[0,1]为平衡因子。Among them, f(ld,d) is the node evaluation function; l is the load factor of the DataNode, which is inversely proportional to the number of data copies currently stored in the DataNode; d is the distance coefficient, which is inversely proportional to the network distance from the DataNode to the current DataNode; A∈ [0,1] is the balance factor.

在本发明实施例中,A用于描述数据负载(即该DataNode当前存放的数据副本数量)和网络距离参数在评价中的比重,可以由系统管理员根据系统对负载均衡和数据传输性能的需求指定数值。In the embodiment of the present invention, A is used to describe the data load (that is, the number of data copies currently stored by the DataNode) and the proportion of the network distance parameter in the evaluation, which can be determined by the system administrator according to the system's requirements for load balancing and data transmission performance Specify a value.

S104、根据计算出的调度评价值选取数据存放节点。S104. Select a data storage node according to the calculated scheduling evaluation value.

可选地,根据计算出的调度评价值选取数据存放节点可以包括:Optionally, selecting a data storage node according to the calculated scheduling evaluation value may include:

对计算出的调度评价值进行排序;Sort the calculated scheduling evaluation values;

依据调度评价值从高到低选取数据存放节点。Select data storage nodes from high to low according to the scheduling evaluation value.

在本发明实施例中,将这两方面信息代入评价函数计算得到各个DataNode结点的调度评价值以后,例如图2所示的E1、E2、E3、E4、E5,可以对计算出的调度评价值进行排序,依据调度评价值从高到低选取数据存放节点。例如,选取评价值最大的DataNode作为数据放置结点,便选择了综合权衡数据负载与网络距离较优的数据放置结点,从而优化数据块存放。In the embodiment of the present invention, after these two aspects of information are substituted into the evaluation function to calculate the scheduling evaluation value of each DataNode node, such as E1, E2, E3, E4, and E5 shown in Figure 2, the calculated scheduling evaluation value can be calculated. The value is sorted, and the data storage node is selected according to the scheduling evaluation value from high to low. For example, by selecting the DataNode with the highest evaluation value as the data placement node, the data placement node with a better overall balance between data load and network distance is selected, thereby optimizing the storage of data blocks.

为了达到本发明实施例目的,本发明实施例还提供了一种hadoop数据存储装置1,需要说明的是,上述的任何方法实施例均适用于该装置实施例中,在此不再一一赘述。如图3所示,该装置包括:选取模块11、获取模块12、计算模块13和存储模块14;In order to achieve the purpose of the embodiment of the present invention, the embodiment of the present invention also provides a hadoop data storage device 1. It should be noted that any of the above-mentioned method embodiments are applicable to the device embodiment, and will not be repeated here. . As shown in Figure 3, the device includes: a selection module 11, an acquisition module 12, a calculation module 13 and a storage module 14;

选取模块11,用于当接收到用户提交的数据存储请求时,从多个不同机架中随机选取预设数量的数据节点DataNode;The selection module 11 is configured to randomly select a preset number of data nodes DataNode from a plurality of different racks when receiving a data storage request submitted by a user;

获取模块12,用于获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;The obtaining module 12 is used to obtain the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes and the number of data copies currently stored by each DataNode;

计算模块13,用于根据该距离信息以及数据副本数量计算每个DataNode的调度评价值;Calculation module 13, for calculating the scheduling evaluation value of each DataNode according to the distance information and the number of data copies;

存储模块14,用于根据计算出的调度评价值选取数据存放节点。The storage module 14 is configured to select a data storage node according to the calculated scheduling evaluation value.

可选地,当接收到用户提交的数据存储请求时,选取模块11从多个不同机架中随机选取预设数量的数据节点DataNode包括:Optionally, when receiving a data storage request submitted by a user, the selection module 11 randomly selects a preset number of data nodes DataNodes from multiple different racks, including:

根据数据存储请求,选取模块调用预设的副本放置策略BlockPlacementPolicy;其中,BlockPlacementPolicy的节点选取函数chooseTarget()中增加有网络拓扑NetworkTopology类成员变量clusterMap;According to the data storage request, the selection module invokes the preset copy placement policy BlockPlacementPolicy; among them, the node selection function chooseTarget() of the BlockPlacementPolicy adds the network topology NetworkTopology class member variable clusterMap;

根据clusterMap的节点选择函数Node chooseRandom(String scope)从多个不同机架中获得随机的DataNode。Obtain random DataNodes from multiple different racks according to the node selection function Node chooseRandom(String scope) of clusterMap.

可选地,获取模块获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息包括:Optionally, the acquisition module acquires the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes, including:

根据clusterMap的目标距离函数intgetDistance(Node node1,Node node2)获得各个DataNode与当前DataNode之间的网络距离。Obtain the network distance between each DataNode and the current DataNode according to the target distance function intgetDistance(Node node1, Node node2) of clusterMap.

可选地,获取模块12获取各个DataNode当前存放的数据副本数量包括:Optionally, obtaining the number of data copies currently stored by each DataNode by the obtaining module 12 includes:

调用Hadoop系统中描述DataNode的数据节点描述策略DataNodeDescriptor;Call the data node description strategy DataNodeDescriptor describing DataNode in the Hadoop system;

根据DataNodeDescriptor中的块数据组函数intnumBlocks()获取各个DataNode上已经存放的数据块数量,作为数据副本数量。According to the block data group function intnumBlocks() in DataNodeDescriptor, the number of data blocks stored on each DataNode is obtained as the number of data copies.

可选地,计算模块13根据距离信息以及数据副本数量计算每个DataNode的调度评价值包括:将获取的关于每个DataNode的距离信息以及数据副本数量代入以下预设的节点评价函数计算所述每个DataNode的调度评价值E:Optionally, the calculation module 13 calculates the scheduling evaluation value of each DataNode according to the distance information and the number of data copies, including: substituting the acquired distance information about each DataNode and the number of data copies into the following preset node evaluation function to calculate the each Scheduling evaluation value E of a DataNode:

E=f(ld,d)=Al+(1-A)d;E=f(ld,d)=Al+(1-A)d;

其中,f(ld,d)为节点评价函数;l为该DataNode的负载系数,反比于该DataNode当前存放的数据副本数量;d为距离系数,反比于该DataNode到当前DataNode的网络距离;A∈[0,1]为平衡因子。Among them, f(ld,d) is the node evaluation function; l is the load factor of the DataNode, which is inversely proportional to the number of data copies currently stored in the DataNode; d is the distance coefficient, which is inversely proportional to the network distance from the DataNode to the current DataNode; A∈ [0,1] is the balance factor.

可选地,存储模块14根据计算出的调度评价值选取数据存放节点包括:Optionally, the storage module 14 selecting a data storage node according to the calculated scheduling evaluation value includes:

对计算出的调度评价值进行排序;Sort the calculated scheduling evaluation values;

依据调度评价值从高到低选取数据存放节点。Select data storage nodes from high to low according to the scheduling evaluation value.

本发明实施例包括:当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode;获取预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;根据该距离信息以及数据副本数量计算每个DataNode的调度评价值;根据计算出的调度评价值选取数据存放节点。通过本发明实施例方案,实现了数据存放的负载均衡,并实现了良好的数据传输性能。The embodiment of the present invention includes: when receiving the data storage request submitted by the user, the name node NameNode randomly selects a preset number of data nodes DataNode from a plurality of different racks; obtains the data node of each DataNode in the preset number of DataNodes in the network topology The distance information to the current DataNode and the number of data copies currently stored in each DataNode; the scheduling evaluation value of each DataNode is calculated according to the distance information and the number of data copies; the data storage node is selected according to the calculated scheduling evaluation value. Through the solution of the embodiment of the present invention, the load balance of data storage is realized, and good data transmission performance is realized.

在本发明实施例中,该实施例方案一方面可以缩短数据块存储的平均距离,减少存储时间;另一方面也可以降低数据结点之间存储负载的不平衡,特别是能在一个机架内实现各个数据结点较为平均的数据块存储。该方法综合了数据存放结点的网络距离与数据负载两方面因素,既能实现数据存放的负载均衡,又能实现良好的数据传输性能。In the embodiment of the present invention, on the one hand, the solution of this embodiment can shorten the average distance of data block storage and reduce the storage time; on the other hand, it can also reduce the imbalance of storage load between data nodes, especially in a Realize the average data block storage of each data node within. This method integrates the two factors of network distance and data load of data storage nodes, which can not only realize load balance of data storage, but also achieve good data transmission performance.

虽然本发明实施例所揭露的实施方式如上,但所述的内容仅为便于理解本发明而采用的实施方式,并非用以限定本发明实施例。任何本发明实施例所属领域内的技术人员,在不脱离本发明实施例所揭露的精神和范围的前提下,可以在实施的形式及细节上进行任何的修改与变化,但本发明实施例的专利保护范围,仍须以所附的权利要求书所界定的范围为准。Although the implementation manner disclosed in the embodiment of the present invention is as above, the content described is only the implementation manner adopted for understanding the present invention, and is not intended to limit the embodiment of the present invention. Any person skilled in the field of the embodiments of the present invention can make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed by the embodiments of the present invention, but the embodiments of the present invention The scope of patent protection must still be subject to the scope defined in the appended claims.

Claims (10)

1.一种hadoop数据存储方法,其特征在于,包括:1. a hadoop data storage method, is characterized in that, comprises: 当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode;When receiving the data storage request submitted by the user, the name node NameNode randomly selects a preset number of data nodes DataNode from multiple different racks; 获取所述预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;Obtain the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes and the number of data copies currently stored by each DataNode; 根据所述距离信息以及所述数据副本数量计算所述每个DataNode的调度评价值;calculating the scheduling evaluation value of each DataNode according to the distance information and the number of data copies; 根据计算出的所述调度评价值选取数据存放节点。A data storage node is selected according to the calculated scheduling evaluation value. 2.根据权利要求1所述的hadoop数据存储方法,其特征在于,所述当接收到用户提交的数据存储请求时,名字节点NameNode从多个不同机架中随机选取预设数量的数据节点DataNode包括:2. hadoop data storage method according to claim 1, is characterized in that, described when receiving the data storage request that user submits, name node NameNode randomly selects the data node DataNode of preset quantity from a plurality of different racks include: 根据所述数据存储请求,所述NameNode调用预设的副本放置策略BlockPlacementPolicy;其中,所述BlockPlacementPolicy的节点选取函数chooseTarget()中增加有网络拓扑NetworkTopology类成员变量clusterMap;According to the data storage request, the NameNode calls the preset copy placement policy BlockPlacementPolicy; wherein, the node selection function chooseTarget() of the BlockPlacementPolicy is added with a network topology NetworkTopology class member variable clusterMap; 根据所述clusterMap的节点选择函数Node chooseRandom(String scope)从所述多个不同机架中获得随机的DataNode。Random DataNodes are obtained from the plurality of different racks according to the node selection function Node chooseRandom(String scope) of the clusterMap. 3.根据权利要求2所述的hadoop数据存储方法,其特征在于,所述获取所述预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息包括:3. hadoop data storage method according to claim 2, is characterized in that, each DataNode in the DataNode of described preset quantity described obtaining in network topology to the distance information of current DataNode comprises: 根据所述clusterMap的目标距离函数intgetDistance(Node node1,Node node2)获得所述各个DataNode与所述当前DataNode之间的网络距离。Obtain the network distance between each DataNode and the current DataNode according to the target distance function intgetDistance(Node node1, Node node2) of the clusterMap. 4.根据权利要求1所述的hadoop数据存储方法,其特征在于,所述获取各个DataNode当前存放的数据副本数量包括:4. hadoop data storage method according to claim 1, is characterized in that, the data copy quantity that described obtaining each DataNode deposits currently comprises: 调用Hadoop系统中描述DataNode的数据节点描述策略DataNodeDescriptor;Call the data node description strategy DataNodeDescriptor describing DataNode in the Hadoop system; 根据所述DataNodeDescriptor中的块数据组函数intnumBlocks()获取所述各个DataNode上已经存放的数据块数量,作为所述数据副本数量。According to the block data group function intnumBlocks() in the DataNodeDescriptor, the number of data blocks stored on each DataNode is obtained as the number of data copies. 5.根据权利要求1所述的hadoop数据存储方法,其特征在于,所述根据所述距离信息以及所述数据副本数量计算所述每个DataNode的调度评价值包括:将获取的关于每个DataNode的所述距离信息以及所述数据副本数量代入以下预设的节点评价函数计算所述每个DataNode的调度评价值E:5. hadoop data storage method according to claim 1, is characterized in that, described according to described distance information and described data copy number calculation described each DataNode scheduling evaluation value comprises: about each DataNode that will acquire The distance information and the number of data copies are substituted into the following preset node evaluation function to calculate the scheduling evaluation value E of each DataNode: E=f(ld,d)=Al+(1-A)d;E=f(ld,d)=Al+(1-A)d; 其中,f(ld,d)为节点评价函数;l为该DataNode的负载系数,反比于该DataNode当前存放的数据副本数量;d为距离系数,反比于该DataNode到当前DataNode的网络距离;A∈[0,1]为平衡因子。Among them, f(ld,d) is the node evaluation function; l is the load factor of the DataNode, which is inversely proportional to the number of data copies currently stored in the DataNode; d is the distance coefficient, which is inversely proportional to the network distance from the DataNode to the current DataNode; A∈ [0,1] is the balance factor. 6.根据权利要求1所述的hadoop数据存储方法,其特征在于,所述根据计算出的所述调度评价值选取数据存放节点包括:6. hadoop data storage method according to claim 1, is characterized in that, the described scheduling evaluation value selection data storage node according to calculating comprises: 对计算出的调度评价值进行排序;Sort the calculated scheduling evaluation values; 依据所述调度评价值从高到低选取数据存放节点。Select data storage nodes from high to low according to the scheduling evaluation value. 7.一种hadoop数据存储装置,其特征在于,包括:选取模块、获取模块、计算模块和存储模块;7. a hadoop data storage device, is characterized in that, comprises: select module, obtain module, calculation module and storage module; 所述选取模块,用于当接收到用户提交的数据存储请求时,从多个不同机架中随机选取预设数量的数据节点DataNode;The selection module is used to randomly select a preset number of data nodes DataNodes from multiple different racks when receiving a data storage request submitted by a user; 所述获取模块,用于获取所述预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息以及各个DataNode当前存放的数据副本数量;The obtaining module is used to obtain the distance information from each DataNode in the network topology to the current DataNode in the preset number of DataNodes and the number of data copies currently stored by each DataNode; 所述计算模块,用于根据所述距离信息以及所述数据副本数量计算所述每个DataNode的调度评价值;The calculation module is configured to calculate the scheduling evaluation value of each DataNode according to the distance information and the number of data copies; 所述存储模块,用于根据计算出的所述调度评价值选取数据存放节点。The storage module is configured to select a data storage node according to the calculated scheduling evaluation value. 8.根据权利要求7所述的hadoop数据存储方法,其特征在于,所述当接收到用户提交的数据存储请求时,选取模块从多个不同机架中随机选取预设数量的数据节点DataNode包括:8. hadoop data storage method according to claim 7, is characterized in that, described when receiving the data storage request that user submits, selecting module randomly selects the data node DataNode of preset quantity from a plurality of different racks and comprises : 根据所述数据存储请求,所述选取模块调用预设的副本放置策略BlockPlacementPolicy;其中,所述BlockPlacementPolicy的节点选取函数chooseTarget()中增加有网络拓扑NetworkTopology类成员变量clusterMap;According to the data storage request, the selection module calls the preset copy placement policy BlockPlacementPolicy; wherein, the node selection function chooseTarget() of the BlockPlacementPolicy is added with a network topology NetworkTopology class member variable clusterMap; 根据所述clusterMap的节点选择函数Node chooseRandom(String scope)从所述多个不同机架中获得随机的DataNode。Random DataNodes are obtained from the plurality of different racks according to the node selection function Node chooseRandom(String scope) of the clusterMap. 9.根据权利要求8所述的hadoop数据存储方法,其特征在于,所述获取模块获取所述预设数量的DataNode中各个DataNode在网络拓扑中到当前DataNode的距离信息包括:9. hadoop data storage method according to claim 8, is characterized in that, described obtaining module obtains each DataNode in the DataNode of described preset quantity in network topology to the distance information of current DataNode and comprises: 根据所述clusterMap的目标距离函数intgetDistance(Node node1,Node node2)获得所述各个DataNode与所述当前DataNode之间的网络距离。Obtain the network distance between each DataNode and the current DataNode according to the target distance function intgetDistance(Node node1, Node node2) of the clusterMap. 10.根据权利要求7所述的hadoop数据存储方法,其特征在于,所述获取模块获取各个DataNode当前存放的数据副本数量包括:10. hadoop data storage method according to claim 7, is characterized in that, described obtaining module obtains the data copy quantity that each DataNode deposits currently and comprises: 调用Hadoop系统中描述DataNode的数据节点描述策略DataNodeDescriptor;Call the data node description strategy DataNodeDescriptor describing DataNode in the Hadoop system; 根据所述DataNodeDescriptor中的块数据组函数intnumBlocks()获取所述各个DataNode上已经存放的数据块数量,作为所述数据副本数量。According to the block data group function intnumBlocks() in the DataNodeDescriptor, the number of data blocks stored on each DataNode is obtained as the number of data copies.
CN201710799237.4A 2017-09-07 2017-09-07 A kind of hadoop date storage methods and device Pending CN107566496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710799237.4A CN107566496A (en) 2017-09-07 2017-09-07 A kind of hadoop date storage methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710799237.4A CN107566496A (en) 2017-09-07 2017-09-07 A kind of hadoop date storage methods and device

Publications (1)

Publication Number Publication Date
CN107566496A true CN107566496A (en) 2018-01-09

Family

ID=60979530

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710799237.4A Pending CN107566496A (en) 2017-09-07 2017-09-07 A kind of hadoop date storage methods and device

Country Status (1)

Country Link
CN (1) CN107566496A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632621A (en) * 2020-12-30 2021-04-09 中国移动通信集团江苏有限公司 Data access method, device, equipment and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009029792A2 (en) * 2007-08-29 2009-03-05 Nirvanix, Inc. Method and system for global usage based file location manipulation
CN102984280A (en) * 2012-12-18 2013-03-20 北京工业大学 Data backup system and method for social cloud storage network application
CN103595805A (en) * 2013-11-22 2014-02-19 浪潮电子信息产业股份有限公司 Data placement method based on distributed cluster
CN106936905A (en) * 2017-03-07 2017-07-07 中国联合网络通信集团有限公司 The dispatching method and its scheduling system of the Nova component virtual machines based on openstack

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009029792A2 (en) * 2007-08-29 2009-03-05 Nirvanix, Inc. Method and system for global usage based file location manipulation
CN102984280A (en) * 2012-12-18 2013-03-20 北京工业大学 Data backup system and method for social cloud storage network application
CN103595805A (en) * 2013-11-22 2014-02-19 浪潮电子信息产业股份有限公司 Data placement method based on distributed cluster
CN106936905A (en) * 2017-03-07 2017-07-07 中国联合网络通信集团有限公司 The dispatching method and its scheduling system of the Nova component virtual machines based on openstack

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘言青: "面向超级计算机的海量近线存储系统关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
林伟伟: "一种改进的 Hadoop 数据放置策略", 《华南理工大学学报(自然科学版)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632621A (en) * 2020-12-30 2021-04-09 中国移动通信集团江苏有限公司 Data access method, device, equipment and computer storage medium

Similar Documents

Publication Publication Date Title
US12393607B2 (en) System and method for implementing a scalable data storage service
US20230400990A1 (en) System and method for performing live partitioning in a data store
US11609697B2 (en) System and method for providing a committed throughput level in a data store
US11789925B2 (en) System and method for conditionally updating an item with attribute granularity
US9053167B1 (en) Storage device selection for database partition replicas
US9489443B1 (en) Scheduling of splits and moves of database partitions
US8572091B1 (en) System and method for partitioning and indexing table data using a composite primary key
US8732517B1 (en) System and method for performing replica copying using a physical copy mechanism
CN109117275B (en) Account reconciliation method, device, computer equipment and storage medium based on data sharding
CN110347651A (en) Method of data synchronization, device, equipment and storage medium based on cloud storage
CN110321225B (en) Load balancing method, metadata server and computer readable storage medium
CN109783564A (en) Support the distributed caching method and equipment of multinode
US10614055B2 (en) Method and system for tree management of trees under multi-version concurrency control
CN107566496A (en) A kind of hadoop date storage methods and device
Zhang et al. Speeding up VM startup by cooperative VM image caching
CN108270851A (en) A kind of date storage method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180109

RJ01 Rejection of invention patent application after publication