CN103971066A

CN103971066A - Verification method for integrity of big data migration in HDFS

Info

Publication number: CN103971066A
Application number: CN201410212726.1A
Authority: CN
Inventors: 赵仁明; 辛国茂; 亓开元; 房体盈
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2014-05-20
Filing date: 2014-05-20
Publication date: 2014-08-06

Abstract

The present invention provides a method for verifying the integrity of large data migration in HDFS. The specific implementation process is as follows: obtain the detailed information of the original HDFS file and directory structure and the information of the new HDFS file after migration; fragment the original file information and the new file information Processing; output the comparison verification and verification results of the old and new file information. Compared with the existing technology, this method for verifying the integrity of big data migration in HDFS does not need to compile and package programs, and only needs simple scripts to complete the verification; it highlights the advantages of flexibility and convenience of big data, making Users can find incomplete data that may exist very quickly and easily; the scope of application is wide, and this method is suitable for a variety of HDFS environments and has strong practicability.

Description

A Method of Integrity Verification of Big Data Migration in HDFS

技术领域 technical field

本发明涉及计算机技术领域，具体的说是一种HDFS中大数据迁移完整性验证的方法。 The invention relates to the field of computer technology, in particular to a method for verifying the integrity of large data migration in HDFS.

背景技术 Background technique

大数据(big data)，或称巨量资料，指的是所涉及的资料量规模巨大到无法透过目前主流软件工具，在合理时间内达到撷取、管理、处理、并整理成为帮助企业经营决策更积极目的的资讯。 Big data, or huge amount of data, refers to the amount of data involved is so large that it cannot be captured, managed, processed, and organized within a reasonable time through current mainstream software tools to help enterprises operate Information for more active purposes in decision-making.

Hadoop Distributed File System（HDFS）被设计成适合运行在通用硬件(commodity hardware)上的分布式文件系统。HDFS是一个高度容错性的系统，适合部署在廉价的机器上。HDFS能提供高吞吐量的数据访问，非常适合大规模数据集上的应用。运行在HDFS之上的程序有很大量的数据集。典型的HDFS文件大小是TB的级别。所以，HDFS被调整成支持大文件。它应该提供很高的聚合数据带宽，一个集群中支持数百个节点，一个集群中还应该支持千万级别的文件。HDFS被设计成可以简便地实现平台间的迁移，这将推动需要大数据集的应用更广泛地采用HDFS作为平台。 Hadoop Distributed File System (HDFS) is designed as a distributed file system suitable for running on commodity hardware. HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high-throughput data access and is very suitable for applications on large-scale data sets. Programs running on HDFS have very large datasets. Typical HDFS file sizes are on the order of terabytes. Therefore, HDFS is tuned to support large files. It should provide high aggregate data bandwidth, support hundreds of nodes in a cluster, and support tens of millions of files in a cluster. HDFS is designed to allow easy migration between platforms, which will drive wider adoption of HDFS as a platform for applications that require large data sets.

本技术提供了一种简便宜行的验证HDFS数据迁移之后数据完整性的方法，可以使管理员迅速、方便的验证迁移后数据是否完整、有效，并将验证结果记录到日志文件中。 This technology provides a simple and cheap method for verifying data integrity after HDFS data migration, which enables administrators to quickly and conveniently verify whether the data after migration is complete and valid, and record the verification results in log files.

发明内容 Contents of the invention

本发明的技术任务是解决现有技术的不足，提供一种HDFS中大数据迁移完整性验证的方法。 The technical task of the present invention is to solve the deficiencies of the prior art, and provide a method for verifying the integrity of large data migration in HDFS.

本发明的技术方案是按以下方式实现的，该一种HDFS中大数据迁移完整性验证的方法，其具体实现过程如下： The technical scheme of the present invention is realized in the following manner, the method for verifying the integrity of large data migration in this a kind of HDFS, its specific implementation process is as follows:

1)获取原始HDFS文件及目录结构详细信息和迁移后的新HDFS文件信息； 1) Obtain the original HDFS file and directory structure details and the new HDFS file information after migration;

2)对原始文件信息以及新文件信息分片处理； 2) Segment processing of original file information and new file information;

3)输出新旧文件信息的对比验证和验证结果。 3) Output the comparative verification and verification results of the old and new file information.

所述步骤1）的详细过程为： The detailed process of the step 1) is:

在原始的HDFS文件系统中通过执行hadoop fs –lsr / > oldInfo命令，获取原始HDFS文件的详细信息，并将结果重定向到oldInfo文件中； In the original HDFS file system, execute the hadoop fs –lsr / > oldInfo command to obtain the detailed information of the original HDFS file, and redirect the result to the oldInfo file;

在迁移后新的HDFS文件系统中，通过执行相同的命令hadoop fs –lsr / >newInfo，获取新的HDFS文件信息，并将结果重定向到newInfo文件中。 In the new HDFS file system after migration, execute the same command hadoop fs –lsr / >newInfo to obtain the new HDFS file information, and redirect the result to the newInfo file.

所述步骤2）的详细过程为：将原始文件信息oldInfo和新文件信息newInfo按相同的规则进行分片，这里的规则是指按照行数分割成相同的文件数。 The detailed process of the step 2) is: segment the original file information oldInfo and the new file information newInfo according to the same rule, where the rule refers to splitting into the same number of files according to the number of lines.

所述步骤3）的详细过程为：通过将分片后的新旧HDFS文件信息进行对应的逐个对比，将对比后的结果保存在日志文件中，这里的对比是指对比文件或文件夹的名字，以及文件的大小是否相匹配。 The detailed process of step 3) is: by comparing the new and old HDFS file information after fragmentation correspondingly one by one, the result after the comparison is saved in the log file, where the comparison refers to the name of the comparison file or folder, and whether the file size matches.

所述匹配过程为： The matching process is:

一、用旧的文件信息为基准，逐条匹配新的文件信息； 1. Using the old file information as the benchmark, match the new file information one by one;

二、若完全匹配，则取旧文件信息的下一条继续步骤二过程的匹配； 2. If it matches completely, take the next item of the old file information and continue the matching of the process of step 2;

三、若文件大小未能完全匹配上，代表该文件迁移不完整，将文件信息记录至日志文件后，继续步骤二； 3. If the file size does not match exactly, it means that the file migration is incomplete. After recording the file information to the log file, proceed to step 2;

四、若文件信息为找到，代表该文件未被迁移至新文件系统，将文件信息记录至日志文件后，继续步骤二； 4. If the file information is found, it means that the file has not been migrated to the new file system. After recording the file information to the log file, proceed to step 2;

五、当所有的旧文件信息全都被提取过一遍之后，本次完整性验证结束。 5. After all the old file information has been extracted, the integrity verification ends.

本发明与现有技术相比所产生的有益效果是： The beneficial effect that the present invention produces compared with prior art is:

本发明的一种HDFS中大数据迁移完整性验证的方法是一种高效、快速且易实施操作的对HDFS中迁移出的数据完整性验证的方法,最终实现利用该技术,高效、简便的验证迁移出的新数据的完整性，进一步减少了人工逐一进行数据验证的工作量，且大大减少了编程的工作量；不需要进行程序的编译、打包，只需要简单的脚本即可完成验证；更加突出大数据灵活、便捷的优势，使得用户可以非常快速简便的找到可能存在的不完整的数据；适用范围广泛，该方法适用于多种HDFS环境，实用性强，易于推广。 A method for verifying the integrity of large data migration in HDFS according to the present invention is an efficient, fast and easy-to-operate method for verifying the integrity of data migrated from HDFS, and finally realizes efficient and simple verification using this technology The integrity of the new data migrated further reduces the workload of manual data verification one by one, and greatly reduces the workload of programming; no need to compile and package the program, and only need simple scripts to complete the verification; more Highlight the flexible and convenient advantages of big data, so that users can find incomplete data that may exist very quickly and easily; the scope of application is wide, this method is suitable for a variety of HDFS environments, strong practicability, and easy to promote.

附图说明 Description of drawings

附图1为本发明的实现流程示意图。 Accompanying drawing 1 is the realization flow diagram of the present invention.

具体实施方式 Detailed ways

下面结合附图对本发明的一种HDFS中大数据迁移完整性验证的方法作以下详细说明。 A method for verifying the integrity of large data migration in HDFS according to the present invention will be described in detail below in conjunction with the accompanying drawings.

如附图1所示，现提供一种HDFS中大数据迁移完整性验证的方法，该方法的具体思路是依次取出迁移之前的每个HDFS文件信息，在新的HDFS文件信息中搜索，若文件信息存在，且大小等信息符合。则继续在迁移前的HDFS信息中取出下一条，继续比较。如果在新的HDFS文件信息中没找到或找到后文件大小不相符，则代表该条数据未被成功迁移。 As shown in Figure 1, a method for verifying the integrity of big data migration in HDFS is now provided. The specific idea of this method is to sequentially take out the information of each HDFS file before migration, and search in the new HDFS file information. If the file The information exists, and the size and other information match. Then continue to take out the next item in the HDFS information before migration, and continue to compare. If it is not found in the new HDFS file information or the file size does not match after being found, it means that the data has not been migrated successfully.

其具体实现过程如下： Its specific implementation process is as follows:

一、原始HDFS文件及目录结构详细信息和迁移后的新HDFS文件信息的获取。 1. Acquisition of original HDFS file and directory structure details and new HDFS file information after migration.

在原始的HDFS文件系统中通过执行hadoop fs –lsr / > oldInfo命令，获取原始HDFS文件的详细信息，并将结果重定向到oldInfo文件中。 In the original HDFS file system, execute the hadoop fs –lsr / > oldInfo command to obtain the detailed information of the original HDFS file, and redirect the result to the oldInfo file.

在迁移后新的HDFS文件系统中，通过执行相同的命令hadoop fs –lsr / >newInfo，获取新的HDFS文件信息，并将结果重定向到newInfo文件中。 In the new HDFS file system after migration, execute the same command hadoop fs –lsr / >newInfo to obtain the new HDFS file information and redirect the result to the newInfo file.

二、原始文件信息以及新文件信息的分片处理。 2. Fragmentation processing of original file information and new file information.

由于hadoop文件系统中的文件数量巨大，且目录结构及其复杂。为了便于新旧信息对比验证。可将原始文件信息oldInfo和新文件信息newInfo按相同的规则进行分片。具体的分片方法见如下脚本： Due to the huge number of files in the hadoop file system, and the directory structure is extremely complex. In order to facilitate the comparison and verification of new and old information. The original file information oldInfo and the new file information newInfo can be segmented according to the same rules. For the specific sharding method, see the following script:

#!/bin/env python #!/bin/env python

import sys import sys

if __name__ == "__main__": if __name__ == "__main__":

subdir_file = open("oldInfo") subdir_file = open("oldInfo")

line_count = 0 line_count = 0

subdir_list = [] subdir_list = []

for line in subdir_file: for line in subdir_file:

line_count += 1 line_count += 1

if line_count == 1: if line_count == 1:

continue continue

items = line.strip().split() items = line.strip().split()

subdir_list.append(items[-1]) subdir_list.append(items[-1])

output_file_list = [] output_file_list = []

FILE_COUNT = 8 FILE_COUNT = 8

for i in range(0, FILE_COUNT + 1): for i in range(0, FILE_COUNT + 1):

output_file_name = "split_subdir_" + str(i) output_file_name = "split_subdir_" + str(i)

output_file_list.append(open(output_file_name, 'w')) output_file_list. append(open(output_file_name, 'w'))

count_per_file = len(subdir_list) / FILE_COUNT count_per_file = len(subdir_list) / FILE_COUNT

total_count = 0 total_count = 0

for subdir in subdir_list: for subdir in subdir_list:

file_index = total_count / count_per_file file_index = total_count / count_per_file

output_file_list[file_index].write("%s\n" % subdir) output_file_list[file_index].write("%s\n" % subdir)

total_count += 1 total_count += 1

for i in range(0, FILE_COUNT + 1): for i in range(0, FILE_COUNT + 1):

output_file_list[i].close() output_file_list[i].close()

print "total count", total_count print "total count", total_count

使用该方法，按行数将oldInfo中的信息分片到10个split_subdir文件中。 Use this method to split the information in oldInfo into 10 split_subdir files according to the number of rows.

二、新旧文件信息的对比验证和验证结果的输出。 2. Comparison and verification of old and new file information and output of verification results.

通过将分片后的新旧HDFS文件信息进行对应的逐个对比，将对比后的结果保存在日志文件中。其中，主要是对比文件或文件夹的名字，以及文件的大小是否相匹配。 By comparing the fragmented new and old HDFS file information one by one, the comparison results are saved in the log file. Among them, it is mainly to compare the names of files or folders, and whether the sizes of files match.

具体的对比方法见如下脚本： For the specific comparison method, see the following script:

#!/bin/env python #!/bin/env python

import sys import sys

#!/bin/env python #!/bin/env python

import sys import sys

def load_info(info_file_name): def load_info(info_file_name):

info = {} info = {}

for line in open(info_file_name): for line in open(info_file_name):

items = line.strip().split() items = line.strip().split()

if len(items) != 8: if len(items) != 8:

print "wrong info, %s" % line print "wrong info, %s" % line

sys.exit(1) sys. exit(1)

size = items[4] size = items[4]

file_name = items[7] file_name = items[7]

info[file_name] = size info[file_name] = size

return info return info

if __name__ == "__main__": if __name__ == "__main__":

if len(sys.argv) < 4: if len(sys.argv) < 4:

print "%s <info.old> <info.new> <log_file> [dirlist_file_name]" % sys.argv[0] print "%s <info.old> <info.new> <log_file> [dirlist_file_name]" % sys.argv[0]

sys.exit(1) sys. exit(1)

dir_list = [] dir_list = []

if len(sys.argv) >= 5: if len(sys.argv) >= 5:

for line in open(sys.argv[4]): for line in open(sys.argv[4]):

dir_list.append(line.strip()) dir_list.append(line.strip())

old_info = load_info(sys.argv[1]) old_info = load_info(sys.argv[1])

new_info = load_info(sys.argv[2]) new_info = load_info(sys.argv[2])

log_file = open(sys.argv[3], 'w') log_file = open(sys. argv[3], 'w')

total_file_count = 0 total_file_count = 0

missed_file_count = 0 missed_file_count = 0

mismatch_file_count = 0 mismatch_file_count = 0

for file_name in old_info: for file_name in old_info:

if len(dir_list) > 0: if len(dir_list) > 0:

match = False match=False

for target_dir in dir_list: for target_dir in dir_list:

if file_name.startswith(target_dir): if file_name.startswith(target_dir):

match = True match=True

break break

if not match: if not match:

continue continue

total_file_count += 1 total_file_count += 1

if file_name not in new_info: if file_name not in new_info:

log_file.write("[MISSING] %s\n" % file_name) log_file.write("[MISSING] %s\n" % file_name)

missed_file_count += 1 missed_file_count += 1

elif new_info[file_name] != old_info[file_name]: elif new_info[file_name] != old_info[file_name]:

log_file.write("[MISMATCH] %s [%s != %s]\n" % (file_name, old_info[file_name], new_info[file_name])) log_file.write("[MISMATCH] %s [%s != %s]\n" % (file_name, old_info[file_name], new_info[file_name]))

mismatch_file_count += 1 mismatch_file_count += 1

log_file.close() log_file. close()

通过Command<info.old> <info.new> <log_file> [dirlist_file_name]方式，传入相应参数进行新旧HDFS文件信息的对比。 Through Command<info.old> <info.new> <log_file> [dirlist_file_name], pass in the corresponding parameters to compare the information of the old and new HDFS files.

实施例： Example:

本发明实施例中的实施步骤如下： The implementation steps in the embodiment of the present invention are as follows:

1、获取迁移前和迁移后的文件信息。 1. Obtain the file information before and after migration.

2、对文件信息进行分割。 2. Segment the file information.

3、用旧的文件信息为基准，逐条匹配新的文件信息。 3. Using the old file information as a benchmark, match the new file information one by one.

4、若完全匹配上，则取旧文件信息的下一条继续3过程的匹配。 4. If there is a complete match, take the next item of the old file information and continue the matching in process 3.

5、若文件大小未能完全匹配上，代表该文件迁移不完整。将文件信息记录至日志文件后，继续步骤3。 5. If the file size does not match exactly, it means that the file migration is incomplete. After recording the file information to the log file, go to step 3.

6、若文件信息为找到，代表该文件未被迁移至新文件系统。将文件信息记录至日志文件后，继续步骤3。 6. If the file information is found, it means that the file has not been migrated to the new file system. After recording the file information to the log file, go to step 3.

7、当所有的旧文件信息全都被提取过一遍之后，本次完整性验证结束。 7. After all the old file information has been extracted, the integrity verification ends.

以上所述仅为本发明的实施例而已，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。 The above description is only an embodiment of the present invention, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. a method for large Data Migration integrity verification in HDFS, is characterized in that its specific implementation process is as follows:

1) obtain the new HDFS fileinfo after original HDFS file and bibliographic structure details and migration;

2) to original file information and new fileinfo burst processing;

3) export contrast verification and the result of new and old fileinfo.

2. the method for large Data Migration integrity verification in a kind of HDFS according to claim 1, is characterized in that: the detailed process of described step 1) is:

In original HDFS file system, by carrying out hadoop fs – lsr/> oldInfo order, obtain the details of original HDFS file, and result is redirected in oldInfo file;

, by carrying out identical order hadoop fs – lsr/>newInfo, obtain new HDFS fileinfo, and result is redirected in newInfo file afterwards in new HDFS file system in migration.

3. the method for large Data Migration integrity verification in a kind of HDFS according to claim 1 and 2, it is characterized in that: described step 2) detailed process be: original file information oldInfo and new fileinfo newInfo are carried out to burst by identical rule, and the rule here refers to according to line number and is divided into identical number of files.

4. the method for large Data Migration integrity verification in a kind of HDFS according to claim 3, it is characterized in that: the detailed process of described step 3) is: by the new and old HDFS fileinfo after burst being carried out to corresponding contrast one by one, result after contrast is kept in journal file, the contrast here refers to the name of documents or file, and whether the size of file matches.

5. the method for large Data Migration integrity verification in a kind of HDFS according to claim 4, is characterized in that: described matching process is:

One, be benchmark with old fileinfo, mate one by one new fileinfo;

If two mate completely, get the coupling of next continuation step 2 process of ancient deed information;

If three file sizes fail to match completely, represent that this file migration is imperfect, fileinfo is recorded to after journal file, continue step 2;

If four fileinfos for finding, represent that this file is not migrated to new file system, fileinfo is recorded to after journal file, continue step 2;

Five, after all ancient deed information is all extracted and goes over, this integrity verification finishes.