CN110990358B

CN110990358B - Decompression method, electronic equipment and computer readable storage medium

Info

Publication number: CN110990358B
Application number: CN201910944737.1A
Authority: CN
Inventors: 闫威
Original assignee: MIGU Culture Technology Co Ltd
Current assignee: MIGU Culture Technology Co Ltd
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2023-06-30
Anticipated expiration: 2039-09-30
Also published as: CN110990358A

Abstract

The embodiment of the invention relates to the field of data processing, and discloses a decompression method, electronic equipment and a computer readable storage medium. In some embodiments of the present application, the decompression method includes: pre-decoding the data to be decompressed to obtain block information of the data to be decompressed, wherein the block information indicates the position of a compressed block in the data to be decompressed; dividing data to be decompressed into N data blocks according to the block information, wherein each data block at least comprises a compression block, and N is a positive integer; each data block is decompressed concurrently. In this embodiment, the speed of decompression is increased.

Description

A decompression method, electronic device and computer-readable storage medium

技术领域technical field

本发明实施例涉及数据处理领域，特别涉及一种解压缩方法、电子设备及计算机可读存储介质。The embodiments of the present invention relate to the field of data processing, and in particular to a decompression method, electronic equipment, and a computer-readable storage medium.

背景技术Background technique

由于基于GZIP的数据压缩格式的压缩率高、业界成熟度高，被互联网和大数据领域广泛使用。例如，日常内容分发网络(Content Delivery Network，CDN)的海量访问日志通常被压缩成GZIP包存储，压缩率可达到5至6倍。通过网络将此格式数据流传输到大数据平台进行实时或离线分析，可极大提升网络传输效率和减少网络拥堵。Due to the high compression rate and high industry maturity of the GZIP-based data compression format, it is widely used in the Internet and big data fields. For example, the massive access logs of daily content delivery network (Content Delivery Network, CDN) are usually compressed into GZIP package storage, and the compression rate can reach 5 to 6 times. The data stream in this format is transmitted to the big data platform through the network for real-time or offline analysis, which can greatly improve network transmission efficiency and reduce network congestion.

然而，发明人发现现有技术中至少存在如下问题：基于GZIP的数据压缩格式的解压缩方法的解压缩速度太慢。However, the inventors have found that at least the following problems exist in the prior art: the decompression speed of the decompression method based on the GZIP data compression format is too slow.

发明内容Contents of the invention

本发明实施方式的目的在于提供一种解压缩方法、电子设备及计算机可读存储介质，使得提高了解压缩速度。The purpose of the embodiments of the present invention is to provide a decompression method, electronic equipment, and a computer-readable storage medium, so as to increase the decompression speed.

为解决上述技术问题，本发明的实施方式提供了一种解压缩方法，包括以下步骤：对待解压数据进行预解码，得到待解压数据的块信息，块信息指示待解压数据中的压缩块的位置；根据块信息，将待解压数据划分为N个数据块，每个数据块中至少包括一个压缩块，N为正整数；并发地对每个数据块进行解压缩。In order to solve the above-mentioned technical problems, the embodiment of the present invention provides a decompression method, comprising the following steps: pre-decoding the data to be decompressed, obtaining block information of the data to be decompressed, and the block information indicates the position of the compressed block in the data to be decompressed ; According to the block information, divide the data to be decompressed into N data blocks, each data block includes at least one compressed block, and N is a positive integer; decompress each data block concurrently.

本发明的实施方式还提供了一种电子设备，包括：至少一个处理器；以及，与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够执行上述实施方式提及的解压缩方法。The embodiment of the present invention also provides an electronic device, including: at least one processor; and a memory connected to the at least one processor in communication; wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor. Executed by one processor, so that at least one processor can execute the decompression method mentioned in the above implementation manner.

本发明的实施方式还提供了一种计算机可读存储介质，存储有计算机程序，计算机程序被处理器执行时实现上述实施方式提及的解压缩方法。Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, and implementing the decompression method mentioned in the above embodiment when the computer program is executed by a processor.

本发明实施方式相对于现有技术而言，通过对待解压进行预解码，得到待解压数据的块信息，以便将待解压数据进行分块，以实现并行解压缩。由于并行地对各个划分后的数据块进行解压缩，相对于对整个文件进行解压缩，提高了解压缩速度，减少了解压的总延迟时间。Compared with the prior art, the embodiments of the present invention obtain the block information of the data to be decompressed by pre-decoding the data to be decompressed, so as to divide the data to be decompressed into blocks, so as to realize parallel decompression. Since each divided data block is decompressed in parallel, compared with decompressing the entire file, the decompression speed is improved and the total delay time of decompression is reduced.

另外，对待解压数据进行预解码，得到待解压数据的块信息，具体包括：对待解压数据进行预解码，确定各压缩块的块尾的位置信息；根据各压缩块的块尾的位置信息，确定块信息。In addition, the data to be decompressed is pre-decoded to obtain the block information of the data to be decompressed, which specifically includes: pre-decoding the data to be decompressed, determining the position information of the block tail of each compressed block; according to the position information of the block tail of each compressed block, determining block information.

另外，对待解压数据进行预解码，确定各压缩块的块尾的位置信息，具体包括：根据待解压数据的编码表，对待解压数据中的字符进行码表匹配；若字符匹配到的编码值为256，将字符的位置信息，作为当前压缩块的块尾的位置信息。该实现中，只锁定压缩块的结束字符，不进行距离位置替换，再分块后再并行解压缩，提高了解压缩速度。In addition, the data to be decompressed is pre-decoded to determine the position information of the end of each compressed block, which specifically includes: according to the code table of the data to be decompressed, the code table matching is performed on the characters in the data to be decompressed; 256. Use the character position information as the block end position information of the current compressed block. In this implementation, only the end character of the compressed block is locked, and the distance position is not replaced, and the block is then decompressed in parallel to improve the decompression speed.

另外，根据块信息，将待解压数据划分为N个数据块，具体包括：根据块信息，按照预设的合并规则，将待解压数据中的压缩块合并为N个数据块；其中，合并后的数据块中，第i+1个数据块的第一个压缩块与第i个数据块的最后一个压缩块相同，1≤i＜N。该实现中，各数据块冗余上一数据块的最后一个压缩块，确保数据块内的首个压缩块在解压缩时能够找到足够远距离的引用字符，完成正常的字符替换。In addition, according to the block information, the data to be decompressed is divided into N data blocks, specifically including: according to the block information, according to the preset merging rules, the compressed blocks in the data to be decompressed are merged into N data blocks; wherein, after merging In the data blocks of , the first compressed block of the i+1th data block is the same as the last compressed block of the ith data block, 1≤i<N. In this implementation, each data block is redundant with the last compressed block of the previous data block, so as to ensure that the first compressed block in the data block can find a quote character far enough away when decompressing, and complete normal character replacement.

另外，块信息还指示压缩块的顺序；合并规则为：根据压缩块的顺序，将第1个压缩块至第M个压缩块合并为一个数据块；判断2M是否小于N；其中，M为正整数；若确定是，将第M个压缩块至第2M个压缩块合并为一个数据块；令M＝2M，返回执行判断2M是否小于N的步骤；若确定不是，将第M个压缩块至第N个压缩块合并为一个数据块。该实现中，按照各压缩块的顺序进行合并，保证了后续并行解压缩时压缩块与压缩块之间内容的连续性。In addition, the block information also indicates the order of the compressed blocks; the merging rule is: according to the order of the compressed blocks, merge the first compressed block to the Mth compressed block into one data block; determine whether 2M is less than N; where M is positive Integer; if it is determined to be yes, merge the Mth compressed block to the 2M compressed block into one data block; let M=2M, return to the step of judging whether 2M is less than N; if it is determined not to, merge the Mth compressed block to the The Nth compressed block is merged into one data block. In this implementation, the compressed blocks are merged in order to ensure the continuity of content between compressed blocks during subsequent parallel decompression.

另外，对第k个数据块的解压缩过程包括：若确定k＝1，从第一个压缩块开始解压缩，解压至最后一个压缩块的最后一个预定符号；若确定1＜k＜N，从第一个压缩块的最后一个预定符号开始解压缩，解压至最后一个压缩块的最后一个预定符号；若确定k＝N，从第一个压缩块的最后一个预定符号开始解压缩，直至解压完最后一个压缩块。该实现中，保证解压缩后各数据块内容的完整性，为hadoop平台和流式计算集群(Spark计算平台)的无缝集成提供了基础。In addition, the decompression process to the kth data block includes: if it is determined that k=1, decompress from the first compressed block, and decompress to the last predetermined symbol of the last compressed block; if it is determined that 1<k<N, Decompress from the last predetermined symbol of the first compressed block, and decompress to the last predetermined symbol of the last compressed block; if k=N is determined, start to decompress from the last predetermined symbol of the first compressed block until decompression Finish the last compressed block. In this implementation, the integrity of the content of each data block after decompression is guaranteed, which provides a basis for the seamless integration of the Hadoop platform and the streaming computing cluster (Spark computing platform).

另外，由分布式计算平台根据块信息，将待解压数据划分为N个数据块，并发地对每个数据块进行解压缩。In addition, the distributed computing platform divides the data to be decompressed into N data blocks according to the block information, and decompresses each data block concurrently.

另外，分布式计算平台与Spark计算平台通信连接，分布式计算平台将各个数据块的解压缩数据传输至Spark计算平台。In addition, the distributed computing platform communicates with the Spark computing platform, and the distributed computing platform transmits the decompressed data of each data block to the Spark computing platform.

附图说明Description of drawings

一个或多个实施例通过与之对应的附图中的图片进行示例性说明，这些示例性说明并不构成对实施例的限定，附图中具有相同参考数字标号的元件表示为类似的元件，除非有特别申明，附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute a limitation to the embodiments. Elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the drawings in the drawings are not limited to scale.

图1是根据本发明的第一实施方式的解压缩方法的流程图；Fig. 1 is the flowchart of the decompression method according to the first embodiment of the present invention;

图2是根据本发明的第二实施方式的解压缩方法的流程图；Fig. 2 is the flowchart of the decompression method according to the second embodiment of the present invention;

图3是根据本发明的第三实施方式的解压缩装置的结构示意图；3 is a schematic structural diagram of a decompression device according to a third embodiment of the present invention;

图4是根据本发明的第四实施方式的电子设备的结构示意图。Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的各实施方式进行详细的阐述。然而，本领域的普通技术人员可以理解，在本发明各实施方式中，为了使读者更好地理解本申请而提出了许多技术细节。但是，即使没有这些技术细节和基于以下各实施方式的种种变化和修改，也可以实现本申请所要求保护的技术方案。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention more clear, various implementation modes of the present invention will be described in detail below in conjunction with the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present invention, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized.

本发明的第一实施方式涉及一种解压缩方法，应用于电子设备，如服务器或终端。如图1所示，解压缩方法包括以下步骤：The first embodiment of the present invention relates to a decompression method, which is applied to an electronic device, such as a server or a terminal. As shown in Figure 1, the decompression method includes the following steps:

步骤101：对待解压数据进行预解码，得到待解压数据的块信息。Step 101: Perform pre-decoding on the data to be decompressed to obtain block information of the data to be decompressed.

具体地说，块信息指示待解压数据中的压缩块的位置。Specifically, the block information indicates the location of the compressed block in the data to be decompressed.

需要说明的是，块信息可以是压缩块的块头的位置信息，也可以是压缩块的块尾的位置信息，还可以是压缩块的块头的位置信息和块尾的位置信息，此处不作限制。It should be noted that the block information may be the position information of the block header of the compressed block, or the position information of the block tail of the compressed block, or the position information of the block header and the block tail of the compressed block, which is not limited here .

在一个实施例中，块信息是压缩块的块尾的位置信息。具体地说，电子设备对待解压数据进行预解码，确定各压缩块的块尾的位置信息；根据各压缩块的块尾的位置信息，确定块信息。In one embodiment, the block information is the location information of the block trailer of the compressed block. Specifically, the electronic device pre-decodes the data to be decompressed, and determines the position information of the block tail of each compressed block; and determines the block information according to the position information of the block tail of each compressed block.

在一个实施例中，块尾的位置信息可以是压缩块的结束字符的寻址位置。具体地说，压缩块的结束字符的编码值为256，电子设备根据待解压数据的编码表，对待解压数据中的字符进行码表匹配；若字符匹配到的编码值为256，将字符的位置信息，作为当前压缩块的块尾的位置信息。In one embodiment, the location information of the end of the block may be the addressing location of the end character of the compressed block. Specifically, the code value of the end character of the compressed block is 256, and the electronic device performs code table matching on the characters in the data to be decompressed according to the code table of the data to be decompressed; if the character is matched with a code value of 256, the position of the character Information, as the position information of the block end of the current compressed block.

假设，待解压数据是通过无损数据压缩算法(deflate算法)压缩得到的压缩文件(以下称为deflate文件)，电子设备确定块信息的过程如下：首先，电子设备对deflate文件进行卸封装，即去除deflate文件中的头文件，保留deflate文件的压缩流，其中，头文件可以包括deflate文件的描述信息，描述信息可以指示deflate文件为动态压缩文件或静态压缩文件。然后，读取deflate文件的编码树并对压缩流进行初始预解码。对于预解码过程中解出的值无论大于还是小于256，都丢弃并向后继续解析；如果发现解出的值是256，说明压缩块已经结束，此时缓存该块的全部信息；按同样规则继续处理下一个块，直到解析出全部的块。其中，预解码过程通常为通过哈夫曼算法实现对编码树进行解码，得到编码表，然后通过lz77算法，根据编码表接触deflate文件的每个字符。该过程只判断读取的字符是否为256，不会对重复字符进行距离位置的替换，因此缓存出的分块中的内容依然为未进行完整解压的数据。经过本步骤，可以获取：未进行完整解压的压缩块和压缩块的结束字符的寻址地址。可选择的，在预解码的过程中，对各压缩块进行编号，使得可以得到各压缩块的ID序号。Assuming that the data to be decompressed is a compressed file (hereinafter referred to as a deflate file) compressed by a lossless data compression algorithm (deflate algorithm), the process for the electronic device to determine block information is as follows: first, the electronic device unpacks the deflate file, that is, remove The header file in the deflate file retains the compressed stream of the deflate file, wherein the header file may include description information of the deflate file, and the description information may indicate that the deflate file is a dynamically compressed file or a statically compressed file. Then, the encoding tree of the deflate file is read and the initial pre-decoding of the compressed stream is done. Regardless of whether the value obtained during the pre-decoding process is greater than or less than 256, discard it and continue parsing backwards; if the value obtained is found to be 256, it means that the compression block has ended, and all the information of the block is cached at this time; follow the same rules Continue processing the next chunk until all chunks have been parsed. Among them, the pre-decoding process is usually to decode the coding tree through the Huffman algorithm to obtain the coding table, and then use the lz77 algorithm to contact each character of the deflate file according to the coding table. This process only judges whether the read character is 256, and does not replace the distance position for repeated characters, so the content in the cached block is still data that has not been fully decompressed. After this step, it is possible to obtain: the compressed block that has not been fully decompressed and the addressing address of the end character of the compressed block. Optionally, during the pre-decoding process, each compressed block is numbered, so that the ID number of each compressed block can be obtained.

值得一提的是，压缩块的分离和压缩块的结束字符的寻址位置为后续实现并行解压提供了基础。It is worth mentioning that the separation of the compressed block and the addressing position of the end character of the compressed block provide the basis for subsequent parallel decompression.

值得一提的是，对各压缩快进行编号为后续的压缩块合并提供了基础。It is worth mentioning that the numbering of each compression block provides a basis for subsequent merging of compression blocks.

步骤102：根据块信息，将待解压数据划分为N个数据块。Step 102: Divide the data to be decompressed into N data blocks according to the block information.

具体地说，每个数据块中至少包括一个压缩块，N为正整数。Specifically, each data block includes at least one compressed block, and N is a positive integer.

在一个实施例中，电子设备可以将每个压缩块作为一个数据块。In one embodiment, the electronic device can treat each compressed block as a data block.

在一个实施例中，可以由分布式计算平台根据块信息，将待解压数据划分为N个数据块，并发地对每个数据块进行解压缩。In one embodiment, the distributed computing platform may divide the data to be decompressed into N data blocks according to the block information, and decompress each data block concurrently.

步骤103：并发地对每个数据块进行解压缩。Step 103: Decompress each data block concurrently.

具体地说，电子设备可以通过分布式计算平台对每个数据块并行地解压缩。根据每个数据块的解压缩数据，确定待解压数据最终的解压后文件。Specifically, the electronic device can decompress each data block in parallel through the distributed computing platform. According to the decompressed data of each data block, the final decompressed file of the data to be decompressed is determined.

在一个实施例中，分布式计算平台与Spark计算平台通信连接，分布式计算平台将各个数据块的解压缩数据传输至Spark计算平台。In one embodiment, the distributed computing platform communicates with the Spark computing platform, and the distributed computing platform transmits the decompressed data of each data block to the Spark computing platform.

需要说明的是，以上仅为举例说明，并不对本发明的技术方案构成限定。It should be noted that the above is only for illustration and does not limit the technical solution of the present invention.

与现有技术相比，本实施方式中提供的解压缩方法，通过对待解压进行预解码，得到待解压数据的块信息，以便将待解压数据进行分块，以实现并行解压缩。由于并行地对各个划分后的数据块进行解压缩，相对于对整个文件进行解压缩，提高了解压缩速度，减少了解压的总延迟时间。Compared with the prior art, the decompression method provided in this embodiment obtains the block information of the data to be decompressed by pre-decoding the data to be decompressed, so as to divide the data to be decompressed into blocks, so as to realize parallel decompression. Since each divided data block is decompressed in parallel, compared with decompressing the entire file, the decompression speed is improved and the total delay time of decompression is reduced.

本发明的第二实施方式涉及一种解压缩方法，本实施方式是对第一实施方式的步骤102和步骤103的举例说明。The second embodiment of the present invention relates to a decompression method, and this embodiment is an illustration of step 102 and step 103 of the first embodiment.

具体的说，如图2所示，在本实施方式中，包含步骤201至步骤203，其中，步骤201与第一实施方式中的步骤101大致相同，此处不再赘述。下面主要介绍不同之处：Specifically, as shown in FIG. 2 , in this embodiment, step 201 to step 203 are included, wherein step 201 is substantially the same as step 101 in the first embodiment, and will not be repeated here. Here are the main differences:

步骤201：对待解压数据进行预解码，得到待解压数据的块信息。Step 201: Perform pre-decoding on the data to be decompressed to obtain block information of the data to be decompressed.

步骤202：根据块信息，按照预设的合并规则，将待解压数据中的压缩块合并为N个数据块。Step 202: Merge the compressed blocks in the data to be decompressed into N data blocks according to the block information and according to the preset merging rules.

具体地说，合并后的数据块中，第i+1个数据块的第一个压缩块与第i个数据块的最后一个压缩块相同，1≤i＜N。Specifically, in the merged data blocks, the first compressed block of the i+1th data block is the same as the last compressed block of the ith data block, 1≤i<N.

值得一提的是，将多个压缩块合并为一个压缩块，可以避免电子设备内并行解压缩任务过多，占用电子设备的资源。It is worth mentioning that merging multiple compression blocks into one compression block can avoid too many parallel decompression tasks in the electronic device and occupy the resources of the electronic device.

在一个实施例中，块信息还指示压缩块的顺序，例如，块信息中还包括压缩块的编号，即ID序号，合并规则为：根据压缩块的顺序，将第1个压缩块至第M个压缩块合并为一个数据块；判断2M是否小于N；其中，M为正整数；若确定是，将第M个压缩块至第2M个压缩块合并为一个数据块；令M＝2M，返回执行判断2M是否小于N的步骤；若确定不是，将第M个压缩块至第N个压缩块合并为一个数据块。In one embodiment, the block information also indicates the order of the compressed blocks. For example, the block information also includes the number of the compressed block, that is, the ID sequence number. The merging rule is: according to the order of the compressed blocks, the first compressed block to the Mth compressed blocks are merged into one data block; judge whether 2M is less than N; wherein, M is a positive integer; if yes, merge the Mth compressed block to the 2M compressed block into one data block; let M=2M, return Execute the step of judging whether 2M is smaller than N; if it is determined not to be, merge the Mth compressed block to the Nth compressed block into one data block.

需要说明的是，本领域技术人员可以理解，实际应用中，M可以根据各压缩块的数据大小，以及分布式计算平台的并行处理能力确定，此处不做限制。It should be noted that those skilled in the art can understand that in practical applications, M can be determined according to the data size of each compressed block and the parallel processing capability of the distributed computing platform, which is not limited here.

值的一提的是，按照各压缩块的ID序号顺序进行合并，保证了后续并行解压缩时压缩块与压缩块之间内容的连续性。It is worth mentioning that the merging is performed according to the order of the ID numbers of each compressed block, which ensures the continuity of the content between the compressed blocks and the compressed blocks during subsequent parallel decompression.

值得一提的是，各数据块冗余上一数据块的最后一个压缩块，确保数据块内的首个压缩块在解压缩时能够找到足够远距离的引用字符，完成正常的字符替换。It is worth mentioning that each data block is redundant with the last compressed block of the previous data block to ensure that the first compressed block in the data block can find a quote character far enough away when decompressing to complete normal character replacement.

假设，待解压数据为deflate文件(GZIP文件)，分布式计算平台为hadoop平台。Hadoop平台根据各压缩块的ID序号，按顺序将预定数量的压缩块合并为合适大小的hadoop分区(数据块)。其中，预定数量可以根据需要设置。由于GZIP文件自身的压缩块的大小是几十K左右，而Hadoop的分区通常为64M，对每几十K的数据启动一个处理任务对大数据系统来说不是最优方案。因此，识别缓存全部块信息后，Hadoop平台要对压缩块进行相应合并。考虑到解压后数据量会翻几倍，本实施方式中，将100个连续的压缩块合成一个大集合，作为一个hadoop分区。每个hadoop分区提供该分区所包含的压缩块的个数、ID序号、压缩块内容等，作为Hadoop分区信息的一部分。Assume that the data to be decompressed is a deflate file (GZIP file), and the distributed computing platform is a hadoop platform. The Hadoop platform merges a predetermined number of compressed blocks into Hadoop partitions (data blocks) of appropriate size in sequence according to the ID numbers of each compressed block. Wherein, the predetermined number can be set as required. Since the size of the compressed block of the GZIP file itself is about tens of K, and the partition size of Hadoop is usually 64M, starting a processing task for every tens of K of data is not the optimal solution for big data systems. Therefore, after identifying and caching all block information, the Hadoop platform should merge the compressed blocks accordingly. Considering that the amount of data will increase several times after decompression, in this embodiment, 100 consecutive compressed blocks are synthesized into a large set as a hadoop partition. Each hadoop partition provides the number of compressed blocks contained in the partition, ID sequence number, compressed block content, etc., as part of the Hadoop partition information.

其中，在将压缩块合并为hadoop分区时，可以根据各压缩块的ID序号，按顺序合并，如第一个hadoop分区为#1，#2，#3，#4，某hadoop分区为#8,#9,#10,#11，等等。按照压缩块的ID序号顺序进行合并的目的是保证在后续进行并行解压缩时，可以满足块与块之间内容存在连续性，继而顺利完成对于每个hadoop分块的解压缩。Among them, when merging compressed blocks into hadoop partitions, they can be merged in sequence according to the ID numbers of each compressed block, such as the first hadoop partition being #1, #2, #3, #4, and a hadoop partition being #8 , #9, #10, #11, and so on. The purpose of merging according to the order of the ID numbers of compressed blocks is to ensure that the content continuity between blocks can be satisfied during subsequent parallel decompression, and then the decompression of each hadoop block can be successfully completed.

另外，在将各压缩块合并为hadoop分区时，还可以满足：每个hadoop分区的第一个压缩块为上一个hadoop分区的最后一个压缩块。在每个hadoop分区的开头(首个hadoop分区除外)，冗余增加前一个hadoop分区的末尾块。如第一个hadoop分区为#1，#2，#3，#4共4个块，则第二个hadoop分区应包含#4，#5，#6，#7，#8，确保块#4包含在内；第三个hadoop分区应包含#8,#9,#10,#11,#12…，确保块#8包含在内，以此类推，目的是确保分区内的首块在解压时能找到足够远距离的引用字符，完成正常的字符替换。In addition, when merging each compressed block into a hadoop partition, it can also be satisfied that the first compressed block of each hadoop partition is the last compressed block of the previous hadoop partition. At the beginning of each hadoop partition (except the first hadoop partition), redundancy is added to the end block of the previous hadoop partition. If the first hadoop partition is #1, #2, #3, #4, a total of 4 blocks, then the second hadoop partition should contain #4, #5, #6, #7, #8, to ensure block #4 Included; the third hadoop partition should contain #8, #9, #10, #11, #12..., ensure block #8 is included, and so on, the purpose is to ensure that the first block in the partition is decompressed Quoted characters far enough away can be found to perform normal character substitution.

步骤203：并发地对每个数据块进行解压缩。Step 203: Decompress each data block concurrently.

具体地说，对第k个数据块的解压缩过程包括：若确定k＝1，从第一个压缩块开始解压缩，解压至最后一个压缩块的最后一个预定符号；若确定1＜k＜N，从第一个压缩块的最后一个预定符号开始解压缩，解压至最后一个压缩块的最后一个预定符号；若确定k＝N，从第一个压缩块的最后一个预定符号开始解压缩，直至解压完最后一个压缩块。Specifically, the decompression process for the kth data block includes: if it is determined that k=1, decompress from the first compressed block, and decompress to the last predetermined symbol of the last compressed block; if it is determined that 1<k< N, decompress from the last predetermined symbol of the first compressed block, and decompress to the last predetermined symbol of the last compressed block; if k=N is determined, decompress from the last predetermined symbol of the first compressed block, Until the last compressed block is decompressed.

需要说明的是，预定符号可以是换行符，也可以是其他指定符号，例如，冒号等，本实施方式不作限制。It should be noted that the predetermined symbol may be a newline character, or other specified symbols, such as a colon, which is not limited in this embodiment.

需要说明的是，电子设备可以通过LZ77算法对各数据块中的数据解压缩，也可以根据待解压数据的压缩算法，选择合适的解压算法，对各数据块中的数据进行解压缩，本实施方式不限制解压缩过程使用的算法。It should be noted that the electronic device can decompress the data in each data block through the LZ77 algorithm, or select an appropriate decompression algorithm to decompress the data in each data block according to the compression algorithm of the data to be decompressed. mode does not restrict the algorithm used by the decompression process.

假设，预定符号为换行符。分布式计算平台在经过上一步骤形成hadoop分区后，获取分区信息列表，随后，对每个hadoop分区的内容进行并行解压，对于每一个hadoop分区的解压过程为：Assume, the predetermined symbol is a newline character. After the distributed computing platform forms hadoop partitions in the previous step, it obtains the partition information list, and then decompresses the content of each hadoop partition in parallel. The decompression process for each hadoop partition is:

(1)判断当前解压的hadoop分区是否为第一分区；如是第一分区，则从当前分区的第一个压缩块开始解码，截止至当前分区的最后一个压缩块的最后一个可见的换行符为止。位于最后一个换行符之后的字符串，被认为是不完整的，在当前分区被解压时丢弃，留给下一个分区处理。如不是第一分区，则进入下一步骤；(1) Determine whether the currently decompressed hadoop partition is the first partition; if it is the first partition, start decoding from the first compressed block of the current partition until the last visible line break of the last compressed block of the current partition . The string after the last newline character is considered incomplete, discarded when the current partition is decompressed, and left for the next partition to process. If it is not the first partition, go to the next step;

(2)判断当前解压的hadoop分区是否为最后一个分区。如不是最后一个分区，则从当前分区的第一个压缩块的最后一个可见的换行符开始解码，截止至最后一个压缩块的最后一个可见的换行符为止。即首压缩块的最后一个换行符之前的数据被认为是前一个分区已经包含处理过了。如是最后一个分区，则进入下一个步骤；(2) Determine whether the currently decompressed hadoop partition is the last partition. If it is not the last partition, decoding starts from the last visible newline character of the first compressed block of the current partition and ends at the last visible newline character of the last compressed block. That is, the data before the last newline character of the first compressed block is considered to have been included and processed in the previous partition. If it is the last partition, go to the next step;

(3)从当前分区的第一个压缩块的最后一个可见的换行符开始解码，截止至最后一个压缩块完整解压。(3) Decoding starts from the last visible line break of the first compressed block of the current partition, and decompresses completely until the last compressed block.

以上步骤的目的是解决hadoop分区里面解压后的内容无法直接用于后续RDD构造和大数据并发计算的问题。由于GZIP压缩特性，分配在每个分区里的压缩块，尤其是首块和尾块，无法保证首块的第一个字符正好是一条文本记录的行首字符，也无法保证尾块中的最后一个字符正好是一条文本记录的行尾字符。这种情况下，直接将分区内容交给后续的分布式计算有可能会出现系统报错等异常情况。通过以上步骤，基于每个hadoop分区的首个块和末尾块的最后一个换行符进行分割，即可保证分区内容的读取完整，继而可保证Spark计算平台能够构造基于内存的Spark RDD，即最小计算单位，最终实现GZIP文件并行解压，使得hadoop平台和流式计算集群(Spark计算平台)的无缝集成。The purpose of the above steps is to solve the problem that the decompressed content in the Hadoop partition cannot be directly used for subsequent RDD construction and concurrent calculation of big data. Due to the GZIP compression feature, the compressed blocks allocated in each partition, especially the first block and the tail block, cannot guarantee that the first character of the first block is exactly the first character of a text record, nor can it guarantee that the last character of the tail block A character is exactly the end-of-line character of a text record. In this case, directly handing over the partition content to the subsequent distributed computing may cause abnormal situations such as system error reporting. Through the above steps, splitting based on the last newline character of the first block and the last block of each hadoop partition can ensure that the content of the partition is read completely, and then ensure that the Spark computing platform can construct a memory-based Spark RDD, that is, the minimum The computing unit finally realizes the parallel decompression of GZIP files, which makes the seamless integration of Hadoop platform and streaming computing cluster (Spark computing platform).

发明人发现，现有基于GZIP的文件采集和大数据计算分析系统，其核心消耗在于对GZIP解压的过程。由于GZIP文件特有的封装头，以及压缩块的存储不是基于整字节存储，而是二进制流(bit流)的连续存储，由于流里面没有专门提供每个压缩块的起止及汇总信息，因此无法在流中一次性获取所有块的列表信息，这些特性都使GZIP天然不支持hadoop分区，也不持并行读取和解压。天然不支持hadoop的分区分块特性，意味着即使大数据平台规模再庞大，物理机再多，CPU再多，也无法发挥优势。对于一个GZIP文件，解压阶段是无法做到缺省并行的多任务解压和计算，只能单核单进程解压，这一瓶颈极大限制了大数据平台的计算能力和使用效率。而本实施方式提供了解压缩方法，通过预解码获取压缩块的块信息，对hadoop平台解码过程进行了适应性改进，使得GZIP文件可以使用hadoop平台进行并行解码，提高了解码速度。The inventor found that the core consumption of existing GZIP-based file collection and big data calculation and analysis systems lies in the process of decompressing GZIP. Due to the unique encapsulation header of GZIP files, and the storage of compressed blocks is not based on whole byte storage, but the continuous storage of binary stream (bit stream), since the stream does not specifically provide the start, end and summary information of each compressed block, it cannot Obtain the list information of all blocks in the stream at one time. These characteristics make GZIP not naturally support hadoop partitions, nor support parallel reading and decompression. Naturally, it does not support Hadoop's partitioning and block features, which means that no matter how large the scale of the big data platform is, no matter how many physical machines and CPUs there are, it will not be able to take advantage of it. For a GZIP file, in the decompression stage, the default parallel multi-task decompression and calculation cannot be achieved, and only a single-core single-process decompression can be performed. This bottleneck greatly limits the computing power and usage efficiency of the big data platform. However, this embodiment provides a decompression method, obtains the block information of the compressed block through pre-decoding, and adapts the decoding process of the Hadoop platform to make adaptive improvements, so that GZIP files can be decoded in parallel using the Hadoop platform, and the decoding speed is improved.

值得一提的是，由于输入hadoop平台的数据为未解压的待压缩数据，相对于将解压后的数据输入hadoop平台进行分区的方式，提升了在hadoop平台上实时计算时的解压效率，通过快速预处理和缓存，实现了并行解压从而减少了解压的总延迟时间。压缩文件的大小越大，提升效率就越高。除此之外，降低了解压中间文件在Hadoop存储空间，降低约10倍以上。通过本实施方式，Hadoop平台和流式计算框架，如Spark计算平台无缝集成，大数据业务应用开发者只需关注业务的大规模分布式运算开发即可，而不会因为GZIP带来的技术限制而去关心如何提升解压效率等纯技术问题。It is worth mentioning that since the data input to the Hadoop platform is uncompressed data to be compressed, compared with the method of inputting the decompressed data into the Hadoop platform for partitioning, the decompression efficiency during real-time calculation on the Hadoop platform is improved. Preprocessing and caching enables parallel decompression to reduce the total delay time of decompression. The larger the compressed file size, the more efficient the boost will be. In addition, the storage space of the decompressed intermediate files in Hadoop is reduced by more than 10 times. Through this implementation mode, the Hadoop platform is seamlessly integrated with the streaming computing framework, such as the Spark computing platform, and developers of big data business applications only need to focus on the large-scale distributed computing development of the business, and will not be affected by the technology brought by GZIP. Restrictions and care about purely technical issues such as how to improve decompression efficiency.

与现有技术相比，本实施方式中提供的解压缩方法，通过对待解压进行预解码，得到待解压数据的块信息，以便将待解压数据进行分块，以实现并行解压缩。由于并行地对各个划分后的数据块进行解压缩，相对于对整个文件进行解压缩，提高了解压缩速度，减少了解压的总延迟时间。除此之外，按照各压缩块的ID序号顺序进行合并，保证了后续并行解压缩时压缩块与压缩块之间内容的连续性。各数据块冗余上一数据块的最后一个压缩块，确保数据块内的首个压缩块在解压缩时能够找到足够远距离的引用字符，完成正常的字符替换。Compared with the prior art, the decompression method provided in this embodiment obtains the block information of the data to be decompressed by pre-decoding the data to be decompressed, so as to divide the data to be decompressed into blocks, so as to realize parallel decompression. Since each divided data block is decompressed in parallel, compared with decompressing the entire file, the decompression speed is improved and the total delay time of decompression is reduced. In addition, the merging is performed according to the sequence of the ID numbers of each compressed block, which ensures the continuity of content between compressed blocks during subsequent parallel decompression. Each data block is redundant with the last compressed block of the previous data block, so as to ensure that the first compressed block in the data block can find a quote character far enough away to complete normal character replacement when decompressing.

上面各种方法的步骤划分，只是为了描述清楚，实现时可以合并为一个步骤或者对某些步骤进行拆分，分解为多个步骤，只要包括相同的逻辑关系，都在本专利的保护范围内；对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计，但不改变其算法和流程的核心设计都在该专利的保护范围内。The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.

本发明的第三实施方式涉及一种解压缩装置，如图3所示，包括：预解码模块301、分块模块302和解压缩模块303。预解码模块301用于对待解压数据进行预解码，得到待解压数据的块信息，块信息指示待解压数据中的压缩块的位置。分块模块302用于根据块信息，将待解压数据划分为N个数据块，每个数据块中至少包括一个压缩块，N为正整数；解压缩模块303用于并发地对每个数据块进行解压缩。The third embodiment of the present invention relates to a decompression device, as shown in FIG. 3 , including: a pre-decoding module 301 , a blocking module 302 and a decompression module 303 . The pre-decoding module 301 is configured to pre-decode the data to be decompressed to obtain block information of the data to be decompressed, and the block information indicates the position of the compressed block in the data to be decompressed. The block module 302 is used to divide the data to be decompressed into N data blocks according to the block information, each data block includes at least one compressed block, and N is a positive integer; the decompression module 303 is used to concurrently process each data block to unzip.

不难发现，本实施方式为与第一实施方式相对应的系统实施例，本实施方式可与第一实施方式互相配合实施。第一实施方式中提到的相关技术细节在本实施方式中依然有效，为了减少重复，这里不再赘述。相应地，本实施方式中提到的相关技术细节也可应用在第一实施方式中。It is not difficult to find that this embodiment is a system embodiment corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment. The relevant technical details mentioned in the first embodiment are still valid in this embodiment, and will not be repeated here to reduce repetition. Correspondingly, the relevant technical details mentioned in this implementation manner can also be applied in the first implementation manner.

值得一提的是，本实施方式中所涉及到的各模块均为逻辑模块，在实际应用中，一个逻辑单元可以是一个物理单元，也可以是一个物理单元的一部分，还可以以多个物理单元的组合实现。此外，为了突出本发明的创新部分，本实施方式中并没有将与解决本发明所提出的技术问题关系不太密切的单元引入，但这并不表明本实施方式中不存在其它的单元。It is worth mentioning that all the modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units. In addition, in order to highlight the innovative part of the present invention, units that are not closely related to solving the technical problems proposed by the present invention are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

本发明的第四实施方式涉及一种电子设备，如图4所示，包括：至少一个处理器401；以及，与至少一个处理器401通信连接的存储器402；其中，存储器402存储有可被至少一个处理器401执行的指令，指令被至少一个处理器401执行，以使至少一个处理器401能够执行上述实施方式提及的解压缩方法。The fourth embodiment of the present invention relates to an electronic device, as shown in FIG. 4 , including: at least one processor 401; and a memory 402 communicatively connected to at least one processor 401; An instruction executed by one processor 401, the instruction is executed by at least one processor 401, so that at least one processor 401 can execute the decompression method mentioned in the foregoing implementation manner.

该电子设备包括：一个或多个处理器401以及存储器402，图4中以一个处理器401为例。处理器401、存储器402可以通过总线或者其他方式连接，图4中以通过总线连接为例。存储器402作为一种非易失性计算机可读存储介质，可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。处理器401通过运行存储在存储器402中的非易失性软件程序、指令以及模块，从而执行设备的各种功能应用以及数据处理，即实现上述解压缩方法。The electronic device includes: one or more processors 401 and a memory 402, one processor 401 is taken as an example in FIG. 4 . The processor 401 and the memory 402 may be connected through a bus or in other ways. In FIG. 4 , connection through a bus is taken as an example. The memory 402, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. The processor 401 executes various functional applications and data processing of the device by running non-volatile software programs, instructions and modules stored in the memory 402, that is, implements the above decompression method.

存储器402可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储选项列表等。此外，存储器402可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施方式中，存储器402可选包括相对于处理器401远程设置的存储器，这些远程存储器可以通过网络连接至外接设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 402 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store an option list and the like. In addition, the memory 402 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some implementations, the memory 402 may optionally include a memory set remotely relative to the processor 401, and these remote memories may be connected to an external device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

一个或者多个模块存储在存储器402中，当被一个或者多个处理器401执行时，执行上述任意方法实施方式中的解压缩方法。One or more modules are stored in the memory 402, and when executed by one or more processors 401, execute the decompression method in any of the above method implementations.

上述产品可执行本申请实施方式所提供的方法，具备执行方法相应的功能模块和有益效果，未在本实施方式中详尽描述的技术细节，可参见本申请实施方式所提供的方法。The above-mentioned products can execute the methods provided in the embodiments of this application, and have the corresponding functional modules and beneficial effects for executing the methods. For technical details not described in detail in this embodiment, please refer to the methods provided in the embodiments of this application.

本发明的第五实施方式涉及一种计算机可读存储介质，存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

即，本领域技术人员可以理解，实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序存储在一个存储介质中，包括若干指令用以使得一个设备(可以是单片机，芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, and other media that can store program codes.

本领域的普通技术人员可以理解，上述各实施方式是实现本发明的具体实施例，而在实际应用中，可以在形式上和细节上对其作各种改变，而不偏离本发明的精神和范围。Those of ordinary skill in the art can understand that the above-mentioned embodiments are specific examples for realizing the present invention, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the present invention. scope.

Claims

1. A decompression method, characterized in that, comprising:

Pre-decoding the data to be decompressed to obtain block information of the data to be decompressed, the block information indicating the position of the compressed block in the data to be decompressed;

According to the block information, divide the data to be decompressed into N data blocks, each data block includes at least one compressed block, and N is a positive integer;

decompressing each of said data blocks concurrently;

According to the block information, dividing the data to be decompressed into N data blocks specifically includes:

Merging the compressed blocks in the data to be decompressed into N data blocks according to the block information according to a preset merging rule;

Wherein, in the merged data blocks, the first compressed block of the i+1th data block is the same as the last compressed block of the ith data block, 1≤i<N.

2. The decompression method according to claim 1, wherein the pre-decoding of the data to be decompressed to obtain the block information of the data to be decompressed specifically includes:

Perform pre-decoding on the data to be decompressed, and determine the position information of the block tail of each compressed block;

The block information is determined according to the position information of the block tail of each compressed block.

3. The decompression method according to claim 2, wherein the pre-decoding of the data to be decompressed is carried out to determine the position information of the block tail of each compressed block, specifically comprising:

performing code table matching on the characters in the data to be decompressed according to the code table of the data to be decompressed;

If the code value matched by the character is 256, the position information of the character is used as the position information of the block end of the current compressed block.

4. The decompression method according to claim 1, wherein the block information also indicates the order of the compressed blocks; the merging rule is:

According to the sequence of the compressed blocks, merging the first compressed block to the Mth compressed block into one data block;

Determine whether 2M is less than N; where M is a positive integer;

If yes, merge the M compressed block to the 2M compressed block into one data block; make M=2M, return to the step of judging whether 2M is less than N;

If it is determined not to be, merge the Mth compressed block to the Nth compressed block into one data block.

5. decompression method according to claim 1, is characterized in that, the decompression process to the kth data block comprises:

If it is determined that k=1, decompress from the first compressed block, and decompress to the last predetermined symbol of the last compressed block;

If it is determined that 1<k<N, decompress from the last predetermined symbol of the first compressed block, and decompress to the last predetermined symbol of the last compressed block;

If it is determined that k=N, decompress from the last predetermined symbol of the first compressed block until the last compressed block is decompressed.

6. The decompression method according to any one of claims 1 to 5, wherein the data to be decompressed is divided into N data blocks by a distributed computing platform according to the block information, and concurrently Each of said data blocks is decompressed.

7. decompression method according to claim 6, is characterized in that, described distributed computing platform is connected with Spark computing platform communication, and described distributed computing platform transmits the decompressed data of each described data block to described Spark computing platform.

8. An electronic device, comprising: at least one processor; and,

memory communicatively coupled to the at least one processor;

Wherein, the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1 to 7. The decompression method described in the item.

9. A computer-readable storage medium storing a computer program, wherein the computer program implements the decompression method according to any one of claims 1 to 7 when executed by a processor.