CN111966571B

CN111966571B - Co-processing Method for Time Estimation Based on ARM-FPGA Coprocessor Heterogeneous Platform

Info

Publication number: CN111966571B
Application number: CN202010807124.6A
Authority: CN
Inventors: 罗志勇; 何禹辰; 马国喜; 王耀
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2023-05-12
Anticipated expiration: 2040-08-12
Also published as: CN111966571A

Abstract

The invention discloses a time estimation cooperative processing method based on an ARM-FPGA coprocessor heterogeneous platform, which is applied to a scene that a big data storage unit and a computing unit are separated. The high-performance hardware FPGA coprocessor is deployed in the storage unit through the AXI protocol interface, so that the data processing capacity of the storage unit node is enhanced, and calculation power is provided for preprocessing of mass data. When the redundant information in the source data is too small, if the source data preprocessing is performed again, the data interaction time and the data preprocessing time of the storage unit are increased, and the time delay problem of the whole process cannot be reasonably reduced. Therefore, the pre-processing amount of the source data is estimated in a probability mode, so that whether the data pre-processing is needed or not is determined. And the time spending in the whole process is estimated by a time estimation method, wherein the estimated time spending comprises the sum of time delays of source data extraction, transmission communication, calculation and the like. And finally, selecting a proper processor to perform data processing, so that the efficiency is maximized, and the overall performance of the system is improved.

Description

Co-processing Method for Time Estimation Based on ARM-FPGA Coprocessor Heterogeneous Platform

技术领域technical field

本发明属于异构平台协同处理的分布式计算技术领域，特别是基于ARM-FPGA协处理器异构体系的时间可预估协同处理方法。The invention belongs to the technical field of distributed computing for heterogeneous platform cooperative processing, in particular to a time-predictable collaborative processing method based on ARM-FPGA coprocessor heterogeneous system.

背景技术Background technique

大数据时代下数据量的爆炸式增长，导致海量大数据中，存在着许多冗余、无效信息。针对数据存储单元和计算单元分离的大数据系统场景中，如何面对海量数据高效进行数据预处理，去除冗余数据，以节省数据存储的结点、减少数据的通信时延和提升数据的可用性，是一个值得关注的问题。在传统的数据处理中，通过高性能处理器或者多台处理器联机处理以满足大数据处理的需求。随着摩尔定律的约束，不同架构下的处理器主频提升受到极大的限制，而在数据爆炸的信息量驱动下，基于ARM处理器和高性能新硬件(GPU、现场可编程门阵列)协同混合架构，受到研究人员的广泛关注，成为该领域的焦点。The explosive growth of data volume in the era of big data has led to a lot of redundant and invalid information in massive big data. In the big data system scenario where the data storage unit and the computing unit are separated, how to efficiently perform data preprocessing in the face of massive data and remove redundant data, so as to save data storage nodes, reduce data communication delay and improve data availability , is a problem worthy of attention. In traditional data processing, high-performance processors or multiple processors are used for online processing to meet the needs of big data processing. With the constraints of Moore's Law, the improvement of the main frequency of processors under different architectures is greatly limited, and driven by the amount of information in the data explosion, based on ARM processors and high-performance new hardware (GPU, field programmable gate array) Synergistic hybrid architectures, which have received extensive attention from researchers, have become the focus of this field.

ARM处理器拥有双指令集，大量使用寄存器，指令执行速度更快，寻址方式灵活，执行效率高。具有体积小、低功耗、低成本、高性能等特点，广泛应用于嵌入式系统中。处理器在程序任务执行中，有着并行处理优势。但是程序逻辑的复杂度也限定了程序执行的指令并行性，过载多线程是很难达到的。随着海量数据和结构的复杂化，仅靠某一处理显得不太现实，通过搭载协处理器来分担处理器的数据计算压力。而FPGA作为半制定电路的可编程逻辑器件，通过可编程特性制定任意逻辑功能。具有高效的编程、开发周期短、并行计算和低功耗等突出优点。这些特点使得数据处理更快速，实时性更高，是很多新硬件无法达到的。The ARM processor has a dual instruction set, uses a large number of registers, executes instructions faster, has flexible addressing modes, and has high execution efficiency. It has the characteristics of small size, low power consumption, low cost, high performance, etc., and is widely used in embedded systems. The processor has the advantage of parallel processing in the execution of program tasks. However, the complexity of program logic also limits the parallelism of program execution instructions, and it is difficult to achieve overloaded multi-threading. With the complexity of massive data and structure, it is unrealistic to rely on only one processing, and the data calculation pressure of the processor is shared by equipped with co-processors. As a programmable logic device of a semi-defined circuit, FPGA can formulate any logic function through programmable features. It has outstanding advantages such as efficient programming, short development cycle, parallel computing and low power consumption. These characteristics make data processing faster and more real-time, which cannot be achieved by many new hardware.

因此，将FPGA作为协处理器搭载的异构平台，软硬件协同处理，通过对海量数据的预处理，去除冗余数据，以减少结点间传输的数据量、相互间的通信延迟和节省数据存储结点的空间。可有效处理开度分配和传输时间等问题。但是在数据预处理时，可能存在预处理数据信息冗余量很少，但是进行数据预处理会处理整个数据，并不能有效降低数据内部通信量、网络传输时延和计算处理时间，得不偿失。因此，在源数据储存和计算结点分离的应用场景中，如何在异构平台处理海量数据时，提前预估和评判该任务基于不同处理器所用时间和计算特点，使得数据计算结点在不同处理器间进行合理分配，进一步减少数据预处理计算时间和网络通信的时间，充分发挥异构平台性能的各自优势，效率最大化，是值得关注的问题。Therefore, FPGA is used as a heterogeneous platform equipped with a coprocessor, software and hardware are co-processed, and redundant data is removed through preprocessing of massive data, so as to reduce the amount of data transmitted between nodes, the communication delay between each other and save data Space to store nodes. It can effectively deal with problems such as opening distribution and transmission time. However, during data preprocessing, there may be little redundancy in preprocessing data information, but data preprocessing will process the entire data, which cannot effectively reduce data internal communication traffic, network transmission delay, and calculation processing time, and the gain outweighs the gain. Therefore, in the application scenario where the source data storage and computing nodes are separated, how to predict and judge the task in advance when processing massive data on heterogeneous platforms is based on the time and computing characteristics of different processors, so that the data computing nodes are in different Reasonable allocation among processors, further reducing data preprocessing calculation time and network communication time, giving full play to the respective advantages of heterogeneous platform performance, and maximizing efficiency are issues worthy of attention.

发明内容Contents of the invention

本发明旨在解决以上现有技术的问题。提出了一种基于ARM-FPGA协处理器异构平台的时间估算协同处理方法。本发明的技术方案如下：The present invention aims to solve the above problems of the prior art. A time estimation co-processing method based on ARM-FPGA co-processor heterogeneous platform is proposed. Technical scheme of the present invention is as follows:

一种基于ARM-FPGA协处理器异构平台的时间可预估协同处理方法，其方法包含：A time-predictable collaborative processing method based on an ARM-FPGA coprocessor heterogeneous platform, the method comprising:

在大数据存储单元和计算单元分离的应用场景，将AXI4.0接口将协处理器FPGA部署在数据存储单元、源数据中冗余量在内的信息进行预估、估算整个实施过程中通信时延的总和以及对处理器进行综合抉择，具体步骤如下：In the application scenario where the big data storage unit and the computing unit are separated, the AXI4.0 interface will deploy the coprocessor FPGA in the data storage unit, and the information including the redundancy in the source data will be estimated and the communication time during the entire implementation process will be estimated. The sum of the delay and the comprehensive selection of the processor, the specific steps are as follows:

S1、通过AXI4.0接口，实现ARM处理器和FPGA互联的速数据交互；S1. Through the AXI4.0 interface, realize the high-speed data interaction between the ARM processor and the FPGA interconnection;

S2、存储单元结点源数据以索引表的形式进行存储，数据体现划分为两大类，具有强规律性的数据集和复杂规律数据集；S2. The source data of the storage unit node is stored in the form of an index table, and the data representation is divided into two categories, the data set with strong regularity and the data set with complex regularity;

S3、针对源数据不同体现形式，建立类索引表概率存储结构来表现不同数据分布形式的概率问题，从而判断是否需要对源数据进行预处理；S3. According to different representation forms of source data, establish a class index table probability storage structure to represent the probability problems of different data distribution forms, so as to judge whether the source data needs to be preprocessed;

S4、源数据在处理的各个过程中都存在时延，因此，需要明确数据每个传输阶段的通信过程，计算整个时延的总和，为源数据处理器最终决策提供可靠的依据，因此，需要分别估算ARM处理器数据处理和FPGA数据处理过程的整体时间时延；S4. There is time delay in each process of source data processing. Therefore, it is necessary to clarify the communication process of each data transmission stage, calculate the sum of the entire time delay, and provide a reliable basis for the final decision of the source data processor. Therefore, it is necessary to Estimate the overall time delay of ARM processor data processing and FPGA data processing respectively;

S5、当整体时延预估完成后，根据预估时延情况，对源数据处理时处理器的选择进行决策；如果要准确判定那个处理器更合适，单单使用整体通信预估时延来抉择不太妥当，还要考虑当前ARM处理器和FPGA处理器的状态。S5. After the overall delay estimation is completed, according to the estimated delay, make a decision on the choice of processor for source data processing; if you want to accurately determine which processor is more suitable, only use the overall communication estimated delay to make a decision Not very appropriate, also consider the current state of ARM processors and FPGA processors.

进一步的，所述步骤S1中，当数据请求产生时，ARM处理器通过对内存的访问，提取数据。当源数据无法全部存储在处理器内存中时，可以访问存储在存储介质(硬盘)中的源数据，并将全部源数据发送至计算单元处理器中；但是，如果使用FPGA进行数据预处理，则会通过AXI接口，使得ARM处理器和FPGA之间进行高速通信和数据交互。Further, in the step S1, when a data request is generated, the ARM processor extracts the data by accessing the memory. When the source data cannot all be stored in the processor memory, the source data stored in the storage medium (hard disk) can be accessed, and all the source data can be sent to the computing unit processor; however, if the FPGA is used for data preprocessing, The AXI interface will enable high-speed communication and data interaction between the ARM processor and the FPGA.

进一步的，所述步骤S2中，以不同的数据特性使用不同的方式去描述，性别列分布满足二项分布，结点记录其发生的概率值，年龄列用均匀分布来描述，结点记录它的最大和最小值，成绩列分布就相对很复杂，以哈希方式去获得数据键值，从而进行处理；其中，不同的键值列用不同的数字标记，性别列用1标记，年龄列用2标记，成绩列用3标记。Further, in the step S2, different data characteristics are used to describe in different ways, the gender column distribution satisfies the binomial distribution, the node records the probability value of its occurrence, the age column is described by a uniform distribution, and the node records it The maximum and minimum values of the score column are relatively complex, and the key value of the data is obtained by hashing for processing; among them, different key value columns are marked with different numbers, the gender column is marked with 1, and the age column is marked with 2 is marked, and the grade column is marked with 3.

进一步的，所述步骤S3中，针对规律性较强的源数据，概率分布相对单一，采用定值计算估算概率值。针对复杂规律的源数据中，每项数据的键称以类的结构来描述，建立类索引表来估算数据大小；以源数据键值大小建立索引结点表，在结点索引表建立过程中，其结点建立是以频繁被索引的数据键值，以保证在估算时的精确度；底层叶子结点存放数据A的概率值，用P_i,j表示在结点数据间A_i<A<A_j的概率值；以索引结构中第一个叶子结点为例，P_0,4表示结点数据小于A₄的概率值；如果在A<A₄区间满足对称概率分布，可依据P_0,4概率值估算一个确定概率值。Further, in the step S3, for the source data with strong regularity, the probability distribution is relatively single, and the probability value is estimated by using fixed value calculation. In the source data with complex rules, the key of each data is described by the structure of the class, and the class index table is established to estimate the data size; the index node table is established based on the size of the key value of the source data, and the node index table is established in the process of , its nodes are established with frequently indexed data key values to ensure the accuracy of estimation; the underlying leaf nodes store the probability value of data A, and P _{i, j} represents the node data between A _i < A < the probability value of A _j ; taking the first leaf node in the index structure as an example, P _0,4 indicates the probability value of the node data less than A ₄ ; if the symmetric probability distribution is satisfied in the interval A<A ₄ , it can be based on P A probability value of _0,4 estimates a certain probability value.

进一步的，所述步骤S4中，以ARM处理器系统进行源数据处理，整个过程时延主要包含源数据提取传输时间、网络传输时延和源数据计算时间，分别用T_Ao-t、T_Af-n和T_Ao-c表示；其处理过程先从存储节点单元取出源数据，在转发给计算节点单元，其整体时间开销用T_Arm符号表示；其公式为：T_Arm＝T_Ao-t+T_Ao-c；源数据内部存储单元以高速PCIe接口进行数据通信，其通信时延为稳定的传输速率(V₁)与源数据量(A_o)的乘积,表示为T_Ao-t＝V₁*A_o；Further, in the step S4, the ARM processor system is used to process the source data. The time delay of the whole process mainly includes the source data extraction and transmission time, the network transmission time delay and the source data calculation time, which are respectively represented by T _Ao-t and T _{Af -n} and T _Ao-c represent; the processing process first takes out the source data from the storage node unit, and then forwards it to the computing node unit, and its overall time cost is represented by the T _Arm symbol; the formula is: T _Arm = T _Ao-t + T _Ao-c ; the internal storage unit of the source data communicates with the high-speed PCIe interface, and its communication delay is the product of the stable transmission rate (V ₁ ) and the amount of source data (A _o ), expressed as T _Ao-t = V ₁ * A _o ;

进一步的，所述步骤S4中，以FPGA作为协处理器进行源数据预处理，整个过程时延包含提取源数据通信时延、FPGA中数据预处理时延、处理后数据网络传输时延和计算单元处理时延，分别用T_Fo-t、T_Fo-c、T_Ff-n和T_Ff-c表示；源数据FPGA以AXI接口与ARM处理进行数据通信，其通信时延为稳定的传输速率(V)与源数据量(A_o)的乘积,表示为T_Fo-t＝(V+V₁)*A_o；数据在网络中传输的时延由当前的网络速度(N_v)和传输的数据量(A_o+f)来决定，其表示为T_Af-n,T_Ff-n＝N_v*A_o+f；当存在源数据处理请求时，ARM处理器解析任务消息，使得FPGA处理器通过AXI协议接口从ARM处理中进行源数据交互，再进行数据预处理，当源数据预处理完成后，直接通过FPGA上融合的网卡将处理后的数据通过网络发送给计算节点单元；整体时间开销为T_Fpga表示。其公式为：T_Fpga＝T_Fo-t+T_Fo-c+T_Ff-n+T_Ff-c。Further, in the step S4, the FPGA is used as a coprocessor to perform source data preprocessing, and the entire process delay includes extracting source data communication delay, data preprocessing delay in FPGA, post-processing data network transmission delay and calculation The unit processing delay is represented by T _Fo-t , T _Fo-c , T _Ff-n and T _Ff-c respectively; the source data FPGA communicates with ARM processing through the AXI interface, and the communication delay is a stable transmission rate The product of (V) and the amount of source data (A _o ), expressed as T _Fo-t = (V+V ₁ )*A _o ; the delay of data transmission in the network is determined by the current network speed (N _v ) and transmission It is determined by the amount of data (A _o+f ), which is expressed as T _Af-n , T _Ff-n =N _v *A _o+f ; when there is a source data processing request, the ARM processor parses the task message, so that the FPGA The processor interacts with the source data from ARM processing through the AXI protocol interface, and then performs data preprocessing. After the source data preprocessing is completed, the processed data is directly sent to the computing node unit through the network through the integrated network card on the FPGA; the overall The time overhead is represented by T _Fpga . The formula is: T _Fpga =T _Fo-t +T _Fo-c +T _Ff-n +T _Ff-c .

进一步的，所述步骤S5中，当我们认为两个处理器目前都有一个任务正在处理，所需处理时间分别用T_A-now和T_F-now，如果此时另外一个任务到来，该任务使用ARM处理器和FPGA处理器的时间分别是T_Arm和T_Fpga，那么可以判断出ARM处理器和FPGA处理器处理任务所需总时间为T_A-all和T_F-all，其中T_A-all＝T_A-now+T_Arm，T_F-all＝T_F-now+T_Fpga，根据任务完成所需总时间我们可以做出合理的决策，当T_A-all<T_F-all时，选择ARM处理器进行任务数据处理，反之，选择FPGA处理器进行任务数据处理。Further, in the step S5, when we consider that both processors currently have a task being processed, the required processing time is respectively T _A-now and T _F-now , if another task arrives at this time, the task The time of using ARM processor and FPGA processor is T _Arm and T _Fpga respectively, then it can be judged that the total time required for ARM processor and FPGA processor to process tasks is T _A-all and T _F-all , where T _{A- all} ＝T _A-now +T _Arm , TF _-all ＝T _F-now +T _Fpga , we can make a reasonable decision according to the total time required to complete the task, when TA _-all < _TF-all , Choose the ARM processor for task data processing, and vice versa, choose the FPGA processor for task data processing.

本发明的优点及有益效果如下：Advantage of the present invention and beneficial effect are as follows:

1、基于AXI4.0协议接口将FPGA协处理器部署在数据存储单元。这里重点关注在数据传输的过程，如何对源数据的预处理，减少无效数据，从而达到减缓网络堵塞，降低数据传输量。主要利用FPGA可编程灵活性、开发周期短和强大并行计算效率的特点，通过将硬件FPGA分布式方式部署在边缘侧，软硬件协同处理，以加速源数据的预处理。而不是大家重点研究方向在如何在源数据传输途径中增设硬件的处理，以达到最终效果。1. Deploy the FPGA coprocessor in the data storage unit based on the AXI4.0 protocol interface. Here we focus on how to preprocess the source data and reduce invalid data in the process of data transmission, so as to slow down network congestion and reduce the amount of data transmission. Mainly taking advantage of the characteristics of FPGA programmable flexibility, short development cycle and powerful parallel computing efficiency, hardware FPGA is deployed in a distributed manner on the edge side, and software and hardware are co-processed to accelerate the preprocessing of source data. Rather than focus on how to add hardware processing in the source data transmission path to achieve the final effect.

2、以概率的方式预估源数据中无效数据量来决策是否需要数据预处理。在数据预处理时，可能存在预处理数据信息冗余量很少，但是进行数据预处理会处理整个数据，并不能有效降低数据内部通信量、网络传输时延和计算处理时间，得不偿失。因此，是否进行源数据预处理显得尤为重要。在这里采用概率的方式对数据的估算，其中源数据以索引表的形式存储，存在两种情况，一是源数据存在明显规律，可通过定值概率进行分析，解决源数据预处理的估值问题；二是源数据相对复杂，但是数据以索引表的形式存储，以索引表结构概率进行数据大小的估量。虽然最终估算的结果受所建立的类索引结构的影响很大，但是种方式可以极大减少估算时处理的时间，我们并不需要采用精细方案对源数据冗余量做精细的估算，带来过多的计算压力。估算量的精度对结果影响较小，整体看来以丢精度方式带来了处理时间上的极大提升。2. Estimate the amount of invalid data in the source data in a probabilistic way to decide whether data preprocessing is required. During data preprocessing, there may be little redundancy in preprocessing data information, but data preprocessing will process the entire data, which cannot effectively reduce data internal communication traffic, network transmission delay, and calculation processing time, and the gain outweighs the gain. Therefore, whether to perform source data preprocessing is particularly important. Here, the probability method is used to estimate the data, and the source data is stored in the form of an index table. There are two situations. One is that the source data has obvious rules, which can be analyzed through the fixed value probability to solve the estimation of the source data preprocessing The second problem is that the source data is relatively complex, but the data is stored in the form of an index table, and the data size is estimated by the probability of the index table structure. Although the final estimation result is greatly affected by the established class index structure, this method can greatly reduce the processing time during estimation, and we do not need to adopt a fine-grained scheme to fine-tune the source data redundancy, which brings Too much computational pressure. The accuracy of the estimator has little effect on the result, and overall it seems that the loss of accuracy has greatly improved the processing time.

3、通过时间估算法对整个过程中不同处理器的时间开销进行预估。这里我们可以做到对整个通信的过程时间的测算，为不同处理器间合理分担工作负载状态提供依据，对不同任务选择不同的处理器，充分发挥各自处理器的优势，整体提升系统效率。3. Estimate the time overhead of different processors in the whole process through the time estimation method. Here we can measure the time of the entire communication process, provide a basis for the reasonable sharing of workload among different processors, select different processors for different tasks, give full play to the advantages of each processor, and improve the overall system efficiency.

4、针对处理器选择的决策。根据各自处理器处理时间开销和当前处理器的实时状态预估因素综合决策，选择FPGA协处理器或者ARM处理器进行数据处理。4. Decision-making for processor selection. According to the comprehensive decision-making of each processor's processing time overhead and the current processor's real-time state estimation factors, FPGA coprocessor or ARM processor is selected for data processing.

附图说明Description of drawings

图1是本发明提供优选实施例基于大数据存储单元和计算单元分离的应用场景；Fig. 1 is an application scenario based on the separation of a large data storage unit and a computing unit in a preferred embodiment provided by the present invention;

图2是本发明提供优选实施例存储数据单元中部署FPGA器件示意图；Fig. 2 is a schematic diagram of deploying an FPGA device in a storage data unit according to a preferred embodiment of the present invention;

图3是索引结构图；Figure 3 is an index structure diagram;

图4是根据上文中索引表例子建立类索引表概率存储结构图；FIG. 4 is a probability storage structure diagram for establishing a class index table according to the index table example above;

图5是由ARM处理器系统完成的数据预处理过程图；Fig. 5 is the data preprocessing process figure that is finished by ARM processor system;

图6是由FPGA作为协处理器的源数据处理过程；Fig. 6 is by FPGA as the source data processing process of coprocessor;

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、详细地描述。所描述的实施例仅仅是本发明的一部分实施例。The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

本发明解决上述技术问题的技术方案是：The technical scheme that the present invention solves the problems of the technologies described above is:

本发明基于ARM-FPGA协处理器异构平台的时间可预估协同处理方法。基于大数据存储单元和计算单元分离的应用场景，其场景结构如图1。将AXI4.0接口将协处理器FPGA部署在数据存储单元、源数据中冗余量等信息进行预估、整个实施过程中时延的总和估算和对处理选择进行综合抉择。包含如下具体步骤：The invention is based on a time-predictable collaborative processing method of an ARM-FPGA coprocessor heterogeneous platform. Based on the application scenario where the big data storage unit and computing unit are separated, the scenario structure is shown in Figure 1. The AXI4.0 interface will deploy the coprocessor FPGA in the data storage unit, the redundancy in the source data and other information will be estimated, the sum of the delay in the entire implementation process will be estimated, and a comprehensive decision will be made on the processing options. Contains the following specific steps:

1、通过AXI4.0接口协议，将存储数据单元中部署FPGA器件。其部署的存储单元结构图如图2。当数据请求产生时，ARM处理器通过对内存的访问，提取数据。当源数据无法全部存储在处理器内存中时，可以访问存储在存储介质(硬盘)中的源数据，并将全部源数据发送至计算单元处理器中。但是，如果使用FPGA进行数据预处理，则会通过AXI接口，使得ARM处理器和FPGA之间进行高速通信和数据交互。1. Through the AXI4.0 interface protocol, the FPGA device will be deployed in the storage data unit. The structure diagram of the storage unit deployed is shown in Figure 2. When a data request is generated, the ARM processor extracts the data by accessing the memory. When all the source data cannot be stored in the processor memory, the source data stored in the storage medium (hard disk) can be accessed, and all the source data can be sent to the computing unit processor. However, if the FPGA is used for data preprocessing, the AXI interface will enable high-speed communication and data interaction between the ARM processor and the FPGA.

2、存储单元中的源数据通过建立索引表的形式进行存储，索引表结构如图3。数据形式粗略划分为两大类，具有强规律性的数据集和复杂规律数据集。在这里以一个具体例子来解释数据的分类问题。性别列分布满足二项分布，结点记录其发生的概率值。年龄列用均匀分布来描述，结点记录它的最大和最小值。成绩列分布就相对很复杂，以哈希方式去获得数据键值，从而进行处理。其中，不同的键值列用不同的数字标记。性别列用1标记，年龄列用2标记，成绩列用3标记。2. The source data in the storage unit is stored in the form of an index table. The structure of the index table is shown in Figure 3. Data forms are roughly divided into two categories, data sets with strong regularity and data sets with complex regularity. Here, a specific example is used to explain the data classification problem. The gender column distribution satisfies the binomial distribution, and the nodes record the probability value of its occurrence. The age column is described by a uniform distribution, and the nodes record its maximum and minimum values. The distribution of the score column is relatively complicated, and the key value of the data is obtained by hashing for processing. Among them, different key-value columns are marked with different numbers. The gender column is marked with 1, the age column is marked with 2, and the grade column is marked with 3.

3、根据上文中索引表例子建立类索引表概率存储结构如图4所示。3. According to the index table example above, the probability storage structure of the class index table is established as shown in Figure 4.

针对规律性较强的源数据，概率分布相对单一，采用定值计算估算概率值。For the source data with strong regularity, the probability distribution is relatively simple, and the probability value is estimated by fixed value calculation.

针对复杂规律的源数据中，每项数据的键称以类的结构来描述，建立类索引表来估算数据大小。以源数据键值大小建立索引结点表，在结点索引表建立过程中，其结点建立是以频繁被索引的数据键值，以保证在估算时的精确度。底层叶子结点存放数据A的概率值，用P_i,j表示在结点数据间A_i<A<A_j的概率值。以索引结构中第一个叶子结点为例，P_0,4表示结点数据小于A₄的概率值。如果在A<A₄区间满足对称概率分布，可依据P_0,4概率值估算一个确定概率值。因此，这种结点索引表存储结构下，结点数量与最终预估精度呈正相关。即结点数量越多，最终精度就会越准确，反之，结点数量越少，误差越大。In the source data with complex rules, the key of each data is described by the class structure, and the class index table is established to estimate the data size. The index node table is built with the size of the key value of the source data. In the process of building the node index table, the node is created with the key value of the frequently indexed data to ensure the accuracy of the estimation. The bottom leaf node stores the probability value of data A, and P _i,j represents the probability value of A _i <A<A _j among node data. Taking the first leaf node in the index structure as an example, P _0,4 represents the probability value that the node data is smaller than A ₄ . If the symmetric probability distribution is satisfied in the interval A<A ₄ , a certain probability value can be estimated based on the probability value of P _0,4 . Therefore, under this node index table storage structure, the number of nodes is positively correlated with the final prediction accuracy. That is, the larger the number of nodes, the more accurate the final accuracy will be. Conversely, the smaller the number of nodes, the greater the error.

4、针对源数据处理过程中整个通信时延，需要明确数据传输每个阶段的通信过程,计算整个时延的总和，从而对源数据处理期间最终决策的选择。因此，需要分别估算ARM处理器预处理和FPGA预处理的整体时间时延。4. For the entire communication delay in the source data processing process, it is necessary to clarify the communication process of each stage of data transmission, and calculate the sum of the entire delay, so as to select the final decision during the source data processing. Therefore, it is necessary to estimate the overall time delay of ARM processor preprocessing and FPGA preprocessing respectively.

由ARM处理器系统完成的数据预处理过程如图5所示。以ARM处理器系统进行源数据处理，整个过程时延主要包含源数据提取传输时间、网络传输时延和源数据计算时间，分别用T_Ao-t、T_Af-n和T_Ao-c表示。其处理过程先从存储节点单元取出源数据，在转发给计算节点单元。其整体时间开销用T_Arm符号表示。其公式为：T_Arm＝T_Ao-t+T_Ao-c+T_Af-n。源数据内部存储单元以高速PCIe接口进行数据通信，其通信时延为稳定的传输速率(V₁)与源数据量(A_o)的乘积,表示为T_Ao-t＝V₁*A_o。The data preprocessing process completed by the ARM processor system is shown in Figure 5. The source data is processed by the ARM processor system. The delay in the whole process mainly includes the source data extraction and transmission time, network transmission delay and source data calculation time, which are represented by T _Ao-t , T _Af-n and T _Ao-c respectively. The processing process first takes out the source data from the storage node unit, and forwards it to the computing node unit. Its overall time overhead is expressed in T _Arm notation. The formula is: T _Arm =T _Ao-t +T _Ao-c +T _Af-n . The internal storage unit of the source data uses a high-speed PCIe interface for data communication, and its communication delay is the product of the stable transmission rate (V ₁ ) and the amount of source data (A _o ), expressed as T _Ao-t = V ₁ *A _o .

由FPGA作为协处理器的源数据处理过程如图6所示。FPGA作为协处理器进行源数据预处理，整个过程时延包含提取源数据通信时延、FPGA中数据预处理时延、处理后数据网络传输时延和计算单元处理时延，分别用T_Fo-t、T_Fo-c、T_Ff-n和T_Ff-c表示。源数据FPGA以AXI接口与ARM处理进行数据通信，其往返通信时延为稳定的传输速率(V)与源数据量(A_o)的乘积,表示为T_Fo-t＝(V+V₁)*A_o。数据在网络中传输的时延由当前的网络速度(N_v)和传输的数据量(A_o+f)来决定，其表示为T_Af-n,T_Ff-n＝N_v*A_o+f。当存在源数据处理请求时，ARM处理器解析任务消息，使得FPGA处理器通过AXI协议接口从ARM处理器中进行源数据交互，再进行数据预处理，当源数据预处理完成后，直接通过FPGA上融合的网卡将处理后的数据通过网络发送给计算节点单元。整体时间开销为T_Fpga表示。其公式为：T_Fpga＝T_Fo-t+T_Fo-c+T_Ff-n+T_Ff-c。其中，在计算单元数据处理时间由计算单元处理速度和数据量大小决定，跟数据存储在索引表行数呈正相关，具体时延通过实验来估算。Figure 6 shows the source data processing process with FPGA as coprocessor. FPGA acts as a coprocessor for source data preprocessing. The whole process delay includes the communication delay of extracting source data, data preprocessing delay in FPGA, data network transmission delay after processing, and computing unit processing _delay . _t , _TFo-c , _TFf-n and _TFf-c represent. The source data FPGA uses the AXI interface to process data with ARM, and its round-trip communication delay is the product of the stable transmission rate (V) and the source data volume (A _o ), expressed as T _Fo-t = (V+V ₁ ) *A _o . The delay of data transmission in the network is determined by the current network speed (N _v ) and the amount of data transmitted (A _o+f ), which is expressed as T _Af-n , T _Ff-n = N _v *A _{o+ f} . When there is a source data processing request, the ARM processor parses the task message, so that the FPGA processor interacts with the source data from the ARM processor through the AXI protocol interface, and then performs data preprocessing. After the source data preprocessing is completed, it directly passes through the FPGA The integrated network card on the upper end sends the processed data to the computing node unit through the network. The overall time overhead is represented by T _Fpga . The formula is: T _Fpga =T _Fo-t +T _Fo-c +T _Ff-n +T _Ff-c . Among them, the data processing time of the computing unit is determined by the processing speed of the computing unit and the size of the data volume, and is positively correlated with the number of rows of data stored in the index table. The specific time delay is estimated through experiments.

5、当整体时延预估完成后，根据预估时延情况，对源数据预处理时处理器的选择进行决策。如果要准确判定那个处理器更合适，单单使用整体通信预估时延来抉择不太妥当，因为我们还要考虑当前ARM处理器和FPGA处理器的状态。此时，当我们认为两个处理器目前都有一个任务正在处理，所需处理时间分别用T_A-now和T_F-now，如果此时另外一个任务到来，该任务使用ARM处理器和FPGA处理器的时间分别是T_Arm和T_Fpga。那么可以判断出ARM处理器和FPGA处理器处理任务所需总时间为T_A-all和T_F-all。其中T_A-all＝T_A-now+T_Arm，T_F-all＝T_F-now+T_Fpga。根据任务完成所需总时间我们可以做出合理的决策。当T_A-all<T_F-all时，选择ARM处理器进行任务数据处理，反之，选择FPGA处理器进行任务数据处理。5. After the overall delay estimation is completed, a decision is made on the choice of processor for source data preprocessing according to the estimated delay. If you want to accurately determine which processor is more suitable, it is not appropriate to use the estimated overall communication delay to make a decision, because we also need to consider the current state of the ARM processor and FPGA processor. At this time, when we think that two processors are currently processing a task, the required processing time is T _A-now and T _F-now respectively. If another task arrives at this time, the task uses the ARM processor and FPGA The processor times are T _Arm and T _Fpga , respectively. Then it can be judged that the total time required by the ARM processor and the FPGA processor to process tasks is T _A-all and T _F-all . Where T _A-all = T _A-now + T _Arm , T _F-all = T _F-now + T _Fpga . We can make reasonable decisions based on the total time required for the task to complete. When T _A-all < T _F-all , select the ARM processor for task data processing, otherwise, select the FPGA processor for task data processing.

以上实施方式以实例结合附图，详细的解释了基于ARM-FPGA协处理器异构平台的时间可预估协同处理方法的具体过程，完成对大数据系统中存储单元和计算单元分离的数据进行加速。The above implementation method uses an example in conjunction with the accompanying drawings to explain in detail the specific process of the time-predictable collaborative processing method based on the ARM-FPGA coprocessor heterogeneous platform, and completes the data processing of the separation of the storage unit and the calculation unit in the big data system. accelerate.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes Other elements not expressly listed, or elements inherent in the process, method, commodity, or apparatus are also included. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。在阅读了本发明的记载的内容之后，技术人员可以对本发明作各种改动或修改，这些等效变化和修饰同样落入本发明权利要求所限定的范围。The above embodiments should be understood as only for illustrating the present invention but not for limiting the protection scope of the present invention. After reading the contents of the present invention, skilled persons can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.

Claims

1. The time estimation collaborative processing method based on the ARM-FPGA coprocessor heterogeneous platform is characterized in that, in the application scenario where the big data storage unit and the computing unit are separated, the coprocessor FPGA is deployed in the data storage through the AXI4.0 interface unit, estimate the information including the redundancy in the source data, estimate the sum of the communication delay in the entire implementation process, and make a comprehensive decision on the processor. The specific steps are as follows:

S1. Realize high-speed data interaction between ARM processor and FPGA through AXI4.0 interface;

S2. The source data of the storage unit node is stored in the form of an index table, and the data representation is divided into two categories, the data set with strong regularity and the data set with complex regularity;

S3. According to the different forms of source data, establish a class index table probability storage structure to represent the probability of different data distribution forms, so as to judge whether the source data needs to be preprocessed; in the step S3, for the source with strong regularity For the data, use fixed value calculation to estimate the probability value; for the source data with complex laws, the key name of each data is described by the structure of the class, and the class index table is established to estimate the data size;

S4. There is time delay in each process of source data processing. It is necessary to clarify the communication process of each data transmission stage and calculate the sum of the entire time delay. It is necessary to estimate the overall time delay of ARM processor data processing and FPGA data processing respectively. ;

S5. After the overall time delay estimation is completed, according to the estimated time delay, a decision is made on the choice of the processor when processing the source data; when making the decision, the state of the current ARM processor and the FPGA processor should also be considered;

In the step S1, when the data request is generated, the ARM processor extracts the data by accessing the memory; when the source data cannot be completely stored in the processor memory, the source data stored in the storage medium can be accessed, and the All source data is sent to the processor of the computing unit; if FPGA is used for data preprocessing, it will use the AXI4.0 interface to enable high-speed communication and data interaction between the ARM processor and FPGA;

In the step S2, the gender column distribution satisfies the binomial distribution, the node records the probability value of its occurrence, the age column is described by a uniform distribution, the node records its maximum and minimum values, and the achievement column obtains data in a hash manner Key value, so as to process; Among them, different key value columns are marked with different numbers, the gender column is marked with 1, the age column is marked with 2, and the grade column is marked with 3;

In the step S4, the ARM processor system is used to process the source data. The time delay of the whole process mainly includes the source data extraction and transmission time, the network transmission time delay and the source data calculation time. T _Ao-t , T _Af-n and T _Ao-c means; the processing process first takes out the source data from the storage node unit, and then forwards it to the computing node unit, and the overall time cost is represented by the T _Arm symbol; the formula is: T _Arm = T _Ao-t + T _{Ao- c} ; the internal storage unit of the source data communicates with a high-speed PCIe interface, and its communication delay is the product of the stable transmission rate V ₁ and the source data amount A _o , expressed as T _Ao-t = V ₁ *A _o ;

In the step S4, the FPGA is used as a coprocessor to perform source data preprocessing, and the entire process delay includes extracting source data communication delay, data preprocessing delay in FPGA, data network transmission delay after processing, and computing unit processing time Delay, represented by T _Fo-t , T _Fo-c , T _Ff-n and T _Ff-c respectively; source data FPGA communicates with ARM through AXI interface, and its communication delay is the stable transmission rate V and source The product of the amount of data A _o is expressed as T _Fo-t = (V+V ₁ )*A _o ; the time delay of data transmission in the network is determined by the current network speed N _v and the amount of transmitted data A _o+f , which is expressed as T _Ff-n , T _Ff-n =N _v *A _o+f ; when there is a source data processing request, the ARM processor parses the task message, so that the FPGA processor can perform the processing from the ARM through the AXI protocol interface Source data interaction, and then data preprocessing. After the source data preprocessing is completed, the processed data is directly sent to the computing node unit through the network through the integrated network card on the FPGA; the overall time overhead is represented by T _Fpga ; the formula is: T _Fpga = T _Fo-t + T _Fo-c + T _Af-n + T _Ff-c .

2. the time estimation collaborative processing method based on ARM-FPGA coprocessor heterogeneous platform according to claim 1, it is characterized in that, in described step S3, set up index node table with source data key value size, in knot In the process of establishing the point index table, the nodes are established based on frequently indexed data key values to ensure the accuracy of estimation; the underlying leaf nodes store the probability value of data A, which is represented by P _{i, j} at the node The probability value of A _i <A<A _j between the data; for the first leaf node P _0,4 in the index structure, it means the probability value of the node data less than A ₄ ; if the symmetric probability distribution is satisfied in the A<A ₄ interval, A certain probability value can be estimated based on the P _0,4 probability value.

3. the time estimation collaborative processing method based on ARM-FPGA coprocessor heterogeneous platform according to claim 1, is characterized in that, in described step S5, when two processors all have a task to be processing at present, so The required processing time is T _A-now and T _F-now respectively. If another task comes at this time, and the time for this task to use the ARM processor and the FPGA processor is T _Arm and T _Fpga respectively, then it can be judged that the ARM processor and the total time required by the FPGA processor to process tasks are T _A-all and T _F-all , where

T _A-all = T _A-now + T _Arm , T _F-all = T _F-now + T _Fpga , according to the total time required for task completion, when T _A-all < T _F-all , select the ARM processor Perform task data processing, on the contrary, choose FPGA processor for task data processing.