CN113625264B - Method and system for parallel processing of railway detection big data - Google Patents
Method and system for parallel processing of railway detection big data Download PDFInfo
- Publication number
- CN113625264B CN113625264B CN202110665774.6A CN202110665774A CN113625264B CN 113625264 B CN113625264 B CN 113625264B CN 202110665774 A CN202110665774 A CN 202110665774A CN 113625264 B CN113625264 B CN 113625264B
- Authority
- CN
- China
- Prior art keywords
- data
- detection data
- node
- parallel
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S13/00—Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
- G01S13/88—Radar or analogous systems specially adapted for specific applications
- G01S13/885—Radar or analogous systems specially adapted for specific applications for ground probing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Electromagnetism (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Algebra (AREA)
- Operations Research (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Networks & Wireless Communication (AREA)
- Train Traffic Observation, Control, And Security (AREA)
- Multi Processors (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及集群资源处理及优化技术领域,尤其涉及一种并行处理铁路检测大数据的方法及系统。The present invention relates to the technical field of cluster resource processing and optimization, and in particular to a method and system for parallel processing of railway inspection big data.
背景技术Background Art
铁路运输为百姓的生活和工作带来诸多便利和支持,但是保障铁路的安全运营也是现在及未来须始终重视的核心方向,随着我国铁路运营里程的迅速增长,如何对日益庞大铁路网的基础设施进行快速、准确、无损地检测,及时掌握基础设施的健康状态,成为当前亟待解决的重大课题。Railway transportation brings many conveniences and support to people's lives and work, but ensuring the safe operation of railways is also a core direction that must always be paid attention to now and in the future. With the rapid growth of my country's railway operating mileage, how to conduct fast, accurate and non-destructive detection of the infrastructure of the increasingly large railway network and timely grasp the health status of the infrastructure has become a major issue that needs to be solved urgently.
探地雷达(Ground Penetrating Radar.GPR)是利用天线发射和接收高频电磁波来探测介质内部物质特性和分布规律的一种地球物理方法,通过研究雷达波极化方式的变化可以获得与地下介质物性相关的信息,应用时可以采用多种方式利用探地雷达进行检测,相应地获得的检测数据种类复杂度和规模都较大,需要采用设定的策略进行处理和分析。Ground Penetrating Radar (GPR) is a geophysical method that uses antennas to transmit and receive high-frequency electromagnetic waves to detect the material properties and distribution patterns inside a medium. By studying the changes in the polarization of radar waves, information related to the physical properties of underground media can be obtained. When applied, ground penetrating radar can be used for detection in a variety of ways. The types, complexity and scale of the corresponding detection data obtained are large, and set strategies need to be used for processing and analysis.
现有技术中针对探地雷达检测数据的处理手段进行了研究,但这些技术绝大部分基于单计算节点,涉及的算法往往高度串行化,仅适用于小尺度GPR数据集的处理。随着当前探地雷达数据规模的快速扩张,部分技术人员也进行了现有数据处理技术的优化研究,基于上述技术,其借助大数据云计算技术的海量高速并行化的处理能力,提升对大规模数据进行快速处理的能力,其基于Hadoop的分布式集群管理数据,具体采用分布式文件系统HDFS、MySQL的关系型数据库集群结合Hbase来解决结构化数据的海量存储和高效访问,这样虽然保障了集群资源的可扩展性和可移植性,能够存储多种格式的数据,但是其将雷达数据进行预处理或后续处理的过程中存在重复读写的问题,且容易因不同并行任务的计算量差异过大引起负载均衡问题,当工作流中的算法步骤复杂时,可能导致各并行计算节点无法正常运作,无法满足大规模探地雷达检测数据的处理要求。因此,亟需提供一种能够满足应用需求的高效合理的数据并行处理方法。In the prior art, research has been conducted on the processing methods of ground penetrating radar detection data, but most of these technologies are based on a single computing node, and the algorithms involved are often highly serialized and only applicable to the processing of small-scale GPR data sets. With the rapid expansion of the current ground penetrating radar data scale, some technicians have also conducted optimization research on existing data processing technologies. Based on the above technology, they use the massive high-speed parallel processing capabilities of big data cloud computing technology to improve the ability to quickly process large-scale data. They manage data based on Hadoop's distributed clusters, specifically using distributed file systems HDFS and MySQL's relational database clusters combined with Hbase to solve the massive storage and efficient access of structured data. Although this ensures the scalability and portability of cluster resources and can store data in multiple formats, there are problems of repeated reading and writing in the process of preprocessing or subsequent processing of radar data, and it is easy to cause load balancing problems due to the large difference in the amount of calculation of different parallel tasks. When the algorithm steps in the workflow are complex, it may cause each parallel computing node to fail to operate normally and fail to meet the processing requirements of large-scale ground penetrating radar detection data. Therefore, it is urgent to provide an efficient and reasonable data parallel processing method that can meet application requirements.
公开于本发明背景技术部分的信息仅仅旨在加深对本发明的一般背景技术的理解,而不应当被视为承认或以任何形式暗示该信息构成己为本领域技术人员所公知的现有技术。The information disclosed in the background technology section of the present invention is only intended to deepen the understanding of the general background technology of the present invention, and should not be regarded as acknowledging or suggesting in any form that the information constitutes the prior art known to those skilled in the art.
新一代高性能硬件架构体系的快速发展,给海量GPR数据快速处理的开展创造了新的机遇。大量高精度、大区域的GPR检测数据可以利用并行技术进行处理,以极大地提高处理效率。The rapid development of a new generation of high-performance hardware architecture systems has created new opportunities for the rapid processing of massive GPR data. Large amounts of high-precision, large-area GPR detection data can be processed using parallel technology to greatly improve processing efficiency.
发明内容Summary of the invention
为解决上述问题,本发明提供了一种并行处理铁路检测大数据的方法,在一个实施例中,所述方法包括:To solve the above problems, the present invention provides a method for parallel processing of railway detection big data. In one embodiment, the method includes:
数据存储步骤、将采集的检测数据进行划分,并按照设定的存储策略将划分后的检测数据进行存储;Data storage step, dividing the collected detection data, and storing the divided detection data according to the set storage strategy;
算法组合制定步骤、根据检测数据的工程性质需求和解释目标需求,创建对应的运算工作流,其中,所述运算工作流为检测数据需进行的多个信号处理算法的组合,包括常规类工作流和迭代类工作流;The algorithm combination formulation step creates a corresponding operation workflow according to the engineering property requirements and interpretation target requirements of the detection data, wherein the operation workflow is a combination of multiple signal processing algorithms required for the detection data, including conventional workflows and iterative workflows;
并行处理步骤、基于预先创建的并行节点架构并行调取存储的检测数据,并分别按照与所述运算工作流匹配的调度策略实现基于常规类运算和迭代类运算的并行处理。The parallel processing step is to retrieve the stored detection data in parallel based on the pre-created parallel node architecture, and implement parallel processing based on conventional operations and iterative operations respectively according to the scheduling strategy matching the operation workflow.
优选地,在所述数据存储步骤中,按照采集区域和采集设备类型对铁路检测数据进行划分,得到处理优先级不同的各级的检测数据。Preferably, in the data storage step, the railway detection data is divided according to the collection area and the type of collection equipment to obtain detection data of different levels with different processing priorities.
进一步地,在一个实施例中,在存储划分后的检测数据时,采用关系型数据库和分布式数据库多种存储方式,结合优先级信息实现对多种格式检测数据的并行存储;Furthermore, in one embodiment, when storing the divided detection data, multiple storage methods such as relational database and distributed database are used, and priority information is combined to realize parallel storage of detection data in multiple formats;
其中,所述检测数据文件的数据体包括道头数据和数据内容,各检测数据的道头数据与所述文件头关联存储在关系型数据库中,各检测数据的数据内容存储在分布式数据库中。The data body of the detection data file includes header data and data content. The header data of each detection data is associated with the file header and stored in a relational database, and the data content of each detection data is stored in a distributed database.
进一步地,一个优选的实施例中,所述方法还包括:节点架构创建步骤、设置一个主节点作为任务调度节点,与一个读写节点和N个计算节点共同构成并行节点架构。Furthermore, in a preferred embodiment, the method further includes: a node architecture creation step, setting a master node as a task scheduling node, which together with a read-write node and N computing nodes constitutes a parallel node architecture.
进一步地,一个实施例中,在所述并行处理步骤中,按照以下步骤实现基于常规类运算的并行处理:Furthermore, in one embodiment, in the parallel processing step, the parallel processing based on conventional class operations is implemented according to the following steps:
任务调度节点判断是否存在需要处理的检测数据,若存在,统计需处理检测数据的总道数x,启动读写节点和计算节点;The task scheduling node determines whether there is any test data that needs to be processed. If so, it counts the total number of test data x that needs to be processed and starts the read-write node and the computing node.
依次循环执行如下步骤,直至所有待处理的检测数据运算完成,释放读写节点和计算节点:The following steps are executed in a loop until all pending detection data operations are completed and the read/write nodes and computing nodes are released:
基于预先设置的调度基数K,由读写节点读取x/K道作为本轮处理数据;任务调度节点将本轮处理数据平均划分为N部分,分配至各个计算节点;Based on the preset scheduling base K, the read/write node reads x/K channels as the data for this round of processing; the task scheduling node divides the data for this round of processing into N parts evenly and distributes them to each computing node;
各个计算节点根据常规类工作流并行执行运算;Each computing node performs operations in parallel according to the conventional class workflow;
并行调度节点确认是否收到计算完成的已处理检测数据,若收到,将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区;The parallel scheduling node confirms whether the processed detection data has been received. If received, it transmits it to the read-write node, which integrates it according to the corresponding spatial structure and writes it into the storage area;
其中,K为正整数,1≤K≤N,或进一步结合单位标记数据对应的道数S进行优化,确定调度基数K。Wherein, K is a positive integer, 1≤K≤N, or further optimized in combination with the number of channels S corresponding to the unit mark data to determine the scheduling base K.
进一步地,一个实施例中,按照以下步骤实现基于迭代类运算的并行处理:Furthermore, in one embodiment, parallel processing based on iterative operations is implemented according to the following steps:
任务调度节点判断是否存在需要处理的检测数据,若存在,统计需处理检测数据的总道数x,启动读写节点和计算节点;The task scheduling node determines whether there is any test data that needs to be processed. If so, it counts the total number of test data x that needs to be processed and starts the read-write node and the computing node.
根据硬件资源、x以及该迭代算法的计算单位大小设置调度基数K,由读写节点读取x/K道作为第一轮处理数据;任务调度节点将本轮处理数据平均划分为N部分,分配至各个计算节点;The scheduling base K is set according to the hardware resources, x, and the computing unit size of the iterative algorithm. The read/write node reads x/K channels as the first round of processing data. The task scheduling node divides the processing data of this round into N parts and distributes them to each computing node.
进而依次循环执行如下步骤,直至所有待处理的检测数据运算完成,释放读写节点和计算节点:Then, the following steps are executed in a loop until all pending detection data operations are completed and the read/write nodes and computing nodes are released:
任务调度节点实时统计处于空闲的计算节点数量n,The task scheduling node counts the number of idle computing nodes n in real time.
读写节点读取道,数据,将其平均划分为n部分,分配至各个空闲的计算节点;Read-write node reads The data is divided into n parts evenly and distributed to each idle computing node;
各个计算节点根据迭代类工作流并行执行运算,并实时向任务调度节点反馈运算状态信息;Each computing node executes operations in parallel according to the iterative workflow and feeds back the operation status information to the task scheduling node in real time;
并行调度节点确认是否收到计算完成的已处理检测数据,若收到,将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区。The parallel scheduling node confirms whether the processed detection data has been received. If received, it will be transmitted to the read-write node, which will integrate it according to the corresponding spatial structure and write it into the storage area.
基于上述任意一个或多个实施例中所述的并行处理铁路检测大数据的方法,本发明还提供一种存储介质,该存储介质上存储有可实现如上述任意一个或多个实施例所述方法的程序代码。Based on the method for parallel processing of railway detection big data described in any one or more of the above embodiments, the present invention also provides a storage medium storing program codes that can implement the method described in any one or more of the above embodiments.
基于上述任意一个或多个实施例中所述方法的其他方面,本发明还提供一种并行处理铁路检测大数据的系统,该系统包括:Based on other aspects of the method described in any one or more of the above embodiments, the present invention also provides a system for parallel processing of railway detection big data, the system comprising:
数据存储模块,其配置为将采集的检测数据进行划分,并按照设定的存储策略将划分后的检测数据进行存储;A data storage module, configured to divide the collected detection data and store the divided detection data according to a set storage strategy;
算法组合制定模块,其配置为根据检测数据的工程性质需求和解释目标需求,创建对应的运算工作流,其中,所述运算工作流为检测数据需进行的多个信号处理算法的组合,包括常规类工作流和迭代类工作流;An algorithm combination formulation module is configured to create a corresponding operation workflow according to the engineering property requirements and interpretation target requirements of the detection data, wherein the operation workflow is a combination of multiple signal processing algorithms required for the detection data, including conventional workflows and iterative workflows;
并行处理模块,其配置为基于预先创建的并行节点架构并行调取存储的检测数据,并分别按照与所述运算工作流匹配的调度策略实现基于常规类运算和迭代类运算的并行处理。The parallel processing module is configured to retrieve the stored detection data in parallel based on a pre-created parallel node architecture, and implement parallel processing based on conventional operations and iterative operations according to a scheduling strategy that matches the operation workflow.
进一步地,在一个实施例中,所述系统还包括:节点架构创建模块、其配置为设置一个主节点作为任务调度节点,与一个读写节点和N个计算节点共同构成并行节点架构。Furthermore, in one embodiment, the system also includes: a node architecture creation module, which is configured to set a master node as a task scheduling node, which together with a read-write node and N computing nodes constitutes a parallel node architecture.
与最接近的现有技术相比,本发明还具有如下有益效果:Compared with the closest prior art, the present invention also has the following beneficial effects:
本发明提供的一种并行处理铁路检测大数据的方法及系统,其将采集的检测数据进行划分,能够保障具有紧急处理需求的数据能第一时间得到处理结果,有利于避免数据处理不及时引起的事故或影响;进一步地,本发明提出的数据划分与存储方法,针对数据处理算法原理进行多粒度的切分运算,针对常规类算法和迭代类算法基于并行节点架构按照不同的调度策略实现并行处理,从负载均衡性能和集群资源利用率两个层面上提升了数据处理的速率,能有效满足当下及未来大规模数据的处理需求,为铁路检测数据的后续应用提供可靠助力。The present invention provides a method and system for parallel processing of railway detection big data, which divides the collected detection data to ensure that data with urgent processing needs can obtain processing results in the first time, which is conducive to avoiding accidents or impacts caused by untimely data processing; further, the data division and storage method proposed in the present invention performs multi-granularity segmentation operations based on the principle of data processing algorithm, and realizes parallel processing based on the parallel node architecture according to different scheduling strategies for conventional algorithms and iterative algorithms, which improves the data processing rate from two levels of load balancing performance and cluster resource utilization, can effectively meet the current and future large-scale data processing needs, and provide reliable assistance for the subsequent application of railway detection data.
本发明的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present invention will be described in the following description, and partly become apparent from the description, or understood by practicing the present invention. The purpose and other advantages of the present invention can be realized and obtained by the structures particularly pointed out in the description, claims and drawings.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
附图用来提供对本发明的进一步理解,并且构成说明书的一部分,与本发明的实施例共同用于解释本发明,并不构成对本发明的限制。在附图中:The accompanying drawings are used to provide a further understanding of the present invention and constitute a part of the specification. Together with the embodiments of the present invention, they are used to explain the present invention and do not constitute a limitation of the present invention. In the accompanying drawings:
图1是本发明一实施例提供的并行处理铁路检测大数据的方法的流程示意图;FIG1 is a schematic flow chart of a method for parallel processing of railway inspection big data provided by an embodiment of the present invention;
图2是本发明另一实施例提供的并行处理铁路检测大数据的系统的结构示意图。FIG2 is a schematic diagram of the structure of a system for parallel processing of railway inspection big data provided by another embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
以下将结合附图及实施例来详细说明本发明的实施方式,借此本发明的实施人员可以充分理解本发明如何应用技术手段来解决技术问题,并达成技术效果的实现过程并依据上述实现过程具体实施本发明。需要说明的是,只要不构成冲突,本发明中的各个实施例以及各实施例的各个特征可以相互结合,所形成的技术方案均在本发明的保护范围之内。The following will describe the implementation methods of the present invention in detail in conjunction with the accompanying drawings and embodiments, so that the implementers of the present invention can fully understand how the present invention applies technical means to solve technical problems and achieve the implementation process of technical effects and implement the present invention specifically according to the above implementation process. It should be noted that as long as there is no conflict, the various embodiments and various features of the embodiments in the present invention can be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.
虽然流程图将各项操作描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。各项操作的顺序可以被重新安排。当其操作完成时处理可以被终止,但是还可以具有未包括在附图中的附加步骤。处理可以对应于方法、函数、规程、子例程、子程序等等。Although the flowcharts describe the operations as sequential processes, many of the operations may be performed in parallel, concurrently, or simultaneously. The order of the operations may be rearranged. A process may be terminated when its operations are complete, but may also have additional steps not included in the accompanying drawings. A process may correspond to a method, function, procedure, subroutine, subprogram, etc.
计算机设备包括用户设备与网络设备。其中,用户设备或客户端包括但不限于电脑、智能手机、PDA等;网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算的由大量计算机或网络服务器构成的云。计算机设备可单独运行来实现本发明,也可接入网络并通过与网络中的其他计算机设备的交互操作来实现本发明。计算机设备所处的网络包括但不限于互联网、广域网、城域网、局域网、VPN网络等。Computer devices include user devices and network devices. Among them, user devices or clients include but are not limited to computers, smart phones, PDAs, etc.; network devices include but are not limited to a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of computers or network servers. Computer devices can be run alone to implement the present invention, or they can be connected to the network and implement the present invention through interactive operations with other computer devices in the network. The network where the computer device is located includes but is not limited to the Internet, wide area network, metropolitan area network, local area network, VPN network, etc.
2019年末,全国铁路营业里程达到13.9万公里以上;其中,高速铁路营业里程突破3.5万公里,稳居世界第一。根据发改基础[2016]1536号《中长期铁路网规划》,到2025年,铁路网规模达到17.5万公里左右,其中高速铁路3.8万公里左右,网络覆盖进一步扩大,路网结构更加优化,骨干作用更加显著,更好发挥铁路对经济社会发展的保障作用。普速铁路网规模达到13.1万公里左右,并规划实施既有线扩能改造2万公里左右。随着我国铁路运营里程的迅速增长,如何对未来庞大铁路网的基础设施进行快速、准确、无损地检测,及时掌握基础设施的健康状态,成为当前亟待解决的重大课题。At the end of 2019, the operating mileage of railways in China reached more than 139,000 kilometers; among them, the operating mileage of high-speed railways exceeded 35,000 kilometers, ranking first in the world. According to the Medium- and Long-Term Railway Network Plan (No. 1536 of the Development and Reform Foundation [2016]), by 2025, the scale of the railway network will reach about 175,000 kilometers, of which about 38,000 kilometers will be high-speed railways. The network coverage will be further expanded, the network structure will be more optimized, the backbone role will be more significant, and the railway will better play its role in safeguarding economic and social development. The scale of the conventional railway network will reach about 131,000 kilometers, and it is planned to implement the capacity expansion and reconstruction of existing lines of about 20,000 kilometers. With the rapid growth of my country's railway operating mileage, how to quickly, accurately and non-destructively detect the infrastructure of the future huge railway network and timely grasp the health status of the infrastructure has become a major issue that needs to be solved urgently.
现有技术中GPR(探地雷达)数据处理技术在处理中小尺度数据集上己经相对成熟。但这些技术绝大部分基于单计算节点,涉及的算法往往高度串行化,无法有效适用于大规模铁路检测数据的处理。随着当前探地雷达数据规模的快速扩张,部分技术人员也进行了现有数据处理技术的优化研究,基于上述技术,其借助大数据云计算技术的海量高速并行化的处理能力,提升对大规模数据进行快速处理的能力,其基于Hadoop的分布式集群管理数据,具体采用分布式文件系统HDFS、MySQL的关系型数据库集群结合Hbase来解决结构化数据的海量存储和高效访问,这样虽然保障了集群资源的可扩展性和可移植性,能够存储多种格式的数据,但是其并未提供明确的数据存储策略,且将雷达数据进行预处理或后续处理的过程中存在重复读写的问题,容易因不同并行任务的计算量差异过大引起负载均衡问题,当工作流中的算法步骤复杂时,可能导致各并行计算节点无法正常运作,无法满足大规模探地雷达检测数据的处理要求。因此,亟需提供一种能够满足应用需求的高效合理的数据并行处理方法。In the prior art, GPR (ground penetrating radar) data processing technology is relatively mature in processing small and medium-scale data sets. However, most of these technologies are based on single computing nodes, and the algorithms involved are often highly serialized, which cannot be effectively applied to the processing of large-scale railway detection data. With the rapid expansion of the current ground penetrating radar data scale, some technicians have also conducted optimization research on existing data processing technologies. Based on the above technology, it uses the massive high-speed parallel processing capability of big data cloud computing technology to improve the ability to quickly process large-scale data. It manages data based on Hadoop's distributed cluster, and specifically uses the distributed file system HDFS and MySQL's relational database cluster combined with Hbase to solve the massive storage and efficient access of structured data. Although this ensures the scalability and portability of cluster resources and can store data in multiple formats, it does not provide a clear data storage strategy, and there is a problem of repeated reading and writing in the process of pre-processing or subsequent processing of radar data. It is easy to cause load balancing problems due to the large difference in the amount of calculation of different parallel tasks. When the algorithm steps in the workflow are complicated, it may cause each parallel computing node to fail to operate normally, and it cannot meet the processing requirements of large-scale ground penetrating radar detection data. Therefore, it is urgent to provide an efficient and reasonable data parallel processing method that can meet application requirements.
新一代高性能硬件架构体系的快速发展,给海量GPR数据快速处理的开展创造了新的机遇,高精度、大区域的GPR检测数据可以利用并行技术进行处理,以极大地提高处理效率,但是并行实现处理的过程中,由于各路并行运算的运算量难以均衡,难以克服负载差异过大的影响,容易引起运算资源不均和中间内存资源消耗不足的现象,处理效率难以保障。The rapid development of a new generation of high-performance hardware architecture systems has created new opportunities for the rapid processing of massive GPR data. High-precision, large-area GPR detection data can be processed using parallel technology to greatly improve processing efficiency. However, in the process of parallel processing, it is difficult to balance the amount of computation of each parallel operation and overcome the impact of excessive load differences, which can easily lead to uneven computing resources and insufficient consumption of intermediate memory resources, making it difficult to guarantee processing efficiency.
为解决上述问题,本发明提供一种并行处理铁路检测大数据的方法及系统,该方法建立“数据并行+算法并行”的混合并行计算架构,将雷达信号数据处理算法分为两大类,分别制定适用的调度和计算策略,从而使整个工作流处理模式能够实现负载均衡的效果,最大化发挥集群资源的利用效率。To solve the above problems, the present invention provides a method and system for parallel processing of railway inspection big data. The method establishes a hybrid parallel computing architecture of "data parallelism + algorithm parallelism", divides radar signal data processing algorithms into two categories, and formulates applicable scheduling and computing strategies respectively, so that the entire workflow processing mode can achieve load balancing effect and maximize the utilization efficiency of cluster resources.
接下来基于附图详细描述本发明实施例的方法的详细流程,附图的流程图中示出的步骤可以在包含诸如一组计算机可执行指令的计算机系统中执行。虽然在流程图中示出了各步骤的逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。Next, the detailed process of the method of the embodiment of the present invention is described in detail based on the accompanying drawings. The steps shown in the flowchart of the accompanying drawings can be executed in a computer system including a set of computer executable instructions. Although the logical order of each step is shown in the flowchart, in some cases, the steps shown or described can be executed in a different order than here.
实施例一Embodiment 1
图1示出了本发明实施例一提供的并行处理铁路检测大数据的方法的流程示意图,参照图1可知,该方法包括如下步骤。FIG1 shows a flow chart of a method for parallel processing of railway detection big data provided by Embodiment 1 of the present invention. Referring to FIG1 , it can be seen that the method includes the following steps.
数据存储步骤、将采集的检测数据进行划分,并按照设定的存储策略将划分后的检测数据进行存储;Data storage step, dividing the collected detection data, and storing the divided detection data according to the set storage strategy;
算法组合制定步骤、根据检测数据的工程性质需求和解释目标需求,创建对应的运算工作流,其中,所述运算工作流为检测数据需进行的多个信号处理算法的组合,包括常规类工作流和迭代类工作流;The algorithm combination formulation step creates a corresponding operation workflow according to the engineering property requirements and interpretation target requirements of the detection data, wherein the operation workflow is a combination of multiple signal processing algorithms required for the detection data, including conventional workflows and iterative workflows;
并行处理步骤、基于预先创建的并行节点架构并行调取存储的检测数据,并分别按照与所述运算工作流匹配的调度策略实现基于常规类运算和迭代类运算的并行处理。The parallel processing step is to retrieve the stored detection data in parallel based on the pre-created parallel node architecture, and implement parallel processing based on conventional operations and iterative operations respectively according to the scheduling strategy matching the operation workflow.
铁路数据检测领域的实际应用过程中,应用的探地雷达数据采集装置,包括探地雷达主机、发射与接收天线以及测点与测线空间坐标确定与记录系统;所述探地雷达主机分别连接并控制发射与接收天线、测点与测线空间坐标确定与记录系统;所述测点与测线空间坐标确定与记录系统用于确定与记录每个探地雷达数据采集点的具体位置。In the actual application process in the field of railway data detection, the ground penetrating radar data acquisition device used includes a ground penetrating radar host, transmitting and receiving antennas, and a measuring point and line spatial coordinate determination and recording system; the ground penetrating radar host is respectively connected to and controls the transmitting and receiving antennas, the measuring point and line spatial coordinate determination and recording system; the measuring point and line spatial coordinate determination and recording system is used to determine and record the specific position of each ground penetrating radar data acquisition point.
对于采集的原始检测数据,还通过对应型号的探地雷达数据读取器以及数据转换器读取原始检测数据并将其转换为满足本发明并行处理的标准格式。The collected original detection data is also read by a ground penetrating radar data reader and a data converter of a corresponding model and converted into a standard format that satisfies the parallel processing of the present invention.
本发明研究人员考虑到由于不同类型的探地雷达设备以及设置与不同区域的探地雷达设备获取的检测数据在处理需求方面存在轻重缓急之分,例如,某区域发生了地震,则该区域铁路相关探地雷达的实时检测数据则具备紧急的处理需求。因此,一个实施例中,在所述数据存储步骤中,按照采集区域和采集设备类型对铁路检测数据进行划分,得到处理优先级不同的各级的检测数据。The researchers of the present invention have considered that the detection data obtained by different types of ground penetrating radar equipment and ground penetrating radar equipment set in different areas have different processing requirements. For example, if an earthquake occurs in a certain area, the real-time detection data of the ground penetrating radar related to the railway in the area has urgent processing requirements. Therefore, in one embodiment, in the data storage step, the railway detection data is divided according to the acquisition area and the type of acquisition equipment to obtain detection data of different levels with different processing priorities.
具体地,可以为各批检测数据包添加优先级标记,并依据该优先级标记直接批量存储至不同优先等级对应的子存储区中,这样,避免了优先级标记位于检测数据文件头中,影响后续并行处理过程中的读取速率和调度运算速率。相应地,具有最高处理优先级的存储区,其中所存数据具备最高的调度优先级和读取优先级。Specifically, priority tags can be added to each batch of detection data packets, and based on the priority tags, they can be directly stored in batches in sub-storage areas corresponding to different priority levels. In this way, the priority tags are avoided from being located in the detection data file header, affecting the reading rate and scheduling operation rate in the subsequent parallel processing process. Accordingly, the storage area with the highest processing priority has the highest scheduling priority and reading priority for the stored data.
实际应用场景中,每次检测得到的数据集往往是多个文件,需要将数据集分配到多个并行计算节点,同时进行处理。切分调度运算的粒度直接影响每个并行计算节点的数据处理量和处理效率,对负载均衡有重要影响。In actual application scenarios, the data sets obtained from each detection are often multiple files, and the data sets need to be distributed to multiple parallel computing nodes for simultaneous processing. The granularity of the split scheduling operation directly affects the data processing volume and processing efficiency of each parallel computing node, and has an important impact on load balancing.
现有技术中采用以文件为单位进行切分,比如总共10个文件5个节点,每个节点分配2个文件。但有时单个文件很大,导致单个节点的计算量过大占用全部内存,其他节点无法运行,完全无法发挥并行化处理的优势。通俗来说,就是明明有多个生产线,却无法同时生产。The existing technology uses file-based segmentation, for example, 10 files and 5 nodes in total, with 2 files assigned to each node. However, sometimes a single file is very large, resulting in a single node's computational workload being too large to occupy all memory, and other nodes cannot run, completely failing to take advantage of parallel processing. In layman's terms, there are multiple production lines, but they cannot produce at the same time.
因此,为了进行细粒度的拆分,需要分解雷达文件,本发明研究人员考虑到由于探地雷达检测数据文件格式的特殊性,无法直接分解,其结合文件格式的具体内容实现并行分区存储,以便于实现检测数据的灵活切分。Therefore, in order to perform fine-grained splitting, it is necessary to decompose the radar file. The researchers of the present invention took into account the particularity of the ground penetrating radar detection data file format and failed to decompose it directly. They combined the specific content of the file format to realize parallel partition storage, so as to realize flexible segmentation of the detection data.
具体地,检测数据的文件格式为:文件头+信号数据(例如第1道数据+第2道数据……);其中,文件头为多个采集参数,且不同雷达设备的格式不同。Specifically, the file format of the detection data is: file header + signal data (such as the first channel data + the second channel data...); wherein the file header is a plurality of acquisition parameters, and the format is different for different radar devices.
进一步地,信号数据中的每1道数据包含2部分,道头+雷达信号(数据内容,如某品牌为4个数+512个数);Furthermore, each data in the signal data contains two parts: header + radar signal (data content, such as 4 numbers + 512 numbers for a certain brand);
由于后续的信号处理算法需要用到文件头中的参数,所以本专利的研究人员将文件头存储到关系型数据库中,如mysql数据库,领域内技术人员可以根据需求选用任何合理的关系型数据库;雷达信号数据内容则存储到分布式数据库中,如hdfs数据库,应用时,领域内技术人员可以根据需求选用任何合理的分布式数据库;基于此,信号数据就可以进行灵活的均等切分了,最小的粒度为单道信号。由此可见,采用本发明的技术构思,能够使各并行处理节点的数据处理粒度更小更均衡,从1个文件(通常路基检测单个文件有几十万道数据)到较小的道集(比如几十道,几百道)甚至1道数据。Since the subsequent signal processing algorithm needs to use the parameters in the file header, the researchers of this patent store the file header in a relational database, such as a MySQL database. Technical personnel in the field can choose any reasonable relational database according to their needs; the radar signal data content is stored in a distributed database, such as an HDFS database. When applied, technical personnel in the field can choose any reasonable distributed database according to their needs; based on this, the signal data can be flexibly and evenly divided, and the smallest granularity is a single-channel signal. It can be seen that the technical concept of the present invention can make the data processing granularity of each parallel processing node smaller and more balanced, from 1 file (usually a single file for roadbed detection has hundreds of thousands of data) to a smaller set of data (such as dozens or hundreds of data) or even 1 data.
此外,道头中会表征这一道是否为标记数据,对后续处理也有用,也存到关系型数据库中。所述标记数据可用于匹配雷达数据的真实空间位置,是采集时操作采集软件插入的,一般是等距离处,或者表征检测区段的地质特征发生变化的位置。In addition, the trace header will indicate whether this trace is marked data, which is also useful for subsequent processing and is also stored in the relational database. The marked data can be used to match the real spatial position of the radar data. It is inserted by the acquisition software during acquisition, generally at equal distances, or at locations where the geological characteristics of the detection section have changed.
具体地,考虑到探地雷达数据及其属性特征值涉及多种不同的数据格式,包括时域均方根振幅、时域相干性、频域-3dB带宽平均频率、频域-3dB带宽平均相位、时频域低频增加面积、时频域高频衰减面积等多维属性数据。因此,在一个实施例中,针对各个优先级存储区,在存储划分后的检测数据时,采用MySQL关系型数据库和分布式数据库多种存储方式,结合优先级信息实现对多种格式检测数据的并行分区存储;Specifically, considering that the ground penetrating radar data and its attribute characteristic values involve multiple different data formats, including time domain root mean square amplitude, time domain coherence, frequency domain -3dB bandwidth average frequency, frequency domain -3dB bandwidth average phase, time-frequency domain low frequency increase area, time-frequency domain high frequency attenuation area and other multi-dimensional attribute data. Therefore, in one embodiment, for each priority storage area, when storing the divided detection data, multiple storage methods such as MySQL relational database and distributed database are used, and priority information is combined to realize parallel partition storage of detection data in multiple formats;
其中,所述检测数据包括文件头和数据体,文件头包括各文件对应的数据采集时总道数、采样点数、天线频率、道间距以及时窗参数;各检测数据的文件头存储至关系型数据库中。The detection data includes a file header and a data body. The file header includes the total number of channels, the number of sampling points, the antenna frequency, the channel spacing and the time window parameters corresponding to each file during data collection. The file header of each detection data is stored in a relational database.
所述检测数据文件的数据体包括道头数据和数据内容,各检测数据的道头数据与所述文件头关联存储在关系型数据库中,各检测数据的数据内容存储在分布式数据库中。The data body of the detection data file includes header data and data content. The header data of each detection data is associated with the file header and stored in a relational database, and the data content of each detection data is stored in a distributed database.
具体实施时,可采取将文件头对应的数据采集时属性参数存入MySQL关系数据库,如总道数、采样点数、天线频率、道间距、时窗等参数。In specific implementation, the attribute parameters corresponding to the data collection in the file header can be stored in the MySQL relational database, such as the total number of channels, the number of sampling points, the antenna frequency, the channel spacing, the time window and other parameters.
将数据体在HDFS中分块存储,如果道头提示是道数标记,将道数标记也关联存入数据库,采用这样的存储方式,雷达数据可以按任意单道数据大小的倍数切分。The data body is stored in blocks in HDFS. If the header prompts a channel number mark, the channel number mark is also associated and stored in the database. With this storage method, radar data can be divided into multiples of any single channel data size.
实际应用时,数据库存储格式可以如下例所示:In actual application, the database storage format can be as follows:
文件属性参数表示例:Example of file attribute parameter table:
文件标记表示例:Example of file tag table:
传统算法的输入输出为一个或多个雷达文件,并行计算时单个任务的最小数据量普遍较大,容易引起并行计算负载不均衡,以及计算模块内存不足的问题,严重时甚至影响所有并行计算的正常运行。The input and output of traditional algorithms are one or more radar files. The minimum data size of a single task during parallel computing is generally large, which can easily cause unbalanced parallel computing load and insufficient memory in the computing module. In severe cases, it can even affect the normal operation of all parallel computing.
进一步地,即使现有技术中设置其并行计算的最小计算单位为一个雷达数据文件,由于其可能需要执行迭代运算处理,而迭代运算的次数不均,依然可能造成计算负载不均的问题,而现有技术中若该批计算中,存在未执行完毕的计算则其他并行运算通道处于毫无价值的等待状态,大大影响运算处理的时效性。Furthermore, even if the minimum computing unit of parallel computing is set to a radar data file in the prior art, since it may need to perform iterative processing and the number of iterative operations is uneven, it may still cause the problem of uneven computing load. In the prior art, if there are unfinished calculations in the batch of calculations, other parallel computing channels are in a worthless waiting state, which greatly affects the timeliness of the computing processing.
因此,一个实施例中,在算法组合制定步骤、根据检测数据的工程性质需求和解释目标需求,创建对应的运算工作流,其中,所述运算工作流为检测数据需进行的多个信号处理算法的组合,包括常规类工作流和迭代类工作流。本发明中针对不同的运算原理,根据其是否为迭代算法将运算工作流划分为常规类工作流和迭代工作流。Therefore, in one embodiment, in the algorithm combination formulation step, a corresponding operation workflow is created according to the engineering property requirements and interpretation target requirements of the detection data, wherein the operation workflow is a combination of multiple signal processing algorithms required for the detection data, including conventional workflows and iterative workflows. In the present invention, the operation workflow is divided into conventional workflows and iterative workflows according to different operation principles and whether it is an iterative algorithm.
需要说明的是,针对某一个雷达文件,可能涉及到具备执行顺序的运算处理,基于此,各雷达文件的运算处理工作流不限于一个常规类工作流和一个迭代类工作流。研究人员可以根据其运算需求合理规划各雷达数据文件不同阶段的运算工作流,各阶段运算工作流中至少包括一个算法,并进行统一规划存储。实际应用中,同类型批量雷达数据对应的运算工作流一致。It should be noted that for a certain radar file, it may involve computational processing with an execution order. Based on this, the computational processing workflow of each radar file is not limited to one conventional workflow and one iterative workflow. Researchers can reasonably plan the computational workflows of different stages of each radar data file according to their computational requirements. Each stage of the computational workflow includes at least one algorithm and is uniformly planned and stored. In practical applications, the computational workflows corresponding to the same type of batch radar data are consistent.
进一步地,为了保障各并行计算通道的负载均衡和利用利用率,本发明研究人员设置雷达信号处理算法的输入输出单元及最小计算单元为单道或道集。Furthermore, in order to ensure the load balance and utilization rate of each parallel computing channel, the researchers of the present invention set the input and output units and the minimum computing unit of the radar signal processing algorithm to a single channel or a channel set.
确定算法的最小计算单元e,即可开展互相之间无关联的独立计算,以最小计算单元为输入输出,比如单道或道集,道集大小可指定。By determining the minimum computational unit e of the algorithm, independent computations that are not related to each other can be carried out, with the minimum computational unit as input and output, such as a single channel or a channel set, and the channel set size can be specified.
以线性增益算法为例,对于每一道数据的计算结果体现为每个采样点乘以不同的系数:Taking the linear gain algorithm as an example, the calculation result for each data is reflected by multiplying each sampling point by a different coefficient:
P(i)=aiΔt+bexpciΔt,i=1,2,…,N1 P(i)=aiΔt+bexp ciΔt , i=1, 2,...,N 1
式中:N1为每一道信号的采样点个数(文件头中可读取);Where: N 1 is the number of sampling points of each signal (can be read in the file header);
P(i)为第i个采样点对应的加权因子;P(i) is the weighting factor corresponding to the i-th sampling point;
Δt为采样时间间隔(文件头中可读取);Δt is the sampling time interval (can be read in the file header);
a,b,c为系数,系数的值可由用户自行设定,在界面输入;a, b, c are coefficients, and the values of the coefficients can be set by the user and input in the interface;
因此针对上述线性增益算法,则设置最小计算单位为单道数据。Therefore, for the above linear gain algorithm, the minimum calculation unit is set to single-channel data.
结合实际应用,一个实施例中,本发明提供的并行处理方法还包括:节点架构创建步骤、设置一个主节点A作为任务调度节点,与一个读写节点B和N个计算节点C1~CN共同构成并行节点架构。In combination with practical applications, in one embodiment, the parallel processing method provided by the present invention further includes: a node architecture creation step, setting a master node A as a task scheduling node, which together with a read/write node B and N computing nodes C 1 -CN constitute a parallel node architecture.
进一步地,在所述并行处理步骤中,对于运算工作流中的多个算法,逐个处理:Furthermore, in the parallel processing step, multiple algorithms in the computational workflow are processed one by one:
如果是常规类算法,采用静态调度方法,A节点将本次处理数据划分为N部分,平均分配到C1~CN节点,在并行执行过程中将不再进行任务的调度。节点C1~CN分别调用设定的常规类算法进行并行处理,并在处理完后将计算结果反馈回A节点。If it is a conventional algorithm, a static scheduling method is used. Node A divides the data to be processed into N parts and evenly distributes them to nodes C 1 to C N. No task scheduling will be performed during the parallel execution. Nodes C 1 to C N respectively call the set conventional algorithm for parallel processing and feed back the calculation results to node A after processing.
因此,在一个实施例中,按照以下步骤实现基于常规类运算的并行处理:Therefore, in one embodiment, the following steps are followed to implement parallel processing based on conventional class operations:
任务调度节点判断是否存在需要处理的检测数据,若存在,统计需处理检测数据的总道数x,启动读写节点和计算节点;The task scheduling node determines whether there is any test data that needs to be processed. If so, it counts the total number of test data x that needs to be processed and starts the read-write node and the computing node.
依次循环执行如下步骤,直至所有待处理的检测数据运算完成,释放读写节点和计算节点:The following steps are executed in a loop until all pending detection data operations are completed and the read/write nodes and computing nodes are released:
基于预先设置的调度基数K,由读写节点读取x/K道作为本轮处理数据;任务调度节点将本轮处理数据平均划分为N部分,分配至各个计算节点;因此,运算过程中K的设置决定了调度粒度的大小,调度粒度指的是每个计算节点每次计算的数据量;在这一场景下,调度粒度=x/NK;Based on the preset scheduling base K, the read/write node reads x/K channels as the data for this round of processing; the task scheduling node divides the data for this round of processing into N parts evenly and distributes them to each computing node; therefore, the setting of K during the operation determines the size of the scheduling granularity, which refers to the amount of data calculated by each computing node each time; in this scenario, the scheduling granularity = x/NK;
各个计算节点根据常规类工作流并行执行运算;Each computing node performs operations in parallel according to the conventional class workflow;
并行调度节点确认是否收到计算完成的已处理检测数据,若收到,将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区;The parallel scheduling node confirms whether the processed detection data has been received. If received, it transmits it to the read-write node, which integrates it according to the corresponding spatial structure and writes it into the storage area;
其中,K为正整数,1≤K≤N,或进一步结合单位标记数据对应的道数S进行优化,确定调度基数K。Wherein, K is a positive integer, 1≤K≤N, or further optimized in combination with the number of channels S corresponding to the unit mark data to determine the scheduling base K.
实际应用时,由于常规运算各并行通道的计算过程是一致的,计算时间差距不会太大,因此,采用批量完成后再启动下一批数据调取及运算,但是需要说明的是,并行调度节点一旦识别到收到计算完成的已处理检测数据,就可以立即将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区,与上述的批量完后启动下一批读取和运算不冲突,可以有效避免数据读写和存储的拥挤。In actual application, since the calculation process of each parallel channel of conventional operation is consistent, the calculation time difference will not be too large. Therefore, the next batch of data retrieval and operation will be started after the batch is completed. However, it should be noted that once the parallel scheduling node recognizes that the processed detection data has been received and the calculation has been completed, it can immediately transmit it to the read-write node, and the read-write node will integrate it according to the corresponding spatial structure and write it to the storage area. This does not conflict with the start of the next batch of reading and operation after the above-mentioned batch is completed, and can effectively avoid data reading, writing and storage congestion.
其中,按照待处理数据总道数x的大小设定计算单位大小e的值,通常待处理数据总道数越大,相应的计算单位大小越大,最小的e值为1,表示以单道为最小计算单位。The value of the calculation unit size e is set according to the total number of data channels x to be processed. Generally, the larger the total number of data channels to be processed, the larger the corresponding calculation unit size. The minimum e value is 1, indicating that a single channel is the minimum calculation unit.
进一步地,考虑到用于匹配雷达数据的真实空间位置的标记数据,在一个可选的实施例中,确定切分信息内容的调度基数时,结合标记数据的位置进行计算,具体地,保持单位调度道数为单位标记数据对应道数S的单倍或多倍,以使各2个标记范围内的数据能够采用相同的信号处理算法,避免对应数据单元的计算信号有太大差异。比如标记是每1000道左右一个,那么数据处理时的单位调度道数也会参考这个标记粒度道数(单位标记数据对应的道数),与之相等或者倍数相等。Furthermore, considering the marker data used to match the real spatial position of the radar data, in an optional embodiment, when determining the scheduling base number for segmenting the information content, the position of the marker data is combined for calculation. Specifically, the unit scheduling channel number is kept as a single or multiple times of the channel number S corresponding to the unit marker data, so that the data within each 2 marker ranges can use the same signal processing algorithm to avoid too much difference in the calculation signal of the corresponding data unit. For example, if the marker is one per 1,000 channels, the unit scheduling channel number during data processing will also refer to the marker granularity channel number (the channel number corresponding to the unit marker data), and be equal to it or a multiple of it.
具体地,实际应用时,可以按照以下策略确定目标调取基数的值:Specifically, in actual application, the value of the target call base can be determined according to the following strategy:
若基于初始调度基数K’得到的单位调度道数N*e的值小于等于对应数据单元的S,则选定目标调度基数K的取值等于S;If the value of the unit scheduling channel number N*e obtained based on the initial scheduling cardinality K' is less than or equal to S of the corresponding data unit, the value of the selected target scheduling cardinality K is equal to S;
若单位调度道数N*e的值大于对应数据单元的S,则基于的值向上取整得到目标倍数m,选定m*S为目标调取粒度K的取值。If the value of the unit scheduling channel number N*e is greater than the corresponding data unit S, then based on The value of is rounded up to get the target multiple m, and m*S is selected as the target value of the granularity K.
进一步地,如果是迭代类算法,则采用循环分配的动态调度方法:Furthermore, if it is an iterative algorithm, a dynamic scheduling method of cyclic allocation is adopted:
设置对应的并行调度基数K,K为整数。Set the corresponding parallel scheduling base K, where K is an integer.
A节点首次分配数据量占总数据量比例为1/K,将这些数据划分为N部分,平均分配到C1~CN节点。C1~CN分别调用指定的算法进行并行处理,并在处理完后将计算结果反馈回A节点。The amount of data initially allocated by node A accounts for 1/K of the total amount of data, and the data is divided into N parts and evenly distributed to nodes C 1 to C N. C 1 to C N respectively call the specified algorithm for parallel processing and feed back the calculation results to node A after processing.
A节点接收C1~CN节点的反馈信息,并实时统计处于空闲状态的计算节点数,本发明研究人员考虑到,对于迭代类工作流中的算法组合,由于迭代次数未知,对于迭代次数多的算法,不同计算节点的单次运算时间差会随着迭代次数而增大,形成明显的时间差,导致各计算节点耗时明显不同);从剩余计算量中,继续均衡分配数据量(1/Kn)给每个空闲的计算节点;Node A receives feedback information from nodes C1 ~ CN , and counts the number of computing nodes in idle state in real time. The researchers of the present invention consider that for the algorithm combination in the iterative workflow, since the number of iterations is unknown, for the algorithm with many iterations, the single operation time difference of different computing nodes will increase with the number of iterations, forming a significant time difference, resulting in significantly different time consumption of each computing node); from the remaining computing amount, continue to evenly distribute the data amount (1/Kn) to each idle computing node;
循环执行上述步骤,直至所有数据处理完毕。The above steps are executed repeatedly until all data are processed.
因此,一个实施例中,在所述并行处理步骤中,按照以下步骤实现基于迭代类运算的并行处理:Therefore, in one embodiment, in the parallel processing step, the parallel processing based on iterative operations is implemented according to the following steps:
任务调度节点判断是否存在需要处理的检测数据,若存在,统计需处理检测数据的总道数x,启动读写节点和计算节点;The task scheduling node determines whether there is any test data that needs to be processed. If so, it counts the total number of test data x that needs to be processed and starts the read-write node and the computing node.
根据硬件资源、x以及该迭代算法的计算单位大小设置调度基数K,由读写节点读取x/K道作为第一轮处理数据;任务调度节点将本轮处理数据平均划分为N部分,分配至各个计算节点;The scheduling base K is set according to the hardware resources, x, and the computing unit size of the iterative algorithm. The read/write node reads x/K channels as the first round of processing data. The task scheduling node divides the processing data of this round into N parts and distributes them to each computing node.
进而依次循环执行如下步骤,直至所有待处理的检测数据运算完成,释放读写节点和计算节点:Then, the following steps are executed in a loop until all pending detection data operations are completed and the read/write nodes and computing nodes are released:
任务调度节点实时统计处于空闲的计算节点数量n,The task scheduling node counts the number of idle computing nodes n in real time.
读写节点读取道,数据,将其平均划分为n部分,分配至各个空闲的计算节点;Read-write node reads The data is divided into n parts evenly and distributed to each idle computing node;
各个计算节点根据迭代类工作流并行执行运算,并实时向任务调度节点反馈运算状态信息;Each computing node executes operations in parallel according to the iterative workflow and feeds back the operation status information to the task scheduling node in real time;
并行调度节点确认是否收到计算完成的已处理检测数据,若收到,将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区。The parallel scheduling node confirms whether the processed detection data has been received. If received, it will be transmitted to the read-write node, which will integrate it according to the corresponding spatial structure and write it into the storage area.
对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。For the aforementioned method embodiments, for the sake of simplicity, they are all described as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the order of the actions described, because according to the present invention, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
需要指出的是,在本发明的其他实施例中,该方法还可以通过将上述实施例中的某一个或某几个进行结合来得到新的数据并行处理方法,以实现对海量检测数据的高效分析和运算。It should be pointed out that, in other embodiments of the present invention, the method can also obtain a new data parallel processing method by combining one or several of the above embodiments to achieve efficient analysis and calculation of massive detection data.
采用本发明上述任意一个或多个实施例中所述的方案实现铁路检测数据的并行处理,在实现快速准确运算的基础上,进一步提升了各并行计算通道的负载均衡性,且保障了最大程度地发挥并行计算资源的利用效率,有助于为后续的应用研究提供可靠的数据支持。The scheme described in any one or more of the above embodiments of the present invention is adopted to realize parallel processing of railway detection data. On the basis of achieving fast and accurate calculations, the load balancing of each parallel computing channel is further improved, and the utilization efficiency of parallel computing resources is guaranteed to be maximized, which helps to provide reliable data support for subsequent application research.
需要说明的是,基于本发明上述任意一个或多个实施例中的方法,本发明还提供一种存储介质,该存储介质上存储有可实现如述任意一个或多个实施例中所述方法的程序代码,该代码被操作系统执行时能够实现如上所述的并行处理铁路检测大数据的方法。It should be noted that, based on the method in any one or more of the above-mentioned embodiments of the present invention, the present invention also provides a storage medium, on which is stored a program code that can implement the method as described in any one or more of the above-mentioned embodiments, and when the code is executed by the operating system, the method for parallel processing of railway detection big data as described above can be implemented.
实施例二Embodiment 2
上述本发明公开的实施例中详细描述了方法,对于本发明的方法可采用多种形式的装置或系统实现,因此基于上述任意一个或多个实施例中所述方法的其他方面,本发明还提供一种并行处理铁路检测大数据的系统,该系统用于执行上述任意一个或多个实施例中所述并行处理铁路检测大数据的方法。下面给出具体的实施例进行详细说明。The method is described in detail in the embodiments disclosed in the present invention. The method of the present invention can be implemented by various forms of devices or systems. Therefore, based on other aspects of the method described in any one or more embodiments, the present invention also provides a system for parallel processing of railway detection big data, which is used to execute the method for parallel processing of railway detection big data described in any one or more embodiments. Specific embodiments are given below for detailed description.
具体地,图2中示出了本发明实施例中提供的并行处理铁路检测大数据的系统的结构示意图,如图2所示,该系统包括:Specifically, FIG2 shows a schematic diagram of the structure of a system for parallel processing of railway detection big data provided in an embodiment of the present invention. As shown in FIG2 , the system includes:
数据存储模块,其配置为将采集的检测数据进行划分,并按照设定的存储策略将划分后的检测数据进行存储;A data storage module, configured to divide the collected detection data and store the divided detection data according to a set storage strategy;
算法组合制定模块,其配置为根据检测数据的工程性质需求和解释目标需求,创建对应的运算工作流,其中,所述运算工作流为检测数据需进行的多个信号处理算法的组合,包括常规类工作流和迭代类工作流;An algorithm combination formulation module is configured to create a corresponding operation workflow according to the engineering property requirements and interpretation target requirements of the detection data, wherein the operation workflow is a combination of multiple signal processing algorithms required for the detection data, including conventional workflows and iterative workflows;
并行处理模块,其配置为基于预先创建的并行节点架构并行调取存储的检测数据,并分别按照与所述运算工作流匹配的调度策略实现基于常规类运算和迭代类运算的并行处理。The parallel processing module is configured to retrieve the stored detection data in parallel based on a pre-created parallel node architecture, and implement parallel processing based on conventional operations and iterative operations according to a scheduling strategy that matches the operation workflow.
一个实施例中,所述数据存储模块包括数据划分单元,其配置为按照采集区域和采集设备类型对铁路检测数据进行划分,得到处理优先级不同的各级的检测数据。In one embodiment, the data storage module includes a data partitioning unit configured to partition the railway detection data according to the collection area and the type of collection equipment to obtain detection data of different levels with different processing priorities.
进一步地,一个实施例中,在存储划分后的检测数据时,所述数据存储模块的数据存储单元采用MySQL关系型数据库、HDFS多种存储方式,结合优先级信息实现对多种格式检测数据的分区存储;Further, in one embodiment, when storing the divided detection data, the data storage unit of the data storage module adopts a MySQL relational database, HDFS and multiple storage methods, and combines priority information to realize partition storage of detection data in multiple formats;
其中,所述检测数据包括文件头和数据体,文件头包括数据采集时总道数、采样点数、天线频率、道间距以及时窗参数。The detection data includes a file header and a data body, and the file header includes the total number of channels, the number of sampling points, the antenna frequency, the channel spacing and the time window parameters during data collection.
所述检测数据文件的数据体包括道头数据和数据内容,各检测数据的道头数据与所述文件头关联存储在关系型数据库中,各检测数据的数据内容存储在分布式数据库中。The data body of the detection data file includes header data and data content. The header data of each detection data is associated with the file header and stored in a relational database, and the data content of each detection data is stored in a distributed database.
进一步地,一个实施例中,所述系统还包括:节点架构创建步骤、设置一个主节点作为任务调度节点,与一个读写节点和N个计算节点共同构成并行节点架构。Furthermore, in one embodiment, the system also includes: a node architecture creation step, setting a master node as a task scheduling node, which together with a read-write node and N computing nodes constitutes a parallel node architecture.
优选地,一个实施例中,所述并行处理模块包括常规运算处理单元,其配置为按照以下步骤实现基于常规类运算的并行处理:Preferably, in one embodiment, the parallel processing module includes a conventional operation processing unit, which is configured to implement parallel processing based on conventional operations according to the following steps:
任务调度节点判断是否存在需要处理的检测数据,若存在,统计需处理检测数据的总道数x,启动读写节点和计算节点;The task scheduling node determines whether there is any test data that needs to be processed. If so, it counts the total number of test data x that needs to be processed and starts the read-write node and the computing node.
依次循环执行如下步骤,直至所有待处理的检测数据运算完成,释放读写节点和计算节点:The following steps are executed in a loop until all pending detection data operations are completed and the read/write nodes and computing nodes are released:
基于预先设置的调度基数K,由读写节点读取x/K道作为本轮处理数据;任务调度节点将本轮处理数据平均划分为N部分,分配至各个计算节点;Based on the preset scheduling base K, the read/write node reads x/K channels as the data for this round of processing; the task scheduling node divides the data for this round of processing into N parts evenly and distributes them to each computing node;
各个计算节点根据常规类工作流并行执行运算;Each computing node performs operations in parallel according to the conventional class workflow;
并行调度节点确认是否收到计算完成的已处理检测数据,若收到,将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区;The parallel scheduling node confirms whether the processed detection data has been received. If received, it transmits it to the read-write node, which integrates it according to the corresponding spatial structure and writes it into the storage area;
其中,K为正整数,1≤K≤N,或进一步结合单位标记数据对应的道数S进行优化,确定调度基数K。Wherein, K is a positive integer, 1≤K≤N, or further optimized in combination with the number of channels S corresponding to the unit mark data to determine the scheduling base K.
一个实施例中,所述并行处理模块包括迭代运算处理单元,其配置为按照以下步骤实现基于迭代类运算的并行处理:In one embodiment, the parallel processing module includes an iterative operation processing unit, which is configured to implement parallel processing based on iterative operations according to the following steps:
任务调度节点判断是否存在需要处理的检测数据,若存在,统计需处理检测数据的总道数x,启动读写节点和计算节点;The task scheduling node determines whether there is any test data that needs to be processed. If so, it counts the total number of test data x that needs to be processed and starts the read-write node and the computing node.
根据硬件资源、x以及该迭代算法的计算单位大小设置调度基数K,由读写节点读取x/K道作为第一轮处理数据;任务调度节点将本轮处理数据平均划分为N部分,分配至各个计算节点;The scheduling base K is set according to the hardware resources, x, and the computing unit size of the iterative algorithm. The read/write node reads x/K channels as the first round of processing data. The task scheduling node divides the processing data of this round into N parts and distributes them to each computing node.
进而依次循环执行如下步骤,直至所有待处理的检测数据运算完成,释放读写节点和计算节点:Then, the following steps are executed in a loop until all pending detection data operations are completed and the read/write nodes and computing nodes are released:
任务调度节点实时统计处于空闲的计算节点数量n,The task scheduling node counts the number of idle computing nodes n in real time.
读写节点读取道,数据,将其平均划分为n部分,分配至各个空闲的计算节点;Read-write node reads The data is divided into n parts evenly and distributed to each idle computing node;
各个计算节点根据迭代类工作流并行执行运算,并实时向任务调度节点反馈运算状态信息;Each computing node executes operations in parallel according to the iterative workflow and feeds back the operation status information to the task scheduling node in real time;
并行调度节点确认是否收到计算完成的已处理检测数据,若收到,将其传至读写节点,由读写节点按照对应的空间结构进行整合并写入存储区。The parallel scheduling node confirms whether the processed detection data has been received. If received, it will be transmitted to the read-write node, which will integrate it according to the corresponding spatial structure and write it into the storage area.
本发明实施例提供的并行处理铁路检测大数据的系统中,各个模块或单元结构可以根据实际分析和运算需求独立运行或组合运行,以实现相应的技术效果。In the system for parallel processing of railway inspection big data provided by the embodiment of the present invention, each module or unit structure can operate independently or in combination according to actual analysis and computing requirements to achieve corresponding technical effects.
应该理解的是,本发明所公开的实施例不限于这里所公开的特定结构、处理步骤或材料,而应当延伸到相关领域的普通技术人员所理解的这些特征的等同替代。还应当理解的是,在此使用的术语仅用于描述特定实施例的目的,而不意味着限制。It should be understood that the embodiments disclosed in the present invention are not limited to the specific structures, processing steps or materials disclosed herein, but should be extended to equivalent substitutions of these features understood by ordinary technicians in the relevant field. It should also be understood that the terms used herein are only used for the purpose of describing specific embodiments and are not meant to be limiting.
说明书中提到的“一实施例”意指结合实施例描述的特定特征、结构或特征包括在本发明的至少一个实施例中。因此,说明书通篇各个地方出现的短语“一实施例”并不一定均指同一个实施例。The "one embodiment" mentioned in the specification means that a particular feature, structure or characteristic described in conjunction with the embodiment is included in at least one embodiment of the present invention. Therefore, the phrase "one embodiment" appearing in various places throughout the specification does not necessarily refer to the same embodiment.
虽然本发明所揭露的实施方式如上,但所述的内容只是为了便于理解本发明而采用的实施方式,并非用以限定本发明。任何本发明所属技术领域内的技术人员,在不脱离本发明所揭露的精神和范围的前提下,可以在实施的形式上及细节上作任何的修改与变化,但本发明的专利保护范围,仍须以所附的权利要求书所界定的范围为准。Although the embodiments disclosed in the present invention are as above, the contents described are only embodiments adopted for facilitating the understanding of the present invention and are not intended to limit the present invention. Any technician in the technical field to which the present invention belongs can make any modifications and changes in the form and details of the implementation without departing from the spirit and scope disclosed in the present invention, but the patent protection scope of the present invention shall still be subject to the scope defined in the attached claims.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110665774.6A CN113625264B (en) | 2021-06-16 | 2021-06-16 | Method and system for parallel processing of railway detection big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110665774.6A CN113625264B (en) | 2021-06-16 | 2021-06-16 | Method and system for parallel processing of railway detection big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113625264A CN113625264A (en) | 2021-11-09 |
CN113625264B true CN113625264B (en) | 2024-08-30 |
Family
ID=78378122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110665774.6A Active CN113625264B (en) | 2021-06-16 | 2021-06-16 | Method and system for parallel processing of railway detection big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113625264B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117812123B (en) * | 2023-12-29 | 2024-12-06 | 雷帝斯(杭州)流体科技有限公司 | A valve control method and system based on Internet of Things |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059631A (en) * | 2019-04-19 | 2019-07-26 | 中铁第一勘察设计院集团有限公司 | The contactless monitoring defect identification method of contact net |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5968109A (en) * | 1996-10-25 | 1999-10-19 | Navigation Technologies Corporation | System and method for use and storage of geographic data on physical media |
EP1134674A4 (en) * | 1998-11-24 | 2010-06-09 | Panasonic Corp | DATA STRUCTURE OF A NUMERIC MAPPING FILE |
US7272613B2 (en) * | 2000-10-26 | 2007-09-18 | Intel Corporation | Method and system for managing distributed content and related metadata |
CN100547594C (en) * | 2007-06-27 | 2009-10-07 | 中国科学院遥感应用研究所 | A Digital Earth Prototype System |
CN103338261B (en) * | 2013-07-04 | 2016-06-29 | 北京泰乐德信息技术有限公司 | The storage of a kind of track traffic Monitoring Data and processing method and system |
CA2891151C (en) * | 2014-05-19 | 2023-07-04 | Siddhartha Sengupta | System and method for generating vehicle movement plans in a large railway network |
CN103969627B (en) * | 2014-05-26 | 2016-07-06 | 苏州市数字城市工程研究中心有限公司 | The extensive D integral pin-fin tube analogy method of GPR based on FDTD |
EP3472767A4 (en) * | 2016-06-18 | 2019-12-04 | Fractal Industries, Inc. | PRECISE AND DETAILED MODELING OF SYSTEMS USING A DISTRIBUTED SIMULATION ENGINE |
CN107423338B (en) * | 2017-04-28 | 2020-12-25 | 中国铁道科学研究院 | Railway comprehensive detection data display method and device |
EP3662331A4 (en) * | 2017-08-02 | 2021-04-28 | Strong Force Iot Portfolio 2016, LLC | METHODS AND SYSTEMS FOR DETECTION IN AN INDUSTRIAL INTERNET OF THINGS DATA COLLECTION ENVIRONMENT WITH LARGE AMOUNTS OF DATA |
KR101950935B1 (en) * | 2017-09-27 | 2019-02-22 | (주) 퓨처젠 | System for sniffing detection data of railway safety device, and program |
CN108804220A (en) * | 2018-01-31 | 2018-11-13 | 中国地质大学(武汉) | A method of the satellite task planning algorithm research based on parallel computation |
-
2021
- 2021-06-16 CN CN202110665774.6A patent/CN113625264B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059631A (en) * | 2019-04-19 | 2019-07-26 | 中铁第一勘察设计院集团有限公司 | The contactless monitoring defect identification method of contact net |
Also Published As
Publication number | Publication date |
---|---|
CN113625264A (en) | 2021-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107092627B (en) | Columnar storage representation of records | |
CN112000703B (en) | Data warehousing processing method and device, computer equipment and storage medium | |
CN107301206A (en) | A kind of distributed olap analysis method and system based on pre-computation | |
CN107451233B (en) | Storage method of time-attribute-priority spatiotemporal trajectory data file in auxiliary storage device | |
CN108763299A (en) | A kind of large-scale data processing calculating acceleration system | |
US11954084B2 (en) | Method and apparatus for processing table, device, and storage medium | |
KR20110035899A (en) | Dimension reduction mechanisms to represent large communication network graphs for structured queries | |
CN112579586A (en) | Data processing method, device, equipment and storage medium | |
Emeliyanov et al. | GPU-based tracking algorithms for the ATLAS high-level trigger | |
Magana-Zook et al. | Large-scale seismic waveform quality metric calculation using Hadoop | |
CN105808358A (en) | Data dependency thread group mapping method for many-core system | |
CN107729555A (en) | A kind of magnanimity big data Distributed Predictive method and system | |
CN109657197A (en) | A kind of pre-stack depth migration calculation method and system | |
Sethy et al. | Big data analysis using Hadoop: a survey | |
CN120179606B (en) | Storage and computing integrated parallel processing system and method | |
CN113625264B (en) | Method and system for parallel processing of railway detection big data | |
CN104573082A (en) | Space small file data distribution storage method and system based on access log information | |
CN108334532B (en) | A Spark-based Eclat parallelization method, system and device | |
Ulmer et al. | Extending composable data services into SmartNICs | |
CN107679133B (en) | A practical mining method for massive real-time PMU data | |
CN112308317B (en) | Noise power spectrum calculation method and system for massive seismic observation data based on distributed architecture | |
Huang et al. | Performance evaluation of enabling logistic regression for big data with R | |
US8832157B1 (en) | System, method, and computer-readable medium that facilitates efficient processing of distinct counts on several columns in a parallel processing system | |
Harris et al. | Monte carlo based server consolidation for energy efficient cloud data centers | |
Wu et al. | Indexing blocks to reduce space and time requirements for searching large data files |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |