CN117009139A

CN117009139A - Data stream computing anomaly recovery method, device, equipment and storage medium

Info

Publication number: CN117009139A
Application number: CN202310884253.9A
Authority: CN
Inventors: 尚晶; 武智晖; 刘辉; 郭志伟; 陈卓
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-11-07

Abstract

The present invention relates to the field of data processing technology, and in particular to a data stream computing exception recovery method, device, equipment and storage medium. This invention determines the abnormal operator and adjacent operators when the flow calculation is abnormal, calls the target operator corresponding to the abnormal operator, and replaces the abnormal operator with the target operator to establish a connection with the adjacent operator to ensure the integrity Data processing executability, while reading the replay point data in the pre-written log through adjacent numbers, and delivering the replay point data to the target operator, so that the target operator can perform flow calculations based on the replay point data to complete Recalculation of abnormal flow calculation results and recalculation of all data avoids the technical problem of low recalculation efficiency in the existing technology when the real-time data flow calculation based on flink is abnormal, and reduces the cycle of abnormal recovery of flow calculation.

Description

Data flow computing exception recovery method, device, equipment and storage medium

技术领域Technical field

本发明涉及数据处理技术领域，尤其涉及一种数据流计算异常恢复方法、装置、设备及存储介质。The present invention relates to the field of data processing technology, and in particular to a data stream computing exception recovery method, device, equipment and storage medium.

背景技术Background technique

在云计算作业或者大规模数据计算作业中，由于实时计算作业在一个较大的集群服务器中执行，其中，若是硬件故障或者异常都会导致计算作业失败，在传统技术中，一般采用分布式流处理和批处理框架(Apache Flink)采用周期性保存快照的形式，从而在数据处理过程中出现异常时，可以根据保存的数据重新流计算过程，但是针对大规模的实时计算作业，其保存的作业状态可能非常大，状态的备份和重新处理的周期非常长，容错成本较高，且由于数据状态具有实时性，导致重新进行流计算的结果和正常的计算结果不一致，影响数据的处理效率。In cloud computing jobs or large-scale data computing jobs, since real-time computing jobs are executed in a larger cluster server, hardware failures or abnormalities will cause the computing jobs to fail. In traditional technology, distributed stream processing is generally used. And the batch processing framework (Apache Flink) adopts the form of periodically saving snapshots, so that when an exception occurs during data processing, the calculation process can be reflowed based on the saved data. However, for large-scale real-time computing jobs, the saved job status It may be very large, the cycle of state backup and reprocessing is very long, the fault tolerance cost is high, and because the data state is real-time, the results of re-stream calculation are inconsistent with the normal calculation results, affecting the data processing efficiency.

上述内容仅用于辅助理解本发明的技术方案，并不代表承认上述内容是现有技术。The above content is only used to assist in understanding the technical solution of the present invention, and does not represent an admission that the above content is prior art.

发明内容Contents of the invention

本发明的主要目的在于提供一种数据流计算异常恢复方法、装置、设备及存储介质，旨在解决现有技术中基于flink的实时数据流计算异常时，重新计算的效率较低的技术问题。The main purpose of the present invention is to provide a data flow calculation abnormality recovery method, device, equipment and storage medium, aiming to solve the technical problem in the prior art of low recalculation efficiency when real-time data flow calculation based on flink is abnormal.

为实现上述目的，本发明提供了一种数据流计算异常恢复方法，所述方法包括以下步骤：In order to achieve the above purpose, the present invention provides a data stream computing exception recovery method, which includes the following steps:

在数据流计算出现异常时，确定异常算子和对应的相邻算子；When an exception occurs in the data flow calculation, determine the abnormal operator and the corresponding adjacent operator;

调用所述异常算子对应的目标算子，并建立所述目标算子与相邻算子之间的连接；Call the target operator corresponding to the abnormal operator, and establish a connection between the target operator and adjacent operators;

基于所述相邻算子读取预写式日志中的重放点数据，并将所述重放点数据下发至所述目标算子；Read the replay point data in the write-ahead log based on the adjacent operator, and deliver the replay point data to the target operator;

基于所述目标算子对所述重放点数据进行数据流计算。Data flow calculation is performed on the replay point data based on the target operator.

可选地，所述基于所述目标算子对所述重放点数据进行数据流计算，包括：Optionally, performing data flow calculation on the replay point data based on the target operator includes:

基于所述目标算子读取外部存储数据库中的所述重放点数据对应的状态信息；Read the status information corresponding to the replay point data in the external storage database based on the target operator;

基于所述目标算子根据所述重放点数据和所述状态信息进行数据流计算。Data flow calculation is performed based on the replay point data and the status information based on the target operator.

可选地，所述数据流计算异常恢复方法，还包括：Optionally, the data flow computing exception recovery method also includes:

将流计算结果写入所述预写式日志；Write the flow calculation results into the write-ahead log;

调用外部存储数据库更新所述重放点数据对应的状态信息。Call an external storage database to update the status information corresponding to the replay point data.

可选地，所述相邻算子包括上级算子；Optionally, the adjacent operators include superior operators;

所述数据流计算异常恢复方法，还包括：The data flow computing exception recovery method also includes:

基于所述目标算子向所述上级算子发送确认信息；Send confirmation information to the superior operator based on the target operator;

基于所述上级算子根据所述确认信息调整所述预写式日志中所述重放点数据的数据偏移量，并执行正常数据流计算模式。Based on the upper-level operator, the data offset of the replay point data in the write-ahead log is adjusted according to the confirmation information, and a normal data flow calculation mode is executed.

可选地，基于所述目标算子对所述重放点数据进行流计算，包括：Optionally, perform flow calculation on the replay point data based on the target operator, including:

通过所述目标算子基于所述预写式日志中的数据偏移量判断所述重放点数据是否为重复数据；Use the target operator to determine whether the replay point data is duplicate data based on the data offset in the write-ahead log;

在所述重放点数据不为重复数据时，执行基于所述目标算子对所述重放点数据进行数据流计算的步骤。When the replay point data is not repeated data, the step of performing data flow calculation on the replay point data based on the target operator is performed.

在所述重放点数据为重复数据时，基于所述目标算子生成提醒信息，并向上级算子反馈所述提醒信息；When the replay point data is repeated data, generate reminder information based on the target operator, and feed back the reminder information to the superior operator;

调整所述预写式日志中所述重放点数据的数据偏移量；Adjust the data offset of the replay point data in the write-ahead log;

通过所述上级算子根据所述提醒信息读取外部存储数据库中与调整后的数据偏移量对应的待处理数据，并将所述待处理数据下发至所述目标算子，以执行数据流计算。The superior operator reads the data to be processed corresponding to the adjusted data offset in the external storage database according to the reminder information, and sends the data to be processed to the target operator to execute the data Stream computing.

可选地，所述基于所述相邻算子读取预写式日志中的重放点数据，包括：Optionally, reading the replay point data in the write-ahead log based on the adjacent operator includes:

基于所述相邻算子查询预写式日志中的历史数据偏移量；Query the historical data offset in the write-ahead log based on the adjacent operator;

根据所述历史数据偏移量读取所述预写式日志中的重放点数据。Read replay point data in the write-ahead log according to the historical data offset.

此外，为实现上述目的，本发明还提出一种数据流计算异常恢复装置，所述数据流计算异常恢复装置包括：In addition, in order to achieve the above object, the present invention also proposes a data flow computing abnormality recovery device. The data flow computing abnormality recovery device includes:

获取模块，用于在数据处理出现异常时，确定异常算子和对应的相邻算子；The acquisition module is used to determine the abnormal operator and the corresponding adjacent operator when an exception occurs in data processing;

调用模块，用于调用所述异常算子对应的目标算子，并建立所述目标算子与相邻算子之间的连接；A calling module, used to call the target operator corresponding to the abnormal operator and establish a connection between the target operator and adjacent operators;

读取模块，用于基于所述相邻算子读取预写式日志中的重放点数据，并将所述重放点数据下发至所述目标算子；A reading module, configured to read the replay point data in the write-ahead log based on the adjacent operator, and send the replay point data to the target operator;

计算模块，用于基于所述目标算子对所述重放点数据进行数据流计算。A calculation module, configured to perform data flow calculation on the replay point data based on the target operator.

此外，为实现上述目的，本发明还提出一种数据流计算异常恢复设备，所述数据流计算异常恢复设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的数据流计算异常恢复程序，所述数据流计算异常恢复程序配置为实现如上文所述的数据流计算异常恢复方法的步骤。In addition, in order to achieve the above object, the present invention also proposes a data flow computing exception recovery device. The data flow computing exception recovery device includes: a memory, a processor, and a device that is stored on the memory and can run on the processor. The data flow computing exception recovery program is configured to implement the steps of the data flow computing exception recovery method as described above.

此外，为实现上述目的，本发明还提出一种存储介质，所述存储介质上存储有数据流计算异常恢复程序，所述数据流计算异常恢复程序被处理器执行时实现如上文所述的数据流计算异常恢复方法的步骤。In addition, in order to achieve the above object, the present invention also proposes a storage medium, a data flow calculation exception recovery program is stored on the storage medium, and when the data flow calculation exception recovery program is executed by the processor, the data as described above is realized. Steps of the streaming computing exception recovery method.

本发明通过在数据流计算出现异常时，确定异常算子和对应的相邻算子；调用所述异常算子对应的目标算子，并建立所述目标算子与相邻算子之间的连接；基于所述相邻算子读取预写式日志中的重放点数据，并将所述重放点数据下发至所述目标算子；基于所述目标算子对所述重放点数据进行数据流计算，本发明通过确定流计算异常时的异常算子和相邻算子，并调用异常算子对应的目标算子，以目标算子代替异常算子建立与相邻算子之间的连接，保证整体数据处理可执行性，同时通过相邻数字读取预写式日志中的重放点数据，下发重放点数据至目标算子，从而目标算子可以根据重放点数据进行流计算，完成异常流计算结果的重新计算，而重新计算所有的数据，避免了现有技术中基于flink的实时数据流计算异常时，重新计算的效率较低的技术问题，减少流计算异常恢复的周期。In the present invention, when an abnormality occurs in data flow calculation, the abnormal operator and the corresponding adjacent operator are determined; the target operator corresponding to the abnormal operator is called, and the relationship between the target operator and the adjacent operator is established. Connect; read the replay point data in the write-ahead log based on the adjacent operator, and send the replay point data to the target operator; perform the replay based on the target operator Point data is used for data flow calculation. This invention determines the abnormal operator and adjacent operator when the flow calculation is abnormal, and calls the target operator corresponding to the abnormal operator, and replaces the abnormal operator with the target operator to establish a relationship with the adjacent operator. The connection between them ensures the executability of the overall data processing. At the same time, the replay point data in the pre-written log is read through adjacent numbers, and the replay point data is sent to the target operator, so that the target operator can be based on the replay Perform flow calculation on point data to complete the recalculation of abnormal flow calculation results and recalculate all data. This avoids the technical problem of low recalculation efficiency in the existing technology when the real-time data flow calculation based on flink is abnormal and reduces the flow rate. Calculate the abnormal recovery period.

附图说明Description of the drawings

图1是本发明实施例方案涉及的硬件运行环境的数据流计算异常恢复设备的结构示意图；Figure 1 is a schematic structural diagram of a data flow computing exception recovery device for a hardware operating environment involved in an embodiment of the present invention;

图2为本发明数据流计算异常恢复方法第一实施例的流程示意图；Figure 2 is a schematic flowchart of the first embodiment of the data flow calculation exception recovery method of the present invention;

图3为本发明数据流计算异常恢复方法一实施例的flink架构示意图；Figure 3 is a schematic diagram of the flink architecture according to one embodiment of the data stream computing exception recovery method of the present invention;

图4为本发明数据流计算异常恢复方法一实施例的现有flink单个作业示意图；Figure 4 is a schematic diagram of an existing flink single job according to an embodiment of the data stream computing exception recovery method of the present invention;

图5为本发明数据流计算异常恢复方法一实施例的计算节点状态剥离后flink单个作业示意图；Figure 5 is a schematic diagram of a single flink job after the computing node status is stripped according to one embodiment of the data flow computing exception recovery method of the present invention;

图6为本发明数据流计算异常恢复方法第二实施例的流程示意图；Figure 6 is a schematic flowchart of the second embodiment of the data flow calculation exception recovery method of the present invention;

图7为本发明数据流计算异常恢复方法一实施例的正常数据处理流程示意图；Figure 7 is a schematic diagram of the normal data processing flow of one embodiment of the data flow calculation abnormality recovery method of the present invention;

图8为本发明数据流计算异常恢复方法一实施例的异常恢复流程示意图；Figure 8 is a schematic diagram of an exception recovery process according to an embodiment of the data flow calculation exception recovery method of the present invention;

图9为本发明数据流计算异常恢复装置第一实施例的结构框图。Figure 9 is a structural block diagram of the first embodiment of the device for recovering data flow calculation exceptions according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further described with reference to the embodiments and the accompanying drawings.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

参照图1，图1为本发明实施例方案涉及的硬件运行环境的数据流计算异常恢复设备结构示意图。Referring to FIG. 1 , FIG. 1 is a schematic structural diagram of the data flow computing exception recovery device of the hardware operating environment involved in the embodiment of the present invention.

如图1所示，该数据流计算异常恢复设备可以包括：处理器1001，例如中央处理器(Central Processing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity，Wi-Fi)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory，RAM)，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in Figure 1, the data flow computing exception recovery device may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize connection communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard). The optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wireless-Fidelity (Wi-Fi) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM) or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk memory. The memory 1005 may optionally be a storage device independent of the aforementioned processor 1001.

本领域技术人员可以理解，图1中示出的结构并不构成对数据流计算异常恢复设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a limitation on the data stream computing exception recovery device, and may include more or less components than shown in the figure, or combine certain components, or different components. layout.

如图1所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及数据流计算异常恢复程序。As shown in Figure 1, memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and a data flow calculation exception recovery program.

在图1所示的数据流计算异常恢复设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本发明数据流计算异常恢复设备中的处理器1001、存储器1005可以设置在数据流计算异常恢复设备中，所述数据流计算异常恢复设备通过处理器1001调用存储器1005中存储的数据流计算异常恢复程序，并执行本发明实施例提供的数据流计算异常恢复方法。In the data flow computing abnormality recovery equipment shown in Figure 1, the network interface 1004 is mainly used for data communication with the network server; the user interface 1003 is mainly used for data interaction with the user; the processing in the data flow computing abnormality recovery equipment of the present invention The processor 1001 and the memory 1005 can be arranged in a data flow computing exception recovery device. The data flow computing exception recovery device calls the data flow computing exception recovery program stored in the memory 1005 through the processor 1001, and executes the data provided by the embodiment of the present invention. Stream computing exception recovery method.

本发明实施例提供了一种数据流计算异常恢复方法，参照图2，图2为本发明一种数据流计算异常恢复方法第一实施例的流程示意图。Embodiments of the present invention provide a data flow computing exception recovery method. Refer to Figure 2. Figure 2 is a flow chart of a data flow computing exception recovery method according to the first embodiment of the present invention.

本实施例中，所述数据流计算异常恢复方法包括以下步骤：In this embodiment, the data flow computing exception recovery method includes the following steps:

步骤S10：在数据流计算出现异常时，确定异常算子和对应的相邻算子。Step S10: When an abnormality occurs in the data flow calculation, determine the abnormal operator and the corresponding adjacent operator.

需要说明的是，本实施例方法的执行主体可以是具有数据处理、程序运行、网络通信的设备，例如：云服务器或者计算机集群等，还可以是其他可以实现相同或者相似功能的设备，本实施例对此不做具体限制，在本实施例以及下述实施例中，将会以云服务器为例进行说明。It should be noted that the execution subject of the method of this embodiment can be a device with data processing, program running, and network communication, such as a cloud server or a computer cluster, or other devices that can implement the same or similar functions. This implementation This example does not specifically limit this. In this embodiment and the following embodiments, a cloud server will be used as an example for explanation.

值得说明的是，本实施例方法的应用场景为基于框架和分布式处理引擎(ApacheFlink)，以下简称flink，flink用于在无边界和有边界数据流上进行有状态的计算，能在所有常见集群环境中运行，并能以内存速度和任意规模进行计算，具有高吞吐、低延迟的特性。It is worth noting that the application scenario of the method in this embodiment is based on the framework and distributed processing engine (Apache Flink), hereinafter referred to as flink. Flink is used to perform stateful calculations on unbounded and bounded data streams, and can be used in all common It runs in a cluster environment and can perform calculations at memory speed and at any scale, with high throughput and low latency.

在具体实现中，参考图3，Flink集群中包含作业管理器(Job Manager)和工作节点(Task Manager)，作业管理器是作业管理进程，工作节点是工作进程，管理进程定期触发状态保存(Trigger Checkpoint)和确认状态保存完成(Ack.Checkpint)，TaskManager中的算子负责保存状态快照到外部的分布式存储(snapshot store)，例如Hadoop分布式文件系统(Hadoop Distributed File System，HDFS)或者resis等分布式文件系统，本实施例对此不做具体限制。In the specific implementation, refer to Figure 3, the Flink cluster contains the job manager (Job Manager) and the working node (Task Manager). The job manager is the job management process, the working node is the working process, and the management process regularly triggers state saving (Trigger Checkpoint) and confirm the completion of state saving (Ack.Checkpint). The operator in TaskManager is responsible for saving the state snapshot to an external distributed storage (snapshot store), such as Hadoop Distributed File System (Hadoop Distributed File System, HDFS) or resis, etc. Distributed file system, this embodiment does not impose specific restrictions on this.

参考图4，在传统技术中，基于flink执行数据处理作业时，每个算子在自己所在的计算节点使用内存或者内存结合文件系统存储状态，定期将数据状态信息备份到分布式文件系统，此时，若是某个计算节点或者算子出现异常，可以通过从分布式文件系统中读取备份的数据状态信息，以便于重新进行数据的流计算，此时，无需对整个数据处理作业都重新计算一次，提高了数据处理效率。Referring to Figure 4, in traditional technology, when executing data processing jobs based on flink, each operator uses memory or memory combined with the file system to store status on its own computing node, and regularly backs up the data status information to the distributed file system. This At the same time, if an abnormality occurs in a certain computing node or operator, the backup data status information can be read from the distributed file system to facilitate re-calculation of the data flow. At this time, there is no need to recalculate the entire data processing job. Once, the data processing efficiency is improved.

其中，数据状态信息主要是针对多次数据处理过程中处于中间状态的数据，例如：在“计算2的10次幂的五倍”的数据处理作业时，可以先计算2的十次幂，再将2的十次幂乘5，此时若是“将2的十次幂乘5”的步骤出错，可以通过调用之前保存的“2的十次幂”重新计算将2的十次幂乘5的步骤，而不需要计算整个数据处理流程，这个保存中间数据计算结果，方便在故障时进行“断点恢复”的过程就是数据状态信息。Among them, the data status information is mainly for data in an intermediate state during multiple data processing processes. For example: in the data processing job of "calculating five times the tenth power of 2", you can first calculate the tenth power of 2, and then Multiply 2 to the tenth power by 5. If there is an error in the step of "multiplying 2 to the tenth power by 5", you can recalculate the value of multiplying 2 to the tenth power by 5 by calling the previously saved "2 to the tenth power". Instead of calculating the entire data processing process, this process of saving intermediate data calculation results to facilitate "breakpoint recovery" in the event of a failure is data status information.

在具体实现中，由于flink数据吞吐量较大，一批数据的计算所占用的字节可以是从KB级到TB级，在定期将数据状态信息备份到分布式文件系统时，针对较小的数据状态信息可以采用较短的时间周期备份一次，例如30秒/次，减少流计算异常时，需要重新计算数据量，提高流计算异常恢复的效率；针对较大数据状态信息则可以采用较长的时间周期备份一次，例如：2min，本实施例对此不做具体限制。In the specific implementation, due to the large data throughput of flink, the bytes occupied by the calculation of a batch of data can be from KB level to TB level. When the data status information is regularly backed up to the distributed file system, for smaller Data status information can be backed up in a shorter time period, such as 30 seconds/time, to reduce the need to recalculate the amount of data when stream computing exceptions occur and improve the efficiency of stream computing exception recovery. For larger data status information, a longer period can be used. The time period for backing up is once, for example: 2 minutes. This embodiment does not impose specific restrictions on this.

可以理解的是，算子是指函数计算空间到另一函数计算空间的映射，例如：source算子、window算子sink算子等，本实施例对此不做具体限制，异常算子是指在进行流计算时，出现错误计算结果的映射关系，在本实施例中，参考图5中，Flink对实时数据流的处理分为获取数据、处理数据以及输出处理结果，算子用于处理数据，获取数据的数据源可以是Kafka业务系统，算子在获取Kafka业务系统中的数据后，可以将计算结果输出至Kafka业务系统中，完成一次完整的数据流计算过程。It can be understood that an operator refers to the mapping of a function calculation space to another function calculation space, such as: source operator, window operator, sink operator, etc. This embodiment does not impose specific restrictions on this. Exception operators refer to When performing stream calculations, the mapping relationship of incorrect calculation results appears. In this embodiment, referring to Figure 5, Flink's processing of real-time data streams is divided into obtaining data, processing data and outputting processing results. Operators are used to process data. , the data source for obtaining data can be the Kafka business system. After the operator obtains the data in the Kafka business system, it can output the calculation results to the Kafka business system to complete a complete data flow calculation process.

在本实施例中，异常算子的相邻算子是指与异常算子相连的上级算子和下级算子，例如：图5中window算子为异常算子时，相邻算子包括上级算子source算子和下级算子sink算子。In this embodiment, the adjacent operators of the abnormal operator refer to the upper-level operator and the lower-level operator connected to the abnormal operator. For example: when the window operator in Figure 5 is an abnormal operator, the adjacent operators include the upper-level operator The operator source operator and the subordinate operator sink operator.

此外，由于传统的流计算异常恢复会从分布式存储引擎中获取数据状态信息后，在本地应用，再进行重新执行流计算；本实施例则是通过直接读取分布式存储引擎的数据状态信息，无需应用的过程，提高流计算异常恢复效率。In addition, traditional flow computing exception recovery will obtain the data status information from the distributed storage engine, apply it locally, and then re-execute the flow computing; this embodiment directly reads the data status information of the distributed storage engine. , no application process is needed to improve the efficiency of stream computing exception recovery.

步骤S20：调用所述异常算子对应的目标算子，并建立所述目标算子与相邻算子之间的连接。Step S20: Call the target operator corresponding to the abnormal operator, and establish a connection between the target operator and adjacent operators.

可以理解的是，目标算子可以是与异常算子归属于不同服务器，但是功能相同的算子，此过程由作业管理器执行，建立目标算子和相邻算子之间的连接是指将与异常算子相连的上下游算子的信息、异常算子预写式日志的存储路径以及状态存储信息等发送至目标算子，从而以目标算子替代相邻算子，从而建立完整的连接，进而可以是实现上下游算子之间的数据和消息的传递。It can be understood that the target operator can be an operator that belongs to a different server than the abnormal operator but has the same function. This process is performed by the job manager. Establishing a connection between the target operator and adjacent operators means to The information of the upstream and downstream operators connected to the abnormal operator, the storage path of the abnormal operator's prewritten log, and the state storage information are sent to the target operator, thereby replacing the adjacent operator with the target operator to establish a complete connection. , which can further realize the transmission of data and messages between upstream and downstream operators.

步骤S30：基于所述相邻算子读取预写式日志中的重放点数据，并将所述重放点数据下发至所述目标算子。Step S30: Read the replay point data in the write-ahead log based on the adjacent operator, and deliver the replay point data to the target operator.

应当说明的是，预写式日志(Write Ahead Log，WAL)数据库系统中的一种手段，用于保证数据操作的原子性和持久性，对于非内存数据库而言，磁盘I/O操作是数据库效率的一大瓶颈。在相同的数据量下，采用预写式日志的数据库系统在事务提交时，磁盘写操作只有传统的回滚日志的一半左右，大大提高了数据库磁盘I/O操作的效率，从而提高了数据库的性能。It should be noted that the Write Ahead Log (WAL) database system is a means to ensure the atomicity and durability of data operations. For non-memory databases, disk I/O operations are database A major bottleneck in efficiency. Under the same data volume, when a transaction is submitted in a database system using write-ahead logs, the disk write operations are only about half of those of traditional rollback logs, which greatly improves the efficiency of database disk I/O operations and thus improves the database's performance. performance.

其中，在使用预写式日志的系统时，所有的修改在提交之前都要先写入log文件中，log文件中通常包括redo和undo信息。举例说明，一个程序在执行某些操作的过程中机器掉电了，在重新启动时，程序可能需要知道当时执行的操作是成功了还是部分成功或者是失败了，如果使用了预写式日志，程序就可以检查log文件，并对突然掉电时计划执行的操作内容跟实际上执行的操作内容进行比较，在这个比较的基础上，程序就可以决定是撤销已做的操作还是继续完成已做的操作，或者是保持原样。Among them, when using a write-ahead log system, all modifications must be written to the log file before submission. The log file usually includes redo and undo information. For example, a program loses power while the machine is performing certain operations. When restarting, the program may need to know whether the operation performed at that time was successful, partially successful, or failed. If a write-ahead log is used, The program can check the log file and compare the contents of the operations planned to be performed when the power is suddenly lost with the contents of the operations actually performed. Based on this comparison, the program can decide whether to undo the operations that have been done or to continue to complete the operations that have been done. operation, or leave it as is.

重放点数据是指上一次数据计算成功且无误的数据对应的偏移量往后的数据，例如：在执行ABC三条数据作业时，AB都计算成功，且计算结果正确，但是C计算异常，则重放点数据为C，即上一次计算成功的数据往后的数据。The replay point data refers to the data after the offset corresponding to the last data calculation that was successful and correct. For example: when executing three data operations of ABC, AB is calculated successfully and the calculation results are correct, but the calculation of C is abnormal. Then the replay point data is C, which is the data after the last successfully calculated data.

在具体实现中，由于不同的算子执行的数据处理流程不同，在数据处理过程中，算子需要处理的数据是由上一级的算子下发的，因此，在进行异常恢复时，同样是由相邻算子读取预写式日志中的重放点数据，并将重放点数据下发至所述目标算子，目标算子才能根据重放点数据执行数据流计算，完成数据的异常恢复。In the specific implementation, since different operators perform different data processing procedures, during the data processing process, the data that the operator needs to process is issued by the operator at the upper level. Therefore, when performing exception recovery, the same The adjacent operator reads the replay point data in the pre-written log and sends the replay point data to the target operator. Only then can the target operator perform data flow calculations based on the replay point data and complete the data. abnormal recovery.

进一步地，所述基于所述相邻算子读取预写式日志中的重放点数据，包括：Further, reading the replay point data in the write-ahead log based on the adjacent operator includes:

在具体实现中，在读取预写式日志中的重放点数据时，可以根据预写式日志中设置的偏移量确定重放点数据的位置，例如：In a specific implementation, when reading the replay point data in the write-ahead log, the location of the replay point data can be determined based on the offset set in the write-ahead log, for example:

步骤S40：基于所述目标算子对所述重放点数据进行数据流计算。Step S40: Perform data flow calculation on the playback point data based on the target operator.

值得说明的是，在进行数据流计算时，根据目标算子的函数空间映射关系对应的业务逻辑对接收到的数据进行计算即可。It is worth mentioning that when performing data flow calculation, the received data can be calculated according to the business logic corresponding to the function space mapping relationship of the target operator.

在具体实现中，每个上游算子从预写式日志中，找到数据重放点，重放数据即可，状态存储在外部的高并发分布式KV存储中，状态无需特殊处理，所有状态无需重新分配，通过重放数据实现不可变性，即便是外部维表发生变化，也不会导致错误恢复后与错误恢复前的数据处理结果不一致。In the specific implementation, each upstream operator finds the data replay point from the pre-written log and replays the data. The state is stored in the external high-concurrency distributed KV storage. The state does not need special processing, and all states do not need to be processed. Redistribution achieves immutability by replaying data. Even if the external dimension table changes, it will not cause the data processing results after error recovery to be inconsistent with those before error recovery.

本实施例通过确定流计算异常时的异常算子和相邻算子，并调用异常算子对应的目标算子，以目标算子代替异常算子建立与相邻算子之间的连接，保证整体数据处理可执行性，同时通过相邻数字读取预写式日志中的重放点数据，下发重放点数据至目标算子，从而目标算子可以根据重放点数据进行流计算，完成异常流计算结果的重新计算，而重新计算所有的数据，避免了现有技术中基于flink的实时数据流计算异常时，重新计算的效率较低的技术问题，减少流计算异常恢复的周期。This embodiment determines the abnormal operator and adjacent operators when the stream calculation is abnormal, calls the target operator corresponding to the abnormal operator, and uses the target operator to replace the abnormal operator to establish a connection with the adjacent operator, ensuring that Overall data processing executability, while reading the replay point data in the pre-written log through adjacent numbers, and delivering the replay point data to the target operator, so that the target operator can perform stream calculations based on the replay point data. Complete the recalculation of abnormal flow calculation results and recalculate all data, which avoids the technical problem of low recalculation efficiency in the existing technology when the real-time data flow calculation based on flink is abnormal, and reduces the cycle of abnormal recovery of flow calculation.

参考图6，图6为本发明一种数据流计算异常恢复方法第二实施例的流程示意图。Referring to Figure 6, Figure 6 is a schematic flowchart of a second embodiment of a data flow calculation exception recovery method according to the present invention.

基于上述第一实施例，在本实施例中，所述步骤S40，包括：Based on the above first embodiment, in this embodiment, step S40 includes:

步骤S401：基于所述目标算子读取外部存储数据库中的所述重放点数据对应的状态信息。Step S401: Read the status information corresponding to the playback point data in the external storage database based on the target operator.

需要说明的是，在计算重放点数据时，若是重放点数据存在时间上的差异，例如：使用流计算作业判断用户的在线情况，如果未发生错误，关联的结果是(xx用户，在线)，如果发生异常的时间段内，用户下线，那么恢复作业后，用户的状态就变成了(XX用户，离线)，可以发现两次计算的结果完全不同，此时不能判断出是因为故障导致误判用户离线，还是用户主动离线，不能保证幂等性。It should be noted that when calculating the replay point data, if there is a time difference in the replay point data, for example: using a stream computing job to determine the user's online status, if no error occurs, the associated result is (xx user, online ), if the user goes offline during the abnormal time period, then after the job is resumed, the user's status becomes (XX user, offline). It can be found that the results of the two calculations are completely different. At this time, it cannot be determined because Whether the fault leads to misjudgment that the user is offline or the user actively goes offline, idempotence cannot be guaranteed.

外部存储数据库可以是分布式高可用、高性能的KV类型的状态存储，例如Redis、HBASE等数据库，本实施例对此不做具体限制。The external storage database may be a distributed high-availability, high-performance KV type state storage, such as Redis, HBASE and other databases. This embodiment does not impose specific restrictions on this.

步骤S402：基于所述目标算子根据所述重放点数据和所述状态信息进行数据流计算。Step S402: Perform data flow calculation according to the playback point data and the status information based on the target operator.

在具体实现中，参考图7，图7为本实施例流计算异常恢复的流程示意图，其中，新算子为目标算子，在出现流计算异常时，作业管理器通过在其他的服务器上拉起新算子3，并告知新算子3其上下游信息、旧算子3的WAL日志的存储路径以及旧算子的状态存储信息，新算子3实例初始化，重新建立与上游算子2、下游算子4的连接关系，从而实现上下游之间数据传递和事件的通知，然后上游算子从预写式日志中找到重放点数据，并向下游新算子3发送重放点数据，新算子从分布式状态存储器中读取状态信息，进而可以基于目标算子对应的函数映射关系重新进行流计算。In the specific implementation, refer to Figure 7, which is a schematic flow chart of stream computing exception recovery in this embodiment. The new operator is the target operator. When a stream computing exception occurs, the job manager pulls the Start a new operator 3, and inform the new operator 3 of its upstream and downstream information, the storage path of the old operator 3's WAL log, and the old operator's status storage information. The new operator 3 instance is initialized and re-established with the upstream operator 2. , the connection relationship of downstream operator 4, thereby realizing data transfer and event notification between upstream and downstream, and then the upstream operator finds the replay point data from the prewritten log and sends the replay point data to the new downstream operator 3 , the new operator reads the state information from the distributed state memory, and then the flow calculation can be re-calculated based on the function mapping relationship corresponding to the target operator.

进一步地，在流计算完成后，为了保证避免再次出现异常结果，所述数据流计算异常恢复方法，还包括：Further, after the flow calculation is completed, in order to ensure that abnormal results are avoided again, the data flow calculation exception recovery method also includes:

可以理解的是，在流计算完成后可以首先将流计算结果写入预写式日志中，同时若是涉及到了数据状态的计算，还可以更新分布式状态存储器中的状态。It can be understood that after the flow calculation is completed, the flow calculation results can be first written into the write-ahead log. At the same time, if the calculation of data status is involved, the status in the distributed state memory can also be updated.

进一步地，为了保证后续数据流计算可以正常进行，所述数据流计算异常恢复方法，还包括：Further, in order to ensure that subsequent data flow calculations can be performed normally, the data flow calculation exception recovery method also includes:

在完成了异常恢复后，可以通知上级数据处理完毕；然后开始数据的正常处理流程，例如：算子3完成了异常恢复后，向算子2通知恢复完成，此时算子2可以读取预写式日志中的数据，并调整预写式日志中重放点数据的数据偏移量，使得数据偏移量向后偏移或者加1，具体调整方式根据数据偏移量的设置方式有关，例如：流数据计算作业中存在ABCD四个数据计算任务，当C存在异常时，流数据计算作业的偏移值为2，异常恢复完成后，流数据计算作业的偏移值为3，此时将会计算D的数据计算任务，本实施例对此不做具体限制。After completing the exception recovery, you can notify the superior that the data processing is completed; then start the normal data processing process. For example: after operator 3 completes exception recovery, it notifies operator 2 that the recovery is complete. At this time, operator 2 can read the preset data. Write the data in the log, and adjust the data offset of the replay point data in the write-ahead log, so that the data offset is shifted backward or increased by 1. The specific adjustment method depends on the setting method of the data offset. For example: There are four data calculation tasks ABCD in the streaming data computing job. When there is an exception in C, the offset value of the streaming data computing job is 2. After the exception recovery is completed, the offset value of the streaming data computing job is 3. At this time The data calculation task of D will be calculated, and this embodiment does not impose specific limitations on this.

在具体实现中，参考图8，正常数据流计算模式是指上游算子(算子2)向下游发送数据；下游算子负责数据去重，防止数据重复处理，如果是重复发送的数据，则无需处理该数据，否则下游算子收到数据之后，按照业务逻辑处理数据，并将处理结果首先写入WAL日志。此时无论重复启动、或者恢复作业多少次，都可以保证幂等性；写入日志成功后，如果是有状态计算，则调用外部的分布式状态存储更新状态；更新状态与写入日志是原子操作，要么一起成功，要么一起失败；数据状态更新完毕，且向上游算子发送消息，确认算子发送的数据处理完毕；上游算子在WAL日志中，记录已成功处理数据的位置的偏移量。In the specific implementation, referring to Figure 8, the normal data flow calculation mode means that the upstream operator (operator 2) sends data to the downstream; the downstream operator is responsible for data deduplication to prevent repeated data processing. If the data is sent repeatedly, then There is no need to process this data, otherwise the downstream operator will process the data according to business logic after receiving the data, and first write the processing results to the WAL log. At this time, no matter how many times the job is started or resumed, idempotence can be guaranteed; after the log is written successfully, if it is a stateful calculation, the external distributed state storage is called to update the status; updating the status and writing the log are atomic The operations either succeed or fail together; the data status is updated, and a message is sent to the upstream operator to confirm that the data sent by the operator has been processed; the upstream operator records the offset of the position of the successfully processed data in the WAL log. quantity.

进一步地，基于所述目标算子对所述重放点数据进行流计算，包括：Further, performing flow calculation on the replay point data based on the target operator includes:

在具体实现中，判断所述重放点数据是否为重复数据可以是判断重放点数据在预写式日志中的位置，是否与数据偏移量对应，例如：流数据计算作业中存在ABCD四个数据计算任务，流数据计算作业的偏移值为2，重放点数据的位置为C所处的位置，则表示重放点数据不是重放点数据，流数据计算作业的偏移值为2，重放点数据的位置为B或A所处的位置，则表示重放点数据是重放点数据。In a specific implementation, determining whether the replay point data is duplicate data may be to determine whether the position of the replay point data in the pre-written log corresponds to the data offset. For example: there are ABCD four in the stream data calculation job. data calculation task, the offset value of the stream data calculation job is 2, and the position of the replay point data is the position of C, which means that the replay point data is not the replay point data, the offset value of the stream data calculation job is 2. If the position of the replay point data is the position of B or A, it means that the replay point data is the replay point data.

进一步地，在所述重放点数据为重复数据时，基于所述目标算子生成提醒信息，并向上级算子反馈所述提醒信息；Further, when the replay point data is repeated data, reminder information is generated based on the target operator, and the reminder information is fed back to the superior operator;

需要说明的是，本实施例中除了上级算子接收到目标算子的提醒信息，重发数据之外，还可以检测在一定时间内是否接收到来自目标算子的流计算完成信息，若是没有，则表示上级算子下发数据、算子连接、流计算结果写入预写式日志、状态更新以及通知上级算子中的某一项存在异常，任一种异常的发生都会影响数据流计算的过程。It should be noted that in this embodiment, in addition to receiving the reminder information from the target operator and resending the data, the superior operator can also detect whether the flow calculation completion information from the target operator is received within a certain period of time. If not, , it means that the superior operator delivers data, connects the operator, writes the flow calculation results to the write-ahead log, updates the status, and notifies the superior operator that there is an exception in an item. The occurrence of any abnormality will affect the data flow calculation. the process of.

因此，上游未收到下游的处理完成的确认消息，超过一定时间，则上游算子2进行数据的重发，下游算子3负责来判断是否是重复数据，如果是重复数据，则通知上游数据已经处理完毕，如果下游算子认为是非重复数据，则按照正常的数据处理流程进行数据的处理。Therefore, if the upstream does not receive the confirmation message that the downstream processing is completed, and it exceeds a certain time, the upstream operator 2 will resend the data, and the downstream operator 3 is responsible for judging whether it is duplicate data. If it is duplicate data, it will notify the upstream data The processing has been completed. If the downstream operator considers the data to be non-duplicate, the data will be processed according to the normal data processing flow.

本实施例根据重放点数据和重放点数据对应的状态信息，进行数据流计算，避免因此状态变化导致前后计算结果不一致，影响数据处理作业的幂等性，提高了数据处理作业的精确度。This embodiment performs data flow calculations based on the replay point data and the state information corresponding to the replay point data, to avoid inconsistent calculation results caused by state changes, affecting the idempotence of the data processing job, and improving the accuracy of the data processing job. .

此外，本发明实施例还提出一种存储介质，所述存储介质上存储有数据流计算异常恢复程序，所述数据流计算异常恢复程序被处理器执行时实现如上文所述的数据流计算异常恢复方法的步骤。In addition, an embodiment of the present invention also proposes a storage medium on which a data flow calculation exception recovery program is stored. When the data flow calculation exception recovery program is executed by the processor, the data flow calculation exception recovery program is implemented as described above. Steps of the recovery method.

由于本存储介质采用了上述所有实施例的全部技术方案，因此至少具有上述实施例的技术方案所带来的所有有益效果，在此不再一一赘述。Since this storage medium adopts all the technical solutions of all the above embodiments, it has at least all the beneficial effects brought by the technical solutions of the above embodiments, which will not be described again here.

参照图9，图9为本发明数据流计算异常恢复装置第一实施例的结构框图。Referring to FIG. 9 , FIG. 9 is a structural block diagram of a first embodiment of a data flow calculation exception recovery device according to the present invention.

如图9所示，本发明实施例提出的数据流计算异常恢复装置包括：As shown in Figure 9, the data stream computing exception recovery device proposed by the embodiment of the present invention includes:

获取模块10，用于在数据处理出现异常时，确定异常算子和对应的相邻算子。The acquisition module 10 is used to determine the abnormal operator and the corresponding adjacent operator when an abnormality occurs in data processing.

调用模块20，用于调用所述异常算子对应的目标算子，并建立所述目标算子与相邻算子之间的连接。The calling module 20 is used to call the target operator corresponding to the abnormal operator, and establish a connection between the target operator and adjacent operators.

读取模块30，用于基于所述相邻算子读取预写式日志中的重放点数据，并将所述重放点数据下发至所述目标算子。The reading module 30 is configured to read the replay point data in the write-ahead log based on the adjacent operator, and deliver the replay point data to the target operator.

计算模块40，用于基于所述目标算子对所述重放点数据进行数据流计算。The calculation module 40 is configured to perform data flow calculation on the replay point data based on the target operator.

在一实施例中，所述计算模块40，还用于基于所述目标算子读取外部存储数据库中的所述重放点数据对应的状态信息；基于所述目标算子根据所述重放点数据和所述状态信息进行数据流计算。In an embodiment, the calculation module 40 is also configured to read the status information corresponding to the replay point data in the external storage database based on the target operator; based on the target operator according to the replay Point data and the state information are used for data flow calculations.

在一实施例中，所述计算模块40，还用于将流计算结果写入所述预写式日志；调用外部存储数据库更新所述重放点数据对应的状态信息。In one embodiment, the calculation module 40 is also configured to write flow calculation results into the write-ahead log; and call an external storage database to update the status information corresponding to the replay point data.

在一实施例中，所述计算模块40，还用于基于所述目标算子向所述上级算子发送确认信息；基于所述上级算子根据所述确认信息调整所述预写式日志中所述重放点数据的数据偏移量，并执行正常数据流计算模式。In one embodiment, the calculation module 40 is further configured to send confirmation information to the upper-level operator based on the target operator; and adjust the pre-written log according to the confirmation information based on the upper-level operator. The data offset of the replay point data and performs normal data flow calculation mode.

在一实施例中，所述计算模块40，还用于通过所述目标算子基于所述预写式日志中的数据偏移量判断所述重放点数据是否为重复数据；在所述重放点数据不为重复数据时，执行基于所述目标算子对所述重放点数据进行数据流计算的步骤。In one embodiment, the calculation module 40 is further configured to use the target operator to determine whether the replay point data is duplicate data based on the data offset in the write-ahead log; When the playback point data is not repeated data, the step of performing data flow calculation on the playback point data based on the target operator is performed.

在一实施例中，所述计算模块40，还用于在所述重放点数据为重复数据时，基于所述目标算子生成提醒信息，并向上级算子反馈所述提醒信息；调整所述预写式日志中所述重放点数据的数据偏移量；通过所述上级算子根据所述提醒信息读取外部存储数据库中与调整后的数据偏移量对应的待处理数据，并将所述待处理数据下发至所述目标算子，以执行数据流计算。In one embodiment, the calculation module 40 is also configured to generate reminder information based on the target operator when the replay point data is repeated data, and feed back the reminder information to the superior operator; adjust the The data offset of the replay point data in the write-ahead log; the upper-level operator reads the data to be processed corresponding to the adjusted data offset in the external storage database according to the reminder information, and The data to be processed is sent to the target operator to perform data flow calculation.

在一实施例中，所述计算模块40，还用于基于所述相邻算子查询预写式日志中的历史数据偏移量；根据所述历史数据偏移量读取所述预写式日志中的重放点数据。In one embodiment, the calculation module 40 is also used to query the historical data offset in the write-ahead log based on the adjacent operator; and read the write-ahead log based on the historical data offset. Replay point data in the log.

应该理解的是，虽然本申请实施例中的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，其可以以其他的顺序执行。而且，图中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，其执行顺序也不必然是依次进行，而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although each step in the flow chart in the embodiment of the present application is displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this article, the execution of these steps is not strictly limited in order, and they can be executed in other orders. Moreover, at least some of the steps in the figure may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and their execution order is not necessarily sequential. may be performed in turn or alternately with other steps or sub-steps of other steps or at least part of stages.

应当理解的是，以上仅为举例说明，对本发明的技术方案并不构成任何限定，在具体应用中，本领域的技术人员可以根据需要进行设置，本发明对此不做限制。It should be understood that the above are only examples and do not constitute any limitation on the technical solution of the present invention. In specific applications, those skilled in the art can make settings as needed, and the present invention does not impose any limitations on this.

需要说明的是，以上所描述的工作流程仅仅是示意性的，并不对本发明的保护范围构成限定，在实际应用中，本领域的技术人员可以根据实际的需要选择其中的部分或者全部来实现本实施例方案的目的，此处不做限制。It should be noted that the workflow described above is only illustrative and does not limit the scope of the present invention. In practical applications, those skilled in the art can select some or all of them for implementation according to actual needs. The purpose of this embodiment is not limited here.

另外，未在本实施例中详尽描述的技术细节，可参见本发明任意实施例所提供的数据流计算异常恢复方法，此处不再赘述。In addition, for technical details that are not described in detail in this embodiment, please refer to the data stream computing exception recovery method provided by any embodiment of the present invention, and will not be described again here.

此外，需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。Furthermore, it should be noted that, as used herein, the terms "include", "comprises" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or system that includes a list of elements includes not only those elements, but also other elements not expressly listed or elements inherent to the process, method, article or system. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器(Read Only Memory，ROM)/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as a read-only memory). , ROM)/RAM, magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, or network device, etc.) to execute the method described in various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the description and drawings of the present invention may be directly or indirectly used in other related technical fields. , are all similarly included in the scope of patent protection of the present invention.

Claims

1. A data stream computation anomaly recovery method, characterized in that the data stream computation anomaly recovery method comprises:

when the data stream calculation is abnormal, determining an abnormal operator and a corresponding adjacent operator;

invoking a target operator corresponding to the abnormal operator, and establishing connection between the target operator and an adjacent operator;

reading replay point data in a pre-written log based on the adjacent operator, and issuing the replay point data to the target operator;

and carrying out data stream calculation on the replay point data based on the target operator.

2. The data stream computation anomaly recovery method of claim 1, wherein said performing data stream computation on the playback point data based on the target operator comprises:

reading state information corresponding to the replay point data in an external storage database based on the target operator;

and carrying out data stream calculation according to the replay point data and the state information based on the target operator.

3. The data stream computation anomaly recovery method of claim 2, wherein the data stream computation anomaly recovery method further comprises:

writing a stream calculation result into the pre-written log;

and calling an external storage database to update the state information corresponding to the replay point data.

4. The data stream computation anomaly recovery method of claim 3, wherein the neighbor operator comprises a superior operator;

the data stream computing anomaly recovery method further comprises the following steps:

transmitting confirmation information to the superior operator based on the target operator;

and adjusting the data offset of the replay point data in the pre-written log according to the confirmation information based on the superior operator, and executing a normal data stream calculation mode.

5. The data stream computation anomaly recovery method according to any one of claims 1 or 2, wherein stream computation of the playback point data based on the target operator comprises:

judging whether the replay point data is repeated data or not based on the data offset in the pre-written log through the target operator;

and executing the step of performing data stream calculation on the replay point data based on the target operator when the replay point data is not the duplicate data.

6. The data stream computation anomaly recovery method of claim 5, wherein the data stream computation anomaly recovery method further comprises:

when the replay point data are repeated data, generating reminding information based on the target operator, and feeding back the reminding information to a superior operator;

adjusting a data offset of the playback point data in the pre-written log;

and reading the data to be processed corresponding to the adjusted data offset in the external storage database according to the reminding information through the superior operator, and issuing the data to be processed to the target operator so as to execute data stream calculation.

7. The data stream computation anomaly recovery method of claim 1, wherein the reading replay point data in a pre-written log based on the neighbor operator comprises:

querying historical data offset in a pre-written log based on the adjacent operator;

and reading the replay point data in the pre-written log according to the historical data offset.

8. A data stream computation anomaly recovery device, characterized in that the data stream computation anomaly recovery device comprises:

the acquisition module is used for determining an abnormal operator and a corresponding adjacent operator when the data processing is abnormal;

the calling module is used for calling a target operator corresponding to the abnormal operator and establishing connection between the target operator and an adjacent operator;

the reading module is used for reading the replay point data in the pre-written log based on the adjacent operator and issuing the replay point data to the target operator;

and the calculation module is used for carrying out data stream calculation on the replay point data based on the target operator.

9. A data stream computation anomaly recovery device, characterized in that the data stream computation anomaly recovery device comprises: a memory, a processor, and a data stream computation anomaly recovery program stored on the memory and executable on the processor, the data stream computation anomaly recovery program configured to implement the data stream computation anomaly recovery method of any one of claims 1 to 7.

10. A storage medium having stored thereon a data stream computation anomaly recovery program which, when executed by a processor, implements the data stream computation anomaly recovery method of any one of claims 1 to 7.