CN118467217A

CN118467217A - Information recording method, apparatus, electronic device, and computer-readable storage medium

Info

Publication number: CN118467217A
Application number: CN202410444329.0A
Authority: CN
Inventors: 张垚
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-08-09

Abstract

The disclosure provides an information recording method, an information recording device, electronic equipment and a computer readable storage medium, and relates to the field of information maintenance. The method comprises the following steps: when the Flink job runs, first information of each TASKMANAGER node for executing the job is recorded in real time; triggering an uploading signal when the TASKMANAGER node exits in a fault mode under the condition that the Flink operation is operated; and uploading error information generated by the TASKMANAGER node when the Flink operation runs to the JobManager node based on the uploading signal, so that the JobManager node stores the error information in the first information. By the method, the first information of the TASKMANAGER node can be recorded in real time, and error information is uploaded when the fault occurs.

Description

Information recording method, apparatus, electronic device, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of information maintenance, and in particular, to an information recording method, an information recording device, an electronic device, and a computer readable storage medium.

Background

APACHE FLINK is an open source stream processing and batching framework for providing low latency, high throughput and Exactly-Once semantic data processing capabilities on unbounded and bounded data streams. It is maintained by the Apache software foundation. The Flink cluster is made up of two core components at job run time: jobManager nodes and TASKMANAGER nodes.

However, the related art has a disadvantage in locating the cause of the TASKMANAGER node failure exit error at the time of flight job run. Although the flank cluster can acquire the reason of the abnormality when the TASKMANAGER node encounters the abnormality, the cause of the crash cannot be accurately acquired when the TASKMANAGER node encounters the exit of the problem crash process. The new TASKMANAGER node can start and replace the original TASKMANAGER node, so that the log in the event of a crash is difficult to find and check. For a stopped Flink job, it is also not ensured that the historical log is kept intact. Therefore, a solution is needed that can conveniently locate the error information and history log of the TASKMANAGER node crash exit.

Disclosure of Invention

The embodiment of the disclosure provides an information recording method, an information recording device, an electronic device and a computer readable storage medium, which aim to solve the problems in the background art.

In order to solve the above technical problems, the present disclosure is implemented as follows:

In a first aspect, an embodiment of the present disclosure provides an information recording method, which is applied to a link cluster, where the link cluster includes a TASKMANAGER node, a JobManager node, an error information uploading module and a history information recording module, the error information uploading module is disposed on the TASKMANAGER node, and the history information recording module is disposed on the JobManager node, and the method includes:

When a Flink job runs, recording first information of each TASKMANAGER node for executing the Flink job by the history information recording module, wherein the first information comprises content necessary for running log information of TASKMANAGER nodes for running the Flink job in a tracing way, and the first information is used for responding to information inquiry and/or problem tracing of a user side;

Triggering an uploading signal when the TASKMANAGER node exits in a fault mode under the condition that the Flink operation is executed;

based on the uploading signal, uploading error information generated by the TASKMANAGER node when the link job runs to the JobManager node through the error information uploading module, so that the JobManager node stores the error information in the first information.

Optionally, the first information includes operation attribute information, operation state information, and positioning information, where the operation state information is used to represent current TASKMANAGER nodes and historical TASKMANAGER nodes operation states, and the operation state information includes: running, normal exiting and abnormal exiting, wherein the recording, by the history information recording module, first information of each TASKMANAGER node for executing the link job in real time includes:

Storing the operation attribute information of each TASKMANAGER node according to different scheduling frames of the Flink job;

detecting the running state information of each TASKMANAGER node for executing the Flink operation in real time, and updating the running state information in a running state table;

recording positioning information of each TASKMANAGER node, wherein the positioning information is used for describing a mode of tracing first information of the TASKMANAGER node;

the method further comprises the steps of:

receiving a TASKMANAGER node state query request of the user side;

And displaying the running state table on a job monitoring page of the JobManager node based on the TASKMANAGER node state query request so that the user side obtains the TASKMANAGER node which is running and/or the TASKMANAGER node which is stopped, wherein the TASKMANAGER node which is stopped comprises the TASKMANAGER node which is normally stopped and the TASKMANAGER node which is abnormally stopped.

Optionally, the storing the operation attribute information of each TASKMANAGER node according to different scheduling frameworks of the link job includes:

When the scheduling framework is a Yarn schedule, first operation attribute information in the Yarn schedule is stored, wherein the first operation attribute information comprises: the host name of the TASKMANAGER node when running, the container ID of the TASKMANAGER node when running, the user name of the Flink job, the IP of the resource manager node in the Yarn schedule, the port number of the management page service of the resource manager node in the Yarn schedule, wherein the container is packaged with the Flink job, the running environment of the Flink job and required resources, and the TASKMANAGER node executes the Flink job in the container;

The method further comprises the steps of: constructing a uniform resource locator based on the name of each item of information contained in the first operation attribute information, wherein the uniform resource locator is used for acquiring the first operation attribute information of the TASKMANAGER node;

When the scheduling framework is Kubernetes scheduling, second operation attribute information in the Kubernetes scheduling is stored, wherein the second operation attribute information comprises: the Flink job submits the name space to which the Flink job belongs and the name of the unit where the TASKMANAGER node runs;

The method further comprises the steps of: when the TASKMANAGER node is determined to be in fault exit under the condition that the Flink job is operated, calling a command to acquire second operation attribute information of the TASKMANAGER node; and when determining that the Flink job is stopped, positioning log information of the TASKMANAGER nodes exported by a log synchronization tool based on the second operation attribute information.

Optionally, before the process of the TASKMANAGER node exits under the condition that the Flink job runs, the method further includes:

calling a hook function of the Java virtual machine before the Java virtual machine running the TASKMANAGER node exits;

And collecting and uploading error environment information when the Java virtual machine runs based on the content of the hook function.

Optionally, the error information includes fault exception information, a memory usage condition of the Java virtual machine, and starting parameters, a starting command, and path information of the Java virtual machine, and before the error information generated by the TASKMANAGER node during the running of the link job is uploaded to the JobManager node by the error information uploading module, the method includes:

when the TASKMANAGER node normally operates, collecting starting parameters, starting commands and path information of the Java virtual machine after the TASKMANAGER node is started;

When the process of the TASKMANAGER node fails, collecting the failure exception information and the memory use condition of the Java virtual machine, wherein the failure exception information comprises error stack information, and the memory use condition of the Java virtual machine is the memory use condition of the Java virtual machine when the TASKMANAGER node fails;

The uploading, by the error information uploading module, the error information generated by the TASKMANAGER node when the link job runs to the JobManager node includes:

and when the JobManager node is available, uploading the fault exception information, the Java virtual machine memory use condition, the starting parameters of the Java virtual machine, the starting command and the path information to the JobManager node through the error information uploading module.

Optionally, the method further comprises:

when the JobManager node is not available, discarding uploading the error information and accelerating the process termination of the TASKMANAGER node;

and after the JobManager nodes are restored to be available, responding to information inquiry and/or problem tracing of the user side based on the first information of the TASKMANAGER nodes which are exited by the faults and recorded by the historical information recording module in real time.

Optionally, the uploading, by the error information uploading module, the error information generated by the TASKMANAGER node when the link job runs to the JobManager node includes:

The RPC interface based on the RPC protocol uploads the error information to the JobManager node;

the method further comprises the steps of:

After the JobManager node receives the error information, the error information is displayed on a job monitoring page of the JobManager node, so that a user side can view and analyze the error information.

In a second aspect, an embodiment of the present disclosure provides an information recording apparatus, which is applied to a link cluster, where the link cluster includes a TASKMANAGER node, a JobManager node, an error information uploading module and a history information recording module, the error information uploading module is disposed on the TASKMANAGER node, and the history information recording module is disposed on the JobManager node, and the apparatus includes:

The first information recording module is used for recording first information of each TASKMANAGER node for executing the Flink job through the history information recording module when the Flink job is running, wherein the first information comprises content necessary for running log information of TASKMANAGER nodes for running the Flink job in a tracing way, and the first information is used for responding to information inquiry and/or problem tracing of a user side;

the uploading signal triggering module is used for triggering uploading signals when the TASKMANAGER node exits from faults under the condition that the Flink operation is operated;

And the uploading module is used for uploading error information generated by the TASKMANAGER node when the Flink operation runs to the JobManager node through the error information uploading module based on the uploading signal, so that the JobManager node stores the error information in the operation log information.

Optionally, the first information includes operation attribute information, operation state information, and positioning information, where the operation state information is used to represent current TASKMANAGER nodes and historical TASKMANAGER nodes operation states, and the operation state information includes: running, normal exiting and abnormal exiting, the first information recording module includes:

The operation attribute storage module is used for storing operation attribute information of each TASKMANAGER node according to different scheduling frames of the Flink job;

The real-time detection module is used for detecting the running state information of each TASKMANAGER node for executing the Flink operation in real time and updating the running state information in a running state table;

The positioning information recording module is used for recording the positioning information of each TASKMANAGER node, and the positioning information is used for describing the mode of tracing the first information of the TASKMANAGER node;

The apparatus further comprises:

The receiving query request module is used for receiving a TASKMANAGER node state query request of the user side;

and the running state display module is used for displaying the running state table on the job monitoring page of the JobManager node based on the TASKMANAGER node state query request so that the user side obtains the TASKMANAGER node which is running and/or the TASKMANAGER node which is stopped, wherein the TASKMANAGER node which is stopped comprises the TASKMANAGER node which is normally stopped and the TASKMANAGER node which is abnormally stopped.

Optionally, the save operation attribute module includes:

The first operation attribute saving module is used for saving first operation attribute information in the Yarn schedule when the scheduling framework is the Yarn schedule, and the first operation attribute information comprises: the host name of the TASKMANAGER node when running, the container ID of the TASKMANAGER node when running, the user name of the Flink job, the IP of the resource manager node in the Yarn schedule, the port number of the management page service of the resource manager node in the Yarn schedule, wherein the container is packaged with the Flink job, the running environment of the Flink job and required resources, and the TASKMANAGER node executes the Flink job in the container;

The apparatus further comprises:

The construction module is used for constructing a uniform resource locator based on the name of each item of information contained in the first operation attribute information, and the uniform resource locator is used for acquiring the first operation attribute information of the TASKMANAGER node;

the second operation attribute saving module is configured to save second operation attribute information in Kubernetes scheduling when the scheduling framework is the Kubernetes scheduling, where the second operation attribute information includes: the Flink job submits the name space to which the Flink job belongs and the name of the unit where the TASKMANAGER node runs;

The apparatus further comprises:

The second operation attribute acquisition module is used for calling a command to acquire second operation attribute information of the TASKMANAGER node when determining that the TASKMANAGER node is out of order under the condition that the Flink job is operated; and when determining that the Flink job is stopped, positioning log information of the TASKMANAGER nodes exported by a log synchronization tool based on the second operation attribute information.

Optionally, the apparatus further comprises:

The monitoring module is used for calling a hook function of the Java virtual machine before the Java virtual machine running the TASKMANAGER node exits;

and the collection and uploading module is used for collecting and uploading error environment information when the Java virtual machine runs based on the content of the hook function.

Optionally, the error information includes fault exception information, java virtual machine memory usage, and starting parameters, starting commands, and path information of the Java virtual machine, and the apparatus further includes:

The first collecting module is used for collecting starting parameters, starting commands and path information of the Java virtual machine after the starting of the TASKMANAGER node when the TASKMANAGER node operates normally;

The second collecting module is used for collecting the fault abnormal information and the memory use condition of the Java virtual machine when the process of the TASKMANAGER node is faulty, wherein the fault abnormal information comprises error stack information, and the memory use condition of the Java virtual machine is the memory use condition of the Java virtual machine when the TASKMANAGER node is faulty;

the uploading module comprises:

And the parameter uploading module is used for uploading the fault abnormal information, the Java virtual machine memory use condition, the starting parameters, the starting command and the path information of the Java virtual machine to the JobManager node through the error information uploading module when the JobManager node is available.

Optionally, the apparatus further comprises:

the termination uploading module is used for giving up uploading the error information and accelerating the process termination of the TASKMANAGER node when the JobManager node is unavailable;

And the corresponding module is used for responding to the information inquiry and/or problem tracing of the user side based on the first information of the TASKMANAGER node which is exited by the fault and recorded by the historical information recording module in real time after the JobManager node is recovered to be available.

The uploading module comprises:

The RPC uploading module is used for uploading the error information to the JobManager node based on an RPC interface of an RPC protocol;

The apparatus further comprises:

And the display module is used for displaying the error information on the job monitoring page of the JobManager node after the JobManager node receives the error information, so that a user side can view and analyze the error information.

In a third aspect, the present disclosure proposes an electronic device comprising: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when being executed by the processor, implements the steps of the information recording method.

In a fourth aspect, the disclosed embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of an information recording method.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

The first information of each TASKMANAGER node for executing the Flink operation can be recorded in real time through the historical information recording module. The first information comprises the content necessary for tracing the operation log information of the TASKMANAGER node of the Flink job, and can be used for responding to the information inquiry and problem tracing of the user side. When TASKMANAGER nodes fail to exit when the Flink operation is running, an uploading signal is triggered. In addition, through an error information uploading module, error information generated by the TASKMANAGER node during the flight operation is uploaded to the JobManager node. In this way, the JobManager node can save the error information in the first information in real time. By the method, the first information of the TASKMANAGER nodes can be recorded in real time, and error information is uploaded during faults, so that a user can conveniently inquire and trace back the running condition of the Flink operation, and the efficiency of fault detection and problem positioning is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a system architecture diagram of a Flink cluster provided in one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of steps of an information recording method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of method steps implemented on a historian module provided in accordance with one embodiment of the present disclosure;

FIG. 4 is a flowchart of method steps implemented on an error information upload module provided by one embodiment of the present disclosure;

fig. 5 is a block diagram of an information recording apparatus according to an embodiment of the present disclosure;

Fig. 6 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

The disclosure provides an information recording method, which is applied to a Flink cluster, wherein the Flink cluster comprises TASKMANAGER nodes, jobManager nodes, an error information uploading module and a history information recording module, the error information uploading module is deployed on the TASKMANAGER nodes, and the history information recording module is deployed on the JobManager nodes. Fig. 1 is a system architecture diagram of a Flink cluster according to an embodiment of the present disclosure, and as shown in fig. 1, the system of a Flink cluster in the present disclosure includes the following components: the system comprises a TASKMANAGER node, a JobManager node, an error information uploading module and a history information recording module. The error information uploading module is deployed on the TASKMANAGER node, and the history information recording module is deployed on the JobManager node. The TASKMANAGER node is a working node in the link cluster and is responsible for executing actual data processing tasks. The error information uploading module is used for collecting and uploading error information to the JobManager node before the TASKMANAGER node fails to exit. The JobManager node plays a role of a control node in the link cluster and is responsible for the management and coordination of the execution lifecycle of the whole application program. The first information refers to a log file for recording the running state of the flank job and related information through a history information recording module when the flank job runs. The key information includes what is necessary to trace back the log information of the TASKMANAGER nodes running the link job. The purpose of the first information is to respond to information inquiry and/or problem tracing requirements of the user side. Through recording the first information, a user can conveniently inquire and trace back the running condition and log information of the Flink operation on each TASKMANAGER node so as to conduct fault investigation and problem analysis. The historical information recording module records first information of each TASKMANAGER node for executing the Flink job in real time. By uploading the error information to the module, the error information can be collected in time before the TASKMANAGER node fails to exit. And through the historical information recording module, the first information of the TASKMANAGER node can be recorded in real time. The Flink cluster system can effectively realize the uploading of error information and the recording of operation logs, and improves the efficiency of fault detection and problem positioning.

The present disclosure may be applied to task management and troubleshooting scenarios in a Flink cluster. Specifically, when a flank job runs in a Yarn cluster, a TASKMANAGER node may fail to exit due to network fluctuations, illegal inputs, dirty data, insufficient resources, node failure, and so on. In this case, the error information uploading module and the history information recording module provided by the present disclosure can help the user to quickly locate and solve the problem. For example, in a real-time data processing scenario, the error information upload module may collect and upload error information to node JobManager before the flank job exits from the TASKMANAGER nodes. Thus, the user can acquire detailed error information through the JobManager node, so that the root cause of the fault can be rapidly located. In addition, in the case that the link job has been exited, the history information recording module may record the first information of each TASKMANAGER node in real time. Therefore, even if the operation of the job is stopped, the user can query and trace back the operation state and log information of the history TASKMANAGER node through the JobManager node, so that the fault detection and the problem positioning are facilitated. The two modules will be described in detail separately.

Fig. 2 is a schematic diagram of steps of an information recording method according to an embodiment of the present disclosure, as shown in fig. 2, where the method includes:

Step S101, when a Flink job runs, first information of each TASKMANAGER node for executing the Flink job is recorded in real time through the historical information recording module, wherein the first information comprises content necessary for tracing running log information of TASKMANAGER nodes for running the Flink job, and the first information is used for responding to information inquiry and/or problem tracing of a user side;

step S102, triggering an uploading signal when the TASKMANAGER node fails to exit under the condition that the Flink operation is running;

and step 103, uploading error information generated by the TASKMANAGER node during the flight operation to the JobManager node through the error information uploading module based on the uploading signal, so that the JobManager node stores the error information in the first information.

Referring to step S101, when the link job is running, the history information recording module records the first information of each TASKMANAGER node in real time. These first information include content necessary for retrospectively running the Flink job, such as the status of the node, the running condition, error information, etc. By recording these first information, the user can obtain detailed information about the operation log information of the TASKMANAGER node through the query and trace back function, so that the user can analyze and solve the problem according to the information and quickly locate the cause of the fault. In addition, the history information recording module can also respond to information inquiry and problem tracing of the user side. A user may query for first information for a particular TASKMANAGER node by interacting with the JobManager node. By providing the query interface, a user can acquire first information of a specific time period and a specific node according to own requirements so as to trace problems and troubleshoot the problems.

Referring to step S102, when the link job is running, if the TASKMANAGER node fails to exit, the error information uploading module triggers an uploading signal. The purpose of this upload signal is to upload error information and environmental information to the JobManager node before the TASKMANAGER node fails out. When the TASKMANAGER node is about to fail out, the error information upload module binds an out hook function (shutdown hook). This shutdown hook is triggered immediately before the TASKMANAGER node exits, collecting and sorting error information. In a flank cluster, when the TASKMANAGER node is about to exit, a shutdown hook of the JVM (the Java virtual machine in which the TASKMANAGER node operates) is called. The shutdown hook is not invoked only in extreme cases of system crashes, system power down, or forced ending of processes using kill-9 commands, and in most cases, such as when Java programs exit erroneously or the Yarn kill TASKMANAGER process falls, it is feasible to monitor the shutdown hook call to reflect the TASKMANAGER node exit.

Referring to step S103, when the TASKMANAGER node is about to fail and exit, the error information uploading module triggers an uploading signal. This upload signal will inform the error information upload module to send the collected error information to the JobManager node. After receiving the error information, the JobManager node stores the error information in the first information. As already mentioned above, the first information is a log file that is used by the JobManager node to record the running status of the link job and related information. These log files can be viewed and analyzed by the user for troubleshooting and problem localization. By storing the error information in the first information, the user can acquire the error information of the TASKMANAGER node by querying the first information during the operation of the job or after the job is stopped. Thus, even after the Flink operation is stopped, the user can conveniently locate TASKMANAGER the reason of abnormal exit and conduct fault detection and problem tracing.

For example, assume that a flank job runs on a cluster of 3 TASKMANAGER nodes. When a job is executed on TASKMANAGER, TASKMANAGER, and TASKMANAGER, the history information module records the status, operation, and error information of each node. If during operation of the job, the TASKMANAGER node fails to exit, the error information upload module triggers an upload signal. The upload signal informs the error message upload module to upload the error message generated by TASKMANAGER node during job run time to JobManager node. After receiving the error information, the JobManager node stores the error information in the first information. For example, the JobManager node may save the error information of the TASKMANAGER node in the first information for a user to query during the operation of the job or after the job has stopped. By querying the first information, the user may obtain error information of the TASKMANAGER node, such as error stack information, an abnormal situation, and the like. Thus, the user can conduct fault detection and problem positioning according to the information, and find out the reason for the TASKMANAGER node fault. Through the historical information recording module and the error information uploading module, a user can acquire detailed information and error information of the TASKMANAGER node by inquiring first information of the JobManager node during the running period of the Flink job or after the job is stopped, so that fault investigation and problem tracing can be conducted.

Illustratively, the first information includes operation attribute information, operation state information, and positioning information, the operation state information being used to represent current TASKMANAGER nodes and historical TASKMANAGER nodes operation states, the operation state information including: running, normal exiting and abnormal exiting, wherein the recording, by the history information recording module, first information of each TASKMANAGER node for executing the link job in real time includes: storing the operation attribute information of each TASKMANAGER node according to different scheduling frames of the Flink job; detecting the running state information of each TASKMANAGER node for executing the Flink operation in real time, and updating the running state information in a running state table; and recording the positioning information of each TASKMANAGER node, wherein the positioning information is used for describing the mode of tracing the first information of the TASKMANAGER node.

The historical information recording module can record first information of each TASKMANAGER node for executing the link job in real time, wherein the first information comprises operation attribute information, operation state information and positioning information. For example, suppose a Flink job runs on a cluster of 3 TASKMANAGER nodes. Each TASKMANAGER node has some specific operational attribute information such as the node's IP address, hostname, container ID, etc. The history module may store the operation attribute information of each TASKMANAGER nodes according to the scheduling framework (such as Yarn or Kubernetes) used. Meanwhile, the historical information recording module can detect the running state information of each TASKMANAGER node in real time and update the running state information in a running state table, and a user can acquire the real-time running state information of each TASKMANAGER node by inquiring the running state table, wherein the running state information is used for representing the running states of the current TASKMANAGER node and the historical TASKMANAGER node. These operational state information include three states: running, normal exit, and abnormal exit. The running indicates that the current TASKMANAGER node is running normally. Normal exit means that TASKMANAGER node exits normally after completing the task. An abnormal exit indicates that the TASKMANAGER node has an abnormal condition during operation resulting in an exit. Such operational status information may help the monitoring and management system to understand the state of the TASKMANAGER nodes and perform corresponding processing and analysis. In addition, the history information recording module records the positioning information of each TASKMANAGER node so that the user can trace back to the first information of the specific TASKMANAGER node. The positioning information can be different according to different scheduling frames, because different scheduling frames have different characteristics and requirements, and the operation attribute information of the TASKMANAGER nodes can be adapted according to different scheduling frames. Through the positioning information, a user can conveniently trace back to the first information of the specific TASKMANAGER nodes so as to conduct fault investigation and problem tracing.

The historical information recording module is used for recording the first information of each TASKMANAGER node in real time, including operation attribute information, operation state information and positioning information, so that a user is provided with convenient inquiry and tracing functions, and fault detection and problem positioning can be conveniently carried out.

Illustratively, the method further comprises: receiving a TASKMANAGER node state query request of the user side;

The JobManager node is ready to receive a request from the client to query TASKMANAGER node for state information. The JobManager node performs corresponding processing according to the received query request. The JobManager node will display the contents of the running state table on the job monitor page. On the job monitor page, the user can see the running status information of each TASKMANAGER node. For example, if a TASKMANAGER node is running, it may be marked as "running". If a TASKMANAGER node has stopped running, it will be marked as "stopped running". Through the job monitoring page, the user can acquire the information of the running TASKMANAGER node and the running stop TASKMANAGER node, and the user can conveniently acquire the information of the running TASKMANAGER node and the running stop TASKMANAGER node, wherein the running stop TASKMANAGER node comprises two cases: normal shutdown TASKMANAGER node and abnormal shutdown TASKMANAGER node. The normal shutdown TASKMANAGER node means that the TASKMANAGER node is stopped when the flank job is executed or exits normally. The TASKMANAGER node which is abnormally stopped is that the TASKMANAGER node is abnormally stopped due to various reasons (such as network faults, insufficient resources and the like) in the execution process of the Flink job. The user monitors the operation state of the operation in real time and knows the state of the TASKMANAGER node in time. The technical effect improves the visual monitoring and management capability of the user on the operation, so that the user can better know and master the operation condition of the operation.

Fig. 3 is a flowchart of method steps implemented on a history information recording module according to an embodiment of the present disclosure, as shown in fig. 3, when a JobManager node is in a surviving state, that is, in an operating state, the operating state of the TASKMANAGER node is detected and recorded in real time by the history information recording module, and corresponding operation attribute information is updated in real time and saved according to a scheduling framework of a link job, which will be described in detail later. Meanwhile, the JobManager node checks whether the Flink job is to be exited or not through the history information recording module, if so, the TASKMANAGER state information is written into the log of the JobManager node, and then the JobManager node is exited.

Illustratively, the storing the operation attribute information of each TASKMANAGER node according to the different scheduling frames of the link job includes: when the scheduling framework is a Yarn schedule, first operation attribute information in the Yarn schedule is stored, wherein the first operation attribute information comprises: the host name of the TASKMANAGER node when running, the container ID of the TASKMANAGER node when running, the user name of the Flink job, the IP of the resource manager node in the Yarn schedule, the port number of the management page service of the resource manager node in the Yarn schedule, the container is packaged with the Flink job, the running environment of the Flink job and the required resources, and the TASKMANAGER node executes the Flink job in the container.

In order to better track and monitor the execution condition of the job and provide the functions of inquiring and problem tracing for the user, the operation attribute information of each TASKMANAGER node is stored according to different scheduling frames of the Flink job. Specifically, when the scheduling framework schedules for Yarn, the saved first operation attribute information includes the following:

Hostname where TASKMANAGER node runs: the host name of each TASKMANAGER node is recorded, so that the operation on which nodes can be accurately known; container ID of TASKMANAGER node at runtime: the unique identification of each TASKMANAGER node in the container is recorded, so that the specific TASKMANAGER node can be conveniently positioned and searched; user name of the flank job: the user name of the operation of the record can be used for authority management and operation tracing. IP of resource manager node in Yarn scheduling: the IP address of the Yarn resource manager node is recorded, interaction can be carried out with the resource manager, and the cluster state and the resource allocation condition are obtained; port number of management page service of resource manager node in Yarn schedule: and recording the port number of the management page service of the Yarn resource manager node, accessing the management page of the resource manager through the port number, and checking the state of the cluster and the resource allocation condition. In addition, the containers encapsulate the flank jobs, their operating environments and the required resources. The TASKMANAGER node executes the Flink operation in the container, and the isolation and resource management of the operation are realized in a containerized mode. By storing the operation attribute information, the system can track and monitor the execution condition of the job in real time, and know the state and the behavior of each TASKMANAGER node. Meanwhile, the information can also be provided for users to inquire and trace problems, so that the users can be helped to locate and solve the problems in the process of executing the job.

Illustratively, the method further comprises: constructing a uniform resource locator based on the name of each item of information contained in the first operation attribute information, wherein the uniform resource locator is used for acquiring the first operation attribute information of the TASKMANAGER node; when the scheduling framework is Kubernetes scheduling, second operation attribute information in the Kubernetes scheduling is stored, wherein the second operation attribute information comprises: the Flink job submits the name space to which it belongs and the name of the unit in which the TASKMANAGER node is running.

In addition to the steps of storing the first operational attribute information for the TASKMANAGER node described above, constructing a Uniform Resource Locator (URL) based on the information is included for obtaining the first operational attribute information for the TASKMANAGER node. Specifically, the constructed Uniform Resource Locator (URL) is used to access and obtain the first operation attribute information of the TASKMANAGER node, that is, the operation log of the TASKMANAGER node, according to the name of each item of information contained in the saved first operation attribute information. This URL may be constructed by concatenating the values of the various pieces of information to locate the running log to a particular TASKMANAGER node. For example, in the Yarn schedule, according to the hostname, container ID, user name, the IP of the Yarn resource manager node and the port number of the management page service in the first operation attribute information, the following URL may be constructed to obtain the operation log of the TASKMANAGER node:

http://<yarn-ip>:<yarn-port>/jobhistory/logs/<container-host>/<contain er-id>/<container-id>/<username>;

By constructing such a URL, we can directly access the history log service of the Yarn dispatch framework to obtain the running log of a particular TASKMANAGER node. The log can help a system or a user to know the state, behavior and abnormal condition of the TASKMANAGER node in the running process, and is convenient for the user to conduct fault detection and problem analysis.

When the Kubernetes scheduling framework is used, the second operation attribute information is correspondingly saved, and the information comprises a name space to which the flank job submission belongs and a name of a unit where the TASKMANAGER node is operated. In Kubernetes, the namespace (namespace) is one mechanism for isolating and organizing resources. Each Flink job commit will specify a namespace that identifies the scope to which the job belongs. And the namespace to which the Flink job submission belongs is saved, and the TASKMANAGER node is associated with a specific job, so that subsequent management and inquiry are facilitated. And the name of a unit (pod-name) refers to the smallest, most basic computational unit that can be created and deployed in a Kubernetes cluster, representing a process or set of closely related processes running on the cluster node. Each TASKMANAGER node runs in one unit, and the unit where the TASKMANAGER node is located can be accurately positioned by saving the names of the units, so that the related running attribute information of the unit can be traced. In the Kubernetes schedule framework, the second operational attribute information for the TASKMANAGER node may be obtained by executing the following commands:

kubectl logs<pod-name>--previous-n<namespace>；

by executing this command, the TASKMANAGER node's travel log before the crash can be obtained. The command is used for tracing logs before the TASKMANAGER node crashes, so that a user can conveniently conduct fault detection and problem analysis.

Illustratively, the method further comprises: when the TASKMANAGER node is determined to be in fault exit under the condition that the Flink job is operated, calling a command to acquire second operation attribute information of the TASKMANAGER node; and when determining that the Flink job is stopped, positioning second operation attribute information of the TASKMANAGER nodes derived by a log synchronization tool based on the second operation attribute information.

Upon detecting a TASKMANAGER node failure exit, immediately taking action to obtain second operational attribute information of the node. Specifically, call and execute commands: kubectl describe pod < pod-name > -n < nmespace >, the second operation attribute information of the TASKMANAGER node may be obtained. Upon determining that the flank job is stopped, second operational attribute information of the TASKMANAGER nodes is exported by the log synchronization tool. When a flank job is stopped, a TASKMANAGER node log associated with the job may be traced back and looked up based on the second run attribute information. Log information to the historian TASKMANAGER node may be located through a log path or other access means derived by the log synchronization tool. Thus, even after the Flink job is stopped, the second operation attribute information of the TASKMANAGER node can be obtained by looking up the derived log, and the related information of the TASKMANAGER node can be traced back after the job is stopped, so that problem tracing and analysis can be performed. It should be noted that for a stopped Flink job, such as the Kubernetes schedule framework, there is no guarantee that the historical pod log will be kept intact. This means that Kubernetes cannot guarantee that the running log of all historical pod can be kept intact when the Flink job is stopped. This is because the Kubernetes design is not focused on maintaining a log of the history pod's execution, but rather on container orchestration and management. Therefore, if it is desired to obtain the complete history TASKMANAGER log of the stopped Flink job, the present disclosure proposes to use the log synchronization tool to export the pod log during running to other storage locations in real time, so as to trace back the first information of the history TASKMANAGER, improve the usability of the Flink cluster system, and facilitate the user to perform fault investigation and problem analysis.

Illustratively, before the TASKMANAGER node fails to exit in the case of the Flink job running, the method further includes: calling a hook function of the Java virtual machine before the Java virtual machine running the TASKMANAGER node crashes; and collecting and uploading error environment information when the Java virtual machine runs based on the content of the hook function.

Before the TASKMANAGER node fails to exit during the running of the link job, the error information uploading module performs operations including the following two aspects: firstly, before the Java virtual machine of the TASKMANAGER node is crashed, a hook function of the Java virtual machine needs to be called, which is described above and will not be repeated here. Second, when the TASKMANAGER node fails to exit during the flight operation, an upload signal needs to be triggered. When the hook function is invoked, it may be determined that the TASKMANAGER node failed out during the fly job operation and triggered to upload signals, and immediately before the TASKMANAGER node exited, relevant error information and operation state information are uploaded to a specified location for subsequent troubleshooting and problem analysis. Thus, based on the contents of the hook function, error environment information generated in running with the Java virtual machine can be collected and uploaded.

Illustratively, the error information includes fault exception information, memory usage of the Java virtual machine, and startup parameters, startup commands, and path information of the Java virtual machine, and before the error information generated by the TASKMANAGER node during the Flink operation is uploaded to the JobManager node by the error information uploading module, the method includes: when the TASKMANAGER node normally operates, collecting starting parameters, starting commands and path information of the Java virtual machine after the TASKMANAGER node is started; and when the process of the TASKMANAGER node fails, collecting the failure exception information and the memory use condition of the Java virtual machine, wherein the failure exception information comprises error stack information, and the memory use condition of the Java virtual machine is the memory use condition of the Java virtual machine when the TASKMANAGER node fails.

Before the TASKMANAGER node uploads the error information generated by the Flink operation to the JobManager node through the error information uploading module, the error information is divided into two types to be collected. When the TASKMANAGER node operates normally, starting parameters, starting commands and path information of the Java virtual machine after starting of the TASKMANAGER node, namely, unchanged information after starting of the TASKMANAGER node, are collected, and the information is determined after starting of the TASKMANAGER node and comprises the starting parameters, the starting commands and the path information of the Java virtual machine. Such information is collected and cached at TASKMANAGER normal operation to reduce the time-consuming information collection phase. And secondly, when the process of the TASKMANAGER node fails, collecting failure abnormal information and Java virtual machine memory service conditions, namely information representing failure conditions, which are produced immediately when the process fails. The fault exception information includes error stack information, and may provide detailed error type and error stack information. The memory use condition of the Java virtual machine refers to the memory use condition of the Java virtual machine when the TASKMANAGER node fails. The memory usage condition of the Java virtual machine is collected, so that the memory occupation condition of the Java virtual machine when a fault occurs is known, and the analysis of the cause and influence of the fault is facilitated.

Illustratively, the uploading, by the error information uploading module, the error information generated by the TASKMANAGER node when the link job runs to the JobManager node includes: and when the JobManager node is available, uploading the fault exception information, the Java virtual machine memory use condition, the starting parameters of the Java virtual machine, the starting command and the path information to the JobManager node through the error information uploading module.

When the JobManager node is available, the error information uploading module may upload the collected fault exception information, the Java virtual machine memory usage condition, and the starting parameters, the starting command and the path information of the Java virtual machine to the JobManager node. In this way, developers and operators can obtain the information through the JobManager node for troubleshooting and problem positioning.

Fig. 4 is a flowchart of method steps implemented on an error information uploading module according to an embodiment of the present disclosure, as shown in fig. 4, when a TASKMANAGER node starts to boot, first, the boot parameters, the boot command and the path information of the Java virtual machine after the start of the TASKMANAGER node, that is, the unchanged information after the start of the TASKMANAGER node, are collected. The error uploading module detects whether the TASKMANAGER node is abnormal or not in real time in operation, namely detects whether the Java virtual machine operated by the TASKMANAGER node is to be terminated or not, and does not need to operate when no abnormality is detected; when the abnormality is detected, fault abnormality information is collected under the condition that the JobManager node is judged to be normal and alive, otherwise, the process of the TASKMANAGER node is terminated. The collected error information is uploaded and reported to the JobManager node, and then the process of the TASKMANAGER node is terminated.

Illustratively, the method further comprises: when the JobManager node is not available, discarding uploading the error information and accelerating the process termination of the TASKMANAGER node; and after the JobManager nodes are restored to be available, responding to information inquiry and/or problem tracing of the user side based on the first information of the TASKMANAGER nodes which are exited by the faults and recorded by the historical information recording module in real time.

In contrast to the previous case, if node JobManager is not available (e.g., crashes or active-standby switches), the error information upload module cannot upload error information to node JobManager. In this case, the error information uploading module may discard uploading the error information in order to accelerate the process termination of the TASKMANAGER node. After JobManager nodes are recovered to be available, based on the first information of the TASKMANAGER nodes which are logged by the historical information logging module and are out of failure, information inquiry and/or problem tracing of the user side can be responded. The history information module maintains information and running status of all TASKMANAGER nodes that were started during the operation of the job. Such information includes the TASKMANAGER node's operational state, runtime environment information, and the like.

Illustratively, the uploading, by the error information uploading module, the error information generated by the TASKMANAGER node when the link job runs to the JobManager node includes: the RPC interface based on the RPC protocol uploads the error information to the JobManager node; the method further comprises the steps of: after the JobManager node receives the error information, the error information is displayed on a job monitoring page of the JobManager node, so that a user side can view and analyze the error information.

The error information uploading module uses a customized RPC interface to upload the collected error information to the JobManager node based on the RPC protocol. RPC (remote procedure call) is a protocol for communication between different processes or different computers that allows one program to call the procedure of another program just as a local program. The error information uploading module may transmit the collected error information to the JobManager node through the RPC interface. After receiving the error information, the JobManager node displays the error information in a monitoring page for the user to check and analyze.

The flank job typically runs in a distributed environment, where the TASKMANAGER node and JobManager node may be located in different computers or processes. By using the RPC protocol and the customized RPC interface, remote communication can be carried out between different nodes, and data transmission and interaction are realized. The collected error information is uploaded to the JobManager node and displayed in the monitoring page, so that real-time monitoring and analysis of the error condition during operation can be realized. The method is very important for developers and operation and maintenance personnel, and can discover and solve problems in time, so that the stability and reliability of operation are improved.

Illustratively, in the case that multiple flank clusters belong to a cross-data center deployment, a unified collection and management mechanism of logs is introduced, ensuring that the status and error information of all clusters can be shared in real time.

The method for collecting information of single Flink clusters refers to the embodiment of the disclosure, and for collecting information of multiple Flink clusters, the disclosure proposes a unified collection and management mechanism for introducing logs, in which a range and a format of synchronous data need to be determined first, including determining which data need to be synchronized, such as a running log, state information, error information, and the like, and unifying the data formats of the synchronous data in advance, such as JSON or Avro, so as to ensure that data among different clusters can be mutually understood and processed, and set version numbers for the synchronized data in advance, thereby ensuring the sequence and consistency of data update. The present disclosure proposes to use a message queue (e.g., APACHE KAFKA) to implement the real-time data synchronization described above, send the status updates and log records of each Flink cluster as messages to a shared message queue, and other clusters subscribe to this queue to receive updates, while for data that does not require real-time synchronization, a distributed file system (e.g., HDFS) or object store (e.g., amazon S3) may be used to periodically perform batch synchronization, upload the log to a configured distributed file system (HDFS, S3, etc.) when the JobManager node exits, and the JobManager node may add a history information record module to implement this function.

Through a real-time synchronization mechanism, the state and log information can be rapidly shared among the clusters, when one cluster has a problem, other clusters can rapidly respond, corresponding measures are taken, the downtime of the system is reduced, and the stability and usability of the whole system are improved. The synchronous data can provide data support for global optimization of the system, and by analyzing the running data of different clusters, the performance bottleneck and the optimization point of the system can be found, and decision support is provided for optimization and upgrading of the system. Thus, the log information of a plurality of Flink jobs can be stored and managed uniformly.

Fig. 5 is a block diagram of an information recording apparatus according to an embodiment of the present disclosure, as shown in fig. 5, applied to a link cluster, where the link cluster includes a TASKMANAGER node, a JobManager node, an error information uploading module and a history information recording module, the error information uploading module is disposed on the TASKMANAGER node, and the history information recording module is disposed on the JobManager node, and the apparatus includes:

The first information recording module 201 is configured to record, in real time, first information of each TASKMANAGER node for executing the link job through the history information recording module when the link job is running, where the first information includes content necessary for running log information of TASKMANAGER nodes for running the link job in a traceback manner, and the first information is used for responding to information query and/or problem traceback of a user side;

An upload signal triggering module 202, configured to trigger an upload signal when the TASKMANAGER node exits from a failure under the condition that the Flink job runs;

and the uploading module 203 is configured to upload, based on the uploading signal, error information generated by the TASKMANAGER node when the link job is running to the JobManager node through the error information uploading module, so that the JobManager node stores the error information in the first information.

Illustratively, the first information includes operation attribute information, operation state information, and positioning information, the operation state information being used to represent current TASKMANAGER nodes and historical TASKMANAGER nodes operation states, the operation state information including: running, normal exiting and abnormal exiting, the first information recording module includes:

The apparatus further comprises:

Optionally, the save operation attribute module includes:

The apparatus further comprises:

Optionally, the apparatus further comprises:

the uploading module comprises:

Optionally, the apparatus further comprises:

The uploading module comprises:

The apparatus further comprises:

The embodiment of the present disclosure further provides an electronic device, and fig. 6 is a block diagram of a structure of an electronic device provided in an embodiment of the present disclosure, as shown in fig. 6, including a processor 301, a communication interface 302, a memory 303, and a communication bus 304, where the processor 301, the communication interface 302, and the memory 303 complete communication with each other through the communication bus 304.

A memory 303 for storing a computer program;

the processor 301 is configured to implement the information recording method according to any of the embodiments described above when executing the program stored in the memory 303.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.

The embodiment of the disclosure further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements the above processes in the embodiment of the information recording method, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here. In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

While the preferred embodiments of the disclosed embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the disclosed embodiments.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that includes the element. The foregoing has described in detail the methods, apparatus, electronic devices and computer readable storage medium for information recording provided by the present disclosure, and specific examples have been applied herein to illustrate the principles and embodiments of the present disclosure, the above examples being provided only to assist in understanding the methods of the present disclosure and their core ideas; meanwhile, as one of ordinary skill in the art will have variations in the detailed description and the application scope in light of the ideas of the present disclosure, the present disclosure should not be construed as being limited to the above description.

Claims

1. An information recording method is characterized by being applied to a link cluster, wherein the link cluster comprises TASKMANAGER nodes, jobManager nodes, an error information uploading module and a history information recording module, the error information uploading module is deployed on the TASKMANAGER nodes, and the history information recording module is deployed on the JobManager nodes, and the method comprises the following steps:

Based on the uploading signal, uploading error information generated by the TASKMANAGER node when the link job runs to the JobManager node through the error information uploading module, so that the JobManager node stores the error information in the running log information.

2. The information recording method according to claim 1, wherein the first information includes operation attribute information, operation state information, and positioning information, the operation state information being used to represent current TASKMANAGER nodes and historical TASKMANAGER nodes operation states, the operation state information including: running, normal exiting and abnormal exiting, wherein the recording, by the history information recording module, first information of each TASKMANAGER node for executing the link job in real time includes:

the method further comprises the steps of:

receiving a TASKMANAGER node state query request of the user side;

3. The information recording method according to claim 2, wherein the storing the operation attribute information of each TASKMANAGER node according to the different scheduling frames of the link job includes:

4. The information recording method according to claim 1, wherein before the process of the TASKMANAGER node exits from failure in the case where the link job runs, further comprising:

5. The information recording method according to claim 1, wherein the error information includes failure exception information, java virtual machine memory usage, and start parameters, start commands, and path information of the Java virtual machine, and before the error information generated by the TASKMANAGER node at the time of the Flink job run is uploaded to the JobManager node by the error information upload module, the method includes:

6. The information recording method according to claim 1, characterized in that the method further comprises:

after the JobManager nodes are restored to be available, responding to information inquiry and/or problem tracing of the user side based on the first information of the TASKMANAGER nodes which are exited by the faults and recorded by the historical information recording module.

7. The information recording method according to claim 1, wherein the uploading, by the error information uploading module, the TASKMANAGER node to the JobManager node, the error information generated by the link job in operation, includes:

the method further comprises the steps of:

8. An information recording apparatus, characterized in that, be applied to the Flink cluster, the Flink cluster includes TASKMANAGER node, jobManager node, error information uploading module and history information recording module, error information uploading module disposes on the TASKMANAGER node, history information recording module disposes on the JobManager node, the apparatus includes:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the information recording method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the information recording method according to any one of claims 1-7.