CN120704987A

CN120704987A - System status acquisition method, device, system and non-volatile storage medium

Info

Publication number: CN120704987A
Application number: CN202510806382.5A
Authority: CN
Inventors: 代镇聪; 周东旭; 孙坚; 刘炜
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2025-06-16
Filing date: 2025-06-16
Publication date: 2025-09-26

Abstract

The application discloses a system state acquisition method, device and system and a nonvolatile storage medium. The method comprises the steps of obtaining first observation data of a micro-service application program in a micro-service layer of a system, obtaining second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, obtaining third observation data of the nodes in an infrastructure layer of the system through the second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space, and carrying out summarization analysis on the first observation data, the second observation data and the third observation data to obtain the running state of the system. The application solves the technical problem that the system state cannot be accurately determined due to the fact that the application program deployed in the created environment cannot be efficiently observed in the related technology.

Description

System state acquisition method, device and system and nonvolatile storage medium

Technical Field

The present application relates to the field of cloud computing, and in particular, to a method, an apparatus, a system, and a non-volatile storage medium for acquiring a system state.

Background

In the related art, when the cloud native application programs deployed in the information and creation environment are observed, the application programs cannot be comprehensively observed, and various resource expenses are large during the observation. Therefore, in the related technology, high-efficiency observation cannot be performed on application programs deployed in the information creation environment, so that the system state cannot be timely and accurately determined.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a system state acquisition method, device and system and a nonvolatile storage medium, which at least solve the technical problem that the system state cannot be accurately determined due to the fact that an application program deployed in a credit environment cannot be efficiently observed in the related technology.

According to one aspect of the embodiment of the application, a system state acquisition method is provided, which comprises the steps of acquiring first observation data of micro-service application programs in a micro-service layer of a system in the micro-service layer of the system, acquiring second observation data of node application programs in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, acquiring third observation data of the node in an infrastructure layer of the system through a second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space, and carrying out summarized analysis on the first observation data, the second observation data and the third observation data to obtain the running state of the system.

Optionally, acquiring first observation data of a micro service application program in a micro service layer of a system comprises determining first tracking information of a network data packet in the system, wherein the first tracking information is used for indicating transmission path information of the network data packet in the micro service layer, screening initial first observation data by adopting a neural network model, removing noise data in the first observation data to obtain the first observation data, and processing the first observation data and the tracking information by adopting the neural network model to obtain a micro service calling mode identification result, wherein the micro service calling mode identification result is used for indicating whether an abnormal micro service calling mode exists in the micro service layer of the system.

Optionally, determining the first tracking information of the east-west network data packet of the micro service application program comprises adding a distributed tracking context to a network data packet header of the micro service application program, determining initial first tracking information of the network data packet according to the distributed tracking context, screening the initial first tracking information by adopting a neural network model, and removing noise information in the initial first tracking information to obtain the first tracking information.

Optionally, the method further comprises identifying trace context information of the network data packet in the system through the first extended berkeley filter agent module, and obtaining second trace information of the network data packet, wherein the second trace information comprises transmission path information of the network data packet at a container network layer.

Optionally, after the second observation data of the node application program is obtained through the first extended berkeley filter agent module, the method further comprises the steps of adopting a neural network model to process the second observation data and the second tracking information, determining the fault type of the system and the fault reason corresponding to the fault type, wherein the fault type comprises at least one of network congestion, packet loss and network delay higher than a preset delay threshold.

Optionally, the obtaining third observation data of each node through the second extended berkeley filter agent module comprises obtaining the third observation data of each node through the second extended berkeley filter agent module and the data exporter, and processing the third observation data by adopting a neural network model to obtain an association relation between internal data of the third observation data.

Optionally, the first observation data comprises at least one of index data or log data of the micro-service application, the second observation data comprises communication performance index data of the node application, and the third observation data comprises perceived container telemetry data of the node.

According to another aspect of the embodiment of the application, a system state acquisition device is provided, which comprises a first processing module, a second processing module and a fourth processing module, wherein the first processing module is used for acquiring first observation data of a micro-service application program in a micro-service layer of a system, the second processing module is used for acquiring second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, the third processing module is used for acquiring third observation data of the node in an infrastructure layer of the system through the second extended Berkeley filter agent module, the second extended Berkeley filter agent module is arranged in the kernel space, and the fourth processing module is used for carrying out summarizing analysis on the first observation data, the second observation data and the third observation data to obtain the running state of the system.

According to another aspect of the embodiment of the present application, there is further provided a nonvolatile storage medium, in which a program is stored, where when the program runs, the program controls a device where the nonvolatile storage medium is located to execute the method for acquiring the system state.

According to another aspect of the embodiment of the present application, there is also provided an electronic device including a memory and a processor, where the processor is configured to execute a program stored in the memory, and the program executes a method for acquiring a system state when running the program.

According to another aspect of embodiments of the present application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements a method of acquiring a system state.

In the embodiment of the application, a micro-service layer of a system is adopted to acquire first observation data of a micro-service application program in the micro-service layer of the system, a first extended Berkeley filter agent module is adopted to acquire second observation data of a node application program in a container network layer of the system, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, a second extended Berkeley filter agent module is adopted to acquire third observation data of the node in an infrastructure layer of the system, wherein the second extended Berkeley filter agent module is arranged in the kernel space, the first observation data, the second observation data and the third observation data are subjected to summarizing analysis to obtain the running state of the system, and the purposes of comprehensively acquiring the observation data and reducing resource expenditure in the process of acquiring the observation data are achieved by respectively acquiring the observation data of different levels through the first extended Berkeley filter agent module.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a schematic structural view of a computer terminal (mobile terminal) according to an embodiment of the present application;

fig. 2 is a flowchart of a system state acquiring method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an observable agent architecture based on an extended Berkeley filter provided in accordance with an embodiment of the present application;

FIG. 4 is a schematic diagram of a cloud native microservice full-link observable architecture provided in accordance with an embodiment of the present application;

FIG. 5 is a comparison of performance metrics provided in accordance with an embodiment of the present application;

fig. 6 is a schematic structural diagram of a system state acquisition device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to better understand the embodiments of the present application, technical terms related to the embodiments of the present application are explained as follows:

eBPF (Extended Berkeley PACKET FILTER, extended berkeley filter), a powerful kernel technology originally designed for network packet filtering, has now been extended to many other uses. It allows a user to run small, efficient programs in the kernel without modifying the kernel source code or loading the kernel module.

EBPF is of great concern in terms of load balancing, firewall and network security. One eBPF program is a series of 64-bit instructions, whose security and high performance are verified by the compiler. It uses CLANG LLVM, etc. tools on the host to compile just-in-time to detect for targets. eBPF the program derives these metrics by writing kernel-level metrics to a memory map that is located in kernel space but accessible through user space. This approach reduces the need to copy data from kernel space to user space, enabling higher throughput with lower overhead. This lower overhead enables dynamic execution eBPF of scripts in a production environment to debug an actual system without affecting the user.

OTel (open observability technology, openTelemetry) to provide standardized software development kits (Software Development Kit, SDK), application programming interfaces (Application Programming Interface, API), and tools for data ingestion, conversion, and transmission to observable backend.

Sidecar A container architecture mode refers to an auxiliary container running in parallel with a main application container, provides functions of log collection, monitoring, proxy service and the like, and supports the operation and expansion of the main application. Are commonly used in microservice and containerized environments.

LLM (Large Language Model ) based on deep learning technology, can process and generate natural language text, has powerful language understanding and generating capacity, and may be used in intelligent interaction and data processing in various fields.

Observable in cloud native applications refers to metrics, logs, and tracking information. The metrics are numerical data representations over time that can be used to infer system behavior using mathematical modeling or prediction. The index is optimized for storage, compression and retrieval, enabling longer retention and faster queries. A log is an immutable, time-stamped record of discrete events recorded over time. Finally, tracking information represents interactions between distributed applications, which may provide a macroscopic view of the request-response lifecycle. In a distributed system, a single request may traverse multiple services distributed across different servers or geographic locations. Implementation in a monolithic application is straightforward because its code base and deployment are both on a single infrastructure endpoint. In the deployment of the credit-creating environment, the same task is complex to realize. In a monolithic application, a developer may detect an application or host to collect observable data. However, in a trafficking environment deployment with numerous containers, distributed hosts, and various micro-service implementation ecosystems, it is not feasible to manually add additional observable code to different micro-services or Operating System (OS) kernels. Thus, one key issue in ensuring that a cloud native application is observable is the associated overhead of maintaining and updating observable code in different micro-service implementations ecosystems.

Kubernetes is a highly modular and extensible open source container orchestration platform that automates the deployment, expansion, and management of containerized applications. OTel the per-language detection library provided can be used to derive metrics, logs, and tracking information from applications, supporting a variety of languages and platforms, including Java, go, node. Js, python, and NET. However, in an agile multi-language environment, detecting each application using OTel is very cumbersome. Furthermore, it only knows the L7 index of the application and does not gather other layer, network stack or kernel level information that is critical to cloud-native observability. Moreover, OTel aggregate observable data from multiple pods into services, which limits its ability to perform pod analysis.

Observable sidecar (i.e., an auxiliary container running with the application container in Kubernetes pod) reduces the overhead of maintaining multiple applications by redirecting all observable functions from the application container to a separate (sidecar) container. Typically sidecar deployed next to each microservice is responsible for handling network functions and performing L3, L4 and L7 network observable tasks. However, the observable method of sidecar is limited to collecting network metrics, i.e., delay, traffic/error rate, and tracking information. Thus, as sidecar increases the container overhead on cloud native clusters, they suffer from the disadvantage of occupying valuable cluster resources (including computing, storage, and network resources). Furthermore, although sidecar is particularly useful in detecting problems in micro-services, they do not provide accurate root cause analysis of the problem. For example, tracking information may be used to identify services that are behaving abnormally and analyze delays, but not through the internal aspects of the side-car debugging application container. Thus, in order to achieve full observability of the container, observable data must be collected directly from the underlying host operating system. While there are many proxy-based approaches (e.g., openTelemetry, cAdvisor, AWS X-Ray, cloudwatch Container Insights), deploying these solutions (typically in user space) creates additional resource overhead, affecting the performance of the deployed workload.

It should be noted that unlike components in a monolithic architecture that communicate through in-process calls, a microservice architecture suffers from more service interaction failures because the interaction process between microservice architectures can occur over unreliable networks. In addition, containers of the related art are designed to be short lived and can be quickly started and destroyed, which makes tracking their related data challenging. Furthermore, the containers deployed on distributed hosts complicate the collection of a comprehensive view of service behavior and performance. Finally, using the Kubernetes management container requires monitoring additional components such as kube-APISERVER, KUBE-controller-manager, agents, kubelet, and Container Network Interfaces (CNIs).

In addition, LLM has made significant progress in the field of natural language processing, exhibiting powerful language understanding, generating and reasoning capabilities. However, in the observability scenario of the belief-wound container cluster, the advantages of LLM have not been fully exploited to solve the problems existing in the related art.

To solve the above problem, to achieve full observability of the container, we must collect observability data directly from the underlying host operating system. While there are many proxy-based approaches, deploying these solutions (typically in user space) creates additional resource overhead, affecting the performance of the deployed workload. The embodiment of the application develops a full-scale intelligent observable solution based on eBPF for covering indexes, logs and distributed tracking for the environment of the information creation container, and the full-scale intelligent observable solution is described in detail below.

It is noted that implementing an observable solution in the related art uses a user space program to capture and analyze data, which can result in significant performance overhead. The method provided by the embodiment of the application transfers the tasks to eBPF programs in the kernel space, thereby greatly reducing the expenditure and simultaneously adding deep context information for the data from the kernel. And the eBPF framework proposed by the embodiment of the present application can be dynamically injected into any deployed application without recompilation or redeployment. Our proposed solution improves Cloudflare on the existing eBPF exporter and incorporates new functionality in several bpf.c scripts, including accept-latency, cachestat, llcstat, malloc, oomkill, runqlat, shrinklat and tcpbacklog.

That is, embodiments of the present application provide a complete intelligent observable solution for the wound container environment that utilizes an extended berkeley packet filter (eBPF) in the Linux kernel. The proposed solution relies on a eBPF-based small event trigger to extend operating system functionality without detecting the corresponding application code, while running at wire speed (i.e., at the maximum speed transfer rate supported by the hardware device) in kernel space, without degrading the performance of the detected event. By utilizing observable agents based on kernel space eBPF, the solution can directly collect deep context observable data without incurring additional overhead. Furthermore, the proposed solution allows to get a good knowledge of the context of the observed application without the need to detect the burden. The LLM is used for realizing intelligent data screening, analysis and visualization, reducing manual intervention and improving the intelligent level of an observable system. The framework provided by the embodiment of the application can dynamically inject the deployed application programs without recompilation or deployment, thereby enhancing the flexibility and expandability of the system.

According to an embodiment of the present application, there is provided a method embodiment of a system state acquisition method, it should be noted that, steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, steps illustrated or described may be performed in an order different from that illustrated herein.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal (or mobile device) for implementing a method of acquiring a system state. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, 102 n) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. Among other things, a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the system state acquisition method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the system state acquisition method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the above operating environment, the embodiment of the present application provides a method for acquiring a system state, as shown in fig. 2, where the method includes the following steps:

Step S202, acquiring first observation data of a micro service application program in a micro service layer of a system in the micro service layer of the system;

In the technical scheme provided in step S202, the step of obtaining the first observation data of the micro service application program in the micro service layer of the system includes determining first tracking information of network data packets in the system, wherein the first tracking information is used for indicating transmission path information of the network data packets in the micro service layer, adopting a neural network model to screen initial first observation data, removing noise data in the first observation data to obtain the first observation data, adopting the neural network model to process the first observation data and the tracking information to obtain a micro service calling mode identification result, and the micro service calling mode identification result is used for indicating whether an abnormal micro service calling mode exists in the micro service layer of the system.

As an alternative embodiment, the following manner may be used to identify whether an abnormal micro service invocation mode exists at the micro service layer of the system:

First tracking information of network data packets in a system is determined, wherein the information details a transmission path of the data packets in a micro service layer, and the first tracking information comprises source service, target service and passing service nodes of the data packets. To obtain this information, a first eBPF proxy module is deployed on each node of the Kubernetes cluster, directly capturing the header of the network packet in kernel space to read and parse key trace identifications, such as the distributed trace IDs, which are implemented by standard fields, such as X-Request-IDs, that are updated by each service node as the packet traverses the micro-service grid, thus recording the complete Request-response cycle and interactions between the micro-services.

The initially collected first observation data may then be screened and processed using LLM, which is directed to removing irrelevant or redundant data, i.e., noise data in the first observation data, to ensure accuracy and efficiency of subsequent analysis. The first observation data includes indexes such as call frequency, response time, error rate and the like of the micro service, and data of a network layer, such as the size of a data packet, a sending time stamp and the like. The LLM, through training, identifies features closely related to the micro-service invocation pattern, such as the average response time and standard deviation of a particular service invocation under normal conditions, based on which the model can filter out those data points that do not provide additional diagnostic value, such as service invocation records within the normal response time range, resulting in refined first observation data.

The LLM can then be employed to further analyze the first observation and tracking information to identify anomalies in the micro-service invocation mode while processing the refined first observation and tracking information. This process involves associating tracking information with the first observation data, combining network path information and microservice interaction data. In the process, the LLM model constructs a dynamic map of micro service call, nodes in the map represent services, edges represent call among services, and the weights of the edges are determined by indexes such as call frequency, delay and the like. Thus, by comparing historical data, the model can identify call patterns that deviate from normal behavior, such as sudden increases in call times for a service, significant increases in average delay, or abnormally increased error rates, which may be signs of service failure, network congestion, or malicious attacks.

In one exemplary embodiment, suppose that the system monitors that the call delay of micro service a to micro service B suddenly increases from an average of 50 ms to over 200 ms and the call frequency is also 30% higher than usual, while LLM notices that the call response time of micro service B to micro service C is normal, but the call to micro service D fails a lot. In conjunction with these observations and tracking information, the model may infer that the source of the abnormal call pattern may be that the particular path of microservice B to D is problematic, rather than the performance bottleneck of microservice B itself.

In this process, the tracking information serves as a path identifier, providing an accurate context for micro-service invocation, helping to locate problem services and paths. The first observation data is used as a quantization index, the abnormal characteristics of the calling mode are revealed, and the combination of the first observation data and the abnormal characteristics enables the abnormal micro-service calling mode in the system to be accurately identified, so that key clues are provided for rapid diagnosis and repair of problems. By continually monitoring and dynamically updating the model, this mechanism can accommodate changes in the system environment, such as new service introduction or network configuration adjustments, to maintain its effectiveness in identifying abnormal call patterns.

In some embodiments of the present application, the network data packet includes an east-west data packet and a north-south data packet, and the distributed tracking context may be added to and tracked by only the east-west data packet or only the head of the north-south data packet.

As an optional implementation mode, the step of determining the first tracking information of the east-west network data packet of the micro service application program comprises the steps of adding a distributed tracking context to the network data packet head of the micro service application program, determining initial first tracking information of the network data packet according to the distributed tracking context, screening the initial first tracking information by adopting a neural network model, and removing noise information in the initial first tracking information to obtain the first tracking information.

In some embodiments of the present application, the micro services may be detected at the micro services layer using OTel libraries, with additional distributed tracking contexts being attached to the east-west network packet header for propagation. Meanwhile, a neural network model LLM is introduced to perform preliminary screening and analysis on indexes, logs and tracking information generated by OTel. For example, the LLM may identify an abnormal micro-service invocation pattern according to predefined business rules and historical data patterns, and issue early warning in advance.

Step S204, obtaining second observation data of a node application program in a container network layer of a system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node;

In the technical solution provided in step S204, the method further includes identifying, by the first extended berkeley filter proxy module, tracking context information of the network data packet in the system, and obtaining second tracking information of the network data packet, where the second tracking information includes transmission path information of the network data packet at the container network layer.

As an optional implementation manner, after the second observation data of the node application program is obtained through the first extended berkeley filter agent module, a neural network model may be further used to process the second observation data and the second tracking information, and determine a fault type of the system and a fault cause corresponding to the fault type, where the fault type includes at least one of network congestion, packet loss, and network delay being higher than a preset delay threshold.

In some embodiments of the present application, the first extended berkeley filter proxy module is capable of directly collecting second observation data of the node application program in the kernel space, including but not limited to key performance indexes such as throughput of a network interface, sending and receiving states of data packets, network delay and the like. These data are captured directly by eBPF programs without processing by the user space agent, ensuring the efficiency and accuracy of data acquisition. Subsequently, second trace information is extracted, which contains the trace ID of the complete path of the network packet and the request-response period, providing a detailed context for network activity.

The LLM then processes the second observation and the second tracking information to identify a potential failure type of the system. In this process, the second observation data provides a quantitative indicator of the system's operation, and the second tracking information provides specific details about the transmission of the data packets in the network, including the network nodes that pass through, the delays experienced, and whether packet loss has occurred, etc.

For example, assuming there is a network node in the cluster that is continuously highly loaded, the number of inbound and outbound packets for its network interface is large, which may lead to network congestion. The eBPF agent monitors the network interface of the node and collects second observation data, such as queue latency, number of transmission failures, and average network delay of the data packet. At the same time, by parsing the second trace information residing in the header of the network packets, the origin and destination of the packets, as well as the intermediate nodes they traverse, can be identified.

Thus, the LLM establishes the baseline behaviors of network congestion, packet loss rate and network delay under the normal network running state through the learning of the historical data. When the model receives the latest second observation data, the second observation data can be compared with the baseline data, and a possible abnormal mode can be identified. For example, if the queue latency of a packet increases significantly, the packet loss rate exceeds a normal threshold, the network delay exceeds a preset delay threshold, and the model treats these anomalies as signs of network congestion.

In another exemplary embodiment, it is assumed that the system monitors the frequent loss of network packets from microservice E to microservice F, which may lead to an inter-service communication disruption or data inconsistency. The eBPF proxy module identifies all the data packets from E to F through the second tracking information, and collects second observation data, including the size of the data packet, the sending time stamp and the absence of receiving acknowledgement. The LLM identifies the mode and the frequency of packet loss through processing the information, and if the packet loss rate reaches or exceeds a preset threshold value, the model can diagnose that the network packet loss fault exists in the system.

Finally, if the network delay is above the preset delay threshold, the second observed data collected by the eBPF agent will include the time stamps of the transmitted and received data packets, as well as the calculated delay. LLM analyzes these delay data, and identifies services or paths with abnormal delays, as compared to historical data. If the model detects that the average delay from microservice G to microservice H is much higher than normal, this may mean that the network connection between G and H is disturbed or a bottleneck exists, the model may diagnose this as a network delay failure.

By means of the method, eBPF frames and LLM work cooperatively, intelligent detection and fault type identification of network congestion, packet loss and network delay abnormality are achieved in a container network layer, timely early warning information is provided for operation and maintenance teams, rapid positioning and problem solving are facilitated, and reliability and efficiency of the system are improved. Along with continuous learning and optimization of the model, the accuracy and response speed of the identification mechanism are improved continuously, and the identification mechanism is better suitable for complex and changeable container environments.

In some embodiments of the application, the RED metrics, i.e., rate, error, and duration, of an application are monitored at the container network layer by eBPF agents (i.e., the first extended Berkeley filter agent module) deployed on the cluster nodes. eBPF agents capture and derive throughput, delay, and error codes. By utilizing Deepflow eBPF library, the agent provides a mechanism for automatically collecting context propagation and OpenTelemetry distributed tracking, and on the basis, LLM carries out real-time analysis on collected network data, intelligently judges the root cause of network congestion, packet loss and other problems, and provides corresponding solution suggestions. For example, when a network delay increase is detected, the LLM may analyze whether it is due to a certain micro-service traffic being too large or a network configuration problem, and make a suggestion to adjust traffic allocation or optimize network configuration.

Step S206, obtaining third observation data of the node at an infrastructure layer of the system through a second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in a kernel space;

in some embodiments of the present application, the step of obtaining the third observation data of each node through the second extended berkeley filter agent module includes obtaining the third observation data of each node through the second extended berkeley filter agent module and the data exporter, and processing the third observation data by using the neural network model to obtain an association relationship between internal data of the third observation data.

As an alternative embodiment, at the infrastructure level, a eBPF agent (i.e., a second extended berkeley filter agent module) may be deployed for use in conjunction with eBPF exporter of Cloudflare to enable detailed, container-aware telemetry data about cluster nodes to be obtained and to facilitate debugging and troubleshooting anomalies. LLM performs deep analysis on these underlying telemetry data, mining potential associations between data, and providing a more comprehensive system health report for system administrators.

Optionally, the second extended berkeley filter proxy module may aggregate node data for each domain in the cluster node, where each domain may be considered as a collection of nodes consisting of a portion of the cluster nodes. No direct interaction between nodes in different domains is possible. In the third observation data acquisition process, the extended berkeley program arranged in the kernel of each node writes the passively acquired data into eBPF maps, and then eBPF exporter transmits the data to the second extended berkeley filter agent module.

And step S208, summarizing and analyzing the first observation data, the second observation data and the third observation data to obtain the running state of the system.

In the technical solution provided in step S208, the first observation data includes at least one of index data or log data of the micro service application, the second observation data includes communication performance index data of the node application, and the third observation data includes sensing container telemetry data of the node. The perceived container telemetry data includes data that can reflect the in-container application running status, container resource usage, and interaction information of the container in the node. The perceived container telemetry data may be used to monitor and analyze the health, performance, and safety of the containerized environment.

In the scheme provided by the embodiment of the application for intelligently observing the environment of the information and wound container by combining the dynamic injection eBPF framework and the LLM technology, a multi-level data collection and analysis method is adopted so as to realize comprehensive understanding of the whole system. The first observation data, the second observation data and the third observation data are respectively sourced from a micro service layer, a container network layer and an infrastructure layer, and the data of each layer provide a specific view angle to jointly form a drawing in the overall running state of the system.

The first observation data, i.e., index data or log data of the micro-service application, includes call frequency, response time, error rate and application log among services, which can reveal health status and interaction pattern of the micro-service. Through the OTel library, this information is appended to the network packet for dissemination and centralized processing. In the analysis phase, the LLM model can interpret the data, identify abnormal patterns in micro-service invocation or degradation of service performance, such as increased response delay, frequent errors, or unexpected service interaction behavior.

The second observation data relates to communication performance indexes of the node application program, and mainly comprises network congestion degree, data packet loss rate and network delay. The data is captured in real time in the kernel space by eBPF agent, which directly reflects the network interaction efficiency between containers and the outside. The LLM model can be used for insights into potential problems on the network layer, such as congestion on a network path, instability or delay abnormality of data packet transmission, timely early warning of risk of network performance degradation and assisting operation and maintenance decision through the second observation data.

The third observation data is the node's perception container telemetry data, and covers the key information of the containerized environment, including the running state of the application program in the container, the use condition of the container resources (such as CPU, memory, disk I/O) and the interactive information between the containers. These data not only reflect the resource consumption and performance of a single container, but also perspective the interactions between the container and the container, and the host. The LLM model utilizes the powerful semantic understanding and pattern recognition capability to carry out deep analysis on the third observation data, evaluates the overall health condition of the container environment, predicts possible resource bottlenecks or security holes, and ensures effective utilization and safe operation of system resources.

And integrating the first, second and third observation data, the LLM model can perform advanced data fusion and analysis, and a comprehensive view of the running state of the system is established. The model combines the call patterns obtained from the micro-service layer with the communication performance of the network layer and the resource usage of the infrastructure layer, mutually verifies and identifies the problems at the system level. For example, if a first observation shows a micro-service call delay rise while a second observation reveals an increase in packet loss for a particular network path, and a third observation shows a surge in CPU utilization for a container, the LLM model can correlate these isolated observations to infer that network delay and packet transmission problems may be due to excessive CPU resource consumption, etc.

Through such comprehensive analysis, the LLM model not only can detect and diagnose faults or performance bottlenecks occurring in the system, but also can predict challenges possibly encountered in the future, and provides data-driven decision basis for optimizing resource allocation strategies, enhancing security measures and improving user experience. With the continuous accumulation of data and the self-optimization of the model, the analysis framework becomes more intelligent and efficient, and enables intelligent transportation and management of the environment of the information creation container.

In some embodiments of the present application, prometaus may also be used to provide data storage for observable data sources and visualized using a Grafana dashboard.

In some embodiments of the present application, as shown in FIG. 3, an observable agent architecture based on an extended Berkeley filter is also provided. The proxy architecture also has the Kubernetes infrastructure integrated therein. As can be seen from fig. 3, the scheme of cloud native micro-service full-link observable enhancement based on eBPF and LLM technologies is composed of a plurality of eBPF programs running in kernel space, these eBPF programs are triggered by system calls generated by other application programs, and the running of a promethaus collector on a working node Worker eBPF periodically extracts its collected data from eBPF agents on the cluster. And the Master Node uses Control Plane and Network to export relevant data. The main node and the working node can interact in an API communication mode. Further eBPF Agents in fig. 3 and 4 refers to eBPF an acquisition client for acquiring data of a node. For collected data, if cross-domain transmission is not required, the data is sent directly to Prometaus in the domain. If cross-domain transmission is required, the data collected by the collection client in one domain is sent to proxy (i.e. eBPF agent) before being sent to prometheus.

It can be seen that the scheme provided by the embodiment of the application realizes the complete micro service observability of the Kubernetes application by combining three observable levels of the micro service layer, the container network layer and the infrastructure layer. Index and tracking information can be collected from the distributed container, and performance abnormality can be detected, so that expandability and high efficiency are both considered. And deploying eBPF-based agents on each cluster node to dynamically expand the functions of the operating system without affecting the life cycle of the application program. The observable data generated by each level are exported to the Prometaus instance and visualized through a custom Grafana dashboard. Wherein the cloud native micro-service full-link observable architecture inside the working node is shown in fig. 4. In addition, in the method provided by the application, the micro-service runs in the container runtime environment and communicates through Envoy sidecar agents. eBPF Agent (eBPF collection client) runs as DaemonSet in the kernel space of all cluster nodes. sidecar, collector and application provide distributed tracking ID and index for Prometaus instance, grab interval 1 second.

The method comprises the steps of acquiring first observation data of a micro-service application program in a micro-service layer of a system through the micro-service layer of the system, acquiring second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, acquiring third observation data of the nodes in an infrastructure layer of the system through a second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space, and carrying out summarizing analysis on the first observation data, the second observation data and the third observation data to obtain the running state of the system.

In addition, the scheme of combining the dynamic injection eBPF framework and the intelligent observation of the information and creation container environment by the LLM technology provided by the embodiment of the application simultaneously achieves three effects of reducing OpenTelemetry Collector overhead, eliminating Sidecar in distributed tracking and improving the accuracy of the host level index.

For reduced OpenTelemetry Collector overhead, a manually instrumented application is the first step to achieve observability. In the embodiment of the application, the application program is instrumented by integrating OTel libraries so as to achieve the purpose of observability. The embodiment of the application provides a two-step method for realizing complete application program level observability. First, the use of OTel language SDKs to generate metrics, logs, and tracking information requires application developers to instrumented the application, but is a reliable way to expose custom (organization-specific) observable data. Next eBPF Agent collects OpenTelemetry instrumented application generated metrics and tracking information and exports it to promethaus. In the process, the LLM can compress and optimize the collected data in real time, so that the data transmission quantity and the storage occupation are reduced. For example, the LLM may identify duplicate or redundant data for merging or deleting, thereby improving data transmission and storage efficiency.

For Sidecar in the elimination of distributed tracking, the Kubernetes microservice architecture utilizes an ingress or reverse proxy (e.g., envoy) as the entry point for all incoming network packets to form a network services grid. OpenTelemetry utilizes the agent to generate a distributed trace of the application. When Pods forwards the Request to the other Pods in our network, envoy sidecar of Pods extracts and propagates the distributed trace ID, using the X-Request-ID header. By aggregating the information of multiple sidecar, the round trip of the request-response cycle can be visualized in distributed tracking. The Envoy sidecar method also tracks throughput, delay, and errors. Sidecar, such as Envoy, helps to provide layer seven (application layer) observability by being able to inspect network packets and decrypt the packets through Transport Layer Security (TLS). However, a complete network stack observable solution does not require deployment of user space sidecar. While the proposed solution of the embodiment of the present application uses eBPF native parsing all data packets flowing in the network and extracting the X-Request-ID header to generate the complete distributed trace, a kernel space distributed trace method based on eBPF is shown with Deepflow. In the process, the LLM can carry out intelligent analysis on the distributed tracking data, quickly locate fault points and provide instruction information for fault repair. For example, when a request is lost or delayed too high, the LLM can analyze the trace data to find out which node or micro-service is problematic and give corresponding repair suggestions.

For increasing the accuracy of the host level metrics, most host level observable agents are privileged applications that access/proc virtual file systems. The/proc folder contains runtime system information (e.g., hardware configuration). Prometheus NodeExporter is a wrapper that reads data from the/proc folder and services it through the HTTP endpoint. Unlike NodeExporter, cAdvisor provides a resource utilization subdivision for each container at the colleague who exports the metrics, i.e., a container-aware exporter. Although NodeExporter and cAdvisor provide time series indicators by sampling data in the/proc folder, their selection according to the sampling interval can result in information loss. The eBPF-based scheme provided by the embodiment of the application adopts an innovative sampling method. eBPF Agent the index is collected directly by running eBPF program in Linux kernel. These metrics include overall system performance, resource utilization, and network traffic information. The collected metrics are written eBPF Map to capture all events without losing information. The use of eBPF exporters significantly reduces the overhead by only 1% of NodeExporter overhead. And meanwhile, comprehensive index collection is ensured, the indexes at the kernel level are allowed to be accessed, and indexes and tracking points which are more than 2000 kernel levels are available and cannot be accessed by other tools. LLM can perform deep analysis on these host-level metrics, providing more accurate system performance assessment and prediction. For example, LLM can predict future resource usage trends of the system based on historical index data, helping administrators to make resource planning and adjustments in advance.

A comparison diagram of the observation capability comparison of the method provided by the embodiment of the application and the schemes OpenTelemetry, envoy Sidecar, cAdvisor, nodeExporter and the like in the related technology is shown in FIG. 5. In fig. 5 Ouer solution, a method provided by an embodiment of the present application is shown, where "v" indicates that the performance of the evaluation index meets the preset requirement, and "x" indicates that the performance does not meet the preset requirement.

The embodiment of the application provides a system state acquisition device, and fig. 6 is a schematic structural diagram of the device. As can be seen from fig. 6, the device comprises a first processing module 60 for acquiring first observation data of micro service application programs in a micro service layer of the system, a second processing module 62 for acquiring second observation data of node application programs in a container network layer of the system through a first extended berkeley filter agent module, wherein the first extended berkeley filter agent module is arranged in a kernel space of each node in the cluster nodes, a third processing module 64 for acquiring third observation data of the nodes in an infrastructure layer of the system through the second extended berkeley filter agent module, wherein the second extended berkeley filter agent module is arranged in the kernel space, and a fourth processing module 66 for performing summarized analysis on the first observation data, the second observation data and the third observation data to obtain an operation state of the system.

In some embodiments of the present application, the step of the first processing module 60 obtaining the first observation data of the micro service application program in the micro service layer of the system includes determining first tracking information of network data packets in the system, where the first tracking information is used to indicate transmission path information of the network data packets in the micro service layer, filtering the initial first observation data by using a neural network model to remove noise data in the first observation data to obtain the first observation data, and processing the first observation data and the tracking information by using the neural network model to obtain a micro service call pattern recognition result, where the micro service call pattern recognition result is used to indicate whether an abnormal micro service call pattern exists in the micro service layer of the system.

In some embodiments of the present application, the step of determining the first trace information of the east-west network data packet of the micro service application by the first processing module 60 includes adding a distributed trace context to a network data packet header of the micro service application, determining initial first trace information of the network data packet according to the distributed trace context, filtering the initial first trace information by using a neural network model, and removing noise information in the initial first trace information to obtain first trace information.

In some embodiments of the present application, the second processing module 62 is further configured to identify trace context information of the network packet in the system through the first extended berkeley filter proxy module, and obtain second trace information of the network packet, where the second trace information includes transmission path information of the network packet at a container network layer.

In some embodiments of the present application, after the second observation data of the node application program is obtained by the first extended berkeley filter proxy module, the second processing module 62 is further configured to process the second observation data and the second tracking information by using a neural network model, and determine a fault type of the system and a fault cause corresponding to the fault type, where the fault type includes at least one of network congestion, packet loss, and network delay being higher than a preset delay threshold.

In some embodiments of the present application, the step of the third processing module 64 obtaining the third observation data of each node through the second extended berkeley filter proxy module includes obtaining the third observation data of each node through the second extended berkeley filter proxy module and the data exporter, and processing the third observation data by using the neural network model to obtain an association relationship between internal data of the third observation data.

In some embodiments of the application, the first observation data comprises at least one of index data or log data for a micro-service application, the second observation data comprises communication performance index data for a node application, and the third observation data comprises perceived container telemetry data for the node.

The respective modules in the system state acquiring device may be program modules (for example, a set of program instructions for implementing a specific function), or may be hardware modules, and the latter may be expressed in the form of, but not limited to, a processor, or the functions of the respective modules may be implemented by a processor.

According to the embodiment of the application, a nonvolatile storage medium is provided, a program is stored in the nonvolatile storage medium, equipment where the nonvolatile storage medium is located is controlled to execute a system state acquisition method when the program runs, wherein the system state acquisition method comprises the steps of acquiring first observation data of a micro service application program in a micro service layer of a system, acquiring second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, acquiring third observation data of the node in an infrastructure layer of the system through the second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space, and carrying out summarizing analysis on the first observation data, the second observation data and the third observation data to obtain the running state of the system.

According to the embodiment of the application, the electronic equipment comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, the program runs to execute a method for acquiring the system state in a micro-service layer of the system, acquiring first observation data of a micro-service application program in the micro-service layer of the system, acquiring second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, acquiring third observation data of the node in an infrastructure layer of the system through a second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space, and performing summary analysis on the first observation data, the second observation data and the third observation data to acquire the running state of the system.

According to the embodiment of the application, a computer program product is provided, which comprises a computer program, wherein the computer program when being executed by a processor realizes a method for acquiring system states, namely, at a micro-service layer of a system, acquiring first observation data of a micro-service application program in the micro-service layer of the system, acquiring second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node, acquiring third observation data of the node in an infrastructure layer of the system through the second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space, and performing summary analysis on the first observation data, the second observation data and the third observation data to acquire the running states of the system.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the related art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which can store the program code.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method for acquiring a system state, comprising:

Acquiring first observation data of a micro-service application program in a micro-service layer of a system in the micro-service layer of the system;

acquiring second observation data of a node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in a cluster node;

Acquiring third observation data of the node at an infrastructure layer of the system through a second extended berkeley filter agent module, wherein the second extended berkeley filter agent module is arranged in the kernel space;

and carrying out summarization analysis on the first observation data, the second observation data and the third observation data to obtain the running state of the system.

2. The method of claim 1, wherein acquiring first observation data for a micro service application in a micro service layer of a system comprises:

Determining first tracking information of a network data packet in the system, wherein the first tracking information is used for indicating transmission path information of the network data packet in the micro-service layer;

screening initial first observation data by adopting a neural network model, and removing noise data in the first observation data to obtain the first observation data;

and processing the first observation data and the tracking information by adopting a neural network model to obtain a micro-service calling mode identification result, wherein the micro-service calling mode identification result is used for indicating whether an abnormal micro-service calling mode exists in a micro-service layer of the system.

3. The method of claim 2, wherein determining the first trace information of the east-west network packet of the micro-service application comprises:

adding a distributed tracking context to a network data packet header of the micro-service application;

Determining initial first tracking information of the network data packet according to the distributed tracking context;

And screening the initial first tracking information by adopting the neural network model, and removing noise information in the initial first tracking information to obtain the first tracking information.

4. The method for acquiring a system state according to claim 1, characterized in that the method further comprises:

And identifying tracking context information of the network data packet in the system through the first extended Berkeley filter agent module, and obtaining second tracking information of the network data packet, wherein the second tracking information comprises transmission path information of the network data packet in the container network layer.

5. The method of claim 4, wherein after the second observation data of the node application is obtained by the first extended berkeley filter proxy module, the method further comprises:

And processing the second observation data and the second tracking information by adopting a neural network model, and determining the fault type of the system and the fault reason corresponding to the fault type, wherein the fault type comprises at least one of network congestion, packet loss and network delay higher than a preset delay threshold.

6. The method of claim 1, wherein obtaining third observation data for each of the nodes by a second extended berkeley filter proxy module comprises:

Acquiring the third observation data of each node through the second extended berkeley filter agent module and a data exporter;

And processing the third observation data by adopting a neural network model to obtain the association relation between the internal data of the third observation data.

7. The method of claim 1, wherein the first observation data comprises at least one of index data or log data of the micro service application, the second observation data comprises communication performance index data of the node application, and the third observation data comprises perceived container telemetry data of the node.

8. A system state acquisition apparatus, comprising:

The first processing module is used for acquiring first observation data of a micro-service application program in a micro-service layer of the system;

The second processing module is used for acquiring second observation data of the node application program in a container network layer of the system through a first extended Berkeley filter agent module, wherein the first extended Berkeley filter agent module is arranged in a kernel space of each node in the cluster nodes;

the third processing module is used for acquiring third observation data of the node at an infrastructure layer of the system through a second extended Berkeley filter agent module, wherein the second extended Berkeley filter agent module is arranged in the kernel space;

and the fourth processing module is used for summarizing and analyzing the first observation data, the second observation data and the third observation data to obtain the running state of the system.

9. A nonvolatile storage medium, wherein a program is stored in the nonvolatile storage medium, and wherein the program, when executed, controls a device in which the nonvolatile storage medium is located to execute the system state acquisition method according to any one of claims 1 to 7.

10. An electronic device comprising a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the method of acquiring a system state according to any one of claims 1 to 7.

11. A computer program product comprising a computer program which, when executed by a processor, implements a method of acquiring a system state according to any one of claims 1 to 7.