WO2024125111A1

WO2024125111A1 - Cloud service alarm method and apparatus, and system and computing node

Info

Publication number: WO2024125111A1
Application number: PCT/CN2023/127455
Authority: WO
Inventors: 谢尚周
Original assignee: 华为云计算技术有限公司
Priority date: 2022-12-12
Filing date: 2023-10-30
Publication date: 2024-06-20

Abstract

In the present application, the method comprises: a database node sending usage indicators of the database node to a data lake, and the data lake storing the usage indicators of the database node; when fault processing of the database node is completed, a high-order management service receiving a health analysis event of the database node; when receiving the health analysis event, the high-order management service acquiring the usage indicators of the database node from the data lake; the high-order management service using the usage indicators to perform analysis so as to obtain whether a user application has restored connections to healthy database nodes; and if the user application is not connected to the healthy database nodes, an alarm platform giving an alarm to a user. In this way, during fault processing of database nodes, if user applications have not connected healthy database nodes, an alarm is given to user in a timely manner, and the users are reminded to restore the use of the user applications to the healthy database nodes in a timely manner.

Description

Cloud service alarm method, device, system and computing node

Technical Field

本申请涉及计算机领域，尤其涉及云服务告警的方法、数据节点、分布式数据库、计算节点和计算机程序产品。The present application relates to the field of computers, and in particular to a cloud service alarm method, a data node, a distributed database, a computing node, and a computer program product.

Background technique

数据库的极致高可用是数据库云服务追求的目标。现实场景中，硬件发生故障，网络意外波动等，会触发数据库的高可用切换，由此产生用户侧数据库客户端的感知。The ultimate high availability of the database is the goal pursued by database cloud services. In real-world scenarios, hardware failures, unexpected network fluctuations, etc., will trigger the high availability switch of the database, which will generate user-side database client perception.

常规的数据库连接工具采用手动连接的方式，在用户启动工具是连接数据库，后续保持连接不再变动。一旦原有连接因异常断开，客户端工具只有在再次尝试查询时才会有错误提示，不会在网络波动后自动重新连接。Conventional database connection tools use a manual connection method. When the user starts the tool, the database is connected and the connection is maintained. Once the original connection is disconnected due to an abnormality, the client tool will only have an error message when trying to query again, and will not automatically reconnect after network fluctuations.

发明内容Summary of the invention

有鉴于此，本申请提供了一种云服务告警的方法和装置、系统、计算节点、计算机程序产品和非易失性存储介质,在数据库节点的故障处理时，如果用户还在使用异常的数据库，则及时告警用户。In view of this, the present application provides a cloud service alarm method and device, system, computing node, computer program product and non-volatile storage medium. When handling a database node failure, if the user is still using an abnormal database, the user will be promptly alarmed.

第一方面，本申请提供一种云服务告警的方法。数据库节点向数据湖发送所述数据库节点的使用指标，所述数据湖存储所述数据库节点的使用指标；在完成所述数据库节点的故障处理时，高阶管理服务接收所述数据库节点的健康分析事件；所述高阶管理服务在接收到所述健康分析事件时，从所述数据湖获取所述数据库节点的使用指标；所述高阶管理服务使用所述使用指标，分析得到用户应用是否恢复连接健康的数据库节点；如果所述用户应用未连接健康的数据库节点，则告警平台向用户告警。In the first aspect, the present application provides a method for cloud service alarm. A database node sends a usage indicator of the database node to a data lake, and the data lake stores the usage indicator of the database node; when the fault handling of the database node is completed, the high-level management service receives a health analysis event of the database node; when the high-level management service receives the health analysis event, it obtains the usage indicator of the database node from the data lake; the high-level management service uses the usage indicator to analyze whether the user application has restored the connection to the healthy database node; if the user application is not connected to the healthy database node, the alarm platform alarms the user.

这样，在数据库节点的故障处理后，如果用户应用还未连接健康的数据库节点，则及时告警用户，提醒用户及时恢复用户应用对健康数据库节点的使用。In this way, after the failure of the database node is handled, if the user application has not yet connected to a healthy database node, the user will be promptly alerted to remind the user to restore the use of the healthy database node by the user application in a timely manner.

第一方面的一种可能设计，所述高阶管理服务在消息服务订阅所述数据库节点的健康分析事件；所述消息服务接收所述数据库节点的健康分析事件。In a possible design of the first aspect, the high-level management service subscribes to the health analysis event of the database node in the message service; and the message service receives the health analysis event of the database node.

第一方面的一种可能设计，所述完成所述数据库节点的故障处理，包括：主数据库节点与从数据库节点完成主备切换。In a possible design of the first aspect, completing the fault handling of the database node includes: completing the master-slave switching between the master database node and the slave database node.

第一方面的一种可能设计，所述完成所述数据库节点的故障处理，包括：所述数据库节点完成故障恢复。In a possible design of the first aspect, completing the fault handling of the database node includes: the database node completing fault recovery.

第一方面的一种可能设计，所述所述数据库节点的使用指标，包括如下的一种或多种：In a possible design of the first aspect, the usage indicator of the database node includes one or more of the following:

所述数据库节点的处理器利用率；Processor utilization of the database node;

所述数据库节点的内存利用率；The memory utilization of the database node;

所述数据库节点的用户连接数。The number of user connections to the database node.

第二方面，本申请提供一种云服务告警的装置。该装置包括的多个功能模块，用于实现第一方面或第一方面的任意可能设计提供的方法的不同步骤。In a second aspect, the present application provides a cloud service alarm device, which includes multiple functional modules for implementing different steps of the method provided in the first aspect or any possible design of the first aspect.

第三方面，本申请提供一种系统，该系统包括：数据库节点，用于向数据湖发送所述数据库节点的使用指标；所述数据湖，用于存储所述数据库节点的使用指标；高阶管理服务，用于在完成所述数据库节点的故障处理时接收所述数据库节点的健康分析事件，从所述数据湖获取所述数据库节点的使用指标，使用所述使用指标分析得到用户应用是否恢复连接健康的数据库节点；告警平台，用于如果所述用户应用未连接健康的数据库节点，则向用户告警。In a third aspect, the present application provides a system, comprising: a database node, configured to send a usage indicator of the database node to a data lake; the data lake, configured to store the usage indicator of the database node; a high-level management service, configured to receive a health analysis event of the database node when completing the fault handling of the database node, obtain the usage indicator of the database node from the data lake, and use the usage indicator to analyze whether the user application has restored the connection to the healthy database node; The alarm platform is used to alarm the user if the user application is not connected to a healthy database node.

这样，在数据库节点的故障处理时，如果用户应用还未连接健康的数据库节点，则及时告警用户，提醒用户及时恢复用户应用对健康数据库节点的使用。In this way, when handling a database node failure, if the user application has not yet connected to a healthy database node, the user will be promptly alerted to remind the user to restore the user application's use of the healthy database node in a timely manner.

第三方面的一种可能设计，所述高阶管理服务，用于在消息服务订阅所述数据库节点的健康分析事件；所述消息服务，用于接收所述数据库节点的健康分析事件。In a possible design of the third aspect, the high-level management service is used to subscribe to the health analysis events of the database node in the message service; and the message service is used to receive the health analysis events of the database node.

第三方面的一种可能设计，所述完成所述数据库节点的故障处理，包括：主数据库节点与从数据库节点完成主备切换。In a possible design of the third aspect, completing the fault handling of the database node includes: completing the master-slave switching between the master database node and the slave database node.

第三方面的一种可能设计，所述完成所述数据库节点的故障处理，包括：所述数据库节点完成故障恢复。In a possible design of the third aspect, completing the fault handling of the database node includes: the database node completing fault recovery.

第三方面的一种可能设计，所述所述数据库节点的使用指标，包括如下的一种或多种：In a possible design of the third aspect, the usage indicator of the database node includes one or more of the following:

第四方面，本申请提供一种计算设备集群，其特征在于，包括至少一个计算设备，每个计算设备包括处理器和存储器；所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令，以使得所述计算设备集群执行上述第一方面或者第一方面的各种可能设计提供的方法。In a fourth aspect, the present application provides a computing device cluster, characterized in that it includes at least one computing device, each computing device includes a processor and a memory; the processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method provided by the above-mentioned first aspect or various possible designs of the first aspect.

本申请提供一种计算节点，该计算节点包括处理器和存储器。该处理器执行该存储器存储的指令，使得该计算节点部署第二方面提供的装置。The present application provides a computing node, the computing node comprising a processor and a memory. The processor executes instructions stored in the memory, so that the computing node deploys the device provided in the second aspect.

第五方面，本申请提供一种计算机可读存储介质，该计算机可读存储介质中存储有指令，当计算节点的处理器执行该指令时，该计算节点执行上述第一方面或者第一方面的各种可能设计提供的方法。In a fifth aspect, the present application provides a computer-readable storage medium having instructions stored therein. When a processor of a computing node executes the instructions, the computing node executes the method provided by the first aspect or various possible designs of the first aspect.

本申请提供一种计算机可读存储介质，该计算机可读存储介质中存储有指令，当计算节点的处理器执行该指令时，该计算节点实现第二方面提供的装置。The present application provides a computer-readable storage medium, in which instructions are stored. When a processor of a computing node executes the instructions, the computing node implements the apparatus provided in the second aspect.

第六方面，本申请提供一种计算机程序产品，该计算机程序产品包括指令，该指令存储在计算机可读存储介质中。计算节点的处理器可以从计算机可读存储介质读取该指令；该处理器执行该指令，使得该计算节点执行上述第一方面或者第一方面的各种可能设计提供的方法。In a sixth aspect, the present application provides a computer program product, the computer program product comprising instructions, the instructions being stored in a computer-readable storage medium. A processor of a computing node can read the instructions from the computer-readable storage medium; the processor executes the instructions, so that the computing node executes the method provided by the first aspect or various possible designs of the first aspect.

本申请提供一种计算机程序产品，该计算机程序产品包括指令，该指令存储在计算机可读存储介质中。计算节点的处理器可以从计算机可读存储介质读取该指令；该处理器执行该指令，该计算节点实现第二方面提供的装置。The present application provides a computer program product, which includes instructions, which are stored in a computer-readable storage medium. A processor of a computing node can read the instructions from the computer-readable storage medium; the processor executes the instructions, and the computing node implements the apparatus provided in the second aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请举例提供的云服务告警的方法的一种流程示意图；FIG1 is a flow chart of a cloud service alarm method provided by way of example in this application;

图2为本申请举例提供的云服务告警的方法的一种流程示意图；FIG2 is a flow chart of a cloud service alarm method provided by way of example in this application;

图3为本申请举例提供的计算设备100的一种流程示意图；FIG3 is a flowchart of a computing device 100 provided by way of example in this application;

图4为本申请举例提供的集群的一种架构示意图；FIG4 is a schematic diagram of an architecture of a cluster provided as an example in this application;

图5为本申请举例提供的集群的一种架构示意图。FIG5 is a schematic diagram of an architecture of a cluster provided as an example in this application.

Detailed ways

下面将结合本申请中的附图，对本申请提供的技术方案进行描述。The technical solution provided in this application will be described below in conjunction with the drawings in this application.

图1举例提供一种云服务告警的方法。下面结合图1介绍本方法的方法流程，该方法至少包括步骤S11至步骤S16。 Fig. 1 provides an example of a cloud service alarm method. The method flow of the method is described below in conjunction with Fig. 1. The method at least includes steps S11 to S16.

步骤S11,数据库节点向数据湖发送所述数据库节点的使用指标。Step S11: The database node sends the usage indicator of the database node to the data lake.

数据库节点部署数据库，为用户提供数据库服务。例如，用户可以使用数据库节点上的数据库。Database nodes deploy databases and provide database services to users. For example, users can use databases on database nodes.

数据库节点可以有一个或多个。举例说明，数据库系统包括：主数据库节点，和一个或多个从数据库节点。主数据库节点支持用户的读写操作。从数据库节点可以支持用户的读操作。在主数据库节点故障或者需要升级时，可以进行主数据库节点和从数据库节点的主从切换，即主数据库节点变为从数据库节点，一个从数据库节点变为主数据库节点。There can be one or more database nodes. For example, a database system includes: a master database node, and one or more slave database nodes. The master database node supports user read and write operations. The slave database node can support user read operations. When the master database node fails or needs to be upgraded, the master and slave database nodes can be switched, that is, the master database node becomes a slave database node, and a slave database node becomes a master database node.

作为步骤S11的一种可能实现方式,每个数据库节点自主监控本节点的使用指标，并主动向数据库上报或发送监控到的使用指标。As a possible implementation of step S11, each database node autonomously monitors the usage index of the node, and actively reports or sends the monitored usage index to the database.

作为本申请的一种可能实现举例，所述数据库节点的使用指标，包括如下的一种或多种：As a possible implementation example of the present application, the usage indicator of the database node includes one or more of the following:

举例说明，数据库节点使用中央处理器(central processing unit，CPU)和内存来运行数据库。数据库节点实时监控数据库节点的CPU利用率和内存使用率。另外，数据库节点还实时监控用户的数据库连接的总连接数(即，数据库节点的用户连接数)。For example, a database node uses a central processing unit (CPU) and memory to run a database. The database node monitors the CPU utilization and memory usage of the database node in real time. In addition, the database node also monitors the total number of user database connections (i.e., the number of user connections to the database node) in real time.

步骤S12,数据湖存储所述数据库节点的使用指标。Step S12: The data lake stores usage indicators of the database nodes.

数据湖具有云存储功能。在本申请中，数据湖用于存储数据库节点的使用指标。The data lake has cloud storage capabilities. In this application, the data lake is used to store usage indicators of database nodes.

步骤S13,在完成所述数据库节点的故障处理时，高阶管理服务接收所述数据库节点的健康分析事件。Step S13: When the fault handling of the database node is completed, the high-level management service receives the health analysis event of the database node.

数据库节点可能会出现故障，比如单节点故障，比如主备切换。当故障被排除后，会触发健康分析事件，该健康分析事件会触发高阶管理服务进行用户是否使用健康数据库的分析，分析得出用户是否使用健康的数据库。Database nodes may fail, such as single node failure or master-slave switchover. When the failure is eliminated, a health analysis event will be triggered, which will trigger the high-level management service to analyze whether the user is using a healthy database. The analysis will determine whether the user is using a healthy database.

完成所述数据库节点的故障处理两种业务场景举例。场景举例一，主数据库节点与从数据库节点完成主备切换。场景举例二，所述数据库节点完成故障恢复。Two business scenarios are given for completing the fault handling of the database node. In scenario example 1, the master database node and the slave database node complete the master-slave switch. In scenario example 2, the database node completes fault recovery.

作为高阶管理服务如何获得健康分析事件的业务场景举例，所述高阶管理服务在消息服务订阅所述数据库节点的健康分析事件；所述消息服务接收所述数据库节点的健康分析事件。As a business scenario example of how a high-level management service obtains health analysis events, the high-level management service subscribes to the health analysis events of the database node in the message service; the message service receives the health analysis events of the database node.

步骤S14,所述高阶管理服务在接收到所述健康分析事件时，从所述数据湖获取所述数据库节点的使用指标。Step S14: When the high-level management service receives the health analysis event, it obtains the usage indicator of the database node from the data lake.

步骤S15,所述高阶管理服务使用所述使用指标，分析得到用户应用是否恢复连接健康的数据库节点。Step S15: The high-level management service uses the usage indicator to analyze whether the user application has restored the connection to the healthy database node.

健康分析事件会触发高阶管理服务进行用户应用是否连接健康数据库的分析，分析得出用户应用是否使用健康的数据库。比如在主数据库节点和从数据库节点完成主备切换后，分析用户的数据库连接是否切换到新的主数据库节点，即用户当前是否使用新的主数据库节点来处理业务。Health analysis events trigger high-level management services to analyze whether user applications are connected to healthy databases, and to determine whether user applications are using healthy databases. For example, after the master and slave database nodes complete the master-slave switch, analyze whether the user's database connection is switched to the new master database node, that is, whether the user is currently using the new master database node to process business.

本申请中，如果步骤S15分析得出所述用户应用未连接健康的数据库节点这个结论，则通知告警平台向用户告警。In the present application, if the analysis in step S15 leads to the conclusion that the user application is not connected to a healthy database node, the alarm platform is notified to alarm the user.

步骤S16,如果所述用户应用未连接健康的数据库节点，则告警平台向用户告警。Step S16: If the user application is not connected to a healthy database node, the alarm platform alerts the user.

通过步骤S11到步骤S16，本申请可以在完成数据库节点的故障处理后如果还出现用户应用未连接到健康的数据库节点的场景下，及时向用户告警。Through step S11 to step S16, the present application can promptly warn the user if the user application is still not connected to a healthy database node after completing the fault handling of the database node.

本申请中，如图1所示，数据库节点、高可用服务、消息服务、数据湖、高阶管理服务和告警平台可以是独立部署的。作为其它的实现方式，数据库节点、高可用服务、消息服务、数据湖、高阶管理服务和告警平台中的部分可以集成到一起部署，比如告警平台可以集成到高阶管理服务，比如高可用服务可以集成到数据库节点。 In this application, as shown in Figure 1, database nodes, high-availability services, message services, data lakes, high-level management services, and alarm platforms can be deployed independently. As another implementation method, some of the database nodes, high-availability services, message services, data lakes, high-level management services, and alarm platforms can be integrated and deployed together, such as the alarm platform can be integrated into the high-level management service, and the high-availability service can be integrated into the database node.

本申请提供图1所示方法的一种举例实现，如图2所示。The present application provides an example implementation of the method shown in FIG1 , as shown in FIG2 .

在图2中，数据库节点，用来部署用户购买的数据库云服务。In Figure 2, the database node is used to deploy the database cloud service purchased by the user.

在图2中，高可用服务，负责对数据库高可用进行管理，包括：数据库故障修复、主备切换、异常状态监控等。In Figure 2, the high availability service is responsible for managing the high availability of the database, including database fault repair, master-slave switching, abnormal status monitoring, etc.

在图2中，高阶管理服务，负责数据库服务的生命周期管理，包括：数据库服节点的创建、规格变配、开关机、删除等生命周期。In Figure 2, the high-level management service is responsible for the lifecycle management of the database service, including the creation, specification change, power on and off, and deletion of database service nodes.

在图2中，数据湖，用来保存数据库节点的监控指标的数据系统，可以进行高效的指标项查询。In Figure 2, the data lake is a data system used to store monitoring indicators of database nodes, which can perform efficient indicator item queries.

在图2中，消息服务，负责数据库管理服务传递事件通知；In Figure 2, the message service is responsible for delivering event notifications to the database management service;

在图2中，告警平台，负责向用户告警。In Figure 2, the alarm platform is responsible for alerting users.

如图2实现图1所示方法的一种示例流程。FIG2 is an example process of implementing the method shown in FIG1 .

数据库节点实时向数据湖发送数据库节点的使用指标，比如CPU利用率、内存使用率以及用户连接数，这些使用指标可以反映用户应用是否恢复连接健康的数据库节点。The database node sends the database node usage indicators to the data lake in real time, such as CPU utilization, memory utilization, and number of user connections. These usage indicators can reflect whether the user application has restored the connection to the healthy database node.

如果高可用服务完成数据库节点的故障恢复或者主备切换，则高可用服务想消息服务发送健康分析事件。由于高阶管理服务在消息服务订阅了该健康分析事件，改消息服务会主动将该健康分析事件发送至高阶管理服务。该健康分析事件会触发高阶管理服务对数据库节点进行健康分析，比如分析用户应用是否恢复连接健康的数据库节点，比如分析用户应用是否连接并使用主数据库节点。为实现健康分析，高阶管理服务从所述数据湖获取所述数据库节点的使用指标，使用该使用指标完成用户应用是否恢复连接健康的数据库节点。如果分析结果是所述用户应用未连接健康的数据库节点，则告警平台向用户告警。这样，用户可以手动连接健康的数据库节点，使用健康的数据库节点支撑业务。If the high-availability service completes the fault recovery or master-slave switching of the database node, the high-availability service sends a health analysis event to the message service. Since the high-level management service has subscribed to the health analysis event in the message service, the message service will actively send the health analysis event to the high-level management service. The health analysis event will trigger the high-level management service to perform a health analysis on the database node, such as analyzing whether the user application has restored the connection to the healthy database node, such as analyzing whether the user application is connected to and using the primary database node. To implement health analysis, the high-level management service obtains the usage indicator of the database node from the data lake, and uses the usage indicator to complete whether the user application has restored the connection to the healthy database node. If the analysis result is that the user application is not connected to the healthy database node, the alarm platform will alarm the user. In this way, the user can manually connect to the healthy database node and use the healthy database node to support the business.

【系统】【system】

本申请还提供一种云服务告警的系统，如图1或图2所示，包括：The present application also provides a cloud service alarm system, as shown in FIG1 or FIG2, including:

数据库节点，用于向数据湖发送所述数据库节点的使用指标；A database node, used to send usage indicators of the database node to the data lake;

所述数据湖，用于存储所述数据库节点的使用指标；The data lake is used to store usage indicators of the database nodes;

高阶管理服务，用于在完成所述数据库节点的故障处理时接收所述数据库节点的健康分析事件，从所述数据湖获取所述数据库节点的使用指标，使用所述使用指标分析得到用户应用是否恢复连接健康的数据库节点；A high-level management service, configured to receive a health analysis event of the database node when the fault handling of the database node is completed, obtain a usage indicator of the database node from the data lake, and use the usage indicator to analyze whether the user application has restored the connection to the healthy database node;

告警平台，用于如果所述用户应用未连接健康的数据库节点，则向用户告警。The alarm platform is used to alarm the user if the user application is not connected to a healthy database node.

该系统的一种可能实现方式，所述高阶管理服务，用于在消息服务订阅所述数据库节点的健康分析事件；In a possible implementation of the system, the high-level management service is used to subscribe to the health analysis events of the database node in the message service;

所述消息服务，用于接收所述数据库节点的健康分析事件。The message service is used to receive health analysis events of the database node.

该系统的一种可能实现方式，所述完成所述数据库节点的故障处理，包括：主数据库节点与从数据库节点完成主备切换。In a possible implementation of the system, completing the fault handling of the database node includes: completing the master-slave switch between the master database node and the slave database node.

该系统的一种可能实现方式，所述完成所述数据库节点的故障处理，包括：所述数据库节点完成故障恢复。In a possible implementation of the system, completing the fault handling of the database node includes: the database node completing fault recovery.

该系统的一种可能实现方式，所述数据库节点的使用指标，包括如下的一种或多种：In a possible implementation of the system, the usage index of the database node includes one or more of the following:

所述数据库节点的内存利用率； The memory utilization of the database node;

数据库节点、数据湖、高阶管理服务、告警平台和消息服务均可以通过软件实现，或者可以通过硬件实现。示例性的，接下来介绍高阶管理服务的实现方式。类似的，数据库节点、数据湖、告警平台和消息服务的实现方式可以参考高阶管理服务的实现方式。Database nodes, data lakes, high-level management services, alarm platforms, and message services can all be implemented through software or hardware. As an example, the implementation of high-level management services is described below. Similarly, the implementation of database nodes, data lakes, alarm platforms, and message services can refer to the implementation of high-level management services.

模块作为软件功能单元的一种举例，高阶管理服务可以包括运行在计算实例上的代码。其中，计算实例可以是物理主机(计算设备)、虚拟机、容器等计算设备中的至少一种。进一步地，上述计算设备可以是一台或者多台。例如，高阶管理服务可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是，用于运行该应用程序的多个主机/虚拟机/容器可以分布在相同的region中，也可以分布在不同的region中。用于运行该代码的多个主机/虚拟机/容器可以分布在相同的AZ中，也可以分布在不同的AZ中，每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中，通常一个region可以包括多个AZ。As an example of a software functional unit, a high-level management service may include code running on a computing instance. The computing instance may be at least one of a physical host (computing device), a virtual machine, a container, and other computing devices. Furthermore, the computing device may be one or more. For example, a high-level management service may include code running on multiple hosts/virtual machines/containers. It should be noted that the multiple hosts/virtual machines/containers used to run the application may be distributed in the same region or in different regions. The multiple hosts/virtual machines/containers used to run the code may be distributed in the same AZ or in different AZs, each AZ including a data center or multiple data centers with close geographical locations. Typically, a region may include multiple AZs.

同样，用于运行该代码的多个主机/虚拟机/容器可以分布在同一个VPC中，也可以分布在多个VPC中。其中，通常一个VPC设置在一个region内。同一region内两个VPC之间，以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关，经通信网关实现VPC之间的互连。Similarly, multiple hosts/virtual machines/containers used to run the code can be distributed in the same VPC or in multiple VPCs. Usually, a VPC is set up in a region. For cross-region communication between two VPCs in the same region and between VPCs in different regions, a communication gateway must be set up in each VPC to achieve interconnection between VPCs through the communication gateway.

模块作为硬件功能单元的一种举例，高阶管理服务可以包括至少一个计算设备，如服务器等。或者，高阶管理服务也可以是利用ASIC实现、或PLD实现的设备等。其中，上述PLD可以是CPLD、FPGA、GAL或其任意组合实现。As an example of a hardware functional unit, the high-level management service may include at least one computing device, such as a server, etc. Alternatively, the high-level management service may also be a device implemented by ASIC or PLD, etc. The PLD may be implemented by CPLD, FPGA, GAL or any combination thereof.

高阶管理服务包括的多个计算设备可以分布在相同的region中，也可以分布在不同的region中。高阶管理服务包括的多个计算设备可以分布在相同的AZ中，也可以分布在不同的AZ中。同样，高阶管理服务包括的多个计算设备可以分布在同一个VPC中，也可以分布在多个VPC中。其中，所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。The multiple computing devices included in the advanced management service can be distributed in the same region or in different regions. The multiple computing devices included in the advanced management service can be distributed in the same AZ or in different AZs. Similarly, the multiple computing devices included in the advanced management service can be distributed in the same VPC or in multiple VPCs. The multiple computing devices can be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

【计算设备】【Computing equipment】

本申请还提供一种计算设备100。如图3所示，计算设备100包括：总线102、处理器104、存储器106和通信接口108。处理器104、存储器106和通信接口108之间通过总线102通信。计算设备100可以是服务器或终端设备。应理解，本申请不限定计算设备100中的处理器、存储器的个数。The present application also provides a computing device 100. As shown in FIG3 , the computing device 100 includes: a bus 102, a processor 104, a memory 106, and a communication interface 108. The processor 104, the memory 106, and the communication interface 108 communicate with each other through the bus 102. The computing device 100 may be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 100.

总线102可以是外设部件互连标准(peripheral component interconnect，PCI)总线或扩展工业标准结构(extended industry standard architecture，EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示，图3中仅用一条线表示，但并不表示仅有一根总线或一种类型的总线。总线104可包括在计算设备100各个部件(例如，存储器106、处理器104、通信接口108)之间传送信息的通路。The bus 102 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG. 3 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 104 may include a path for transmitting information between various components of the computing device 100 (e.g., the memory 106, the processor 104, and the communication interface 108).

处理器104可以包括中央处理器(central processing unit，CPU)、图形处理器(graphics processing unit，GPU)、微处理器(micro processor，MP)或者数字信号处理器(digital signal processor，DSP)等处理器中的任意一种或多种。The processor 104 may include any one or more processors such as a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP).

存储器106可以包括易失性存储器(volatile memory)，例如随机存取存储器(random access memory，RAM)。处理器104还可以包括非易失性存储器(non-volatile memory)，例如只读存储器(read-only memory，ROM)，快闪存储器，机械硬盘(hard disk drive，HDD)或固态硬盘(solid state drive，SSD)。The memory 106 may include a volatile memory, such as a random access memory (RAM). The processor 104 may also include a non-volatile memory, such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid state drive (SSD).

存储器106中存储有可执行的程序代码，处理器104执行该可执行的程序代码以分别实现前述数据库节点、数据湖、高阶管理服务、告警平台和消息服务的功能，从而实现云服务告警的方法。也即，存储器106上存有用于执行云服务告警的方法的指令。The memory 106 stores executable program codes, and the processor 104 executes the executable program codes to respectively implement the functions of the aforementioned database node, data lake, high-level management service, alarm platform, and message service, thereby implementing the cloud service alarm method. That is, the memory 106 stores instructions for executing the cloud service alarm method.

或者，存储器106中存储有可执行的代码，处理器104执行该可执行的代码以分别实现前述数据库节点、数据湖、高阶管理服务、告警平台和消息服务的功能，从而实现云服务告警的方法。也即，存储器106上存有用于执行云服务告警的方法的指令。Alternatively, the memory 106 stores executable codes, and the processor 104 executes the executable codes to respectively implement the aforementioned database nodes. The functions of the cloud service alarm point, data lake, high-level management service, alarm platform and message service are combined to realize the cloud service alarm method. That is, the memory 106 stores instructions for executing the cloud service alarm method.

通信接口103使用例如但不限于网络接口卡、收发器一类的收发模块，来实现计算设备100与其他设备或通信网络之间的通信。The communication interface 103 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 100 and other devices or a communication network.

【计算设备集群】【Computing Device Cluster】

本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器，例如是中心服务器、边缘服务器，或者是本地数据中心中的本地服务器。在一些实施例中，计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。The embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.

如图4所示，所述计算设备集群包括至少一个计算设备100。计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行云服务告警的方法的指令。As shown in Fig. 4, the computing device cluster includes at least one computing device 100. The memory 106 in one or more computing devices 100 in the computing device cluster may store the same instructions for executing the cloud service alarm method.

在一些可能的实现方式中，该计算设备集群中的一个或多个计算设备100的存储器106中也可以分别存有用于执行云服务告警的方法的部分指令。换言之，一个或多个计算设备100的组合可以共同执行用于执行云服务告警的方法的指令。In some possible implementations, the memory 106 of one or more computing devices 100 in the computing device cluster may also store partial instructions for executing the method for cloud service alarm. In other words, the combination of one or more computing devices 100 may jointly execute the instructions for executing the method for cloud service alarm.

需要说明的是，计算设备集群中的不同的计算设备100中的存储器106可以存储不同的指令，分别用于执行高阶管理服务的部分功能。也即，不同的计算设备100中的存储器106存储的指令可以实现数据库节点、数据湖、高阶管理服务、告警平台和消息服务中的一个或多个模块的功能。It should be noted that the memory 106 in different computing devices 100 in the computing device cluster can store different instructions, which are respectively used to execute part of the functions of the high-level management service. That is, the instructions stored in the memory 106 in different computing devices 100 can implement the functions of one or more modules in the database node, data lake, high-level management service, alarm platform and message service.

在一些可能的实现方式中，计算设备集群中的一个或多个计算设备可以通过网络连接。其中，所述网络可以是广域网或局域网等等。图5示出了一种可能的实现方式。如图5所示，两个计算设备100A和100B之间通过网络进行连接。具体地，通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中，计算设备100A中的存储器106中存有执行高阶管理服务的功能的指令。同时，计算设备100B中的存储器106中存有执行数据库节点、数据湖、告警平台和消息服务的功能的指令。In some possible implementations, one or more computing devices in a computing device cluster may be connected via a network. The network may be a wide area network or a local area network, etc. FIG. 5 shows a possible implementation. As shown in FIG. 5 , two computing devices 100A and 100B are connected via a network. Specifically, the network is connected via a communication interface in each computing device. In this type of possible implementation, the memory 106 in the computing device 100A stores instructions for executing functions of high-level management services. At the same time, the memory 106 in the computing device 100B stores instructions for executing functions of database nodes, data lakes, alarm platforms, and message services.

图5所示的计算设备集群之间的连接方式可以是考虑到本申请提供的云服务告警的方法的需要大量存储，因此考虑将数据库节点、数据湖、告警平台和消息服务实现的功能交由计算设备100B执行。The connection method between the computing device clusters shown in Figure 5 can be considered that the cloud service alarm method provided in this application requires a large amount of storage, so it is considered to entrust the functions implemented by the database node, data lake, alarm platform and message service to the computing device 100B for execution.

应理解，图5中示出的计算设备100A的功能也可以由多个计算设备100完成。同样，计算设备100B的功能也可以由多个计算设备100完成。It should be understood that the functions of the computing device 100A shown in FIG5 may also be completed by multiple computing devices 100. Similarly, the functions of the computing device 100B may also be completed by multiple computing devices 100.

本申请实施例还提供了另一种计算设备集群。该计算设备集群中各计算设备之间的连接关系可以类似的参考图4和图5所述计算设备集群的连接方式。不同的是，该计算设备集群中的一个或多个计算设备100中的存储器106中可以存有相同的用于执行云服务告警的方法的指令。The embodiment of the present application also provides another computing device cluster. The connection relationship between the computing devices in the computing device cluster can be similar to the connection mode of the computing device cluster described in Figures 4 and 5. The difference is that the memory 106 in one or more computing devices 100 in the computing device cluster can store the same instructions for executing the cloud service alarm method.

需要说明的是，计算设备集群中的不同的计算设备100中的存储器106可以存储不同的指令，用于执行云服务告警的系统的部分功能。也即，不同的计算设备100中的存储器106存储的指令可以实现数据库节点、数据湖、高阶管理服务、告警平台和消息服务中的一个或多个装置的功能。It should be noted that the memory 106 in different computing devices 100 in the computing device cluster can store different instructions for executing part of the functions of the cloud service alarm system. That is, the instructions stored in the memory 106 in different computing devices 100 can implement the functions of one or more devices in the database node, data lake, high-level management service, alarm platform and message service.

本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的，能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时，使得至少一个计算设备执行云服务告警的方法。The embodiment of the present application also provides a computer program product including instructions. The computer program product may be a software or program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device is caused to execute a cloud service alarm method.

本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令，所述指令指示计算设备执行云服务告警的方法。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a computer device capable of storing The computer readable storage medium includes any available medium or a data storage device such as a data center including one or more available mediums. The available medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid state drive). The computer readable storage medium includes instructions that instruct the computing device to execute the cloud service alarm method.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

A cloud service alarm method, characterized in that the method comprises:

The database node sends the usage indicator of the database node to the data lake, and the data lake stores the usage indicator of the database node;

When the fault handling of the database node is completed, the high-level management service receives a health analysis event of the database node;

The high-level management service obtains the usage indicator of the database node from the data lake when receiving the health analysis event;

The high-level management service uses the usage indicator to analyze whether the user application has restored the connection to the healthy database node;

If the user application is not connected to the healthy database node, the alarm platform will alarm the user.

The method according to claim 1, characterized in that the method comprises:

The high-level management service subscribes to the health analysis event of the database node in the message service;

The message service receives a health analysis event of the database node.

The method according to claim 2, characterized in that

The completing the fault handling of the database node includes: the master database node and the slave database node completing the master-slave switch.

The method according to any one of claims 1 to 3, characterized in that

The completing the fault handling of the database node includes: the database node completing fault recovery.

The method according to any one of claims 1 to 4, characterized in that the usage index of the database node includes one or more of the following:

Processor utilization of the database node;

The memory utilization of the database node;

The number of user connections to the database node.

A cloud service alarm system, characterized in that the system comprises:

A database node, used to send usage indicators of the database node to the data lake;

The data lake is used to store usage indicators of the database nodes;

A high-level management service, configured to receive a health analysis event of the database node when the fault handling of the database node is completed, obtain a usage indicator of the database node from the data lake, and use the usage indicator to analyze whether the user application has restored the connection to the healthy database node;

The alarm platform is used to alarm the user if the user application is not connected to the healthy database node.

The system according to claim 6, characterized in that

The high-level management service is used to subscribe to the health analysis events of the database node in the message service;

The message service is used to receive health analysis events of the database node.

The system according to claim 6 or 7 is characterized in that completing the fault handling of the database node includes: completing the master-slave switching between the master database node and the slave database node.

The system according to any one of claims 6 to 8, characterized in that completing the fault handling of the database node includes: the database node completing fault recovery.

The system according to any one of claims 6 to 9, characterized in that the usage indicator of the database node The mark may include one or more of the following:

Processor utilization of the database node;

The memory utilization of the database node;

The number of user connections to the database node.

A computing device cluster, characterized in that it includes at least one computing device, each computing device includes a processor and a memory;

The processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, so that the computing device cluster executes the method according to any one of claims 1 to 5.

A computer program product comprising instructions, characterized in that when the instructions are executed by a computing device cluster, the computing device cluster executes the method according to any one of claims 1 to 5.

A computer-readable storage medium, characterized in that it includes computer program instructions. When the computer program instructions are executed by a computing device cluster, the computing device cluster executes the method according to any one of claims 1 to 5.