CN109067597A

CN109067597A - A kind of distributed system dynamic and intelligent service administering method

Info

Publication number: CN109067597A
Application number: CN201811110086.8A
Authority: CN
Inventors: 袁海; 范渊
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2018-09-21
Filing date: 2018-09-21
Publication date: 2018-12-21

Abstract

The present invention relates to the Treatment process of distributed system, it is desirable to provide a kind of distributed system dynamic and intelligent service administering method.This method comprises: configuring side vehicle process for the attendant application in distributed system；Request first passes through the processing of side vehicle process, and the agency of attendant application is served as by the latter；Side vehicle process is the information of monitoring and scheduling server record request, and serves as its Collection agent；Accumulation data during monitoring and scheduling server by utilizing is run are fitted, and are calculated the optimal scheduling strategy under present load, are then issued dispatch command.The present invention can reduce operation system complexity, reduce the functional redundancy in distributed system；The availability of raising system, the load capacity for improving system reduce the response time；Intelligent monitoring alarm, allows alarm really to generate meaning and value；Conducive to the investigation of problem on line；Judgment basis is provided to system dynamic capacity-expanding.

Description

Distributed system dynamic intelligent service management method

Technical Field

The invention relates to a treatment technology of a distributed system, in particular to a dynamic intelligent service treatment method of the distributed system.

Background

Operation and maintenance of distributed systems often rely on monitoring of links. And finally, the response time of the whole call chain is obtained by monitoring the response time, the delay time and the like of each service, and the load capacity of the system is obtained by a pressure test and the like. When the project is operated, indexes such as QPS exceed a certain numerical value, then an alarm is given out and the current is limited. The current limiting of IT systems is usually set by a method of pressure testing and empirical judgment. Dynamic current limiting can be distinguished according to the state of each subsystem, and the availability and the stability of the whole system are guaranteed.

Distributed systems can suffer from problems with service administration, monitoring, and reduced availability as services are split. The current scheme often cannot intelligently early warn, the current limit threshold cannot be dynamically set according to the state of the system, and the failure or overtime of one service often causes cascade failure or overtime, and finally causes a catastrophic result to the whole system.

Distributed systems solve many problems in software development, and bring many difficulties for which there are some solutions in the industry. One solution is described in the paper on link monitoring google Dapper; prometheus et al are representative of the monitoring industry; these solutions solve some real-world problems, but are far from adequate, for example:

1. there is invasiveness to the business system: the existing link tracking system needs to modify the existing service code for customization and development, and the running old system is difficult to be compatible.

2. Difficult adaptation to heterogeneous systems: if the system uses different operating systems and different programming languages, the existing scheme is difficult to completely cover.

3. The alarm and intervention can not be carried out dynamically and intelligently: the existing alarm system often sets a fixed threshold value, and an alarm is given out after the threshold value is exceeded. No negative feedback mechanism is introduced. Since the load on the system and the performance of the system are not linear, fixed threshold intervention does not allow the system to operate optimally.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects in the prior art and provides a distributed system dynamic intelligent service management method.

In order to solve the technical problems, the invention adopts the following solution:

the method for managing the dynamic intelligent service of the distributed system comprises the following steps:

(1) configuring a sidecar process, i.e., an independent process capable of operating in a sidecar mode, for a service application in a distributed system; the side car process can run together, provides additional capacity for the side car process, and meanwhile can manage and schedule the service application program according to a preset scheme;

(2) the request is processed by a sidecar process before reaching the service application program, and the request is used as a proxy of the service application program, so that the monitoring of a request link and a service operating environment is realized, and the service is limited;

(3) the side car process records the starting time and the ending time of the request and synchronizes the ID of the request to the monitoring scheduling server; meanwhile, the sidecar process also serves as an acquisition agent of the monitoring and scheduling server, and collects the monitoring indexes of the service operation environment in real time according to a set time interval;

(4) and after receiving new load data and performance indexes acquired by the sidecar process, the monitoring and scheduling server performs fitting by using accumulated data in the running period, calculates an optimal scheduling strategy under the current load, and then issues a scheduling instruction.

In the present invention, the monitoring index of the service operation environment in step (3) includes: CPU utilization rate, memory utilization rate, disk IO and network IO, and requested return status code.

In the invention, in a distributed system, the sidecar processes configured by different service application programs are universal, and the difference is only the difference of configuration parameters.

In the invention, the agent service of the side car process is realized by the following modes:

(1) according to the traditional proxy server principle, the real address of the service application program is replaced by the address of the sidecar process during service registration and discovery; or,

(2) implemented in conjunction with a controller of an SDN.

In the invention, a monitoring scheduling server is used as a monitoring alarm center: the monitoring and dispatching server calculates the load and performance of the whole system according to the received load data and performance indexes, and sends out an alarm according to set conditions; meanwhile, the monitoring and scheduling server continuously performs learning and fitting by using historical data to acquire a typical load state of the system, and then dynamically adjusts the alarm threshold value according to different system operation periods.

Compared with the prior art, the invention has the technical effects that:

1. the complexity of a service system is reduced, the service system needs to be associated with real services, and the management and the scheduling of the services do not need to be associated;

2. functional redundancy in distributed systems is reduced. In a distributed system, the operating environment of each service requires services such as current limiting, monitoring, disconnection, and the like. Universal sidecar can reduce redevelopment.

3. The availability of the system is improved, the load capacity of the system is improved, and the response time is reduced. And (3) by a negative feedback principle and the fitting of historical data, the system is in an optimal operation state at any time.

4. The intelligent monitoring alarm can really generate meaning and value.

5. Is beneficial to the examination of on-line problems. The link tracking and monitoring can enable the troubleshooting personnel to quickly and clearly locate the fault point.

6. And providing judgment basis for dynamic capacity expansion of the system.

Drawings

FIG. 1 is a deployment diagram of the present invention (a sidecar process and a service application are deployed on the same machine, virtual machine, or container).

Fig. 2 is a schematic view of the work flow of the sidecar process.

Detailed Description

First, the technical terms or concepts related to the present invention are explained as follows:

1. principle of negative feedback

The principle of negative feedback is the basic concept of the theory of control. The feedback can be simply divided into positive feedback and negative feedback, and the system output of the negative feedback can return to the input in a certain way and has the opposite effect to the output, so that the error between the output of the system and the target is reduced, and the system tends to be stable. In the system described in the present invention, the parameters of the current limit and the monitored alarms are constantly optimized based on the principle of negative feedback.

2. Sidecar mode

The Sidecar mode, also called the Sidecar mode, is an independent process running with the service application, providing additional capabilities to the application and enabling certain management and scheduling of the application.

3. System load

The system load is a measure of the current pressure of the system, and can be measured by the following quantitative indexes:

a) QPS, generally, refers to the number of requests processed per unit time.

b) TPS, refers to the complete number of transactions processed per unit time. A transaction may be understood simply as a collection of one or several steps of operation by a user.

c) The number of concurrent users refers to the total number of users who send requests to the server at the same time.

d) Disk IO, CPU usage, memory usage, etc

4. System performance

The system performance reflects the system processing capability under the current load, and mainly has the following indexes:

a) user request latency

b) Server processing time

c) Request error timeout rate, error rate

5. Proxy mode

The proxy schema provides a proxy object to the target and references to the target are controlled by the proxy object. In this system, a sidecar (sidecar) is used to proxy calls to specific services.

And monitoring a link: one request relies on multiple services and eventually failure of a request is often difficult to troubleshoot. If the call link is tracked and monitored, it is convenient to find out which environment the request is in.

Intelligent early warning: in a traditional IT operation and maintenance system, a threshold value is generally set, and a system alarm is sent out after the threshold value is exceeded. Such as cpu loading of more than 80%. This threshold requires the operator to set it empirically, which is a very coarse-grained decision. The intelligent early warning can distinguish the system condition according to different events at different times. If the cpu load exceeds 50% when the system is idle, a special event is often generated, and early warning is needed at the time.

Dynamic current limiting: distributed, intelligent and dynamic current limiting method. In a distributed system, the traffic distributed to each service is limited to a certain extent, ensuring efficient operation of the service. In addition, dynamic current limiting is performed by combining statistical data and a negative feedback principle.

And (3) open circuit protection: in a distributed microservice system, a function often depends on coordination of multiple services, and a simple page request may invoke N services in the microservice system.

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The invention relates to a distributed system dynamic intelligent service management method, which specifically comprises the following steps:

in a distributed system, the sidecar processes configured by different service applications are common, and only differ in configuration parameters. And the full multiplexing can ensure that other contents do not need to be concerned additionally when the business code is developed, thereby reducing the complexity and the coupling degree of a business system.

The agent service of the sidecar process can be realized by the following modes: (1) according to the traditional proxy server principle, the real address of the service application program is replaced by the address of the sidecar process during service registration and discovery; or, (2) in conjunction with a controller of an SDN (Software Defined Network, SDN).

(3) The side car process records the starting time and the ending time of the request and synchronizes the ID of the request to the monitoring scheduling server; meanwhile, the sidecar process also serves as an acquisition agent of the monitoring and scheduling server, and collects the monitoring indexes of the service operation environment in real time according to a set time interval; monitoring metrics for the service operating environment include (but are not limited to): CPU utilization rate, memory utilization rate, disk IO and network IO, and requested return status code.

(4) And after receiving new load data and performance indexes acquired by the sidecar process, the monitoring and scheduling server performs fitting by using accumulated data in the running period (the longer the system running time is, the better the fitting effect is), calculates the optimal scheduling strategy under the current load, and then issues a scheduling instruction.

A simple example is:

1. there are service applications A, B, C in the system

2. The current QPS of the service application A, B, C is 100,10,50 respectively

3. Historical data fitting surface, when limiting the A service QPS to 80 the best overall system response performance

4. The dispatch service sends the computed result to the sidecar

Sdecar throttled to 80 and put excess traffic into either message queue smoothing QPS or service a is leveled out, throttled, if allowed

6. If a service D is a non-critical service, and the D service is in the service call chain of A- > B- > C- > D, and the response time of D is long, the dispatch service will determine whether to open the circuit according to a predetermined policy. The call chain then becomes A- > B- > C. The contents of service D are placed in an asynchronous queue or are compensated by a subsequent timed task or other extra mechanism.

In the invention, the monitoring and scheduling server also plays a role of a monitoring and alarming center. Conventional monitoring alarms often check whether an index exceeds a fixed threshold and often target a single service. In the invention, the monitoring and scheduling server not only has load and performance data of single service, but also can calculate the load and performance of the whole system according to the received load data and performance indexes, and sends out an alarm according to set conditions; meanwhile, the monitoring and scheduling server continuously performs learning and fitting by using historical data to acquire a typical load state of the system, and then dynamically adjusts the alarm threshold value according to different system operation periods. For example, if the government system is in the low peak period after 8:00 night and in the high peak period during a certain time of day, the alarm threshold value is dynamically adjusted for different periods.

Claims

1. A distributed system dynamic intelligent service governance method is characterized by comprising the following steps:

2. The method of claim 1, wherein the monitoring of the service operating environment in step (3) comprises: CPU utilization rate, memory utilization rate, disk IO and network IO, and requested return status code.

3. The method of claim 1, wherein the sidecar processes configured by different service applications are common in the distributed system and differ only in configuration parameters.

4. The method of claim 1, wherein the agent service of the sidecar process is implemented by:

(2) implemented in conjunction with a controller of an SDN.

5. The method of claim 1, wherein the monitoring dispatch server acts as a monitoring alarm center: the monitoring and dispatching server calculates the load and performance of the whole system according to the received load data and performance indexes, and sends out an alarm according to set conditions; meanwhile, the monitoring and scheduling server continuously performs learning and fitting by using historical data to acquire a typical load state of the system, and then dynamically adjusts the alarm threshold value according to different system operation periods.