[go: up one dir, main page]

CN109067597A - A kind of distributed system dynamic and intelligent service administering method - Google Patents

A kind of distributed system dynamic and intelligent service administering method Download PDF

Info

Publication number
CN109067597A
CN109067597A CN201811110086.8A CN201811110086A CN109067597A CN 109067597 A CN109067597 A CN 109067597A CN 201811110086 A CN201811110086 A CN 201811110086A CN 109067597 A CN109067597 A CN 109067597A
Authority
CN
China
Prior art keywords
monitoring
service
sidecar
request
distributed system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811110086.8A
Other languages
Chinese (zh)
Inventor
袁海
范渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dbappsecurity Technology Co Ltd
Original Assignee
Hangzhou Dbappsecurity Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dbappsecurity Technology Co Ltd filed Critical Hangzhou Dbappsecurity Technology Co Ltd
Priority to CN201811110086.8A priority Critical patent/CN109067597A/en
Publication of CN109067597A publication Critical patent/CN109067597A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/042Network management architectures or arrangements comprising distributed management centres cooperatively managing the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multi Processors (AREA)

Abstract

The present invention relates to the Treatment process of distributed system, it is desirable to provide a kind of distributed system dynamic and intelligent service administering method.This method comprises: configuring side vehicle process for the attendant application in distributed system;Request first passes through the processing of side vehicle process, and the agency of attendant application is served as by the latter;Side vehicle process is the information of monitoring and scheduling server record request, and serves as its Collection agent;Accumulation data during monitoring and scheduling server by utilizing is run are fitted, and are calculated the optimal scheduling strategy under present load, are then issued dispatch command.The present invention can reduce operation system complexity, reduce the functional redundancy in distributed system;The availability of raising system, the load capacity for improving system reduce the response time;Intelligent monitoring alarm, allows alarm really to generate meaning and value;Conducive to the investigation of problem on line;Judgment basis is provided to system dynamic capacity-expanding.

Description

Distributed system dynamic intelligent service management method
Technical Field
The invention relates to a treatment technology of a distributed system, in particular to a dynamic intelligent service treatment method of the distributed system.
Background
Operation and maintenance of distributed systems often rely on monitoring of links. And finally, the response time of the whole call chain is obtained by monitoring the response time, the delay time and the like of each service, and the load capacity of the system is obtained by a pressure test and the like. When the project is operated, indexes such as QPS exceed a certain numerical value, then an alarm is given out and the current is limited. The current limiting of IT systems is usually set by a method of pressure testing and empirical judgment. Dynamic current limiting can be distinguished according to the state of each subsystem, and the availability and the stability of the whole system are guaranteed.
Distributed systems can suffer from problems with service administration, monitoring, and reduced availability as services are split. The current scheme often cannot intelligently early warn, the current limit threshold cannot be dynamically set according to the state of the system, and the failure or overtime of one service often causes cascade failure or overtime, and finally causes a catastrophic result to the whole system.
Distributed systems solve many problems in software development, and bring many difficulties for which there are some solutions in the industry. One solution is described in the paper on link monitoring google Dapper; prometheus et al are representative of the monitoring industry; these solutions solve some real-world problems, but are far from adequate, for example:
1. there is invasiveness to the business system: the existing link tracking system needs to modify the existing service code for customization and development, and the running old system is difficult to be compatible.
2. Difficult adaptation to heterogeneous systems: if the system uses different operating systems and different programming languages, the existing scheme is difficult to completely cover.
3. The alarm and intervention can not be carried out dynamically and intelligently: the existing alarm system often sets a fixed threshold value, and an alarm is given out after the threshold value is exceeded. No negative feedback mechanism is introduced. Since the load on the system and the performance of the system are not linear, fixed threshold intervention does not allow the system to operate optimally.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects in the prior art and provides a distributed system dynamic intelligent service management method.
In order to solve the technical problems, the invention adopts the following solution:
the method for managing the dynamic intelligent service of the distributed system comprises the following steps:
(1) configuring a sidecar process, i.e., an independent process capable of operating in a sidecar mode, for a service application in a distributed system; the side car process can run together, provides additional capacity for the side car process, and meanwhile can manage and schedule the service application program according to a preset scheme;
(2) the request is processed by a sidecar process before reaching the service application program, and the request is used as a proxy of the service application program, so that the monitoring of a request link and a service operating environment is realized, and the service is limited;
(3) the side car process records the starting time and the ending time of the request and synchronizes the ID of the request to the monitoring scheduling server; meanwhile, the sidecar process also serves as an acquisition agent of the monitoring and scheduling server, and collects the monitoring indexes of the service operation environment in real time according to a set time interval;
(4) and after receiving new load data and performance indexes acquired by the sidecar process, the monitoring and scheduling server performs fitting by using accumulated data in the running period, calculates an optimal scheduling strategy under the current load, and then issues a scheduling instruction.
In the present invention, the monitoring index of the service operation environment in step (3) includes: CPU utilization rate, memory utilization rate, disk IO and network IO, and requested return status code.
In the invention, in a distributed system, the sidecar processes configured by different service application programs are universal, and the difference is only the difference of configuration parameters.
In the invention, the agent service of the side car process is realized by the following modes:
(1) according to the traditional proxy server principle, the real address of the service application program is replaced by the address of the sidecar process during service registration and discovery; or,
(2) implemented in conjunction with a controller of an SDN.
In the invention, a monitoring scheduling server is used as a monitoring alarm center: the monitoring and dispatching server calculates the load and performance of the whole system according to the received load data and performance indexes, and sends out an alarm according to set conditions; meanwhile, the monitoring and scheduling server continuously performs learning and fitting by using historical data to acquire a typical load state of the system, and then dynamically adjusts the alarm threshold value according to different system operation periods.
Compared with the prior art, the invention has the technical effects that:
1. the complexity of a service system is reduced, the service system needs to be associated with real services, and the management and the scheduling of the services do not need to be associated;
2. functional redundancy in distributed systems is reduced. In a distributed system, the operating environment of each service requires services such as current limiting, monitoring, disconnection, and the like. Universal sidecar can reduce redevelopment.
3. The availability of the system is improved, the load capacity of the system is improved, and the response time is reduced. And (3) by a negative feedback principle and the fitting of historical data, the system is in an optimal operation state at any time.
4. The intelligent monitoring alarm can really generate meaning and value.
5. Is beneficial to the examination of on-line problems. The link tracking and monitoring can enable the troubleshooting personnel to quickly and clearly locate the fault point.
6. And providing judgment basis for dynamic capacity expansion of the system.
Drawings
FIG. 1 is a deployment diagram of the present invention (a sidecar process and a service application are deployed on the same machine, virtual machine, or container).
Fig. 2 is a schematic view of the work flow of the sidecar process.
Detailed Description
First, the technical terms or concepts related to the present invention are explained as follows:
1. principle of negative feedback
The principle of negative feedback is the basic concept of the theory of control. The feedback can be simply divided into positive feedback and negative feedback, and the system output of the negative feedback can return to the input in a certain way and has the opposite effect to the output, so that the error between the output of the system and the target is reduced, and the system tends to be stable. In the system described in the present invention, the parameters of the current limit and the monitored alarms are constantly optimized based on the principle of negative feedback.
2. Sidecar mode
The Sidecar mode, also called the Sidecar mode, is an independent process running with the service application, providing additional capabilities to the application and enabling certain management and scheduling of the application.
3. System load
The system load is a measure of the current pressure of the system, and can be measured by the following quantitative indexes:
a) QPS, generally, refers to the number of requests processed per unit time.
b) TPS, refers to the complete number of transactions processed per unit time. A transaction may be understood simply as a collection of one or several steps of operation by a user.
c) The number of concurrent users refers to the total number of users who send requests to the server at the same time.
d) Disk IO, CPU usage, memory usage, etc
4. System performance
The system performance reflects the system processing capability under the current load, and mainly has the following indexes:
a) user request latency
b) Server processing time
c) Request error timeout rate, error rate
5. Proxy mode
The proxy schema provides a proxy object to the target and references to the target are controlled by the proxy object. In this system, a sidecar (sidecar) is used to proxy calls to specific services.
And monitoring a link: one request relies on multiple services and eventually failure of a request is often difficult to troubleshoot. If the call link is tracked and monitored, it is convenient to find out which environment the request is in.
Intelligent early warning: in a traditional IT operation and maintenance system, a threshold value is generally set, and a system alarm is sent out after the threshold value is exceeded. Such as cpu loading of more than 80%. This threshold requires the operator to set it empirically, which is a very coarse-grained decision. The intelligent early warning can distinguish the system condition according to different events at different times. If the cpu load exceeds 50% when the system is idle, a special event is often generated, and early warning is needed at the time.
Dynamic current limiting: distributed, intelligent and dynamic current limiting method. In a distributed system, the traffic distributed to each service is limited to a certain extent, ensuring efficient operation of the service. In addition, dynamic current limiting is performed by combining statistical data and a negative feedback principle.
And (3) open circuit protection: in a distributed microservice system, a function often depends on coordination of multiple services, and a simple page request may invoke N services in the microservice system.
The following detailed description of embodiments of the invention refers to the accompanying drawings.
The invention relates to a distributed system dynamic intelligent service management method, which specifically comprises the following steps:
(1) configuring a sidecar process, i.e., an independent process capable of operating in a sidecar mode, for a service application in a distributed system; the side car process can run together, provides additional capacity for the side car process, and meanwhile can manage and schedule the service application program according to a preset scheme;
(2) the request is processed by a sidecar process before reaching the service application program, and the request is used as a proxy of the service application program, so that the monitoring of a request link and a service operating environment is realized, and the service is limited;
in a distributed system, the sidecar processes configured by different service applications are common, and only differ in configuration parameters. And the full multiplexing can ensure that other contents do not need to be concerned additionally when the business code is developed, thereby reducing the complexity and the coupling degree of a business system.
The agent service of the sidecar process can be realized by the following modes: (1) according to the traditional proxy server principle, the real address of the service application program is replaced by the address of the sidecar process during service registration and discovery; or, (2) in conjunction with a controller of an SDN (Software Defined Network, SDN).
(3) The side car process records the starting time and the ending time of the request and synchronizes the ID of the request to the monitoring scheduling server; meanwhile, the sidecar process also serves as an acquisition agent of the monitoring and scheduling server, and collects the monitoring indexes of the service operation environment in real time according to a set time interval; monitoring metrics for the service operating environment include (but are not limited to): CPU utilization rate, memory utilization rate, disk IO and network IO, and requested return status code.
(4) And after receiving new load data and performance indexes acquired by the sidecar process, the monitoring and scheduling server performs fitting by using accumulated data in the running period (the longer the system running time is, the better the fitting effect is), calculates the optimal scheduling strategy under the current load, and then issues a scheduling instruction.
A simple example is:
1. there are service applications A, B, C in the system
2. The current QPS of the service application A, B, C is 100,10,50 respectively
3. Historical data fitting surface, when limiting the A service QPS to 80 the best overall system response performance
4. The dispatch service sends the computed result to the sidecar
Sdecar throttled to 80 and put excess traffic into either message queue smoothing QPS or service a is leveled out, throttled, if allowed
6. If a service D is a non-critical service, and the D service is in the service call chain of A- > B- > C- > D, and the response time of D is long, the dispatch service will determine whether to open the circuit according to a predetermined policy. The call chain then becomes A- > B- > C. The contents of service D are placed in an asynchronous queue or are compensated by a subsequent timed task or other extra mechanism.
In the invention, the monitoring and scheduling server also plays a role of a monitoring and alarming center. Conventional monitoring alarms often check whether an index exceeds a fixed threshold and often target a single service. In the invention, the monitoring and scheduling server not only has load and performance data of single service, but also can calculate the load and performance of the whole system according to the received load data and performance indexes, and sends out an alarm according to set conditions; meanwhile, the monitoring and scheduling server continuously performs learning and fitting by using historical data to acquire a typical load state of the system, and then dynamically adjusts the alarm threshold value according to different system operation periods. For example, if the government system is in the low peak period after 8:00 night and in the high peak period during a certain time of day, the alarm threshold value is dynamically adjusted for different periods.

Claims (5)

1. A distributed system dynamic intelligent service governance method is characterized by comprising the following steps:
(1) configuring a sidecar process, i.e., an independent process capable of operating in a sidecar mode, for a service application in a distributed system; the side car process can run together, provides additional capacity for the side car process, and meanwhile can manage and schedule the service application program according to a preset scheme;
(2) the request is processed by a sidecar process before reaching the service application program, and the request is used as a proxy of the service application program, so that the monitoring of a request link and a service operating environment is realized, and the service is limited;
(3) the side car process records the starting time and the ending time of the request and synchronizes the ID of the request to the monitoring scheduling server; meanwhile, the sidecar process also serves as an acquisition agent of the monitoring and scheduling server, and collects the monitoring indexes of the service operation environment in real time according to a set time interval;
(4) and after receiving new load data and performance indexes acquired by the sidecar process, the monitoring and scheduling server performs fitting by using accumulated data in the running period, calculates an optimal scheduling strategy under the current load, and then issues a scheduling instruction.
2. The method of claim 1, wherein the monitoring of the service operating environment in step (3) comprises: CPU utilization rate, memory utilization rate, disk IO and network IO, and requested return status code.
3. The method of claim 1, wherein the sidecar processes configured by different service applications are common in the distributed system and differ only in configuration parameters.
4. The method of claim 1, wherein the agent service of the sidecar process is implemented by:
(1) according to the traditional proxy server principle, the real address of the service application program is replaced by the address of the sidecar process during service registration and discovery; or,
(2) implemented in conjunction with a controller of an SDN.
5. The method of claim 1, wherein the monitoring dispatch server acts as a monitoring alarm center: the monitoring and dispatching server calculates the load and performance of the whole system according to the received load data and performance indexes, and sends out an alarm according to set conditions; meanwhile, the monitoring and scheduling server continuously performs learning and fitting by using historical data to acquire a typical load state of the system, and then dynamically adjusts the alarm threshold value according to different system operation periods.
CN201811110086.8A 2018-09-21 2018-09-21 A kind of distributed system dynamic and intelligent service administering method Pending CN109067597A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811110086.8A CN109067597A (en) 2018-09-21 2018-09-21 A kind of distributed system dynamic and intelligent service administering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811110086.8A CN109067597A (en) 2018-09-21 2018-09-21 A kind of distributed system dynamic and intelligent service administering method

Publications (1)

Publication Number Publication Date
CN109067597A true CN109067597A (en) 2018-12-21

Family

ID=64763413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811110086.8A Pending CN109067597A (en) 2018-09-21 2018-09-21 A kind of distributed system dynamic and intelligent service administering method

Country Status (1)

Country Link
CN (1) CN109067597A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109903175A (en) * 2019-03-26 2019-06-18 众安在线财产保险股份有限公司 A kind of Insurance core system monitor supervision platform
CN110868449A (en) * 2019-10-14 2020-03-06 深圳市非零无限科技有限公司 Method and system for realizing timing task based on asynchronous message
CN111130944A (en) * 2019-12-30 2020-05-08 苏州思必驰信息科技有限公司 System monitoring method and system
CN111212129A (en) * 2019-12-30 2020-05-29 北京浪潮数据技术有限公司 Container application high-availability method, device and equipment based on side car mode
CN111475772A (en) * 2020-03-27 2020-07-31 微梦创科网络科技(中国)有限公司 Capacity evaluation method and device
CN111917844A (en) * 2020-07-17 2020-11-10 中信银行股份有限公司 Distributed service tracking method and device
WO2020227266A1 (en) * 2019-05-08 2020-11-12 Cisco Technology, Inc. Systems and methods for protecting a service mesh from external attacks on exposed software vulnerabilities
CN111935289A (en) * 2020-08-14 2020-11-13 中国工商银行股份有限公司 Dynamic monitoring method and device based on block chain
CN112241355A (en) * 2020-10-19 2021-01-19 恩亿科(北京)数据科技有限公司 Link tracking method, system, computer readable storage medium and electronic device
CN112511560A (en) * 2020-12-21 2021-03-16 北京云思畅想科技有限公司 Data security guarantee method in hybrid cloud environment based on service grid
CN112615790A (en) * 2020-12-22 2021-04-06 苏州思必驰信息科技有限公司 Multi-server-side flow limiting and flow monitoring system and method
CN114374693A (en) * 2021-12-09 2022-04-19 中国空间技术研究院 Decentralized real-time service scheduling management method and system for distributed system
CN115665590A (en) * 2022-10-21 2023-01-31 北京中电飞华通信有限公司 Internet of things data acquisition system and method based on eSIM card and 5G communication
WO2023016415A1 (en) * 2021-08-09 2023-02-16 华为云计算技术有限公司 Node for running container group, and management system and method of container group

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103875276A (en) * 2011-10-07 2014-06-18 瑞典爱立信有限公司 BNG to PCRF intermediary entity for BBF and 3GPP access interworking
CN107025256A (en) * 2015-11-06 2017-08-08 国际商业机器公司 The method and system for reactivating the time for reducing the service based on cloud
US9842045B2 (en) * 2016-02-19 2017-12-12 International Business Machines Corporation Failure recovery testing framework for microservice-based applications

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103875276A (en) * 2011-10-07 2014-06-18 瑞典爱立信有限公司 BNG to PCRF intermediary entity for BBF and 3GPP access interworking
CN107025256A (en) * 2015-11-06 2017-08-08 国际商业机器公司 The method and system for reactivating the time for reducing the service based on cloud
US9842045B2 (en) * 2016-02-19 2017-12-12 International Business Machines Corporation Failure recovery testing framework for microservice-based applications

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈皓: "《管理设计篇之"服务网络"》", 《极客时间》 *
陈皓: "《管理设计篇之"边车模式》", 《极客时间》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109903175A (en) * 2019-03-26 2019-06-18 众安在线财产保险股份有限公司 A kind of Insurance core system monitor supervision platform
US10999312B2 (en) 2019-05-08 2021-05-04 Cisco Technology, Inc. Systems and methods for protecting a service mesh from external attacks on exposed software vulnerabilities
WO2020227266A1 (en) * 2019-05-08 2020-11-12 Cisco Technology, Inc. Systems and methods for protecting a service mesh from external attacks on exposed software vulnerabilities
CN110868449A (en) * 2019-10-14 2020-03-06 深圳市非零无限科技有限公司 Method and system for realizing timing task based on asynchronous message
CN111130944A (en) * 2019-12-30 2020-05-08 苏州思必驰信息科技有限公司 System monitoring method and system
CN111212129A (en) * 2019-12-30 2020-05-29 北京浪潮数据技术有限公司 Container application high-availability method, device and equipment based on side car mode
CN111475772A (en) * 2020-03-27 2020-07-31 微梦创科网络科技(中国)有限公司 Capacity evaluation method and device
CN111475772B (en) * 2020-03-27 2023-12-15 微梦创科网络科技(中国)有限公司 Capacity assessment method and device
CN111917844A (en) * 2020-07-17 2020-11-10 中信银行股份有限公司 Distributed service tracking method and device
CN111935289A (en) * 2020-08-14 2020-11-13 中国工商银行股份有限公司 Dynamic monitoring method and device based on block chain
CN111935289B (en) * 2020-08-14 2022-10-18 中国工商银行股份有限公司 Dynamic monitoring method and device based on block chain
CN112241355A (en) * 2020-10-19 2021-01-19 恩亿科(北京)数据科技有限公司 Link tracking method, system, computer readable storage medium and electronic device
CN112511560A (en) * 2020-12-21 2021-03-16 北京云思畅想科技有限公司 Data security guarantee method in hybrid cloud environment based on service grid
CN112615790A (en) * 2020-12-22 2021-04-06 苏州思必驰信息科技有限公司 Multi-server-side flow limiting and flow monitoring system and method
WO2023016415A1 (en) * 2021-08-09 2023-02-16 华为云计算技术有限公司 Node for running container group, and management system and method of container group
CN114374693A (en) * 2021-12-09 2022-04-19 中国空间技术研究院 Decentralized real-time service scheduling management method and system for distributed system
CN115665590A (en) * 2022-10-21 2023-01-31 北京中电飞华通信有限公司 Internet of things data acquisition system and method based on eSIM card and 5G communication

Similar Documents

Publication Publication Date Title
CN109067597A (en) A kind of distributed system dynamic and intelligent service administering method
CN110971444B (en) Alarm management method, device, server and storage medium
US5796633A (en) Method and system for performance monitoring in computer networks
US5696701A (en) Method and system for monitoring the performance of computers in computer networks using modular extensions
CN109857558A (en) A kind of data flow processing method and system
CN112380086B (en) Intelligent sensing control system and method for distributed micro-service architecture data center
CN104780220B (en) Towards the intelligent monitor system and monitoring method of the large-scale distributed system of stock futures industry
US20100043004A1 (en) Method and system for computer system diagnostic scheduling using service level objectives
CN104407926B (en) A kind of dispatching method of cloud computing resources
KR20080044508A (en) Performance failure management system and its method using statistical analysis
KR20120023703A (en) Server control program, control server, virtual server distribution method
CN104378262A (en) Intelligent monitoring analyzing method and system under cloud computing
CN102681904B (en) Data syn-chronization dispatching method and device
CN112751726B (en) Data processing method and device, electronic equipment and storage medium
CN114896121A (en) Monitoring method and device for distributed processing system
CN110750425A (en) Database monitoring method, device and system and storage medium
CN117632897A (en) Dynamic capacity expansion and contraction method and device
CN120447720A (en) Throughput-optimized service quality early warning power capping system
CN118963974A (en) Multi-dimensional distributed task scheduling method, device, equipment and storage medium
CN107463490B (en) Cluster log centralized collection method applied to platform development
CN111158763B (en) Equipment instruction processing system for intelligent management and control of building
CN120216194A (en) Real-time data stream processing method, device, electronic device and storage medium
CN118672758B (en) A system, method, device and medium for multi-cluster task scheduling and monitoring
CN119806880A (en) Automated maintenance method, device and computer program product for container
CN101442437A (en) Method, system and equipment for implementing high availability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221