From 4a44183d16d449ea4d4ce2c46794b91b10345f72 Mon Sep 17 00:00:00 2001 From: Killian Delarue Date: Fri, 12 Aug 2022 11:23:55 +0200 Subject: [PATCH] Doc: add node-monitoring doc --- docs/developer/openmetrics.rst | 18 --- docs/index.rst | 1 + docs/shell/validation.rst | 6 + docs/user/node-monitoring.rst | 260 +++++++++++++++++++++++++++++++++ 4 files changed, 267 insertions(+), 18 deletions(-) create mode 100644 docs/user/node-monitoring.rst diff --git a/docs/developer/openmetrics.rst b/docs/developer/openmetrics.rst index 31ac7f3f1789..5b7278eb6fff 100644 --- a/docs/developer/openmetrics.rst +++ b/docs/developer/openmetrics.rst @@ -62,21 +62,3 @@ source - using adequate values: scheme: http static_configs: - targets: ['localhost:9091'] - - -Monitoring the node with metrics --------------------------------- - -Once the node is correctly set up to export metrics -and those are collected by a `Prometheus server `_, -you can graphically monitor your node with a `Grafana dashboard `_. - -Dashboards suited for Octez can be easily built with the `Grafazos `_ tool. -Grafazos provides several ready-to-use dashboards for Octez on the `Grafazos packages page `__, as plain JSON files. -Their sources are also available as `jsonnet `__ files, that can be adjusted to build customized dashboards, if needed: - - -- ``octez-basic``: A basic dashboard with all the node metrics -- ``octez-full``: A full dashboard with the logs and hardware data. - This dashboard should be used with `Netdata `_ (for supporting hardware data) and `Promtail `_ (for exporting the logs). -- ``octez-compact``: A compact dashboard that gives a brief overwiev of the various node metrics on a single page. diff --git a/docs/index.rst b/docs/index.rst index bf1225dafdcd..81e3e01d5fdb 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -116,6 +116,7 @@ in the :ref:`introduction `. user/key-management user/node-configuration + user/node-monitoring user/versioning user/snapshots user/history_modes diff --git a/docs/shell/validation.rst b/docs/shell/validation.rst index afcbcfe65436..0c058488bc3d 100644 --- a/docs/shell/validation.rst +++ b/docs/shell/validation.rst @@ -78,6 +78,8 @@ communicating with each other via message passing. Workers are spawned and killed dynamically, according to connected peers, incoming blocks to validate, and active (test)chains. +.. _chain_validator: + A *chain validator* worker is launched by the validator for each *chain* that it considers alive. Each chain validator is responsible for handling blocks that belong to this chain, and select the best head for @@ -87,6 +89,8 @@ chain. Forking a chain is decided from within the economic protocol. In protocol Alpha, this is only used to try new protocols before self amending the main chain. +.. _peer_validator: + The chain validator spawns one *peer validator* worker per connected peer. The set of peer validators is updated, grown, or shrunk on the fly, according to the connections and disconnections signals from the peer-to-peer component. @@ -123,6 +127,8 @@ pipeline, or multipass) will interact with the distributed DB to get the data they need (block headers and operations). When they have everything needed for a block, they will call the *block validator*. +.. _block_validator: + The *block validator* validates blocks (currently in sequence), assuming that all the necessary data have already been retrieved from the peer-to-peer network. When a block is valid, it will notify the diff --git a/docs/user/node-monitoring.rst b/docs/user/node-monitoring.rst new file mode 100644 index 000000000000..2dee27493f47 --- /dev/null +++ b/docs/user/node-monitoring.rst @@ -0,0 +1,260 @@ +Monitoring a Tezos Node +======================= + +Monitoring the behavior of a Tezos node can be partially achieved by exploring the logs or, +more efficiently, through the RPC server. The use of RPCs is detailed in :doc:`the RPC documentation <../developer/rpc>` +and :doc:`the RPC references <../shell/rpc>`. + +Most practically, however, is to use Octez Metrics to gather information and statistics, which has been integrated directly into the node +since Octez version 14. Users are now able to get metrics without using an external tool, +such as `tezos-metrics `_ (which is now deprecated). +The node now includes a server that registers the implemented metrics and outputs them for each received ``/metrics`` http request. +So now you can configure and launch your node with a metrics exporter. + + +Starting a node with monitoring +------------------------------- + +Start +~~~~~ + +The node can be started with its metrics exporter with the option ``--metrics-addr`` which takes as a parameter ``:`` or ```` or ``:``. + +```` and ```` are respectively the address and the port on which to expose the metrics. +By default, ```` is ``localhost`` and ```` is ``9932``. + +.. code-block:: shell + + tezos-node run --metrics-addr=: … + +Note that it is possible to serve metrics on several addresses by using the option more than once. + +Configure +~~~~~~~~~ + +You can also add this configuration to your persistent configuration file through the command line: + +.. code-block:: shell + + tezos-node config init --metrics-addr=: ... + + #Or if the configuration file already exists + tezos-node config update --metrics-addr=: ... + +See :doc:`the documentation of the node configuration<./node-configuration>` for more information. + +A correct setup should write an entry in the logs similar to: + +:: + + - node.main: starting metrics server on : + +Octez Metrics +------------- + +This section focuses on access to the metrics and their uses. +More details on the metrics specifications are available :doc:`here <../developer/openmetrics>` + +Scraping Octez Metrics +~~~~~~~~~~~~~~~~~~~~~~ + +Once your node is correctly set up to export metrics, you can scrape them by querying the metrics server of your node with the request `/metrics`. + +Ex.: + +.. code-block:: shell + + curl http://:/metrics + +You will be presented with the list of defined and computed metrics as follows: + +:: + + #HELP metric description + #TYPE metric type + octez_metric_name{label_name=label_value} x.x + + +The metrics that can be exposed by the node can be listed with the command: + +.. code-block:: shell + + tezos-node dump-metrics + + +Version 14 of Octez exports metrics from various components of the node, namely: + +- :doc:`The p2p layer <../shell/p2p>` +- :doc:`The store <../shell/storage>` +- :doc:`The prevalidator <../shell/prevalidation>` +- :ref:`The chain validator ` +- :ref:`The block validator ` +- :ref:`The peer validator ` +- The distributed database +- :doc:`The RPC server <../shell/rpc>` +- The node version + +Each exported metric has the following form:: + + octez_subsystem_metric{label_name=label_value;...} value + +Each metric name starts with ``octez`` as its namespace, followed by the a subsystem name, which is the section of the node described by the metric. +It follows the OpenMetrics specification described `here `__ + +A metric may provide labeled parameters which allow for different instances of the metric, with different label values. +For instance, the metric ``octez_distributed_db_requester_table_length`` has a label name ``requester_kind`` which allows this metric to have one value for each kind of requester. + +:: + + octez_distributed_db_requester_table_length{requester_kind="block_header"} x + octez_distributed_db_requester_table_length{requester_kind="protocol"} y + ... + +Metrics provide information about the node in the form of a `gauge `_ that can increase or decrease (like the number of connections), +a `counter `_ that can only increase (like the head level), +or a `histogram `_ used to track the size of events and how long they usually take (e.g., the time taken by an RPC call). + +The label value is sometimes used to store information that can't be described by the metric value (which can only be a float). This is used for example by the ``octez_version`` metric that provides the version within the labels. + +.. note:: + + Most of the metrics are computed when scraped from the node. As there is no rate limiter, you should consider scraping wisely and adding a proxy for a public endpoint, to limit the impact on performance. + +.. _prometheus_server: + +Prometheus +~~~~~~~~~~ + +Scraping metrics gives you instant values of the metrics. For a more effective monitoring, you should create a time series of these metrics. + +We suggest using `Prometheus `_ for that purpose. + +Once installed, you need to add the scraping job to the configuration file. + +:: + + - job_name: 'tezos-exporter' + scrape_interval: interval s + metrics_path: "/metrics" + static_configs: + - targets: ['addr:port'] + +Prometheus is a service, so you need to start it. Note that Prometheus can also scrape metrics from several nodes! + +.. code-block:: shell + + sudo systemctl start prometheus + +.. _hardware_metrics: + +Hardware metrics +~~~~~~~~~~~~~~~~ + +In addition to node metrics, you may want to gather other information and statistics for effective monitoring, such as hardware metrics. + +For that purpose, we suggest using `Netdata `_. + +To install Netdata: + +.. code-block:: shell + + bash <(curl -Ss https://my-netdata.io/kickstart.sh) + +Add the following at the end of ``/etc/netdata/app_groups.conf`` + +.. code-block:: shell + + tezos: tezos-node tezos-validator + +.. _filecheck: + +Optionally, you can enable storage monitoring with ``filecheck``. + +To do so, create a ``filecheck.conf`` file in ``/etc/netdata/go.d/`` and add:: + + jobs: + - name: octez-data-dir-size + discovery_every: 30s + dirs: + collect_dir_size: yes + include: + - '/path/to/data/dir' + + - name: octez-context-size + discovery_every: 30s + dirs: + collect_dir_size: yes + include: + - '/path/to/data/dir/context' + + - name: octez-store-size + discovery_every: 30s + dirs: + collect_dir_size: yes + include: + - '/path/to/data/dir/store' + + +Then, you need to make sure that the ``netdata`` user has the correct read/write/execute permissions. +This can be achieved by adding this user to your user's group, or by defining custom rules. + +To check that the setup is correct:: + + #Log as netdata user + sudo -u netdata -s + + #Go to the plugin directory + cd /usr/libexec/netdata/plugins.d/ + + #Run the debugger + ./go.d.plugin -d -m filecheck + + +With a correct install, you should see lines such as:: + + BEGIN 'filecheck_octez-data-dir-size.dir_size' 9999945 + SET '/path/to/data/dir/' = 48585735837 + END + +Note, if you use filecheck for storage monitoring, you need to configure your dashboards accordingly. More details in the :ref:`Grafazos configuration section `. + +Dashboards +---------- + +Dashboards will take your node monitoring to the next level, allowing you to visualize the raw data collected with pretty, colorful graphs. + +Grafana +~~~~~~~ + +Dashboards can be created and visualized with `Grafana `_. Grafana can be installed by following `these instructions `_. + +Once installed and running, you should be able to reach the interface on port ``3000`` (you can change the port on the Grafana config file). + +Then you need to add the configured Prometheus server (see :ref:`Prometheus `) as a data source in ``Configuration/Data sources``. + + +Grafazos +~~~~~~~~ + +You can interactively create your own dashboards to monitor your node, using the Grafana GUI. Alternatively Grafana allows you to import dashboards from JSON files. + +`Grafazos `_ generates JSON files that you can import into the Grafana interface. + +This tool generates the following dashboards: + +- ``octez-compact``: A compact dashboard that gives a brief overview of the various node metrics on a single page. +- ``octez-basic``: A basic dashboard with all the node metrics. +- ``octez-with-logs``: Same as basic but also displays the node's logs, with `Promtail `_ (for exporting the logs). +- ``octez-full``: A full dashboard with the logs and hardware data. This dashboard should be used with `Netdata `_ (for supporting hardware data) in addition to Promtail. + +You can generate them from the sources, with your own configuration. Or you can use the JSON files, compatible with your node version found `here `_. + +.. _grafazos_configuration: + +The dashboards can be configured by setting environment variables before starting their generation (using ``make``). + +The available variables are: + +- ``BRANCH``: Used to specify the name of the branch of the node. +- ``NODE_INSTANCE_LABEL``: Used to set the name of the node instance label in the metrics. +- ``STORAGE_MODE``: To be set to ``filecheck`` if the :ref:`storage monitoring with filecheck ` is enabled. -- GitLab