What

This is the base MR for adding a new Grafana dashboard which will handle the panels generated from the prometheus data added by the PPX profiling effort.

Debate topics (which can/should be addressed in this MR):

Do we want a new Grafana dashboard with this new information? I advocate for yes, because for this, a node with the TEZOS_PROFILING=profiling PROFILING_BACKEND="prometheus" node should be run, and I believe the existing running nodes used for observations should not be impacted by the profiling overhead.
Is the dashboard design up to standard? It's the first time I am building one, so I am not sure how liked would be or how expressive is. Please note that this dashboard added by this MR contains a panel just for the Store module, because we need to add more prometheus labels to other components.
Should there be more panels? In this case I did not exaggerate with more, although there are much more in-detail profiling outputs. Let's see the ordinary store_profiler outputs for a ghostnet node:

2024-11-26T11:21:44.189-00:00
BMNMbXBSzpPdiVvN5MXvfyHPejWVLhkignTaYBzMDCdkxCJ3CQL ............................ 1         5371.821ms  11% +1m10s706.923ms
  compute_live_blocks .......................................................... 1            0.004ms  75%
  store_block .................................................................. 1            0.446ms 102% +0.001ms
  set_head ..................................................................... 1            1.366ms 107% +0.500ms
    get_pred_block ............................................................. 1            0.002ms 467% +0.003ms
    may_split_context .......................................................... 1            0.001ms 210% +0.005ms
    finalize_set_head .......................................................... 1            1.348ms 107% +0.015ms
      write_new_head ........................................................... 1            0.490ms 105% +0.617ms
      write_new_target ......................................................... 1            0.001ms 105% +1.107ms
      updating live blocks ..................................................... 1            0.168ms 101% +1.135ms
        locked compute live blocks with cache .................................. 1            0.166ms 101% +0.002ms
          compute live blocks with new head .................................... 1            0.164ms 100% +0.000ms

What I did was to showcase graphs only for the 3 big (Notice-level) categories (compute_live_blocks, store_block and set_head) and I thought that this would be decent enough. If we see some unexplained increase in the average time they take, then we can investigate more deeply in the profiling text logs.

Combining graphs: As you can see, I did not combine any graph yet, because I believe graphs should only be combined (for let's say measurements x and y) when x and y are two options of a process (for instance, an operation that is done on either x = attestation or y = preattestation by a function operating on consensus operations). Is this a good approach?

Why

How

Manually testing the MR

To test the MR, please refer to the Monitoring an Octez node tutorial, it was my first step, too.

Then, you definitely need a node. What I used was a running node on ghostnet (which ran for a few days):

$ TEZOS_PPX_PROFILER=profiling PROFILING_BACKEND="prometheus" PROFILING="debug" ./octez-node  run --data-dir ~/.tezos-node-ghostnet --metrics-addr localhost:9091

You should be able to see the metrics at this address: http://localhost:9091/metrics

For generating the Grafana dashboard, you need to:

$ cd ~/tezos/grafazos

$ NODE_INSTANCE_LABEL=instance make profiling

This will create the octez-profiling.json file (Grafana dashboard template), which, when imported in Grafana, should give you a nice-looking dashboard:

You can also check the other already implemented panels by running the make command without any argument, and this will build all the other .json files, and then you can import any of the files in a new Grafana dashboard.

Checklist

Document the interface of any function added or modified (see the coding guidelines)
Document any change to the user interface, including configuration parameters (see node configuration)
Provide automatic testing (see the testing guide).
For new features and bug fixes, add an item in the appropriate changelog (docs/protocols/alpha.rst for the protocol and the environment, CHANGES.rst at the root of the repository for everything else).
Select suitable reviewers using the Reviewers field below.
Select as Assignee the next person who should take action on that MR

Edited Nov 29, 2024 by Gabriel Moise

Grafazos: Add file to generate profiling dashboard

What

Why

How

Manually testing the MR

Checklist

Merge request reports