Octez-P2P manages TCP connections. When a peer is no longer reachable, TCP may takes from few minutes up to few dozen of minutes to detect this due to how TCP retransmission works (see for example this link: https://stackoverflow.com/questions/5907527/application-control-of-tcp-retransmission-on-linux).

This MR suggests to set the TCP_USER_TIMEOUT to detect connection issues way earlier. To understand how this works, you can read this blog article:

https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

Since this option is not available in Unix.ml, we introduce a stub.

How to test

Check that it does not fail on Mac OS/X (should we warn the user somehow?)
Check it solves the issue (following the reproduction steps below)
Check whether the constant value for the timeout (currently: 10s) is acceptable, otherwise what is an acceptable value?

Reproduction steps (Based on a simpler tezt-cloud scenario)

Prevent the baker to receive connections:

sudo iptables -I INPUT 1 -p tcp --dport 30055 -j DROP

Run the following tezt-cloud scenario:

tezt-cloud cloud dal -v --log-file /tmp/log --stake 1 --producers 1 --localhost

Observe the port picked by the baker to connect to the producer:

docker exec -it teztcloud-saroupille-dal-producer-1 bash
cat /tmp/tezt-<pid>/1/producer-dal-node-0/daily_logs/daily-20250221.log | grep -i conn

you will see for example:

2025-02-21T11:01:56.097-00:00 [p2p.connect_handler.accept_connection] Accepting new connection from 127.0.0.1:34500
2025-02-21T11:01:56.104-00:00 [p2p.connect_handler.new_connection] authentification status for ::ffff:127.0.0.1:30055: check identity idrfj95Nx5EY3KhwX1ruMAa98vdR2a

Hence the port picked is 34500.

Prevent packets to be emitted to port 34500 (simulating the host unreachable issue).

sudo iptables -I OUTPUT 1 -p tcp --dport 34500 -m conntrack --ctstate NEW,ESTABLISHED -j REJECT --reject-with icmp-host-unreachable

At this stage we can observe that nor the observer nor the baker are able to detect an issue. The socket lives. Likely TCP is at work trying to figure out what is going on.
If we try to disconnect explicitly from the point on the producer by running:

./octez-client --endpoint http://127.0.0.1:30104 rpc delete /p2p/points/disconnect/127.0.0.1:30055

you can observe that it may take ages to return. This is an issue I observed in the past and never understood. If we wait long enough, we will see the EHOSTUNREACH error on the producer side but also that the attester will be notified on the disconnection. I am not sure why it arrives at the same time, maybe because they have the same TCP configuration? This was actually tested with the gossipsub patches and so maybe the attester detect the issue about the same time because of the same TCP configuration. One should test on the old master to check whether the attester observes the disconnection or not.

By changing keepalive properties (from 2h to 1mn), I was able to observe a much shorter time for the disconnection.

Edited Feb 24, 2025 by François Thiré

Octez/P2P: Decrease the time to detect a connection is dead

How to test

Reproduction steps (Based on a simpler tezt-cloud scenario)

Merge request reports