Octez-p2p: enable SO_KEEPALIVE flag for connections
(patch proposed by @Saroupille )
This MR introduces the enabling of the SO_KEEPALIVE socket option for connections established by the Octez-P2P library. This enhancement aims to improve connection reliability within the Data Availability Layer (DAL) node by addressing issues related to stale connections.
Rationale
The decision to enable SO_KEEPALIVE stems from observed behavior where connections in the DAL node become stale over time due to infrequent communication from certain peers. A socket-level keep-alive mechanism provides a robust alternative to application-level ping strategies, simplifying the architecture while achieving the desired connection reliability.
Impact on DAL Nodes
-
Maintain Active Connections: By enabling SO_KEEPALIVE, the socket periodically sends keep-alive probes to verify that the connection remains active. This is particularly beneficial for the DAL node, where some peers may communicate (i.e. sending messages) infrequently.
-
Prevent Stale Connections: Without a ping mechanism at the GossipSub/Application layer, SO_KEEPALIVE helps detect and close unresponsive or dead connections, thereby preventing resource leaks and enhancing overall network stability.
-
Resource Management: Helps efficiently manage system resources by automatically cleaning up inactive connections.
Impact on L1 Nodes
-
Minimal Necessity: L1 nodes typically engage in frequent data exchanges, reducing the likelihood of stale connections.
-
Non-Harmful: While enabling SO_KEEPALIVE may offer negligible benefits for L1 nodes, it should not introduce any adverse effects, ensuring compatibility and safety across different node types.
Usage Considerations
-
Performance: While SO_KEEPALIVE introduces periodic network traffic to monitor connection health, the impact is generally minimal and outweighed by the benefits of maintaining active and reliable connections.
-
Configurability: The frequency and parameters of keep-alive probes can typically be adjusted at the system level to balance between responsiveness and network overhead, allowing flexibility based on deployment environments.