Orbit 2 - Improved diagnostics for silent actor routing failure
ORBIT : Operating Business Intelligence Tool
Status: Beta
Brought to you by:
orbitapp
Originally created by: mattdkerr
Previously we ran into a bug with Orbit 2 where the client gave no indication that issues were with a particular deployment of the service component in Kubernetes. The observed behavior was that calls would attempt to call the actor but never return and never log an error. The resolution was to delete the deployment of the Orbit 2 server and re-deploy it. This experience brought up the fact that there is insufficient information in logs and metrics to indicate a deployment is in a bad state, why the server is not instantiating an actor, the client isn’t returning, etc. I believe an improvement is needed here in order to greatly reduce the MTTR for these types of issues.
Originally posted by: brettmorien
This issue was created while debugging an issue in our multi-tenant dev environment where activations of addressables were happening a fraction of the time. Redeploying Orbit entirely solved the problem temporarily, but the issue returned.
After further debugging it was determined that 2 installations of a client service were running against the same Orbit Server cluster. They were independent of each other (one a team-dev instance, one a personal team-member's instance), but both shared the same namespace in their connections. This caused addressable placement to happen randomly between clients, but it looked like missing placements when only watching the expected client.
It's hard to write diagnostic logging to help with this situation. Metrics are enabled on the /metrics route which would show more connected clients than expected, and perhaps that can be improved and exposed better in our environment. But disambiguating between messages from connected clients would be based on the data sent by the clients, which in this case was identical.
I have updated the documentation around client configuration to make the role of namespace a little more explicit, hoping this will help others avoid this situation.
Originally posted by: mattdkerr
What about some diagnostics logging about which Orbit namespace and cluster the client is connecting to?