Orbit / Tickets / #427 Orbit 2 - Improved diagnostics for silent actor routing failure

#427 Orbit 2 - Improved diagnostics for silent actor routing failure

Status: open

Owner: nobody

Labels: feature (6)

Updated: 2020-06-02

Created: 2020-05-02

Creator: Anonymous

Private: No

Originally created by: mattdkerr

Previously we ran into a bug with Orbit 2 where the client gave no indication that issues were with a particular deployment of the service component in Kubernetes. The observed behavior was that calls would attempt to call the actor but never return and never log an error. The resolution was to delete the deployment of the Orbit 2 server and re-deploy it. This experience brought up the fact that there is insufficient information in logs and metrics to indicate a deployment is in a bad state, why the server is not instantiating an actor, the client isn’t returning, etc. I believe an improvement is needed here in order to greatly reduce the MTTR for these types of issues.

Discussion

Anonymous - 2020-06-02

Originally posted by: brettmorien

This issue was created while debugging an issue in our multi-tenant dev environment where activations of addressables were happening a fraction of the time. Redeploying Orbit entirely solved the problem temporarily, but the issue returned.

After further debugging it was determined that 2 installations of a client service were running against the same Orbit Server cluster. They were independent of each other (one a team-dev instance, one a personal team-member's instance), but both shared the same namespace in their connections. This caused addressable placement to happen randomly between clients, but it looked like missing placements when only watching the expected client.

It's hard to write diagnostic logging to help with this situation. Metrics are enabled on the /metrics route which would show more connected clients than expected, and perhaps that can be improved and exposed better in our environment. But disambiguating between messages from connected clients would be based on the data sent by the clients, which in this case was identical.

I have updated the documentation around client configuration to make the role of namespace a little more explicit, hoping this will help others avoid this situation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2020-06-02

Originally posted by: mattdkerr

What about some diagnostics logging about which Orbit namespace and cluster the client is connecting to?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Orbit 2 - Improved diagnostics for silent actor routing failure

ORBIT : Operating Business Intelligence Tool

Milestone

Searches

Help

#427 Orbit 2 - Improved diagnostics for silent actor routing failure

Discussion