WHY: Health checks for socket itself to see if the socket systemd service is in a healthy state.
HOW: tbd
DONE: Self-healing for socket.
External Healthcheck Host level health check for commands receivable socket
- Status: open
- Priority: 3
- Complexity: 2
- Public link
- Back to the board
- Swimlane: Service Resiliancy
- Column: Parking Lot
- Position: 6
- Assignee: AndrewCz
- Creator: Jack
- Assigned Group: not assigned
- Started:
- Created: 2021/08/04 20:47
- Modified: 2021/10/16 14:54
- Moved: 2021/10/16 14:54
Moved to column Planned
Can't reproduce
Moved to column Planned
You can crash it by fscking the interpreter on the script you call:
This leads to:
This only happens when the python script runs into an uncaught exception. This causes it to retry until it his the limit that it fails the socket entirely. The fix would be to restart the socket, and then to either manually mount it into portal (doable?) or re-run compositional role (potentially invasive?). So it's not great either way.
I'm wondering in a self-sustainable way how I would execute commands on the host if the socket were down because the python file had an un-fixable error?
So, not really knowing what's happening, I could attempt the following...
We have to assume that the service is failing due to the python script. Therefore, the first thing to do would be to replace the python script. This would be by formatting the raw jinja2 template into an actual python script. This would be by reaching back up to the git repo, and getting the latest version.
Next, we would have to deal with the socket and portal. We would have to restart the socket to start it back up again. Then we would have to insert it into the portal container. I don't know what that would look like, but I bet
nsenter
would have a role to play.See the docs:
Since we aren't running into this in the wild, and only on development builds, then we should probably de-prioritize this. It should be re-classified to a resiliance thing. Definitely something to be implemented before we go to a zero-trust model.