External Healthcheck Host level health check for commands receivable socket

Status: open
Priority: 3
Complexity: 2
Public link
Back to the board

Swimlane: Service Resiliancy
Column: Parking Lot
Position: 6

Assignee: AndrewCz
Creator: Jack
Assigned Group: not assigned

Started:
Created: 2021/08/04 20:47
Modified: 2021/10/16 14:54
Moved: 2021/10/16 14:54

Description

WHY: Health checks for socket itself to see if the socket systemd service is in a healthy state.
HOW: tbd
DONE: Self-healing for socket.

Sub-Tasks

Internal links

Comments

AndrewCz Created at: 2021/09/06 21:25 Updated at: 2021/09/06 21:25

Moved to column Planned

AndrewCz Created at: 2021/09/15 20:08 Updated at: 2021/09/15 20:08

Can't reproduce

AndrewCz Created at: 2021/09/24 22:23 Updated at: 2021/09/24 22:23

Moved to column Planned

AndrewCz Created at: 2021/09/24 22:23 Updated at: 2021/09/24 22:24

You can crash it by fscking the interpreter on the script you call:

-bash: /usr/local/bin/commands_receivable.py: /usr/local/lib/docker/virtualenv/bin/python3: bad interpreter: No such file or directory

This leads to:

● commands_receivable.socket
     Loaded: loaded (/etc/systemd/system/commands_receivable.socket; disabled; vendor preset: enabled)
     Active: failed (Result: service-start-limit-hit) since Sat 2021-09-25 02:20:13 UTC; 3min 48s ago
   Triggers: ● commands_receivable.service
     Listen: /run/commands_receivable.sock (Stream)

Sep 25 02:18:31 vanilla-test systemd[1]: Listening on commands_receivable.socket.
Sep 25 02:20:13 vanilla-test systemd[1]: commands_receivable.socket: Failed with result 'service-start-limit-hit'.

AndrewCz Created at: 2021/09/26 23:07 Updated at: 2021/09/26 23:07

This only happens when the python script runs into an uncaught exception. This causes it to retry until it his the limit that it fails the socket entirely. The fix would be to restart the socket, and then to either manually mount it into portal (doable?) or re-run compositional role (potentially invasive?). So it's not great either way.

AndrewCz Created at: 2021/10/02 02:00 Updated at: 2021/10/02 02:00

I'm wondering in a self-sustainable way how I would execute commands on the host if the socket were down because the python file had an un-fixable error?

AndrewCz Created at: 2021/10/02 14:19 Updated at: 2021/10/02 14:19

So, not really knowing what's happening, I could attempt the following...

We have to assume that the service is failing due to the python script. Therefore, the first thing to do would be to replace the python script. This would be by formatting the raw jinja2 template into an actual python script. This would be by reaching back up to the git repo, and getting the latest version.

Next, we would have to deal with the socket and portal. We would have to restart the socket to start it back up again. Then we would have to insert it into the portal container. I don't know what that would look like, but I bet nsenter would have a role to play.

See the docs:

   TriggerLimitIntervalSec=, TriggerLimitBurst=
Configures a limit on how often this socket unit my be
activated within a specific time interval. The
TriggerLimitIntervalSec= may be used to configure the length
of the time interval in the usual time units "us", "ms", "s",
"min", "h", ... and defaults to 2s (See systemd.time(7) for
details on the various time units understood). The
TriggerLimitBurst= setting takes a positive integer value and
specifies the number of permitted activations per time
interval, and defaults to 200 for Accept=yes sockets (thus by
default permitting 200 activations per 2s), and 20 otherwise
(20 activations per 2s). Set either to 0 to disable any form
of trigger rate limiting. If the limit is hit, the socket
unit is placed into a failure mode, and will not be
connectible anymore until restarted. Note that this limit is
enforced before the service activation is enqueued.

AndrewCz Created at: 2021/10/02 14:20 Updated at: 2021/10/02 14:20

Since we aren't running into this in the wild, and only on development builds, then we should probably de-prioritize this. It should be re-classified to a resiliance thing. Definitely something to be implemented before we go to a zero-trust model.