Healthchecks and Monitoring with Docker and OurCompose

Monitoring services is foundational for discovering issues in processes and programs, and helps to promote discovery of problems among services. We use docker healthchecks at OurCompose to discover issues with applications, and along side our monitoring we use autonomous processes to fix issues when they arise. This helps us to not have to spend time on fixing issues (as they self resolve), and allows us more time to develop.

This post goes into some of the docker healthchecks we use and how we monitor and resolve issues within containers to confirm services are online for users on OurCompose Instances.

It All Starts with a Container HealthCheck

Docker provides out of the box an ability to run healthchecks on contaienrs to determine whether or not they are in a healthy or unhealthy state. Although these are available in the Dockerfile, at OurCompose we use the Docker Compose version of healthchecks because we don’t roll our own containers for every service, however we do roll healthchecks for every service.

Because we use ansible to configure instances, for our 3.X series of the Ansible Collection we pass in a variable as our healthcheck test to the docker compose configuraion similiar to the following:

Within the Service Role YAML File (in this case dolibarr):

    healthcheck:
    test: "{{ compositional_dolibarr_healthcheck }}"
    interval: 5s
    timeout: 30s
    retries: 3

Where the compositional_dolibarr_healthcheck variable looks like this:

compositional_dolibarr_healthcheck: |
  wget --quiet --no-verbose --tries=1 --spider localhost:80 \
  || exit 1

This means every 5 seconds we will try to run wget --quiet --no-verbose --tries=1 --spider localhost:80 || exit 1 from within the container on itself to check if responses are returning properly. If they are not the healthcheck will exit 1 reporing an error and will retry two more times to check if the service is working.

If it fails three times we have an issue :(. This means our docker container is having issues and needs to be fixed!

This is where Portal, the CommandsRecievable Service and Socket come into play!

Collect the HealthChecks and Take Action!

Now that we collect service statuses every 5 seconds we need a way to collect and parse each service’s status to see if the instance is healthy or having issues. If all the services are healthy, great, we aren’t having problems! In our 3.X Collection if even one service is having issues we will need to try to fix the issue automatically by running Compositional Role on the entire instance. If a container is continuing to run into issues after trying to fix problems multiple times, someone is going to have to manually take a look which we alert for. Here is a look at how we check statuses from Portal with Operating System Sockets and Service, and how we attempt to resolve them.

From the OS level we have a few things going on:

  • A cron job that will run every 10 minutes to run a portal command to collect and parse all the statuses of each service.
  • A CommandsReceivable Service on the OS that will run whitelisted commands from within its own ansible container
  • A Socket on the OS to listen for portal connections
  • Portal with exposure to the OS socket

What the cron job looks like on the server:

*/10 * * * * /usr/bin/docker exec portal /app/bin/seeds/health_check.sh

What the portal_commands_receivable.service looks like:

[Service]
StandardOutput=journal+console
StandardError=journal+console
ExecStart=/usr/local/bin/commands_receivable.py

What the portal_commands_receivable.socket looks like:

[Socket]
ListenStream=/var/run/commands_receivable.sock

[Install]
WantedBy=sockets.target

Now back to Portal. Basically with the cron job in place, every 10 minutes we will run the health_check shell script against our instance and Portal will make a call to the socket which will run the python script (service) to run the needed commands from the collection, in this case the report_health.yml task.

In Portal, the ruby shell script calls a rake task which can be found here. I have excluded full script for brevity, but essentially what occurs is: every 10 minutes we call the Portal Docker container to collect the health of our applications by checking against our CommandsReceivable Socket and Service.

From Portal, these are the most important lines used when we call a health check run:

    socket_connection = UNIXSocket.new("/var/run/commands_receivable.sock")
    vault_pass = ENV['ENVIRONMENT_VAULT_PASSWORD'].blank? ? 'notnil' : ENV['ENVIRONMENT_VAULT_PASSWORD']
    pass_string = "{'script': 'playbooks/report_health.yml', 'vault_password': '#{vault_pass}', 'collection_version': '#{ENV['ROLE_BRANCH']}'}"
    socket_connection.print(pass_string)
    socket_connection.close_write
    services_status = {}
    begin
        service_name = nil
        while true
            line = socket_connection.readline()
            if line.to_s.starts_with?("changed:")
                service_name = line[line.index("(item=")+6..line.index(")")-1]
            end
            if line.to_s.starts_with?("STDOUT")
                puts "Found STDOUT"
                status = socket_connection.readline.to_s.delete("\n")
                puts status
                services_status[service_name] = status
            end
        end
    rescue EOFError
        puts "End of File, do nothing"
    end

In the above lines you can see we connect to a socket to pass some JSON data, with the most important argument being: playbooks/report_health.yml. This means we want to run the report_health.yml task from the collection.

After the socket passes data, the service starts and CommandsReceivable.py kicks off. CommandsReceivable.py does a number of things including building an image for us to run containers from, building safe containers, and allowing Portal to run collection commands.

Below are some of the most important lines in CommandsReceivable.py for running collection commands via a container (in this case report_health.yml):

def run_docker_command(spec):
    """
    Takes the spec that the server passes us and runs docker-compose based off
    of it.
    """
    client = docker.from_env()
    set_entrypoint_path()
    print('Running Container')
    # TODO Deal with local/remove pathing
    container = client.containers.run(
        image=get_container_image(spec),
        command=build_command(spec),
        entrypoint='/entrypoint/entrypoint.sh',
        network_mode='host',
        detach=True,
        stream=True,
        environment={
            'VAULT_PASSWORD': spec['vault_password']
            },
        volumes={
            '/srv/local/portal_storage/': {
                'bind': '/portal_storage',
                'mode': 'rw'
                },
            '/root/.ssh/': {
                'bind': '/root/.ssh',
                'mode': 'ro'
                },
            '/tmp/entrypoint/': {
                'bind': '/entrypoint',
                'mode': 'ro'
                }
            },
        )
    return container

After our command has been passed from portal to the python script, the report_health.yml is run. The task looks like the following:

---
- name: Check the health of all of the services
  hosts: all
  vars_files:
    - ../ansible_collections/compositionalenterprises/ourcompose/roles/compositional/defaults/main.yml
    - ../environment/group_vars/compositional/all.yml
  tasks:
    - name: Check for all of the services being healthy
      shell: docker inspect --format='{{.State.Health.Status}}' {{ item }}
      loop: "{{ compositional_services }}"
      register: report_health_result
      failed_when: report_health_result['stdout'] != 'healthy'

This means we are able to run the report health task from its own container not exposed to the OS from Portal, but because we’ve connected to the socket and haven’t closed the connection, we are able to send data back to Portal for Portal to parse and this is where the magic happens.

As it currently stands Portal will parse every service and pull the status of the service. If the service is healthy then the instance doesn’t need to be intervened on or touched. In 3.x, if one service is unhealthy, Portal will attempt to run a compositional role on the instance in an attempt to fix the broken service. If another 10 minutes passes, the health check runs again, and the service is still unhealthy, Portal will send an alert to the admins to fix the service manually. Since this implementation we have not had to do much manual intervention as the instances fix themselves. Talk about automation.

And Beyond

The 3.X series was mentioned quite a bit in this post. As we start to move towards our 4.X roll out, health checks will continue to operate similarly, however the way services are fixed is now independent. This means if an instance is running dolibarr, nextcloud, and akaunting, and nextcloud is reporting in with an unhealthy status, only a fix will be run against the nextcloud service as opposed to the entire instance minimizing downtime.

Want to learn more?

Fill out our Contact Form, or do some more research at OurCompose.com