Containers, Zombie Processes, and Init Systems

As mentioned throughout the podcast, we are continuing to make updates to our Portal and CommandCenter services to make them more efficient, feature full, and useful. Some of the changes made are for asthetics and actual feature usage while others are minor changes to the codebase or are behind the scenes to the users.

Recently one of our behind the scenes changes was to start the container for Portal and CommandCenter to directly run the web server instead of running an init system with a web server to reduce timeouts and improve speed of deployment with these two internal applications. This improvement had an impact on our production services, but started to cause zombie processes when health checks were run against the container.

What this change looked like:

## Dockerfile

### FROM
CMD ["/bin/sh", "/app/entrypoint.sh"]

### TO
ENTRYPOINT ["/app/entrypoint.sh"]
CMD ["bundle", "exec", "puma", "-C", "config/puma.rb", "-e", "production"]
## entrypoint.sh

## FROM
bundle exec puma -C config/puma.rb -e production

## TO
exec "$@"

What this meant was that instead of running /bin/sh and then executing the web server and all exec statements following from /bin/sh, the web server would replace the parent shell process within the conatiner. There are posts on the benefits of using exec $@ over running a shell and a process within, but that is beyond the scope of this post.

Our issue started to occur after this change when we noticed zombie processes on servers. On one server we saw over 7,000 zombie processes that looked like the following:

root     31082 24379  0 00:10 ?        00:00:00 [ssl_client] <defunct>
root     31170 24379  0 00:10 ?        00:00:00 [ssl_client] <defunct>
root     31256 24379  0 00:10 ?        00:00:00 [ssl_client] <defunct>
root     31460 24379  0 00:10 ?        00:00:00 [ssl_client] <defunct>
root     31546 24379  0 00:10 ?        00:00:00 [ssl_client] <defunct>
root     31630 24379  0 00:10 ?        00:00:00 [ssl_client] <defunct>
root     31717 24379  0 00:11 ?        00:00:00 [ssl_client] <defunct>
root     31802 24379  0 00:11 ?        00:00:00 [ssl_client] <defunct>
root     31882 24379  0 00:11 ?        00:00:00 [ssl_client] <defunct>
root     31969 24379  0 00:11 ?        00:00:00 [ssl_client] <defunct>
root     32172 24379  0 00:11 ?        00:00:00 [ssl_client] <defunct>

Now initially these meant nothing to us, however we started to put together this was part of our issue. We looked up similiar issues and saw healthchecks on the container were creating zombie ssl_client processes which if continued to be created would eventually kill the container.

Our solution after some research was to include an init system within the container to protect against zombies. (running bash -c has the ability to protect against zombie processes.)

We did this by including the following line (init: true) in the docker-compose configuration:

version: '3.6'
services:
    portal:
        image: "compositionalenterprises/portal:"
        container_name: portal
        # The line below:
        init: true
        restart: always

Essentially what an init system does is not prevent or magically remove zombies, but an init system is designed to reap zombies when the parent process that failed to wait on them exits and the zombies hang around. The init process then becomes the zombies parent and this way the zombies are able to be cleaned up.

As explained well on this StackOverflow Post, An init system should be used:

  • when you want to run more than one service in a container
  • when you run a single process that spawns a lot of child processes
  • when you can’t add signal handlers to the process running as PID 1

Including an init system on our containers helped us to resolve our zombie process issue and likely prevented more catastrophic issues from occuring on production systems.

Want to learn more?

Fill out our Contact Form, or do some more research at OurCompose.com