DPKG Locks with Cloud-Init and Our Solution

Throughout the podcast we have discussed some issues we have encountered with DPKG. This post goes into the problems we encountered with some deploys, our first solution, and then our updated solution.

As part of our deploys for the OurCompose Suite with Digital Ocean droplets, we need to install system updates and required packages. One of the major hurdles we encountered on some deploys and software installs was the inability to get a lock on dpkg and apt because another process was using it. This failure to update packages and install others caused the entire deploy to hault and required a manual intervention and a re-deploy to fix the issue. To temporarily resolve this, we set timeouts to wait for the other process to finish and then continued happily with our updates. This was only a short term solution to our problem.

After still occasionally encountering the issue with the timeout fix we were using, we decided to take a deeper dive into what was causing apt/dpkg to have problems. Through an analysis of services running on the virtual machine and a deep dive into what was going on, we came to realize cloud-init was causing the issue.

We recognized there were issues with deploys when we saw errors being returned stating dpkg was locked and in use. The error was the following:

E: Could not get lock /var/lib/dpkg/lock – open (11: Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?

This was causing everything from the point of apt install ${application} on the deploy to break meaning none of the services or tools were installed and basically what we were left with was a new virtual machine with minimal configurations set.

Our first fix as I described above was to just try to update packages, but timeout if the installs fail. In code, through ansible, this is what that fix looked like (starting at the “until line”):

    - name: Install additional packages
        name: "{{ pkgs }}"
        state: latest
        update_cache: True
          - "pkg-config"
          - "libsystemd-dev"
      register: pkg_install
      retries: 60
      delay: 5
      # Here we're also saying to retry on if we're facing a 'could not get lock'.
      # See: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html#examples
      until: pkg_install is success or (
        not pkg_install['msg'] | d() | regex_search('Failed to lock apt|Could not get lock'))

This was a short term solution as we continued to see issues even with the calculated three minute timeout. Now we could have increased the timeout to solve the issue, but this would not solve the root cause of the issue. We needed to figure out what was causing apt to lock.

After a dive into journalctl to follow the issue on where the lock was, we discovered cloud-init was running updates/customizations and collecting metadata with apt and dpkg.

To resolve our issue of the locks we put in place a couple lines of ansible to follow the journalctl output and wait for the cloud-init processes to successfully finish. Then from there we were free to run our updates and installs without issues. The following code is the fix we currently have in place to wait for cloud-init to complete:

    - name: Wait for cloud-init to complete
      shell: journalctl --boot _COMM=cloud-init | grep 'Cloud-init.*finished at'
      register: cloud_init_install
      retries: 60
      delay: 5
      until: cloud_init_install is success

Since this fix was put in place, we have been able to successfully deploy instances without issue and the locks are no longer causing problems on deploys.

Want to learn more?

Fill out our Contact Form, or do some more research at OurCompose.com