Migrating from Docker to rootless Buildah

An in-depth guide to overcoming the challenges when migrating from Docker to a safer, high performance and supported container image builder

To provide a seamless recruiting experience to millions of job seekers and employers, ZipRecruiter relies on a unique tech stack for orchestrating, running, and building containerized images.

This includes AWS EC2 clusters for computing, Kubernetes for container orchestration, the Flatcar operating system, and Docker to build images and run containers for our apps and Jenkins to help streamline deployment operations.  One of those ways this all comes together, for example, is by mounting the Docker socket from the Kubernetes node into each Jenkins agent container.

During the second quarter of 2022, we encountered a critical technical failure in Docker itself. Investigations uncovered inherent security vulnerabilities and throttling of parallel builds which prompted an urgent search for an alternative way to build and push images to the registry.

The most recent docs on these issues are from 2019 and 2021. This article can hopefully save you a few months of trial and error.

The file system driver error and COPY failure that prompted a Docker investigation

Like many issues in our field, a version update kicked off a chain reaction. 

We use the Flatcar operating system on nodes in our K8s infrastructure on AWS EC2 machines. Every few weeks, we upgrade Flatcar AMI to the latest version, which also comes with an upgrade in the Docker version. 

After upgrading to Docker 19.xx we encountered build failures of some of our applications related to a bug outlined in this github issue #403. In essence, Docker 18.xx has its Native Overlay Diff parameter set as true by default, but in later Docker 19.xx and 20.xx the default is false. This influences the performance of the Overlay driver which I’ll explain a bit more about later. 

In addition, a COPY fail issue, outlined here, pertaining to Docker versions later than 17.06 popped up.

These bugs meant we could not build properly, forcing us to revert back to the previous Docker and Flatcar versions.

The implication of this failure to upgrade is two-prong; 

First, we were not able to leverage new features and functionality. More importantly however, by not upgrading to newer versions we were increasing security risk exposure over time as new vulnerabilities are discovered and exploited every day. 

As we devised a solution (which I share below), we uncovered larger issues with Docker that merited dropping it altogether for a better build system. 

Three (more) reasons to drop the Docker-daemon

1. Security Risks

In today’s world, running as root is bad practice, not to mention dangerous. The Docker-daemon, unfortunately, runs as root with the highest level of access and in conflict with the security paradigm of containerized processes. A hacker who gains access to the pod could gain access to all root node data and inflict serious damage.

In addition, mounting and setting up the Docker socket is not straightforward. The Docker architecture says you have to mount a Unix socket on a pod to build images, but failure to configure it correctly can result in vulnerabilities. Indeed, there are other alternatives like docker-in-docker but that has its own issues, such as nested containerization and resource isolation issues. Eventually, docker also later released a rootless docker, but you still have to mount the docker socket as the docker daemon is running on the node. 

2. Docker-daemon throttles parallel builds

Docker runs a single Docker-daemon on each node. No matter how many pods Kubernetes spins up in a node, all the pods will use the same Docker-daemon. This limited our build system’s ability to build parallel images even if we pick large EC2 instances, and by sticking with it we were throttling our potential on large AWS instances.

3. Lack of maintenance and support 

As Kubernetes adopted a more modular and standardized approach for interacting with container runtimes, the project dropped dockershim as an intermediary for interacting with Docker. Continuing to use an unsupported tool would be risky, and prevent us from being able to use the latest versions of K8s.


Although this is more related to runtime, and not directly related to the build system, it served as yet another reason to find an alternative to Docker. We ultimately migrated to containerd, but that’s a story for another day.

Evaluating other build tools

In light of the bugs and aforementioned drawbacks, we set out to evaluate our alternatives to build images. We explored and tested 4 different options: Kaniko (by Google), buildah (open source), s2i, and img. 

Ultimately we chose Buildah, as it posed the path of least resistance without compromise on security and performance. 

Four criteria were of key consideration during evaluation: security, performance, compatibility and stability.

Security

❓Which extra privileges, if any, would pods running the new builder need?

❓Would having builder images not build from our base images present a problem for our security posture?

Buildah can run inside a Kubernetes container with much less privileges than root on the node. 

Performance

❓How fast are builds under the new builder, in comparison to Docker build? 

❓How well does the new builder leverage build/image caches?

The new solution had to achieve, at the very least, what Docker was already doing. By using the native OverlayFS storage driver (explained below) we achieved the desired performance.

Compatibility

❓How easy is it to integrate the new builder with what we already have? How much change, if any, would each app need to undergo to make use of the new builder (including documentation and re-training)? 

❓How well does the new builder integrate directly with GitLab?

❓Does using this tool restrict our choices in integrating a more comprehensive OSS CI system in the future?

With over 1,300 ZipRecruiter apps using Dockerfiles, and for the sake of maintaining backward compatibility, we wanted to continue using Dockerfile syntax and the files we already had.

Buildah seamlessly ingests dockerfiles, enabling virtually no re-writing or training. 

Stability

❓How well maintained is this project, especially if it is open-source?

Buildah is backed by RedHat with a very active open source community. Long-term stability seemed like a safe bet. 

Key technical issues when transitioning from Docker to Buildah

Transitioning from Docker to Buildah is not straightforward. The solutions below will shorten the time it takes you to get up and running.

1. Loading the Native Overlay Diff driver correctly

Overlay is a storage driver that both Docker and Buildah use. Thus, even after migrating to Buildah, we had to avoid the aforementioned bug (#403) by making sure that the Flatcar AMI ‘Native Overlay Diff’ parameter is TRUE so that Overlay works as desired. 

To achieve this, you must load the Overlay driver in the Flatcar AMI with an adjustment in the boot options as follows:

Overlay metacopy=off redirect_dir=off options

💡 If you are using EC2 you can do this in a boot script. In Flatcar AMI you can also do this via a vendor config. No matter what option you are using, this config must be set to true. 

To check this, run the Buildah info looking for Native Overlay Diff: true. 

 buildah info
{
    "host": {
        "CgroupVersion": "v2",
        "OCIRuntime": "runc",
        "kernel": "5.15.142-flatcar",
        "os": "linux",
        "rootless": true,
    },
    "store": {
        "GraphDriverName": "overlay",
        "GraphOptions": [
            "overlay.ignore_chown_errors=true"
        ],
        "GraphStatus": {
            "Backing Filesystem": "extfs",
            "Native Overlay Diff": "true",
            "Supports d_type": "true",
            "Us

2. Choosing the best performing storage driver

Initially, we tried using the VFS storage driver with Buildah, but it was very slow. The VFS backend is a very simple fallback that has no copy-on-write support. Each layer is just a separate directory. Creating a new layer based on another layer is done by making a deep copy of the base layer into a new directory.

Fuse-overlay is another option. In this 2019 RedHat blog it is written: “Fuse-overlay works quite well and gives us better performance than using the VFS storage driver.” However, in comparison to Docker and benchmark expectations Fuse-overlay is also too slow, and performance was not acceptable for Buildah to qualify as a Docker replacement. 

While this 2021 article theorized about implementing ‘native OverlayFS’, we put it to the test. We found that it performed as well as Docker, finally enabling our move to Buildah. 

💡 For anyone interested in performance, OverlayFS is probably the best option.

To implement, you must use Linux kernel v5.13 or later. And to be able to update to the v5.13 kernel, you need the Flatcar AMI that has kernel version 5.13 and above. Serendipitously, as we realized how to solve the issue, v5.13 was released. 

3. Performance

After dealing with the functional issues, eventually we set out to use build images with Buildah in a controlled environment to assess performance. There we noticed it was inconsistently slower to pull images. 

Debugging Buildah and inspection of the underlying Golang and io.copy code [copy from ECR to local storage] all came up clean. Then, while monitoring network traffic we noticed a high rate of TCP re-transmissions. Eventually an upgrade from 3227.2.4 to the Flatcar OS version 3510.2.0 fixed this issue. 

This was one of the last issues we had during the Docker to Buildah migration and once fixed, Buildah performance was on par with Docker allowing full migration to Buildah.

4. Configuring authentication

Similarly to Docker, after Buidah builds the images, our build system pushes the images to cloud-based AWS ECR for storage. To do so, you have to provide authentication. In addition, in some instances we also need to access third-party images to which we need to authenticate ourselves. We authenticate at the beginning once the pod is running using the buildah login command and specify buildah config to be used. 

This is an example storage configuration (storage.conf) that needs to be setup for every pod:

[storage]
# Default Storage Driver, Must be set for proper operation.
driver = "overlay"

[storage.options.overlay]
ignore_chown_errors = "true"

5. Multistage builds

We were building multiple images for a single app, and that needed to continue working. Buildah, however, had an optimization feature and produced only a single image. 

To solve this issue we reached out to the Buildah team and told them it was breaking our system. They added new functionality in Buildah to ‘Skip unneeded stages from multi-stages.’ You can read the feature request here.

If you need this feature, using the following flag (available from v1.27.2 onwards):

--skip-unused-stages=false 

6. Bazel builds 

We have some scripting on top of Makefiles that use Bazel to build packages. 

With Docker, Bazel shut down properly after each build and when the next RUN command came through, Bazel would start a new server. This is the desired behavior. 

When we tested building images with Buildah, the Bazel local server process didn’t shutdown properly and left some state files such that when the next RUN directive initiated, Bazel didn’t start the server and assumed it was already running from the previous image layer thereby failing the entire build process.

To fix this issue you have to either force Bazel to start a local server at each layer by deleting state files from the previous layer, or you can combine multiple make commands in a single RUN directive. 

7. Unable to resolve hostname

There was a bug in Buildah resulting in Java builds failing because the intermediate container gets containerID as hostname when using host network, and it was not resolvable as the entry was not present in /etc/hosts. Details here and fix here.

To solve this, we used an internal patch for a few months and once migration was complete, we got it fixed upstream by the Buildah team. These days, as long as you use Buildah version v1.31.0 onwards, you should have no issues.  

8. Setting rootless privileges for pods on a node

By running each pod on a node as rootless, and following the principle of least privileges, bad actors won’t be able to access actual node data.

To run Buildah inside a Kubernetes container without root privileges, set the following: 

# Set default security profile strategies

  runAsUser:
    rule: MustRunAsNonRoot
  allowedCapabilities:
    # Required for Buildah to run in a non-privileged container. See
    # https://github.com/containers/buildah/issues/4049
    - SETUID
    - SETGID
    # "Since Linux 5.12, this capability is also needed to map
    # user ID 0 in a new user namespace" from:
    #   - https://man7.org/linux/man-pages/man7/capabilities.7.html
    # See also (search for "If updating /proc/pid/uid_map"):
    #   - https://man7.org/linux/man-pages/man7/user_namespaces.7.html
    - SETFCAP

9. Chown errors

We had to set ignore_chown_errors = “true” [see above in storage.conf] to fix some of the apps we build. 

Here’s some documentation for this flag on GITHUB and PODMAN

“This will allow non-privileged users running with a single UID within a user namespace to run containers. The user can pull and use any image, even those with multiple uids. Note multiple UIDs will be squashed down to the default uid in the container. These images will have no separation between the users in the container. Only supported for the overlay and vfs drivers.“ – Github.

10. Mknod requires root privileges

While attempting to build the open-source ingress-nginx app with a rootless buildah, we encountered the following error:

Fail to run mknod  - Operation not permitted

As explained here, the error occurs because when an unprivileged user (rootless) does not have enough privileges for using mknod, the kernel blocks that operation. Regardless of how many capabilities are left in the user namespace, it won’t be possible.

This is a limitation (documented here) we had to live with. The workaround is to use an already built image from the github repository

11. Too Many Open Files 

Too Many Open Files is the error that Linux returns when open() or connect() or anything else that would allocate a file descriptor fails because we hit the upper limit on open files.

The buildah-build manual page gave us a hint on what number to choose for the max number of open files: 

“nofile”: maximum number of open files (ulimit -n)

“nofile”: maximum number of open files (1048576); when run by root

Initially we started with half of what buildah uses when run as root (i.e. (1048576 / 2 = 524288) and this did the trick. 

The final config we used was as follows: 

"--ulimit", "nofile=524288"

You will want to experiment based on your requirements and set nofile accordingly.

12. Inability to access root file system

When building images, Buildah runs as rootless and doesn’t have permissions of root owned directories. For example, I saw an issue where /var/run was symlinked to /run, and /run was owned by root and Buildah was unable to access it. 

Make sure you are not accessing files owned by root when building images. 

Keep in mind that many other Buildah build configuration arguments may be useful depending on your project, for example: –network, –layers, –format, –build-context, –memory etc. 

Go exploring! 

In summary, we determined that migrating from Docker to Buildah was  necessary to maintain ZipRecruiter’s high standards for performance and data security. Despite many challenges along the way, we persevered, went to the source, and eventually achieved our goal while helping create fixes for the whole community.

If you’re interested in working on solutions like these, visit our Careers page to see open roles.

* * *

About the Author

Saurabh Ahuja is a Staff Software Engineer at ZipRecruiter. As a key member of the Core Systems team, Saurabh builds and maintains the framework and tools that power our technological development and online services. After 18+ years at the most successful tech companies in the world, he still loves to get his hands deep in code. ZipRecruiter offers him precisely that, as well as the opportunity to influence the company as a leading IC, and the flexibility to take part in family life at home and train for intense Ultraman triathlons.

More Articles by Engineering Team at ZipRecruiter