Troubleshooting Dockerfile builds with checkpoint containers

(This was originally published on http://blogs.dlt.com/troubleshooting-dockerfile-builds-checkpoint-containers/ Republished here with permission and some minor edits.)

The scene

I recently ran into some problems compiling code while building a docker image using a Dockerfile.  The compile process wasn’t working, so the failed make install was stopping the container image from being built.  There’s 3 ways I could have approached troubleshooting this failed build.

I used what I’m calling a checkpoint container to troubleshoot quickly and get back on track with the rest of the build process.

Dockerfile quick tour

Dockerfiles are an ordered list of instructions for the docker build process.  Each instruction results in a layer of the image with the results of whatever commands were executed.  RUN instructions execute commands in a shell and modify the file system, COPY instructions copy files from the working directory on the host into the specified location in the container, and so on.  You can get more info on the Dockerfile and the build process on the web

Let’s take a quick look at the Dockerfile.  This is the build that was failing:

# Derived from instructions here https://gist.github.com/lucasoldaini/2d548cafa7ea4d147aa2bb1c7cd393cc
FROM rhel7

# Needs rhel-7-server-optional-rpms enabled on the docker host that builds the container
RUN  yum repolist && yum-config-manager --enable rhel-7-server-optional-rpms && \
INSTALL_PKGS="gcc-gfortran gcc binutils autoconf automake make git python-setuptools Cython python-devel" && yum -y install $INSTALL_PKGS

RUN git clone https://github.com/xianyi/OpenBLAS && git clone https://github.com/numpy/numpy.git

# OpenBLAS build section
WORKDIR /OpenBLAS
COPY OpenBLAS.conf /etc/ld.so.conf.d/
RUN git checkout tags/v0.2.18 && \
make CC="gcc -m64" FC="gfortran -m64" RANLIB="ranlib" FFLAGS=" -O2 -fPIC" TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 && \
make install && \
ldconfig

# NumPY build section
COPY site.cfg /numpy/
WORKDIR /numpy
RUN unset CPPFLAGS && \
unset LDFLAGS && \
python setup.py build --fcompiler=gnu95 && \
python setup.py install

# SciPY build section
WORKDIR /
RUN git clone https://github.com/scipy/scipy.git
RUN yum -y install gcc-c++
WORKDIR /scipy
RUN git checkout tags/v0.15.1
RUN python setup.py config_fc --fcompiler=gnu95 install

WORKDIR /
ADD experiments.tar /tmp/

# docker run /tmp/benchmark
ENTRYPOINT ["python"]
CMD ["--help"]

While it’s not important to discuss what OpenBLAS is, the libraries built are optimized to the machine based on compile time flags for specific optimizations.  The compile of OpenBLAS happening in the container was creating a library with a different name than the install section of the Makefile expected.

I had to see what the result of the compile was in the source directory, but the build failed so the final container wasn’t around.

3 easy troubleshooting techniques

1. Change the Dockerfile and rebuild

This is probably the most common way folks would look at troubleshooting a docker container image build.  Simply look at the instruction that failed, interpret the output from the build, and modify the instruction.  Then we can re-run the build to see what happens.

This is a very simple way to troubleshoot, but has the disadvantage of re-running the build every time.  If the instruction that needs to change is the failed instruction, e.g. adding another package to a list of RPMs, then the build will start from the cache at that point.  This would be very fast.  If the instruction is from a previous layer, then all of the intermediate instructions would be re-run as well.  This could take a while if, say, we need to update the list of RPMs to add a library, which would results in reinstalling all the RPMs.  The resulting Dockerfile may not be the most optimized, but it will create reproducible images.

Overall, this is fairly low drag, and therefore common.  The downfall comes when the output from the failed instruction isn’t clear.  If you need to see what winds up on the disk, like in my compiler / installer mismatch, you are out of luck.  The make was creating something, but it wasn’t making the something that make install needed.  So I needed to see the artifacts being created on disk.

2. Build in fresh container and record steps

When we need to be able to interact with the container build artifacts, many folks will turn to launching the base layer container, getting a shell, doing the work, and recording actions for use in a Dockerfile for later builds.

This provides the most control and is just like troubleshooting compiling on a workstation.  If the compile fails, we can just change the flags and try again.  Once we’ve gotten a good build, we can take the notes of what RPMs needed to be installed, the flags to send make, additional files and configs that need to be updated, and create individual instructions in a Dockerfile.  The resulting Dockerfile will be clean and we can apply some optimizations to the docker build process right out of the gate.

However, this is fairly high friction.  We’re doing most of the work just like a workstation, which ignores the benefits of using the build system in the first place.  We’re human, we may forget to write things down, the Dockerfile might miss steps we did in the shell.  Plus, after all our work, we still need to run a docker build to get our final container image.  Not the optimal way to use the tools at hand.

3. Create a checkpoint container

I needed a method that combined the two approaches: using the built-in build system and getting a shell in a container as close to the failure point as possible.  It turns out we can do that by taking advantage of how the docker build process works.

First, we need to take a look at how images are created.

Docker image sidebar

Docker images are composed of file system layers, each subsequent action done on a new copy-on-write layers.  You’ve probably seen a diagram that looks something like this:

container_layers

image credit: https://blog.docker.com/2015/10/docker-basics-webinar-qa/

During the docker build process, each instruction starts a new container based on the previously committed layer, executes the instruction, and commits the new layer if the instruction succeeds.  At the end of the process, the tags you provided to docker build are applied and you wind up with an image named the way you expected, like myuser/python.

When we run docker build -t oblasbench . it ends up like this:

...
Step 16 : ENTRYPOINT python
---> Using cache
---> 305e4fec274d
Step 17 : CMD --help
---> Using cache
---> 61db194d3341
Successfully built 61db194d3341

If we look at docker images oblasbench, we can see that the Image ID from the Repository named oblasbench is the same as the last line of our docker build output.

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
oblasbench          latest              61db194d3341        About an hour ago   1.716 GB

Troubleshooting in the intermediate layers

Let’s rewind back to troubleshooting the oblasbench container compile.  The issue I mentioned happened in Step 7. The make flags were copied from a Gist had some clear instructions on building OpenBLAS from source.

RUN git checkout tags/v0.2.18 && \
make CC="gcc -m64" FC="gfortran -m64" RANLIB="ranlib" FFLAGS=" -O2 -fPIC" TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 && \
make install && \
ldconfig

But on this system, those compiler flags resulted in the following error:

Step 5 : COPY OpenBLAS.conf /etc/ld.so.conf.d/
---> Using cache
---> 20a1fbb0d798
Step 6 : RUN git checkout tags/v0.2.18 && make CC="gcc -m64" FC="gfortran -m64" RANLIB="ranlib" FFLAGS=" -O2 -fPIC" TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 && make install && ldconfig
---> Running in 64b59c47cdb0
[snip]

make -j 1 -f Makefile.install install
make[1]: Entering directory `/OpenBLAS'
Generating openblas_config.h in /opt/OpenBLAS/include
Generating f77blas.h in /opt/OpenBLAS/include
Generating cblas.h in /opt/OpenBLAS/include
Copying LAPACKE header files to /opt/OpenBLAS/include
Copying the static library to /opt/OpenBLAS/lib
make[1]: Leaving directory `/OpenBLAS'
install: cannot stat 'libopenblas-r0.2.18.a': No such file or directory
make[1]: *** [install] Error 1
make: *** [install] Error 2

The failed build means no final container I can see in docker images, but let’s look closer at Step 5 & 6.  We know that docker build is running a new container in Step 6, which is based on the intermediate layer from Step 5.    This means we can use the layer from Step 5 for our checkpoint container.

A checkpoint container is a container we launch manually using that same intermediate layer that docker build uses to create the cache.

The last line of Step 5 is image layer id we want to use 20a1fbb0d798 for our checkpoint container, let’s take a look at that image:

docker history 20a1fbb0d798
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
20a1fbb0d798        21 hours ago        /bin/sh -c #(nop) COPY file:0aefaff87b25769f6   18 B
01aaf43c0d50        22 hours ago        /bin/sh -c #(nop) WORKDIR /OpenBLAS             0 B
bbe3500e7e2d        22 hours ago        /bin/sh -c git clone https://github.com/xiany   179.7 MB
7f0e79eee3c5        22 hours ago        /bin/sh -c yum repolist && yum-config-manager   528.8 MB
4a6b6e1a17d7        4 weeks ago                                                         201.6 MB            Created by Image Factory

That’s the history of the working build so far.  We can go ahead and fire up a container that has the checkpoint we need and get a shell.

docker run -it 20a1fbb0d798 /bin/bash
[root@1ca7e23a12c6 OpenBLAS]# pwd
/OpenBLAS
[root@1ca7e23a12c6 OpenBLAS]# git status
# On branch develop
nothing to commit, working directory clean

Now we can start in figuring out the right flags we need to get OpenBLAS to compile and install cleanly.  We can run several experiments to find the fight optimizations for the environment.  And if we need to, we can just launch a new checkpoint container to start from scratch.

The experiment with compiler flags and rebuilds ended up with a new set:

make CC="gcc -m64" FC="gfortran -m64" TARGET= && \
make install && \
ldconfig

Once we’ve found that working set, we update the Dockerfile and exit the container.  Since we’ve been working in a new running container, we haven’t mucked about with the cache the build uses, docker build will pick right back up again at Step 6 with the same image we were using for troubleshooting.

A new way of making Dockerfiles

Using this pattern, we can improve our ability troubleshoot container images as we build them.  It means making initial Dockerfiles to get a working build, then improving the Dockerfile to build to our standards for smaller layers and compounding some of our instructions.

Let’s take a look at creating the scipy section of the Dockerfile:

WORKDIR /
RUN git clone https://github.com/scipy/scipy.git
RUN yum -y install gcc-c++
WORKDIR /scipy
RUN git checkout tags/v0.15.1
RUN python setup.py config_fc --fcompiler=gnu95 install

This looks like bash history output rather than a streamlined Dockerfile like the NumPY section.  Each instruction is only doing one thing, which increases the number of layers.  It also means we have a wider choice for creating checkpoint containers to do troubleshooting and additional experimentation.

For example, the latest stable release of SciPY needs a later version of Cython that I had available in the container, so I needed to hunt around for a release tag in the git tree that compiled.  One caveat, images can’t have more than 127 layers, so we do need to balance checkpoints against total layer depth.

Once we’re satisfied with the way SciPY has been built and installed in the container, we can modify the Dockerfile to optimize for the build process and minimize image size.

...
INSTALL_PKGS="gcc-gfortran gcc binutils autoconf automake make git python-setuptools Cython python-devel gcc-g++" && yum -y install $INSTALL_PKGS

RUN git clone https://github.com/xianyi/OpenBLAS && git clone https://github.com/numpy/numpy.git && git clone https://github.com/scipy/scipy.git
...

WORKDIR /scipy
RUN git checkout tags/v0.15.1 && python setup.py config_fc --fcompiler=gnu95 install

The gcc-g++ RPM goes into the package list at the top and the git clone joins the other 2 git clone operations.  We can eliminate the WORKDIR / since that was only needed to make sure the git clone happened in the right spot.

Collapsing the resulting individual instructions to a single RUN means that we’ve taken the installation of SciPY down to 2 layers, one of which is a cd NOP.  You could probably go further and change those to actual cd or pushd/popd commands in the RUN instructions to eliminate the NOP layers.

TL;DR

Understanding the docker build process lets us take advantage of the multiple layers in docker container images.  Using these intermediate build layers to build checkpoint containers means we can inspect, manipulate, and troubleshoot our build process in a sane and simple manner.  Getting to know the Dockerfile format can improve your build process and resulting container images.

Sidebar: Continuous improvement

As I was writing this post, I noticed quite a few things that I could have done better or different in the build process.  Changing the build order and detection of libraries to numpy from scipy made the overall process faster.  It’s always important to review what you’re doing to see if there’s a way to learn.

Copyright (c) 2016 Copyright Holder All Rights Reserved.

Leave a Reply

Your email address will not be published. Required fields are marked *