(This was originally published on http://blogs.dlt.com/troubleshooting-dockerfile-builds-checkpoint-containers/ Republished here with permission and some minor edits.)
I recently ran into some problems compiling code while building a docker image using a Dockerfile. The compile process wasn’t working, so the failed
make install was stopping the container image from being built. There’s 3 ways I could have approached troubleshooting this failed build.
I used what I’m calling a checkpoint container to troubleshoot quickly and get back on track with the rest of the build process.
Dockerfile quick tour
Dockerfiles are an ordered list of instructions for the
docker build process. Each instruction results in a layer of the image with the results of whatever commands were executed. RUN instructions execute commands in a shell and modify the file system, COPY instructions copy files from the working directory on the host into the specified location in the container, and so on. You can get more info on the Dockerfile and the build process on the web
Let’s take a quick look at the Dockerfile. This is the build that was failing:
# Derived from instructions here https://gist.github.com/lucasoldaini/2d548cafa7ea4d147aa2bb1c7cd393cc FROM rhel7 # Needs rhel-7-server-optional-rpms enabled on the docker host that builds the container RUN yum repolist && yum-config-manager --enable rhel-7-server-optional-rpms && \ INSTALL_PKGS="gcc-gfortran gcc binutils autoconf automake make git python-setuptools Cython python-devel" && yum -y install $INSTALL_PKGS RUN git clone https://github.com/xianyi/OpenBLAS && git clone https://github.com/numpy/numpy.git # OpenBLAS build section WORKDIR /OpenBLAS COPY OpenBLAS.conf /etc/ld.so.conf.d/ RUN git checkout tags/v0.2.18 && \ make CC="gcc -m64" FC="gfortran -m64" RANLIB="ranlib" FFLAGS=" -O2 -fPIC" TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 && \ make install && \ ldconfig # NumPY build section COPY site.cfg /numpy/ WORKDIR /numpy RUN unset CPPFLAGS && \ unset LDFLAGS && \ python setup.py build --fcompiler=gnu95 && \ python setup.py install # SciPY build section WORKDIR / RUN git clone https://github.com/scipy/scipy.git RUN yum -y install gcc-c++ WORKDIR /scipy RUN git checkout tags/v0.15.1 RUN python setup.py config_fc --fcompiler=gnu95 install WORKDIR / ADD experiments.tar /tmp/ # docker run /tmp/benchmark ENTRYPOINT ["python"] CMD ["--help"]
While it’s not important to discuss what OpenBLAS is, the libraries built are optimized to the machine based on compile time flags for specific optimizations. The compile of OpenBLAS happening in the container was creating a library with a different name than the install section of the Makefile expected.
I had to see what the result of the compile was in the source directory, but the build failed so the final container wasn’t around.
3 easy troubleshooting techniques
1. Change the Dockerfile and rebuild
This is probably the most common way folks would look at troubleshooting a docker container image build. Simply look at the instruction that failed, interpret the output from the build, and modify the instruction. Then we can re-run the build to see what happens.
This is a very simple way to troubleshoot, but has the disadvantage of re-running the build every time. If the instruction that needs to change is the failed instruction, e.g. adding another package to a list of RPMs, then the build will start from the cache at that point. This would be very fast. If the instruction is from a previous layer, then all of the intermediate instructions would be re-run as well. This could take a while if, say, we need to update the list of RPMs to add a library, which would results in reinstalling all the RPMs. The resulting Dockerfile may not be the most optimized, but it will create reproducible images.
Overall, this is fairly low drag, and therefore common. The downfall comes when the output from the failed instruction isn’t clear. If you need to see what winds up on the disk, like in my compiler / installer mismatch, you are out of luck. The
make was creating something, but it wasn’t making the something that
make install needed. So I needed to see the artifacts being created on disk.
2. Build in fresh container and record steps
When we need to be able to interact with the container build artifacts, many folks will turn to launching the base layer container, getting a shell, doing the work, and recording actions for use in a Dockerfile for later builds.
This provides the most control and is just like troubleshooting compiling on a workstation. If the compile fails, we can just change the flags and try again. Once we’ve gotten a good build, we can take the notes of what RPMs needed to be installed, the flags to send
make, additional files and configs that need to be updated, and create individual instructions in a Dockerfile. The resulting Dockerfile will be clean and we can apply some optimizations to the docker build process right out of the gate.
However, this is fairly high friction. We’re doing most of the work just like a workstation, which ignores the benefits of using the build system in the first place. We’re human, we may forget to write things down, the Dockerfile might miss steps we did in the shell. Plus, after all our work, we still need to run a
docker build to get our final container image. Not the optimal way to use the tools at hand.
3. Create a checkpoint container
I needed a method that combined the two approaches: using the built-in build system and getting a shell in a container as close to the failure point as possible. It turns out we can do that by taking advantage of how the
docker build process works.
First, we need to take a look at how images are created.
Docker image sidebar
Docker images are composed of file system layers, each subsequent action done on a new copy-on-write layers. You’ve probably seen a diagram that looks something like this:
image credit: https://blog.docker.com/2015/10/docker-basics-webinar-qa/
docker build process, each instruction starts a new container based on the previously committed layer, executes the instruction, and commits the new layer if the instruction succeeds. At the end of the process, the tags you provided to
docker build are applied and you wind up with an image named the way you expected, like
When we run
docker build -t oblasbench . it ends up like this:
... Step 16 : ENTRYPOINT python ---> Using cache ---> 305e4fec274d Step 17 : CMD --help ---> Using cache ---> 61db194d3341 Successfully built 61db194d3341
If we look at
docker images oblasbench, we can see that the Image ID from the Repository named
oblasbench is the same as the last line of our
docker build output.
REPOSITORY TAG IMAGE ID CREATED SIZE oblasbench latest 61db194d3341 About an hour ago 1.716 GB
Troubleshooting in the intermediate layers
Let’s rewind back to troubleshooting the
oblasbench container compile. The issue I mentioned happened in Step 7. The
make flags were copied from a Gist had some clear instructions on building OpenBLAS from source.
RUN git checkout tags/v0.2.18 && \ make CC="gcc -m64" FC="gfortran -m64" RANLIB="ranlib" FFLAGS=" -O2 -fPIC" TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 && \ make install && \ ldconfig
But on this system, those compiler flags resulted in the following error:
Step 5 : COPY OpenBLAS.conf /etc/ld.so.conf.d/ ---> Using cache ---> 20a1fbb0d798 Step 6 : RUN git checkout tags/v0.2.18 && make CC="gcc -m64" FC="gfortran -m64" RANLIB="ranlib" FFLAGS=" -O2 -fPIC" TARGET= BINARY=64 USE_THREAD=1 GEMM_MULTITHREADING_THRESHOLD=50 NUM_THREADS=16 NO_AFFINITY=1 DYNAMIC_ARCH=1 INTERFACE64=1 && make install && ldconfig ---> Running in 64b59c47cdb0 [snip] make -j 1 -f Makefile.install install make: Entering directory `/OpenBLAS' Generating openblas_config.h in /opt/OpenBLAS/include Generating f77blas.h in /opt/OpenBLAS/include Generating cblas.h in /opt/OpenBLAS/include Copying LAPACKE header files to /opt/OpenBLAS/include Copying the static library to /opt/OpenBLAS/lib make: Leaving directory `/OpenBLAS' install: cannot stat 'libopenblas-r0.2.18.a': No such file or directory make: *** [install] Error 1 make: *** [install] Error 2
The failed build means no final container I can see in
docker images, but let’s look closer at Step 5 & 6. We know that
docker build is running a new container in Step 6, which is based on the intermediate layer from Step 5. This means we can use the layer from Step 5 for our checkpoint container.
A checkpoint container is a container we launch manually using that same intermediate layer that
docker build uses to create the cache.
The last line of Step 5 is image layer id we want to use
20a1fbb0d798 for our checkpoint container, let’s take a look at that image:
docker history 20a1fbb0d798 IMAGE CREATED CREATED BY SIZE COMMENT 20a1fbb0d798 21 hours ago /bin/sh -c #(nop) COPY file:0aefaff87b25769f6 18 B 01aaf43c0d50 22 hours ago /bin/sh -c #(nop) WORKDIR /OpenBLAS 0 B bbe3500e7e2d 22 hours ago /bin/sh -c git clone https://github.com/xiany 179.7 MB 7f0e79eee3c5 22 hours ago /bin/sh -c yum repolist && yum-config-manager 528.8 MB 4a6b6e1a17d7 4 weeks ago 201.6 MB Created by Image Factory
That’s the history of the working build so far. We can go ahead and fire up a container that has the checkpoint we need and get a shell.
docker run -it 20a1fbb0d798 /bin/bash [root@1ca7e23a12c6 OpenBLAS]# pwd /OpenBLAS [root@1ca7e23a12c6 OpenBLAS]# git status # On branch develop nothing to commit, working directory clean
Now we can start in figuring out the right flags we need to get OpenBLAS to compile and install cleanly. We can run several experiments to find the fight optimizations for the environment. And if we need to, we can just launch a new checkpoint container to start from scratch.
The experiment with compiler flags and rebuilds ended up with a new set:
make CC="gcc -m64" FC="gfortran -m64" TARGET= && \ make install && \ ldconfig
Once we’ve found that working set, we update the Dockerfile and exit the container. Since we’ve been working in a new running container, we haven’t mucked about with the cache the build uses,
docker build will pick right back up again at Step 6 with the same image we were using for troubleshooting.
A new way of making Dockerfiles
Using this pattern, we can improve our ability troubleshoot container images as we build them. It means making initial Dockerfiles to get a working build, then improving the Dockerfile to build to our standards for smaller layers and compounding some of our instructions.
Let’s take a look at creating the
scipy section of the Dockerfile:
WORKDIR / RUN git clone https://github.com/scipy/scipy.git RUN yum -y install gcc-c++ WORKDIR /scipy RUN git checkout tags/v0.15.1 RUN python setup.py config_fc --fcompiler=gnu95 install
This looks like
bash history output rather than a streamlined Dockerfile like the NumPY section. Each instruction is only doing one thing, which increases the number of layers. It also means we have a wider choice for creating checkpoint containers to do troubleshooting and additional experimentation.
For example, the latest stable release of SciPY needs a later version of Cython that I had available in the container, so I needed to hunt around for a release tag in the git tree that compiled. One caveat, images can’t have more than 127 layers, so we do need to balance checkpoints against total layer depth.
Once we’re satisfied with the way SciPY has been built and installed in the container, we can modify the Dockerfile to optimize for the build process and minimize image size.
... INSTALL_PKGS="gcc-gfortran gcc binutils autoconf automake make git python-setuptools Cython python-devel gcc-g++" && yum -y install $INSTALL_PKGS RUN git clone https://github.com/xianyi/OpenBLAS && git clone https://github.com/numpy/numpy.git && git clone https://github.com/scipy/scipy.git ... WORKDIR /scipy RUN git checkout tags/v0.15.1 && python setup.py config_fc --fcompiler=gnu95 install
gcc-g++ RPM goes into the package list at the top and the
git clone joins the other 2
git clone operations. We can eliminate the
WORKDIR / since that was only needed to make sure the
git clone happened in the right spot.
Collapsing the resulting individual instructions to a single
RUN means that we’ve taken the installation of SciPY down to 2 layers, one of which is a
cd NOP. You could probably go further and change those to actual
pushd/popd commands in the
RUN instructions to eliminate the NOP layers.
docker build process lets us take advantage of the multiple layers in docker container images. Using these intermediate build layers to build checkpoint containers means we can inspect, manipulate, and troubleshoot our build process in a sane and simple manner. Getting to know the Dockerfile format can improve your build process and resulting container images.
Sidebar: Continuous improvement
As I was writing this post, I noticed quite a few things that I could have done better or different in the build process. Changing the build order and detection of libraries to
scipy made the overall process faster. It’s always important to review what you’re doing to see if there’s a way to learn.
Copyright (c) 2016 Copyright Holder All Rights Reserved.