Automating Every Aspect of Your Python Project
Every Python project can benefit from automation using Makefile, optimized Docker images, well configured CI/CD, Code Quality Tools and more…
By Martin Heinz, DevOps Engineer at IBM
Every project — regardless of whether you are working on web app, some data science or AI — can benefit from well configured CI/CD, Docker images that are both debuggable in development and optimized for production environment or a few extra code quality tools, like CodeClimate or SonarCloud. All these are things we will go over in this article and we will see how those can be added to your Python project!
This is a follow up to previous article about creating “Ultimate” Python Project Setup, so you might want check that out before reading this one.
TL;DR: Here is my repository with full source code and docs: https://github.com/MartinHeinz/python-project-blueprint
Debuggable Docker Containers for Development
Some people don’t like Docker because containers can be hard to debug or because their images take long time to be built. So, let’s start here, by building images that are ideal for development — fast to build and easy to debug.
To make the image easily debuggable we will need base image that includes all the tools we might ever need when debugging — things like bash
, vim
, netcat
, wget
, cat
, find
, grep
etc. python:3.8.1-buster
seems like a ideal candidate for the task. It includes a lot of tools by default and we can install everything what is missing pretty easily. This base image is pretty thick, but that doesn't matter here as it's going to be used only for development. Also as you probably noticed, I chose very specific image - locking both version of Python as well as Debian - that's intentional, as we want to minimize chance of "breakage" caused by newer, possibly incompatible version of either Python or Debian.
As an alternative you could use Alpine based image. That however, might cause some issues, as it uses musl libc
instead of glibc
which Python relies on. So, just keep that in mind if decide to choose this route.
As for the speed of builds, we will leverage multistage builds to allow us to cache as many layers as possible. This way we can avoid downloading dependencies and tools like gcc
as well as all libraries required by our application (from requirements.txt
).
To further speed things up we will create custom base image from previously mentioned python:3.8.1-buster
, that will include all tool we need as we cannot cache steps needed for downloading and installation of these tools into final runner image.
Enough talking, let’s see the Dockerfile
:
Above you can see that we will go through 3 intermediate images before creating final runner image. First of them is named builder
. It downloads all necessary libraries that will be needed to build our final application, this includes gcc
and Python virtual environment. After installation it also creates actual virtual environment which is then used by next images.
Next comes the builder-venv
image which copies list of our dependencies (requirements.txt
) into the image and then installs it. This intermediate image is needed for caching as we only want to install libraries if requirements.txt
changes, otherwise we just use cache.
Before we create our final image we first want to run tests against our application. That’s what happens in the tester
image. We copy our source code into image and run tests. If they pass we move on to the runner
.
For runner image we are using custom image that includes some extras like vim
or netcat
that are not present in normal Debian image. You can find this image on Docker Hub here and you can also check out the very simple Dockerfile
in base.Dockerfile
here. So, what we do in this final image - first we copy virtual environment that holds all our installed dependencies from tester
image, next we copy our tested application. Now that we have all the sources in the image we move to directory where application is and then set ENTRYPOINT
so that it runs our application when image is started. For the security reasons we also set USER
to 1001, as best practices tell us that you should never run containers under root
user. Final 2 lines set labels of the image. These are going to get replaced/populated when build is ran using make
target which we will see a little later.
Optimized Docker Containers for Production
When it comes to production grade images we will want to make sure that they are small, secure and fast. My personal favourite for this task is Python image from Distroless project. What is Distroless, though?
Let me put it this way — in an ideal world everybody would build their image using FROM scratch
as their base image (that is - empty image). That's however not what most of us would like to do, as it requires you to statically link your binaries, etc. That's where Distroless comes into play - it's FROM scratch
for everybody.
Alright, now to actually describe what Distroless is. It’s set of images made by Google that contain the bare minimum that’s needed for your app, meaning that there are no shells, package managers or any other tools that would bloat the image and create signal noise for security scanners (like CVE) making it harder to establish compliance.
Now that we know what we are dealing with, let’s see the production Dockerfile
... Well actually, we are not gonna change that much here, it's just 2 lines:
All we had to change is our base images for building and running the application! But difference is pretty big — our development image was 1.03GB and this one is just 103MB, that’s quite a difference! I know, I can already hear you — “But Alpine can be even smaller!” — Yes, that’s right, but size doesn’t matter that much. You will only ever notice image size when downloading/uploading it, which is not that often. When the image is running, size doesn’t matter at all. What is more important than size is security and in that regard Distroless is surely superior, as Alpine (which is great alternative) has lots of extra packages, that increase attack surface.
Last thing worth mentioning when talking about Distroless are debug images. Considering that Distroless doesn’t contain any shell (not even sh
), it gets pretty tricky when you need to debug and poke around. For that, there are debug
versions of all Distroless images. So, when poop hits the fan, you can build your production image using debug
tag and deploy it alongside your normal image, exec into it and do - for example - thread dump. You can use the debug version of python3
image like so:
Single Command for Everything
With all the Dockerfiles
ready, let's automate the hell out of it with Makefile
! First thing we want to do is build our application with Docker. So to build dev image we can do make build-dev
which runs following target:
This target builds the image by first substituting labels at the bottom of dev.Dockerfile
with image name and tag which is created by running git describe
and then running docker build
.
Next up — building for production with make build-prod VERSION=1.0.0
:
This one is very similar to previous target, but instead of using git
tag as version, we will use version passed as argument, in the example above 1.0.0
.
When you run everything in Docker, then you will at some point need to also debug it in Docker, for that, there is following target:
From the above we can see that entrypoint gets overridden by bash
and container command gets overridden by argument. This way we can either just enter the container and poke around or run one off command, like in the example above.
When we are done with coding and want to push the image to Docker registry, then we can use make push VERSION=0.0.2
. Let's see what the target does:
It first runs build-prod
target we looked at previously and then just runs docker push
. This assumes that you are logged into Docker registry, so before running this you will need to run docker login
.
Last target is for cleaning up Docker artifacts. It uses name
label that was substituted into Dockerfiles
to filter and find artifacts that need to be deleted:
You can find full code listing for this Makefile
in my repository here: https://github.com/MartinHeinz/python-project-blueprint/blob/master/Makefile
CI/CD with GitHub Actions
Now, let’s use all these handy make
targets to setup our CI/CD. We will be using GitHub Actions and GitHub Package Registry to build our pipelines (jobs) and to store our images. So, what exactly are those?
- GitHub Actions are jobs/pipelines that help you automate your development workflows. You can use them to create individual tasks and then combine them into custom workflows, which are then executed — for example — on every push to repository or when release is created.
- GitHub Package Registry is a package hosting service, fully integrated with GitHub. It allows you to store various types of packages, e.g. Ruby gems or npm packages. We will use it to store our Docker images. If you are not familiar with GitHub Package Registry and want more info on it, then you can check out my blog post here.
Now, to use GitHub Actions, we need to create workflows that are going to be executed based on triggers (e.g. push to repository) we choose. These workflows are YAML files that live in .github/workflows
directory in our repository:
In there, we will create 2 files build-test.yml
and push.yml
. First of them build-test.yml
will contain 2 jobs which will be triggered on every push to the repository, let's look at those:
First job called build
verifies that our application can be build by running our make build-dev
target. Before it runs it though, it first checks out our repository by executing action called checkout
which is published on GitHub.
The second job is little more complicated. It runs tests against our application as well as 3 linters (code quality checkers). Same as for previous job, we use checkout@v1
action to get our source code. After that we run another published action called setup-python@v1
which sets up python environment for us (you can find details about it here). Now that we have python environment, we also need application dependencies from requirements.txt
which we install with pip
. At this point we can proceed to run make test
target, which triggers our Pytest suite. If our test suite passes we go on to install linters mentioned previously - pylint, flake8 and bandit. Finally, we run make lint
target, which triggers each of these linters.
That’s all for the build/test job, but what about the pushing one? Let’s go over that too:
First 4 lines define when we want this job to be triggered. We specify that this job should start only when tags are pushed to repository (*
specifies pattern of tag name - in this case - anything). This is so that we don't push our Docker image to GitHub Package Registry every time we push to repository, but rather only when we push tag that specifies new version of our application.
Now for the body of this job — it starts by checking out source code and setting environment variable of RELEASE_VERSION
to git
tag we pushed. This is done using build-in ::setenv
feature of GitHub Actions (more info here). Next, it logs into Docker registry using REGISTRY_TOKEN
secret stored in repository and login of user who initiated the workflow ( github.actor
). Finally, on the last line it runs push
target, which builds prod image and pushes it into registry with previously pushed git
tag as image tag.
You can out checkout complete code listing in the files in my repository here.
Code Quality Checks using CodeClimate
Last but not least, we will also add code quality checks using CodeClimate and SonarCloud. These will get triggered together with our test job shown above. So, let’s add few lines to it:
We start with CodeClimate for which we first export GIT_BRANCH
variable which we retrieve using GITHUB_REF
environment variable. Next, we download CodeClimate test reporter and make it executable. Next we use it to format coverage report generated by our test suite, and on the last line we send it to CodeClimate with test reporter ID which we store in repository secrets.
As for the SonarCloud, we need to create sonar-project.properties
file in our repository which looks like this (values for this file can be found on SonarCloud dashboard in bottom right):
Other than that, we can just use existing sonarcloud-github-action
, which does all the work for us. All we have to do is supply 2 tokens - GitHub one which is in repository by default and SonarCloud token which we can get from SonarCloud website.
Note: Steps on how to get and set all the previously mentioned tokens and secrets are in the repository README here.
Conclusion
That’s it! With tools, configs and code from above, you are ready to build and automate all aspects of your next Python project! If you need more info about topics shown/discussed in this article, then go ahead and check out docs and code in my repository here: https://github.com/MartinHeinz/python-project-blueprint and if you have any suggestions/issues, please submit issue in the repository or just star it if you like this little project of mine. ????
Resources
- The best Docker base image for your Python application
- Google Distroless
- Scan Your Docker Images for Vulnerabilities
- 5 open source tools for container security
- SonarCloud GitHub Action
Bio: Martin Heinz is a DevOps Engineer at IBM. A software developer, Martin is passionate about computer security, privacy and cryptography, focused on cloud and serverless computing, and is always ready to take on a new challenge.
Original. Reposted with permission.
Related:
- Free From MIT: Intro to Computer Science and Programming in Python
- Data Science Meets Devops: MLOps with Jupyter, Git, and Kubernetes
- Deploy Machine Learning Pipeline on AWS Fargate