Configuration Management in the era of Containers

Triggered by multiple independent events, including (but not limited to) an opinionated blogpost describing The Evolution of Distributed Systems Management that got quite some readers from Hacker News, the Devops Weekly newsletter and probably created a bit of a stir in the Configuration Management community and things that I noticed the past few weeks, I decided to also write down my opinion on the matter.

Quite some people nowadays seems convinced that containers are the future and will (magically) solve all the problems for you. Surprise, surprise, that is not how it works. For example, what if you come up with a container that runs nginx, that is all nice and dandy, and you decide to run it on Amazon's Elastic Container Service. Suddenly, you get a peak traffic load and you see this in your error logs, oopsie: open() [..] failed (24: Too many open files). You suddenly figure out that you actually had to think about the host system your container runs on as well, and that you need to bump the ulimits so that this does not happen again... I hope you didn't find out in production.

The benefits of containers are pretty clear; build once somewhere - run everywhere and think that you can make system engineers obsolete and you're all DevOps yourself now. The fact that you can run the exact same artifact in D, T, A and P is very powerful, I will not deny that. Nor will I deny that Docker made containers accessible to the masses, and they did a way better job than LXC/LXD in getting mainstream traction for this, and while under the hood most of these projects is still enabled by cgroups and other features in the (beautiful) Linux kernel, Docker has gained its right of existence in my opinion. It is clear that this is the direction we're currently heading, and that Kubernetes will most likely become and stay the preferred way to run your containers. This movement makes quite a lot more sense to me than the microservices hype we had before, since this is actually solving a practical problem and not creating dozens of new problems at the same time.

Yesterday, this repository appeared on the HN frontpage, describing Kubernetes Security Best Practices. To give you a small excerpt: Contributors to this guide are running Kubernetes in production and worked on several K8s projects to learn about security flaws the hard way. and Your cluster is as secure as the system running it. Nobody is stopping you from manually securing your system(s), but if there is one thing we learned from the previous configuration management hype, that is that manual human work is error-prone and easy to forget, and also that it scales very poorly when you have to suddenly manage a bit more than 5-10 installations (no matter if these are dedicated servers or VM's, you really don't wanna do this by hand).

Apparently, to completely understand and run containers in a sane way, you still need Linux knowledge. Say what?! So the system engineers are not suddenly unemployed. It is easy to get containers or Kubernetes running, but you might end up with something insecure (see above) or performing poorly.

Look, I get that (most) configuration management tools are not sexy, and writing configuration might be less fun to you than writing actual code. But these tools are most likely written in an attempt to address the four principles of modern Release Engineering, being Identifiability, Reproducibility, Consistency and Agility. To clarify this with a quick example: some people believe that their servers should be in the timezone where the server is physically located (makes sense, right?). Some others believe they should never be set to anything else than UTC, and that you can localize it to your own time eg. via your browser or other tools (also makes sense! -- especially when you also start taking things like Daylight Savings Time into account). Do you care which one of the previous statements make more sense? No, you don't. What you do care about, is if you see timestamps somewhere, that you know how to interpret them. How can you achieve this, besides asking your friendly SysOps/DevOps colleague that setup the server for you and writing it down on a (digital) sticky note afterwards? Eureka, configuration management, infrastructure-as-code, here we come!

Another interesting example: GitLab introduced Auto DevOps recently and Jenkins also supports containers for a while. So your development team writes a build job, and (obviously) does some implicit assumptions in these run.sh scripts (or YAML, otherwise) that, for example, curl is available on the host system (not the container, that's within your own control). On Jenkins Slave 1 this is true, because a SysOps/DevOps member installed it by hand upon somebody's request, on Jenkins Slave 2 it is true because it runs on a Linux distribution that happens to ship curl by default (yay!), but on Jenkins Slave 3 it is not available at all, on Jenkins Slave 4 a development team member installed it himself because the SysOps/DevOps team wouldn't do it for them and it is a computer under their desk anyways on which they have root access, and Jenkins Slave 5 runs an operating system that has no installation candidate for curl at all. This might sound like a joke, but it definitely is not. Maybe you say; yeah, well, curl should not be there by default, only wget should be! Sure, thanks for your knowledge, but where is this based on, and where is this knowledge stored? Or maybe you are an extremist and you say, curl, wget, that is all for noobs, you can only have netcat on there. Where is the line here? (Hint: in this example there is no line, your infrastructure is a mess consisting out of all kinds of different snowflakes and computers under peoples desks, and you might want to improve that). After reading this example, you might think that I am a mister know-it-all and try to look anybody else look dumb, but that is not the case (at least not my purpose). I am genuinely asking you, the readers, what is the line? Which of the following commands can you expect to be present on a host system, without knowing the operating system or Googling the answer beforehand, and while not using configuration management tools? git, diff, curl, grep, wc, awk, xargs, sed, cut, cat, head? Maybe you firmly believe everything should be run in a container, including things up to ls. That is fine! But as a development team member, I would want to know. And not by you telling my after my build has been broken because the Jenkins Slaves are picked by a round-robin algorithm and you have been debugging this for an hour or so, but beforehand. Infrastructure-as-code, configuration management, could help you out here. Make some rules, pick a standard, then enforce it everywhere. Don't create snowflakes. Manual work is error-prone. Humans make mistakes. Life is good!

The Dockerfile approach makes it easy for you to describe how a container should be build, but if you write more than N lines (fill in yourself) in your Dockerfile, aren't you just going back 10 years like when we provisioned servers using install.sh bash scripts? How is that an improvement over configuration management tools, and why does it seem like containers and configuration management tools are eachothers opponents? I am not saying that it is sensible to run configuration management tools inside your Docker containers; I think you should (at least) not have the agents running in there, as the whole idea is that containers are not VM's and they should be mostly immutable after being created in a certain way. But I don't think that running something like chef-solo, masterless Puppet or Ansible/Salt while doing a one-off docker build is that crazy. Of course, I could also be wrong in the long term, this is a blogpost after all.

I just want to trigger people's critical thinking process with this blogpost. It is a waste if the new hype (containers) means a former hype (configuration management) has to be discarded and thrown away in the trash can. There are valuable lessons learned from the past, and why would you ignore or neglect them? Things like ulimits and sysctl tweaks, firewalls, monitoring/alerting, backups and unattended-upgrades are not suddenly solved by running your applications in containers. Security, standardization and best-practices start by reproducible infrastructure and as little manual work as possible, I believe.

Comments