Containers are ephemeral, here now and gone the next minute without so much as a goodbye. Of course, containers were designed to be lightweight and temporary right from the beginning, and if all you’re doing is running light web applications, this isn’t a problem. The problem arises when you’re a bank or an insurance company that’s running databases and can’t afford to throw away your data on a daily basis. The interesting thing here is that Docker really doesn’t have a native persistent storage option, though users have found a few different ways around the problem. To someone who works on databases, containers losing all of their data every time they’re deleted is a nightmare and can be compared to a system catastrophe rather than a solution.
Although containers are typically short-lived and uncoupled from any persistent resources, the ability to ensure that data exists after a container stops running is crucial to many applications. There are a few workarounds to permanently store data from containers, though they’re not all as straightforward as we’d like them to be. Containers are just little mini-environments in a box, and without data or code running in them they don’t even exist. They pop up when there is a job to do, and then disappear like they were designed to. Data today is extremely valuable, and people like to have a record somewhere of what’s happening. Docker data volumes is the first solution that we’re going to look at, and it is quite easy to setup as well.
Docker data volumes
Rather than injecting containers with code every time they run, a Docker data volume is nothing but a directory, like a DOS directory, for example C:/DOS/GAMES/DIRECTORY. The directory is then used as a mount point and can be defined by the user before the containers start up. What it does is it sits in the host file system and persistently stores data for a container. Volumes are normally given random 64 character names like the temp files in Windows, but it’s a good practice to change those to something more manageable.
The reason that the data volume sits on the host and not in the containers is because it’s now managed outside the capabilities of the Docker storage. This makes everything a lot faster since a lot of load is taken off the storage driver that would otherwise have to keep track of all changes and differences. Linux containers are built on an image of the base container so any data stored on the base container basically gets replicated in every single container and increases the resource footprint considerably. This is called an overlay file system, and containers use this to implement a copy-on-write process that compares new information to the base image and saves just the changes. These changes are then discarded when the container terminates, and hence the dilemma.
Data volumes created this way are permanent and will exist whether the containers are still running or not. The drawback, however, is that volumes cannot be assigned to a new or running container, which leads to a little problem called “Orphan Volumes,” or in other words, volumes with no associated containers.
Docker data containers
An alternate and more interesting method is that rather than having data externally available from a volume or have it loaded on the base container image, assign a container just for storage and have all the other containers access it like an external storage device. Just to be clear, this container does not process any code but acts like a mounted data volume from which other containers can access data. Another advantage is the fact that you are now effectively reusing the same volume that is being accessed by all the containers rather than creating new random 64 character volumes. The important thing to remember, however, is that if the underlying base container image is stopped — that’s the end of your data — so a backup script needs to be executed before that happens.
In both scenarios above, the storage volume lies within the Docker file structure /var/lib/docker/volume. In the third method, we’re going to mount an external directory from the host into the container the same way we did earlier, except now the containers have access to an entire directory and can be used for storing data by one or more containers at the same time. This is also useful when you want to test code. All you need to do is load the code in the directory before you mount it, and once it’s running you can make changes to the code and see the effects in a controlled environment. One thing to remember is that containers have read/write access by default, so if you give them one of your critical directories to play with, make sure you change that access to read-only or be prepared for the consequences.
Finally, we come to the interesting storage solutions, and though storage has been considered an Achilles’ heel for Docker, it’s a brilliant opportunity for young enterprise startups to step up to the opportunity. Docker is excellent, but what’s more excellent is it leaves huge gaps in the stack for new tools and solutions to make themselves useful. Though Docker has Swarm now, the original lack of an orchestration software made Kubernetes what it is today (much to Docker’s jealousy, but that’s a whole other post). The lack of persistent storage is nothing but a gift to the enterprise saying, come join the party if you can.
Answering that call we now have a host of startups and big enterprises coming up with solutions that address the persistent storage requirements of containers. Apart from the plugin architecture itself, Docker has also designed an interface and API that allows storage vendors to build drivers to this effect. You can’t really ask for more from an offering that’s up the food chain in you ecosystem.
The recent Dockercon 2017 had two major storage announcements — one from London-based StorageOS that released a beta of their persistent storage offering for containers, which will be available as a free 40MB plugin once beta testing is done.
The other offering was from Nimble Storage, and is an actual physical offering rather than a software plugin. Nimble’s offering is called MultiCloud Flash Fabric and is an actual storage device based on Nimble’s predictive flash platform. With flash performance and predictive analysis and the ability to swap out or add hard disks without disrupting the system, this is a robust storage offering. Nimble’s flash storage solution costs north of $40,000, but is the ultimate fix for Docker storage issue as it lets developers work on, ship, and run with production data. It truly is a portable data stack to match Docker.
How storage plugins work is they map storage from a single host to an external storage service like a traditional storage device, or one of Nimble’s flash arrays. If a container switches hosts, however, the information is lost, and that’s where our third storage solution steps in: Flocker. Flocker is a not a storage solution but rather a platform developed by ClusterHQ that manages and automates the relocation of containers from host to host. A lot of big enterprise names are writing code for the Flocker API including Dell, EMC, NetAPP, and Pure Storage, further validating it as an enterprise-ready tool for Docker.
Docker’s success can be significantly attributed to the tools in its ecosystem that fill in the gaps in the stack while the tools in the ecosystem have the gaps to thank. With the entire enterprise backing Docker and jumping on every opportunity to make it better, storage solutions are definitely not going to be a problem for containers going forward.