Data Science Using Docker
30 Oct 2018Wow, it’s been 2 years since I last wrote a blog post. Since then I finished work as a neuroscientist and started a data science job at SEEK. Loving the career change, I’m working in the AI platform services team where I use machine learning (ML) to help improve the efficiency of employment markets. I’m lucky to be surrounded by some excellent experienced data scientists and engineers and have learnt many useful skills. I plan to start blogging again to share some of the useful things I’ve picked up over the last 2 years.
One of the most useful and versatile tools I’ve picked up to help data science workflow is Docker.
What’s so good about using docker for interactive data science?
The thing I love about using docker is, it has eliminated the hassle of re-installing software and managing package/library/module versions every time I want to train ML models on a different machine – no more fighting module conflicts! You can make a docker image that has all your favourite data science tooling, and then use that image to easily build a container with your data science work environment that is identical every time, no matter what machine you build it on.
So no more “it worked when I ran it on my machine”!! ;-P
Some basic terminology
Dockerfile:
- The ‘dockerfile’ is a text file with simple code that describes your docker image.
Docker image:
- The docker ‘image’ describes the base operating system and all the other programs you want (for example, in my case:
linux
,R
,dplyr
python
,fastText
,xgboost
,PyICU
etc).
Docker container:
- A running instance of an image is called a ‘container’. You can have one or many containers of the same image running on one or many physical host machines.
For want of an analogy… analogies are rarely perfect and this one is no exception, but in terms of baking a cake:
- the dockerfile is your ingredients list,
- the image is your recipe,
- the container is the cake!
- And the machine you are using is the oven.
You can bake as many cakes as you like with a given recipe. You can bake the exact same cake many times in different ovens. You can bake multiple cakes simultaneously in the same oven. As long as a machine has docker installed, your docker image is going to run and the container will work the same every time!
How to get Started
- Test installation worked by running the simple hello-world Docker image:
- in your terminal/command prompt
docker run hello-world
- in your terminal/command prompt
- To start with, you may like to use an pre-made public image created for data science tooling, pulled directly from dockerhub. There are many of these available for free for on dockerhub which as a nice search function.
- for example, you could download this Jupyter Notebook Data Science Stack using
docker pull jupyter/datascience-notebook
- for example, you could download this Jupyter Notebook Data Science Stack using
- Alternatively, create your own custom docker image, with the exact tooling and versions you want. For example, below I discuss the files needed to edit and build my own custom docker image:
My docker image for training ml models
I made this docker image to help with my data science work flow. Specifically it allows me to quickly and easily set up the required versions of my tooling/packages (python3.6
, R
, dplyr
xgboost
, fastText
, pyICU
etc.) in a container on other machines.
The image can be pulled as is from directly from my public docker repo using terminal command:
docker pull danielpnewman/training-tools
Alternatively you can update my docker files and rebuild your own custom image using the steps below. :-)
Making your own docker image, ideal for model training using Python3.6, xgboost, fastText, PyICU, R, dplyr, and any other data science tools you like:
- Clone the files from my docker image
git@github.com:DanielPNewman/training-docker-files.git
-
If needed update the Dockerfile with required software.
-
If needed update requirements with required python packages.
-
Build local docker image from Dockerfile in ~/training-docker-files directory, this code tags the image name as “danielpnewman/training-tools”, which can be changed to whatever name you like:
cd training-docker-files
docker build -t danielpnewman/training-tools .
-
Put training data, scripts etc. into local
/to-mount
directory and then mount it into the docker container when you build it using this command:-
docker run --interactive --tty --volume $(pwd)/to-mount:/training/to-mount danielpnewman/training-tools
-
Note you can mount multiple directories:
docker run --interactive --tty --volume $(pwd)/to-mount:/training/to-mount --volume $(pwd)/scripts:/training/scrips danielpnewman/training-tools
-
- You can close the terminal of an active docker session and then log back into it later using its CONTAINER ID, e.g:
exec -it d40b2796e7ca /bin/bash
Very basic docker cheat sheet
List Docker CLI commands
docker
docker container --help
Display Docker version and info
docker --version
docker version
docker info
Execute Docker image
docker run hello-world
List Docker images
docker image ls
List Docker containers (running, all, all in quiet mode)
docker container ls
docker container ls --all
docker container ls -aq