Data Science Using Docker

Wow, it’s been 2 years since I last wrote a blog post. Since then I finished work as a neuroscientist and started a data science job at SEEK. Loving the career change, I’m working in the AI platform services team where I use machine learning (ML) to help improve the efficiency of employment markets. I’m lucky to be surrounded by some excellent experienced data scientists and engineers and have learnt many useful skills. I plan to start blogging again to share some of the useful things I’ve picked up over the last 2 years.

One of the most useful and versatile tools I’ve picked up to help data science workflow is Docker.

What’s so good about using docker for interactive data science?

The thing I love about using docker is, it has eliminated the hassle of re-installing software and managing package/library/module versions every time I want to train ML models on a different machine – no more fighting module conflicts! You can make a docker image that has all your favourite data science tooling, and then use that image to easily build a container with your data science work environment that is identical every time, no matter what machine you build it on.

So no more “it worked when I ran it on my machine”!! ;-P

Some basic terminology

Dockerfile:

  • The ‘dockerfile’ is a text file with simple code that describes your docker image.

Docker image:

  • The docker ‘image’ describes the base operating system and all the other programs you want (for example, in my case: linux, R, dplyr python, fastText, xgboost, PyICU etc).

Docker container:

  • A running instance of an image is called a ‘container’. You can have one or many containers of the same image running on one or many physical host machines.

For want of an analogy… analogies are rarely perfect and this one is no exception, but in terms of baking a cake:

  • the dockerfile is your ingredients list,
  • the image is your recipe,
  • the container is the cake!
  • And the machine you are using is the oven.

You can bake as many cakes as you like with a given recipe. You can bake the exact same cake many times in different ovens. You can bake multiple cakes simultaneously in the same oven. As long as a machine has docker installed, your docker image is going to run and the container will work the same every time!

How to get Started

  1. Install Docker

  2. Test installation worked by running the simple hello-world Docker image:
    • in your terminal/command prompt docker run hello-world
  3. To start with, you may like to use an pre-made public image created for data science tooling, pulled directly from dockerhub. There are many of these available for free for on dockerhub which as a nice search function.
  4. Alternatively, create your own custom docker image, with the exact tooling and versions you want. For example, below I discuss the files needed to edit and build my own custom docker image:

My docker image for training ml models

I made this docker image to help with my data science work flow. Specifically it allows me to quickly and easily set up the required versions of my tooling/packages (python3.6, R, dplyr xgboost, fastText, pyICU etc.) in a container on other machines.

The image can be pulled as is from directly from my public docker repo using terminal command:

  • docker pull danielpnewman/training-tools

Alternatively you can update my docker files and rebuild your own custom image using the steps below. :-)

Making your own docker image, ideal for model training using Python3.6, xgboost, fastText, PyICU, R, dplyr, and any other data science tools you like:

  1. Clone the files from my docker image
    • git@github.com:DanielPNewman/training-docker-files.git
  2. If needed update the Dockerfile with required software.

  3. If needed update requirements with required python packages.

  4. Build local docker image from Dockerfile in ~/training-docker-files directory, this code tags the image name as “danielpnewman/training-tools”, which can be changed to whatever name you like:

    • cd training-docker-files
    • docker build -t danielpnewman/training-tools .
  5. Put training data, scripts etc. into local /to-mount directory and then mount it into the docker container when you build it using this command:

    • docker run --interactive --tty --volume $(pwd)/to-mount:/training/to-mount danielpnewman/training-tools

    • Note you can mount multiple directories:

      • docker run --interactive --tty --volume $(pwd)/to-mount:/training/to-mount --volume $(pwd)/scripts:/training/scrips danielpnewman/training-tools
  6. You can close the terminal of an active docker session and then log back into it later using its CONTAINER ID, e.g:
- exec -it d40b2796e7ca /bin/bash

Very basic docker cheat sheet

List Docker CLI commands

docker docker container --help

Display Docker version and info

docker --version docker version docker info

Execute Docker image

docker run hello-world

List Docker images

docker image ls

List Docker containers (running, all, all in quiet mode)

docker container ls docker container ls --all docker container ls -aq

Using R to create a word cloud from my PhD thesis text

Median Property Prices 2005-16 - PART 2!

A few weeks back I made a blog post with this nice little .gif below, of change over time in Median Melbourne Property Prices ($) from 2005-2016 - see my previous blog on 29 Sep 2016 :

Well I’ve just come back to looking at that data set and this time I’ve plotted the % change per annum and overall, and also absolute $ change from 2005-2016 on some interactive plots. These plots allow you to zoom in, hover over a suburb to see more info, or click on a suburb to open a new window and explore that suburb in more detail.

The R code I used to make the plots below is here.

Explore below, it’s interesting to see that SYNDAL has the greatest per annum and overall % growth, however it’s TOORAK that has by far has the highest absolute $ growth over the same period of time.

The ggiraph package gives ggplot2 nice reactivity to user input

Using the plotly package to give your ggplot2 plots simple reactivity to user input

First make up some fake revenue data for a company with a number of shops operating in each State from 2012 to 2015:

### Install/load required packages
#List of R packages required for this analysis:
required_packages <- c("ggplot2", "stringr", "plotly", "dplyr")
#Install required_packages:
new.packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
#Load required_packages:
lapply(required_packages, require, character.only = TRUE)

#Set decimal points and disable scientific notation
options(digits=3, scipen=999) 

#Make up some fake data
df<-data_frame(state=rep(c("New South Wales", 
                 "Victoria", 
                 "Queensland",
                 "Western Australia",
                 "South Australia",
                 "Tasmania"), 36)) %>%
    group_by(state) %>%
    mutate(year=c(rep(2012, 9), rep(2013,9),rep(2014, 9),rep(2015, 9))) %>%
    group_by(state, year) %>%
    mutate(`store ID` = str_c("shop_#",as.character(seq_along(state)))) %>%
    group_by(state, year, `store ID`) %>%
    mutate(`Revenue ($)` =  ifelse(state=="New South Wales", sample(x=c(1000000:9000000), 1),
                            ifelse(state=="Victoria", sample(x=c(1000000:7000000), 1),
                            ifelse(state=="Queensland", sample(x=c(1000000:5000000), 1),
                            ifelse(state=="Western Australia",sample(x=c(100000:2000000), 1),
                            ifelse(state=="South Australia",sample(x=c(100000:900000), 1),       
                            ifelse(state=="Tasmania", sample(x=c(100000:2000000), 1), NA)))))))

Now visualise this data using ggplot:

ggplot(df, aes(state, `Revenue ($)`, colour=state, label = `store ID`)) +
    geom_boxplot() + 
    geom_point() +
    theme(axis.title.x =  element_blank(),
          axis.text.x  =  element_blank(), 
          axis.title.y = element_text(face="bold", size=12),
          axis.text.y  = element_text(angle=0, vjust=0.5, size=11),
          legend.title = element_text(size=12, face="bold"),
          legend.text = element_text(size = 12, face = "bold"),
          plot.title = element_text(face="bold", size=14)) + 
    ggtitle("Store Revenue per State from 2012 to 2015") +
    facet_wrap(~year)

Now make the plot reactive to the user’s mouse by wrapping plotly’s ggplotly() function around it:

p<-ggplotly(ggplot(df, aes(state, `Revenue ($)`, colour=state, label = `store ID`)) +
    geom_boxplot() + 
    geom_point() +
    theme(axis.title.x =  element_blank(),
          axis.text.x  =  element_blank(), 
          axis.title.y = element_text(face="bold", size=12),
          axis.text.y  = element_text(angle=0, vjust=0.5, size=10),
          legend.title = element_text(size=12, face="bold"),
          legend.text = element_text(size = 12, face = "bold"))+
    facet_wrap(~year))


##Publish to plotly
# plotly_POST(p, filename = "dans_plotly_example")

This type of simple plot made using plotly and ggplot2 in R are great because they have some basic “reactivity” to user input, (e.g. hover mouse over data point and lable appears with info. about data point like “store ID”” for example), but they do not need to be hosted on a server - they are simple enough to be knitted into a stand-alone HTML document.