Learn How to Containerize Your R Notebooks with Docker

Have you ever written an R Notebook that worked on your local machine, only to share it with a colleague and find out that it broke? It’s an all too common story. In this article, I’ll show how to write reporting R Notebooks the right way – by containerizing them with Docker.

Docker is a platform that supports the development of isolated and reproducible environments. With it, you can wrap up your R Notebook code and dependencies so that it works on any machine.

But what does that mean? Well, every computer is a little different. Different operating system, libraries, package versions etc. Given these differences, code written on one computer may not run properly on another. Docker provides a ‘virtual environment’, a space where everything can run the same way every time, on every machine. For anyone developing software, this is a game changer.

Anyway, let’s get started!

R Notebook (R Markdown)

Of course, to start, you’ll need your R Notebook, which performs an analysis, reports results, and saves the outputs. In case, you are unsure, this is your typical .Rmd file that can be used for generating html reports. If you are unsure how to create an R Markdown, here is an introductory tutorial on the R website for reference.

Producing a reproducible R report is very different from just creating a report. To effectively make your report reproducible, we should take note of the following properties of our notebook:

Inputs: Does your notebook require any input arguments?
Dependencies: Which libraries are needed for its execution?
Outputs: Does your notebook write output data, e.g., in the form of an HTML document?
Data: Where is the required data coming from? Are there any associated keys and secrets?

For this exercise, I am using a notebook called reporting_template.Rmd. You’ll find references to this name in the code below. The contents of the notebook are not important – it just loads some random libraries, loads some data from Snowflake, prints ‘hello world’ and exits.

Therefore:

The notebook has several unique dependencies that need to be managed in an automated way.
It also fetches data from a centralized Snowflake warehouse, ensuring that the analysis is always up-to-date. This step isn’t strictly necessary, but by using a remote database, we ensure that no data is shared manually, improving reproducibility and security.

Other than that, there’s nothing special about the report. It’s just a basic R Notebook. Replace reporting_template.Rmd with your own R Notebook (Markdown) as needed.

Dependency Management with ‘renv’

In R, I like to manage dependencies using the package renv. renv is an R package designed to create and manage project-specific libraries. It helps ensure that your R projects are reproducible by tracking and isolating dependencies. Basically, every library that you install will be tracked with renv so that anyone using the notebook doesn’t have to take manual actions to reproduce your work.

As a prepartory step, we’ll need to create this R environment before we can build our Docker container.

Here’s how it works.

Install the renv package:

install.packages("renv")

2. ‘Initialize’ the project:

library(renv)
renv::init()

3. Install any necessary packages:

renv::install("example_package_name")

4. Use the ‘snapshot’ command to save the project’s state:

renv::snapshot()

Our Docker image will use this saved state generated above. This means that within our Docker image, we’ll need to add the following ‘restore’ command. This command will ‘load’ our R environment within the Docker container.

renv::restore()

Fetching Data

As mentioned earlier, we should ensure that no data is shared manually to support reproducibility and improve security. To that end, we’ll store and retrieve the data from the analytics database Snowflake. Of course, our code will need keys and passwords to access the data, we’ll need to support this on our code. Never put your passwords into the code. I can’t stress that enough.

You’ll notice that the DB passwords in the code below have been supplanted by placeholder variables. These variables can be passed to the Docker image at runtime via the command line.

Regarding the database – Snowflake is just one of many possible options here. Your database will very likely be different, depending on your companies tech infrastructure. Replace the Snowflake steps in the container with your own requirements.

The Docker Container

FROM rocker/r-apt:bionic

USER root
RUN apt-get update
RUN apt upgrade -y
RUN apt-get install curl -y

RUN apt-get update && \
    apt-get -y install libgdal-dev && \
    apt-get install -y -qq

# Copy the local files
COPY . .

# Install pandoc
RUN apt-get install pandoc -y

# Download the Snowflake ODBC driver
RUN curl "https://sfc-repo.snowflakecomputing.com/odbc/linux/latest/snowflake_linux_x8664_odbc-2.25.3.tgz" -o ./snowflake_linux_x8664_odbc-2.25.3.tgz 

# Unzip and untar the snowflake odbc driver
RUN tar -xvzf ./snowflake_linux_x8664_odbc-2.25.3.tgz

# Copy the driver to a new folder for reference
RUN mkdir /odbc
RUN cp -R ./snowflake_odbc /odbc/snowflake_odbc
RUN /odbc/snowflake_odbc/unixodbc_setup.sh

# Create the reports directory
RUN mkdir reports

# Load the environment
RUN R -e "install.packages('renv', repos = 'http://cran.us.r-project.org')"
RUN R -e "renv::restore()"

# Run the markdown with the command line arguments
CMD R -e "rmarkdown::render('reporting_template.Rmd', output_file='/reports/Report.html', db_id='${DB_ID}', db_password ='${DB_PASSWORD}'))"

Docker Container (Breakdown)

Let’s go through some of the key components of this code. Firstly, the code in the block below loads a base image with R installed. A base image is like a pre-built starting point so that you don’t have to specify absolutely everything from scratch. Then, additionally I’ve installed some basic Unix prerequisites necessary to run the markdown.

FROM rocker/r-apt:bionic

USER root
RUN apt-get update
RUN apt upgrade -y
RUN apt-get install curl -y

RUN apt-get update && \
    apt-get -y install libgdal-dev && \
    apt-get install -y -qq

Next, this line installs pandoc which supports the generating of html from R markdown.

# Install pandoc
RUN apt-get install pandoc -y

Then we install the drivers required for interacting with the remote Snowflake database. Replace this code with your own database drivers.

# Download the Snowflake ODBC driver
RUN curl "https://sfc-repo.snowflakecomputing.com/odbc/linux/latest/snowflake_linux_x8664_odbc-2.25.3.tgz" -o ./snowflake_linux_x8664_odbc-2.25.3.tgz 

# Unzip and untar the snowflake odbc driver
RUN tar -xvzf ./snowflake_linux_x8664_odbc-2.25.3.tgz

# Copy the driver to a new folder for reference
RUN mkdir /odbc
RUN cp -R ./snowflake_odbc /odbc/snowflake_odbc
RUN /odbc/snowflake_odbc/unixodbc_setup.sh

Then, we can load the dependencies required for our markdown. Notice the ‘restore’ function that was described earlier.

# Load the environment
RUN R -e "install.packages('renv', repos = 'http://cran.us.r-project.org')"
RUN R -e "renv::restore()"

Finally, the output report is generated with the last line of the image. As noted earlier, my markdown is called ‘reporting_template.Rmd‘ and creates an output file called ‘Report.html‘.

Finally, notice that this last line is CMD, rather that RUN. Docker RUN commands are executed during the build (setup) process. CMD commands are executed at runtime ie. when producing reports.

# Run the markdown with the command line arguments
CMD R -e "rmarkdown::render('reporting_template.Rmd', output_file='/reports/Report.html', db_id='${DB_ID}', db_password ='${DB_PASSWORD}'))"

How to Generate a Report

Everything is now set up to be fully reproducible. The only requirement to reproduce the report is that the user has Docker installed on their system (and has database access passwords).

However, there are a couple of small steps remaining.

First, we must build the previously created Docker image. To build an image, means that we take all the setup steps once and save the result into an ‘image’.

You’ll need to have Docker installed on your system and then, from the command line, run the command below. If you don’t have command line access, or are not sure how to use it, Docker also has a Desktop version.

build -t reporting_image .

Finally, create an output directory ‘LOCAL_DIR’ where you will store the results. The generated report will be copied to this location.

Then, simply run the following command:

docker run -v LOCAL_DIR:/reports -e DB_ID="enter the id here" -e DB_PASSWORD="enter the snowflake password here" reporting_image

You’ll notice that the database id and password and passed in at runtime. The results are created and save to your LOCAL_DIR. Easy!

Summary

As you can see, its relatively straightforward to set up a Docker container for your R Notebook. In doing so, you’ll make the lives of your colleagues a lot easier as they can near instantly reproduce your work without any tedious manual setup.

I highly recommend containerizing your work and applications, particularly where the work needs to be reproduced, shared or deployed.

Subscribe for free below to receive updates from my blog.

Aaron Pickering