Docker

Docker is a platform for developers and sysadmins to develop, deploy, and run applications with containers. The use of Linux containers to deploy applications is called containerization. Containers are not new, but their use for easily deploying applications is. Containerization is increasingly popular because containers are:

Flexible

Even the most complex applications can be containerized.

Lightweight

Containers leverage and share the host kernel.

Interchangeable

You can deploy updates and upgrades on-the-fly.

Portable

You can build locally, deploy to the cloud, and run anywhere.

Scalable

You can increase and automatically distribute container replicas.

Stackable

You can stack services vertically and on-the-fly.

For this course, we use Docker primarily to ensure every student is using the exact same platform for their applications, and to avoid certain platform-specific issues and peculiarities.

You are not required to use Docker for this lab when you feel comfortable setting up the required tools on your own system.

A basic understanding of some Docker concepts helps in getting started with this course. Part 1: Orientation and setup of the Get Started Guide covers the basic Docker concepts used in this course.

Before trying the lab assignments and tutorials in the next sections, make sure you Install Docker (stable) and test your installation by running the simple Hello World image.

docker run hello-world

In order to run Spark in a container, a Dockerfile is provided in the root of all repositories we will use in this lab, including the repository for the Getting Started guide. The Dockerfile can be used to build images for spark-submit to run your Spark application, spark-shell to run a Spark interactive shell, and the Spark history server to view event logs from application runs. You need to build these images before you get started. The Dockerfiles we provide assume that you run Docker from the folder at which they are located. Don't move them around! They will stop working.

To build a docker image from the Dockerfile, we use docker build:

docker build --target <target> -t <tag> .

Here <target> selects the target from the Dockerfile, <tag> sets the tag for the resulting image, and the . sets the build context to the current working directory.

We use docker to pull and build the images we need to use Spark and SBT.

sbt

docker pull hseeberger/scala-sbt:11.0.12_1.5.5_2.12.14
docker tag hseeberger/scala-sbt:11.0.12_1.5.5_2.12.14 sbt

spark-shell

docker build --target spark-shell -t spark-shell .

spark-submit

docker build --target spark-submit -t spark-submit .

spark-history-server

docker build --target spark-history-server -t spark-history-server .

You could then run the following commands from the Spark application root (the folder containing the build.sbt file). Please make sure to use the provided template project.

The commands below are provided as a reference, and they will be used throughout the rest of this guide. You do not have to run them now, because some of them require additional parameters (e.g. spark-submit) that we will provide later in the manual.

To run SBT to package or test your application (sbt <command>)
```
docker run -it --rm -v "`pwd`":/root sbt sbt
```

To start a Spark shell (spark-shell)

docker run -it --rm -v "`pwd`":/io spark-shell

To run your Spark application (spark-submit) (fill in the class name of your application and the name of your project!)

docker run -it --rm -v "`pwd`":/io -v "`pwd`"/spark-events:/spark-events \
spark-submit --class <YOUR_CLASSNAME> \
target/scala-2.12/<YOUR_PROJECT_NAME>_2.12-1.0.jar

To spawn the history server to view event logs, accessible at localhost:18080

docker run -it --rm -v "`pwd`"/spark-events:/spark-events \
-p 18080:18080 spark-history-server

The further we get in the manual, we will generally not mention the full Docker commands this explicitly again, so know that if we mention e.g. spark-shell, you should run the corresponding docker run command listed above. You can create scripts or aliases for your favorite shell to avoid having to type a lot.

Supercomputing for Big Data - Lab Manual

Docker

Setting up Spark in Docker