Introduction
In this tutorial, we will train a machine learning model to identify water in Sentinel-2 satellite images. We will be using code from this GitHub repo using this dataset.
First, find a target executor.
╭──────────────────┬────────────────────┬────────────────┬─────────────────────╮
│ NAME │ TYPE │ LOCATION │ LAST HEARD FROM │
├──────────────────┼────────────────────┼────────────────┼─────────────────────┤
│ icekube │ container-executor │ RISE, Sweden │ 2024-02-28 20:05:45 │
│ dev │ container-executor │ Rutvik, Sweden │ 2024-02-28 20:05:45 │
│ leonardo-booster │ container-executor │ Cineca, Italy │ 2024-02-27 18:50:11 │
│ lumi-std │ container-executor │ CSC, Finland │ 2024-02-28 20:06:01 │
╰──────────────────┴────────────────────┴────────────────┴─────────────────────╯
Generate an empty working, targeting the ICEKube K8s cluster. Note that the target executor can be changed later.
mkdir waterml
cd waterml
pollinator new -n icekube
INFO[0000] Creating directory Dir=./cfs/src
INFO[0000] Creating directory Dir=./cfs/data
INFO[0000] Creating directory Dir=./cfs/result
INFO[0000] Generating Filename=./project.yaml
INFO[0000] Generating Filename=./cfs/data/hello.txt
INFO[0000] Generating Filename=./cfs/src/main.py
Dataset
Copy the water_body_dataset
to the ./cfs/data
directory.
cp ~/water_body_dataset ./cfs/data/water
If the dataset is already stored in Colonies CFS, you can copy the dataset directly from CFS to the project directory.
colonies fs sync -l /water -d ./cfs/data/water
The dataset will upload next time the project run and will be available in the container at these directories:
projdir = os.environ.get("PROJECT_DIR")
image_path = projdir + '/data/water/Images/'
mask_path = projdir + '/data/water/Masks/'
Build a Docker container (optional)
We are going the Container Executor, which comes in three variants.
Kube Executor runs containers as Kubernetes batch jobs.
Docker Executor runs containers as Docker containers on a baremetal servers or VMs.
HPC Executor runs containers as Singularity containers on HPC systems, managing them as Slurm jobs.
As the function specification is identical, meaning that we can easily switch between these 3 types of executors. To run containers, we first need to create a Dockerfile with the following content:
FROM docker.io/tensorflow/tensorflow:2.13.0-gpu
RUN apt-get update && apt-get install -y python3 python3-pip wget vim git fish libgl1-mesa-glx libglib2.0-0
RUN python3 -m pip install --upgrade pip
RUN pip3 install pycolonies opencv-python tqdm Pillow scikit-learn keras matplotlib numpy
Build and publish the Dockerfile and publish the Docker image at public Docker registry.
docker build -t johan/hackaton .
docker push johan/hackaton
The johan/hackaton
Docker image has already been published at DockerHub.
Training the model
Now that we have prepared the dataset and created a Docker container, it’s time to proceed with training the model.
Configure the Pollinator project
projectname: johantest
conditions:
executorNames:
- icekube
nodes: 1
processesPerNode: 1
cpu: 10000m
mem: 15000Mi
walltime: 600
gpu:
count: 1
name: "nvidia-gtx-2080ti"
environment:
docker: johan/hackaton
rebuildImage: false
cmd: python3
source: main.py
Replace main.py
Download source code from this GitHub repo.
wget -O cfs/src/main.py https://raw.githubusercontent.com/johankristianss/colonyoshackaton/main/src/main.py
Note that the Python code saves the training result and a random prediction example in the result directory, which is automatically synchronized back to the client after process completion.
plt.savefig(projdir + '/result/res_' + processid + '.png')
plt.savefig(projdir + '/result/samples_' + processid + '.png')
ls cfs/result
.rw-r--r-- 55k johan 12 Dec 21:40 res_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png
.rw-r--r-- 266k johan 12 Dec 21:40 samples_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png
Train the model
Pollinator will automatically synchronize the cfs/src
, cfs/data
, and cfs/result
directories to Colonies CFS, generate
a function specification and then submit the function specification, follow the process execution, and upon completion, synchronize the
project files back to your local computer.
pollinator run --follow
67/67 [==============================] - 1s 18ms/step - loss: 0.3434 - accuracy: 0.7024 - val_loss: 0.3263 - val_accuracy: 0.7038
Epoch 25/30
67/67 [==============================] - 1s 17ms/step - loss: 0.3307 - accuracy: 0.7092 - val_loss: 0.3146 - val_accuracy: 0.7121
Epoch 26/30
67/67 [==============================] - 1s 18ms/step - loss: 0.3139 - accuracy: 0.7140 - val_loss: 0.2947 - val_accuracy: 0.7249
Epoch 27/30
67/67 [==============================] - 1s 17ms/step - loss: 0.3226 - accuracy: 0.7110 - val_loss: 0.3027 - val_accuracy: 0.7244
Epoch 28/30
67/67 [==============================] - 1s 17ms/step - loss: 0.2994 - accuracy: 0.7208 - val_loss: 0.2910 - val_accuracy: 0.7259
Epoch 29/30
67/67 [==============================] - 1s 17ms/step - loss: 0.2910 - accuracy: 0.7239 - val_loss: 0.2781 - val_accuracy: 0.7261
Epoch 30/30
67/67 [==============================] - 1s 17ms/step - loss: 0.2856 - accuracy: 0.7258 - val_loss: 0.2733 - val_accuracy: 0.7313
23/23 [==============================] - 0s 4ms/step
INFO[0141] Process finished successfully ProcessID=61e597845ed3df4456c5be7d358e35141b8dc4c1f76a89d7caad0f31f792106c
Downloading samples_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png 100% [===============] (5.0 MB/s)
Downloading res_076e273a1d082dd2886892dfd7d1723e12c747cf2899f2c2ede27ceb55e06ae2.png 100% [===============] (1.7 MB/s)
We can now open the sample and training plot pictures.
![../_images/prediction_example.png](../_images/prediction_example.png)
![../_images/training_result.png](../_images/training_result.png)