Docker usage inside DGXH100 – SUPERCOMPUTER EDUCATION AND RESEARCH CENTRE

Create docker image with base image as any preinstalled cuda tool kit 12.2, we have nvidia/cuda:12.2.0-devel-ubuntu20.04
Create serc user inside docker container, to get user details, type the below command in DGXH100
1. ```
id <uid>

Example:id secdsan
```
  output: uid=18308(secdsan) gid=1040(serc3) groups=998(docker),4002(sec_yoginderkumarnegi),1040(serc3)
Make note of uid, gid if group name is not available, use 3 letter short form of your department name, example for cds department, group name can be used as cds
Create new directory inside /raid/<uid>/
1. ```
mkdir mydocker
```

Create a file named “Dockerfile” and paste the below contents, Modify the user to be created, and any packages to be preinstalled inside docker container.

FROM nvidia/cuda:12.2.0-devel-ubuntu20.04

#set environment variables for user credentials obtained by step 2, password can be of users choice
ENV dockerusername=secdsan
ENV dockeruserpassword=password
ENV dockerusergroupid=1040
ENV dockeruserid=18308
ENV dockerusergroupname=serc3

# Set environment variables for non-interactive installation
ENV DEBIAN_FRONTEND=noninteractive

# Install necessary system packages
RUN apt-get update && apt-get install -y wget bzip2 && apt-get clean && rm -rf /var/lib/apt/lists/*

# Install Miniconda
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/miniconda && rm /tmp/miniconda.sh

# Add Conda to the PATH environment variable
ENV PATH=/opt/miniconda/bin:$PATH

# To install any other packages, uncomment below line and update or leave as it is, if not needed
#RUN apt-get install <your desired package>

#Install sudo to give user root access inside docker
RUN apt-get update && apt-get install -y sudo

#Create group and user
RUN groupadd $dockerusergroupname -g $dockerusergroupid
RUN useradd $dockerusername -u $dockeruserid -g $dockerusergroupid -d /home/$dockerusername
RUN mkdir -p /home/$dockerusername && chown -R $dockerusername:$dockerusergroupid /home/$dockerusername

#Set user password 
RUN echo "$dockerusername:$dockeruserpassword" | chpasswd

#Add user to sudo group
RUN usermod -aG sudo $dockerusername

#Switch to the user for all subsequent commands
USER $dockerusername

Within the folder where Dockerfile is present, run the below command to create a docker image with prebuilt conda and preferred user.
1. ```
docker build -t <preferred_docker_image_name> .
```
  Example:
```
docker build -t secdsan_conda_with _dependiency .
```
  Note: there is a . (dot) at the end of the above command, it is needed.
If image is successfully built can verify it by below command and making sure the image is listed, the <preferred_docker_image_name> will be visible.
1. ```
docker image list
```

Run the code inside the docker container by launching docker container, in the slurm script first and logging into container and adding necessary dependency and then running the code can use the below sample format shared for queue q_1day-1G, can refer https://www.serc.iisc.ac.in/queue-configuration-dgxh100/ for other queue formats:

have shared a sample job script for q_1day-1G, docker yet to be configured and code is run from inside the docker.
Can use the other option to run code directly via slurm script while launching the docker container, if no additional dependency are needed for running the code inside the container.

#!/bin/sh
#SBATCH --job-name=serial_job_test     ## Job name
#SBATCH --ntasks=1                     ## Run on a single CPU can take upto 10
#SBATCH --time=24:00:00                ## Time limit hrs:min:sec, its specific to queue being used
#SBATCH --output=serial_test_job.out   ## Standard output
#SBATCH --error=serial_test_job.err    ## Error log
#SBATCH --gres=gpu:1                   ## GPUs needed, should be same as selected queue GPUs
#SBATCH --partition=q_1day-1G          ## Specific to queue being used, need to select from queues available
#SBATCH --mem=20GB                      ## Memory for computation process can go up to 100GB

pwd; hostname; date |tee result
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag>
##example for above looks like( do not include these 2 highlighted lines in your script):
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest

Can check if job is running successfully via command.
1. ```
squeue
```
If job fails, can debug using serial_test_job.out and serial_test_job.err

Can connect to launched docker container via slurm script and even pass the code to execute, follow the below steps

Note: the users directory inside DGXH100 is mounted inside docker container when user runs “-v <uid>_vol:/workspace/raid/<uid>” example: “-v secdsan_vol:/workspace/raid/secdsan”

Check the running containers using below command and make sure your container is running
```
docker ps
```
The above command gives you container id, which is required for getting inside container.
```
docker exec -u <dockerusername> -it <CONTAINER_ID> bash
```
Example if the CONTAINER_ID is ac65b8ccf8f5 and docker user name inside docker file is secdsan:
```
docker exec -u secdsan -it ac65b8ccf8f5 bash
```
Once inside container can install packages of choice with sudo privileges for the user created inside the docker container via docker file
```
sudo apt install <desired_package>
```
Password will the password set for docker user inside docker file
To check conda version and activate conda use the below commands:
```
conda --version
conda init
```
```
source ~/.bashrc
```
To deactivate conda use the below code
```
conda deactivate
```
After the docker container is ready with required packages, go the code present inside the container and run the code
1. ```
cd /<path to the script>/
```
2. ```
py <script to be run>.py
```

If no additional dependency are needed for running the code inside the container, use the below sample format for slurm job submission. The below sample format shared is for queue q_1day-1G, can refer here for other queue formats:

!/bin/sh
#SBATCH --job-name=serial_job_test     ## Job name
#SBATCH --ntasks=1                     ## Run on a single CPU can take upto 10
#SBATCH --time=24:00:00                ## Time limit hrs:min:sec, its specific to queue being used
#SBATCH --output=serial_test_job.out   ## Standard output
#SBATCH --error=serial_test_job.err    ## Error log
#SBATCH --gres=gpu:1                   ## GPUs needed, should be same as selected queue GPUs
#SBATCH --partition=q_1day-1G          ## Specific to queue being used, need to select from queues available
#SBATCH --mem=20GB                      ## Memory for computation process can go up to 100GB

pwd; hostname; date |tee result
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag> bash -c 'cd /workspace/raid/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt
##example for above looks like( do not include these 2 highlighted lines in your script): 
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest bash -c 'cd /workspace/raid/secdsand/gputest/test1/ && python gputestscript.py' | tee -a log_out.txt