- Create docker image with base image as any preinstalled cuda tool kit 12.2, we have nvidia/cuda:12.2.0-devel-ubuntu20.04
- Create serc user inside docker container, to get user details, type the below command in DGXH100
-
id <uid> Example:id secdsan
output: uid=18308(secdsan) gid=1040(serc3) groups=998(docker),4002(sec_yoginderkumarnegi),1040(serc3)
-
- Make note of uid, gid if group name is not available, use 3 letter short form of your department name, example for cds department, group name can be used as cds
- Create new directory inside /raid/<uid>/
-
mkdir mydocker
-
- Create a file named “Dockerfile” and paste the below contents, Modify the user to be created, and any packages to be preinstalled inside docker container.
-
FROM nvidia/cuda:12.2.0-devel-ubuntu20.04 #set environment variables for user credentials obtained by step 2, password can be of users choice ENV dockerusername=secdsan ENV dockeruserpassword=password ENV dockerusergroupid=1040 ENV dockeruserid=18308 ENV dockerusergroupname=serc3 # Set environment variables for non-interactive installation ENV DEBIAN_FRONTEND=noninteractive # Install necessary system packages RUN apt-get update && apt-get install -y wget bzip2 && apt-get clean && rm -rf /var/lib/apt/lists/* # Install Miniconda RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/miniconda && rm /tmp/miniconda.sh # Add Conda to the PATH environment variable ENV PATH=/opt/miniconda/bin:$PATH # To install any other packages, uncomment below line and update or leave as it is, if not needed #RUN apt-get install <your desired package> #Install sudo to give user root access inside docker RUN apt-get update && apt-get install -y sudo #Create group and user RUN groupadd $dockerusergroupname -g $dockerusergroupid RUN useradd $dockerusername -u $dockeruserid -g $dockerusergroupid -d /home/$dockerusername RUN mkdir -p /home/$dockerusername && chown -R $dockerusername:$dockerusergroupid /home/$dockerusername #Set user password RUN echo "$dockerusername:$dockeruserpassword" | chpasswd #Add user to sudo group RUN usermod -aG sudo $dockerusername #Switch to the user for all subsequent commands USER $dockerusername
-
- Within the folder where Dockerfile is present, run the below command to create a docker image with prebuilt conda and preferred user.
-
docker build -t <preferred_docker_image_name> .
Example:
docker build -t secdsan_conda_with _dependiency .
Note: there is a . (dot) at the end of the above command, it is needed.
-
- If image is successfully built can verify it by below command and making sure the image is listed, the <preferred_docker_image_name> will be visible.
-
docker image list
-
- Run the code inside the docker container by launching docker container, in the slurm script first and logging into container and adding necessary dependency and then running the code can use the below sample format shared for queue q_1day-1G, can refer https://www.serc.iisc.ac.in/queue-configuration-dgxh100/ for other queue formats:
- have shared a sample job script for q_1day-1G, docker yet to be configured and code is run from inside the docker.
- Can use the other option to run code directly via slurm script while launching the docker container, if no additional dependency are needed for running the code inside the container.
-
#!/bin/sh #SBATCH --job-name=serial_job_test ## Job name #SBATCH --ntasks=1 ## Run on a single CPU can take upto 10 #SBATCH --time=24:00:00 ## Time limit hrs:min:sec, its specific to queue being used #SBATCH --output=serial_test_job.out ## Standard output #SBATCH --error=serial_test_job.err ## Error log #SBATCH --gres=gpu:1 ## GPUs needed, should be same as selected queue GPUs #SBATCH --partition=q_1day-1G ## Specific to queue being used, need to select from queues available #SBATCH --mem=20GB ## Memory for computation process can go up to 100GB pwd; hostname; date |tee result docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag> ##example for above looks like( do not include these 2 highlighted lines in your script): docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest
- Can check if job is running successfully via command.
-
squeue
-
- If job fails, can debug using serial_test_job.out and serial_test_job.err
- Can connect to launched docker container via slurm script and even pass the code to execute, follow the below steps
Note: the users directory inside DGXH100 is mounted inside docker container when user runs “-v <uid>_vol:/workspace/raid/<uid>” example: “-v secdsan_vol:/workspace/raid/secdsan”
-
Check the running containers using below command and make sure your container is running
docker ps
-
The above command gives you container id, which is required for getting inside container.
docker exec -u <dockerusername> -it <CONTAINER_ID> bash
Example if the CONTAINER_ID is ac65b8ccf8f5 and docker user name inside docker file is secdsan:
docker exec -u secdsan -it ac65b8ccf8f5 bash
-
Once inside container can install packages of choice with sudo privileges for the user created inside the docker container via docker file
sudo apt install <desired_package>
Password will the password set for docker user inside docker file
-
To check conda version and activate conda use the below commands:
conda --version conda init
source ~/.bashrc
-
To deactivate conda use the below code
conda deactivate
- After the docker container is ready with required packages, go the code present inside the container and run the code
-
cd /<path to the script>/
-
py <script to be run>.py
-
-
If no additional dependency are needed for running the code inside the container, use the below sample format for slurm job submission. The below sample format shared is for queue q_1day-1G, can refer here for other queue formats:
-
!/bin/sh #SBATCH --job-name=serial_job_test ## Job name #SBATCH --ntasks=1 ## Run on a single CPU can take upto 10 #SBATCH --time=24:00:00 ## Time limit hrs:min:sec, its specific to queue being used #SBATCH --output=serial_test_job.out ## Standard output #SBATCH --error=serial_test_job.err ## Error log #SBATCH --gres=gpu:1 ## GPUs needed, should be same as selected queue GPUs #SBATCH --partition=q_1day-1G ## Specific to queue being used, need to select from queues available #SBATCH --mem=20GB ## Memory for computation process can go up to 100GB pwd; hostname; date |tee result docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag> bash -c 'cd /workspace/raid/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt ##example for above looks like( do not include these 2 highlighted lines in your script): docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest bash -c 'cd /workspace/raid/secdsand/gputest/test1/ && python gputestscript.py' | tee -a log_out.txt
-
-