Docker usage inside DGXH100

  1. Create docker image with base image as any preinstalled cuda tool kit 12.2, we have nvidia/cuda:12.2.0-devel-ubuntu20.04
  2. Create serc user inside docker container, to get user details, type the below command in DGXH100
    1. id <uid>
      
      Example:id secdsan

      output: uid=18308(secdsan) gid=1040(serc3) groups=998(docker),4002(sec_yoginderkumarnegi),1040(serc3)

  3. Make note of uid, gid if group name is not available, use 3 letter short form of your department name, example for cds department, group name can be used as cds
  4. Create new directory inside /raid/<uid>/
    1. mkdir mydocker
  5. Create a file named “Dockerfile” and paste the below contents, Modify the user to be created, and any packages to be preinstalled inside docker container.
    1. FROM nvidia/cuda:12.2.0-devel-ubuntu20.04
      
      #set environment variables for user credentials obtained by step 2, password can be of users choice
      ENV dockerusername=secdsan
      ENV dockeruserpassword=password
      ENV dockerusergroupid=1040
      ENV dockeruserid=18308
      ENV dockerusergroupname=serc3
      
      # Set environment variables for non-interactive installation
      ENV DEBIAN_FRONTEND=noninteractive
      
      # Install necessary system packages
      RUN apt-get update && apt-get install -y wget bzip2 && apt-get clean && rm -rf /var/lib/apt/lists/*
      
      # Install Miniconda
      RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/miniconda && rm /tmp/miniconda.sh
      
      # Add Conda to the PATH environment variable
      ENV PATH=/opt/miniconda/bin:$PATH
      
      # To install any other packages, uncomment below line and update or leave as it is, if not needed
      #RUN apt-get install <your desired package>
      
      #Install sudo to give user root access inside docker
      RUN apt-get update && apt-get install -y sudo
      
      #Create group and user
      RUN groupadd $dockerusergroupname -g $dockerusergroupid
      RUN useradd $dockerusername -u $dockeruserid -g $dockerusergroupid -d /home/$dockerusername
      RUN mkdir -p /home/$dockerusername && chown -R $dockerusername:$dockerusergroupid /home/$dockerusername
      
      #Set user password 
      RUN echo "$dockerusername:$dockeruserpassword" | chpasswd
      
      #Add user to sudo group
      RUN usermod -aG sudo $dockerusername
      
      #Switch to the user for all subsequent commands
      USER $dockerusername
      
  6. Within the folder where Dockerfile is present, run the below command to create a docker image with prebuilt conda and preferred user.
    1. docker build -t <preferred_docker_image_name> .

      Example:

      docker build -t secdsan_conda_with _dependiency .

      Note: there is a . (dot) at the end of the above command, it is needed.

  7. If image is successfully built can verify it by below command and making sure the image is listed, the <preferred_docker_image_name> will be visible.
    1. docker image list
  8. Run the code inside the docker container by launching docker container, in the slurm script first and logging into container and adding necessary dependency and then running the code can use the below sample format shared for queue q_1day-1G, can refer https://www.serc.iisc.ac.in/queue-configuration-dgxh100/ for other queue formats:
    1. have shared a sample job script for q_1day-1G, docker yet to be configured and code is run from inside the docker.
    2. Can use the other option to run code directly via slurm script while launching the docker container, if no additional dependency are needed for running the code inside the container.
    3. #!/bin/sh
      #SBATCH --job-name=serial_job_test     ## Job name
      #SBATCH --ntasks=1                     ## Run on a single CPU can take upto 10
      #SBATCH --time=24:00:00                ## Time limit hrs:min:sec, its specific to queue being used
      #SBATCH --output=serial_test_job.out   ## Standard output
      #SBATCH --error=serial_test_job.err    ## Error log
      #SBATCH --gres=gpu:1                   ## GPUs needed, should be same as selected queue GPUs
      #SBATCH --partition=q_1day-1G          ## Specific to queue being used, need to select from queues available
      #SBATCH --mem=20GB                      ## Memory for computation process can go up to 100GB
      
      pwd; hostname; date |tee result
      docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag>
      ##example for above looks like( do not include these 2 highlighted lines in your script):
      docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest
  9. Can check if job is running successfully via command.
    1. squeue
  10. If job fails, can debug using serial_test_job.out and serial_test_job.err
  11. Can connect to launched docker container via slurm script and even pass the code to execute, follow the below steps

    Note: the users directory inside DGXH100 is mounted inside docker container when user runs “-v <uid>_vol:/workspace/raid/<uid>” example: “-v secdsan_vol:/workspace/raid/secdsan”

    1. Check the running containers using below command and make sure your container is running

      docker ps
    2. The above command gives you container id, which is required for getting inside container.

      docker exec -u <dockerusername> -it <CONTAINER_ID> bash

      Example if the CONTAINER_ID is ac65b8ccf8f5 and docker user name inside docker file is secdsan:

      docker exec -u secdsan -it ac65b8ccf8f5 bash
    3. Once inside container can install packages of choice with sudo privileges for the user created inside the docker container via docker file

      sudo apt install <desired_package>

      Password will the password set for docker user inside docker file

    4. To check conda version and activate conda use the below commands:

      conda --version
      conda init
      source ~/.bashrc
    5. To deactivate conda use the below code

      conda deactivate
      
    6. After the docker container is ready with required packages, go the code present inside the container and run the code
      1. cd /<path to the script>/
      2. py <script to be run>.py
    7. If no additional dependency are needed for running the code inside the container, use the below sample format for slurm job submission. The below sample format shared is for queue q_1day-1G, can refer here for other queue formats: 

      1. !/bin/sh
        #SBATCH --job-name=serial_job_test     ## Job name
        #SBATCH --ntasks=1                     ## Run on a single CPU can take upto 10
        #SBATCH --time=24:00:00                ## Time limit hrs:min:sec, its specific to queue being used
        #SBATCH --output=serial_test_job.out   ## Standard output
        #SBATCH --error=serial_test_job.err    ## Error log
        #SBATCH --gres=gpu:1                   ## GPUs needed, should be same as selected queue GPUs
        #SBATCH --partition=q_1day-1G          ## Specific to queue being used, need to select from queues available
        #SBATCH --mem=20GB                      ## Memory for computation process can go up to 100GB
        
        pwd; hostname; date |tee result
        docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag> bash -c 'cd /workspace/raid/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt
        ##example for above looks like( do not include these 2 highlighted lines in your script): 
        docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest bash -c 'cd /workspace/raid/secdsand/gputest/test1/ && python gputestscript.py' | tee -a log_out.txt