1-GPU Job Script q_2day-1G (Wall time = 48 Hrs)DGXH100

1-GPU Job-Script q_2day-1G (Wall time = 48Hrs) to run code from inside docker container 

Note:- Text in red color is for your understanding please do not add in your script.

Kindly refer here for creating docker image inside DGXH100 with predefined user and preinstalled dependency and follow below step.

If code can be run inside docker without additional dependency kindly follow the job submission method at the end of the page to run code while launching Docker Container section.

#!/bin/sh 
#SBATCH --job-name=serial_job_test    ## Job name 
#SBATCH --ntasks=1                    ## Run on a single CPU can take upto 10 
#SBATCH --time=48:00:00               ## Time limit hrs:min:sec, its 48hrs specific to queue being used 
#SBATCH --output=serial_test_job.out  ## Standard output 
#SBATCH --error=serial_test_job.err   ## Error log 
#SBATCH --gres=gpu:1                  ## GPUs needed, should be same as selected queue GPUs 
#SBATCH --partition=q_2day-1G         ## Specific to queue being used, need to select from queues available 
#SBATCH --mem=20GB                    ## Memory for computation process can go up to 100GB 

pwd; hostname; date |tee result
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag>

##example for above looks like( do not include these 2 highlighted lines in your script):
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest

Job Submission Instructions:

  1. All jobs should be sumbitted via slurm.
  2. If jobs are run without slurm, your actions will be notified to your professor and account will be blocked.
  3. Get the sbatch script as above in a file, for job submission use the below command.

     

    sbatch <SCRIPT NAME> 

     example: sbatch test_script.sh

  1. Can check if job is running successfully via command.
    1. squeue
  2. If job fails, can debug using serial_test_job.out and serial_test_job.err
  3. Can connect to launched docker container via slurm script and even pass the code to execute, follow the below steps

    Note: the users directory inside DGXH100 is mounted inside docker container when user runs “-v /raid/<uid>:/workspace/raid/<uid>” example: “-v /raid/secdsan:/workspace/raid/secdsan”

    1. Check the running containers using below command and make sure your container is running

      docker ps
    2. The above command gives you container id, which is required for getting inside container.

      docker exec -u <dockerusername> it <CONTAINER_ID> bash

      Example if the CONTAINER_ID is ac65b8ccf8f5 and docker user name inside docker file is secdsan:

      docker exec -u secdsan it ac65b8ccf8f5 bash
    3. Once inside container can install packages of choice with sudo privileges for the user created inside the docker container via docker file

      sudo apt install <desired_package>

      Password will the password set for docker user inside docker file

    4. To check conda version and activate conda use the below commands:

      conda --version
      source ~/.bashrc
    5. To deactivate conda use the below code

      conda deactivate
      
    6. After the docker container is ready with required packages, go the code present inside the container and run the code
      1. cd /<path to the script>/
      2. py <script to be run>.py
    7. If no additional dependency are needed for running the code inside the container, use the below sample format for slurm job submission

1-GPU Job-Script q_2day-1G (Wall time = 48Hrs) to run code while launching docker container

Note:- Text in red color is for your understanding please do not add in your script.

#!/bin/sh 
#SBATCH --job-name=serial_job_test    ## Job name 
#SBATCH --ntasks=1                    ## Run on a single CPU can take upto 10 
#SBATCH --time=48:00:00               ## Time limit hrs:min:sec, its 48hrs specific to queue being used 
#SBATCH --output=serial_test_job.out  ## Standard output 
#SBATCH --error=serial_test_job.err   ## Error log 
#SBATCH --gres=gpu:1                  ## GPUs needed, should be same as selected queue GPUs 
#SBATCH --partition=q_2day-1G         ## Specific to queue being used, need to select from queues available 
#SBATCH --mem=20GB                    ## Memory for computation process can go up to 100GB 

pwd; hostname; date |tee result
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/<uid>:/workspace/raid/<uid> <preferred_docker_image_name>:<tag> bash -c 'cd /workspace/raid/<uid>/<path to desired folder>/ && python <script to be run.py>' | tee -a log_out.txt

##example for above looks like( do not include these 2 highlighted lines in your script):
docker run -t --gpus '"device='$CUDA_VISIBLE_DEVICES'"' --name $SLURM_JOB_ID --ipc=host --shm-size=20GB --user $(id -u $USER):$(id -g $USER) -v /raid/secdsan:/workspace/raid/secdsan secdsan_cuda:latest bash -c 'cd /workspace/raid/secdsan/gputestfolder/ && python gputest.py' | tee -a log_out.txt

Job Submission Instructions:

  1. All jobs should be sumbitted via slurm.
  2. If jobs are run without slurm, your actions will be notified to your professor and account will be blocked.
  3. Get the sbatch script as above in a file, for job submission use the below command.

     

    sbatch <SCRIPT NAME> 

     example: sbatch test_script.sh