How to Use DGXH100:
Accessing the system:
The NVIDIA-DGXH100 cluster has one login node, DGXH100, through which the user can access the cluster and submit jobs.
The machine is accessible for login using ssh from inside IISc network.
ssh <computational_userid>@dgxh100.serc.iisc.ac.in
Basic steps for better utilization of DGXH100:
- Usually users are landed to their home space after login, can check the current directory location by using the below command.
-
pwd
-
- The home directories are usually limited to storage space of 1.5GB
- Users are also created directory on DGXH100 machine, which can be used for storing their data, the path will be /raid/<computational_userid>
-
cd /raid/<computational_userid>
-
- It is important that users take backup of their raid directory as SERC clears raid space every 2weeks, having frequent backups is safer in case of such data loss.
- As each user has special requirements of software and packages, our team as enabled docker which helps users to set up their own packages, software or environments necessary for executing their jobs.
- can click on the below link to know how to create a customized docker container.
- The docker image suggest to create a user inside docker with same details as available in DGXH100 so that files access and permissions will remain same during job execution.
- It is important that users take backup of docker file frequently, so that it would be easier to recreate docker images with docker storage cleared by SERC team every 2weeks.
NOTE:
“Running jobs without SLURM will lead to blocking of the computational account” Please note that the “/raid/” space is meant for saving your job outputs for a temporary period only. The raid space data older than 14 days (2 Weeks) will be deleted. Please note that the docker system handling docker images older than 14 days (2 Weeks) will be deleted. SERC does not maintain any backups of the raid space data, and hence will not be responsible for any data loss. |