Introduction:
The NVIDIA DGX-1 is a deep learning system, architected for high throughput and high interconnect bandwidth to maximize neural network training performance. The core of the system is a complex of Eight Tesla V100 GPUs connected in the hybrid cube-mesh NVLink network topology. In addition to the eight GPUs, DGX-1 includes two CPUs for boot, storage management, and deep learning framework coordination. DGX-1 is built into a three-rack-unit (3U) enclosure that provides power, cooling, network, multi-system interconnect, and SSD file system cache, balanced to optimize throughput and deep learning training time.
NVLink is an energy-efficient, high-bandwidth interconnect that enables NVIDIA GPUs to connect to peer GPUs or other devices within a node at an aggregate bi-directional bandwidth of up to 300 GB/s per GPU: over nine times that of current PCIe Gen3 x16 interconnections. The NVLink interconnect and the DGX-1 architecture’s hybrid cube-mesh GPU network topology enables the highest achievable data-exchange bandwidth between a group of eight Tesla V100 GPUs.
Vendors:
OEM – NVIDIA Corporations.
Authorized seller – LOCUZ Enterprise Solutions Ltd
Hardware Overview:
GPUs – 8 x Tesla V100
GPU Memory – 256 GB total system
CPU – Dual 20-core Intel Xeon E5-2698 v4 2.2 GHz
NVIDIA CUDA cores – 40,960
NVIDIA Tensor cores (on V100 based systems) – 5,120
System Memory – 512 GB 2.133 GHz DDR4 RDIMM
Storage – 4 x 1.92 TB SSD RAID-0
Network – Dual 10 GbE
Click here to view DGX Usage Statistics in Graphical view |
Performance – 1 Peta FLOPS.[Mixed Precision] ? Read More
TESLA V100 GPU (NV-LINK) Performance | Single V100 GPU | TOTAL ( 8 * V100 GPU) |
Single Precision | Up to 7.8 TFLOPS | Up to 62.4 TFLOPS |
Double Precision | Up to 15.7 TFLOPS | Up to 125.6 TFLOPS |
Deep Learning(Mixed Precision) | Up to 125 TFLOPS | Up to 1 PFLOPS |
Software Overview:
Ubuntu 16.04 Linux OS – Linux x86_64 Platform
NOTE :
“Running jobs without SLURM will lead to blocking of the computational account” ” Please note that the “/localscratch/” space is meant for saving your job outputs for a temporary period only. The localscratch space data older than 14 days (2 Weeks) will be deleted. SERC does not maintain any backups of the localscratch space data, and hence will not be responsible for any data loss. |
How to Use DGX1:
Accessing the system:
The NVIDIA-DGX1 cluster has one login node,nvidia-dgx, through which the user can access the cluster and submit jobs.
The machine is accessible for login using ssh from inside IISc network.
ssh <computational_userid>@nvidia-dgx.serc.iisc.ac.in
The machine can be accessed after applying for basic HPC access, for which:
- Fill the online computational account form here & submit through the mail to nisadmin.serc@iisc.ac.in.
- HPC Application form must be duly signed by your Advisor/Research Supervisor.
- Once the computational account is created, Kindly fill the NVIDIA DGX Access form to access the DGX.
Location of DGX 1 Cluster:
CPU Room – Ground Floor, SERC, IISc