NVIDIA DGX-1 CLUSTER – SUPERCOMPUTER EDUCATION AND RESEARCH CENTRE

Introduction:

The NVIDIA DGX-1 is a deep learning system, architected for high throughput and high interconnect bandwidth to maximize neural network training performance. The core of the system is a complex of Eight Tesla V100 GPUs connected in the hybrid cube-mesh NVLink network topology. In addition to the eight GPUs, DGX-1 includes two CPUs for boot, storage management, and deep learning framework coordination. DGX-1 is built into a three-rack-unit (3U) enclosure that provides power, cooling, network, multi-system interconnect, and SSD file system cache, balanced to optimize throughput and deep learning training time.

NVLink is an energy-efficient, high-bandwidth interconnect that enables NVIDIA GPUs to connect to peer GPUs or other devices within a node at an aggregate bi-directional bandwidth of up to 300 GB/s per GPU: over nine times that of current PCIe Gen3 x16 interconnections. The NVLink interconnect and the DGX-1 architecture’s hybrid cube-mesh GPU network topology enables the highest achievable data-exchange bandwidth between a group of eight Tesla V100 GPUs.

Vendors:

OEM – NVIDIA Corporations.

Authorized seller – LOCUZ Enterprise Solutions Ltd

Hardware Overview:

GPUs – 8 x Tesla V100
GPU Memory – 256 GB total system
CPU – Dual 20-core Intel Xeon E5-2698 v4 2.2 GHz
NVIDIA CUDA cores – 40,960
NVIDIA Tensor cores (on V100 based systems) – 5,120
System Memory – 512 GB 2.133 GHz DDR4 RDIMM
Storage – 4 x 1.92 TB SSD RAID-0
Network – Dual 10 GbE

Click here to view DGX Usage Statistics in Graphical view

Performance – 1 Peta FLOPS.[Mixed Precision] ? Read More

TESLA V100 GPU (NV-LINK) Performance	Single V100 GPU	*TOTAL ( 8 V100 GPU)**
Single Precision	Up to 7.8 TFLOPS	Up to 62.4 TFLOPS
Double Precision	Up to 15.7 TFLOPS	Up to 125.6 TFLOPS
Deep Learning(Mixed Precision)	Up to 125 TFLOPS	Up to 1 PFLOPS

Software Overview:

Ubuntu 16.04 Linux OS – Linux x86_64 Platform

Deep Learning Frameworks

Softwares

Job submission System

NOTE :

“Running jobs without SLURM will lead to blocking of the computational account”

” Please note that the “/localscratch/” space is meant for saving your job outputs for a temporary period only. The localscratch space data older than 14 days (2 Weeks) will be deleted.

SERC does not maintain any backups of the localscratch space data, and hence will not be responsible for any data loss.

How to Use DGX1:

Accessing the system:

The NVIDIA-DGX1 cluster has one login node,nvidia-dgx, through which the user can access the cluster and submit jobs.
The machine is accessible for login using ssh from inside IISc network.
ssh <computational_userid>@nvidia-dgx.serc.iisc.ac.in

The machine can be accessed after applying for basic HPC access, for which:

Fill the online computational account form here & submit through the mail to nisadmin.serc@iisc.ac.in.
HPC Application form must be duly signed by your Advisor/Research Supervisor.
Once the computational account is created, Kindly fill the NVIDIA DGX Access form to access the DGX.

Location of DGX 1 Cluster:

CPU Room – Ground Floor, SERC, IISc

For any queries, raise a ticket in the helpdesk or please contact System Administrator, #103,SERC.