NVIDIA DGX-H100 CLUSTER – SUPERCOMPUTER EDUCATION AND RESEARCH CENTRE

Introduction:

The NVIDIA DGX H100 System is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA H100 Tensor Core GPUs.

8x NVIDIA H100 GPUs With 640 Gigabytes of Total GPU Memory

18x NVIDIA® NVLink® connections per GPU, 900 gigabytes per second of bidirectional GPU-to-GPU bandwidth

4x NVIDIA NVSwitches™

7.2 terabytes per second of bidirectional GPU-to-GPU bandwidth, 1.5X more than previous generation

Dual Intel Xeon Platinum 8480C processors, 112 cores total, and 2 TB System Memory

Powerful CPUs for the most intensive AI jobs

30 Terabytes NVMe SSD

High speed storage for maximum performance

Vendors:

OEM – NVIDIA Corporations.

Vendor – Frontier.

Hardware Overview:

Table 1: Component Description

Component	Description GPU
GPU	8 x NVIDIA H100 GPUs that provide 640 GB total GPU memory
CPU	2 x Intel Xeon 8480C PCIe Gen5 CPUs with 56 cores each 2.0/2.9/3.8 GHz (base/all core turbo/Max turbo)
NVSwitch	4 x 4th generation NVLinks that provide 900 GB/s GPU-to-GPU bandwidth
Storage (OS)	2 x 1.92 TB NVMe M.2 SSD (ea) in RAID 1 array
Storage (Data Cache)	8 x 3.84 TB NVMe U.2 SED (ea) in RAID 0 array
Network (Cluster) card	4 x OSFP ports for 8 x NVIDIA® ConnectX®-7 Single Port InfiniBand Cards Each card provides the following speeds: InfiniBand (default): Up to 400Gbps Ethernet: 400GbE, 200GbE, 100GbE, 50GbE, 40GbE, 25GbE, and 10GbE
Network (storage and in-band management) card	2 x NVIDIA® ConnectX®-7 Dual Port Ethernet Cards Each card provides the following speeds: Ethernet (default): 400GbE, 200GbE, 100GbE, 50GbE, 40GbE, 25GbE, and 10GbE InfiniBand: Up to 400Gbps
System memory (DIMM)	2 TB using 32 x DIMMs
BMC (out-of-band system management)	1 GbE RJ45 interface Supports Redfish, IPMI, SNMP, KVM, and Web user interface
System management interfaces (optional)	Dual port 100GbE in slot 3 and 10 GbE RJ45 interface
Power supply	6 x 3.3 kW

Hardware Overview:

Table 2: Mechanical Specifications

Feature	Description
Form Factor	8U Rackmount
Height	14” (356 mm)
Width	19” (482.3 mm) max
Depth	35.3” (897.1 mm) max
System Weight	287.6 lbs (130.45 kg) max

Click here to view DGX Usage Statistics in Graphical view

Performance – 

NVLink: 4 fourth-generation NVLinks, providing 900 GB/s of GPU-to-GPU bandwidth.

Software Overview:

Ubuntu 22.04.2 LTS Linux OS – Linux x86_64 Platform

Job submission System

NOTE:

“Running jobs without SLURM will lead to blocking of the computational account”

” Please note that the “/raid/” space is meant for saving your job outputs for a temporary period only. The raid space data older than 14 days (2 Weeks) will be deleted.

Please note that the docker system handling docker images older than 14 days (2 Weeks) can be deleted without any prior intimation, users are requested to take regular backups of docker image into their personal storage and clear the images from DGXH100 which are not needed.

SERC does not maintain any backups of the raid space data, and hence will not be responsible for any data loss.

How to Use DGXH100:

Accessing the system:

The NVIDIA-DGXH100 cluster has one login node, DGXH100, through which the user can access the cluster and submit jobs.
The machine is accessible for login using ssh from inside IISc network.
ssh <computational_userid>@dgxh100.serc.iisc.ac.in

The machine can be accessed after applying for basic HPC access, for which:

Fill the online computational account form here & submit through the mail to nisadmin.serc@iisc.ac.in.
HPC Application form must be duly signed by your Advisor/Research Supervisor.
Once the computational account is created, Kindly fill the NVIDIA DGXH100 Access form to access the DGX.

Location of DGXH100 Cluster:

CPU Room – Ground Floor, SERC, IISc

For any queries, raise a ticket in the helpdesk or please contact System Administrator, #103,SERC.