NVIDIA DGX-H100 CLUSTER

Introduction:

The NVIDIA DGX H100 System is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA H100 Tensor Core GPUs. 

  • 8x NVIDIA H100 GPUs With 640 Gigabytes of Total GPU Memory

       18x NVIDIA® NVLink® connections per GPU, 900 gigabytes per second of bidirectional GPU-to-GPU bandwidth 

  • 4x NVIDIA NVSwitches™

      7.2 terabytes per second of bidirectional GPU-to-GPU bandwidth, 1.5X more than previous generation 

  • Dual Intel Xeon Platinum 8480C processors, 112 cores total, and 2 TB System Memory

      Powerful CPUs for the most intensive AI jobs 

  • 30 Terabytes NVMe SSD

       High speed storage for maximum performance

Vendors: 

OEM – NVIDIA Corporations. 

Vendor –Frontier

 Hardware Overview: 

Table 1: Component Description

Component  

Description GPU 

GPU 

8 x NVIDIA H100 GPUs that provide 640 GB total GPU memory 

CPU 

2 x Intel Xeon 8480C PCIe Gen5 CPUs with 56 cores each 2.0/2.9/3.8 GHz (base/all core turbo/Max turbo) 

NVSwitch 

4 x 4th generation NVLinks that provide 900 GB/s GPU-to-GPU bandwidth 

Storage (OS) 

2 x 1.92 TB NVMe M.2 SSD (ea) in RAID 1 array 

Storage (Data Cache) 

8 x 3.84 TB NVMe U.2 SED (ea) in RAID 0 array 

Network (Cluster) card 

4 x OSFP ports for 8 x NVIDIA® ConnectX®-7 Single Port InfiniBand Cards Each card provides the following speeds:

  • InfiniBand (default): Up to 400Gbps
  • Ethernet: 400GbE, 200GbE, 100GbE, 50GbE, 40GbE, 25GbE, and 10GbE 

Network (storage and in-band management) card 

2 x NVIDIA® ConnectX®-7 Dual Port Ethernet Cards Each card provides the following speeds:

  • Ethernet (default): 400GbE, 200GbE, 100GbE, 50GbE, 40GbE, 25GbE, and 10GbE
  • InfiniBand: Up to 400Gbps 

System memory (DIMM) 

2 TB using 32 x DIMMs 

BMC (out-of-band system management) 

1 GbE RJ45 interface Supports Redfish, IPMI, SNMP, KVM, and Web user interface 

System management interfaces (optional) 

Dual port 100GbE in slot 3 and 10 GbE RJ45 interface 

Power supply 

6 x 3.3 kW 

Hardware Overview: 

Table 2: Mechanical Specifications 

Feature 

Description 

Form Factor 

8U Rackmount 

Height 

14” (356 mm) 

Width 

19” (482.3 mm) max 

Depth 

35.3” (897.1 mm) max 

System Weight 

287.6 lbs (130.45 kg) max 

Click here to view DGX Usage Statistics in Graphical view 

Performance – 

NVLink: 4 fourth-generation NVLinks, providing 900 GB/s of GPU-to-GPU bandwidth. 

 

Software Overview: 

Ubuntu 22.04.2 LTS Linux OS – Linux x86_64 Platform 

Job submission System

NOTE:

“Running jobs without SLURM will lead to blocking of the computational account” 

” Please note that the “/raid/” space is meant for saving your job outputs for a temporary period only. The raid space data older than 14 days (2 Weeks) will be deleted. 

SERC does not maintain any backups of the raid space data, and hence will not be responsible for any data loss. 

How to Use DGXH100:

Accessing the system:

The NVIDIA-DGXH100 cluster has one login node, DGXH100, through which the user can access the cluster and submit jobs.
The machine is accessible for login using ssh from inside IISc network.
ssh <computational_userid>@dgxh100.serc.iisc.ac.in

The machine can be accessed after applying for basic HPC access, for which:

  • Fill the online computational account form here & submit through the mail to nisadmin.serc@iisc.ac.in.
  • HPC Application form must be duly signed by your Advisor/Research Supervisor.
  • Once the computational account is created, Kindly fill the NVIDIA DGXH100 Access form to access the DGX.

Location of DGXH100 Cluster:

CPU Room – Ground Floor, SERC, IISc

For any queries, raise a ticket in the helpdesk or please contact System Administrator, #103,SERC.