Introduction:
The NVIDIA DGX H100 System is the universal system purpose-built for all AI infrastructure and workloads, from analytics to training to inference. The system is built on eight NVIDIA H100 Tensor Core GPUs.
- 8x NVIDIA H100 GPUs With 640 Gigabytes of Total GPU Memory
18x NVIDIA® NVLink® connections per GPU, 900 gigabytes per second of bidirectional GPU-to-GPU bandwidth
- 4x NVIDIA NVSwitches™
7.2 terabytes per second of bidirectional GPU-to-GPU bandwidth, 1.5X more than previous generation
- Dual Intel Xeon Platinum 8480C processors, 112 cores total, and 2 TB System Memory
Powerful CPUs for the most intensive AI jobs
- 30 Terabytes NVMe SSD
High speed storage for maximum performance
Vendors:
OEM – NVIDIA Corporations.
Vendor – Frontier.
Hardware Overview:
Table 1: Component Description
Component |
Description GPU |
GPU |
8 x NVIDIA H100 GPUs that provide 640 GB total GPU memory |
CPU |
2 x Intel Xeon 8480C PCIe Gen5 CPUs with 56 cores each 2.0/2.9/3.8 GHz (base/all core turbo/Max turbo) |
NVSwitch |
4 x 4th generation NVLinks that provide 900 GB/s GPU-to-GPU bandwidth |
Storage (OS) |
2 x 1.92 TB NVMe M.2 SSD (ea) in RAID 1 array |
Storage (Data Cache) |
8 x 3.84 TB NVMe U.2 SED (ea) in RAID 0 array |
Network (Cluster) card |
4 x OSFP ports for 8 x NVIDIA® ConnectX®-7 Single Port InfiniBand Cards Each card provides the following speeds:
|
Network (storage and in-band management) card |
2 x NVIDIA® ConnectX®-7 Dual Port Ethernet Cards Each card provides the following speeds:
|
System memory (DIMM) |
2 TB using 32 x DIMMs |
BMC (out-of-band system management) |
1 GbE RJ45 interface Supports Redfish, IPMI, SNMP, KVM, and Web user interface |
System management interfaces (optional) |
Dual port 100GbE in slot 3 and 10 GbE RJ45 interface |
Power supply |
6 x 3.3 kW |
Hardware Overview:
Table 2: Mechanical Specifications
Feature |
Description |
Form Factor |
8U Rackmount |
Height |
14” (356 mm) |
Width |
19” (482.3 mm) max |
Depth |
35.3” (897.1 mm) max |
System Weight |
287.6 lbs (130.45 kg) max |
Click here to view DGX Usage Statistics in Graphical view
Performance –
NVLink: 4 fourth-generation NVLinks, providing 900 GB/s of GPU-to-GPU bandwidth.
Software Overview:
Ubuntu 22.04.2 LTS Linux OS – Linux x86_64 Platform
NOTE:
“Running jobs without SLURM will lead to blocking of the computational account” ” Please note that the “/raid/” space is meant for saving your job outputs for a temporary period only. The raid space data older than 14 days (2 Weeks) will be deleted. SERC does not maintain any backups of the raid space data, and hence will not be responsible for any data loss. |
How to Use DGXH100:
Accessing the system:
The NVIDIA-DGXH100 cluster has one login node, DGXH100, through which the user can access the cluster and submit jobs.
The machine is accessible for login using ssh from inside IISc network.
ssh <computational_userid>@dgxh100.serc.iisc.ac.in
The machine can be accessed after applying for basic HPC access, for which:
- Fill the online computational account form here & submit through the mail to nisadmin.serc@iisc.ac.in.
- HPC Application form must be duly signed by your Advisor/Research Supervisor.
- Once the computational account is created, Kindly fill the NVIDIA DGXH100 Access form to access the DGX.
Location of DGXH100 Cluster:
CPU Room – Ground Floor, SERC, IISc
For any queries, raise a ticket in the helpdesk or please contact System Administrator, #103,SERC.