EMF: System Design and Challenges for Disaggregated GPUs
in Datacenters for Efficiency, Modularity, and Flexibility
With Dennard Scaling phasing out in the mid-2000s, architectural scaling and hardware specialization gain importance to provide performance benefits with already stalling Moore’s law. An outcome from this hardware specialization is GPU, which exploits the Data Level Parallelism in an application. One approach is to augment the existing Cloud infrastructure with accelerators like GPUs that cater to data-parallel, throughput centric workloads ranging from AI, HPC to Visualization. Further, the availability of GPUs in the Public Cloud offerings has expedited their mass adoption. Simultaneously, these modern cloud-based applications are increasing demands on infrastructure in terms of versatility, performance, and efficiency. Provided high CAPEX and OPEX associated with GPUs, their optimal utilization becomes imperative. However, current deployment and allocation policies are inefficient for GPUs and suffer from issues like resource stranding. Disaggregating expensive and power-hungry GPUs will enable a cost-efficient and adaptive ecosystem for their deployments. In this context, GPU disaggregation is proposed to alleviate such infrastructure inefficiencies (e.g., rigidity).
This work first evaluates existing literature on IO disaggregation and GPU disaggregation to identify the limitations of these approaches to support an open and backward compatible solution for GPU disaggregation. NVIDIA GPU computing stack is evaluated to determine how the disaggregation constructs could be met at different abstraction levels. Disaggregation Planes concept is introduced for evaluating the feasibility and limitations of a disaggregated solution for any general resource. These possible disaggregation approaches at each of these planes are assessed using metrics:
1) Composability
2) Independent existence
3) Backward Compatibility.
Further, gains associated with disaggregated GPU deployments were quantified with metrics like failed VM requests and GPU Watt-hours consumption at the datacenter scale. For this, a simulator called QUADD-SIM was built by extending CloudSim, to model, quantify, and contrast different facets of these emerging GPU deployments. Using QUADD-SIM different VM and resource provisioning aspects of disaggregated GPU deployments were modeled. Further, realistic AI workload requests for a period of 3 months with characteristics derived from recent public datacenter traces were simulated. Results attest that disaggregated GPU deployment strategies outperform traditional GPU deployments in reducing failed VM requests and GPU Watt-hours consumption. Through extensive experimentation, it was shown that 5.14% and 7.90% additional failed VM requests were serviced by disaggregated GPU deployments consuming 10.92% and 3.30% lesser GPU Watt-hours as compared to traditional deployment.
Based on identified limitations of existing approaches, benefits in large-scale deployments for disaggregated GPUs, and identifying possible solutions for disaggregating GPUs, a GPU disaggregation system design was proposed. This rack-level, open and backward-compatible system design for GPU disaggregation is called EMF. In addition to detailed system design, several critical design decisions of EMF and how these choices enable a scalable, efficient, and fault-tolerant ecosystem for GPU disaggregation are discussed. Further, different requirements were also specified for each of the EMF elements to allow easy prototyping of the system. Lastly, provided that the disaggregated resources interact over external datacentre network as compared to local interconnects limited within server physical boundaries, such interactions incur overheads. These overheads may lead to performance degradation of workloads executing in such disaggregated deployments. In this context, an evaluation of these overheads for EMF is also presented. Performance impact is evaluated by quantifying worst-case latency overheads due to disaggregation. For this, proprietary NVIDIA host device driver and GPU interactions for data transfer operations over PCIe in terms of TLPs were modeled using an analytical model to understand the performance impact in EMF.
Further, profiling 6 Deep Learning applications and empirical evaluation of the design showed that these overheads could vary from 7.6% to 20.2% for the worst-case, justifying the proposed design’s practicality. It was also found that the latency overheads are directly correlated with the Average Throughput of the application, and applications with short-lifetimes having bursty data-transfer characteristics may show visible performance degradation. Also, some of the relevant future extensions possible from this study are enumerated.