Table of Contents
- Deployment Scenarios
- End-User Goals
- Intended Audience
- VMware vSphere 7 Update 2
- NVIDIA Enterprise AI Suite
- NVIDIA vGPU
- NVIDIA Multi-Instance GPU (MIG)
- Multi-Node (Distributed) Learning – Summary
- Single Node Learning: Deployment Prerequisites
- Environment Preparation (Summary)
- NVIDIA Virtual GPU Manager Installation Process
Deployment Scenarios
Single-node Learning
- Description: Deploy one or more VMs on a single VMware ESXi™ server equipped with GPU(s).
- Ideal For: Running AI workloads on vSphere with NVIDIA GPU(s) using minimal configurations.
Multi-node Learning
- Description: Deploy multiple VMs across several interconnected ESXi servers that have GPU(s) and a Mellanox Network Interface Card.
- Ideal For: Scaling compute resources across multiple GPUs on different ESXi hosts to enhance AI workload performance. This setup requires additional hardware and configuration steps.
End-User Goals
Key Objectives for System Administrators
- Set up an AI/ML environment on vSphere by deploying virtual GPUs with one or more VMs on a single host.
- Confirm GPU sharing among multiple VMs to ensure efficient resource use for various users and applications.
- Prepare the Platform for NVIDIA AI Enterprise Suite Deployment by configuring an environment that meets all prerequisites.
- Enable vSphere vMotion® for vGPU-Enabled VMs to allow live migration of VMs running ML workloads between GPU-capable hosts.
- Build a Cluster for Distributed Training Workloads by creating a cluster of two or more networked host servers equipped with GPUs or NVIDIA virtual GPUs (vGPUs).
Intended Audience
- VMware Administrators
- Machine Learning Practitioners
- Data Scientists
- Application Architects
- IT Managers
VMware vSphere 7 Update 2
Released: March 2021
vSphere 7 Update 2 introduces support for modern GPUs and multi-instance GPUs, enabling improved sharing of AI/ML infrastructure among data scientists and GPU users.
Key Features Delivered
- Support for Latest NVIDIA GPUs – Utilizes NVIDIA Ampere architecture, including the NVIDIA A100 Tensor Core GPU, offering up to 20X better performance in some cases compared to previous-generation GPUs.
- Enhanced Peer-to-Peer Performance – Integrates Address Translation Services (ATS) to boost performance between NVIDIA NICs/GPUs.
- Support for NVIDIA Multi-Instance GPUs (MIG) – Allows for live migration of vGPU-powered VMs, simplifying consolidation, expansion, or upgrades.
- Optimized Workload Placement – VMware vSphere Distributed Resource Scheduler™ (DRS) automatically places AI workloads to ensure optimal resource use and avoid bottlenecks.
NVIDIA Enterprise AI Suite
An end-to-end, cloud-native suite of AI and data science applications and frameworks, optimized and certified by NVIDIA to run on VMware vSphere with NVIDIA-Certified Systems.
Includes essential technologies for:
- Rapid deployment
- Management
- Scaling of AI workloads in modern hybrid cloud environments
NVIDIA vGPU
Enables simultaneous, direct access to a single physical GPU by multiple VMs or aggregates GPUs within a single VM.
- Delivers high-performance compute
- Broad application compatibility
- Scalability
NVIDIA Multi-Instance GPU (MIG)
- Allows NVIDIA A100 GPUs to be partitioned into up to seven independent GPU compute slices for CUDA applications
- Provides separate, secure GPU resources for multiple users
Multi-Node (Distributed) Learning – Summary
To speed up AI/ML training, multi-node learning uses VMs on multiple ESXi hosts, each equipped with a dedicated GPU.
These VMs communicate over the network using TCP or RDMA (Remote Direct Memory Access).
Communication Modes
- Same ESXi host: Memory copy (no special NIC needed)
- Different ESXi hosts with HCA cards: Uses RDMA over HCA cards for best performance
- Different ESXi hosts without HCA cards: Falls back to slower TCP-based communication
Single-Node Learning: Deployment Prerequisites
Hardware Requirements
- Server: At least one server approved on both the VMware Hardware Compatibility List and the NVIDIA Virtual GPU Certified Servers List.
- GPU: A minimum of one NVIDIA GPU must be installed in one of the servers.
Software Requirements
Download and install:
- VMware vSphere Hypervisor (ESXi) 7.0 Update 2
- VMware vCenter Server 7.0 Update 2
- NVIDIA vGPU software 12.2 for VMware vSphere 7.0
- NVIDIA vGPU software license server
NVIDIA Virtual GPU Manager Installation Process
1. Enter Maintenance Mode
In vCenter, right-click the host and select “Enter Maintenance Mode”.
2. Remove Old NVIDIA vGPU VIB
SSH into the ESXi host and run:
#esxcli software vib remove -n NVIDIA-VMware_ESXi_7.0_Host_Driver
3. Install New NVIDIA vGPU VIB
Copy the VIB file to the host and run:
#esxcli software vib install -v /path/to/NVIDIA-vGPU-Host-Driver.vib
4. Reboot the ESXi Host
#reboot
5. Exit Maintenance Mode
In vCenter, right-click the host and select “Exit Maintenance Mode”.
6. Verify GPU Detection
SSH into the host and run:
#nvidia-smi