How To Build An AI-Ready Enterprise Platform Using VSphere7 Update2

How to Build an AI-Ready Enterprise Platform Using vSphere 7 Update 2

Deployment Scenarios

Single-node Learning

Description: Deploy one or more VMs on a single VMware ESXi™ server equipped with GPU(s).
Ideal For: Running AI workloads on vSphere with NVIDIA GPU(s) using minimal configurations.

Multi-node Learning

Description: Deploy multiple VMs across several interconnected ESXi servers that have GPU(s) and a Mellanox Network Interface Card.
Ideal For: Scaling compute resources across multiple GPUs on different ESXi hosts to enhance AI workload performance. This setup requires additional hardware and configuration steps.

End-User Goals

Key Objectives for System Administrators

Set up an AI/ML environment on vSphere by deploying virtual GPUs with one or more VMs on a single host.
Confirm GPU sharing among multiple VMs to ensure efficient resource use for various users and applications.
Prepare the Platform for NVIDIA AI Enterprise Suite Deployment by configuring an environment that meets all prerequisites.
Enable vSphere vMotion® for vGPU-Enabled VMs to allow live migration of VMs running ML workloads between GPU-capable hosts.
Build a Cluster for Distributed Training Workloads by creating a cluster of two or more networked host servers equipped with GPUs or NVIDIA virtual GPUs (vGPUs).

Intended Audience

VMware Administrators
Machine Learning Practitioners
Data Scientists
Application Architects
IT Managers

VMware vSphere 7 Update 2

Released: March 2021

vSphere 7 Update 2 introduces support for modern GPUs and multi-instance GPUs, enabling improved sharing of AI/ML infrastructure among data scientists and GPU users.

Key Features Delivered

Support for Latest NVIDIA GPUs – Utilizes NVIDIA Ampere architecture, including the NVIDIA A100 Tensor Core GPU, offering up to 20X better performance in some cases compared to previous-generation GPUs.
Enhanced Peer-to-Peer Performance – Integrates Address Translation Services (ATS) to boost performance between NVIDIA NICs/GPUs.
Support for NVIDIA Multi-Instance GPUs (MIG) – Allows for live migration of vGPU-powered VMs, simplifying consolidation, expansion, or upgrades.
Optimized Workload Placement – VMware vSphere Distributed Resource Scheduler™ (DRS) automatically places AI workloads to ensure optimal resource use and avoid bottlenecks.

NVIDIA Enterprise AI Suite

An end-to-end, cloud-native suite of AI and data science applications and frameworks, optimized and certified by NVIDIA to run on VMware vSphere with NVIDIA-Certified Systems.

Includes essential technologies for:

Rapid deployment
Management
Scaling of AI workloads in modern hybrid cloud environments

NVIDIA vGPU

Enables simultaneous, direct access to a single physical GPU by multiple VMs or aggregates GPUs within a single VM.

Delivers high-performance compute
Broad application compatibility
Scalability

NVIDIA Multi-Instance GPU (MIG)

Allows NVIDIA A100 GPUs to be partitioned into up to seven independent GPU compute slices for CUDA applications
Provides separate, secure GPU resources for multiple users

Multi-Node (Distributed) Learning – Summary

To speed up AI/ML training, multi-node learning uses VMs on multiple ESXi hosts, each equipped with a dedicated GPU.
These VMs communicate over the network using TCP or RDMA (Remote Direct Memory Access).

Communication Modes

Same ESXi host: Memory copy (no special NIC needed)
Different ESXi hosts with HCA cards: Uses RDMA over HCA cards for best performance
Different ESXi hosts without HCA cards: Falls back to slower TCP-based communication

Single-Node Learning: Deployment Prerequisites

Hardware Requirements

Server: At least one server approved on both the VMware Hardware Compatibility List and the NVIDIA Virtual GPU Certified Servers List.
GPU: A minimum of one NVIDIA GPU must be installed in one of the servers.

Software Requirements

Download and install:

VMware vSphere Hypervisor (ESXi) 7.0 Update 2
VMware vCenter Server 7.0 Update 2
NVIDIA vGPU software 12.2 for VMware vSphere 7.0
NVIDIA vGPU software license server

NVIDIA Virtual GPU Manager Installation Process

1. Enter Maintenance Mode

In vCenter, right-click the host and select “Enter Maintenance Mode”.

2. Remove Old NVIDIA vGPU VIB

SSH into the ESXi host and run:

#esxcli software vib remove -n NVIDIA-VMware_ESXi_7.0_Host_Driver

3. Install New NVIDIA vGPU VIB

Copy the VIB file to the host and run:

#esxcli software vib install -v /path/to/NVIDIA-vGPU-Host-Driver.vib

4. Reboot the ESXi Host

#reboot

5. Exit Maintenance Mode

In vCenter, right-click the host and select “Exit Maintenance Mode”.

6. Verify GPU Detection

SSH into the host and run:

#nvidia-smi

Author Info

Post List

Category Collection

Table of Contents