How to Build an AI-Ready Enterprise Platform Using vSphere7 Update2

Table of Contents

  1. Deployment Scenarios
  2. End-User Goals
  3. Intended Audience
  4. VMware vSphere 7 Update 2
  5. NVIDIA Enterprise AI Suite
  6. NVIDIA vGPU
  7. NVIDIA Multi-Instance GPU (MIG)
  8. Multi-Node (Distributed) Learning – Summary
  9. Single Node Learning: Deployment Prerequisites
  10. Environment Preparation (Summary)
  11. NVIDIA Virtual GPU Manager Installation Process

How to Build an AI-Ready Enterprise Platform Using vSphere 7 Update 2

Deployment Scenarios

Single-node Learning

  • Description: Deploy one or more VMs on a single VMware ESXi™ server equipped with GPU(s).
  • Ideal For: Running AI workloads on vSphere with NVIDIA GPU(s) using minimal configurations.

Multi-node Learning

  • Description: Deploy multiple VMs across several interconnected ESXi servers that have GPU(s) and a Mellanox Network Interface Card.
  • Ideal For: Scaling compute resources across multiple GPUs on different ESXi hosts to enhance AI workload performance. This setup requires additional hardware and configuration steps.

End-User Goals

Key Objectives for System Administrators

  • Set up an AI/ML environment on vSphere by deploying virtual GPUs with one or more VMs on a single host.
  • Confirm GPU sharing among multiple VMs to ensure efficient resource use for various users and applications.
  • Prepare the Platform for NVIDIA AI Enterprise Suite Deployment by configuring an environment that meets all prerequisites.
  • Enable vSphere vMotion® for vGPU-Enabled VMs to allow live migration of VMs running ML workloads between GPU-capable hosts.
  • Build a Cluster for Distributed Training Workloads by creating a cluster of two or more networked host servers equipped with GPUs or NVIDIA virtual GPUs (vGPUs).

Intended Audience

  • VMware Administrators
  • Machine Learning Practitioners
  • Data Scientists
  • Application Architects
  • IT Managers

VMware vSphere 7 Update 2

Released: March 2021

vSphere 7 Update 2 introduces support for modern GPUs and multi-instance GPUs, enabling improved sharing of AI/ML infrastructure among data scientists and GPU users.

Key Features Delivered

  • Support for Latest NVIDIA GPUs – Utilizes NVIDIA Ampere architecture, including the NVIDIA A100 Tensor Core GPU, offering up to 20X better performance in some cases compared to previous-generation GPUs.
  • Enhanced Peer-to-Peer Performance – Integrates Address Translation Services (ATS) to boost performance between NVIDIA NICs/GPUs.
  • Support for NVIDIA Multi-Instance GPUs (MIG) – Allows for live migration of vGPU-powered VMs, simplifying consolidation, expansion, or upgrades.
  • Optimized Workload Placement – VMware vSphere Distributed Resource Scheduler™ (DRS) automatically places AI workloads to ensure optimal resource use and avoid bottlenecks.

NVIDIA Enterprise AI Suite

An end-to-end, cloud-native suite of AI and data science applications and frameworks, optimized and certified by NVIDIA to run on VMware vSphere with NVIDIA-Certified Systems.

Includes essential technologies for:

  • Rapid deployment
  • Management
  • Scaling of AI workloads in modern hybrid cloud environments

NVIDIA vGPU

Enables simultaneous, direct access to a single physical GPU by multiple VMs or aggregates GPUs within a single VM.

  • Delivers high-performance compute
  • Broad application compatibility
  • Scalability

NVIDIA Multi-Instance GPU (MIG)

  • Allows NVIDIA A100 GPUs to be partitioned into up to seven independent GPU compute slices for CUDA applications
  • Provides separate, secure GPU resources for multiple users

Multi-Node (Distributed) Learning – Summary

To speed up AI/ML training, multi-node learning uses VMs on multiple ESXi hosts, each equipped with a dedicated GPU.
These VMs communicate over the network using TCP or RDMA (Remote Direct Memory Access).

Communication Modes

  • Same ESXi host: Memory copy (no special NIC needed)
  • Different ESXi hosts with HCA cards: Uses RDMA over HCA cards for best performance
  • Different ESXi hosts without HCA cards: Falls back to slower TCP-based communication

Single-Node Learning: Deployment Prerequisites

Hardware Requirements

  • Server: At least one server approved on both the VMware Hardware Compatibility List and the NVIDIA Virtual GPU Certified Servers List.
  • GPU: A minimum of one NVIDIA GPU must be installed in one of the servers.

Software Requirements

Download and install:

  • VMware vSphere Hypervisor (ESXi) 7.0 Update 2
  • VMware vCenter Server 7.0 Update 2
  • NVIDIA vGPU software 12.2 for VMware vSphere 7.0
  • NVIDIA vGPU software license server

NVIDIA Virtual GPU Manager Installation Process

1. Enter Maintenance Mode

In vCenter, right-click the host and select “Enter Maintenance Mode”.

2. Remove Old NVIDIA vGPU VIB

SSH into the ESXi host and run:

#esxcli software vib remove -n NVIDIA-VMware_ESXi_7.0_Host_Driver

3. Install New NVIDIA vGPU VIB

Copy the VIB file to the host and run:

#esxcli software vib install -v /path/to/NVIDIA-vGPU-Host-Driver.vib

4. Reboot the ESXi Host

#reboot

5. Exit Maintenance Mode

In vCenter, right-click the host and select “Exit Maintenance Mode”.

6. Verify GPU Detection

SSH into the host and run:

#nvidia-smi

Leave a Reply

Your email address will not be published. Required fields are marked *