Installing NVIDIA Drivers on RHEL 9

This guide walks you through the steps to install and configure NVIDIA drivers and the NVIDIA Container Toolkit on Red Hat Enterprise Linux 9, enabling you to leverage your GPU for containerized workloads.

Step 1: Install Prerequisites

Before you begin, you need to set up your system with the necessary tools and repositories.

First, check that your NVIDIA GPU is detected by running the following command:

lspci | grep -i nvidia

Next, enable the RHEL 9 CodeReady Builder repository and install the EPEL repository, which provides additional packages:

sudo subscription-manager repos --enable codeready-builder-for-rhel-9-$(uname -i)-rpms
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

Finally, install the necessary build tools and libraries required for the driver installation:

sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc make dkms acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig

Step 2: Install the NVIDIA Drivers

Now you are ready to install the NVIDIA drivers from the official repository.

Add the official NVIDIA CUDA repository to your system:

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo

Install the drivers using the dnf module command. This will install the latest version of the NVIDIA driver and configure it with the DKMS (Dynamic Kernel Module Support) framework.

sudo dnf module install -y nvidia-driver:latest-dkms

Step 3: Verify the Installation

After the installation is complete, it’s crucial to verify that the drivers are loaded correctly.

Check that the nvidia kernel module is loaded and the nouveau open-source driver is blacklisted.

lsmod | grep nvidia
lsmod | grep nouveau

The first command should return a result, and the second should return nothing. This indicates the NVIDIA driver is in use instead of the default open-source one.

Next, run nvidia-smi to display information about your GPU and confirm that the drivers are working correctly.

nvidia-smi

Step 4: Install and Configure the NVIDIA Container Toolkit

To use your GPU with container runtimes like Podman, you need to install the NVIDIA Container Toolkit.

Configure the NVIDIA Container Toolkit Repository:

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

Install the nvidia-container-toolkit package:

sudo dnf install -y nvidia-container-toolkit

Generate the CDI (Container Device Interface) specification file, which allows Podman to discover and expose your GPU devices to containers:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

If you are running Podman rootless, you need to set the SELinux boolean to allow containers to access devices:

sudo setsebool -P container_use_devices=true

Step 5: Verify the Installation

Finally, test the entire setup by running a container that uses the GPU. This command pulls an official NVIDIA CUDA image based on UBI 9 and runs nvidia-smi inside of it.

podman run --rm --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi

If the command successfully outputs information about your GPU, your installation is complete and ready for containerized workloads.

This is bash script that automates all above

#!/bin/bash

# This script automates the installation of NVIDIA drivers and the NVIDIA Container Toolkit on RHEL 9.
# It should be run with root privileges.

# Exit immediately if a command exits with a non-zero status.
set -e
set -o pipefail

echo "Starting NVIDIA driver and Container Toolkit installation on RHEL 9..."

# --- Step 1: Install Prerequisites ---
echo "--- Step 1: Installing prerequisites ---"

# Check for NVIDIA GPU
if ! lspci | grep -i nvidia &> /dev/null; then
    echo "No NVIDIA GPU detected. Exiting."
    exit 1
fi
echo "NVIDIA GPU detected. Proceeding with installation."

# Enable the RHEL 9 CodeReady Builder repository
echo "Enabling CodeReady Builder repository..."
sudo subscription-manager repos --enable codeready-builder-for-rhel-9-$(uname -i)-rpms

# Install the EPEL repository
echo "Installing EPEL repository..."
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

# Install necessary build tools and libraries
echo "Installing build tools and libraries..."
sudo dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r) gcc make dkms acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig

# --- Step 2: Install the NVIDIA Drivers ---
echo "--- Step 2: Installing NVIDIA drivers ---"

# Add the official NVIDIA CUDA repository
echo "Adding official NVIDIA CUDA repository..."
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -i)/cuda-rhel9.repo

# Install the drivers using the dnf module command
echo "Installing NVIDIA drivers..."
sudo dnf module install -y nvidia-driver:latest-dkms

# --- Step 3: Verify the Driver Installation ---
echo "--- Step 3: Verifying driver installation ---"

# Check that the nvidia kernel module is loaded and nouveau is not
echo "Checking for nvidia kernel module..."
if lsmod | grep nvidia &> /dev/null; then
    echo "NVIDIA kernel module loaded successfully."
else
    echo "ERROR: NVIDIA kernel module not loaded. Check for errors above."
    exit 1
fi

echo "Checking for nouveau module..."
if lsmod | grep nouveau &> /dev/null; then
    echo "WARNING: Nouveau module is still loaded. Please reboot your system to apply changes."
    read -p "Press Enter to continue without rebooting or Ctrl+C to exit."
fi

# Run nvidia-smi to confirm
echo "Running nvidia-smi to confirm drivers are working..."
if ! nvidia-smi &> /dev/null; then
    echo "ERROR: nvidia-smi failed. The drivers may not be working correctly."
    exit 1
fi
echo "nvidia-smi check passed. Drivers are working."

# --- Step 4: Install and configure NVIDIA Container Toolkit ---
echo "--- Step 4: Installing and configuring NVIDIA Container Toolkit ---"

# Configure the NVIDIA Container Toolkit Repository
echo "Configuring NVIDIA Container Toolkit repository..."
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

# Install the nvidia-container-toolkit package
echo "Installing nvidia-container-toolkit..."
sudo dnf install -y nvidia-container-toolkit

# Generate the CDI specification file
echo "Generating CDI specification file..."
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Set SELinux boolean
echo "Setting SELinux boolean to allow containers access to devices..."
sudo setsebool -P container_use_devices=true

# --- Step 5: Verify the Installation ---
echo "--- Step 5: Verifying final installation with Podman ---"

# Run a test container
echo "Running a test container with GPU access. This may take a few moments..."
if sudo podman run --rm --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi; then
    echo "============================================================"
    echo "✅ Success: All components are installed and working correctly!"
    echo "============================================================"
else
    echo "============================================================"
    echo "❌ ERROR: Verification with Podman failed. Check output above."
    echo "============================================================"
    exit 1
fi

Leave a Reply

Your email address will not be published. Required fields are marked *