How to Set Up NVIDIA GPU Computing Environment with Docker, DCGM, and Triton on Ubuntu

NVIDIA GPU Setup Blog Image

Set Up NVIDIA GPU
Computing Stack

Docker, DCGM, Triton Server & Enterprise Tools on Ubuntu 22.04

$ nvidia-smi
$ docker run --gpus all nvidia/cuda
$ dcgmi health -c
$ tritonserver --model-repository=/models

‍

This comprehensive guide will walk you through setting up a professional-grade NVIDIA GPU computing environment on Ubuntu 22.04, similar to what you'd find on NVIDIA DGX systems. Whether you're using a laptop with a single GPU or a workstation, this guide will help you create a robust platform for AI/ML workloads, inference serving, and GPU-accelerated computing.

What You'll Build

By the end of this guide, you'll have:

NVIDIA GPU drivers properly configured
Docker with GPU support
NVIDIA Data Center GPU Manager (DCGM) for monitoring and diagnostics
NGC (NVIDIA GPU Cloud) CLI for accessing enterprise containers
Triton Inference Server for model deployment
Optional NVIDIA AI Enterprise components

Prerequisites

Before starting, ensure you have:

Ubuntu 22.04 LTS (Jammy Jellyfish) installed
An NVIDIA GPU (GeForce, Quadro, or Data Center GPU)
Administrator (sudo) access
At least 20GB of free disk space
Active internet connection
Basic familiarity with Linux command line

Step 0: System Preparation and Verification

Disable Secure Boot

Secure Boot can prevent NVIDIA kernel modules from loading. You have two options:

Option 1: Disable Secure Boot (Recommended for beginners)

Reboot your system
Enter BIOS/UEFI settings (usually F2, F10, F12, or DEL during boot)
Find Secure Boot settings (often under Security or Boot tabs)
Disable Secure Boot
Save and exit

Option 2: MOK (Machine Owner Key) EnrollmentIf you must keep Secure Boot enabled, you'll need to sign the kernel modules. This is an advanced topic beyond this guide's scope.

Verify GPU Detection

First, let's confirm your system can see the GPU:

bash

$ lspci | grep -i nvidia

Expected Output:

01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile]

If you see no output, check:

Is the GPU properly seated (desktop systems)?
Are you on the correct graphics mode (laptops may have hybrid graphics)?
Check BIOS for GPU-related settings

Install Kernel Headers

NVIDIA drivers compile kernel modules using DKMS, which requires matching kernel headers:

bash

$ uname -r 5.15.0-91-generic $ sudo apt update $ sudo apt install -y linux-headers-$(uname -r)

Troubleshooting:

If headers aren't available, your kernel might be outdated. Run sudo apt upgrade and reboot
Verify headers installed: ls /usr/src | grep linux-headers

Step 1: Disable Nouveau Driver

The open-source Nouveau driver must be disabled before installing NVIDIA proprietary drivers.

Check if Nouveau is Loaded

bash

$ lsmod | grep nouveau

If you see output, Nouveau is active and needs to be disabled.

Create Blacklist Configuration

bash

$ cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 EOF

Regenerate Initial RAM Disk

bash

$ sudo update-initramfs -u update-initramfs: Generating /boot/initrd.img-5.15.0-91-generic

Reboot System

bash

$ sudo reboot

Verify Nouveau is Disabled

After reboot:

bash

$ lsmod | grep nouveau

Should return no output.

Troubleshooting:

Nouveau still loads: Check if blacklist file exists: cat /etc/modprobe.d/blacklist-nouveau.conf
System won't boot: Boot to recovery mode, remove blacklist file, regenerate initramfs
Using systemd-boot: Run sudo bootctl update after initramfs generation

Step 2: Install NVIDIA Data Center Driver

Ubuntu 22.04 offers two driver options. Choose based on your GPU generation:

Option A: Open Kernel Modules (Recommended for Turing and newer)

For RTX 20-series, RTX 30-series, RTX 40-series, and newer:

bash

$ sudo apt update $ sudo apt install -y nvidia-open Reading package lists... Done Building dependency tree... Done ... Setting up nvidia-open (560.35.03-0ubuntu1) ...

Option B: Proprietary Kernel Modules (Legacy GPUs)

For Maxwell, Pascal, Volta, or older GPUs:

bash

$ sudo apt update $ sudo apt install -y cuda-drivers

Reboot to Load Driver

bash

$ sudo reboot

Verify Driver Installation

bash

$ nvidia-smi

Expected Output:

Troubleshooting:

"NVIDIA-SMI has failed": Driver not loaded. Check dmesg | grep -i nvidia for errors
Black screen after install: Boot to recovery, run sudo apt remove nvidia-*, try other driver option
Kernel module build failed: Ensure gcc installed: sudo apt install build-essential
Version mismatch: Check installed version: dpkg -l | grep nvidia

Step 3: Install Docker Engine

We'll use Docker's official repository for the latest stable version.

Set Up Docker Repository

bash

# Install prerequisites$ sudo apt-get update $ sudo apt-get install -y ca-certificates curl# Create keyrings directory$ sudo install -m 0755 -d /etc/apt/keyrings# Add Docker's official GPG key$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \ -o /etc/apt/keyrings/docker.asc $ sudo chmod a+r /etc/apt/keyrings/docker.asc# Set up the repository$ echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \ https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Packages

bash

$ sudo apt-get update $ sudo apt-get install -y docker-ce docker-ce-cli containerd.io \ docker-buildx-plugin docker-compose-plugin

Verify Docker Installation

bash

$ sudo systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; enabled) Active: active (running) since Mon 2024-01-15 10:30:00 EST $ sudo docker run hello-world Hello from Docker! This message shows that your installation appears to be working correctly.

Configure Docker for Non-Root Access (Optional)

bash

$ sudo usermod -aG docker $USER $ newgrp docker# Test without sudo$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

Troubleshooting:

Permission denied: Logout and login again for group changes to take effect
Docker daemon not starting: Check logs: sudo journalctl -xeu docker.service
Network issues: Check if Docker's bridge network conflicts with your network

Step 4: Install NVIDIA Container Toolkit

This critical component enables Docker containers to access your GPU.

Add NVIDIA Container Toolkit Repository

bash

# Add the GPG key$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \ sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg# Add the repository$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install Container Toolkit

bash

$ sudo apt-get update# Install specific version for stability$ export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1 $ sudo apt-get install -y \ nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

Configure Docker Runtime

bash

$ sudo nvidia-ctk runtime configure --runtime=docker INFO[0000] Loading docker config from /etc/docker/daemon.json INFO[0000] Config file does not exist, creating new one INFO[0000] Wrote updated config to /etc/docker/daemon.json INFO[0000] It is recommended to restart the Docker daemon $ sudo systemctl restart docker

Test GPU Access in Container

bash

$ sudo docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi

You should see the same nvidia-smi output as on your host system.

Troubleshooting:

"docker: Error response from daemon: could not select device driver":
- Ensure nvidia-container-toolkit is installed
- Restart Docker: sudo systemctl restart docker
GPU not visible in container:
- Check runtime config: cat /etc/docker/daemon.json
- Verify: docker run --rm --gpus all ubuntu nvidia-smi
Known Issue - systemd cgroup: Running systemctl daemon-reload can cause containers to lose GPU access. Solution: Restart Docker and reconfigure runtime.

Step 5: Install DCGM (Data Center GPU Manager)

DCGM provides enterprise-grade GPU monitoring and diagnostics.

Add CUDA Repository (Required for DCGM)

bash

# Download and install CUDA keyring$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb $ sudo dpkg -i cuda-keyring_1.1-1_all.deb# Update package lists$ sudo apt-get update

Install DCGM

bash

$ sudo apt-get install -y datacenter-gpu-manager

Enable and Start DCGM Service

bash

$ sudo systemctl enable nvidia-dcgm $ sudo systemctl start nvidia-dcgm# Verify service status$ sudo systemctl status nvidia-dcgm ● nvidia-dcgm.service - NVIDIA DCGM service Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled) Active: active (running)

Run Diagnostics

bash

Set Up Health Monitoring

bash

# Set active health checks$ dcgmi health -s a Health monitor systems set successfully# Check health status$ dcgmi health -c Health Status: Healthy

Troubleshooting:

DCGM service not found: Some distributions use containerized DCGM. Run via Docker instead:

bash

docker run --gpus all --rm -v /var/run/docker.sock:/var/run/docker.sock \ nvidia/dcgm:latest

Memory faults reported: Reset GPU: sudo nvidia-smi --gpu-reset -i 0
Diagnostic failures: Ensure no other applications are using GPU during tests

Step 6: Install and Configure NGC CLI

NGC (NVIDIA GPU Cloud) CLI provides access to enterprise containers and models.

Download NGC CLI

Visit: https://org.ngc.nvidia.com/setup/installers/cli
Download the appropriate package for Linux/AMD64
Install the package:

bash

# For .deb package (Ubuntu/Debian)$ sudo dpkg -i ngc-cli_3.31.0_amd64.deb# Or extract and add to PATH for portable installation$ tar -xvf ngc-cli_3.31.0_linux.tar.gz $ export PATH=$PATH:$(pwd)/ngc-cli

Configure NGC with API Key

bash

$ ngc config set Enter API key [no-apikey]: <your-api-key> Enter CLI output format type [ascii]: ascii Enter org [no-org]: <your-org> Enter team [no-team]: <your-team> Enter ace [no-ace]: Configuration saved to /home/user/.ngc/config

Verify Configuration

bash

$ ngc user who User Information Username: your.email@example.com Org: your-organization Team: your-team ACE: nv-us-west-2 $ ngc registry model list --format_type csv | head -5 "Name","Repository","Latest Version","Application","Framework","Precision","Use Case" "BERT Large","nvidia/bert","1.0","Natural Language Processing","PyTorch","FP16","Question Answering"

Troubleshooting:

"401 Unauthorized": Check API key is correct and active
No models listed: Ensure you have access to the repository in your org/team
Connection issues: Check proxy settings if behind corporate firewall

Step 7: Deploy Triton Inference Server

Triton is NVIDIA's production-grade inference serving solution.

Prepare Model Repository

bash

# Clone Triton server repository$ git clone https://github.com/triton-inference-server/server.git $ cd server/docs/examples# Fetch example models$ ./fetch_models.sh Downloading densenet_onnx model... Downloading inception_graphdef model... Downloading simple model... Model repository prepared at: ./model_repository

Run Triton Server

bash

# Start Triton with GPU support$ docker run --gpus=1 --rm \ -p8000:8000 -p8001:8001 -p8002:8002 \ -v $(pwd)/model_repository:/models \ nvcr.io/nvidia/tritonserver:24.01-py3 \ tritonserver --model-repository=/models I0115 10:00:00.000000 1 server.cc:650] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+ I0115 10:00:00.000000 1 server.cc:677] +--------+------+--------+ | Model | Ver | Status | +--------+------+--------+ | simple | 1 | READY | +--------+------+--------+ I0115 10:00:00.000000 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001 I0115 10:00:00.000000 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000 I0115 10:00:00.000000 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Verify Server Health

In a new terminal:

bash

# Check if server is ready$ curl -v localhost:8000/v2/health/ready * Trying 127.0.0.1:8000... * Connected to localhost (127.0.0.1) port 8000 (#0)> GET /v2/health/ready HTTP/1.1 > Host: localhost:8000 > < HTTP/1.1 200 OK < Content-Length: 0 < Content-Type: text/plain

Test Inference with Client

bash

# Pull client SDK image$ docker pull nvcr.io/nvidia/tritonserver:24.01-py3-sdk# Run inference on example image$ docker run -it --rm --net=host \ -v $(pwd):/workspace \ nvcr.io/nvidia/tritonserver:24.01-py3-sdk \ /workspace/install/bin/image_client \ -m densenet_onnx -c 3 -s INCEPTION \ /workspace/images/mug.jpg Request 0, batch size 1 Image '/workspace/images/mug.jpg': 15.349568 (504) = COFFEE MUG 13.227468 (968) = CUP 10.424893 (505) = COFFEEPOT

Troubleshooting:

Port already in use: Change ports: -p8010:8000 -p8011:8001 -p8012:8002
Model load failures: Check model format compatibility and file permissions
Out of memory: Reduce concurrent model instances in config.pbtxt
Performance issues: Enable GPU metrics: --metrics-port=8002

Step 8: Optional - NVIDIA AI Enterprise Setup

For production deployments with enterprise support, consider NVIDIA AI Enterprise.

Prerequisites for AI Enterprise

License Requirements:
- Active NVIDIA Enterprise Account
- Valid AI Enterprise license
- Access to NGC Enterprise catalog
Setup Cloud License Service (CLS):

bash

# Install license client $ sudo apt-get install -y nvidia-license-client # Configure license token $ sudo nvidia-license-client --token-config /etc/nvidia/license-client.tok

Pull Enterprise Containers

bash

# Login to NGC with enterprise credentials$ docker login nvcr.io Username: $oauthtoken Password: <your-api-key># Pull enterprise containers$ docker pull nvcr.io/nvidia/tensorflow:24.01-tf2-py3 $ docker pull nvcr.io/nvidia/pytorch:24.01-py3 $ docker pull nvcr.io/nvidia/rapids/rapids:24.01-cuda12.0-runtime-ubuntu22.04-py3.10

Deploy NIM (NVIDIA Inference Microservices)

bash

# Example: Deploy Llama model via NIM$ docker run --gpus all --rm \ -p 8000:8000 \ -e NGC_API_KEY=$NGC_API_KEY \ nvcr.io/nvidia/nim/llama-2-7b-chat:latest

Step 9: System Validation and Monitoring

Complete System Check

Create a validation script:

bash

$ cat > validate_system.sh << 'EOF' #!/bin/bash echo "=== System Validation Report ===" echo "" # Check NVIDIA Driver echo "1. NVIDIA Driver Check:" if nvidia-smi &>/dev/null; then echo " ✓ Driver installed" nvidia-smi --query-gpu=name,driver_version --format=csv,noheader else echo " ✗ Driver not detected" fi echo "" # Check Docker echo "2. Docker Check:" if docker --version &>/dev/null; then echo " ✓ Docker installed" docker --version else echo " ✗ Docker not installed" fi echo "" # Check GPU in Docker echo "3. Docker GPU Access:" if docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi &>/dev/null; then echo " ✓ GPU accessible in containers" else echo " ✗ GPU not accessible in containers" fi echo "" # Check DCGM echo "4. DCGM Check:" if dcgmi discovery -l &>/dev/null; then echo " ✓ DCGM installed and running" dcgmi discovery -l | head -3 else echo " ✗ DCGM not available" fi echo "" # Check NGC echo "5. NGC CLI Check:" if ngc --version &>/dev/null; then echo " ✓ NGC CLI installed" ngc --version else echo " ✗ NGC CLI not installed" fi echo "" # Check Triton echo "6. Triton Server Check:" if curl -s localhost:8000/v2/health/ready &>/dev/null; then echo " ✓ Triton server is running" else echo " ✗ Triton server not running or not accessible" fi echo "" echo "=== Validation Complete ===" EOF $ chmod +x validate_system.sh $ ./validate_system.sh

Set Up Continuous Monitoring

Create a monitoring dashboard script:

bash

$ cat > gpu_monitor.sh << 'EOF' #!/bin/bash while true; do clear echo "=== GPU Monitoring Dashboard ===" echo "Timestamp: $(date)" echo "" # GPU Utilization nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total \ --format=csv,noheader,nounits echo "" echo "=== Docker Containers Using GPU ===" docker ps --filter "label=com.nvidia.gpu" --format "table {{.Names}}\t{{.Status}}" echo "" echo "=== DCGM Health Status ===" dcgmi health -c | grep "Health Status" echo "" echo "Press Ctrl+C to exit" sleep 5 done EOF $ chmod +x gpu_monitor.sh $ ./gpu_monitor.sh

Common Issues and Solutions

Driver Issues

Problem: NVIDIA-SMI shows "No devices were found"

bash

# Check if driver is loaded$ lsmod | grep nvidia# If empty, try loading manually$ sudo modprobe nvidia# Check for errors$ dmesg | grep -i nvidia | tail -20

Problem: Driver version mismatch with CUDA

bash

# Check compatibility$ nvidia-smi# Note the CUDA Version shown (e.g., 12.4)# Use matching CUDA container versions$ docker run --gpus all nvidia/cuda:12.4.0-base nvidia-smi

Docker GPU Access Issues

Problem: "could not select device driver" error

bash

# Reconfigure container runtime$ sudo nvidia-ctk runtime configure --runtime=docker $ sudo systemctl restart docker# Test again$ docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi

Problem: Container loses GPU access after system updates

bash

# This is a known systemd issue. Fix:$ sudo systemctl restart docker $ sudo nvidia-ctk runtime configure --runtime=docker

Performance Optimization

Enable Persistence Mode for Better Performance:

bash

# Enable persistence mode$ sudo nvidia-smi -pm 1# Set power limit (example: 250W)$ sudo nvidia-smi -pl 250# Set GPU clocks for compute (optional)$ sudo nvidia-smi -ac 5001,1500

Configure Memory Growth for TensorFlow:

python

import tensorflow as tf# Allow memory growthgpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)

Maintenance and Updates

Regular Maintenance Tasks

Weekly:

Check GPU health: dcgmi health -c
Review Docker disk usage: docker system df
Clean unused containers: docker system prune -a

Monthly:

Update NVIDIA drivers (if needed): sudo apt update && sudo apt upgrade
Update Docker images: docker pull nvidia/cuda:latest
Run comprehensive diagnostics: dcgmi diag -r 3

Backup Important Configurations

bash

# Create backup directory$ mkdir -p ~/gpu-setup-backup# Backup configurations$ cp /etc/docker/daemon.json ~/gpu-setup-backup/ $ cp /etc/modprobe.d/blacklist-nouveau.conf ~/gpu-setup-backup/ $ cp ~/.ngc/config ~/gpu-setup-backup/ngc-config

Conclusion

You now have a professional-grade GPU computing environment similar to NVIDIA DGX systems. This setup provides:

Production-ready inference serving with Triton
Enterprise monitoring via DCGM
Container-based workflows with Docker and GPU support
Access to NVIDIA's ecosystem through NGC

Next Steps

Explore NGC Catalog: Browse pre-trained models and optimized containers
Deploy Your Models: Convert and optimize your models for Triton
Set Up Kubernetes: For multi-node deployments, consider GPU Operator
Implement Monitoring: Integrate DCGM with Prometheus/Grafana
Optimize Performance: Profile and tune your specific workloads

Additional Resources

Getting Help

NVIDIA Developer Forums: https://forums.developer.nvidia.com/
GitHub Issues: Report toolkit-specific issues on respective GitHub repos
Enterprise Support: Available with NVIDIA AI Enterprise license

Remember to keep your system updated and regularly check NVIDIA's documentation for the latest best practices and security updates. Happy computing!

‍

GUIDE INFORMATION

TOOLS USED

Terminal

PURPOSE

EXPERIENCE NEEDED

TIME NEEDED

60 min

BEST FOR

COMPLIANCE CONTEXT

Related how-to guides

View All

How to Install NVIDIA GPU Drivers on Ubuntu: A Complete Guide

A step-by-step guide to installing NVIDIA GPU drivers on Ubuntu, from gathering system information to verification and monitoring.

Want to explore secure AI systems tailored to your agency or mission?

Let’s design a federated learning approach that works in your world.

work with us

How to Set Up NVIDIA GPU Computing Environment with Docker, DCGM, and Triton on Ubuntu

Set Up NVIDIA GPU Computing Stack

What You'll Build

Prerequisites

Step 0: System Preparation and Verification

Disable Secure Boot

Verify GPU Detection

Install Kernel Headers

Step 1: Disable Nouveau Driver

Check if Nouveau is Loaded

Create Blacklist Configuration

Regenerate Initial RAM Disk

Reboot System

Verify Nouveau is Disabled

Step 2: Install NVIDIA Data Center Driver

Option A: Open Kernel Modules (Recommended for Turing and newer)

Option B: Proprietary Kernel Modules (Legacy GPUs)

Reboot to Load Driver

Verify Driver Installation

Step 3: Install Docker Engine

Set Up Docker Repository

Install Docker Packages

Verify Docker Installation

Configure Docker for Non-Root Access (Optional)

Step 4: Install NVIDIA Container Toolkit

Add NVIDIA Container Toolkit Repository

Install Container Toolkit

Configure Docker Runtime

Test GPU Access in Container

Step 5: Install DCGM (Data Center GPU Manager)

Add CUDA Repository (Required for DCGM)

Install DCGM

Enable and Start DCGM Service

Run Diagnostics

Set Up Health Monitoring

Step 6: Install and Configure NGC CLI

Download NGC CLI

Configure NGC with API Key

Verify Configuration

Step 7: Deploy Triton Inference Server

Prepare Model Repository

Run Triton Server

Verify Server Health

Test Inference with Client

Step 8: Optional - NVIDIA AI Enterprise Setup

Prerequisites for AI Enterprise

Pull Enterprise Containers

Deploy NIM (NVIDIA Inference Microservices)

Step 9: System Validation and Monitoring

Complete System Check

Set Up Continuous Monitoring

Common Issues and Solutions

Driver Issues

Docker GPU Access Issues

Performance Optimization

Maintenance and Updates

Regular Maintenance Tasks

Backup Important Configurations

Conclusion

Next Steps

Additional Resources

Getting Help

GUIDE INFORMATION

Related how-to guides

How to Install NVIDIA GPU Drivers on Ubuntu: A Complete Guide

Want to explore secure AI systems tailored to your agency or mission?

Set Up NVIDIA GPU
Computing Stack