How to Set Up NVIDIA GPU Computing Environment with Docker, DCGM, and Triton on Ubuntu

NVIDIA GPU Setup Blog Image

Set Up NVIDIA GPU
Computing Stack

Docker, DCGM, Triton Server & Enterprise Tools on Ubuntu 22.04

$ nvidia-smi
$ docker run --gpus all nvidia/cuda
$ dcgmi health -c
$ tritonserver --model-repository=/models

This comprehensive guide will walk you through setting up a professional-grade NVIDIA GPU computing environment on Ubuntu 22.04, similar to what you'd find on NVIDIA DGX systems. Whether you're using a laptop with a single GPU or a workstation, this guide will help you create a robust platform for AI/ML workloads, inference serving, and GPU-accelerated computing.

What You'll Build

By the end of this guide, you'll have:

  • NVIDIA GPU drivers properly configured
  • Docker with GPU support
  • NVIDIA Data Center GPU Manager (DCGM) for monitoring and diagnostics
  • NGC (NVIDIA GPU Cloud) CLI for accessing enterprise containers
  • Triton Inference Server for model deployment
  • Optional NVIDIA AI Enterprise components

Prerequisites

Before starting, ensure you have:

  • Ubuntu 22.04 LTS (Jammy Jellyfish) installed
  • An NVIDIA GPU (GeForce, Quadro, or Data Center GPU)
  • Administrator (sudo) access
  • At least 20GB of free disk space
  • Active internet connection
  • Basic familiarity with Linux command line

Step 0: System Preparation and Verification

Disable Secure Boot

Secure Boot can prevent NVIDIA kernel modules from loading. You have two options:

Option 1: Disable Secure Boot (Recommended for beginners)

  1. Reboot your system
  2. Enter BIOS/UEFI settings (usually F2, F10, F12, or DEL during boot)
  3. Find Secure Boot settings (often under Security or Boot tabs)
  4. Disable Secure Boot
  5. Save and exit

Option 2: MOK (Machine Owner Key) EnrollmentIf you must keep Secure Boot enabled, you'll need to sign the kernel modules. This is an advanced topic beyond this guide's scope.

Verify GPU Detection

First, let's confirm your system can see the GPU:

bash

$ lspci | grep -i nvidia

Expected Output:

01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile]

If you see no output, check:

  • Is the GPU properly seated (desktop systems)?
  • Are you on the correct graphics mode (laptops may have hybrid graphics)?
  • Check BIOS for GPU-related settings

Install Kernel Headers

NVIDIA drivers compile kernel modules using DKMS, which requires matching kernel headers:

bash

$ uname -r
5.15.0-91-generic

$ sudo apt update
$ sudo apt install -y linux-headers-$(uname -r)

Troubleshooting:

  • If headers aren't available, your kernel might be outdated. Run sudo apt upgrade and reboot
  • Verify headers installed: ls /usr/src | grep linux-headers

Step 1: Disable Nouveau Driver

The open-source Nouveau driver must be disabled before installing NVIDIA proprietary drivers.

Check if Nouveau is Loaded

bash

$ lsmod | grep nouveau

If you see output, Nouveau is active and needs to be disabled.

Create Blacklist Configuration

bash

$ cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF

Regenerate Initial RAM Disk

bash

$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-5.15.0-91-generic

Reboot System

bash

$ sudo reboot

Verify Nouveau is Disabled

After reboot:

bash

$ lsmod | grep nouveau

Should return no output.

Troubleshooting:

  • Nouveau still loads: Check if blacklist file exists: cat /etc/modprobe.d/blacklist-nouveau.conf
  • System won't boot: Boot to recovery mode, remove blacklist file, regenerate initramfs
  • Using systemd-boot: Run sudo bootctl update after initramfs generation

Step 2: Install NVIDIA Data Center Driver

Ubuntu 22.04 offers two driver options. Choose based on your GPU generation:

Option A: Open Kernel Modules (Recommended for Turing and newer)

For RTX 20-series, RTX 30-series, RTX 40-series, and newer:

bash

$ sudo apt update
$ sudo apt install -y nvidia-open
Reading package lists... Done
Building dependency tree... Done
...
Setting up nvidia-open (560.35.03-0ubuntu1) ...

Option B: Proprietary Kernel Modules (Legacy GPUs)

For Maxwell, Pascal, Volta, or older GPUs:

bash

$ sudo apt update
$ sudo apt install -y cuda-drivers

Reboot to Load Driver

bash

$ sudo reboot

Verify Driver Installation

bash

$ nvidia-smi

Expected Output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03    Driver Version: 560.35.03    CUDA Version: 12.4   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8     8W /  80W |      1MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Troubleshooting:

  • "NVIDIA-SMI has failed": Driver not loaded. Check dmesg | grep -i nvidia for errors
  • Black screen after install: Boot to recovery, run sudo apt remove nvidia-*, try other driver option
  • Kernel module build failed: Ensure gcc installed: sudo apt install build-essential
  • Version mismatch: Check installed version: dpkg -l | grep nvidia

Step 3: Install Docker Engine

We'll use Docker's official repository for the latest stable version.

Set Up Docker Repository

bash

# Install prerequisites
$ sudo apt-get update
$ sudo apt-get install -y ca-certificates curl

# Create keyrings directory
$ sudo install -m 0755 -d /etc/apt/keyrings

# Add Docker's official GPG key
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
   -o /etc/apt/keyrings/docker.asc
$ sudo chmod a+r /etc/apt/keyrings/docker.asc

# Set up the repository
$ echo \
 "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
 https://download.docker.com/linux/ubuntu \
 $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
 sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Packages

bash

$ sudo apt-get update
$ sudo apt-get install -y docker-ce docker-ce-cli containerd.io \
   docker-buildx-plugin docker-compose-plugin

Verify Docker Installation

bash

$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
    Loaded: loaded (/lib/systemd/system/docker.service; enabled)
    Active: active (running) since Mon 2024-01-15 10:30:00 EST

$ sudo docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.

Configure Docker for Non-Root Access (Optional)

bash

$ sudo usermod -aG docker $USER
$ newgrp docker

# Test without sudo
$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Troubleshooting:

  • Permission denied: Logout and login again for group changes to take effect
  • Docker daemon not starting: Check logs: sudo journalctl -xeu docker.service
  • Network issues: Check if Docker's bridge network conflicts with your network

Step 4: Install NVIDIA Container Toolkit

This critical component enables Docker containers to access your GPU.

Add NVIDIA Container Toolkit Repository

bash

# Add the GPG key
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
 sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

# Add the repository
$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
 sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
 sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Install Container Toolkit

bash

$ sudo apt-get update

# Install specific version for stability
$ export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
$ sudo apt-get install -y \
 nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
 nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
 libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
 libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

Configure Docker Runtime

bash

$ sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Config file does not exist, creating new one
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended to restart the Docker daemon

$ sudo systemctl restart docker

Test GPU Access in Container

bash

$ sudo docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi

You should see the same nvidia-smi output as on your host system.

Troubleshooting:

  • "docker: Error response from daemon: could not select device driver":
    • Ensure nvidia-container-toolkit is installed
    • Restart Docker: sudo systemctl restart docker
  • GPU not visible in container:
    • Check runtime config: cat /etc/docker/daemon.json
    • Verify: docker run --rm --gpus all ubuntu nvidia-smi
  • Known Issue - systemd cgroup: Running systemctl daemon-reload can cause containers to lose GPU access. Solution: Restart Docker and reconfigure runtime.

Step 5: Install DCGM (Data Center GPU Manager)

DCGM provides enterprise-grade GPU monitoring and diagnostics.

Add CUDA Repository (Required for DCGM)

bash

# Download and install CUDA keyring
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb

# Update package lists
$ sudo apt-get update

Install DCGM

bash

$ sudo apt-get install -y datacenter-gpu-manager

Enable and Start DCGM Service

bash

$ sudo systemctl enable nvidia-dcgm
$ sudo systemctl start nvidia-dcgm

# Verify service status
$ sudo systemctl status nvidia-dcgm
● nvidia-dcgm.service - NVIDIA DCGM service
    Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled)
    Active: active (running)

Run Diagnostics

bash

# Quick diagnostic (Level 1)
$ dcgmi diag -r 1
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
| Deployment                | Pass                                           |
| Blacklist                 | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Library              | Pass                                           |
+---------------------------+------------------------------------------------+

# Comprehensive diagnostic (Level 3)
$ dcgmi diag -r 3

Set Up Health Monitoring

bash

# Set active health checks
$ dcgmi health -s a
Health monitor systems set successfully

# Check health status
$ dcgmi health -c
Health Status: Healthy

Troubleshooting:

  • DCGM service not found: Some distributions use containerized DCGM. Run via Docker instead:

bash

 docker run --gpus all --rm -v /var/run/docker.sock:/var/run/docker.sock \
   nvidia/dcgm:latest

  • Memory faults reported: Reset GPU: sudo nvidia-smi --gpu-reset -i 0
  • Diagnostic failures: Ensure no other applications are using GPU during tests

Step 6: Install and Configure NGC CLI

NGC (NVIDIA GPU Cloud) CLI provides access to enterprise containers and models.

Download NGC CLI

  1. Visit: https://org.ngc.nvidia.com/setup/installers/cli
  2. Download the appropriate package for Linux/AMD64
  3. Install the package:

bash

# For .deb package (Ubuntu/Debian)
$ sudo dpkg -i ngc-cli_3.31.0_amd64.deb

# Or extract and add to PATH for portable installation
$ tar -xvf ngc-cli_3.31.0_linux.tar.gz
$ export PATH=$PATH:$(pwd)/ngc-cli

Configure NGC with API Key

bash

$ ngc config set
Enter API key [no-apikey]: <your-api-key>
Enter CLI output format type [ascii]: ascii
Enter org [no-org]: <your-org>
Enter team [no-team]: <your-team>
Enter ace [no-ace]:

Configuration saved to /home/user/.ngc/config

Verify Configuration

bash

$ ngc user who
User Information
Username: your.email@example.com
Org: your-organization
Team: your-team
ACE: nv-us-west-2

$ ngc registry model list --format_type csv | head -5
"Name","Repository","Latest Version","Application","Framework","Precision","Use Case"
"BERT Large","nvidia/bert","1.0","Natural Language Processing","PyTorch","FP16","Question Answering"

Troubleshooting:

  • "401 Unauthorized": Check API key is correct and active
  • No models listed: Ensure you have access to the repository in your org/team
  • Connection issues: Check proxy settings if behind corporate firewall

Step 7: Deploy Triton Inference Server

Triton is NVIDIA's production-grade inference serving solution.

Prepare Model Repository

bash

# Clone Triton server repository
$ git clone https://github.com/triton-inference-server/server.git
$ cd server/docs/examples

# Fetch example models
$ ./fetch_models.sh
Downloading densenet_onnx model...
Downloading inception_graphdef model...
Downloading simple model...
Model repository prepared at: ./model_repository

Run Triton Server

bash

# Start Triton with GPU support
$ docker run --gpus=1 --rm \
 -p8000:8000 -p8001:8001 -p8002:8002 \
 -v $(pwd)/model_repository:/models \
 nvcr.io/nvidia/tritonserver:24.01-py3 \
 tritonserver --model-repository=/models

I0115 10:00:00.000000 1 server.cc:650]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0115 10:00:00.000000 1 server.cc:677]
+--------+------+--------+
| Model  | Ver  | Status |
+--------+------+--------+
| simple | 1    | READY  |
+--------+------+--------+

I0115 10:00:00.000000 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0115 10:00:00.000000 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0115 10:00:00.000000 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Verify Server Health

In a new terminal:

bash

# Check if server is ready
$ curl -v localhost:8000/v2/health/ready
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (
#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
>
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain

Test Inference with Client

bash

# Pull client SDK image
$ docker pull nvcr.io/nvidia/tritonserver:24.01-py3-sdk

# Run inference on example image
$ docker run -it --rm --net=host \
 -v $(pwd):/workspace \
 nvcr.io/nvidia/tritonserver:24.01-py3-sdk \
 /workspace/install/bin/image_client \
 -m densenet_onnx -c 3 -s INCEPTION \
 /workspace/images/mug.jpg

Request 0, batch size 1
Image '/workspace/images/mug.jpg':
   15.349568 (504) = COFFEE MUG
   13.227468 (968) = CUP
   10.424893 (505) = COFFEEPOT

Troubleshooting:

  • Port already in use: Change ports: -p8010:8000 -p8011:8001 -p8012:8002
  • Model load failures: Check model format compatibility and file permissions
  • Out of memory: Reduce concurrent model instances in config.pbtxt
  • Performance issues: Enable GPU metrics: --metrics-port=8002

Step 8: Optional - NVIDIA AI Enterprise Setup

For production deployments with enterprise support, consider NVIDIA AI Enterprise.

Prerequisites for AI Enterprise

  1. License Requirements:
    • Active NVIDIA Enterprise Account
    • Valid AI Enterprise license
    • Access to NGC Enterprise catalog
  2. Setup Cloud License Service (CLS):

bash

  # Install license client
  $ sudo apt-get install -y nvidia-license-client
 
 
# Configure license token
  $ sudo nvidia-license-client --token-config /etc/nvidia/license-client.tok

Pull Enterprise Containers

bash

# Login to NGC with enterprise credentials
$ docker login nvcr.io
Username: $oauthtoken
Password: <your-api-key>

# Pull enterprise containers
$ docker pull nvcr.io/nvidia/tensorflow:24.01-tf2-py3
$ docker pull nvcr.io/nvidia/pytorch:24.01-py3
$ docker pull nvcr.io/nvidia/rapids/rapids:24.01-cuda12.0-runtime-ubuntu22.04-py3.10

Deploy NIM (NVIDIA Inference Microservices)

bash

# Example: Deploy Llama model via NIM
$ docker run --gpus all --rm \
 -p 8000:8000 \
 -e NGC_API_KEY=$NGC_API_KEY \
 nvcr.io/nvidia/nim/llama-2-7b-chat:latest

Step 9: System Validation and Monitoring

Complete System Check

Create a validation script:

bash

$ cat > validate_system.sh << 'EOF'
#!/bin/bash

echo "=== System Validation Report ==="
echo ""

# Check NVIDIA Driver
echo "1. NVIDIA Driver Check:"
if nvidia-smi &>/dev/null; then
   echo "   ✓ Driver installed"
   nvidia-smi --query-gpu=name,driver_version --format=csv,noheader
else
   echo "   ✗ Driver not detected"
fi
echo ""

# Check Docker
echo "2. Docker Check:"
if docker --version &>/dev/null; then
   echo "   ✓ Docker installed"
   docker --version
else
   echo "   ✗ Docker not installed"
fi
echo ""

# Check GPU in Docker
echo "3. Docker GPU Access:"
if docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi &>/dev/null; then
   echo "   ✓ GPU accessible in containers"
else
   echo "   ✗ GPU not accessible in containers"
fi
echo ""

# Check DCGM
echo "4. DCGM Check:"
if dcgmi discovery -l &>/dev/null; then
   echo "   ✓ DCGM installed and running"
   dcgmi discovery -l | head -3
else
   echo "   ✗ DCGM not available"
fi
echo ""

# Check NGC
echo "5. NGC CLI Check:"
if ngc --version &>/dev/null; then
   echo "   ✓ NGC CLI installed"
   ngc --version
else
   echo "   ✗ NGC CLI not installed"
fi
echo ""

# Check Triton
echo "6. Triton Server Check:"
if curl -s localhost:8000/v2/health/ready &>/dev/null; then
   echo "   ✓ Triton server is running"
else
   echo "   ✗ Triton server not running or not accessible"
fi
echo ""

echo "=== Validation Complete ==="
EOF

$ chmod +x validate_system.sh
$ ./validate_system.sh

Set Up Continuous Monitoring

Create a monitoring dashboard script:

bash

$ cat > gpu_monitor.sh << 'EOF'
#!/bin/bash

while true; do
   clear
   echo "=== GPU Monitoring Dashboard ==="
   echo "Timestamp: $(date)"
   echo ""
   
   # GPU Utilization
   nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total \
               --format=csv,noheader,nounits
   
   echo ""
   echo "=== Docker Containers Using GPU ==="
   docker ps --filter "label=com.nvidia.gpu" --format "table {{.Names}}\t{{.Status}}"
   
   echo ""
   echo "=== DCGM Health Status ==="
   dcgmi health -c | grep "Health Status"
   
   echo ""
   echo "Press Ctrl+C to exit"
   sleep 5
done
EOF

$ chmod +x gpu_monitor.sh
$ ./gpu_monitor.sh

Common Issues and Solutions

Driver Issues

Problem: NVIDIA-SMI shows "No devices were found"

bash

# Check if driver is loaded
$ lsmod | grep nvidia

# If empty, try loading manually
$ sudo modprobe nvidia

# Check for errors
$ dmesg | grep -i nvidia | tail -20

Problem: Driver version mismatch with CUDA

bash

# Check compatibility
$ nvidia-smi
# Note the CUDA Version shown (e.g., 12.4)# Use matching CUDA container versions
$ docker run --gpus all nvidia/cuda:12.4.0-base nvidia-smi

Docker GPU Access Issues

Problem: "could not select device driver" error

bash

# Reconfigure container runtime
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker

# Test again
$ docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi

Problem: Container loses GPU access after system updates

bash

# This is a known systemd issue. Fix:
$ sudo systemctl restart docker
$ sudo nvidia-ctk runtime configure --runtime=docker

Performance Optimization

Enable Persistence Mode for Better Performance:

bash

# Enable persistence mode
$ sudo nvidia-smi -pm 1

# Set power limit (example: 250W)
$ sudo nvidia-smi -pl 250

# Set GPU clocks for compute (optional)
$ sudo nvidia-smi -ac 5001,1500

Configure Memory Growth for TensorFlow:

python

import tensorflow as tf

# Allow memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
   tf.config.experimental.set_memory_growth(gpu, True)

Maintenance and Updates

Regular Maintenance Tasks

Weekly:

  • Check GPU health: dcgmi health -c
  • Review Docker disk usage: docker system df
  • Clean unused containers: docker system prune -a

Monthly:

  • Update NVIDIA drivers (if needed): sudo apt update && sudo apt upgrade
  • Update Docker images: docker pull nvidia/cuda:latest
  • Run comprehensive diagnostics: dcgmi diag -r 3

Backup Important Configurations

bash

# Create backup directory
$ mkdir -p ~/gpu-setup-backup

# Backup configurations
$ cp /etc/docker/daemon.json ~/gpu-setup-backup/
$ cp /etc/modprobe.d/blacklist-nouveau.conf ~/gpu-setup-backup/
$ cp ~/.ngc/config ~/gpu-setup-backup/ngc-config

Conclusion

You now have a professional-grade GPU computing environment similar to NVIDIA DGX systems. This setup provides:

  • Production-ready inference serving with Triton
  • Enterprise monitoring via DCGM
  • Container-based workflows with Docker and GPU support
  • Access to NVIDIA's ecosystem through NGC

Next Steps

  1. Explore NGC Catalog: Browse pre-trained models and optimized containers
  2. Deploy Your Models: Convert and optimize your models for Triton
  3. Set Up Kubernetes: For multi-node deployments, consider GPU Operator
  4. Implement Monitoring: Integrate DCGM with Prometheus/Grafana
  5. Optimize Performance: Profile and tune your specific workloads

Additional Resources

Getting Help

  • NVIDIA Developer Forums: https://forums.developer.nvidia.com/
  • GitHub Issues: Report toolkit-specific issues on respective GitHub repos
  • Enterprise Support: Available with NVIDIA AI Enterprise license

Remember to keep your system updated and regularly check NVIDIA's documentation for the latest best practices and security updates. Happy computing!

GUIDE INFORMATION

TOOLS USED
Terminal
,
PURPOSE
EXPERIENCE NEEDED
TIME NEEDED
60 min
BEST FOR
COMPLIANCE CONTEXT

Want to explore secure AI systems tailored to your agency or mission?

Let’s design a federated learning approach that works in your world.
Metallic screwhead rendered in 3D style
Metallic screwhead rendered in 3D style
Metallic screwhead rendered in 3D style
Metallic screwhead rendered in 3D style