Set Up NVIDIA GPU
Computing Stack
Docker, DCGM, Triton Server & Enterprise Tools on Ubuntu 22.04
This comprehensive guide will walk you through setting up a professional-grade NVIDIA GPU computing environment on Ubuntu 22.04, similar to what you'd find on NVIDIA DGX systems. Whether you're using a laptop with a single GPU or a workstation, this guide will help you create a robust platform for AI/ML workloads, inference serving, and GPU-accelerated computing.
By the end of this guide, you'll have:
Before starting, ensure you have:
Secure Boot can prevent NVIDIA kernel modules from loading. You have two options:
Option 1: Disable Secure Boot (Recommended for beginners)
Option 2: MOK (Machine Owner Key) EnrollmentIf you must keep Secure Boot enabled, you'll need to sign the kernel modules. This is an advanced topic beyond this guide's scope.
First, let's confirm your system can see the GPU:
bash
$ lspci | grep -i nvidia
Expected Output:
01:00.0 VGA compatible controller: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile]
If you see no output, check:
NVIDIA drivers compile kernel modules using DKMS, which requires matching kernel headers:
bash
$ uname -r
5.15.0-91-generic
$ sudo apt update
$ sudo apt install -y linux-headers-$(uname -r)
Troubleshooting:
sudo apt upgrade and rebootls /usr/src | grep linux-headersThe open-source Nouveau driver must be disabled before installing NVIDIA proprietary drivers.
bash
$ lsmod | grep nouveau
If you see output, Nouveau is active and needs to be disabled.
bash
$ cat <<EOF | sudo tee /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0
EOF
bash
$ sudo update-initramfs -u
update-initramfs: Generating /boot/initrd.img-5.15.0-91-generic
bash
$ sudo reboot
After reboot:
bash
$ lsmod | grep nouveau
Should return no output.
Troubleshooting:
cat /etc/modprobe.d/blacklist-nouveau.confsudo bootctl update after initramfs generationUbuntu 22.04 offers two driver options. Choose based on your GPU generation:
For RTX 20-series, RTX 30-series, RTX 40-series, and newer:
bash
$ sudo apt update
$ sudo apt install -y nvidia-open
Reading package lists... Done
Building dependency tree... Done
...
Setting up nvidia-open (560.35.03-0ubuntu1) ...
For Maxwell, Pascal, Volta, or older GPUs:
bash
$ sudo apt update
$ sudo apt install -y cuda-drivers
bash
$ sudo reboot
bash
$ nvidia-smi
Expected Output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 8W / 80W | 1MiB / 6144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Troubleshooting:
dmesg | grep -i nvidia for errorssudo apt remove nvidia-*, try other driver optionsudo apt install build-essentialdpkg -l | grep nvidiaWe'll use Docker's official repository for the latest stable version.
bash
# Install prerequisites
$ sudo apt-get update
$ sudo apt-get install -y ca-certificates curl# Create keyrings directory
$ sudo install -m 0755 -d /etc/apt/keyrings# Add Docker's official GPG key
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
-o /etc/apt/keyrings/docker.asc
$ sudo chmod a+r /etc/apt/keyrings/docker.asc# Set up the repository
$ echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
bash
$ sudo apt-get update
$ sudo apt-get install -y docker-ce docker-ce-cli containerd.io \
docker-buildx-plugin docker-compose-plugin
bash
$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled)
Active: active (running) since Mon 2024-01-15 10:30:00 EST
$ sudo docker run hello-world
Hello from Docker!
This message shows that your installation appears to be working correctly.
bash
$ sudo usermod -aG docker $USER
$ newgrp docker# Test without sudo
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Troubleshooting:
sudo journalctl -xeu docker.serviceThis critical component enables Docker containers to access your GPU.
bash
# Add the GPG key
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg# Add the repository
$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
bash
$ sudo apt-get update# Install specific version for stability
$ export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
$ sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
bash
$ sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Config file does not exist, creating new one
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended to restart the Docker daemon
$ sudo systemctl restart docker
bash
$ sudo docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi
You should see the same nvidia-smi output as on your host system.
Troubleshooting:
sudo systemctl restart dockercat /etc/docker/daemon.jsondocker run --rm --gpus all ubuntu nvidia-smisystemctl daemon-reload can cause containers to lose GPU access. Solution: Restart Docker and reconfigure runtime.DCGM provides enterprise-grade GPU monitoring and diagnostics.
bash
# Download and install CUDA keyring
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb# Update package lists
$ sudo apt-get update
bash
$ sudo apt-get install -y datacenter-gpu-manager
bash
$ sudo systemctl enable nvidia-dcgm
$ sudo systemctl start nvidia-dcgm# Verify service status
$ sudo systemctl status nvidia-dcgm
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled)
Active: active (running)
bash
# Quick diagnostic (Level 1)
$ dcgmi diag -r 1
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
| Deployment | Pass |
| Blacklist | Pass |
| NVML Library | Pass |
| CUDA Library | Pass |
+---------------------------+------------------------------------------------+# Comprehensive diagnostic (Level 3)
$ dcgmi diag -r 3
bash
# Set active health checks
$ dcgmi health -s a
Health monitor systems set successfully# Check health status
$ dcgmi health -c
Health Status: Healthy
Troubleshooting:
bash
docker run --gpus all --rm -v /var/run/docker.sock:/var/run/docker.sock \
nvidia/dcgm:latest
sudo nvidia-smi --gpu-reset -i 0NGC (NVIDIA GPU Cloud) CLI provides access to enterprise containers and models.
bash
# For .deb package (Ubuntu/Debian)
$ sudo dpkg -i ngc-cli_3.31.0_amd64.deb# Or extract and add to PATH for portable installation
$ tar -xvf ngc-cli_3.31.0_linux.tar.gz
$ export PATH=$PATH:$(pwd)/ngc-cli
bash
$ ngc config set
Enter API key [no-apikey]: <your-api-key>
Enter CLI output format type [ascii]: ascii
Enter org [no-org]: <your-org>
Enter team [no-team]: <your-team>
Enter ace [no-ace]:
Configuration saved to /home/user/.ngc/config
bash
$ ngc user who
User Information
Username: your.email@example.com
Org: your-organization
Team: your-team
ACE: nv-us-west-2
$ ngc registry model list --format_type csv | head -5
"Name","Repository","Latest Version","Application","Framework","Precision","Use Case"
"BERT Large","nvidia/bert","1.0","Natural Language Processing","PyTorch","FP16","Question Answering"
Troubleshooting:
Triton is NVIDIA's production-grade inference serving solution.
bash
# Clone Triton server repository
$ git clone https://github.com/triton-inference-server/server.git
$ cd server/docs/examples# Fetch example models
$ ./fetch_models.sh
Downloading densenet_onnx model...
Downloading inception_graphdef model...
Downloading simple model...
Model repository prepared at: ./model_repository
bash
# Start Triton with GPU support
$ docker run --gpus=1 --rm \
-p8000:8000 -p8001:8001 -p8002:8002 \
-v $(pwd)/model_repository:/models \
nvcr.io/nvidia/tritonserver:24.01-py3 \
tritonserver --model-repository=/models
I0115 10:00:00.000000 1 server.cc:650]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0115 10:00:00.000000 1 server.cc:677]
+--------+------+--------+
| Model | Ver | Status |
+--------+------+--------+
| simple | 1 | READY |
+--------+------+--------+
I0115 10:00:00.000000 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
I0115 10:00:00.000000 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
I0115 10:00:00.000000 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
In a new terminal:
bash
# Check if server is ready
$ curl -v localhost:8000/v2/health/ready
* Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
>
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
bash
# Pull client SDK image
$ docker pull nvcr.io/nvidia/tritonserver:24.01-py3-sdk# Run inference on example image
$ docker run -it --rm --net=host \
-v $(pwd):/workspace \
nvcr.io/nvidia/tritonserver:24.01-py3-sdk \
/workspace/install/bin/image_client \
-m densenet_onnx -c 3 -s INCEPTION \
/workspace/images/mug.jpg
Request 0, batch size 1
Image '/workspace/images/mug.jpg':
15.349568 (504) = COFFEE MUG
13.227468 (968) = CUP
10.424893 (505) = COFFEEPOT
Troubleshooting:
-p8010:8000 -p8011:8001 -p8012:8002--metrics-port=8002For production deployments with enterprise support, consider NVIDIA AI Enterprise.
bash
# Install license client
$ sudo apt-get install -y nvidia-license-client
# Configure license token
$ sudo nvidia-license-client --token-config /etc/nvidia/license-client.tok
bash
# Login to NGC with enterprise credentials
$ docker login nvcr.io
Username: $oauthtoken
Password: <your-api-key># Pull enterprise containers
$ docker pull nvcr.io/nvidia/tensorflow:24.01-tf2-py3
$ docker pull nvcr.io/nvidia/pytorch:24.01-py3
$ docker pull nvcr.io/nvidia/rapids/rapids:24.01-cuda12.0-runtime-ubuntu22.04-py3.10
bash
# Example: Deploy Llama model via NIM
$ docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
nvcr.io/nvidia/nim/llama-2-7b-chat:latest
Create a validation script:
bash
$ cat > validate_system.sh << 'EOF'
#!/bin/bash
echo "=== System Validation Report ==="
echo ""
# Check NVIDIA Driver
echo "1. NVIDIA Driver Check:"
if nvidia-smi &>/dev/null; then
echo " ✓ Driver installed"
nvidia-smi --query-gpu=name,driver_version --format=csv,noheader
else
echo " ✗ Driver not detected"
fi
echo ""
# Check Docker
echo "2. Docker Check:"
if docker --version &>/dev/null; then
echo " ✓ Docker installed"
docker --version
else
echo " ✗ Docker not installed"
fi
echo ""
# Check GPU in Docker
echo "3. Docker GPU Access:"
if docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi &>/dev/null; then
echo " ✓ GPU accessible in containers"
else
echo " ✗ GPU not accessible in containers"
fi
echo ""
# Check DCGM
echo "4. DCGM Check:"
if dcgmi discovery -l &>/dev/null; then
echo " ✓ DCGM installed and running"
dcgmi discovery -l | head -3
else
echo " ✗ DCGM not available"
fi
echo ""
# Check NGC
echo "5. NGC CLI Check:"
if ngc --version &>/dev/null; then
echo " ✓ NGC CLI installed"
ngc --version
else
echo " ✗ NGC CLI not installed"
fi
echo ""
# Check Triton
echo "6. Triton Server Check:"
if curl -s localhost:8000/v2/health/ready &>/dev/null; then
echo " ✓ Triton server is running"
else
echo " ✗ Triton server not running or not accessible"
fi
echo ""
echo "=== Validation Complete ==="
EOF
$ chmod +x validate_system.sh
$ ./validate_system.sh
Create a monitoring dashboard script:
bash
$ cat > gpu_monitor.sh << 'EOF'
#!/bin/bash
while true; do
clear
echo "=== GPU Monitoring Dashboard ==="
echo "Timestamp: $(date)"
echo ""
# GPU Utilization
nvidia-smi --query-gpu=name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv,noheader,nounits
echo ""
echo "=== Docker Containers Using GPU ==="
docker ps --filter "label=com.nvidia.gpu" --format "table {{.Names}}\t{{.Status}}"
echo ""
echo "=== DCGM Health Status ==="
dcgmi health -c | grep "Health Status"
echo ""
echo "Press Ctrl+C to exit"
sleep 5
done
EOF
$ chmod +x gpu_monitor.sh
$ ./gpu_monitor.sh
Problem: NVIDIA-SMI shows "No devices were found"
bash
# Check if driver is loaded
$ lsmod | grep nvidia# If empty, try loading manually
$ sudo modprobe nvidia# Check for errors
$ dmesg | grep -i nvidia | tail -20
Problem: Driver version mismatch with CUDA
bash
# Check compatibility
$ nvidia-smi# Note the CUDA Version shown (e.g., 12.4)# Use matching CUDA container versions
$ docker run --gpus all nvidia/cuda:12.4.0-base nvidia-smi
Problem: "could not select device driver" error
bash
# Reconfigure container runtime
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker# Test again
$ docker run --rm --gpus all nvidia/cuda:12.4.0-base nvidia-smi
Problem: Container loses GPU access after system updates
bash
# This is a known systemd issue. Fix:
$ sudo systemctl restart docker
$ sudo nvidia-ctk runtime configure --runtime=docker
Enable Persistence Mode for Better Performance:
bash
# Enable persistence mode
$ sudo nvidia-smi -pm 1# Set power limit (example: 250W)
$ sudo nvidia-smi -pl 250# Set GPU clocks for compute (optional)
$ sudo nvidia-smi -ac 5001,1500
Configure Memory Growth for TensorFlow:
python
import tensorflow as tf# Allow memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
Weekly:
dcgmi health -cdocker system dfdocker system prune -aMonthly:
sudo apt update && sudo apt upgradedocker pull nvidia/cuda:latestdcgmi diag -r 3bash
# Create backup directory
$ mkdir -p ~/gpu-setup-backup# Backup configurations
$ cp /etc/docker/daemon.json ~/gpu-setup-backup/
$ cp /etc/modprobe.d/blacklist-nouveau.conf ~/gpu-setup-backup/
$ cp ~/.ngc/config ~/gpu-setup-backup/ngc-config
You now have a professional-grade GPU computing environment similar to NVIDIA DGX systems. This setup provides:
Remember to keep your system updated and regularly check NVIDIA's documentation for the latest best practices and security updates. Happy computing!