Date Created: 2025-03-29
By: 16BitMiker
[ BACK.. ]
Ubuntu 24.04 LTS "Noble Numbat" brings significant improvements for AI enthusiasts, particularly those running large language models locally. This guide walks through setting up Ollamaโan open-source LLM serverโwith GPU acceleration on the latest Ubuntu release. Whether you have NVIDIA or AMD hardware, we'll cover everything from basic installation to advanced performance tweaks specifically optimized for kernel 6.8+ in Ubuntu 24.04.
Before diving into GPU specifics, let's ensure Ollama is properly installed:
xxxxxxxxxx
# Download and run the Ollama installation script
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation by checking the version
ollama --version
# Ensure the API server is running properly
curl http://localhost:11434/api/tags
The installation script performs several key tasks:
Downloads the appropriate Ollama binary for your system
Sets up the necessary directory structure for model storage
Configures permissions and creates a systemd service
Starts the Ollama service on the default port (11434)
NVIDIA GPUs offer excellent performance for running LLMs, especially with the optimizations in Ubuntu 24.04's newer kernel.
xxxxxxxxxx
# Check which drivers are recommended for your specific GPU
sudo ubuntu-drivers devices
# Install the recommended driver (545 is typically optimal for 24.04)
sudo apt install nvidia-driver-545
# After installation, reboot your system
sudo reboot
The ubuntu-drivers devices
command is crucial as it analyzes your specific GPU hardware and recommends the most compatible driver. Ubuntu 24.04's newer kernel (6.8+) has improved compatibility with the latest NVIDIA drivers, reducing common installation issues found in earlier versions.
CUDA is essential for Ollama to communicate with your NVIDIA GPU:
xxxxxxxxxx
# Add the CUDA repository keys (note the 24.04-specific package)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
# Update repositories and install the CUDA toolkit
sudo apt-get update && sudo apt-get install cuda-toolkit -y
The CUDA toolkit provides:
Runtime libraries for GPU computation
Development tools for GPU programming
Debugging utilities and performance profilers
The NVIDIA kernel modules needed for GPU access
After installation, confirm your GPU is properly recognized:
xxxxxxxxxx
# Display GPU information including model, driver version, and usage
nvidia-smi
A successful output shows your GPU's:
Model name and architecture
CUDA version compatibility
Memory capacity and usage
Temperature and power draw
Any processes currently using the GPU
If this command fails, see the troubleshooting section below.
Flash Attention dramatically improves performance for transformer-based models:
xxxxxxxxxx
# Create a systemd service file for Ollama with optimized parameters
sudo nano /etc/systemd/system/ollama.service
Add this content to the file (replace YOUR_USERNAME with your actual username):
xxxxxxxxxx
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/bin/ollama serve
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart=always
RestartSec=3
User=YOUR_USERNAME
[Install]
WantedBy=multi-user.target
Flash Attention is an optimization algorithm that:
Reduces memory usage by 2-8x by minimizing data transfers between GPU memory layers
Accelerates processing by up to 4x, especially beneficial for longer context windows
Works by "tiling" calculations to leverage faster on-chip SRAM instead of slower VRAM
Enables processing longer contexts (up to 16K tokens) on consumer GPUs
Maintains full mathematical accuracy unlike some other optimization techniques
After creating the service file:
xxxxxxxxxx
# Reload systemd configuration
sudo systemctl daemon-reload
# Enable the service to start on boot
sudo systemctl enable ollama
# Start/restart the service now
sudo systemctl restart ollama
AMD's ROCm platform provides CUDA-like capabilities for Radeon GPUs.
xxxxxxxxxx
# Download and install the AMD GPU driver package for Ubuntu 24.04
wget https://repo.radeon.com/amdgpu-install/24.04/ubuntu/focal/amdgpu-install_6.1.50404-1_all.deb
sudo apt install ./amdgpu-install_6.1.50404-1_all.deb
# Install ROCm with the recommended configuration
sudo amdgpu-install --usecase=rocm
ROCm 6.1+ includes specific optimizations for Ubuntu 24.04's newer kernel, providing:
Better memory management for newer Radeon cards
Improved compatibility with the 6.8+ kernel
More efficient compute scheduling
Enhanced driver stability for long inference sessions
For AMD GPUs, you need the ROCm-specific Ollama build:
xxxxxxxxxx
# Download and install the AMD-specific version of Ollama
curl -L https://ollama.com/download/ollama-linux-amd64-rocm.tgz | sudo tar -C /usr -xzf -
This version is compiled specifically to use ROCm libraries instead of CUDA, enabling Radeon GPUs to accelerate LLM inference.
If your AMD GPU isn't officially supported by ROCm:
xxxxxxxxxx
# Add this environment variable to your Ollama service file
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
The HSA_OVERRIDE_GFX_VERSION
variable tells ROCm to treat your GPU as a specific architecture. The "11.0.0" value works for RDNA 2/3 architecture cards - you may need to adjust this based on your specific GPU model.
Running Ollama in Docker provides isolation and simplified deployment.
xxxxxxxxxx
# Install Docker from Ubuntu 24.04 repositories
sudo apt install docker.io
Ubuntu 24.04 ships with an updated Docker version that includes:
Improved container runtime
Better resource management
Enhanced security features
More efficient image storage
To use your NVIDIA GPU from within Docker:
xxxxxxxxxx
# Add the repository GPG key
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# Add the repository for Ubuntu 24.04 specifically
echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/ubuntu24.04/$(uname -m)/" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Install the toolkit and configure Docker
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
The NVIDIA Container Toolkit:
Creates a bridge between Docker containers and your physical GPU
Passes through CUDA capabilities to containerized applications
Maps appropriate device files and libraries
Handles proper isolation of GPU resources between containers
xxxxxxxxxx
# Run Ollama in a Docker container with GPU access
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
This command:
Creates a detached container (-d
) that runs in the background
Passes all GPUs to the container (--gpus=all
)
Creates a persistent volume for model storage (-v ollama:/root/.ollama
)
Exposes the API port (-p 11434:11434
)
Names the container "ollama" for easy reference
Ubuntu 24.04 offers several opportunities to maximize Ollama's performance.
xxxxxxxxxx
# Install GPU and system monitoring utilities
sudo apt install nvtop htop
These tools provide real-time insights:
nvtop
shows GPU utilization, temperature, memory usage, and processes
htop
offers an enhanced view of CPU, memory, and process management
Both are especially valuable when debugging performance issues
xxxxxxxxxx
# Set CPU governor to performance mode
sudo apt install cpufrequtils
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils
sudo systemctl restart cpufrequtils
# Disable CPU turbo to prevent thermal throttling
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
By default, Ubuntu 24.04 uses the "powersave" governor which dynamically adjusts CPU frequency to save energy. The "performance" governor keeps your CPU running at maximum frequency, providing:
Consistent processing power for model inference
Reduced latency during token generation
More predictable performance during long running sessions
Better handling of concurrent requests
xxxxxxxxxx
# Optimize SSD I/O scheduling
sudo apt install util-linux
echo 'deadline' | sudo tee /sys/block/nvme0n1/queue/scheduler
The "deadline" I/O scheduler:
Prioritizes reducing latency over maximizing throughput
Helps with rapid loading of large model files from disk
Works especially well with NVMe drives common in newer systems
Takes advantage of Ubuntu 24.04's improved I/O subsystem
xxxxxxxxxx
# Limit the number of models Ollama keeps loaded
export OLLAMA_MAX_LOADED_MODELS=1
This setting forces Ollama to unload models when switching between them, which:
Frees up system RAM for the active model
Allows for larger context windows
Reduces the chance of out-of-memory errors
Improves system responsiveness
xxxxxxxxxx
# Pull the latest Llama model
ollama pull llama3.3
# Different model sizes for various GPU capabilities
ollama pull llama3:8b # 8GB+ VRAM (e.g., RTX 3060)
ollama pull mistral:7b # 8GB+ VRAM
ollama pull llama3:70b # 40GB+ VRAM (A100/H100 class)
Model size considerations:
Smaller models (7-8B parameters) run well on consumer GPUs with 8GB+ VRAM
Medium models (13-30B parameters) typically require 16GB+ VRAM
Large models (70B+ parameters) need data center GPUs or multi-GPU setups
Quantized versions (Q4_K_M) reduce VRAM requirements by approximately 75%
xxxxxxxxxx
# For NVIDIA: select specific GPUs by index
export CUDA_VISIBLE_DEVICES=0,1
# For AMD: select specific GPUs by index
export ROCR_VISIBLE_DEVICES=0,1
These environment variables control which physical GPUs your system makes available to Ollama:
Setting "0,1" uses only the first two GPUs in your system
This works well for systems with mixed GPUs, allowing you to dedicate specific cards
Ollama automatically distributes model layers across available GPUs
In multi-GPU setups, interconnect speed (NVLink/PCIe) becomes important for performance
xxxxxxxxxx
# Check if Secure Boot is enabled
mokutil --sb-state
# Force modeset for NVIDIA drivers
echo 'options nvidia-drm modeset=1' | sudo tee /etc/modprobe.d/nvidia.conf
sudo update-initramfs -u
Common causes of GPU detection issues:
Secure Boot may prevent unsigned drivers from loading
Mismatched driver and CUDA versions
Incorrect kernel module loading order
Hardware compatibility issues with older GPUs
The "modeset=1" option forces the NVIDIA kernel module to handle display mode setting, which often resolves detection problems on Ubuntu 24.04's newer graphics stack.
xxxxxxxxxx
# Check Ollama logs for detailed error information
journalctl -xeu ollama
# Restart the service after configuration changes
sudo systemctl restart ollama
Ubuntu 24.04's journald includes improved log formatting and filtering. The command flags:
-x
provides additional explanatory help text
-e
jumps to the end of the logs (most recent entries)
-u ollama
filters logs to show only the Ollama service
xxxxxxxxxx
# Allow access from other devices on your network
export OLLAMA_HOST=0.0.0.0
By default, Ollama only listens on localhost (127.0.0.1), preventing external connections. Setting OLLAMA_HOST=0.0.0.0
:
Makes Ollama listen on all network interfaces
Allows connections from other devices on your network
Enables integration with web UIs or applications running on different machines
Facilitates distributed workflows
Ubuntu 24.04's kernel 6.8+ includes significant improvements for AI workloads:
Better GPU memory management
Improved power state handling
Updated device drivers with enhanced compatibility
More efficient I/O scheduling optimized for NVMe storage
xxxxxxxxxx
# Check if you're using Wayland (recommended)
echo $XDG_SESSION_TYPE
Wayland (Ubuntu 24.04's default display server):
Uses less CPU than X11
Has better GPU compositing, leaving more resources for ML tasks
Provides improved security and stability
Offers better scaling on high-DPI displays
xxxxxxxxxx
# Use nala for better package management UI
sudo apt install nala
sudo nala install <package-name>
Nala is a frontend for APT that offers:
Parallel downloads for faster package installation
Cleaner, more readable output
History tracking of installed packages
Easily revertible package operations
xxxxxxxxxx
# Reduce VRAM usage with KV Cache Quantization
export OLLAMA_KV_CACHE_TYPE=q4_0
This compresses the Key-Value cache to 4-bit precision, which:
Reduces VRAM usage by approximately 75%
Enables running larger models on consumer GPUs
Allows for longer context windows
Causes minimal quality degradation in most use cases
xxxxxxxxxx
# Monitor GPU temperature in real-time
watch nvidia-smi
GPUs automatically reduce clock speeds when they get too hot (thermal throttling):
Maintaining temperatures below 80ยฐC is ideal for performance
Consider improved cooling for sustained workloads
Undervolting can reduce heat with minimal performance impact
Custom fan curves can help balance noise and cooling
xxxxxxxxxx
# Update Ollama to the latest version
curl -fsSL https://ollama.com/install.sh | sh
Ollama development moves quickly, with frequent updates that:
Add support for new models and architectures
Improve performance and stability
Fix bugs and security issues
Optimize for newer GPU hardware
xxxxxxxxxx
# Limit CPU thread usage
ollama run llama3 -t 8
The -t
flag controls how many CPU threads Ollama uses:
Prevents Ollama from consuming all available CPU resources
Leaves processing power for other applications
Helps maintain system responsiveness during inference
Can improve performance on systems with many cores by avoiding scheduling overhead
Ubuntu 24.04 LTS provides an excellent platform for running Ollama with GPU acceleration. The newer kernel, updated drivers, and improved system components offer better performance and stability than previous LTS releases. Whether you're using consumer-grade hardware or data center GPUs, following this guide should help you achieve optimal performance for local LLM inference.