๐Ÿš€ Ultimate Guide to Running Ollama with GPU Acceleration on Ubuntu 24.04 LTS

Date Created: 2025-03-29
By: 16BitMiker
[ BACK.. ]

๐Ÿ‘€ Introduction

Ubuntu 24.04 LTS "Noble Numbat" brings significant improvements for AI enthusiasts, particularly those running large language models locally. This guide walks through setting up Ollamaโ€”an open-source LLM serverโ€”with GPU acceleration on the latest Ubuntu release. Whether you have NVIDIA or AMD hardware, we'll cover everything from basic installation to advanced performance tweaks specifically optimized for kernel 6.8+ in Ubuntu 24.04.

๐Ÿ“‹ Basic Ollama Installation

Before diving into GPU specifics, let's ensure Ollama is properly installed:

The installation script performs several key tasks:

๐Ÿ–ฅ๏ธ GPU Setup for NVIDIA Cards

NVIDIA GPUs offer excellent performance for running LLMs, especially with the optimizations in Ubuntu 24.04's newer kernel.

๐Ÿ” Installing the Right Driver

The ubuntu-drivers devices command is crucial as it analyzes your specific GPU hardware and recommends the most compatible driver. Ubuntu 24.04's newer kernel (6.8+) has improved compatibility with the latest NVIDIA drivers, reducing common installation issues found in earlier versions.

๐Ÿ“ฆ Installing CUDA Toolkit

CUDA is essential for Ollama to communicate with your NVIDIA GPU:

The CUDA toolkit provides:

โœ… Verifying GPU Detection

After installation, confirm your GPU is properly recognized:

A successful output shows your GPU's:

If this command fails, see the troubleshooting section below.

โšก Enabling Flash Attention

Flash Attention dramatically improves performance for transformer-based models:

Add this content to the file (replace YOUR_USERNAME with your actual username):

Flash Attention is an optimization algorithm that:

After creating the service file:

๐Ÿ”ด GPU Setup for AMD Cards

AMD's ROCm platform provides CUDA-like capabilities for Radeon GPUs.

๐Ÿ“ฆ Installing ROCm

ROCm 6.1+ includes specific optimizations for Ubuntu 24.04's newer kernel, providing:

๐Ÿ”„ Installing Ollama with ROCm Support

For AMD GPUs, you need the ROCm-specific Ollama build:

This version is compiled specifically to use ROCm libraries instead of CUDA, enabling Radeon GPUs to accelerate LLM inference.

โš™๏ธ Support for Unofficial Cards

If your AMD GPU isn't officially supported by ROCm:

The HSA_OVERRIDE_GFX_VERSION variable tells ROCm to treat your GPU as a specific architecture. The "11.0.0" value works for RDNA 2/3 architecture cards - you may need to adjust this based on your specific GPU model.

๐Ÿ‹ Docker Setup

Running Ollama in Docker provides isolation and simplified deployment.

๐Ÿณ Installing Docker

Ubuntu 24.04 ships with an updated Docker version that includes:

๐Ÿ› ๏ธ Installing NVIDIA Container Toolkit

To use your NVIDIA GPU from within Docker:

The NVIDIA Container Toolkit:

๐Ÿš€ Running Ollama in Docker

This command:

๐Ÿ”ฅ Performance Optimization

Ubuntu 24.04 offers several opportunities to maximize Ollama's performance.

๐Ÿ“Š Monitoring Tools

These tools provide real-time insights:

๐Ÿ’ช CPU Optimization

By default, Ubuntu 24.04 uses the "powersave" governor which dynamically adjusts CPU frequency to save energy. The "performance" governor keeps your CPU running at maximum frequency, providing:

๐Ÿ’พ Disk Optimization

The "deadline" I/O scheduler:

๐Ÿง  RAM Management

This setting forces Ollama to unload models when switching between them, which:

๐Ÿค– Running Models

๐Ÿ“ฅ Model Installation

Model size considerations:

๐ŸŽฎ Multi-GPU Configuration

These environment variables control which physical GPUs your system makes available to Ollama:

๐Ÿ”ง Troubleshooting

๐Ÿšซ GPU Not Detected

Common causes of GPU detection issues:

The "modeset=1" option forces the NVIDIA kernel module to handle display mode setting, which often resolves detection problems on Ubuntu 24.04's newer graphics stack.

๐Ÿ”„ Service Issues

Ubuntu 24.04's journald includes improved log formatting and filtering. The command flags:

๐ŸŒ Network Access

By default, Ollama only listens on localhost (127.0.0.1), preventing external connections. Setting OLLAMA_HOST=0.0.0.0:

๐Ÿ’ก Ubuntu 24.04 Pro Tips

Kernel Improvements

Ubuntu 24.04's kernel 6.8+ includes significant improvements for AI workloads:

Desktop Environment Optimization

Wayland (Ubuntu 24.04's default display server):

Package Management

Nala is a frontend for APT that offers:

Memory Optimization

This compresses the Key-Value cache to 4-bit precision, which:

Temperature Management

GPUs automatically reduce clock speeds when they get too hot (thermal throttling):

Regular Updates

Ollama development moves quickly, with frequent updates that:

Thread Management

The -t flag controls how many CPU threads Ollama uses:

๐Ÿ“š Conclusion

Ubuntu 24.04 LTS provides an excellent platform for running Ollama with GPU acceleration. The newer kernel, updated drivers, and improved system components offer better performance and stability than previous LTS releases. Whether you're using consumer-grade hardware or data center GPUs, following this guide should help you achieve optimal performance for local LLM inference.

๐Ÿ“– References and Additional Resources