Our Pull Request to Ollama Trims 13% CPU Overhead from Each Inference¶

Every time you send a request to Ollama — even when your model is already loaded in GPU memory — the server spends 300 milliseconds re-reading the model manifest from disk and re-parsing the GGUF metadata. That's 300ms of pure overhead before a single token is generated.

We found this while optimizing our own local inference pipeline for NLA (Natural Language Autoencoding) research, where we needed to make 10,000 sequential inference requests. The 300ms overhead alone added 50 minutes to our batch job.

How We Found It¶

We started with a simple question: why does load_duration show 250-350ms on warm requests when the model is already in VRAM?

Step 1: strace. We traced the system calls during inference. Every syscall completed in under 0.1ms. The overhead wasn't in the kernel — it was in Go userspace.

Step 2: pprof. We noticed that server/routes.go references http.DefaultServeMux specifically to get pprof "for free" — but the net/http/pprof import was missing. The comment said it was there; the code said otherwise. We added the one-line import, rebuilt, and profiled.

Step 3: The smoking gun. 85% of CPU time was in gguf.readKeyValue — Ollama was parsing the GGUF file's metadata section on every single request. For models with large tokenizer vocabularies (30K+ entries), this means reading and allocating megabytes of strings per request, only to throw them away.

The Root Cause¶

Three code paths contribute to the overhead:

Model.Capabilities() calls gguf.Open(m.ModelPath) to check if the model supports completion, embedding, or vision. This is called multiple times per request from different handlers. Each call opens and parses the entire GGUF metadata section.

GetModel() reads the model manifest and config from disk — JSON parsing, template compilation, layer iteration. Called on every request even when nothing has changed.

scheduleRunner() calls GetModel() again, separately from the generate handler. Two uncached disk reads per request.

The Fix¶

Three caches, zero behavioral changes:

// 1. Cache Capabilities() on the Model struct
type Model struct {
    // ... existing fields ...
    cachedCapabilities []model.Capability
}

func (m *Model) Capabilities() []model.Capability {
    if m.cachedCapabilities != nil {
        return m.cachedCapabilities
    }
    // ... existing code ...
    m.cachedCapabilities = capabilities
    return capabilities
}

// 2. Cache GetModel() via sync.Map
var modelCache sync.Map

func getCachedModel(name string) (*Model, error) {
    if cached, ok := modelCache.Load(name); ok {
        return cached.(*Model), nil
    }
    m, err := GetModel(name)
    if err != nil {
        return nil, err
    }
    modelCache.Store(name, m)
    return m, nil
}

The third change replaces GetModel(name) with getCachedModel(name) in scheduleRunner(), eliminating the second uncached read.

Results¶

A/B benchmark on a Tesla P4, same hardware, same model (qwen2.5:3b), same GPU temperature:

Metric	Vanilla v0.20.7	Patched v0.20.7	Improvement
Total time (100 tokens)	2333 ms	2032 ms	13% faster
Total time (10 tokens)	506 ms	205 ms	59% faster
Total time (embedding)	323 ms	22 ms	93% faster
load_duration (overhead)	303 ms	2 ms	301ms eliminated
Inference tok/s	49.3	49.3	No change

The inference speed is identical — we didn't touch the GPU path. The improvement comes from eliminating ~300ms of wasted overhead per request. The impact scales inversely with response length: short responses and embedding requests benefit most, because the overhead was a larger fraction of their total time.

Who This Affects¶

Every Ollama user. The magnitude varies with disk speed:

HDD/SAS disks: 200-400ms overhead → 2ms (our case)
SATA SSD: 50-100ms overhead → 2ms
NVMe SSD: 20-50ms overhead → 2ms

For interactive use (single requests), you might not notice 50ms. But for batch workloads — embeddings, automated pipelines, agent tool calls — it adds up fast. At 10,000 requests, the overhead alone costs 50 minutes on HDD or 8 minutes on NVMe.

The Broader Lesson¶

The performance bottleneck wasn't in the GPU kernels. It wasn't in the model architecture. It was in the request scaffolding — the Go code that sets up each request before inference begins. The actual GPU inference runs at 49 tokens/second; the surrounding infrastructure was spending 6x longer on setup than the GPU spent on computation (for short responses).

This pattern is common in ML infrastructure: the compute is fast, the plumbing is slow. Profiling the plumbing — not just the model — is where the easy wins hide.

A Note on Native Compilation¶

Separately from the caching fix, we also discovered that Ollama's default builds compile CUDA kernels for sm_75, sm_100, and sm_120 architectures. Our Tesla P4 (sm_61, Pascal) was running these kernels via PTX JIT translation — the CUDA driver compiles the code at runtime for the actual GPU architecture. By building Ollama with native sm_61 support via a Nix flake, we measured a 2.6x inference speedup (19 tok/s → 49.7 tok/s) independent of the caching fix.

If you're running Ollama on older GPU hardware, a native build for your specific SM architecture may give you a significant speedup for free.

Links¶

Pull Request #16161 — our fix (3 files changed, 30 insertions, 6 deletions)
GitHub Issue #12443 — the upstream issue reporting this symptom
Our profiling was done with net/http/pprof and strace on NixOS with a custom Ollama build