to be fair, llama.cpp has gotten much easier to use lately with llama-server -hf...

ryandrake · 2026-04-03T16:55:38 1775235338

I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.

One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.

dTal · 2026-04-03T20:21:37 1775247697

You don't need to compile it yourself though? Unless you want CUDA support on Linux I guess, dunno why you'd need such a silly thing though:

https://github.com/ggml-org/llama.cpp/releases

flux3125 · 2026-04-08T09:05:59 1775639159

> dunno why you'd need such a silly thing though

I'm not sure I follow, what alternative to CUDA on Linux offers similar performance?

dTal · 2026-04-09T17:29:21 1775755761

Ah, 'twas a mere jest, a sarcastic jab that of all the manifold builds provided, the most useful is missing - doubtless for good and practical reasons.

Nevertheless, worth looking at the Vulkan builds. They work on all GPUs!

MarsIronPI · 2026-04-03T21:59:09 1775253549

> That said, the need to compile it yourself is still a pretty big barrier for most people.

My distro (NixOS) has binary packages though...

And there's packages in the AUR (Arch), GURU (Gentoo), and even Debian Unstable. Now, these might be a little behind, but if you care that much you can download binaries from GitHub directly.