Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

to be fair, llama.cpp has gotten much easier to use lately with llama-server -hf <model name>. That said, the need to compile it yourself is still a pretty big barrier for most people.


I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.

One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.


You don't need to compile it yourself though? Unless you want CUDA support on Linux I guess, dunno why you'd need such a silly thing though:

https://github.com/ggml-org/llama.cpp/releases


> dunno why you'd need such a silly thing though

I'm not sure I follow, what alternative to CUDA on Linux offers similar performance?


Ah, 'twas a mere jest, a sarcastic jab that of all the manifold builds provided, the most useful is missing - doubtless for good and practical reasons.

Nevertheless, worth looking at the Vulkan builds. They work on all GPUs!


> That said, the need to compile it yourself is still a pretty big barrier for most people.

My distro (NixOS) has binary packages though...

And there's packages in the AUR (Arch), GURU (Gentoo), and even Debian Unstable. Now, these might be a little behind, but if you care that much you can download binaries from GitHub directly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: