Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.

Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.



I haven’t tried any Qwen yet, but so far I’m sticking with gpt-oss-20B.

In terms of what I’m using, I’ve looked at anything that will fit on a MacBook Pro with 32Gb RAM (so with shared memory) - LFM2, Llama, Minstral, Ministral, Devstral, Phi, and Nemotron.

As for quantisation, I aim for the biggest that will fit while also not being too slow - so it all depends on the model. But I’ll skip a model if I can’t at least use a Q4_K_M.

Also, given that I also bump my context to at least 32K, because tooling sucks when the tooling definitions itself come close to 4096!

I can’t wait for RAM prices to come down!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: