Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.
Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.
I haven’t tried any Qwen yet, but so far I’m sticking with gpt-oss-20B.
In terms of what I’m using, I’ve looked at anything that will fit on a MacBook Pro with 32Gb RAM (so with shared memory) - LFM2, Llama, Minstral, Ministral, Devstral, Phi, and Nemotron.
As for quantisation, I aim for the biggest that will fit while also not being too slow - so it all depends on the model. But I’ll skip a model if I can’t at least use a Q4_K_M.
Also, given that I also bump my context to at least 32K, because tooling sucks when the tooling definitions itself come close to 4096!
Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.