Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Many of the use cases you’re describing can be done offline, such as when the phone is charging overnight, although not all of them. An email autoresponder could still work in real time at these token rates, and it would still be faster than most humans at responding to an email.

7 hours * 3600sec/hr * 21token/sec = 530,000 tokens per night on this hardware, assuming no thermal throttling. (I don’t have data to say what the sustained rate would be, throttling could happen.)



Agreed on overnight batch processing. Although my vision for it is to have a sort of local service that can provide "intelligence" on demand for other apps, which might request it concurrently, at which point double digit throughput might become limiting.

There are other reasons to want a higher throughput. To perform retrieval or for a chain-of-thought approach, you typically need to run several prompts per user prompt, effectively impacting user-perceived performance of your LLM based solution.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: