Many of the use cases you’re describing can be done offline, such as when the ph...

omneity · on Oct 2, 2023

Agreed on overnight batch processing. Although my vision for it is to have a sort of local service that can provide "intelligence" on demand for other apps, which might request it concurrently, at which point double digit throughput might become limiting.

There are other reasons to want a higher throughput. To perform retrieval or for a chain-of-thought approach, you typically need to run several prompts per user prompt, effectively impacting user-perceived performance of your LLM based solution.