Running 26B locally is impressive but the latency math gets rough once your doing anything beyond chat. We switched from local inference to API calls for image generation specifically because cold start + generation time on consumer hardware made it impractical for any kind of automated workflow.
Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.
Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.