Branify: AI-Enhanced Customer Service Platform
An AI customer-service platform that handles 70% of inbound tickets without a human, then routes the rest to the right agent with full context attached.
- Next.js
- Python
- OpenAI
- Pinecone
- +1
Case Study · Confidential enterprise client (regulated industry)
How we replaced a $4.2K/month OpenAI bill with a fully on-device LLM workflow using Ollama, llama.cpp, and a FastAPI orchestration layer — keeping 100% of customer data on the user's laptop while delivering sub-second responses.

An enterprise client in a regulated industry was using a hosted LLM through their internal tooling, but every quarter their security review team flagged the same blocker: customer-identifiable text was being sent to a US cloud endpoint. Legal had paused the wider rollout, the OpenAI bill had drifted past $4,200/month with only a fraction of the planned user base, and the product team had been asked the impossible question: "can we keep all of this AI capability without any data leaving the user's laptop?"
They didn't need a research project. They needed a shippable workflow that an internal employee could install in an afternoon, that ran entirely on their existing M-series MacBooks and Windows ThinkPads, and that performed well enough that nobody would resent the privacy upgrade.
We designed and shipped a fully on-device LLM stack: Ollama as the model runtime, llama.cpp under the hood for quantized GPU/CPU inference, a small FastAPI orchestration layer for tool-use and retrieval, and a Next.js desktop UI shipped via a thin Tauri wrapper. Every byte of context stays on the device. There is no remote inference, no telemetry callback, no proxy.
We benchmarked seven open-weight models (Llama 3.1 8B, Qwen 2.5 7B/14B, Phi-3, Mistral, and two domain-tuned variants) across the client's real prompts and shipped a per-task model router: a small fast model for classification and chat, a larger reasoning model on demand. RAG runs against a local ChromaDB index built from the user's own documents, with every embedding computed on-device.
The result is a workflow that's measurably faster than the cloud version on the median prompt (no network round-trip), passes the client's security review on first submission, and scales to every laptop in the company at zero per-seat inference cost.
We collected a representative set of ~400 real prompts from the existing OpenAI logs (sanitized), then benchmarked candidate open-weight models on the client's actual hardware mix — M1, M2, and M3 MacBooks plus a Windows ThinkPad reference machine — measuring tokens-per-second, p95 first-token latency, and quality vs. the GPT-4 baseline using a rubric the client owned.
We picked Ollama as the runtime (clean lifecycle, model versioning, GPU-aware quantization), then built a thin FastAPI orchestration layer that routes each task to the right model, handles retrieval against a local ChromaDB index, and exposes a stable HTTP API the desktop UI can call. The whole stack runs as three local processes managed by the desktop app.
Engineering happened in two-week sprints with a real evaluation harness — every change was scored against a held-out prompt set so quality regressions surfaced immediately. We added a one-click ingest flow for the user's own documents, a model-update channel that respects the user's network policy, and a privacy panel that shows exactly what the model has access to.
We packaged the stack as a signed installer (notarized on macOS, signed on Windows), wrote a short threat model the security team could read in 20 minutes, and shipped to a 25-user pilot before company-wide rollout. The security review passed first submission — the killer feature was a verifiable network-allowlist that proves no LLM traffic ever leaves the device.



“We thought private AI meant a worse product. The UnlockLive build is faster than the cloud version on most of what our team does, and our security review took 20 minutes instead of three months.”
For most enterprise tasks — classification, summarization, RAG over the user's own documents, structured extraction — yes, a well-chosen open-weight model in the 7B-14B range matches or beats GPT-3.5 and gets within 10-15% of GPT-4 on quality. The trick is honest evaluation: we always score candidate models on the client's real prompts, not generic benchmarks.
We default to Llama 3.1 8B for general chat and Qwen 2.5 14B for reasoning-heavy tasks, with Phi-3 mini for ultra-fast classification. Final choice depends on the user's hardware (M2/M3 Macs handle 14B comfortably; older Intel laptops do better with 7B quantized) and the workload mix.
Ollama itself is permissively licensed and has stable model lifecycle, GPU-aware quantization, and a clean HTTP API. We pair it with a thin FastAPI orchestration layer we own, a signed/notarized installer, and a verifiable network allowlist — that combination has passed multiple enterprise security reviews on first submission.
We ship a per-tenant evaluation harness with the deployment. Every model upgrade is scored against a held-out prompt set the client owns, and only promoted if it meets a quality bar. Model updates roll out through a canary channel the user controls.
Typical breakeven is 6-9 months. The build is a one-time engineering cost (8-14 weeks for a focused workflow), then per-seat inference cost drops to zero. Hosted APIs win for spiky, low-volume use; on-device wins for daily-active enterprise tools.
Talk to the same team that built Private, On-Device LLM Deployment for a Privacy-Sensitive Enterprise (Ollama + llama.cpp). We’ll scope your project, give you a fixed-price proposal, and show you the closest analog from our portfolio.
Book a strategy call