The challenge
A North American university ran 30+ academic and administrative systems against Canvas LMS using a nightly batch sync that had grown into a 6-hour fragile job. Whenever Canvas added a feature, the job broke; whenever enrollment surged, the job hit Canvas API rate limits and failed silently halfway through; whenever a downstream system needed fresher data, the answer was "tomorrow morning, maybe."
The central IT team had three concrete pains. First, instructors were complaining that gradebook changes took up to 24 hours to reach the analytics dashboards. Second, the sync job consumed so much of the API quota that other integrations were being rate-limited by Canvas. Third, the job had no recovery story — if it failed at 03:14, an engineer had to wake up, find where it died, and replay it manually. The team needed an integration that was real-time enough for instructors, gentle enough on Canvas to live within shared rate limits, and observable enough that an on-call engineer could trust it.
Our solution
We replaced the nightly job with an event-driven Canvas LMS integration on Python and FastAPI built around three primitives: a Canvas Live Events consumer, a Canvas API client with quota-aware concurrency control, and an idempotent change-event bus the downstream systems subscribe to.
Canvas Live Events flow into a FastAPI webhook receiver, get verified, persisted to an inbox table, and processed by a Celery worker that fans them out to the right downstream handlers. For state Canvas doesn't push (course content, deep enrollment data, large rosters), we use a smart polling layer that uses GraphQL where it's cheaper and REST where it isn't, fetched through a single Canvas client that respects the `X-Rate-Limit-Remaining` header and dynamically slows down before Canvas tells it to.
The downstream systems no longer talk to Canvas directly — they subscribe to our internal change-event bus, which is idempotent and ordered per-entity (per-student, per-course). That single architectural choice killed the duplicate-call problem: 30+ systems used to ask Canvas the same question every night; now they all consume one normalized event. Daily Canvas API call volume settled around 50K — well inside the university's quota with predictable headroom — while the propagation latency for a gradebook change dropped from 24 hours to under 90 seconds.
- Canvas Live Events consumer with signed-payload verification and inbox persistence
- Quota-aware Canvas API client honoring X-Rate-Limit-Remaining and adaptive concurrency
- Mixed REST + GraphQL strategy — GraphQL where it cuts request volume
- OAuth2 token lifecycle with automatic rotation and refresh-on-401 fallback
- Idempotent, per-entity-ordered change-event bus that downstream systems subscribe to
- Built-in replay tool — re-process any window of Canvas events without coordination
- Pagination optimization (bookmark-style cursors) eliminating deep-offset N+1 patterns
- Datadog dashboards for end-to-end latency, Canvas quota headroom, per-system lag
- PagerDuty alerts on quota burn rate, event backlog, and OAuth token health