Back to the notebook
EngineeringJonas K. · 12 min

Sub-100 ms PMS sync: anatomy of the real-time registry.

A room state that's wrong for thirty seconds is an agent sent into an occupied room. Latency isn't an engineer's vanity here: it's the difference between a correct plan and a guest-facing incident.

Published

The problem: room-state drift

Every PMS — Hostaway, Beds24, Smoobu, Cloudbeds, Mews — holds its own truth about arrivals, departures and extensions. If Hostik only learns of a late checkout at the next poll, the cleaning assignment runs on stale data. The visible symptom (an agent at a door that won't open) has an invisible cause: a few seconds of divergence between two systems.

So the goal wasn't "sync fast for the elegance of it," but to shrink the window during which our two worlds diverge enough that a human acts on something false.

One PMS port, several transports

We didn't want a Hostaway connector, a Beds24 connector, a Smoobu connector — each with its own retry logic and its own bugs. We defined a single port (a contract: `fetchEvents`, `drainPages`, token handling) and as many thin adapters as there are providers. The registry core doesn't know which PMS is speaking; it only sees normalized events.

Concrete benefit: an incoming webhook and a paginated backfill travel the exact same code path once normalized. The day we add a provider, we write a transport adapter, not a second half of the application.

Idempotency before speed

PMS webhooks arrive twice, out of order, and sometimes after a backfill that already contains the same event. Going fast on a non-idempotent stream is just being wrong faster. Every event carries a stable key; an already-applied event is ignored, and an event older than current state never overwrites it.

Only once that property was guaranteed could we afford to be aggressive about propagation: a validated change is pushed to connected clients immediately, because we know a duplicate will break nothing.

The traps we paid for once

The backfill that overwrites real time: on first connecting a property, the historical pagination can land after newer live events. The rule "never apply an older state" fixes this — but it took an incident to write it down.

Tokens that expire in silence: an adapter whose token has lapsed fails politely and leaves the state frozen. We centralized token handling in the port rather than in each adapter, so failure is visible and repairable in one place.

The regressions we avoided

We refused to expose latency as a marketing promise before we had a test that measured it end to end. "Under 100 ms" isn't a slogan: it's the target the webhook → normalization → propagation path must hold, and a regression fails the build rather than production.

The real architectural lesson isn't in the millisecond saved. It's in the order of priorities: one contract first, idempotency next, speed last. Reverse the order and you get a fast system that lies.