The silent killer in tracking systems
Carrier events don’t arrive exactly once. They get retried on network blips. Webhooks redeliver. Providers replay days or weeks of history during outages or migrations.
If your processor isn’t idempotent, one replay can duplicate milestones, trigger duplicate notifications, flip statuses back and forth, and quietly erode trust in the entire tracking dataset.
I owned making sure that never happened.
The approach I took
I designed idempotency into the event ingestion pipeline at multiple levels instead of treating it as an afterthought.
The foundation was a composite idempotency key built from provider ID, shipment reference, event type, and normalized timestamp (with tolerance for minor clock skew). Every incoming event was checked against this key before any business logic ran.
For safety, I added payload hashing on top of the key. Even when providers changed identifiers or slightly altered payloads during replays, identical content would still be recognized and skipped.
I stored seen keys in a fast, indexed store with a TTL tuned to real observed replay windows (sometimes days, occasionally longer for major backfills). Writes used atomic “insert-if-not-exists” patterns so race conditions couldn’t create duplicates.
Where possible, I also made the downstream logic itself idempotent: “ensure this milestone exists” instead of “add another milestone,” “set status to X” instead of “increment status,” etc. That way, even if deduplication missed something, the damage was contained.
Finally, I added rich observability. We could see deduplication rates spike during a provider replay and know the system was handling it correctly instead of panicking.
How I proved it worked
I ran deliberate replay tests using both synthetic duplicates and real historical event streams from carriers. Only the first delivery caused state changes or side effects.
In production, we monitored:
- Deduplication hit rate (how often replays were caught)
- Duplicate milestone counts in timelines (dropped to near zero)
- Processing latency impact (kept minimal)
The most satisfying validation was watching a major carrier replay several days of events during one of their incidents — our timelines stayed perfectly clean while other teams dealt with chaos.
Outcomes
- Shipment timelines became trustworthy again. Operations stopped seeing phantom duplicates and questioning the data.
- Provider replays and retries turned from data-quality incidents into non-events.
- New integrations inherited idempotency by default, speeding up future carrier onboarding.
- The system became fundamentally more resilient to the messiness of real-world external data sources.
Tradeoffs and lessons
Tradeoffs
Every lookup added a small amount of latency and we had to pay for storage of idempotency keys. I tuned the TTL aggressively based on actual replay patterns to keep costs reasonable. We also accepted that 100% perfect deduplication forever isn’t realistic at scale — the goal was “good enough that it never hurts the business.”
Lessons I took away
- Idempotency is a core architectural property, not a bolt-on feature. Design for at-least-once from day one.
- Composite keys + payload hashing gives you the best coverage with acceptable performance.
- Observability is what turns a “black box that sometimes duplicates things” into a system you can trust and improve.
What I’d do next
- Provider-specific strategies (some now send explicit sequence numbers or replay headers we could leverage)
- Per-provider deduplication effectiveness dashboards
- Continuous automated replay testing against production-like event streams
- Tighter integration with exactly-once patterns for the highest-value events
If you’re dealing with flaky external event sources and need tracking data that stays correct no matter how many times providers retry or replay, let’s talk .