Back to Playbooks
Reliability 3 min read

Retry, Backoff & Fallback That Won’t Create Duplicates

Production retry patterns for logistics APIs: idempotent operations, exponential backoff with jitter, payload hashing, circuit breakers, and safe fallbacks.

The Real Risk With Retries in Logistics

Retries are table stakes for any system talking to carrier APIs, tracking providers, or external rate engines. But in logistics, a single duplicate can mean two bookings for the same load, double-charged accessorials, or conflicting inventory counts. I’ve seen the mess that creates.

Over multiple projects I rebuilt retry, backoff, and fallback behavior for quote generation, shipment tracking ingestion, and carrier integrations. The north star was simple: survive transient failures gracefully, but never introduce data integrity problems.

The Problems I Had to Solve

Naive retry logic fails in predictable ways:

  • A timed-out POST that actually succeeded on the carrier side → duplicate booking
  • Retried payment or rating calls → duplicate financial transactions
  • Thundering herd after an outage → the recovering service gets crushed again
  • Partial failures leaving systems in inconsistent states

The hard part wasn’t adding retries. It was making them safe.

What I Built

I moved from ad-hoc retry code to a structured, layered approach:

Operation classification became the foundation. I split every external call into four categories:

  • Idempotent reads → retry aggressively
  • Idempotent writes → retry with idempotency keys
  • Non-idempotent writes → limited retries + escalation
  • Destructive or money-impacting operations → no auto-retry, immediate alert

This single decision made on-call life dramatically easier.

Idempotency done right. For APIs that supported it, I generated persistent idempotency keys and attached them across retries. For APIs that didn’t, I built client-side deduplication using payload hashes (operation fingerprint + key fields).

Exponential backoff with jitter. No more linear retries. I added full jitter to spread load and prevent thundering herds.

Defense-in-depth duplicate detection. Even after submission, ingestion layers checked payload hashes and content signatures before processing. This caught cases where the upstream ignored our idempotency key.

Circuit breakers + graceful fallbacks. Sustained failures triggered breakers to protect both sides. When retries were exhausted, the system chose the right fallback based on business impact: stale-but-labeled tracking data for customers, queued manual review for anything involving money or commitments.

How I Proved It Worked

  • Unit and integration tests that simulated timeouts, 5xx responses, partial successes, and retry storms
  • Production metrics on retry rate, duplicate detection hits, circuit breaker trips, and recovery time
  • Regular review of retry logs per endpoint to tune policies (some carriers needed gentler backoff, others could handle more aggression)

Outcomes

The results were concrete and sustained:

  • Duplicate processing incidents from retries fell to near zero
  • Transient failures recovered automatically in the vast majority of cases
  • No more duplicate bookings or charges caused by retry logic
  • Operations teams gained clear visibility and stopped waking up to surprise data inconsistencies

Tradeoffs & Lessons Learned

  • Idempotency tracking added database writes and some latency. Worth it.
  • Backoff increases end-to-end latency on failing calls. Still better than being wrong.
  • Circuit breakers create brief windows of unavailability. Tuning the thresholds is an art.
  • Most important lesson: Retry policy for anything that touches money, commitments, or customer trust is a product decision, not just an engineering default. Treating it as one prevented a lot of painful incidents.

You can see a similar tension between strictness and flow in my data import validation work.

What I’d Build Next

  1. Adaptive backoff that learns from each upstream’s historical recovery patterns
  2. A centralized, persistent retry queue for long-running workflows
  3. Better distributed tracing that ties original request → retries → fallback together
  4. Chaos-style automated failure injection in staging to keep the system honest

If you want to see how I approach production reliability end-to-end, check out Resilient API Integrations.

Need resilient systems that handle failure without creating new problems? Let’s talk.

Working through a similar system?

If one of these patterns maps to the mess in front of you, I can help sort through the architecture, the tradeoffs, and the next step.

Talk it through