The problem every logistics engineer knows
Carrier tracking data is a nightmare. One calls it “DEPARTURE”, another “Departed from Port”, a third “OUTGATE”. Timestamps arrive in UTC, local time, or with no timezone at all. Location data ranges from clean IATA codes to vague free-text strings.
Downstream teams were spending hours every week manually interpreting events, chasing inconsistencies, and fixing broken alerts and customer updates.
I owned the fix: build a normalization layer that turns this chaos into a clean, consistent set of internal milestones everyone could trust.
What I built
I designed and implemented a multi-stage event normalization pipeline.
At the core was a canonical milestone model — a small, opinionated set of business events (Departed Origin, Arrived at Port, Departed Port, In Transit, Available for Pickup, Delivered, etc.) with strict fields: type, location, timestamp (always UTC), source, and confidence.
For each carrier or provider, I created dedicated mappers. Some were simple declarative lookup tables or regex patterns. Others required more sophisticated logic for sequence validation and context-aware classification.
Each provider brought distinct challenges:
- Project44 delivered high-volume event streams, but milestones often arrived out of order with contradictory status fields and inconsistent identifiers across shipment types. The main work was semantic tightening — preventing the system from implying certainty when the source data was ambiguous or incomplete.
- Ocean Insights provided valuable ocean-tracking data, but event completeness varied significantly by route and carrier. Pagination quirks in their API affected import reliability. The normalization layer had to be deliberately conservative so sparse event sets wouldn’t produce misleading timeline states.
- Shipsgo offered container-level ocean visibility, but transient API instability and replayed payloads were common. Position data sometimes arrived future-dated or with gaps. The mapper needed tight deduplication and conservative current-position handling to keep timelines honest under unreliable conditions.
I also added:
- Robust timestamp parsing and timezone normalization
- Location standardization (port codes, city names, fuzzy matching)
- A reconciliation engine that detects conflicting or impossible event sequences and flags them for review instead of guessing
- Detailed audit logging so every normalization decision was traceable: “Carrier X sent event Y → mapped to milestone Z via rule W”
The entire pipeline was built to be fast, observable, and easy to extend when new carriers came online.
How I validated it
I started with exhaustive sample testing against real historical payloads from each carrier. Every mapper had a growing test suite of known inputs and expected canonical outputs.
In production, I tracked:
- Percentage of events successfully mapped (steadily increased as we refined rules)
- Rate of unmapped or ambiguous events (the long tail we monitored weekly)
- Downstream error rates in dashboards and notifications (noticeably lower after rollout)
The best validation was operational: operations and customer teams stopped asking “what does this event actually mean?” and started trusting the milestone timeline.
Outcomes
- Downstream systems (UI, alerting, ETAs, customer portal) now consume clean, consistent milestone data no matter which carrier is involved.
- Adding support for new carriers became dramatically faster — often just a new mapper config + tests.
- Debugging tracking issues went from painful detective work to reading clear audit logs.
Most importantly, tracking data shifted from a constant source of friction to a reliable platform capability.
Tradeoffs & lessons
Tradeoffs
Normalization is inherently lossy, so we kept every raw event for audit and debugging. The extra processing step added some latency, which we mitigated with caching and careful async design. Perfect accuracy is impossible — some edge cases still need human review — but we aimed for high automation with safe fallbacks.
Lessons
- A strong canonical model is the single highest-leverage decision in data integration.
- Declarative mappings (where possible) beat hard-coded logic for long-term maintainability.
- Observability is non-negotiable. If you can’t explain why the system made a decision, you can’t improve it.
What’s next
I’d like to add:
- Feedback loops so operations can flag bad mappings and auto-suggest rule updates
- Better detection of carrier schema drift
- ML-assisted classification for the long tail of rare events
- Public documentation of our canonical model to help partners send cleaner data upstream
If you’re building logistics platforms and need clean, trustworthy tracking data instead of carrier spaghetti, let’s talk .