Executive summary. In enterprise-scale logistics, the line between profitability and loss is thin. We replaced legacy linear programming and univariate forecasting with a two-stage AI engine: a Temporal Fusion Transformer (TFT) for understanding volatile demand, and a Proximal Policy Optimization (PPO) agent for proactive inventory routing. This piece walks through the architecture, the challenges, and the solutions of the deployment.
1. The Business Problem: A Constrained Network
This section defines the physical reality of the supply-chain network. It visualises how thousands of nodes — hubs, regional DCs, fulfilment centres — interact. The core challenge: a localised spike in demand drains inventory, causing cascading bottlenecks back to the central hub. We have to dictate actions preemptively.
Network Dynamics Simulation
A live simulation. The large white node is the central hub; grey nodes are regional centres. Particles represent inventory flow. Demand spikes (white flashes) temporarily drain nodes, forcing the hub to react.
- Capacity Constraint: nodes have max volumes.
- Logistics Limits: edges have throughput caps.
- Cascading Bottlenecks: local stockouts cause global strain.
2. Data Engineering: Establishing the “State”
An RL agent is only as good as what it observes. Legacy batch pipelines caused stale state issues. Below is the real-time streaming architecture designed to provide a unified offline/online state — and a critical challenge we hit with out-of-order data. Click the steps to explore.
High-Throughput Telemetry
Amazon Kinesis Data Streams ingests high-throughput telemetry from every network node — inventory scans, truck telematics, and queue depths at loading docks.
Changes in network topology (e.g. a trucking route closed by weather) are captured via CDC (Change Data Capture) using AWS DMS, flowing from operational databases into Kafka (Amazon MSK).
Challenge: The "Phantom Inventory" Problem
The problem: the RL agent made erratic decisions based on inventory that didn't exist. "Shipment departed" events from legacy ERPs arrived before "shipment picked" events, creating negative inventory spikes.
The solution: custom stateful processing in Spark using mapGroupsWithState. Out-of-order events sit in a 10-minute holding buffer; on timeout a Dead Letter Queue flags them, and we impute the state from a Redis cache.
Bridging the Offline/Online Gap
To solve training–serving skew, we use Amazon SageMaker Feature Store.
- Online Store (ElastiCache/Redis): sub-10ms latency so the inference pipeline can grab the latest node states and 1-hour moving averages.
- Offline Store (Amazon S3 & Glue): complex temporal aggregations for model training, without touching production latency.
3. Core Data Science: Two-Stage Modeling
Forcing a single model to predict future demand and optimise routing was computationally impossible, so we decoupled the problem. Explore the two models below to see how we handled forecasting uncertainty and stopped the RL agent from “hacking” our logistics costs.
Temporal Fusion Transformers
Before making a plan we need the trajectory. TFTs handle heterogeneous inputs — static node types, known future promotions, and unknown real-time inventory.
Challenge: Overfitting to Viral Spikes
Point forecasting overfit to social-media trends, predicting permanent exponential growth. Solution: quantile regression. We output the 10th, 50th and 90th percentiles — if the spread is wide (high uncertainty), the RL agent learns to act conservatively.
Demand Forecast — Quantile Bands
Proximal Policy Optimization
The PPO agent takes the current state plus the TFT probabilistic forecast and produces a continuous routing matrix across the network.
Challenge: Reward Hacking & Hoarding
Initially the agent "hacked" the system to guarantee zero stockouts — hoarding inventory at expensive central hubs and using premium air freight. Solution: dense reward shaping with holding penalties, proportional routing costs and thrashing penalties.
Cost — Naive vs. Shaped Reward
4. AWS Architecture: Micro-Batch Serving
Because our SLA is to generate a network-wide repositioning plan every 15 minutes, we opted for an asynchronous micro-batch architecture rather than strict real-time APIs. This flow orchestrates the interaction between the Feature Store, the TFT forecaster, and the PPO agent.
Click any component above to view its operational role.
Key Takeaways
- Decouple prediction. Separating the forecaster from the optimiser let us debug overfitting and bad policy decisions independently.
- Reward shaping is critical. Enterprise RL will exploit logic loopholes — dense, multi-objective reward functions are the only path to safe production.
- Invest in feature stores. Complex rolling-window features require bridging Spark pipelines and inference endpoints to prevent state-consistency issues.