1 · Executive Overview
This document explains, from a technical perspective, what we built for the Intermodal Operations Hub (IO Hub) predictive analytics proof-of-concept — the problem we are solving, the data we have, the models we chose and why, how long training takes, and how predictions are produced and kept fresh. Every technical section is paired with a plain-language summary so a non-specialist can follow along.
The one-paragraph version
We forecast two operational things for every terminal in the network: how full the yard will get over the next 24 hours (so we can warn about congestion before it happens) and how many containers will move through the gate over the next 14 days (inbound, outbound, and total — for staffing and equipment planning). For each terminal and each metric we let six different algorithms compete and we keep whichever one is most accurate on data it has never seen. The whole fleet — 127 models across 32 terminals — trains in about 17.1 minutes on a single laptop.
2 · Problem Statement
Problem statement
Intermodal terminals are choke points. If the yard fills past its working capacity, cranes and hostlers slow down, trucks queue at the gate, and dwell time explodes. Today these problems are noticed after they happen. The business question is simple:
“Can we see congestion and demand before it arrives, per terminal, accurately enough to act on — using only the operational data we already collect?”
What we turned that into, technically
Two supervised time-series forecasting problems:
| Use case | Target (what we predict) | Horizon | Granularity |
|---|---|---|---|
| UC1 — Yard inventory | Containers on the ground (cumulative in − out) | Next 24 hours | Hourly |
| UC2 — Gate throughput (total) | Total gate moves per day | Next 14 days | Daily |
| UC2 — Gate inbound | Containers entering per day | Next 14 days | Daily |
| UC2 — Gate outbound | Containers leaving per day | Next 14 days | Daily |
A congestion breach flag is derived on top of UC1: if the forecast yard inventory crosses a capacity threshold, we raise a risk alert for that terminal.
3 · The Data We Have
The data we actually have
- Source:
equipmentActivityReported— the event feed every terminal already emits when a container/chassis is moved, gated, grounded, or picked up. - Volume: ~3.75 GB raw, 845,277 events across 32 terminals.
- Window: roughly 2.5 months of continuous history (mid-March → end-May 2026).
- Shape: one big MongoDB extended-JSON array → streamed into 9 NDJSON shards → aggregated into tidy gold tables with DuckDB.
The gold tables we train on
| Table | Grain | Feeds |
|---|---|---|
yard_hourly | terminal × hour | UC1 yard inventory |
gate_flow_hourly | terminal × hour | UC2 gate (rolled up to daily) |
dwell | per container | dwell distribution / quality |
equipment_quality | per unit | idle / data-quality checks |
Is this “big data, lots of moving parts”?
It is a healthy amount of raw data (3.75 GB) but after aggregation each model only sees a compact table: a few weeks of hourly or daily points per terminal. That is exactly why training is fast. The heavy lifting is the one-time extract & aggregate step, not the modelling.
4 · Why This Isn't an NLP Project
Why this is NOT like an NLP / large-language-model project
A very common (and reasonable) worry is: “ML training takes forever — GPUs, huge model repos, weeks of compute.” That is true for deep learning on unstructured data (text, images, audio). Our problem is fundamentally different:
| Aspect | NLP / LLM / deep learning | Our forecasting models |
|---|---|---|
| Input | Unstructured text/images, millions–billions of tokens | Structured numeric tables (a few thousand rows) |
| Model size | Millions–billions of parameters | Thousands of parameters (trees / coefficients) |
| Hardware | GPUs / clusters | One CPU / a laptop |
| Training time | Hours → weeks | Minutes for the whole fleet |
| Pre-trained weights / repos | Required (HuggingFace, etc.) | None — we train from our own data |
| Explainability | Hard (black box) | Easy (feature importance, coefficients) |
We deliberately chose classical, tabular machine-learning: gradient-boosted trees, random forests, linear regression, and dedicated statistical time-series models. These are the right tool for structured operational data and they are fast, cheap, explainable, and sustainable.
5 · The Six Models & Why
The six algorithms in the competition
For every (terminal, metric) we train all six, score them on a clean hold-out, and keep the winner. Four are regression models (they predict from engineered features); two are time-series models (they read the raw series directly).
| # | Algorithm | Family | Why it's in the contest |
|---|---|---|---|
| 1 | HistGradientBoosting | Regression (boosted trees) | Strong default for tabular data; captures non-linear patterns; very fast. |
| 2 | RandomForest | Regression (bagged trees) | Robust, low-variance, hard to over-fit; good when signal is noisy. |
| 3 | XGBoost | Regression (boosted trees) | Industry-standard gradient boosting; often the most accurate on structured data. |
| 4 | Ridge | Regression (linear) | Simple, fast, hard to beat on smooth/linear trends; a strong baseline. |
| 5 | Prophet | Time-series (additive) | Built for business series with weekly seasonality & trend; great for gate flow. |
| 6 | ARIMA | Time-series (statistical) | Classic autocorrelation model; captures momentum & mean-reversion. |
Why a competition instead of one chosen model?
Different terminals behave differently. A high-volume hub looks linear and trendy (Ridge wins); a spiky low-volume terminal has strong weekly seasonality (Prophet wins); another has momentum (ARIMA wins). Rather than force one model on everyone, we let the data decide per terminal. The result below proves it — no single model wins everywhere.
6 · The Leaderboard (Real Results)
The leaderboard — who won, fleet-wide
Across all 127 trained models, here is how often each algorithm produced the most accurate forecast and therefore got deployed:
| Algorithm | Series it won | Win share | |
|---|---|---|---|
| 🥇 Prophet | 45 | 35.4% | |
| 🥈 Ridge | 21 | 16.5% | |
| 🥉 HistGradientBoosting | 19 | 15.0% | |
| ARIMA | 18 | 14.2% | |
| RandomForest | 16 | 12.6% | |
| XGBoost | 8 | 6.3% |
How to read this
- Prophet dominates the daily gate series — expected, because gate traffic has a strong weekly rhythm (quiet weekends, busy mid-week).
- Ridge wins most yard-inventory series — hourly inventory moves smoothly, so a clean linear model with lag features is hard to beat.
- XGBoost / RandomForest / HistGradientBoosting / ARIMA each win the awkward terminals the others can't — which is exactly why we keep them in the contest.
7 · How Training Works & How Long
How training actually works (the loop)
- Pull the gold table for a terminal (hourly yard or daily gate).
- Build features — lag values, rolling averages, hour/day-of-week — for the regressors.
- Cross-validate each regressor with rolling-origin
TimeSeriesSplit(always train on the past, test on the future — never peek ahead). - Hold-out test: chop off the most recent stretch, predict it, measure error (MAE / MAPE).
- Time-series models (Prophet, ARIMA) forecast that same hold-out directly.
- Pick the winner (lowest hold-out MAE), re-fit on all available history, and save it to
models/<terminal>_<series>.joblib. - Repeat for every terminal × every series.
How long does it take — and why so fast?
The entire fleet — 127 models across 32 terminals — trains in about 17.1 minutes on a single laptop CPU. It's fast because each model studies a small, aggregated table (weeks of points, not millions of rows), the algorithms are lightweight, and there are no GPUs or pre-trained weights to load.
| Step | One-time? | Cost |
|---|---|---|
| Extract 3.75 GB raw → NDJSON shards | Yes | Minutes (streamed, low memory) |
| Aggregate → gold tables (DuckDB) | Per data refresh | Seconds–minutes |
| Train whole fleet (6-way contest) | Per retrain | ~17.1 minutes |
| Draw a prediction (serve) | Every request | Milliseconds |
8 · How a Prediction Is Drawn
How a prediction is drawn at serving time
Predictions are not recomputed from scratch. We load the already-trained winner and ask it for the next horizon:
- Load
models/<terminal>_<series>.joblib(the saved winner + its feature list). - Assemble the latest features (most recent lags / rolling windows) from current data.
- Call
model.predict(...)→ the 24-hour or 14-day forecast. - For UC1, compare the forecast to the capacity threshold → raise a congestion breach flag if it crosses.
This is the same call the live app makes (/api/predict/<terminal>/<series>), and it
returns in milliseconds.
TSWrapper whose predict() returns the stored forecast, so every saved model — regression or time-series — exposes the SAME predict() interface to the serving layer. The wrapper lives in a shared importable module so joblib can unpickle it in the app process.9 · Ongoing Training & Fabric
Ongoing / continuous training in production
Two loops run at different speeds:
| Loop | What it does | How often |
|---|---|---|
| Inference loop | Load saved models, produce fresh forecasts as new data lands | Continuously / hourly |
| Retrain loop | Re-run the 6-way contest on the newest window, re-pick winners | Nightly or weekly |
| Drift watch | Track live error (MAE/MAPE); if it degrades, trigger a retrain | Always-on |
Because a full retrain is only ~17.1 minutes, we can afford to retrain often — even nightly — which keeps the models aligned with the most recent operational reality (seasonal shifts, volume changes, new equipment).
How this maps to Microsoft Fabric (production)
| Local POC | Fabric production |
|---|---|
| NDJSON shards | Eventstream / Data Factory ingest |
| DuckDB gold parquet | Lakehouse Delta tables (medallion) |
train_fleet.py | Spark notebook on a schedule |
.joblib winners | MLflow model registry |
| Flask app + docs | Power BI + Copilot |
10 · Sustainability & World-Class
Is the approach sustainable?
Yes — deliberately so. The design favours simple, explainable, CPU-only models that retrain in minutes, store as small files, and run anywhere (laptop → Fabric) with the same code path. There is no GPU bill, no giant model repo, no opaque black box. That is what makes it cheap to run and easy to trust.
What would make it genuinely world-class
| Upgrade | What it adds |
|---|---|
| Prediction intervals | Not just “52,000 containers” but “52,000 ± 1,500, 90% confidence” — so operators see risk, not just a point. |
| Exogenous drivers | Feed in vessel ETAs, rail schedules, holidays, weather — the things that actually push volume. |
| Automated hyper-parameter tuning | Optuna/sweeps per terminal to squeeze out the last few % of accuracy. |
| Backtesting harness | Replay months of history to certify accuracy before trusting a model in production. |
| Drift detection + auto-retrain | Trigger retrains on data/concept drift automatically (close the loop). |
| Global / hierarchical models | One model that shares signal across terminals, reconciled to per-terminal totals. |
| MLflow registry + CI/CD | Versioned models, staged promotion, reproducible deploys. |
| Anomaly & data-quality gates | Catch the negative-inventory / zero-volume edge cases before they reach a forecast. |
Honest limitations today
- A handful of terminals show negative cumulative inventory (outbound > inbound in the window because units arrived before our data starts) — their breach flags are artifacts, not real congestion.
- Low-volume gate series have high percentage error (small denominators) even when absolute error is tiny.
- 2.5 months of history is enough for weekly patterns but not yet for yearly seasonality — which the same pipeline captures automatically once a full year of data is available.