IO Hub — Technical White Paper

1 · Executive Overview

This document explains, from a technical perspective, what we built for the Intermodal Operations Hub (IO Hub) predictive analytics proof-of-concept — the problem we are solving, the data we have, the models we chose and why, how long training takes, and how predictions are produced and kept fresh. Every technical section is paired with a plain-language summary so a non-specialist can follow along.

127

Models trained

one per terminal × series

Terminals

the whole fleet

17.1 min

Training time

not hours

Algorithms competing

best one wins per series

The one-paragraph version

We forecast two operational things for every terminal in the network: how full the yard will get over the next 24 hours (so we can warn about congestion before it happens) and how many containers will move through the gate over the next 14 days (inbound, outbound, and total — for staffing and equipment planning). For each terminal and each metric we let six different algorithms compete and we keep whichever one is most accurate on data it has never seen. The whole fleet — 127 models across 32 terminals — trains in about 17.1 minutes on a single laptop.

In plain wordsThink of it like a weather forecast for the container yard and the gate. We don't guess — we learn the patterns from the last few weeks of real activity and project them forward. And instead of trusting one forecaster, we run a small contest between six of them and publish the winner for each location.

2 · Problem Statement

Problem statement

Intermodal terminals are choke points. If the yard fills past its working capacity, cranes and hostlers slow down, trucks queue at the gate, and dwell time explodes. Today these problems are noticed after they happen. The business question is simple:

“Can we see congestion and demand before it arrives, per terminal, accurately enough to act on — using only the operational data we already collect?”

What we turned that into, technically

Two supervised time-series forecasting problems:

Use case	Target (what we predict)	Horizon	Granularity
UC1 — Yard inventory	Containers on the ground (cumulative in − out)	Next 24 hours	Hourly
UC2 — Gate throughput (total)	Total gate moves per day	Next 14 days	Daily
UC2 — Gate inbound	Containers entering per day	Next 14 days	Daily
UC2 — Gate outbound	Containers leaving per day	Next 14 days	Daily

A congestion breach flag is derived on top of UC1: if the forecast yard inventory crosses a capacity threshold, we raise a risk alert for that terminal.

In plain wordsWe boiled a messy operational worry down to four concrete numbers we can forecast: how full the yard will be each hour, and how many boxes go in, out, and total through the gate each day. Everything else (the red 'congestion risk' alerts) is just comparing those forecasts to a capacity line.

3 · The Data We Have

The data we actually have

Source: equipmentActivityReported — the event feed every terminal already emits when a container/chassis is moved, gated, grounded, or picked up.
Volume: ~3.75 GB raw, 845,277 events across 32 terminals.
Window: roughly 2.5 months of continuous history (mid-March → end-May 2026).
Shape: one big MongoDB extended-JSON array → streamed into 9 NDJSON shards → aggregated into tidy gold tables with DuckDB.

The gold tables we train on

Table	Grain	Feeds
`yard_hourly`	terminal × hour	UC1 yard inventory
`gate_flow_hourly`	terminal × hour	UC2 gate (rolled up to daily)
`dwell`	per container	dwell distribution / quality
`equipment_quality`	per unit	idle / data-quality checks

Is this “big data, lots of moving parts”?

It is a healthy amount of raw data (3.75 GB) but after aggregation each model only sees a compact table: a few weeks of hourly or daily points per terminal. That is exactly why training is fast. The heavy lifting is the one-time extract & aggregate step, not the modelling.

In plain wordsWe are not collecting anything new — this is data the terminals already produce. We squeeze 3.75 GB of raw events down into small, clean per-terminal tables. Once it's tidy, the actual learning is quick because each model studies a short, focused history rather than millions of rows.

4 · Why This Isn't an NLP Project

Why this is NOT like an NLP / large-language-model project

A very common (and reasonable) worry is: “ML training takes forever — GPUs, huge model repos, weeks of compute.” That is true for deep learning on unstructured data (text, images, audio). Our problem is fundamentally different:

Aspect	NLP / LLM / deep learning	Our forecasting models
Input	Unstructured text/images, millions–billions of tokens	Structured numeric tables (a few thousand rows)
Model size	Millions–billions of parameters	Thousands of parameters (trees / coefficients)
Hardware	GPUs / clusters	One CPU / a laptop
Training time	Hours → weeks	Minutes for the whole fleet
Pre-trained weights / repos	Required (HuggingFace, etc.)	None — we train from our own data
Explainability	Hard (black box)	Easy (feature importance, coefficients)

We deliberately chose classical, tabular machine-learning: gradient-boosted trees, random forests, linear regression, and dedicated statistical time-series models. These are the right tool for structured operational data and they are fast, cheap, explainable, and sustainable.

Under the hoodTree ensembles (HistGradientBoosting, RandomForest, XGBoost) and a regularised linear model (Ridge) operate on a lag/rolling-window feature matrix. Prophet and ARIMA operate directly on the raw series. None of them require backprop, GPUs, or pre-trained checkpoints. The largest artifact we persist is a few hundred KB of joblib per model.

In plain wordsThe scary 'AI takes forever and needs giant computers' story is about chatbots and image models. Forecasting how full a yard gets is a much smaller, well-understood maths problem. It runs on a normal computer in minutes, it doesn't need any downloaded 'AI brain', and we can actually explain why it predicts what it predicts.

5 · The Six Models & Why

The six algorithms in the competition

For every (terminal, metric) we train all six, score them on a clean hold-out, and keep the winner. Four are regression models (they predict from engineered features); two are time-series models (they read the raw series directly).

#	Algorithm	Family	Why it's in the contest
1	HistGradientBoosting	Regression (boosted trees)	Strong default for tabular data; captures non-linear patterns; very fast.
2	RandomForest	Regression (bagged trees)	Robust, low-variance, hard to over-fit; good when signal is noisy.
3	XGBoost	Regression (boosted trees)	Industry-standard gradient boosting; often the most accurate on structured data.
4	Ridge	Regression (linear)	Simple, fast, hard to beat on smooth/linear trends; a strong baseline.
5	Prophet	Time-series (additive)	Built for business series with weekly seasonality & trend; great for gate flow.
6	ARIMA	Time-series (statistical)	Classic autocorrelation model; captures momentum & mean-reversion.

Why a competition instead of one chosen model?

Different terminals behave differently. A high-volume hub looks linear and trendy (Ridge wins); a spiky low-volume terminal has strong weekly seasonality (Prophet wins); another has momentum (ARIMA wins). Rather than force one model on everyone, we let the data decide per terminal. The result below proves it — no single model wins everywhere.

Under the hoodRegressors use lag features (t-1, t-2, ...), rolling means, and calendar features, selected by rolling-origin TimeSeriesSplit cross-validation, then re-fit and scored on a held-out tail. Prophet/ARIMA forecast the same hold-out horizon directly and are scored on the SAME MAE so the comparison is apples-to-apples. Lowest hold-out MAE wins and is persisted.

In plain wordsWe hire six forecasters, give them the same exam (predict a stretch of recent days they were never shown), and keep whoever scores best for each terminal. Some terminals are predictable and favour the simple forecaster; others are seasonal and favour the seasonal one. Letting the data pick the winner is why the system stays accurate across very different locations.

6 · The Leaderboard (Real Results)

The leaderboard — who won, fleet-wide

Across all 127 trained models, here is how often each algorithm produced the most accurate forecast and therefore got deployed:

Algorithm	Series it won	Win share
🥇 Prophet	45	35.4%
🥈 Ridge	21	16.5%
🥉 HistGradientBoosting	19	15.0%
ARIMA	18	14.2%
RandomForest	16	12.6%
XGBoost	8	6.3%

Fleet model wins chart — Win share by algorithm across the fleet.

How to read this

Prophet dominates the daily gate series — expected, because gate traffic has a strong weekly rhythm (quiet weekends, busy mid-week).
Ridge wins most yard-inventory series — hourly inventory moves smoothly, so a clean linear model with lag features is hard to beat.
XGBoost / RandomForest / HistGradientBoosting / ARIMA each win the awkward terminals the others can't — which is exactly why we keep them in the contest.

In plain wordsNo single forecaster is best everywhere. The seasonal one wins the gate; the simple one wins the yard; the others mop up the tricky locations. Keeping all six is what makes the fleet accurate as a whole.

7 · How Training Works & How Long

How training actually works (the loop)

Pull the gold table for a terminal (hourly yard or daily gate).
Build features — lag values, rolling averages, hour/day-of-week — for the regressors.
Cross-validate each regressor with rolling-origin TimeSeriesSplit (always train on the past, test on the future — never peek ahead).
Hold-out test: chop off the most recent stretch, predict it, measure error (MAE / MAPE).
Time-series models (Prophet, ARIMA) forecast that same hold-out directly.
Pick the winner (lowest hold-out MAE), re-fit on all available history, and save it to models/<terminal>_<series>.joblib.
Repeat for every terminal × every series.

How long does it take — and why so fast?

The entire fleet — 127 models across 32 terminals — trains in about 17.1 minutes on a single laptop CPU. It's fast because each model studies a small, aggregated table (weeks of points, not millions of rows), the algorithms are lightweight, and there are no GPUs or pre-trained weights to load.

Step	One-time?	Cost
Extract 3.75 GB raw → NDJSON shards	Yes	Minutes (streamed, low memory)
Aggregate → gold tables (DuckDB)	Per data refresh	Seconds–minutes
Train whole fleet (6-way contest)	Per retrain	~17.1 minutes
Draw a prediction (serve)	Every request	Milliseconds

In plain wordsTraining the whole network takes minutes, not hours, because each forecaster reads a short, tidy history instead of a mountain of raw events. The slow part is the one-time tidy-up of the raw feed; after that, retraining is cheap and producing a single prediction is instant.

8 · How a Prediction Is Drawn

How a prediction is drawn at serving time

Predictions are not recomputed from scratch. We load the already-trained winner and ask it for the next horizon:

Load models/<terminal>_<series>.joblib (the saved winner + its feature list).
Assemble the latest features (most recent lags / rolling windows) from current data.
Call model.predict(...) → the 24-hour or 14-day forecast.
For UC1, compare the forecast to the capacity threshold → raise a congestion breach flag if it crosses.

This is the same call the live app makes (/api/predict/<terminal>/<series>), and it returns in milliseconds.

Under the hoodProphet/ARIMA winners are wrapped in a small TSWrapper whose predict() returns the stored forecast, so every saved model — regression or time-series — exposes the SAME predict() interface to the serving layer. The wrapper lives in a shared importable module so joblib can unpickle it in the app process.

Yard inventory forecast — Example: 24-hour yard inventory forecast vs. the capacity line.

Gate total forecast — Example: 14-day gate throughput forecast.

In plain wordsTo answer 'how full will the yard be tonight?' we don't re-train anything — we open the forecaster we already saved for that terminal, feed it the latest readings, and it instantly returns the next 24 hours. If that crosses the capacity line, the dashboard turns red.

9 · Ongoing Training & Fabric

Ongoing / continuous training in production

Two loops run at different speeds:

Loop	What it does	How often
Inference loop	Load saved models, produce fresh forecasts as new data lands	Continuously / hourly
Retrain loop	Re-run the 6-way contest on the newest window, re-pick winners	Nightly or weekly
Drift watch	Track live error (MAE/MAPE); if it degrades, trigger a retrain	Always-on

Because a full retrain is only ~17.1 minutes, we can afford to retrain often — even nightly — which keeps the models aligned with the most recent operational reality (seasonal shifts, volume changes, new equipment).

How this maps to Microsoft Fabric (production)

Local POC	Fabric production
NDJSON shards	Eventstream / Data Factory ingest
DuckDB gold parquet	Lakehouse Delta tables (medallion)
`train_fleet.py`	Spark notebook on a schedule
`.joblib` winners	MLflow model registry
Flask app + docs	Power BI + Copilot

In plain wordsIn production two things happen on a clock: every hour the system makes fresh forecasts from saved models, and every night (or week) it re-runs the contest on the latest data so the winners never go stale. If accuracy ever slips, it retrains itself. Because retraining is so cheap, the system stays current without anyone babysitting it.

10 · Sustainability & World-Class

Is the approach sustainable?

Yes — deliberately so. The design favours simple, explainable, CPU-only models that retrain in minutes, store as small files, and run anywhere (laptop → Fabric) with the same code path. There is no GPU bill, no giant model repo, no opaque black box. That is what makes it cheap to run and easy to trust.

What would make it genuinely world-class

Upgrade	What it adds
Prediction intervals	Not just “52,000 containers” but “52,000 ± 1,500, 90% confidence” — so operators see risk, not just a point.
Exogenous drivers	Feed in vessel ETAs, rail schedules, holidays, weather — the things that actually push volume.
Automated hyper-parameter tuning	Optuna/sweeps per terminal to squeeze out the last few % of accuracy.
Backtesting harness	Replay months of history to certify accuracy before trusting a model in production.
Drift detection + auto-retrain	Trigger retrains on data/concept drift automatically (close the loop).
Global / hierarchical models	One model that shares signal across terminals, reconciled to per-terminal totals.
MLflow registry + CI/CD	Versioned models, staged promotion, reproducible deploys.
Anomaly & data-quality gates	Catch the negative-inventory / zero-volume edge cases before they reach a forecast.

Honest limitations today

A handful of terminals show negative cumulative inventory (outbound > inbound in the window because units arrived before our data starts) — their breach flags are artifacts, not real congestion.
Low-volume gate series have high percentage error (small denominators) even when absolute error is tiny.
2.5 months of history is enough for weekly patterns but not yet for yearly seasonality — which the same pipeline captures automatically once a full year of data is available.

In plain wordsWe chose the boring-but-reliable route on purpose: small, explainable models that are cheap to run and easy to trust. To go from 'solid proof-of-concept' to 'world-class', the next steps are confidence ranges on every number, plugging in real-world drivers like ship arrivals and weather, and letting the system tune and retrain itself automatically. None of that needs exotic AI — it's careful engineering on the same sustainable foundation.

Generated 30 May 2026, 15:57 from real fleet results — 127 models, 32 terminals, ~17.1 min training. Rebuild with python build_technical.py.