1 · Executive Overview

This document explains, from a technical perspective, what we built for the Intermodal Operations Hub (IO Hub) predictive analytics proof-of-concept — the problem we are solving, the data we have, the models we chose and why, how long training takes, and how predictions are produced and kept fresh. Every technical section is paired with a plain-language summary so a non-specialist can follow along.

127
Models trained
one per terminal × series
32
Terminals
the whole fleet
17.1 min
Training time
not hours
6
Algorithms competing
best one wins per series

The one-paragraph version

We forecast two operational things for every terminal in the network: how full the yard will get over the next 24 hours (so we can warn about congestion before it happens) and how many containers will move through the gate over the next 14 days (inbound, outbound, and total — for staffing and equipment planning). For each terminal and each metric we let six different algorithms compete and we keep whichever one is most accurate on data it has never seen. The whole fleet — 127 models across 32 terminals — trains in about 17.1 minutes on a single laptop.

In plain wordsThink of it like a weather forecast for the container yard and the gate. We don't guess — we learn the patterns from the last few weeks of real activity and project them forward. And instead of trusting one forecaster, we run a small contest between six of them and publish the winner for each location.

2 · Problem Statement

Problem statement

Intermodal terminals are choke points. If the yard fills past its working capacity, cranes and hostlers slow down, trucks queue at the gate, and dwell time explodes. Today these problems are noticed after they happen. The business question is simple:

“Can we see congestion and demand before it arrives, per terminal, accurately enough to act on — using only the operational data we already collect?”

What we turned that into, technically

Two supervised time-series forecasting problems:

Use caseTarget (what we predict)HorizonGranularity
UC1 — Yard inventoryContainers on the ground (cumulative in − out)Next 24 hoursHourly
UC2 — Gate throughput (total)Total gate moves per dayNext 14 daysDaily
UC2 — Gate inboundContainers entering per dayNext 14 daysDaily
UC2 — Gate outboundContainers leaving per dayNext 14 daysDaily

A congestion breach flag is derived on top of UC1: if the forecast yard inventory crosses a capacity threshold, we raise a risk alert for that terminal.

In plain wordsWe boiled a messy operational worry down to four concrete numbers we can forecast: how full the yard will be each hour, and how many boxes go in, out, and total through the gate each day. Everything else (the red 'congestion risk' alerts) is just comparing those forecasts to a capacity line.

3 · The Data We Have

The data we actually have

  • Source: equipmentActivityReported — the event feed every terminal already emits when a container/chassis is moved, gated, grounded, or picked up.
  • Volume: ~3.75 GB raw, 845,277 events across 32 terminals.
  • Window: roughly 2.5 months of continuous history (mid-March → end-May 2026).
  • Shape: one big MongoDB extended-JSON array → streamed into 9 NDJSON shards → aggregated into tidy gold tables with DuckDB.

The gold tables we train on

TableGrainFeeds
yard_hourlyterminal × hourUC1 yard inventory
gate_flow_hourlyterminal × hourUC2 gate (rolled up to daily)
dwellper containerdwell distribution / quality
equipment_qualityper unitidle / data-quality checks

Is this “big data, lots of moving parts”?

It is a healthy amount of raw data (3.75 GB) but after aggregation each model only sees a compact table: a few weeks of hourly or daily points per terminal. That is exactly why training is fast. The heavy lifting is the one-time extract & aggregate step, not the modelling.

In plain wordsWe are not collecting anything new — this is data the terminals already produce. We squeeze 3.75 GB of raw events down into small, clean per-terminal tables. Once it's tidy, the actual learning is quick because each model studies a short, focused history rather than millions of rows.

4 · Why This Isn't an NLP Project

Why this is NOT like an NLP / large-language-model project

A very common (and reasonable) worry is: “ML training takes forever — GPUs, huge model repos, weeks of compute.” That is true for deep learning on unstructured data (text, images, audio). Our problem is fundamentally different:

AspectNLP / LLM / deep learningOur forecasting models
InputUnstructured text/images, millions–billions of tokensStructured numeric tables (a few thousand rows)
Model sizeMillions–billions of parametersThousands of parameters (trees / coefficients)
HardwareGPUs / clustersOne CPU / a laptop
Training timeHours → weeksMinutes for the whole fleet
Pre-trained weights / reposRequired (HuggingFace, etc.)None — we train from our own data
ExplainabilityHard (black box)Easy (feature importance, coefficients)

We deliberately chose classical, tabular machine-learning: gradient-boosted trees, random forests, linear regression, and dedicated statistical time-series models. These are the right tool for structured operational data and they are fast, cheap, explainable, and sustainable.

Under the hoodTree ensembles (HistGradientBoosting, RandomForest, XGBoost) and a regularised linear model (Ridge) operate on a lag/rolling-window feature matrix. Prophet and ARIMA operate directly on the raw series. None of them require backprop, GPUs, or pre-trained checkpoints. The largest artifact we persist is a few hundred KB of joblib per model.
In plain wordsThe scary 'AI takes forever and needs giant computers' story is about chatbots and image models. Forecasting how full a yard gets is a much smaller, well-understood maths problem. It runs on a normal computer in minutes, it doesn't need any downloaded 'AI brain', and we can actually explain why it predicts what it predicts.

5 · The Six Models & Why

The six algorithms in the competition

For every (terminal, metric) we train all six, score them on a clean hold-out, and keep the winner. Four are regression models (they predict from engineered features); two are time-series models (they read the raw series directly).

#AlgorithmFamilyWhy it's in the contest
1HistGradientBoostingRegression (boosted trees)Strong default for tabular data; captures non-linear patterns; very fast.
2RandomForestRegression (bagged trees)Robust, low-variance, hard to over-fit; good when signal is noisy.
3XGBoostRegression (boosted trees)Industry-standard gradient boosting; often the most accurate on structured data.
4RidgeRegression (linear)Simple, fast, hard to beat on smooth/linear trends; a strong baseline.
5ProphetTime-series (additive)Built for business series with weekly seasonality & trend; great for gate flow.
6ARIMATime-series (statistical)Classic autocorrelation model; captures momentum & mean-reversion.

Why a competition instead of one chosen model?

Different terminals behave differently. A high-volume hub looks linear and trendy (Ridge wins); a spiky low-volume terminal has strong weekly seasonality (Prophet wins); another has momentum (ARIMA wins). Rather than force one model on everyone, we let the data decide per terminal. The result below proves it — no single model wins everywhere.

Under the hoodRegressors use lag features (t-1, t-2, ...), rolling means, and calendar features, selected by rolling-origin TimeSeriesSplit cross-validation, then re-fit and scored on a held-out tail. Prophet/ARIMA forecast the same hold-out horizon directly and are scored on the SAME MAE so the comparison is apples-to-apples. Lowest hold-out MAE wins and is persisted.
In plain wordsWe hire six forecasters, give them the same exam (predict a stretch of recent days they were never shown), and keep whoever scores best for each terminal. Some terminals are predictable and favour the simple forecaster; others are seasonal and favour the seasonal one. Letting the data pick the winner is why the system stays accurate across very different locations.

6 · The Leaderboard (Real Results)

The leaderboard — who won, fleet-wide

Across all 127 trained models, here is how often each algorithm produced the most accurate forecast and therefore got deployed:

AlgorithmSeries it wonWin share
🥇 Prophet4535.4%
🥈 Ridge2116.5%
🥉 HistGradientBoosting1915.0%
ARIMA1814.2%
RandomForest1612.6%
XGBoost86.3%
Fleet model wins chart
Win share by algorithm across the fleet.

How to read this

  • Prophet dominates the daily gate series — expected, because gate traffic has a strong weekly rhythm (quiet weekends, busy mid-week).
  • Ridge wins most yard-inventory series — hourly inventory moves smoothly, so a clean linear model with lag features is hard to beat.
  • XGBoost / RandomForest / HistGradientBoosting / ARIMA each win the awkward terminals the others can't — which is exactly why we keep them in the contest.
In plain wordsNo single forecaster is best everywhere. The seasonal one wins the gate; the simple one wins the yard; the others mop up the tricky locations. Keeping all six is what makes the fleet accurate as a whole.

7 · How Training Works & How Long

How training actually works (the loop)

  1. Pull the gold table for a terminal (hourly yard or daily gate).
  2. Build features — lag values, rolling averages, hour/day-of-week — for the regressors.
  3. Cross-validate each regressor with rolling-origin TimeSeriesSplit (always train on the past, test on the future — never peek ahead).
  4. Hold-out test: chop off the most recent stretch, predict it, measure error (MAE / MAPE).
  5. Time-series models (Prophet, ARIMA) forecast that same hold-out directly.
  6. Pick the winner (lowest hold-out MAE), re-fit on all available history, and save it to models/<terminal>_<series>.joblib.
  7. Repeat for every terminal × every series.

How long does it take — and why so fast?

The entire fleet — 127 models across 32 terminals — trains in about 17.1 minutes on a single laptop CPU. It's fast because each model studies a small, aggregated table (weeks of points, not millions of rows), the algorithms are lightweight, and there are no GPUs or pre-trained weights to load.

StepOne-time?Cost
Extract 3.75 GB raw → NDJSON shardsYesMinutes (streamed, low memory)
Aggregate → gold tables (DuckDB)Per data refreshSeconds–minutes
Train whole fleet (6-way contest)Per retrain~17.1 minutes
Draw a prediction (serve)Every requestMilliseconds
In plain wordsTraining the whole network takes minutes, not hours, because each forecaster reads a short, tidy history instead of a mountain of raw events. The slow part is the one-time tidy-up of the raw feed; after that, retraining is cheap and producing a single prediction is instant.

8 · How a Prediction Is Drawn

How a prediction is drawn at serving time

Predictions are not recomputed from scratch. We load the already-trained winner and ask it for the next horizon:

  1. Load models/<terminal>_<series>.joblib (the saved winner + its feature list).
  2. Assemble the latest features (most recent lags / rolling windows) from current data.
  3. Call model.predict(...) → the 24-hour or 14-day forecast.
  4. For UC1, compare the forecast to the capacity threshold → raise a congestion breach flag if it crosses.

This is the same call the live app makes (/api/predict/<terminal>/<series>), and it returns in milliseconds.

Under the hoodProphet/ARIMA winners are wrapped in a small TSWrapper whose predict() returns the stored forecast, so every saved model — regression or time-series — exposes the SAME predict() interface to the serving layer. The wrapper lives in a shared importable module so joblib can unpickle it in the app process.
Yard inventory forecast
Example: 24-hour yard inventory forecast vs. the capacity line.
Gate total forecast
Example: 14-day gate throughput forecast.
In plain wordsTo answer 'how full will the yard be tonight?' we don't re-train anything — we open the forecaster we already saved for that terminal, feed it the latest readings, and it instantly returns the next 24 hours. If that crosses the capacity line, the dashboard turns red.

9 · Ongoing Training & Fabric

Ongoing / continuous training in production

Two loops run at different speeds:

LoopWhat it doesHow often
Inference loopLoad saved models, produce fresh forecasts as new data landsContinuously / hourly
Retrain loopRe-run the 6-way contest on the newest window, re-pick winnersNightly or weekly
Drift watchTrack live error (MAE/MAPE); if it degrades, trigger a retrainAlways-on

Because a full retrain is only ~17.1 minutes, we can afford to retrain often — even nightly — which keeps the models aligned with the most recent operational reality (seasonal shifts, volume changes, new equipment).

How this maps to Microsoft Fabric (production)

Local POCFabric production
NDJSON shardsEventstream / Data Factory ingest
DuckDB gold parquetLakehouse Delta tables (medallion)
train_fleet.pySpark notebook on a schedule
.joblib winnersMLflow model registry
Flask app + docsPower BI + Copilot
In plain wordsIn production two things happen on a clock: every hour the system makes fresh forecasts from saved models, and every night (or week) it re-runs the contest on the latest data so the winners never go stale. If accuracy ever slips, it retrains itself. Because retraining is so cheap, the system stays current without anyone babysitting it.

10 · Sustainability & World-Class

Is the approach sustainable?

Yes — deliberately so. The design favours simple, explainable, CPU-only models that retrain in minutes, store as small files, and run anywhere (laptop → Fabric) with the same code path. There is no GPU bill, no giant model repo, no opaque black box. That is what makes it cheap to run and easy to trust.

What would make it genuinely world-class

UpgradeWhat it adds
Prediction intervalsNot just “52,000 containers” but “52,000 ± 1,500, 90% confidence” — so operators see risk, not just a point.
Exogenous driversFeed in vessel ETAs, rail schedules, holidays, weather — the things that actually push volume.
Automated hyper-parameter tuningOptuna/sweeps per terminal to squeeze out the last few % of accuracy.
Backtesting harnessReplay months of history to certify accuracy before trusting a model in production.
Drift detection + auto-retrainTrigger retrains on data/concept drift automatically (close the loop).
Global / hierarchical modelsOne model that shares signal across terminals, reconciled to per-terminal totals.
MLflow registry + CI/CDVersioned models, staged promotion, reproducible deploys.
Anomaly & data-quality gatesCatch the negative-inventory / zero-volume edge cases before they reach a forecast.

Honest limitations today

  • A handful of terminals show negative cumulative inventory (outbound > inbound in the window because units arrived before our data starts) — their breach flags are artifacts, not real congestion.
  • Low-volume gate series have high percentage error (small denominators) even when absolute error is tiny.
  • 2.5 months of history is enough for weekly patterns but not yet for yearly seasonality — which the same pipeline captures automatically once a full year of data is available.
In plain wordsWe chose the boring-but-reliable route on purpose: small, explainable models that are cheap to run and easy to trust. To go from 'solid proof-of-concept' to 'world-class', the next steps are confidence ranges on every number, plugging in real-world drivers like ship arrivals and weather, and letting the system tune and retrain itself automatically. None of that needs exotic AI — it's careful engineering on the same sustainable foundation.