1 · What exactly is a “model”?
This is the single most important idea in the whole project, so we start here and we go slowly. When the app says “127 models”, here is precisely what that means.
The one-sentence answer
A model is not magic, and it is not downloaded
We did not download a model from the internet. We did not use ChatGPT. Each model is built from your own data using a mathematical procedure called supervised regression. Think of it like this:
| Everyday analogy | What it maps to here |
|---|---|
| You notice traffic is always bad at 5pm on weekdays. | The model notices yard inventory rises on certain hours/days. |
| After weeks of commuting you can predict tomorrow’s 5pm traffic. | The model predicts the next hour’s / next day’s value. |
| Your “experience” lives in your head. | The model’s “experience” lives in a .joblib file on disk. |
What is literally inside one .joblib file
When you click “Load & run the real .joblib model” in the app, Python
opens a file like models/BED_yard.joblib. Inside it is a Python
dictionary with four things:
{
"model": <the trained estimator>, # e.g. a Ridge regression or a RandomForest
"features": ["lag_1", "lag_24", "roll_mean_24", "hour", "dow", ...],
"best_name": "Ridge", # which algorithm won for this terminal
"target": "inventory" # the number it predicts
}
So a “model” = a trained estimator + the list of inputs it expects + a label of what it predicts. That is the whole thing. It is typically a few hundred kilobytes. It loads in milliseconds. It does not need the internet.
Why 127 of them?
Every terminal behaves differently. BED is huge and busy; a small terminal is calm. One global model would be a blurry average that fits nobody. So we train a separate specialist model for each terminal and each thing we predict:
- 32 terminals × up to 4 predicted series (yard inventory, gate total, gate inbound, gate outbound)
- = about 127 models (a few terminals lacked enough history for every series, which is why it is 127 and not exactly 128).
That is the fleet. The app is simply a window onto those 127 trained files and the accuracy numbers we recorded when we built them.
The three algorithms that compete
For each terminal-series we don’t guess which method is best — we let three compete and keep the winner:
| Algorithm | Plain description | Good at |
|---|---|---|
| Ridge | A smart straight-line (linear) fit with a stability control. | Smooth, trending signals like a slowly filling yard. |
| RandomForest | 300 decision-trees that vote on the answer. | Bumpy, irregular signals like gate traffic. |
| HistGradientBoosting | Trees built in sequence, each fixing the last one’s mistakes. | Subtle non-linear patterns. |
Across the fleet the winners were: RandomForest 60, HistGradientBoosting 35, Ridge 32 wins. No single method dominated — which is exactly why letting them compete was worth it.
2 · What the app is actually showing you
The screen at http://127.0.0.1:5000 is a live window onto
the 127 trained models. Here is every element, decoded.
The left sidebar — the 32 terminals
Every button is one terminal (BED, DET, FAB, …). A ⚠ risk tag means that terminal’s yard is forecast to cross its congestion line. Clicking a terminal asks the app: “show me everything you learned about this place.”
The KPI strip at the top
| KPI | What it means | Where it comes from |
|---|---|---|
| Current yard inventory | How many units are sitting in the yard right now. | Last real value in your data. |
| Forecast +24h | What the model expects 24 hours from now. | The trained model’s prediction. |
| Yard accuracy (MAPE) | Typical % error of that forecast. | Measured on held-out data. |
| Congestion status | OVER or HEALTHY. | Forecast vs the congestion line. |
Each card = one thing we predict
For a terminal you see up to four cards. Each card shows the same anatomy:
- Last actual — the most recent real number.
- Forecast (next) — the model’s prediction for the next period.
- MAE — Mean Absolute Error: on average, how many units the forecast was off by. Lower = better.
- MAPE — the same error as a percentage. 1.8% means typically within 1.8% of the truth.
- Leaderboard — the three algorithms and their error scores; the 🏆 marks the one we kept and now serve.
- ▶ Load & run the real .joblib model — proves it is real: it opens the actual saved file from disk and runs it in front of you.
A worked example: BED
That single sentence — “BED will stay congested, and we’re 99.9% confident in the size of the number” — is the entire point of the system.
3 · What we predict, and the data behind it
Functionally: we are giving each terminal a short-range operational weather forecast.
The four predicted series
| Series | Question it answers | Horizon |
|---|---|---|
| Yard inventory | Will my yard fill up / stay congested? | next 24 hours |
| Gate total | How busy will the gate be overall? | next 14 days |
| Gate inbound | How many units arriving? | next 14 days |
| Gate outbound | How many units leaving? | next 14 days |
Where the numbers come from
The raw input is 845,277 equipment-activity events spanning 2026-03-16 → 2026-05-29 across 32 terminals — roughly two and a half months of history. Each event is one container/chassis doing something (arriving, leaving, being moved). We roll those raw events up into clean time-series tables:
| Gold table | What it holds |
|---|---|
gate_flow_hourly | Events / inbound / outbound per terminal per hour. |
yard_hourly | Running yard inventory per terminal per hour. |
dwell | How long each unit sat (idle / dwell time). |
equipment_quality | Data-quality flags (bad timestamps etc.). |
Those tables are what the models actually learn from.
4 · How training actually works (semi-technical)
No hand-waving. This is the exact recipe the code runs for every single terminal-series.
Step 1 — turn history into a “study sheet”
A model can’t learn from a raw line on a chart. We convert the series into rows of features → answer. The features are engineered with no look-ahead (we never let the model peek at the future):
| Feature | Meaning |
|---|---|
lag_1, lag_2, lag_3, lag_24 | The value 1, 2, 3 and 24 periods ago. |
roll_mean_24, roll_std_24 | Recent average & volatility (shifted so it’s past-only). |
hour, dow, is_weekend, month_day | Calendar context. |
Step 2 — three candidates compete (cross-validation)
We use rolling-origin TimeSeriesSplit cross-validation. Crucially, it always trains on the past and tests on the future — never the reverse — because shuffling time would be cheating. Each of the three algorithms gets scored on the same folds; lowest average error wins.
Step 3 — refit the winner & score it honestly
The winner is refit on all the training data, then measured on a clean hold-out it has never seen. That hold-out error is the MAE / MAPE you see in the app — an honest estimate of real-world accuracy, not a self-graded test.
Step 4 — persist it
The winning model (plus its feature list) is saved with joblib.dump()
to models/<terminal>_<series>.joblib. That saved file is exactly
what the app re-loads later — and what would be registered in Fabric/MLflow in
production.
5 · How accurate is it — honestly?
We report the good and the bad. This is measured accuracy on held-out data, not marketing.
Fleet scorecard (median across terminals)
| Series | Median MAE | Median MAPE | Read it as |
|---|---|---|---|
| Yard inventory | 26.36 | 1.8% | excellent — usually within ~2%. |
| Gate total | 120.79 | 50.8% | directional — good for planning, not exact counts. |
| Gate inbound | 22.04 | n/a* | solid on absolute units. |
| Gate outbound | 9.38 | n/a* | solid on absolute units. |
*MAPE is “n/a” when some hours have zero activity (you can’t take a percentage of zero). MAE is the reliable metric there.
Why yard is so good and gate is harder
- Yard inventory is a smooth, slowly-moving level — easy to forecast tightly (median ~1.8% error).
- Gate traffic is spiky and low-volume at many terminals — a few units off becomes a big percentage. The MAE (absolute units) stays small, so it’s still useful for staffing decisions.
The honest caveat: ~11 terminals need a data fix
6 · Will it be accurate with one year of data, in Fabric?
Short answer: yes — and it gets better. Here is exactly why, and what changes.
Why more history improves accuracy
| Today (~2.5 months) | With ~1 year |
|---|---|
| Models see only spring weeks. | Models see all seasons — peaks, holidays, slow months. |
| Can’t learn weekly + seasonal cycles fully. | Learns day-of-week, month, and seasonal patterns. |
| Negative-inventory artifact at some terminals. | Long window captures true inflow/outflow → artifact disappears. |
| Yard MAPE already ~1.8%. | Expected to hold or improve, and gate forecasts stabilise. |
Does the method change? No.
The exact same pipeline — features, three-way model competition, time-series cross-validation, hold-out scoring, persisted winners — runs unchanged on one year of data. You simply point it at more history. The code is already written to train per terminal, so it scales by repeating the same recipe.
Does it scale to Fabric? Yes — this is the production home
The local Python you’re running maps 1:1 onto Microsoft Fabric. Nothing is throwaway:
| Local POC (now) | Microsoft Fabric (production) |
|---|---|
| Raw JSON → NDJSON shards | Data Factory / Eventstream ingest into OneLake |
| DuckDB → gold parquet tables | Lakehouse medallion (Bronze→Silver→Gold Delta tables) |
train_fleet.py (scikit-learn) | Spark notebooks running the same code, parallel across terminals |
.joblib files in models/ | MLflow model registry (versioned, champion/challenger) |
| Flask app + this guide | Power BI semantic model + dashboards + Copilot |
How it becomes “always on”
- Inference (cheap): recompute yard forecasts hourly, gate forecasts daily.
- Retraining (heavier): refresh models weekly/monthly as new data lands.
- Drift watch: if accuracy degrades or input patterns shift, auto-retrain.
- One year of rolling history keeps each model current with real seasonality.
7 · Does it cover everything? The full picture
Yes. Here is the entire system on one page, from a raw event to a decision a planner makes.
RAW EVENTS 845,277 equipment-activity events (32 terminals, 2.5 months)
|
v
INGEST / SHARD stream huge JSON -> NDJSON shards (Fabric: Eventstream/Data Factory)
|
v
GOLD TABLES gate_flow_hourly, yard_hourly, dwell, quality (Fabric: Lakehouse Delta)
|
v
FEATURE BUILD lags + rolling stats + calendar (no look-ahead)
|
v
TRAIN PER TERMINAL 3 algorithms compete -> pick winner by time-series CV
| 127 models, 32 terminals
v
SCORE (hold-out) MAE / MAPE recorded honestly per model
|
v
PERSIST models/<terminal>_<series>.joblib (Fabric: MLflow registry)
|
v
SERVE Flask app + this guide (Fabric: Power BI + Copilot)
|
v
DECISION "BED will stay congested in 24h" -> planner acts
Every aspect, checked
| Aspect | Covered by |
|---|---|
| What a model is | Chapter 1 |
| What the app shows | Chapter 2 |
| What we predict + data | Chapter 3 |
| How training works | Chapter 4 |
| Real accuracy (good & bad) | Chapter 5 |
| One year of data + Fabric scale | Chapter 6 |
| End-to-end flow | this chapter |
| How to run it yourself | Chapter 8 |
8 · Run it yourself
Three things you can do right now.
See live predictions (the app)
python app.py
# then open http://127.0.0.1:5000
Pick a terminal, read its forecasts, click ▶ Load & run the real .joblib model to prove the model loads from disk and runs.
Re-train the whole fleet
python train_fleet.py
# writes models/*.joblib, fleet_results.json, fleet_scorecard.json
Rebuild the documents
python build_doc.py # the slide-style POC book -> docs/documentation.html
python build_guide.py # THIS explanatory guide -> docs/guide.html
documentation.html is the
presentation-style book; this guide.html is the long-form
explanatory reference you’re reading now.