1 · What exactly is a “model”?

This is the single most important idea in the whole project, so we start here and we go slowly. When the app says “127 models”, here is precisely what that means.

The one-sentence answer

A model is a small saved file that has learned the pattern of one number at one terminal — for example “how full BED’s yard usually gets” — so it can fill in the next value before it actually happens.

A model is not magic, and it is not downloaded

We did not download a model from the internet. We did not use ChatGPT. Each model is built from your own data using a mathematical procedure called supervised regression. Think of it like this:

Everyday analogy	What it maps to here
You notice traffic is always bad at 5pm on weekdays.	The model notices yard inventory rises on certain hours/days.
After weeks of commuting you can predict tomorrow’s 5pm traffic.	The model predicts the next hour’s / next day’s value.
Your “experience” lives in your head.	The model’s “experience” lives in a `.joblib` file on disk.

What is literally inside one `.joblib` file

When you click “Load & run the real .joblib model” in the app, Python opens a file like models/BED_yard.joblib. Inside it is a Python dictionary with four things:

{
  "model":   <the trained estimator>,   # e.g. a Ridge regression or a RandomForest
  "features": ["lag_1", "lag_24", "roll_mean_24", "hour", "dow", ...],
  "best_name": "Ridge",                  # which algorithm won for this terminal
  "target":  "inventory"                 # the number it predicts
}

So a “model” = a trained estimator + the list of inputs it expects + a label of what it predicts. That is the whole thing. It is typically a few hundred kilobytes. It loads in milliseconds. It does not need the internet.

Why 127 of them?

Every terminal behaves differently. BED is huge and busy; a small terminal is calm. One global model would be a blurry average that fits nobody. So we train a separate specialist model for each terminal and each thing we predict:

32 terminals × up to 4 predicted series (yard inventory, gate total, gate inbound, gate outbound)
= about 127 models (a few terminals lacked enough history for every series, which is why it is 127 and not exactly 128).

That is the fleet. The app is simply a window onto those 127 trained files and the accuracy numbers we recorded when we built them.

The three algorithms that compete

For each terminal-series we don’t guess which method is best — we let three compete and keep the winner:

Algorithm	Plain description	Good at
Ridge	A smart straight-line (linear) fit with a stability control.	Smooth, trending signals like a slowly filling yard.
RandomForest	300 decision-trees that vote on the answer.	Bumpy, irregular signals like gate traffic.
HistGradientBoosting	Trees built in sequence, each fixing the last one’s mistakes.	Subtle non-linear patterns.

Across the fleet the winners were: RandomForest 60, HistGradientBoosting 35, Ridge 32 wins. No single method dominated — which is exactly why letting them compete was worth it.

2 · What the app is actually showing you

The screen at http://127.0.0.1:5000 is a live window onto the 127 trained models. Here is every element, decoded.

The left sidebar — the 32 terminals

Every button is one terminal (BED, DET, FAB, …). A ⚠ risk tag means that terminal’s yard is forecast to cross its congestion line. Clicking a terminal asks the app: “show me everything you learned about this place.”

The KPI strip at the top

KPI	What it means	Where it comes from
Current yard inventory	How many units are sitting in the yard right now.	Last real value in your data.
Forecast +24h	What the model expects 24 hours from now.	The trained model’s prediction.
Yard accuracy (MAPE)	Typical % error of that forecast.	Measured on held-out data.
Congestion status	OVER or HEALTHY.	Forecast vs the congestion line.

Each card = one thing we predict

For a terminal you see up to four cards. Each card shows the same anatomy:

Last actual — the most recent real number.
Forecast (next) — the model’s prediction for the next period.
MAE — Mean Absolute Error: on average, how many units the forecast was off by. Lower = better.
MAPE — the same error as a percentage. 1.8% means typically within 1.8% of the truth.
Leaderboard — the three algorithms and their error scores; the 🏆 marks the one we kept and now serve.
▶ Load & run the real .joblib model — proves it is real: it opens the actual saved file from disk and runs it in front of you.

A worked example: BED

BED’s yard currently holds 52,348 units. Its congestion line (the busiest 10% of its own history) is 45,566. The trained Ridge model forecasts 52,423 in 24 hours — over the line, so BED is flagged risk. And it is an accurate warning: that model’s typical error is only 0.1%.

That single sentence — “BED will stay congested, and we’re 99.9% confident in the size of the number” — is the entire point of the system.

3 · What we predict, and the data behind it

Functionally: we are giving each terminal a short-range operational weather forecast.

The four predicted series

Series	Question it answers	Horizon
Yard inventory	Will my yard fill up / stay congested?	next 24 hours
Gate total	How busy will the gate be overall?	next 14 days
Gate inbound	How many units arriving?	next 14 days
Gate outbound	How many units leaving?	next 14 days

Where the numbers come from

The raw input is 845,277 equipment-activity events spanning 2026-03-16 → 2026-05-29 across 32 terminals — roughly two and a half months of history. Each event is one container/chassis doing something (arriving, leaving, being moved). We roll those raw events up into clean time-series tables:

Gold table	What it holds
`gate_flow_hourly`	Events / inbound / outbound per terminal per hour.
`yard_hourly`	Running yard inventory per terminal per hour.
`dwell`	How long each unit sat (idle / dwell time).
`equipment_quality`	Data-quality flags (bad timestamps etc.).

Those tables are what the models actually learn from.

Yard inventory over time (history + forecast) for the demo terminal.

4 · How training actually works (semi-technical)

No hand-waving. This is the exact recipe the code runs for every single terminal-series.

Step 1 — turn history into a “study sheet”

A model can’t learn from a raw line on a chart. We convert the series into rows of features → answer. The features are engineered with no look-ahead (we never let the model peek at the future):

Feature	Meaning
`lag_1, lag_2, lag_3, lag_24`	The value 1, 2, 3 and 24 periods ago.
`roll_mean_24, roll_std_24`	Recent average & volatility (shifted so it’s past-only).
`hour, dow, is_weekend, month_day`	Calendar context.

Step 2 — three candidates compete (cross-validation)

We use rolling-origin TimeSeriesSplit cross-validation. Crucially, it always trains on the past and tests on the future — never the reverse — because shuffling time would be cheating. Each of the three algorithms gets scored on the same folds; lowest average error wins.

Step 3 — refit the winner & score it honestly

The winner is refit on all the training data, then measured on a clean hold-out it has never seen. That hold-out error is the MAE / MAPE you see in the app — an honest estimate of real-world accuracy, not a self-graded test.

Step 4 — persist it

The winning model (plus its feature list) is saved with joblib.dump() to models/<terminal>_<series>.joblib. That saved file is exactly what the app re-loads later — and what would be registered in Fabric/MLflow in production.

The whole fleet — 127 models across 32 terminals — trained in about 7.3 minutes on a normal machine.

Which algorithm won, across the whole fleet.

5 · How accurate is it — honestly?

We report the good and the bad. This is measured accuracy on held-out data, not marketing.

Fleet scorecard (median across terminals)

Series	Median MAE	Median MAPE	Read it as
Yard inventory	26.36	1.8%	excellent — usually within ~2%.
Gate total	120.79	50.8%	directional — good for planning, not exact counts.
Gate inbound	22.04	n/a*	solid on absolute units.
Gate outbound	9.38	n/a*	solid on absolute units.

*MAPE is “n/a” when some hours have zero activity (you can’t take a percentage of zero). MAE is the reliable metric there.

Why yard is so good and gate is harder

Yard inventory is a smooth, slowly-moving level — easy to forecast tightly (median ~1.8% error).
Gate traffic is spiky and low-volume at many terminals — a few units off becomes a big percentage. The MAE (absolute units) stays small, so it’s still useful for staffing decisions.

The honest caveat: ~11 terminals need a data fix

Some terminals (e.g. NWO, SYR, FAB) show negative cumulative inventory, because more units left than arrived inside our 2.5-month window — those units arrived before the data starts. Their congestion flags are artifacts, not real risk. The genuine high-risk terminals are BED, NSH, GRE, DET, WOR, INR, CHA. The fix is to seed each terminal’s true starting inventory — which is exactly the kind of thing more history solves (see the next chapter).

Terminals ranked by forecast congestion headroom.

6 · Will it be accurate with one year of data, in Fabric?

Short answer: yes — and it gets better. Here is exactly why, and what changes.

Why more history improves accuracy

Today (~2.5 months)	With ~1 year
Models see only spring weeks.	Models see all seasons — peaks, holidays, slow months.
Can’t learn weekly + seasonal cycles fully.	Learns day-of-week, month, and seasonal patterns.
Negative-inventory artifact at some terminals.	Long window captures true inflow/outflow → artifact disappears.
Yard MAPE already ~1.8%.	Expected to hold or improve, and gate forecasts stabilise.

Does the method change? No.

The exact same pipeline — features, three-way model competition, time-series cross-validation, hold-out scoring, persisted winners — runs unchanged on one year of data. You simply point it at more history. The code is already written to train per terminal, so it scales by repeating the same recipe.

Does it scale to Fabric? Yes — this is the production home

The local Python you’re running maps 1:1 onto Microsoft Fabric. Nothing is throwaway:

Local POC (now)	Microsoft Fabric (production)
Raw JSON → NDJSON shards	Data Factory / Eventstream ingest into OneLake
DuckDB → gold parquet tables	Lakehouse medallion (Bronze→Silver→Gold Delta tables)
`train_fleet.py` (scikit-learn)	Spark notebooks running the same code, parallel across terminals
`.joblib` files in `models/`	MLflow model registry (versioned, champion/challenger)
Flask app + this guide	Power BI semantic model + dashboards + Copilot

How it becomes “always on”

Inference (cheap): recompute yard forecasts hourly, gate forecasts daily.
Retraining (heavier): refresh models weekly/monthly as new data lands.
Drift watch: if accuracy degrades or input patterns shift, auto-retrain.
One year of rolling history keeps each model current with real seasonality.

Bottom line: one year of data in Fabric means the same models you see today, trained on richer history, served continuously, versioned, and surfaced in Power BI — more accurate and fully operational.

7 · Does it cover everything? The full picture

Yes. Here is the entire system on one page, from a raw event to a decision a planner makes.

 RAW EVENTS                845,277 equipment-activity events (32 terminals, 2.5 months)
     |
     v
 INGEST / SHARD            stream huge JSON  ->  NDJSON shards        (Fabric: Eventstream/Data Factory)
     |
     v
 GOLD TABLES              gate_flow_hourly, yard_hourly, dwell, quality   (Fabric: Lakehouse Delta)
     |
     v
 FEATURE BUILD            lags + rolling stats + calendar (no look-ahead)
     |
     v
 TRAIN PER TERMINAL       3 algorithms compete  ->  pick winner by time-series CV
     |                    127 models, 32 terminals
     v
 SCORE (hold-out)         MAE / MAPE recorded honestly per model
     |
     v
 PERSIST                  models/<terminal>_<series>.joblib    (Fabric: MLflow registry)
     |
     v
 SERVE                    Flask app + this guide                 (Fabric: Power BI + Copilot)
     |
     v
 DECISION                 "BED will stay congested in 24h"  ->  planner acts

Every aspect, checked

Aspect	Covered by
What a model is	Chapter 1
What the app shows	Chapter 2
What we predict + data	Chapter 3
How training works	Chapter 4
Real accuracy (good & bad)	Chapter 5
One year of data + Fabric scale	Chapter 6
End-to-end flow	this chapter
How to run it yourself	Chapter 8

8 · Run it yourself

Three things you can do right now.

See live predictions (the app)

python app.py
# then open  http://127.0.0.1:5000

Pick a terminal, read its forecasts, click ▶ Load & run the real .joblib model to prove the model loads from disk and runs.

Re-train the whole fleet

python train_fleet.py
# writes models/*.joblib, fleet_results.json, fleet_scorecard.json

Rebuild the documents

python build_doc.py     # the slide-style POC book  -> docs/documentation.html
python build_guide.py   # THIS explanatory guide     -> docs/guide.html

Two documents, two purposes: documentation.html is the presentation-style book; this guide.html is the long-form explanatory reference you’re reading now.

1 · What exactly is a “model”?

The one-sentence answer

A model is not magic, and it is not downloaded

What is literally inside one .joblib file

Why 127 of them?

The three algorithms that compete

2 · What the app is actually showing you

The left sidebar — the 32 terminals

The KPI strip at the top

Each card = one thing we predict

A worked example: BED

3 · What we predict, and the data behind it

The four predicted series

Where the numbers come from

4 · How training actually works (semi-technical)

Step 1 — turn history into a “study sheet”

Step 2 — three candidates compete (cross-validation)

Step 3 — refit the winner & score it honestly

Step 4 — persist it

5 · How accurate is it — honestly?

Fleet scorecard (median across terminals)

Why yard is so good and gate is harder

The honest caveat: ~11 terminals need a data fix

6 · Will it be accurate with one year of data, in Fabric?

Why more history improves accuracy

Does the method change? No.

Does it scale to Fabric? Yes — this is the production home

How it becomes “always on”

7 · Does it cover everything? The full picture

Every aspect, checked

8 · Run it yourself

See live predictions (the app)

Re-train the whole fleet

Rebuild the documents

What is literally inside one `.joblib` file