1 · What exactly is a “model”?

This is the single most important idea in the whole project, so we start here and we go slowly. When the app says “127 models”, here is precisely what that means.

The one-sentence answer

A model is a small saved file that has learned the pattern of one number at one terminal — for example “how full BED’s yard usually gets” — so it can fill in the next value before it actually happens.

A model is not magic, and it is not downloaded

We did not download a model from the internet. We did not use ChatGPT. Each model is built from your own data using a mathematical procedure called supervised regression. Think of it like this:

Everyday analogyWhat it maps to here
You notice traffic is always bad at 5pm on weekdays. The model notices yard inventory rises on certain hours/days.
After weeks of commuting you can predict tomorrow’s 5pm traffic. The model predicts the next hour’s / next day’s value.
Your “experience” lives in your head. The model’s “experience” lives in a .joblib file on disk.

What is literally inside one .joblib file

When you click “Load & run the real .joblib model” in the app, Python opens a file like models/BED_yard.joblib. Inside it is a Python dictionary with four things:

{
  "model":   <the trained estimator>,   # e.g. a Ridge regression or a RandomForest
  "features": ["lag_1", "lag_24", "roll_mean_24", "hour", "dow", ...],
  "best_name": "Ridge",                  # which algorithm won for this terminal
  "target":  "inventory"                 # the number it predicts
}

So a “model” = a trained estimator + the list of inputs it expects + a label of what it predicts. That is the whole thing. It is typically a few hundred kilobytes. It loads in milliseconds. It does not need the internet.

Why 127 of them?

Every terminal behaves differently. BED is huge and busy; a small terminal is calm. One global model would be a blurry average that fits nobody. So we train a separate specialist model for each terminal and each thing we predict:

  • 32 terminals × up to 4 predicted series (yard inventory, gate total, gate inbound, gate outbound)
  • = about 127 models (a few terminals lacked enough history for every series, which is why it is 127 and not exactly 128).

That is the fleet. The app is simply a window onto those 127 trained files and the accuracy numbers we recorded when we built them.

The three algorithms that compete

For each terminal-series we don’t guess which method is best — we let three compete and keep the winner:

AlgorithmPlain descriptionGood at
RidgeA smart straight-line (linear) fit with a stability control. Smooth, trending signals like a slowly filling yard.
RandomForest300 decision-trees that vote on the answer. Bumpy, irregular signals like gate traffic.
HistGradientBoostingTrees built in sequence, each fixing the last one’s mistakes. Subtle non-linear patterns.

Across the fleet the winners were: RandomForest 60, HistGradientBoosting 35, Ridge 32 wins. No single method dominated — which is exactly why letting them compete was worth it.

2 · What the app is actually showing you

The screen at http://127.0.0.1:5000 is a live window onto the 127 trained models. Here is every element, decoded.

The left sidebar — the 32 terminals

Every button is one terminal (BED, DET, FAB, …). A ⚠ risk tag means that terminal’s yard is forecast to cross its congestion line. Clicking a terminal asks the app: “show me everything you learned about this place.”

The KPI strip at the top

KPIWhat it meansWhere it comes from
Current yard inventoryHow many units are sitting in the yard right now.Last real value in your data.
Forecast +24hWhat the model expects 24 hours from now.The trained model’s prediction.
Yard accuracy (MAPE)Typical % error of that forecast.Measured on held-out data.
Congestion statusOVER or HEALTHY.Forecast vs the congestion line.

Each card = one thing we predict

For a terminal you see up to four cards. Each card shows the same anatomy:

  • Last actual — the most recent real number.
  • Forecast (next) — the model’s prediction for the next period.
  • MAEMean Absolute Error: on average, how many units the forecast was off by. Lower = better.
  • MAPE — the same error as a percentage. 1.8% means typically within 1.8% of the truth.
  • Leaderboard — the three algorithms and their error scores; the 🏆 marks the one we kept and now serve.
  • ▶ Load & run the real .joblib model — proves it is real: it opens the actual saved file from disk and runs it in front of you.

A worked example: BED

BED’s yard currently holds 52,348 units. Its congestion line (the busiest 10% of its own history) is 45,566. The trained Ridge model forecasts 52,423 in 24 hours — over the line, so BED is flagged risk. And it is an accurate warning: that model’s typical error is only 0.1%.

That single sentence — “BED will stay congested, and we’re 99.9% confident in the size of the number” — is the entire point of the system.

3 · What we predict, and the data behind it

Functionally: we are giving each terminal a short-range operational weather forecast.

The four predicted series

SeriesQuestion it answersHorizon
Yard inventoryWill my yard fill up / stay congested?next 24 hours
Gate totalHow busy will the gate be overall?next 14 days
Gate inboundHow many units arriving?next 14 days
Gate outboundHow many units leaving?next 14 days

Where the numbers come from

The raw input is 845,277 equipment-activity events spanning 2026-03-16 → 2026-05-29 across 32 terminals — roughly two and a half months of history. Each event is one container/chassis doing something (arriving, leaving, being moved). We roll those raw events up into clean time-series tables:

Gold tableWhat it holds
gate_flow_hourlyEvents / inbound / outbound per terminal per hour.
yard_hourlyRunning yard inventory per terminal per hour.
dwellHow long each unit sat (idle / dwell time).
equipment_qualityData-quality flags (bad timestamps etc.).

Those tables are what the models actually learn from.

Yard inventory over time (history + forecast) for the demo terminal.
Yard inventory over time (history + forecast) for the demo terminal.
Gate throughput: actual vs forecast.
Gate throughput: actual vs forecast.

4 · How training actually works (semi-technical)

No hand-waving. This is the exact recipe the code runs for every single terminal-series.

Step 1 — turn history into a “study sheet”

A model can’t learn from a raw line on a chart. We convert the series into rows of features → answer. The features are engineered with no look-ahead (we never let the model peek at the future):

FeatureMeaning
lag_1, lag_2, lag_3, lag_24The value 1, 2, 3 and 24 periods ago.
roll_mean_24, roll_std_24Recent average & volatility (shifted so it’s past-only).
hour, dow, is_weekend, month_dayCalendar context.

Step 2 — three candidates compete (cross-validation)

We use rolling-origin TimeSeriesSplit cross-validation. Crucially, it always trains on the past and tests on the future — never the reverse — because shuffling time would be cheating. Each of the three algorithms gets scored on the same folds; lowest average error wins.

Step 3 — refit the winner & score it honestly

The winner is refit on all the training data, then measured on a clean hold-out it has never seen. That hold-out error is the MAE / MAPE you see in the app — an honest estimate of real-world accuracy, not a self-graded test.

Step 4 — persist it

The winning model (plus its feature list) is saved with joblib.dump() to models/<terminal>_<series>.joblib. That saved file is exactly what the app re-loads later — and what would be registered in Fabric/MLflow in production.

The whole fleet — 127 models across 32 terminals — trained in about 7.3 minutes on a normal machine.
Which algorithm won, across the whole fleet.
Which algorithm won, across the whole fleet.

5 · How accurate is it — honestly?

We report the good and the bad. This is measured accuracy on held-out data, not marketing.

Fleet scorecard (median across terminals)

SeriesMedian MAEMedian MAPERead it as
Yard inventory26.36 1.8% excellent — usually within ~2%.
Gate total120.79 50.8% directional — good for planning, not exact counts.
Gate inbound22.04 n/a*solid on absolute units.
Gate outbound9.38 n/a*solid on absolute units.

*MAPE is “n/a” when some hours have zero activity (you can’t take a percentage of zero). MAE is the reliable metric there.

Why yard is so good and gate is harder

  • Yard inventory is a smooth, slowly-moving level — easy to forecast tightly (median ~1.8% error).
  • Gate traffic is spiky and low-volume at many terminals — a few units off becomes a big percentage. The MAE (absolute units) stays small, so it’s still useful for staffing decisions.

The honest caveat: ~11 terminals need a data fix

Some terminals (e.g. NWO, SYR, FAB) show negative cumulative inventory, because more units left than arrived inside our 2.5-month window — those units arrived before the data starts. Their congestion flags are artifacts, not real risk. The genuine high-risk terminals are BED, NSH, GRE, DET, WOR, INR, CHA. The fix is to seed each terminal’s true starting inventory — which is exactly the kind of thing more history solves (see the next chapter).
Terminals ranked by forecast congestion headroom.
Terminals ranked by forecast congestion headroom.

6 · Will it be accurate with one year of data, in Fabric?

Short answer: yes — and it gets better. Here is exactly why, and what changes.

Why more history improves accuracy

Today (~2.5 months)With ~1 year
Models see only spring weeks.Models see all seasons — peaks, holidays, slow months.
Can’t learn weekly + seasonal cycles fully.Learns day-of-week, month, and seasonal patterns.
Negative-inventory artifact at some terminals.Long window captures true inflow/outflow → artifact disappears.
Yard MAPE already ~1.8%.Expected to hold or improve, and gate forecasts stabilise.

Does the method change? No.

The exact same pipeline — features, three-way model competition, time-series cross-validation, hold-out scoring, persisted winners — runs unchanged on one year of data. You simply point it at more history. The code is already written to train per terminal, so it scales by repeating the same recipe.

Does it scale to Fabric? Yes — this is the production home

The local Python you’re running maps 1:1 onto Microsoft Fabric. Nothing is throwaway:

Local POC (now)Microsoft Fabric (production)
Raw JSON → NDJSON shardsData Factory / Eventstream ingest into OneLake
DuckDB → gold parquet tablesLakehouse medallion (Bronze→Silver→Gold Delta tables)
train_fleet.py (scikit-learn)Spark notebooks running the same code, parallel across terminals
.joblib files in models/MLflow model registry (versioned, champion/challenger)
Flask app + this guidePower BI semantic model + dashboards + Copilot

How it becomes “always on”

  • Inference (cheap): recompute yard forecasts hourly, gate forecasts daily.
  • Retraining (heavier): refresh models weekly/monthly as new data lands.
  • Drift watch: if accuracy degrades or input patterns shift, auto-retrain.
  • One year of rolling history keeps each model current with real seasonality.
Bottom line: one year of data in Fabric means the same models you see today, trained on richer history, served continuously, versioned, and surfaced in Power BI — more accurate and fully operational.

7 · Does it cover everything? The full picture

Yes. Here is the entire system on one page, from a raw event to a decision a planner makes.

 RAW EVENTS                845,277 equipment-activity events (32 terminals, 2.5 months)
     |
     v
 INGEST / SHARD            stream huge JSON  ->  NDJSON shards        (Fabric: Eventstream/Data Factory)
     |
     v
 GOLD TABLES              gate_flow_hourly, yard_hourly, dwell, quality   (Fabric: Lakehouse Delta)
     |
     v
 FEATURE BUILD            lags + rolling stats + calendar (no look-ahead)
     |
     v
 TRAIN PER TERMINAL       3 algorithms compete  ->  pick winner by time-series CV
     |                    127 models, 32 terminals
     v
 SCORE (hold-out)         MAE / MAPE recorded honestly per model
     |
     v
 PERSIST                  models/<terminal>_<series>.joblib    (Fabric: MLflow registry)
     |
     v
 SERVE                    Flask app + this guide                 (Fabric: Power BI + Copilot)
     |
     v
 DECISION                 "BED will stay congested in 24h"  ->  planner acts

Every aspect, checked

AspectCovered by
What a model isChapter 1
What the app showsChapter 2
What we predict + dataChapter 3
How training worksChapter 4
Real accuracy (good & bad)Chapter 5
One year of data + Fabric scaleChapter 6
End-to-end flowthis chapter
How to run it yourselfChapter 8

8 · Run it yourself

Three things you can do right now.

See live predictions (the app)

python app.py
# then open  http://127.0.0.1:5000

Pick a terminal, read its forecasts, click ▶ Load & run the real .joblib model to prove the model loads from disk and runs.

Re-train the whole fleet

python train_fleet.py
# writes models/*.joblib, fleet_results.json, fleet_scorecard.json

Rebuild the documents

python build_doc.py     # the slide-style POC book  -> docs/documentation.html
python build_guide.py   # THIS explanatory guide     -> docs/guide.html
Two documents, two purposes: documentation.html is the presentation-style book; this guide.html is the long-form explanatory reference you’re reading now.