Knowledge Pack · Pre-Solution
Intermodal Operations Hub — Predictive Analytics
The complete picture of what we have and what we are trying to solve — before we talk about how to solve it.
1 · Problem Statement
An Intermodal Operations Hub (IO Hub) moves containers and trailers across the country using rail for the long haul and trucks for first/last mile. Every day, thousands of containers, chassis, and railcars flow through dozens of terminals — gating in, sitting in the yard, mounting onto trains, traveling, and gating out again.
The operation is reactive. Teams find out a yard is full after it is full, learn a gate is congested while trucks are already queued, and discover a chassis shortage when a driver is already waiting. There is rich event data, but it is not yet turned into foresight.
In plain words
We have a detailed minute-by-minute diary of every container's movements, but no one has used that diary to see a few hours into the future — so the yard, the gate, and the equipment are all managed by reacting instead of planning.The core problem: we cannot reliably answer simple forward-looking questions such as —
- “How full will Terminal X's yard be in 6 hours? Will it breach capacity?”
- “How many trucks will hit the gate in the next 1–3 hours?”
- “Which containers are sitting too long (dwell), and which equipment is idle or bad-ordered?”
- “Will we have enough empties / chassis / railcars where and when they're needed?”
2 · What We're Trying to Achieve
Turn the existing equipmentActivityReported event stream into predictive, actionable insight — moving the operation from reactive to proactive.
Scope of THIS document
This is the problem & source brief. It establishes the business, the operations, the use cases, and exactly what the source data contains. The solution (data pipeline, models, forecasts) is discussed separately.3 · The Business — Intermodal 101
Intermodal transportation moves freight using more than one mode — primarily rail (long-haul) and truck (first/last mile) — without handling the freight itself when switching modes, because it stays inside the same container.
Core assets
Containers (CSXU / UMAX …) Chassis (wheeled platforms) Railcars & TrainsThe primary objective (verbatim from the business doc)
“Ensure the right equipment (containers + chassis) is available at the right terminal, at the right time, to meet customer demand while optimizing cost and utilization.”End-to-end flow of a container
Reservation / booking is placed.
Empty container staged at the terminal.
Container enters the terminal by truck → added to yard inventory.
Container moved from yard, mounted/lifted onto a railcar.
Loaded train travels enroute between terminals.
Containers removed from railcars into the yard/chassis.
Customer/driver picks up → inventory reduced.
Container is reused or repositioned for the next cycle.
Container states
| State | Meaning |
|---|---|
| Empty | Available for booking |
| Loaded | Carrying a customer shipment |
| Mounted Empty | Empty container on a chassis (ready) |
| Mounted Load | Loaded container on a chassis |
| Grounded | Not on a chassis (stacked/stored) |
Business tension: too many mounted empties = wasted cost; too few = service delay. Grounded = storage; Mounted = readiness.
4 · Terminal Operations
A terminal's day revolves around three things: the gate (entry/exit), the yard (storage), and the train (load/unload). These are exactly the flows our data captures.
Gate operations — the inventory valve
The single most important equation
Net Inventory = Gate-In − Gate-Out (plus train arrivals − train departures). This is what drives yard capacity, demand planning, and repositioning decisions.Yard, chassis & train
- Yard: arrival → storage → assignment to a booking → mount/lift → load or gate-out.
- Chassis: bare → mounted → used → returned → reused. Availability directly limits pickups.
- Load to train: reservations set demand; units staged & lifted; track Units Loaded vs Left-to-Load.
- Train unload: arrivals removed into yard/chassis → feeds the next demand cycle.
- Enroute: loads & empties already moving between terminals.
Operational challenges called out in the source: over-/under-usage of chassis, bad-order equipment reducing usable capacity, and yard overflow.
5 · The 8 Predictive Use-Case Groups
The IO Hub vision defines 8 groups of predictive & analytics use cases. They are the menu of value we want to unlock from the data.
Inventory Prediction (Yard Capacity Forecast) POC focus
Objective: Predict yard congestion and optimize asset utilization.
- Forecast containers/trailers in the yard over time (next 6–24 h)
- Identify yard blocks/zones likely to reach capacity
- Predict when congestion will occur
- Empty-container availability & dwell-time prediction
- Inbound surge detection; chassis availability / shortage / turnaround / idle / bad-order
Equipment Usage & Availability
Objective: Understand how equipment is utilized and when it will be short.
- Predict equipment utilization trends
- Forecast downtime from maintenance & repairs
- Identify potential equipment shortages
Intermodal Crew future scope
Objective: Crew planning and assignment.
- Future scope — not in this dataset's reach yet
Railcar Demand & Availability Forecast
Objective: Balance railcar demand and supply.
- Demand: railcars required (reservations / projected volumes)
- Supply: railcars in yard, en route, being unloaded
- Gap analysis: shortage / surplus + recommendations
Gate Operations Analytics POC focus
Objective: Optimize gate flow and reduce truck wait times.
- Gate congestion: truck arrivals in next 1–3 h
- Inbound vs outbound flow (drop vs pickup)
- Throughput forecast (transactions / hour / day)
- Peak-hour patterns, lane-level optimization, gate dwell, driver scoring
Intermodal Train Analytics
Objective: Improve train performance & visibility.
- Avg train footage by destination
- Container distribution by destination/shipper
- On-time performance, load/unload times
Safety Prediction future scope
Objective: Proactively identify and mitigate safety risks.
- Predict high-risk zones, incident patterns, unsafe-condition alerts (future scope)
Conversational Analytics (Chat Module) future scope
Objective: Let business users ask questions in plain English.
- “Which yard block will be full in the next 4 hours?”
- “What is the expected gate congestion at 5 PM?”
- “How many railcars are currently idle?”
Reading the badges
POC focus = directly supported by the data we already have · future scope = needs more sources or later phases.6 · The Source Data — equipmentActivityReported
Everything above is powered by one real dataset: an export of equipment-activity events from the container-history system (IPRO_CONTAINER_HISTORY).
What one record looks like
It is MongoDB extended JSON — one big array of event objects. Each event is a snapshot of a single piece of equipment at a moment in time. A trimmed real example:
{
"equipmentId": "UNKN324740",
"equipmentTypeCode": "H", // H = chassis
"loadEmptyStatus": "E", // Empty
"currentStatus": "On Ramp",
"trainMoveCategory": "Storage",
"eventDescription": "Created",
"eventDateTime": "2026-05-29T20:54:21Z", // the heartbeat
"terminal": { "code": "WOR", "name": "WORCESTER", "region": "North" },
"shipperName": "JB HUNT TRANSPORT SERVICES INC",
"flatcarId": "BNSF211633",
"holdFlag": 0, "damageFlag": -1,
"header": { "uuid": "36395582-…", "messageTypeId": "imod_equipment_activity_reported" }
}
The full proto definition lists every field; the Container Inventory — Important Fields doc highlights the ones that matter for reporting and prediction, catalogued next.
7 · Field Catalog — what each column tells us
The 119 raw fields group into seven meaningful families. Below are the key fields for tracking, reporting, and (later) prediction.
Identity & Ownership — Uniquely identify the equipment and who owns it.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| equipmentId | string | Unique container/trailer ID (prefix + number), e.g. UNKN324740. | Primary key — tracks one asset across every event. |
| unitPrefix | string | Equipment owner/type prefix (CSXU, UMXU, UNKN…). | Ownership & type classification. |
| chassisId | string | ID of the chassis paired with the container. | Container ↔ chassis pairing. |
| chassisPrefix | string | Chassis owner prefix (e.g. TSFZ). | Chassis ownership/type tracking. |
| railcarPrefix / railcarClass / railcarType | string | Railcar owner & class (e.g. DTTX / S635 / 3W). | Railcar-level reporting. |
| flatcarId | string | Flatcar the unit rides on (e.g. BNSF211633). | Links the unit to a physical railcar. |
| equipmentTypeCode | char | C = Container, T = Trailer, H = Chassis. | Segments the dataset by asset type. |
| isoSizeType / isoTypeCode / chassisSize | string/int | ISO size-type (U_53), container/chassis size (53 ft). | Capacity & equipment-mix analysis. |
Shipment & Business Context — Who the shipment is for and where it goes.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| shipperName | string | Shipper from the waybill (e.g. JB HUNT TRANSPORT SERVICES INC). | Customer-level reporting. |
| consigneeName | string | Party receiving the shipment. | Delivery / customer visibility. |
| originCity / destinationCity | string | Origin and waybill destination city. | Lane & flow analysis. |
| finalDestination / finalDestinationCity | string | End destination (incl. outside the network). | End-to-end shipment visibility. |
| unitAgreementName | string | Intermodal contract/agreement name. | Billing & contract analytics. |
| billingPatronName / billingPatronCode | string | Billing party. | Revenue / customer attribution. |
| bookingNumber / billOfLadingNumber | string | Booking & bill-of-lading references. | Ties an event to a reservation. |
Movement & Train Classification — How the unit is moving through the rail network.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| trainMoveCategory | string | Inbound, Outbound, Transrail, Through, Storage. | The core operational category — drives gate & inventory KPIs. |
| equipmentMoveType | string | Move between locations (Y-R yard→rail, R-Y rail→yard…). | Tracks yard ↔ rail transitions. |
| intermodalActivityType | string | Activity classification (e.g. UNSPECIFIED). | Event typing. |
| arrivalTrain6 / departureTrain6 | string | Inbound & outbound train IDs. | Links the unit to train operations. |
| fromRailcar / toRailcar | string | Railcar a unit came off / goes onto. | Load/unload planning & execution. |
| railcarSequence / trainProcessingTrack | int/string | Position in train, processing track. | Load-plan & yard-track detail. |
Status & Condition — The current state and health of the unit.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| loadEmptyStatus | char | L = Loaded, E = Empty. | Core capacity & utilization metric. |
| currentStatus | string | Location/state — Yard, On Ramp, Offsite… | Real-time inventory tracking. |
| holdFlag / holdReason | int/string | On hold? + reason (CUSTOMS, STOPORDER…). | Exception & delay analysis. |
| damageFlag / chassisDamageFlag | int | 1 = damaged ( -1 = unknown ). | Bad-order & maintenance tracking. |
| priorityShipmentFlag | string | High-priority shipment indicator. | SLA / expedite handling. |
| coneStatus / equipmentLocationOnRailcar | string | Securement & stack position (BOTTOM…). | Loading detail & safety. |
Timing & Lifecycle — When the key events happened.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| eventDateTime | datetime (UTC) | Timestamp of the event — the heartbeat of all analytics. | Every time-series & forecast is built on this. |
| yardArrivalTime | datetime | Physical arrival at the terminal. | Start of dwell-time calculation. |
| yardDepartureTime | datetime | Departure from the terminal. | End of dwell & throughput. |
| creationTime / lastUpdateTime | datetime | Record created / last updated. | Data freshness & lineage. |
| groundedDate / notifyDate / holdReleaseDate | datetime | Grounded, customer-notify, hold-release times. | Lifecycle & gate-out readiness. |
Location & Source — Where, and from which system, the event occurred.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| terminal.{code,name,region} | object | Terminal code (WOR), name (WORCESTER), region (North). | Location-based KPIs — the unit of analysis. |
| terminal.{fsac,milepost,city,state} | object | Terminal geography & rail milepost. | Geospatial & network context. |
| eventCity / eventState | string | City/state where the event fired. | Location cross-check. |
| parkingLocation / trainProcessingTrack | string | Physical spot in the yard / track. | Yard-zone & congestion detail. |
| eventSource / eventDescription | string | Source system (IPRO_CONTAINER_HISTORY) & event label (Created, Arrived, Notified…). | Event classification & filtering. |
| header.{uuid,time,messageTypeId} | object | Message envelope: unique id, emit time, type. | Idempotency & streaming lineage. |
Physical Attributes — Size and weight of the unit and its cargo.
| Field | Type | What it is | Why it matters |
|---|---|---|---|
| containerHeight / containerWidth / equipmentLength | int (mm) | Physical dimensions. | Capacity & clearance. |
| grossWeight / tareWeight / cargoWeight | int | Total, empty, and cargo weight. | Weight planning & load limits. |
| equipmentCastingType | string | Casting type (ISO…). | Equipment classification. |
8 · Data Reality & Caveats
Being honest about the source is part of the complete picture.
Known data realities
- Null sentinels: date -2208988800000 (1900-01-01) and N/A / -1 values mean unset, not real data — must be treated as missing.
- Events ≠ true inventory: these are activity events; yard inventory must be derived from arrival/departure deltas (assumptions to be documented in the solution).
- Single export, finite window: ~2.5 months of history — any forecast horizon must be scaled to what the history can support.
- Sparse fields: many of the 119 columns are mostly N/A for a given event type (e.g. railcar fields on a pure yard event).
In plain words
The data is rich and real, but it's a diary of events, not a tidy ledger of "how many are in the yard right now." We'll have to reconstruct the counts we care about — and we must ignore the placeholder values that only look like data.9 · POC Scope & the Next Step
Of the 8 use-case groups, three are directly supported by the data we already have — no extra source systems required. These are the natural POC focus:
| Priority | Use case | What it answers | Source group |
|---|---|---|---|
| 1 | Yard Inventory / Capacity | Hourly yard count + forecast + congestion flag | §1 Inventory |
| 2 | Gate / Throughput | Inbound vs outbound flow + near-term forecast | §5 Gate Ops |
| 3 | Dwell & Equipment Quality | Dwell-time, idle equipment, hold/damage rates | §1 / §2 |
Where this goes next
With the problem and source now fully framed, the solution discussion covers: how we ingest & curate the data, how we derive the metrics, the forecasting approach, accuracy, and the path to production on Microsoft Fabric. That is a separate conversation.Document synthesized from: Container Inventory — Important Fields, Intermodal Business & Terminal Operations, IO Hub Predictive & Analytics Use Cases, and the live equipmentActivityReported source. Generated May 31, 2026.
PART TWO · LAYMAN → SEMI-TECHNICAL · READ TOP TO BOTTOM
What do all these numbers mean — and how did we get them?
You saw cards full of numbers and a “leaderboard” of models. This part explains every piece, in plain English, then a little deeper.
Start here — the one big idea
We have a giant diary of what every container did over the last ~2.5 months. We use that history to teach a computer to guess the near future — how full a yard will get, how many trucks will arrive, how long things will sit.
The whole thing in one sentence
Learn from the past → guess the future → check how close the guess was → keep the model that guessed best.That's it. Everything below is just the detail behind those four steps.
10 · Every number on the card, explained
Here's a real Yard card and what each line is telling you:
Yard inventory (+24h) winning model: Ridge · 1739 train pts Last actual 52,348 Forecast (next) 52,422.5 MAE (error) 64.56 Accuracy (MAPE) 0.1% # = 99.9% accurate Congestion threshold 45,566 Forecast vs threshold OVER # congestion predicted
| What you see | What it means | In plain words |
|---|---|---|
| Last actual | The most recent real, measured value. | “Right now there are 52,348 containers in the yard.” |
| Forecast (next) | What the model predicts for the next period. | “In 24 hours I expect ~52,422.” |
| MAE — Mean Absolute Error | Average miss, in the same units (containers, trucks). Lower is better. | “On a fair test, I was off by about 65 containers on average.” |
| MAPE — Mean Absolute % Error | Average miss as a percentage. Accuracy = 100 − MAPE. | “0.1% error → 99.9% accurate.” |
| train pts | How many historical points the model learned from. More = better. | Yard had 1,739 hourly points; Gate had only 67 daily points. |
| Congestion threshold | A chosen ‘full’ capacity line for the yard. | 45,566 = the line we don't want to cross. |
| Forecast vs threshold | The action signal: is the forecast above the line? | OVER → congestion predicted → prepare space/staff. |
The one trick to remember
MAE = how far off in real units (containers). MAPE = how far off in percent. Accuracy = 100 − MAPE. When MAE is small but MAPE is big, it just means the numbers themselves are small (more on that in §15).11 · The “model race” (the leaderboard)
For each thing we predict, we don't pick one method and hope. We enter six different algorithms into a race, give them all the same data, and let the most accurate one win.
How the race is judged
We hide a recent slice of real history. Each algorithm predicts it. We measure who came closest (lowest error). Lowest number wins. The winner is saved and used for the live forecast.Your Yard leaderboard, read as a race result (lower = better):
ARIMA 117.79 / Prophet 119.12 — close, time-series methods
RandomForest 5,362 · XGBoost 5,804 · HistGB 6,162 — tree models lost badly here
Why did the powerful tree models lose on yard? Because yard inventory is a smooth, slowly-drifting line — a straight-line method (Ridge) fits it better than complex trees, which overcomplicate it. Different problems, different winners.
One honest footnote
The leaderboard number (Ridge 233.63) comes from cross-validation during training; the card's MAE (64.56) comes from the final hold-out test. Same idea, measured at two stages — that's why they differ slightly.12 · What each algorithm actually does
The six racers, in plain words — and what kind of problem each is good at.
Ridge · Linear regression 🏆 Yard, Inbound
What it does: Draws the best straight-line / smooth relationship through the data.
In plain words
Like drawing a steady trend line through dots that mostly drift in one direction.Best for: Smooth, slowly-changing numbers — e.g. yard inventory.
RandomForest · Tree ensemble 🏆 Outbound
What it does: Asks hundreds of yes/no question-trees and averages their votes.
In plain words
Like polling 300 experts who each look at the data differently, then taking the average.Best for: Bumpy, non-linear patterns and sudden jumps.
HistGradientBoosting · Boosted trees strong all-rounder
What it does: Builds trees one after another, each fixing the last one's mistakes.
In plain words
Like a student who reviews every wrong answer and studies exactly that next.Best for: Complex patterns; fast on large data.
XGBoost · Boosted trees strong all-rounder
What it does: Same idea as the above, a very popular, highly-tuned version.
In plain words
The competition-winning cousin of gradient boosting.Best for: Complex tabular patterns.
Prophet · Time-series (by Meta) 🏆 Gate total
What it does: Splits a series into trend + weekly/yearly seasonality + holidays.
In plain words
Like saying ‘busier on Mondays, slower in summer’ and projecting that forward.Best for: Seasonal, calendar-driven series — e.g. gate traffic.
ARIMA · Classic statistics baseline
What it does: Predicts the next value from recent values and recent errors.
In plain words
Like guessing tomorrow mostly from the last few days.Best for: Short-term, self-correlated series.
Why have six at all?
No single method wins everywhere. Smooth series love Ridge; seasonal series love Prophet; spiky series love trees. Racing them means each terminal automatically gets the model that fits it best — no manual guessing.13 · What data (tables) we have
It all starts from one source: 845,277 equipment-activity events (119 fields each, 32 terminals, ~2.5 months). From that raw stream we build four clean tables the models read:
| Table | What it holds | Used for | Built from |
|---|---|---|---|
| yard_hourly | Hourly count of equipment sitting in each terminal's yard. | Yard inventory & congestion forecast | derived from gate-in/out + train arrivals/departures |
| gate_flow_hourly | Hourly trucks gating in and out per terminal. | Gate throughput, inbound vs outbound flow | from gate-in / gate-out events |
| dwell | How long each unit stayed in the yard (arrival → departure). | Dwell-time, slow-mover & idle detection | yardArrivalTime → yardDepartureTime |
| equipment_quality | Hold and damage flags per terminal. | Bad-order / exception analytics | holdFlag, damageFlag, holdReason |
Under the hood
Raw JSON → streamed into shards → summarized with DuckDB into these parquet tables (the “gold” layer). The models never touch the 3.75 GB raw file — they read these tidy hourly summaries.14 · How we train and pick a winner
- Shape the data: turn events into an hourly/daily time-series per terminal.
- Add clues (features): hour-of-day, day-of-week, and recent past values (“lags”).
- Split fairly: train on older data, test on a recent slice the model never saw.
- Race the six: each predicts the hidden slice; we score the error.
- Crown the winner: lowest error wins and is saved as a .joblib file.
- Forecast: the winner predicts the next 24h (yard) or next period (gate).
In plain words
It's like studying past exams (training), then sitting a mock exam on questions you've never seen (the test). Whoever scores best on the mock is trusted with the real one.We did this for every terminal × every metric — 127 models in ~15 minutes on a laptop CPU. The “Load & run the real .joblib” button proves each one is a genuine saved model, not a hardcoded number.
15 · Why Gate accuracy looks “low” (and what's really going on)
Two honest reasons Gate is weaker:
- Far less data: 67 points vs 1,739. Hard to learn a pattern from so few.
- Much longer horizon: guessing 14 days ahead is far harder than 24 hours ahead.
The “outbound 41.8%” trap
Outbound numbers are small and spiky (159, 51…). MAPE divides the error by the actual value — so a modest miss on a small number looks like a huge percentage. Low-volume series always look bad in MAPE even when the model is fine. Judge them by MAE (real units) instead.In plain words
Yard is like predicting the level of a big lake — slow and smooth, easy. Gate is like predicting how many people walk through a door on a random day two weeks from now — noisy and hard, especially with only a couple months of history.16 · What MORE data would help (and how much)
More data is the single biggest lever — especially for gate & flows.
| Data we could add | Today | Why it helps | Impact |
|---|---|---|---|
| More history (1+ year) | Only ~2.5 months today. | Lets models learn weekly & seasonal cycles. Biggest single win, especially for gate. | high |
| Train schedules / ETAs | Not in current feed. | Arrivals drive yard surges — huge for congestion timing. | high |
| Booking / reservation volumes | Partial today. | Forward demand signal for railcars & empties. | high |
| Calendar & holidays | Derivable. | Explains predictable busy/quiet days for gate. | medium |
| Weather | External. | Storms shift truck arrivals & delays. | medium |
| Chassis pool / availability feed | Partial. | Enables chassis shortage & turnaround prediction. | medium |
| Labor / shift rosters | External. | Connects forecasts to staffing decisions. | low |
If you only do one thing
Get 1+ year of history and add train schedules / ETAs. Those two unlock seasonal patterns and explain the surges that drive both yard and gate.17 · Other algorithms we could add to the race
The current six cover the main families. With more data, these are worth entering:
| Algorithm | What it adds | Effort |
|---|---|---|
| LightGBM / CatBoost | Faster, often more accurate boosted trees (CatBoost handles categories natively). | easy add |
| SARIMA / SARIMAX | ARIMA with seasonality + external drivers (e.g. train ETAs). | easy add |
| Exponential Smoothing (ETS) | Lightweight, strong baseline for trend + seasonality. | easy add |
| LSTM / GRU (neural nets) | Deep nets for long sequences — needs lots of data & GPU. | needs 1yr+ |
| Temporal Fusion Transformer | State-of-the-art deep forecaster with feature attention. | needs 1yr+ / GPU |
| Ensembling / stacking | Blend several winners — often beats any single model. | next step |
Under the hood
Easy adds (LightGBM, CatBoost, SARIMA, ETS) plug straight into the existing race today. Deep-learning (LSTM, Transformers) only pays off with a year+ of data and ideally a GPU — otherwise it overfits and underperforms the simple models you already have. Stacking (blending winners) is usually the smartest next gain.18 · If I add 1 year of data — is it production-ready?
| Metric | Today | With 1 year |
|---|---|---|
| Yard inventory | Production-grade (99.9%) | Rock solid |
| Gate total | Demo (≈80%) | Much better — learns weekly/seasonal cycles |
| Inbound / Outbound | Noisy (low volume) | Better; report in MAE |
1 year is necessary, but not the whole story
To be truly production-ready you also need: (1) the right horizon — forecast gate 1–3 hours ahead (the real business question), not 14 days; (2) a retraining schedule as new data arrives; (3) honest metrics (MAE for low-volume flows); (4) monitoring that alerts when error grows.Bottom line
Yard is ready now. Gate becomes production-grade with 1 year of data + a realistic short horizon + a retrain loop — and the biggest quick win is simply asking gate the right question (next few hours, not next two weeks).19 · How to read the screen, quickly
- Find the 🏆 on the leaderboard — that's the chosen model. Lower number = better.
- Compare MAE and MAPE. Both small → trust it. Small MAE but big MAPE → it's a low-volume series, not a broken model.
- Read “Forecast vs threshold.” OVER = act now (prepare yard space / staffing).
- Click “Load & run the real .joblib.” Confirms it's a genuine saved model, live.
Merged from the explained.html companion. Same warm reading theme, no dark mode. Generated May 31, 2026.