Benchmarks LongTail

LongTail E2E Driving

End-to-end driving on 1,000 rare scenarios. Ranked by Multi-Maneuver Score (MMS, 0–10) — a metric significantly more correlated with closed-loop DrivingScore than standard L2 error. Best submission per method shown.

Leaderboard

Live · fetched from HuggingFace · best submission per method

View full leaderboard ↗
Loading leaderboard…

Metrics

MMS

↑ higher is better

Multi-Maneuver Score (0–10). Composite score covering trajectory accuracy and semantic compliance across scenario types. Defined in the KITScenes LongTail paper. arXiv:2603.23607 ↗

S. Coherence

↑ higher is better

Semantic Coherence. Measures whether the driving actions described in a model's reasoning trace match its planned trajectory. Computed via an LLM judge that compares the textual reasoning against the predicted waypoints. Defined in the KITScenes LongTail paper. arXiv:2603.23607 ↗

KIT FZI TU Delft UC3M UPM University of Toronto