End-to-end driving on 1,000 rare scenarios. Ranked by Multi-Maneuver Score (MMS, 0–10) — a metric significantly more correlated with closed-loop DrivingScore than standard L2 error. Best submission per method shown.
Live · fetched from HuggingFace · best submission per method
Could not load leaderboard. View on HuggingFace ↗
MMS
↑ higher is better
Multi-Maneuver Score (0–10). Composite score covering trajectory accuracy and semantic compliance across scenario types. Defined in the KITScenes LongTail paper. arXiv:2603.23607 ↗
S. Coherence
↑ higher is better
Semantic Coherence. Measures whether the driving actions described in a model's reasoning trace match its planned trajectory. Computed via an LLM judge that compares the textual reasoning against the predicted waypoints. Defined in the KITScenes LongTail paper. arXiv:2603.23607 ↗