Long-tail driving scenarios with expert reasoning traces for evaluating vision-language and action models.
KITScenes LongTail is a public dataset for end-to-end autonomous driving and vision-language model evaluation, targeting the rare driving situations that existing benchmarks systematically underrepresent. Its 1,000 nine-second scenarios include specifically selected challenging cases, construction zones, adverse weather, nighttime driving, overtaking, and complex intersection scenarios that enable testing instruction following capabilities when multiple maneuvers are viable.
Each scenario provides synchronized six-view 360° video at 5 Hz, multiple trajectories, and richly detailed high-level driving instructions that go beyond simple route commands — for example, "overtake the truck on the right" rather than just "turn left." This combination of multi-view observation, grounded instructions, and diverse rare events directly enables in-context learning and few-shot generalization for vision-language and action models.
A second, contribution is KITScenes LongTail's multilingual expert reasoning traces: autonomous driving researchers researchers verbally answered five structured questions per scenario, in English, Spanish, and Chinese — a first in public autonomous driving datasets. These traces support research on cross-lingual driving reasoning and enable using a semantic coherence metric, which measures whether the driving actions described in a model's reasoning trace match its planned trajectory. To evaluate models holistically, the dataset also introduces the Multi-Maneuver Score (MMS), a computationally efficient approximation of closed-loop planning scores.
KITScenes LongTail is released under CC BY-NC 4.0. The test split and three training samples are already public on HuggingFace; full splits will follow.
Question: Imagine you are driving the car in the video. Your instruction is to drive straight on. What do you notice?
I'm driving in a construction zone behind another car at about 20 kilometres per hour. The road is wet from the rain, visibility is reduced by water droplets on the windshield. I'm decelerating because I have to steer to the right to follow the road and because there's part of the road without asphalt in front of me.
Expert reasoning trace — English
Conditions underrepresented in standard datasets — selected to stress-test vision-language and action models.
Comparison of self-driving datasets used to benchmark end-to-end driving methods, VLMs, and VLAs. A half-filled circle (◑) indicates a feature is partially available — for example, long-tail scenarios selected by trajectory variation rather than scenario class, or a reduced instruction set {right, left, straight}.
| Dataset | Long-tail data | Expert reasoning | Planning horizon [s] | Multi-maneuver evaluation | Driving comfort evaluation | Real video data | High-level instructions | Main locations |
|---|---|---|---|---|---|---|---|---|
| nuScenes | ✗ | ✗ | 3 | ✗ | ✗ | ✓ | ◑ | Boston, Singapore |
| NAVSIM | ✗ | ✗ | 4 | ✓ | ✗ | ✓ | ◑ | Boston, Singapore |
| Bench2Drive | ◑ | ✗ | var. | ✓ | ✗ | ✗ | ✓ | CARLA cities (simulation) |
| Waymo Open E2E | ✓ | ✗ | 5 | ✓ | ✗ | ✗ | ◑ | 12 U.S. cities |
| DriveLM-Data | ✗ | ◑ | 3 | ✗ | ✗ | ◑ | ◑ | Boston, Singapore, CARLA |
| CoVLA-Dataset | ◑ | ✗ | 3 | ✗ | ✗ | ✓ | ✓ | Tokyo |
| KITScenes LongTail | ✓ | ✓ | 5 | ✓ | ✓ | ✓ | ✓ | Karlsruhe, Heidelberg, Mannheim, Black Forest |
✓ fully available · ◑ partially available · ✗ not available. Planning horizon measured in seconds. KITScenes LongTail is the only dataset combining real video, expert reasoning traces, multi-maneuver and comfort evaluation, and genuine long-tail scenario classes.
Percentages of total category assignments (1,039 across 1,000 scenarios; scenarios can belong to multiple types).
Trajectories are ranked by similarity to reference trajectories across 5 categories. Comfort penalties apply for excess jerk or tortuosity.
| Category | Comfort penalty | MMS |
|---|---|---|
| Expert-like trajectory | none | 10 |
| jerk XOR tortuosity | 9 | |
| jerk AND tortuosity | 8 | |
| Wrong speed | none | 7 |
| jerk XOR tortuosity | 6 | |
| jerk AND tortuosity | 5 | |
| Neglect instruction | none | 4 |
| jerk XOR tortuosity | 3 | |
| jerk AND tortuosity | 2 | |
| Driving off road w/o crash | not considered | 1 |
| Crash | not considered | 0 |
These questions record reasoning traces about traffic scenarios and driving actions. The answers (with actions prepended) serve as expert reasoning traces. Questions 2–5 are generated from the trajectory; the example below is for a highway lane change.
Question 1 (open-ended)
Imagine you are driving the car in the video. Your instruction is: use the right lane. What do you notice?
I'm driving on a highway in the middle lane at about 110 km/h. I just overtook a truck driving in the right lane. In front of me, there is a lot of space in my lane and in the right lane.
Question 2 (0–3 s, speed)
In the next 3 seconds, why are you going to maintain the current speed?
(I'm going to maintain the current speed) to perform a lane change and follow my instruction.
Question 3 (0–3 s, steering)
In the next 3 seconds, why are you going to steer slightly to the right?
(I'm going to steer slightly to the right) to perform a smooth lane change to the right lane.
Question 4 (3–5 s, speed)
In the last 2 seconds, why are you going to maintain the current speed?
(I'm going to maintain the current speed) to finish the lane change.
Question 5 (3–5 s, steering)
In the last 2 seconds, why are you going to steer slightly to the left?
(I'm going to steer slightly to the left) to center the car in the right lane.
Each scenario is annotated with reference trajectories spanning five behaviour categories — from expert-like driving to crash — each carrying a defined Multi-Maneuver Score (MMS). A model prediction is matched to the closest reference via trajectory similarity and inherits its score. The examples below illustrate how MMS is evaluated across three representative scenario types. Point clouds shown for visualization; not required to compute MMS.
Potential rear-end crash
Instruction: Drive straight on.
Reference trajectories
Snow and reduced visibility
Instruction: Drive straight on.
Reference trajectories
Adverse weather, left turn
Instruction: Turn left.
Reference trajectories
Do reasoning traces match planned trajectories? Beyond MMS trajectory scoring, KITScenes LongTail measures whether a model's reasoning trace is consistent with its predicted trajectory. Semantic coherence compares acceleration and steering actions classified from the text with those derived from the trajectory, reporting their match rate over 0-3 s and 3-5 s.
| Model | Avg. 0-5 s ↑ | Acceleration ↑ 0-3 s / 3-5 s | Steering ↑ 0-3 s / 3-5 s |
|---|---|---|---|
| Qwen3-VL 8B | 0.51 | 0.83 / 0.79 | 0.22 / 0.18 |
| Gemma 3 12B | 0.30 | 0.46 / 0.41 | 0.17 / 0.15 |
| Pixtral 12B | 0.27 | 0.32 / 0.51 | 0.12 / 0.13 |
@misc{wagner2026longtaildrivingscenariosreasoning,
title = {LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset},
author = {Royden Wagner and Omer Sahin Tas and Jaime Villa and Felix Hauser and Yinzhe Shen and Marlon Steiner and Dominik Strutz
and Carlos Fernandez and Christian Kinzig and Guillermo S. Guitierrez-Cabello and Hendrik Königshof and Fabian Immel and
Richard Schwarzkopf and Nils Alexander Rack and Kevin Rösch and Kaiwen Wang and Jan-Hendrik Pauls and Martin Lauer and
Igor Gilitschenski and Holger Caesar and Christoph Stiller},
year = {2026},
eprint = {2603.23607},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2603.23607}
} The research leading to these results is partially funded by the German Federal Ministry for Economic Affairs and Energy within the project "NXT GEN AI METHODS". The authors gratefully acknowledge the computing time provided on the high-performance computer HoreKa by the National High-Performance Computing Center at KIT (NHR@KIT). This center is jointly supported by the Federal Ministry of Education and Research and the Ministry of Science, Research and the Arts of Baden-Württemberg, as part of the National High-Performance Computing (NHR) joint funding program. HoreKa is partly funded by the German Research Foundation (DFG).