KITScenes LongTail

Long-tail driving scenarios with expert reasoning traces for evaluating vision-language and action models.

1,000
driving scenarios
360°
surround camera coverage
7
long-tail scenario types
3
reasoning languages · EN / ES / ZH

KITScenes LongTail is a public dataset for end-to-end autonomous driving and vision-language model evaluation, targeting the rare driving situations that existing benchmarks systematically underrepresent. Its 1,000 nine-second scenarios include specifically selected challenging cases, construction zones, adverse weather, nighttime driving, overtaking, and complex intersection scenarios that enable testing instruction following capabilities when multiple maneuvers are viable.

Each scenario provides synchronized six-view 360° video at 5 Hz, multiple trajectories, and richly detailed high-level driving instructions that go beyond simple route commands — for example, "overtake the truck on the right" rather than just "turn left." This combination of multi-view observation, grounded instructions, and diverse rare events directly enables in-context learning and few-shot generalization for vision-language and action models.

A second, contribution is KITScenes LongTail's multilingual expert reasoning traces: autonomous driving researchers researchers verbally answered five structured questions per scenario, in English, Spanish, and Chinese — a first in public autonomous driving datasets. These traces support research on cross-lingual driving reasoning and enable using a semantic coherence metric, which measures whether the driving actions described in a model's reasoning trace match its planned trajectory. To evaluate models holistically, the dataset also introduces the Multi-Maneuver Score (MMS), a computationally efficient approximation of closed-loop planning scores.

KITScenes LongTail is released under CC BY-NC 4.0. The test split and three training samples are already public on HuggingFace; full splits will follow.

Long-taildata Expertreasoning Number ofscenarios Detailedinstructions Geographicdiversity CameraFoV Videodata
nuScenes Waymo E2E CoVLA KITScenes LongTail
Heavy rain and construction zone scenario

Question: Imagine you are driving the car in the video. Your instruction is to drive straight on. What do you notice?

I'm driving in a construction zone behind another car at about 20 kilometres per hour. The road is wet from the rain, visibility is reduced by water droplets on the windshield. I'm decelerating because I have to steer to the right to follow the road and because there's part of the road without asphalt in front of me.

Expert reasoning trace — English

Long-Tail Scenarios

Conditions underrepresented in standard datasets — selected to stress-test vision-language and action models.

Dataset Comparison

Comparison of self-driving datasets used to benchmark end-to-end driving methods, VLMs, and VLAs. A half-filled circle (◑) indicates a feature is partially available — for example, long-tail scenarios selected by trajectory variation rather than scenario class, or a reduced instruction set {right, left, straight}.

Dataset Long-tail data Expert reasoning Planning horizon [s] Multi-maneuver evaluation Driving comfort evaluation Real video data High-level instructions Main locations
nuScenes 3 Boston, Singapore
NAVSIM 4 Boston, Singapore
Bench2Drive var. CARLA cities (simulation)
Waymo Open E2E 5 12 U.S. cities
DriveLM-Data 3 Boston, Singapore, CARLA
CoVLA-Dataset 3 Tokyo
KITScenes LongTail 5 Karlsruhe, Heidelberg, Mannheim, Black Forest

✓ fully available  ·  ◑ partially available  ·  ✗ not available. Planning horizon measured in seconds. KITScenes LongTail is the only dataset combining real video, expert reasoning traces, multi-maneuver and comfort evaluation, and genuine long-tail scenario classes.

Dataset Details

Distribution of Scenario Types

Percentages of total category assignments (1,039 across 1,000 scenarios; scenarios can belong to multiple types).

Specifically selected
19.8%
Intersection
29.6%
Overtake / lane change
22.7%
Construction zone
9.4%
Heavy rain
7.1%
Snow & wintry mix
6.2%
Nighttime
5.1%

Reference Multi-Maneuver Scores

Trajectories are ranked by similarity to reference trajectories across 5 categories. Comfort penalties apply for excess jerk or tortuosity.

Category Comfort penalty MMS
Expert-like trajectorynone10
jerk XOR tortuosity9
jerk AND tortuosity8
Wrong speednone7
jerk XOR tortuosity6
jerk AND tortuosity5
Neglect instructionnone4
jerk XOR tortuosity3
jerk AND tortuosity2
Driving off road w/o crashnot considered1
Crashnot considered0

Context and Questions Asked to Domain Experts

These questions record reasoning traces about traffic scenarios and driving actions. The answers (with actions prepended) serve as expert reasoning traces. Questions 2–5 are generated from the trajectory; the example below is for a highway lane change.

Question 1 (open-ended)

Imagine you are driving the car in the video. Your instruction is: use the right lane. What do you notice?

I'm driving on a highway in the middle lane at about 110 km/h. I just overtook a truck driving in the right lane. In front of me, there is a lot of space in my lane and in the right lane.

Question 2 (0–3 s, speed)

In the next 3 seconds, why are you going to maintain the current speed?

(I'm going to maintain the current speed) to perform a lane change and follow my instruction.

Question 3 (0–3 s, steering)

In the next 3 seconds, why are you going to steer slightly to the right?

(I'm going to steer slightly to the right) to perform a smooth lane change to the right lane.

Question 4 (3–5 s, speed)

In the last 2 seconds, why are you going to maintain the current speed?

(I'm going to maintain the current speed) to finish the lane change.

Question 5 (3–5 s, steering)

In the last 2 seconds, why are you going to steer slightly to the left?

(I'm going to steer slightly to the left) to center the car in the right lane.

Trajectory Annotation and Multi-Maneuver Score

Each scenario is annotated with reference trajectories spanning five behaviour categories — from expert-like driving to crash — each carrying a defined Multi-Maneuver Score (MMS). A model prediction is matched to the closest reference via trajectory similarity and inherits its score. The examples below illustrate how MMS is evaluated across three representative scenario types. Point clouds shown for visualization; not required to compute MMS.

0:00

Traffic light turns red

Potential rear-end crash

Instruction: Drive straight on.

Reference trajectories

expert MMSref = 10
wrong speed MMSref = 7
crash MMSref = 0
prediction MMS = 0 → crash
0:00

Curvy road and wintry mix

Snow and reduced visibility

Instruction: Drive straight on.

Reference trajectories

expert MMSref = 10
wrong speed MMSref = 7
crash MMSref = 0
prediction MMS = 10 → expert
0:00

Heavy rain at an intersection

Adverse weather, left turn

Instruction: Turn left.

Reference trajectories

expert MMSref = 10
wrong speed MMSref = 7
neglect instruction MMSref = 4
prediction MMS = 5.3 → wrong speed
(sim = 0.75)

Semantic Coherence of Model Outputs

Do reasoning traces match planned trajectories? Beyond MMS trajectory scoring, KITScenes LongTail measures whether a model's reasoning trace is consistent with its predicted trajectory. Semantic coherence compares acceleration and steering actions classified from the text with those derived from the trajectory, reporting their match rate over 0-3 s and 3-5 s.

Model Avg. 0-5 s ↑ Acceleration ↑
0-3 s / 3-5 s
Steering ↑
0-3 s / 3-5 s
Qwen3-VL 8B 0.51 0.83 / 0.79 0.22 / 0.18
Gemma 3 12B 0.30 0.46 / 0.41 0.17 / 0.15
Pixtral 12B 0.27 0.32 / 0.51 0.12 / 0.13

Citation

@misc{wagner2026longtaildrivingscenariosreasoning, title = {LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset}, author = {Royden Wagner and Omer Sahin Tas and Jaime Villa and Felix Hauser and Yinzhe Shen and Marlon Steiner and Dominik Strutz and Carlos Fernandez and Christian Kinzig and Guillermo S. Guitierrez-Cabello and Hendrik Königshof and Fabian Immel and Richard Schwarzkopf and Nils Alexander Rack and Kevin Rösch and Kaiwen Wang and Jan-Hendrik Pauls and Martin Lauer and Igor Gilitschenski and Holger Caesar and Christoph Stiller}, year = {2026}, eprint = {2603.23607}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.23607} }

Team

Royden Wagner
Royden Wagner
KIT

Joint First Author · Concept, Methodology, Data Recording

Omer Sahin Tas
Omer Sahin Tas
FZI KIT

Joint First Author · Concept, Methodology

Jaime Villa
Jaime Villa
UC3M

Joint First Author · Data Recording, Annotation, Evaluation

Felix Hauser
Felix Hauser
KIT FZI

Annotation Tools, Trajectory Annotation, Data Recording

Yinzhe Shen
Yinzhe Shen
KIT

Reasoning Annotation, Post-processing, Baseline Models

Marlon Steiner
Marlon Steiner
KIT

Trajectory Generation

Dominik Strutz
Dominik Strutz
KIT

Visualization

Carlos Fernandez
Carlos Fernandez
KIT

Trajectory Annotation, Data Recording

Christian Kinzig
Christian Kinzig
KIT

Image Stitching

Guillermo S. Gutierrez-Cabello
UPM

Reasoning Annotation

Hendrik Königshof
Hendrik Königshof
FZI KIT

Reasoning Annotation, Data Recording

Fabian Immel
Fabian Immel
FZI KIT

Image Post-processing

Richard Schwarzkopf
Richard Schwarzkopf
FZI KIT

Vehicle Setup

Nils Alexander Rack
KIT

Image Post-processing

Kevin Rösch
Kevin Rösch
FZI KIT

Vehicle Setup

Kaiwen Wang
Kaiwen Wang
KIT

Camera Calibration

Jan-Hendrik Pauls
Jan-Hendrik Pauls
KIT

Sensing and Data Acquisition Stack

Martin Lauer
Martin Lauer
KIT

Co-Advisor

Igor Gilitschenski
Igor Gilitschenski
U of T

Co-Advisor

Holger Caesar
Holger Caesar
TU Delft

Co-Advisor

Christoph Stiller
Christoph Stiller
KIT FZI

Principal Investigator

Acknowledgements

The research leading to these results is partially funded by the German Federal Ministry for Economic Affairs and Energy within the project "NXT GEN AI METHODS". The authors gratefully acknowledge the computing time provided on the high-performance computer HoreKa by the National High-Performance Computing Center at KIT (NHR@KIT). This center is jointly supported by the Federal Ministry of Education and Research and the Ministry of Science, Research and the Arts of Baden-Württemberg, as part of the National High-Performance Computing (NHR) joint funding program. HoreKa is partly funded by the German Research Foundation (DFG).

KIT FZI TU Delft UC3M UPM University of Toronto