KITScenes LongTail – Long-tail Driving Scenarios with Reasoning Traces

KITScenes LongTail is a public dataset for end-to-end autonomous driving and vision-language model evaluation, targeting the rare driving situations that existing benchmarks systematically underrepresent. Its 1,000 nine-second scenarios include specifically selected challenging cases, construction zones, adverse weather, nighttime driving, overtaking, and complex intersection scenarios that enable testing instruction following capabilities when multiple maneuvers are viable.

Each scenario provides synchronized six-view 360° video at 5 Hz, multiple trajectories, and richly detailed high-level driving instructions that go beyond simple route commands — for example, "overtake the truck on the right" rather than just "turn left." This combination of multi-view observation, grounded instructions, and diverse rare events directly enables in-context learning and few-shot generalization for vision-language and action models.

A second, contribution is KITScenes LongTail's multilingual expert reasoning traces: autonomous driving researchers researchers verbally answered five structured questions per scenario, in English, Spanish, and Chinese — a first in public autonomous driving datasets. These traces support research on cross-lingual driving reasoning and enable using a semantic coherence metric, which measures whether the driving actions described in a model's reasoning trace match its planned trajectory. To evaluate models holistically, the dataset also introduces the Multi-Maneuver Score (MMS), a computationally efficient approximation of closed-loop planning scores.

KITScenes LongTail is released under CC BY-NC 4.0. The test split and three training samples are already public on HuggingFace; full splits will follow.

Read the Paper arXiv ↗ Download on HuggingFace Challenge Leaderboard ↗

nuScenes Waymo E2E CoVLA KITScenes LongTail

Heavy rain and construction zone scenario

Question: Imagine you are driving the car in the video. Your instruction is to drive straight on. What do you notice?

I'm driving in a construction zone behind another car at about 20 kilometres per hour. The road is wet from the rain, visibility is reduced by water droplets on the windshield. I'm decelerating because I have to steer to the right to follow the road and because there's part of the road without asphalt in front of me.

Expert reasoning trace — English

Long-Tail Scenarios

Conditions underrepresented in standard datasets — selected to stress-test vision-language and action models.

Dataset Comparison

Comparison of self-driving datasets used to benchmark end-to-end driving methods, VLMs, and VLAs. A half-filled circle (◑) indicates a feature is partially available — for example, long-tail scenarios selected by trajectory variation rather than scenario class, or a reduced instruction set {right, left, straight}.

Dataset	Long-tail data	Expert reasoning	Planning horizon [s]	Multi-maneuver evaluation	Driving comfort evaluation	Real video data	High-level instructions	Main locations
nuScenes	✗	✗	3	✗	✗	✓	◑	Boston, Singapore
NAVSIM	✗	✗	4	✓	✗	✓	◑	Boston, Singapore
Bench2Drive	◑	✗	var.	✓	✗	✗	✓	CARLA cities (simulation)
Waymo Open E2E	✓	✗	5	✓	✗	✗	◑	12 U.S. cities
DriveLM-Data	✗	◑	3	✗	✗	◑	◑	Boston, Singapore, CARLA
CoVLA-Dataset	◑	✗	3	✗	✗	✓	✓	Tokyo
KITScenes LongTail	✓	✓	5	✓	✓	✓	✓	Karlsruhe, Heidelberg, Mannheim, Black Forest

✓ fully available · ◑ partially available · ✗ not available. Planning horizon measured in seconds. KITScenes LongTail is the only dataset combining real video, expert reasoning traces, multi-maneuver and comfort evaluation, and genuine long-tail scenario classes.

Dataset Details

Distribution of Scenario Types

Percentages of total category assignments:

Specifically selected

19.8%

Nighttime

5.1%

Snow & wintry mix

6.2%

Heavy rain

7.1%

Construction zone

9.4%

Overtake / lane change

22.7%

Intersection

29.6%

Reference Multi-Maneuver Scores

Trajectories are ranked by similarity to reference trajectories across 5 categories. Comfort penalties apply for excess jerk or tortuosity.

Category	Comfort penalty	MMS
Expert-like trajectory	none	10
	jerk XOR tortuosity	9
	jerk AND tortuosity	8
Wrong speed	none	7
	jerk XOR tortuosity	6
	jerk AND tortuosity	5
Neglect instruction	none	4
	jerk XOR tortuosity	3
	jerk AND tortuosity	2
Driving off road w/o crash	not considered	1
Crash	not considered	0

Context and Questions Asked to Domain Experts

These questions record reasoning traces about traffic scenarios and driving actions. The answers (with actions prepended) serve as expert reasoning traces. Questions 2–5 are generated from the trajectory; the example below is for a highway lane change.

Question 1 (open-ended)

Imagine you are driving the car in the video. Your instruction is: use the right lane. What do you notice?

I'm driving on a highway in the middle lane at about 110 km/h. I just overtook a truck driving in the right lane. In front of me, there is a lot of space in my lane and in the right lane.

Question 2 (0–3 s, speed)

In the next 3 seconds, why are you going to maintain the current speed?

(I'm going to maintain the current speed) to perform a lane change and follow my instruction.

Question 3 (0–3 s, steering)

In the next 3 seconds, why are you going to steer slightly to the right?

(I'm going to steer slightly to the right) to perform a smooth lane change to the right lane.

Question 4 (3–5 s, speed)

In the last 2 seconds, why are you going to maintain the current speed?

(I'm going to maintain the current speed) to finish the lane change.

Question 5 (3–5 s, steering)

In the last 2 seconds, why are you going to steer slightly to the left?

(I'm going to steer slightly to the left) to center the car in the right lane.

Trajectory Annotation and Multi-Maneuver Score

Each scenario is annotated with reference trajectories spanning five behaviour categories — from expert-like driving to crash — each carrying a defined Multi-Maneuver Score (MMS). A model prediction is matched to the closest reference via trajectory similarity and inherits its score. The examples below illustrate how MMS is evaluated across three representative scenario types. Point clouds shown for visualization; not required to compute MMS.

0:00

Traffic light turns red

Potential rear-end crash

Instruction: Drive straight on.

Reference trajectories

expert MMS_ref = 10

wrong speed MMS_ref = 7

crash MMS_ref = 0

prediction MMS = 0 → crash

0:00

Curvy road and wintry mix

Snow and reduced visibility

Instruction: Drive straight on.

Reference trajectories

expert MMS_ref = 10

wrong speed MMS_ref = 7

crash MMS_ref = 0

prediction MMS = 10 → expert

0:00

Heavy rain at an intersection

Adverse weather, left turn

Instruction: Turn left.

Reference trajectories

expert MMS_ref = 10

wrong speed MMS_ref = 7

neglect instruction MMS_ref = 4

prediction MMS = 5.3 → wrong speed
(sim = 0.75)

Semantic Coherence of Model Outputs

Do reasoning traces match planned trajectories? Beyond MMS trajectory scoring, KITScenes LongTail measures whether a model's reasoning trace is consistent with its predicted trajectory. Semantic coherence compares acceleration and steering actions classified from the text with those derived from the trajectory, reporting their match rate over 0-3 s and 3-5 s.

Model	Avg. 0-5 s ↑	Acceleration ↑ 0-3 s / 3-5 s	Steering ↑ 0-3 s / 3-5 s
Qwen3-VL 8B	0.51	0.83 / 0.79	0.22 / 0.18
Gemma 3 12B	0.30	0.46 / 0.41	0.17 / 0.15
Pixtral 12B	0.27	0.32 / 0.51	0.12 / 0.13

Citation

@misc{wagner2026longtaildrivingscenariosreasoning,
  title     = {LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset},
  author    = {Royden Wagner and Omer Sahin Tas and Jaime Villa and Felix Hauser and Yinzhe Shen and Marlon Steiner and Dominik Strutz
               and Carlos Fernandez and Christian Kinzig and Guillermo S. Guitierrez-Cabello and Hendrik Königshof and Fabian Immel and
               Richard Schwarzkopf and Nils Alexander Rack and Kevin Rösch and Kaiwen Wang and Jan-Hendrik Pauls and Martin Lauer and
               Igor Gilitschenski and Holger Caesar and Christoph Stiller},
  year      = {2026},
  eprint    = {2603.23607},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url       = {https://arxiv.org/abs/2603.23607}
}

Team

Royden Wagner

KIT

Joint First Author · Concept, Methodology, Data Recording

Omer Sahin Tas

FZI KIT

Joint First Author · Concept, Methodology

Jaime Villa

UC3M

Joint First Author · Data Recording, Annotation, Evaluation

Felix Hauser

KIT FZI

Annotation Tools, Trajectory Annotation, Data Recording

Yinzhe Shen

KIT

Reasoning Annotation, Post-processing, Baseline Models

Marlon Steiner

KIT

Trajectory Generation

Dominik Strutz

KIT

Visualization

Carlos Fernandez

KIT

Trajectory Annotation, Data Recording

Christian Kinzig

KIT

Image Stitching

Guillermo S. Gutierrez-Cabello

UPM

Reasoning Annotation

Hendrik Königshof

FZI KIT

Reasoning Annotation, Data Recording

Fabian Immel

FZI KIT

Image Post-processing

Richard Schwarzkopf

FZI KIT

Vehicle Setup

Nils Alexander Rack

KIT

Image Post-processing

Kevin Rösch

FZI KIT

Vehicle Setup

Kaiwen Wang

KIT

Camera Calibration

Jan-Hendrik Pauls

KIT

Sensing and Data Acquisition Stack

Martin Lauer

KIT

Co-Advisor

Igor Gilitschenski

U of T

Co-Advisor

Holger Caesar

TU Delft

Co-Advisor

Christoph Stiller

KIT FZI

Principal Investigator

Acknowledgements

The research leading to these results is partially funded by the German Federal Ministry for Economic Affairs and Energy within the project "NXT GEN AI METHODS". The authors gratefully acknowledge the computing time provided on the high-performance computer HoreKa by the National High-Performance Computing Center at KIT (NHR@KIT). This center is jointly supported by the Federal Ministry of Education and Research and the Ministry of Science, Research and the Arts of Baden-Württemberg, as part of the National High-Performance Computing (NHR) joint funding program. HoreKa is partly funded by the German Research Foundation (DFG).