Multimodal E2E Driving

Paper Results

ADE and FDE follow standard protocols. Drivable-surface survival, collision-free rate, and centerline distance leverage KITScenes HD maps with a LiDAR-based occupancy layer. Bold = best; underlined = second-best.

Model	FDE ↓	ADE @3 s ↓							Survival / Tracking @3 s
Model	FDE ↓	avg	selected	constr.	overtake	inters.	night	nominal	Drv. surv. ↑	Coll.-free ↑	CL dist. ↓
Camera-based
UniAD	4.85	2.43	3.37	1.96	2.27	2.29	4.87	2.26	55.5	80.9	0.84
DMAD	4.49	2.30	3.59	1.78	2.27	2.06	5.23	2.09	58.4	85.0	0.59
SSR (non-temp.)	7.57	3.97	6.36	2.07	4.50	3.06	8.54	3.96	65.9	78.0	0.68
SSR (temporal)	5.05	2.49	4.25	2.49	2.30	2.59	5.16	1.99	67.6	79.8	0.78
Generative
Epona (AR, 10)	7.70	3.62	4.31	5.47	3.93	3.25	6.51	3.44	63.0	81.5	0.62
Epona (AR, 100)	6.04	2.86	3.57	4.27	3.24	2.56	5.48	2.63	57.2	82.1	0.66
Epona (SS, 10)	3.98	1.99	2.71	2.57	2.14	1.85	4.43	1.73	81.5	97.7	0.46
Epona (SS, 100)	3.99	1.97	2.63	2.67	2.17	1.83	4.41	1.71	78.6	98.3	0.47

SSR: non-temporal uses only the current keyframe; temporal aggregates BEV features across multiple frames. Epona: AR = autoregressive rollout; SS = single-step prediction. Numbers (10, 100) = diffusion denoising steps. 3 s predicted trajectories are evaluated at the 3-second horizon.

Metrics

ADE / FDE

↓ lower is better

Average / Final Displacement Error in metres at the 3-second prediction horizon (standard protocol).

Drv. surv. / Coll.-free

↑ higher is better

Drivable-surface survival rate and collision-free rate, computed with HD map + LiDAR occupancy.

CL dist.

↓ lower is better

Centerline distance: mean lateral deviation from the nearest HD map centerline in metres.

Leaderboard

Paper Results

Metrics