Benchmarks Multimodal

Monocular Depth Estimation

Range-stratified evaluation of state-of-the-art monocular depth methods on KITScenes Multimodal. Current models trained on existing datasets fail to generalize beyond 200 m — a systematic gap exposed by our long-range LiDAR ground truth.

Leaderboard

Stay tuned for the KITScenes Multimodal Challenges!

Community leaderboard coming soon.

Preview the dataset on HuggingFace ↗

Paper Results

Absolute relative error (AbsRel ↓) and threshold accuracy δ₁ (↑) stratified by distance range. LiDAR ground truth from KITScenes Multimodal's long-range sensor (effective range >400 m).

Method	0–100 m		100–200 m		>200 m		Overall
Method	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑
MapAnything	0.149	83.04	0.485	16.34	0.772	0.03	0.156	81.70
Depth Anything 3	0.278	48.64	0.472	12.32	0.689	0.86	0.282	47.91
UniDAC	0.386	24.12	0.302	40.17	0.540	1.78	0.384	24.36

Metrics

AbsRel

↓ lower is better

Mean absolute relative error: |pred − GT| / GT, averaged over valid LiDAR pixels.

δ₁

↑ higher is better

Threshold accuracy: fraction of pixels where max(pred/GT, GT/pred) < 1.25.