Benchmarks Multimodal

Monocular Depth Estimation

Range-stratified evaluation of state-of-the-art monocular depth methods on KITScenes Multimodal. Current models trained on existing datasets fail to generalize beyond 200 m — a systematic gap exposed by our long-range LiDAR ground truth.

Leaderboard

Stay tuned for the KITScenes Multimodal Challenges!

Community leaderboard coming soon.

Preview the dataset on HuggingFace ↗

Paper Results

Absolute relative error (AbsRel ↓) and threshold accuracy δ₁ (↑) stratified by distance range. LiDAR ground truth from KITScenes Multimodal's long-range sensor (effective range >400 m).

Method 0–100 m 100–200 m >200 m Overall
AbsRel ↓ δ₁ ↑ AbsRel ↓ δ₁ ↑ AbsRel ↓ δ₁ ↑ AbsRel ↓ δ₁ ↑
MapAnything 0.149 83.04 0.485 16.34 0.772 0.03 0.156 81.70
Depth Anything 3 0.278 48.64 0.472 12.32 0.689 0.86 0.282 47.91
UniDAC 0.386 24.12 0.302 40.17 0.540 1.78 0.384 24.36

Metrics

AbsRel

↓ lower is better

Mean absolute relative error: |pred − GT| / GT, averaged over valid LiDAR pixels.

δ₁

↑ higher is better

Threshold accuracy: fraction of pixels where max(pred/GT, GT/pred) < 1.25.

KIT FZI TU Delft UC3M UPM University of Toronto