Range-stratified evaluation of state-of-the-art monocular depth methods on KITScenes Multimodal. Current models trained on existing datasets fail to generalize beyond 200 m — a systematic gap exposed by our long-range LiDAR ground truth.
Stay tuned for the KITScenes Multimodal Challenges!
Community leaderboard coming soon.
Preview the dataset on HuggingFace ↗Absolute relative error (AbsRel ↓) and threshold accuracy δ₁ (↑) stratified by distance range. LiDAR ground truth from KITScenes Multimodal's long-range sensor (effective range >400 m).
| Method | 0–100 m | 100–200 m | >200 m | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| AbsRel ↓ | δ₁ ↑ | AbsRel ↓ | δ₁ ↑ | AbsRel ↓ | δ₁ ↑ | AbsRel ↓ | δ₁ ↑ | |
| MapAnything | 0.149 | 83.04 | 0.485 | 16.34 | 0.772 | 0.03 | 0.156 | 81.70 |
| Depth Anything 3 | 0.278 | 48.64 | 0.472 | 12.32 | 0.689 | 0.86 | 0.282 | 47.91 |
| UniDAC | 0.386 | 24.12 | 0.302 | 40.17 | 0.540 | 1.78 | 0.384 | 24.36 |
AbsRel
↓ lower is better
Mean absolute relative error: |pred − GT| / GT, averaged over valid LiDAR pixels.
δ₁
↑ higher is better
Threshold accuracy: fraction of pixels where max(pred/GT, GT/pred) < 1.25.