Benchmarks Multimodal

Novel View Synthesis

NVS evaluation on KITScenes Multimodal combining standard photometric metrics with a map-based geometric fidelity test: traffic sign recall at lateral offsets probes whether synthesized views preserve 3D structure — a quality hidden by PSNR and SSIM alone.

Leaderboard

Stay tuned for the KITScenes Multimodal Challenges!

Community leaderboard coming soon.

Preview the dataset on HuggingFace ↗

Paper Results

Photometric Metrics

ReconDrive evaluated on the KITScenes NVS benchmark (140 sequences, 216 windows). Three protocols: held-out camera reconstruction, ego-view reconstruction, and ego-view novel view synthesis.

Method Protocol PSNR ↑ SSIM ↑ LPIPS ↓
ReconDrive

Held-out Cam NVS

Novel view from a withheld camera

23.51 0.783 0.318
ReconDrive

Ego Recon

Ego-view reconstruction (training views)

32.42 0.951 0.073
ReconDrive

Ego NVS

Ego-view novel view synthesis

22.61 0.678 0.352

Photometric metrics measure perceptual quality on the original trajectory and do not capture geometric consistency at novel lateral poses.

Traffic Sign Recall

Traffic sign recall on the front camera at seven lateral offsets (−3 m to +3 m). "Photo" is the detector's recall on the real photograph (upper bound). Degradation beyond ±1 m exposes the failure of current NVS methods to maintain 3D structural integrity — a limitation invisible to photometric metrics.

Method Resolution Photo ↑ −3 m −2 m −1 m 0 m +1 m +2 m +3 m
ReconDrive

Low

280 × 518 px (model scale)

19.7 4.1 -79.2% 6.7 -66.0% 11.4 -42.1% 18.2 -7.6% 11.0 -44.2% 5.5 -72.1% 3.7 -81.2%
ReconDrive

High

1600 × 2844 px (sensor scale)

21.6 3.4 -84.3% 5.5 -74.5% 9.5 -56.0% 15.6 -27.8% 9.4 -56.5% 4.6 -78.7% 3.0 -86.1%

Relative drop vs. photo recall shown below each value. At ±3 m, recall degrades by over 80% — current generalizable NVS methods cannot maintain structural integrity at lateral translations critical for autonomous driving simulation.

KIT FZI TU Delft UC3M UPM University of Toronto