Any Resolution Any Geometry: From Multi-View To Multi-Patch

We introduce a multi-patch framework for high-resolution monocular geometry estimation, delivering sharp and globally consistent depth and surface normals at any resolution (e.g., 2K, 4K, 8K) from a single RGB image.
The main ideas are:

Reformulating high-resolution prediction as a multi-patch refinement task: we divide the input image into spatial patches, augment each patch with coarse depth and normal priors, and process all patches jointly with a unified transformer backbone.
Employing cross-patch attention with global positional encoding to propagate information across distant regions, enforcing seamless boundaries and coherent geometry across the entire image.
Introducing a Variable Multi-Patch Training (GridMix) strategy that samples different patch-grid configurations during training, improving robustness to image resolution and spatial layout and yielding strong zero-shot performance on real-world benchmarks.

Framework

Any Resolution Any Geometry: From Multi-View To Multi-Patch

CVPR 2026

👀 Interactive Comparison

In-the-Wild Samples

Depth Estimation

Surface Normal Estimation

In-Domain Samples from UnrealStereo4K

Depth Estimation

Surface Normal Estimation

Method