RayRoPE icon

RayRoPE

Projective Ray Positional Encoding for Multi-view Attention

1 Carnegie Mellon University    2 Apple
Abstract

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in Co3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.

Problem: How should we design positional encoding for multi-view transformers?

We live in a 3D world, and the positional encoding of transformers should be 3D-aware as well. In this project, we propose RayRoPE, a novel relative positional encoding mechanism designed for multi-view attention.

Two-row figure comparing different models and their positional encodings

What makes a good positional encoding in multi-view?

We argue that an ideal positional encoding for multi-view attention should satisfy four desirable properties illustrated below:

Four desirable properties for 3D positional encoding
Method SE(3) Invariance Uniqueness Geometric Adaptiveness Multi-frequency
Plücker Raymap
PRoPE (only for patch indices)
RayRoPE (Ours)

Method: RayRoPE Encodings

0:00 / 0:00
  • Insight 1: Depth predictions help the positional encoding to adapt to the scene geometry.
  • Insight 2: Projecting rays into query camera's frame ensure independence from world coordinates.
  • Insight 3: Modeling uncertainties in depth via expected RoPE produces more stable encodings.

RayRoPE Improves Novel View Synthesis

We train the LVSM model with different positional encodings and compare their results below.

Left Video:

Right Video:

PRoPE
RayRoPE

Select Scene (Input Views):

Scene 0 Ref 0 Scene 0 Ref 1 Scene 0 Ref 2 Scene 0 Ref 3
Scene 1 Ref 0 Scene 1 Ref 1 Scene 1 Ref 2 Scene 1 Ref 3
Scene 2 Ref 0 Scene 2 Ref 1 Scene 2 Ref 2 Scene 2 Ref 3
Scene 3 Ref 0 Scene 3 Ref 1 Scene 3 Ref 2 Scene 3 Ref 3
Scene 5 Ref 0 Scene 5 Ref 1 Scene 5 Ref 2 Scene 5 Ref 3
Scene 6 Ref 0 Scene 6 Ref 1 Scene 6 Ref 2 Scene 6 Ref 3
re10k 0 Ref 0 re10k 0 Ref 1
re10k 1 Ref 0 re10k 1 Ref 1
re10k 2 Ref 0 re10k 2 Ref 1
re10k 3 Ref 0 re10k 3 Ref 1
re10k 4 Ref 0 re10k 4 Ref 1
re10k 5 Ref 0 re10k 5 Ref 1

RayRoPE Improves Stereo Depth Estimation

We evaluate the stereo depth estimation performance with the Unimatch model. The 3D point clouds below show the predicted depth for each method:

Ground Truth

UniMatch

PRoPE

RayRoPE (Ours)

Select Scene (Input Views):

Scene 0001 Ref Scene 0001 Target
Scene 0048 Ref Scene 0048 Target
Scene 0071 Ref Scene 0071 Target
Scene 0150 Ref Scene 0150 Target

Emergent Depth

RayRoPE predicts depth and uncertainties which are used to compute the positional encodings. Even without depth supervision during training, resonable depth predictions emerge, especially in the later layers. Depth prediction accuracy and the predicted uncertainties are inversely correlated.

Emergent Depth Visualization

Select Scene:

Scene 101
Scene 106
Scene 108
Scene 109
Scene 110