RayRoPE icon

RayRoPE

Projective Ray Positional Encoding for Multi-view Attention

1 Carnegie Mellon University    2 Apple
Abstract

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in Co3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.

How should we design positional encoding for multi-view transformers?

We live in a 3D world, and the positional encoding of transformers should be 3D-aware as well. In this project, we propose RayRoPE, a novel relative positional encoding mechanism designed for multi-view attention.

0:00 / 0:00

What makes a good positional encoding in multi-view?

We argue that an ideal positional encoding for multi-view attention should satisfy four desirable properties illustrated below:

Four desirable properties for 3D positional encoding
Method SE(3) Invariance Uniqueness Geometric Adaptiveness Multi-frequency
Plücker Raymap
RoPE on Raymap
GTA, PRoPE (only for patch indices)
RayRoPE (Ours)

Method: RayRoPE Encodings

0:00 / 0:00
  • Insight 1: We can parametrize patch positions via the associated ray(s), using a predicted point along the ray to allow adaptation to scene geometry
  • Insight 2: Projecting rays into query camera's frame ensure independence from world coordinates.
  • Insight 3: Modeling uncertainties in depth via expected RoPE produces more stable encodings.

RayRoPE Improves Novel View Synthesis

We train the LVSM model with different positional encodings and compare their results below.

Left Video:

Right Video:

PRoPE
RayRoPE

Select Scene (Input Views):

Scene 0 Ref 0 Scene 0 Ref 1 Scene 0 Ref 2 Scene 0 Ref 3
Scene 1 Ref 0 Scene 1 Ref 1 Scene 1 Ref 2 Scene 1 Ref 3
Scene 2 Ref 0 Scene 2 Ref 1 Scene 2 Ref 2 Scene 2 Ref 3
Scene 3 Ref 0 Scene 3 Ref 1 Scene 3 Ref 2 Scene 3 Ref 3
Scene 5 Ref 0 Scene 5 Ref 1 Scene 5 Ref 2 Scene 5 Ref 3
Scene 6 Ref 0 Scene 6 Ref 1 Scene 6 Ref 2 Scene 6 Ref 3
re10k 0 Ref 0 re10k 0 Ref 1
re10k 1 Ref 0 re10k 1 Ref 1
re10k 2 Ref 0 re10k 2 Ref 1
re10k 3 Ref 0 re10k 3 Ref 1
re10k 4 Ref 0 re10k 4 Ref 1
re10k 5 Ref 0 re10k 5 Ref 1

RayRoPE Improves Stereo Depth Estimation

We evaluate the stereo depth estimation performance with the Unimatch model. The 3D point clouds below show the predicted depth for each method:

Ground Truth

UniMatch

PRoPE

RayRoPE (Ours)

Select Scene (Input Views):

Scene 0001 Ref Scene 0001 Target
Scene 0125 Ref Scene 0125 Target
Scene 0071 Ref Scene 0071 Target
Scene 0033 Ref Scene 0033 Target
Scene 0048 Ref Scene 0048 Target

Emergent Depth

RayRoPE predicts depth and uncertainties which are used to compute the positional encodings. Even without depth supervision during training, resonable depth predictions emerge, especially in the later layers. Depth prediction accuracy and the predicted uncertainties are inversely correlated.

Emergent Depth Visualization

Select Scene:

Scene 101
Scene 106
Scene 108
Scene 109
Scene 110

Interactive Attention Similarity Demo

Explore how RayRoPE bias attention similarity as relative camera and predicted depth changes. The query camera (orange) is fixed at identity and the heatmap shows attention similarity between the center query token and all key tokens. We set the $Q$, $K$ features to contain constant value 1, so that the attention similarity is purely determined by the positional encodings. Multi-frequency is disabled.

Camera Positions (3D View)
Attention Similarity Heatmap (Key View)

Acknowledgements

This work was supported by Apple. We thank Zihan Wang and Qitao Zhao for insightful discussions throughout the project. Additionally, our implementation is built upon the open-source frameworks PRoPE and Unimatch; we are grateful to the authors for making their code available.

Citation

@article{wu2026rayrope,
  title={RayRoPE: Projective Ray Positional Encoding for Multi-view Attention},
  author={Wu, Yu and Jeon, Minsik and Chang, Jen-Hao Rick and Tuzel, Oncel and Tulsiani, Shubham},
  journal={arXiv preprint arXiv:2601.15275},
  year={2026}
}