This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/AuspiciousApple on 2025-01-12 19:30:24+00:00.
Original Title: [D] Is a ViT with local window attention (SAM-style) not that much more efficient than a vanilla ViT with global attention in all layers? Especially at high resolution where global attention should be super expensive.
I was reading this blog post by Lucas Beyer:
When he compares ViTB/16 and the SAM variant with mostly local attention (window size 14), it was a bit surprised that throughput improvements are slight (left) and that the SAM variant requires more peak memory.
Now this is inference only, so maybe during training the difference is larger, but I naively would have thought that local attention is much faster still, especially at high resolutions.
At 1024x1024, we should have 1024/16=64x64 patches - so the global attention operation should be extremely expensive? Am I missing something?