This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/bnqj on 2024-03-29 13:04:29.
Yes, with VAPE - Vector Addition Positional Encoding.
I’ve been exploring a new approach to positional encoding that I’m calling VAPE - Vector Addition Positional Encoding.
The Method:
- borrow some number of channels from queries and keys,
- run a cumulative (prefix) sum across sequence length on these borrowed channels (add vectors together),
- normalize - divide by the square root of the vector’s magnitude,
- we now have position-aware channels,
- so concatenate them back to queries and keys.
What’s intriguing is that this method can work effectively with just a single channel per head. Using a single channel means that we’re running prefix sum on scalars not vectors and the method still works.
VAPE features:
- No Extra Parameters: VAPE introduces positional information without adding any new parameters to the model, preserving its simplicity and efficiency.
- Performance: Early tests indicate that VAPE outperforms methods like RoPE in final perplexity.
- Extrapolation: Early tests suggest that VAPE can extrapolate beyond training context length quite nicely, there is no explicit positional information added like in RoPE.
- Compatibility with Flash Attention: It’s fully compatible with Flash Attention.
- Efficiency: By leveraging just a small number of channels for positional encoding, VAPE maintains model efficiency.
- Inference Speed: VAPE caches the last positional states for queries and keys - it’s a bit like SSM/RNN, you only need the last state to compute the next.
Seeking Your Insight:
- What benchmarks or specific comparisons would best demonstrate VAPE’s value to you?
- Do you know of any methods similar to VAPE?
Benchmarks:
I’ve run some early tests that look very promising for causal language modeling tasks, but I have quite limited resources for doing benchmarks, thus before I put any effort into them I think it’s better to ask the community how to go about it.