Lemmit.Online bot

Lemmit.Online bot

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/bnqj on 2024-03-29 13:04:29.

Yes, with VAPE - Vector Addition Positional Encoding.

I’ve been exploring a new approach to positional encoding that I’m calling VAPE - Vector Addition Positional Encoding.

The Method:

borrow some number of channels from queries and keys,
run a cumulative (prefix) sum across sequence length on these borrowed channels (add vectors together),
normalize - divide by the square root of the vector’s magnitude,
we now have position-aware channels,
so concatenate them back to queries and keys.

What’s intriguing is that this method can work effectively with just a single channel per head. Using a single channel means that we’re running prefix sum on scalars not vectors and the method still works.

VAPE features:

No Extra Parameters: VAPE introduces positional information without adding any new parameters to the model, preserving its simplicity and efficiency.
Performance: Early tests indicate that VAPE outperforms methods like RoPE in final perplexity.
Extrapolation: Early tests suggest that VAPE can extrapolate beyond training context length quite nicely, there is no explicit positional information added like in RoPE.
Compatibility with Flash Attention: It’s fully compatible with Flash Attention.
Efficiency: By leveraging just a small number of channels for positional encoding, VAPE maintains model efficiency.
Inference Speed: VAPE caches the last positional states for queries and keys - it’s a bit like SSM/RNN, you only need the last state to compute the next.

Seeking Your Insight:

What benchmarks or specific comparisons would best demonstrate VAPE’s value to you?
Do you know of any methods similar to VAPE?

Benchmarks:

I’ve run some early tests that look very promising for causal language modeling tasks, but I have quite limited resources for doing benchmarks, thus before I put any effort into them I think it’s better to ask the community how to go about it.

[D] Is a single channel enough for Positional Encoding in Transformers?

[D] Is a single channel enough for Positional Encoding in Transformers?

This is an automated archive made by the Lemmit Bot.