This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/danielhanchen on 2024-10-21 19:37:38+00:00.


Hey r/MachineLearning folks! Just an update on the gradient accumulation bug - the fix should be in the nightly transformers, and also in Unsloth trainers, so definitely update them! For a recap, grad accumulation in most trainers was calculated incorrectly, causing loss curve differences.

Recap of gradient accumulation bug

Gradient accumulation is used to mimic large batch training by chunking a batch into smaller sequences to reduce GPU VRAM usage. So if your batch size was 32, you could do a batch size of 8, and do 4 mini steps of them by accumulating gradients. The key trick is ga * bsz is held constant, so you can edit those numbers.

So the trick of grad accum is you can inplace add up all mini batch gradients, and after some scaling, you will get back the gradient as if you did 1 full batch.

The issue was the original paper in 2017 showed in expectation this would work, but there was a common misconception that GA actually was equivalent to full batch training. Ie bsz=32, ga=1 should be mathematically equivalent to bsz=1, ga=32. But Benjamin first reported here that training losses did not match up. In fact this problem was unsolved for like 4-5 years - see

Is the Gradient accumulation bug serious?

If you simply plot the L2 Norm between gradient accumulated versions vs full batch training, you will get the error plots like below:

There is some 0.03 L2 difference as you increase the gradient accumulation steps, whilst it’s supposed to be flat. After the fix, the error reduces to 0005 ish, and we show there is some numerical precision issues of accumulating gradients, albeit not much.

But it’s worse - in , I showcase that LoRA on Wikittext incurs a significant penalty if using grad accum:

I listed all experiments here: . So it was much worse than I first anticipated.

Getting the bug fix & more details

The bug fix should be in nightly transformers now! Also the fix is already inside of Unsloth - Colab for it -

More details are in and there’s also a bit of maths proofs and stuff in the blog! I also talk about it in a lecture I gave on the GPU MODE / CUDA MODE server here:

If anyone has any questions, feel free to ask! Thanks!