[P] A technical guide on how to upgrade your training code from single GPU to multiple GPUs

Machine LearningEnglish • 5 days ago

Hey All, We've been writing a technical guide on how to scale training code from single GPU all the way to multiple nodes. It's centered around...

The original was posted on /r/machinelearning by /u/lambda-research on 2024-10-16 18:57:34+00:00.

Hey All,

We’ve been writing a technical guide on how to scale training code from single GPU all the way to multiple nodes.

It’s centered around training LLMs, and goes over things like DDP, FSDP, diagnosing errors/logging, and way more.

Tried to make the code and explanations as clear and simple as possible, let us know if you find it helpful!

Contributions welcome and feel free to open issues with requests/bugs.

You must log in or register to comment.

HotTopNewOld

Chat