Lemmit.Online bot

Lemmit.Online bot

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/rust by /u/ksyiros on 2025-04-23 19:52:10+00:00.

We’re releasing Burn 0.17.0 today, a massive update that improves the Deep Learning Framework in every aspect! Enhanced hardware support, new acceleration features, faster kernels, and better compilers - all to improve performance and reliability.

Broader Support

Mac users will be happy, as we’ve created a custom Metal compiler for our WGPU backend to leverage tensor core instructions, speeding up matrix multiplication up to 3x. This leverages our revamped cpp compiler, where we introduced dialects for Cuda, Metal and HIP (ROCm for AMD) and fixed some memory errors that destabilized training and inference. This is all part of our CubeCL backend in Burn, where all kernels are written purely in Rust.

A lot of effort has been put into improving our main compute-bound operations, namely matrix multiplication and convolution. Matrix multiplication has been refactored a lot, with an improved double buffering algorithm, improving the performance on various matrix shapes. We also added support for NVIDIA’s Tensor Memory Allocator (TMA) on their latest GPU lineup, all integrated within our matrix multiplication system. Since it is very flexible, it is also used within our convolution implementations, which also saw impressive speedup since the last version of Burn.

All of those optimizations are available for all of our backends built on top of CubeCL. Here’s a summary of all the platforms and precisions supported:

Type	CUDA	ROCm	Metal	Wgpu	Vulkan
f16	✅	✅	✅	❌	✅
bf16	✅	✅	❌	❌	❌
flex32	✅	✅	✅	✅	✅
tf32	✅	❌	❌	❌	❌
f32	✅	✅	✅	✅	✅
f64	✅	✅	✅	❌	❌

Fusion

In addition, we spent a lot of time optimizing our tensor operation fusion compiler in Burn, to fuse memory-bound operations to compute-bound kernels. This release increases the number of fusable memory-bound operations, but more importantly handles mixed vectorization factors, broadcasting, indexing operations and more. Here’s a table of all memory-bound operations that can be fused:

Version	Tensor Operations
Since v0.16	Add, Sub, Mul, Div, Powf, Abs, Exp, Log, Log1p, Cos, Sin, Tanh, Erf, Recip, Assign, Equal, Lower, Greater, LowerEqual, GreaterEqual, ConditionalAssign
New in v0.17	Gather, Select, Reshape, SwapDims

Right now we have three classes of fusion optimizations:

Matrix-multiplication
Reduction kernels (Sum, Mean, Prod, Max, Min, ArgMax, ArgMin)
No-op, where we can fuse a series of memory-bound operations together not tied to a compute-bound kernel

Fusion Class	Fuse-on-read	Fuse-on-write
Matrix Multiplication	❌	✅
Reduction	✅	✅
No-Op	✅	✅

We plan to make more compute-bound kernels fusable, including convolutions, and add even more comprehensive broadcasting support, such as fusing a series of broadcasted reductions into a single kernel.

Benchmarks

Benchmarks speak for themselves. Here are benchmark results for standard models using f32 precision with the CUDA backend, measured on an NVIDIA GeForce RTX 3070 Laptop GPU. Those speedups are expected to behave similarly across all of our backends mentioned above.

Version	Benchmark	Median time	Fusion speedup	Version improvement
0.17.0	ResNet-50 inference (fused)	6.318ms	27.37%	4.43x
0.17.0	ResNet-50 inference	8.047ms	-	3.48x
0.16.1	ResNet-50 inference (fused)	27.969ms	3.58%	1x (baseline)
0.16.1	ResNet-50 inference	28.970ms	-	0.97x
----	----	----	----	----
0.17.0	RoBERTa inference (fused)	19.192ms	20.28%	1.26x
0.17.0	RoBERTa inference	23.085ms	-	1.05x
0.16.1	RoBERTa inference (fused)	24.184ms	13.10%	1x (baseline)
0.16.1	RoBERTa inference	27.351ms	-	0.88x
----	----	----	----	----
0.17.0	RoBERTa training (fused)	89.280ms	27.18%	4.86x
0.17.0	RoBERTa training	113.545ms	-	3.82x
0.16.1	RoBERTa training (fused)	433.695ms	3.67%	1x (baseline)
0.16.1	RoBERTa training	449.594ms	-	0.96x

Another advantage of carrying optimizations across runtimes: it seems our optimized WGPU memory management has a big impact on Metal: for long running training, our metal backend executes 4 to 5 times faster compared to LibTorch. If you’re on Apple Silicon, try training a transformer model with LibTorch GPU then with our Metal backend.

Full Release Notes:

Massive Release - Burn 0.17.0: Up to 5x Faster and a New Metal Compiler

Massive Release - Burn 0.17.0: Up to 5x Faster and a New Metal Compiler

This is an automated archive made by the Lemmit Bot.

Broader Support

Fusion

Benchmarks