This is an automated archive made by the Lemmit Bot.
The original was posted on /r/stablediffusion by /u/woct0rdho on 2025-10-15 09:05:21+00:00.
I’ve merged the patch to let torch.compile
work with fp8 on Ampere GPUs and let’s see how it rolls out: https://github.com/woct0rdho/triton-windows/pull/140
I hoped this could be superseded by GGUF + better torch.compile
or Nunchaku, but as of PyTorch 2.9 I realized that fp8 + the block swap in ComfyUI-WanVideoWrapper (or ComfyUI-wanBlockswap for native workflows) runs faster and causes fewer recompilations than GGUF + the block swap in ComfyUI-GGUF on my machine.
This is the first feature in the ‘core’ part (rather than the Windows support code) that’s deliberately different from the official Triton. It should also work on Linux but I’m not sure what’s the best way to publish Linux wheels.
I’m not an expert on PTX. Welcome help in optimizing those PTX code.
triton-windows 3.2.0.post21
is also released, which supports fp8 on RTX 20xx.