Diffusion models quantization benchmark

Diffusion models became de facto standard in image generation. While remarkably powerful, they rely on a multi-step process of iterative denoising, which takes time. Text generation is also processed in a sequential manner, however the text can be streamed and visualized token by token, giving a natural, real-time feeling, similar to how humans read or write a text. With images, we must wait for the final denoising step to complete before a usable result appears. This makes optimizing diffusion models for speed and efficiency a critical challenge.

Diffusion models can be broadly categorized by their underlying architecture:

Pure convolutional diffusion models
Hybrid (convolution + transformer) diffusion models
Fully transformer-based diffusion models

This architectural distinction has a direct impact on which optimization strategies are most effective.Transformer-based diffusion models benefit greatly from quantization - especially fp8 and nf4 on newer NVIDIA GPUs with native hardware acceleration. Convolutional models, by contrast, respond better to pruning and typically demand less VRAM. Regardless of architecture, reducing the number of denoising steps remains a universal approach for faster generation.

This article presents a benchmark for comparing state-of-the-art text-to-image diffusion models across four dimensions: generation speed, image quality, VRAM consumption, and quantization efficiency.

Models compared

For the purpose of this benchmark, multiple models were selected to span the full design space of modern text-to-image diffusion models: from the lightweight and widely-deployed SD 1.5, through the SDXL family and its distilled variants (Turbo, Lightning, Hyper, LCM), to the latest transformer-based Flux.1 architecture. Step counts range from 1 to 30, covering both standard and accelerated inference regimes.

Model	Steps	Architecture	FP16 VRAM
SD 1.5	30	UNet	~4 GB
SDXL	30	UNet	~11 GB
SDXL Turbo	1-4	UNet (ADD)	~11 GB
SDXL Lightning (2-step)	2	UNet LoRA	~11 GB
SDXL Lightning (4-step)	4	UNet LoRA	~11 GB
Hyper-SDXL (4-step)	4	UNet LoRA	~11 GB
LCM-SDXL	4	UNet	~11 GB
Flux.1 Schnell	4	DiT (12B)	~24 GB
Flux.1 Dev	28	DiT (12B)	~24 GB

Table 1. Datasheet of benchmarked models.

Comparison dimensions

Speed - total generation time, time per denoising step, step count
Quality - CLIP score (text alignment), Laplacian sharpness, colorfulness (Hasler & Süsstrunk), Shannon entropy
VRAM - model weight footprint and peak generation VRAM
Quantization - FP16 / BF16 / INT8 / NF4 / INT4 / INT4-GEMM / FP8-W / FP8 (bitsandbytes + torchao), FP8_static

Quantization scope: weights-only vs weights + compute

Not all quantization formats speed up inference. The key distinction is whether the GPU executes matrix multiplications in a lower-precision format, or just stores weights compactly and quantized back to fp16/bf16 before each operation. Weights-only approach reduces the VRAM stress allowing for inference on devices with less VRAM or larger batchsize (throughput) on the same device. Weights + compute allows for both - faster inference and larger batchsize, however require newer GPU architectures.

Level	Weights	Compute (GEMM)	Quantized GEMM	VRAM vs FP16	Speed vs FP16	GPU requirement
fp16 / bf16	fp16 / bf16	fp16 / bf16	No	1×	baseline	Any CUDA GPU
int8	INT8	dequantized → fp16	No	~0.55×	slower	Any CUDA GPU
nf4	NF4 4-bit	dequantized → fp16	No	~0.30×	slower	Any CUDA GPU
int4	NF4 4-bit + double-quant	dequantized → bf16	No	~0.27×	slower	Any CUDA GPU
int4_gemm	INT4 symmetric	INT4 tensor cores	Yes	~0.27×	faster	Ampere+ (RTX 3090+)
fp8_w	FP8	dequantized → bf16	No	~0.50×	same	Any CUDA GPU
fp8	FP8	FP8 tensor cores	Yes	~0.50×	faster	Ada/Blackwell (RTX 4090+)
fp8_static	FP8	FP8 tensor cores	Yes	~0.50×	faster	Ada/Blackwell (RTX 4090+)

Table 2. Comparison of quantisation modes and supported hardware.

When quantization speeds things up

By leveraging native Tensor Core instructions, int4_gemm, fp8, and fp8_static perform matrix multiplication directly on quantized values and fuse the dequantize directly into the multiply, eliminating the overhead (weights are still dequantized before activations in the model). int4_gemm uses symmetric INT4 and runs on any Ampere+ GPU. fp8 uses FP8 dynamic activation quantization (rescales activations every step) and fp8_static uses pre-calibrated frozen scales (no per-step overhead) - both require Ada Lovelace or Blackwell (RTX 4090+).

`int8` vs `fp8_w` - both 8-bit weight-only, different number formats

Both store weights in 8 bits and dequantize before compute, so neither speeds up inference. The difference is what those 8 bits represent:

int8 stores weights as 8-bit integers: a uniform linear grid from -128 to 127. Each block of weights gets one scale factor. The grid spacing is constant, which is a poor fit for neural network weights that cluster near zero with rare large outliers. Before compute weights are dequantized to FP16.
fp8_w stores weights as 8-bit floating point (E4M3: 4 exponent bits, 3 mantissa bits). Because it has an exponent, the grid is non-uniform - denser near zero, sparser at large magnitudes. This matches the distribution of trained weights much better. Dequantizes to bf16.

fp8_w should give marginally better quality than int8 at the same VRAM tier, for the same reason NF4 outperforms uniform int4: floating point naturally matches the bell-curve distribution of neural network weights.

Results

All benchmarks were run on an RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7). This matters for interpreting the quantization results: fp8 and fp8_static speedups require Ada Lovelace or newer (RTX 4090+), and int4_gemm requires Ampere or newer (RTX 3090+). Results for weights-only formats - nf4, int4, int8, fp8_w - generalize to any CUDA GPU, since they affect VRAM and quality but not compute throughput.

Model	Quant	Avg Time (s)	s/step	Steps	VRAM peak (GB)	VRAM model (GB)	Sharpness	Colorfulness	CLIP score
SDXL Lightning (2-step)	fp16	0.35	0.1756	2	11.5	7.3	507.1	52.5	0.2729
SDXL Lightning (4-step)	fp16	0.47	0.1167	4	11.5	7.3	333.8	54.9	0.2799
SD 1.5	fp16	0.87	0.0291	30	3.4	2.7	1460.9	75.1	0.2704
Hyper-SDXL (4-step)	fp16	0.48	0.1191	4	11.5	7.3	1046.4	61.5	0.2692
SDXL Turbo	fp16	0.22	0.0547	4	8.1	6.9	588.3	58.9	0.2827
LCM-SDXL	fp16	0.49	0.1233	4	11.1	6.9	669.1	48.3	0.2850
LCM-SDXL	fp8	1.15	0.2879	4	7.3	4.7	579.3	48.5	0.2854
LCM-SDXL	fp8_w	0.58	0.1446	4	7.3	4.7	590.4	48.3	0.2855
LCM-SDXL	fp8_static	0.73	0.1826	4	7.3	4.7	566.1	48.6	-
LCM-SDXL	int8	1.09	0.2724	4	8.9	4.7	74.5	27.0	0.1348

LCM-SDXL	int4_gemm	1.00	0.2501	4	11.4	7.3	817.2	47.3	0.2869
LCM-SDXL	nf4	0.50	0.1239	4	7.9	3.7	548.8	46.7	0.2844
LCM-SDXL	int4	0.62	0.1560	4	7.8	3.6	545.1	46.8	0.2842
SDXL	fp16	3.70	0.1235	30	11.1	6.9	431.4	53.7	0.2924
SDXL	fp8_w	4.39	0.1464	30	7.3	4.7	404.1	53.4	0.2920
SDXL	fp8_static	4.86	0.1620	30	7.3	4.7	446.2	53.9	-
SDXL	int8	8.21	0.2738	30	8.9	4.7	448.2	53.5	0.2909
SDXL	fp8	8.62	0.2873	30	7.3	4.7	425.5	53.6	0.2915
SDXL	int4_gemm	13.49	0.4497	30	11.4	7.3	611.2	54.2	0.2940
SDXL	nf4	3.91	0.1304	30	7.9	3.7	456.7	51.9	0.2901

SDXL	int4	4.79	0.1596	30	7.8	3.6	462.8	51.9	0.2888
Flux.1 Schnell	fp16	1.51	0.3763	4	38.3	35.7	942.0	68.0	0.2807
Flux.1 Schnell	fp8_static	0.81	0.2021	4	24.4	21.8	959.9	69.2	-
Flux.1 Schnell	fp8	1.54	0.3858	4	24.4	21.8	937.8	69.0	0.2776
Flux.1 Schnell	int8	1.67	0.4172	4	26.4	23.9	941.5	68.0	0.2805
Flux.1 Schnell	fp8_w	2.05	0.5127	4	24.4	21.8	949.5	69.2	0.2778
Flux.1 Schnell	int4_gemm	8.74	2.1851	4	25.1	22.6	737.5	66.4	0.2763
Flux.1 Schnell	int4	1.68	0.4211	4	20.7	18.1	1048.1	68.3	0.2768
Flux.1 Schnell	nf4	1.53	0.3835	4	21.2	18.6	1052.0	69.1	0.2793
Flux.1 Dev	fp16	9.45	0.3375	28	38.3	35.8	943.5	61.4	0.2740

Flux.1 Dev	fp8_static	4.75	0.1698	28	24.4	21.8	860.0	63.6	-
Flux.1 Dev	int8	10.62	0.3794	28	26.4	23.9	860.2	60.9	0.2731
Flux.1 Dev	fp8	9.68	0.3456	28	24.4	21.8	948.0	61.5	0.2697
Flux.1 Dev	fp8_w	13.35	0.4767	28	24.4	21.9	941.9	60.6	0.2703
Flux.1 Dev	int4	10.41	0.3717	28	20.7	18.1	893.8	62.4	0.2730
Flux.1 Dev	nf4	9.75	0.3482	28	21.2	18.7	903.3	62.1	0.2735
Flux.1 Dev	int4_gemm	32.93	1.1759	28	25.2	22.6	470.8	62.5	0.2676

Table 3. Results obtained for RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7)

Python 3.11 · PyTorch 2.11 · diffusers 0.38 · 13 prompts · 3 seeds each (SD 1.5, SDXL Turbo, SDXL, LCM-SDXL, Flux.1 Schnell, Flux.1 Dev, SDXL Lightning 2-step & 4-step, Hyper-SDXL 4-step).

Why `fp8` and `fp8_w` are slower on UNet models (Stable Diffusion family) and similarly fast to `fp16` on DiT models (Flux)

FP8 tensor cores only accelerate matrix multiplies (GEMM). The two architectures have fundamentally different operation profiles:

UNet (SD family): the backbone is dominated by Conv2d layers, not linear projections. fp8 quantized the linear layers inside attention blocks, but the Conv2d operations, which do the majority of the compute, are untouched and stay in bf16. Additionally, Float8DynamicActivationFloat8WeightConfig, used for quantization, measures and scales activations dynamically on every forward pass, adding per-layer overhead to every layer, including the Conv2d ones that gain nothing.
DiT (Flux): a pure transformer with no convolutions - every operation is attention + linear projections. FP8 quantizes essentially 100% of the computer, and FP8 GEMM tensor cores accelerate it all. FP8 tensor cores do run at ~2× fp16 throughput on the GEMM itself, but the dynamic activation quantization overhead consumes most of that gain. Net result: roughly breakeven, with overhead tipping the balance slightly negative (~2.5% slower). Float8DynamicActivationFloat8WeightConfig recomputes activation scales on every forward pass. For each linear layer it:

Scans activations to find amax(|x|) - one extra read over the activation tensor
Multiplies activations by the derived scale - one extra write
Runs the FP8 GEMM
Dequantizes the output back to bf16

Steps 1-2 are pure overhead that fp16 does not pay. The 2× GEMM speedup must first absorb this cost before any net gain appears.

What actually unlocks fp8 speedup for transformer-based models: `fp8_static`

The dynamic activation quantization overhead is the key problem. Calculating the required scale factors on the fly introduces a runtime overhead. fp8_static solves this with an offline calibration step. Because FP8 has a limited dynamic range, precision-sensitive operations - such as LayerNorm, softmax, and residual additions must stay in bf16 to prevent mathematical instability. Only the linear projections run in fp8. In regular fp8 quantization, the scale factor used for dequantization to fp16 is recomputed for each layer for each inference run. It creates compute overhead that stalls compute-bound hardware (data arrives fast enough but the arithmetic units are saturated). Fp8_static calculates those scale factors in the calibration run and reuses them in runtime.

	fp8 (dynamic)	fp8_static
Activation scales	Recomputed every step	Calibrated once at load time
Per-step overhead	amax scan + scale multiply per layer	None
Accuracy	Exact dynamic range	Slight approximation (frozen scales)
Best for	Memory-bandwidth-bound GPUs	Compute-bound GPUs (RTX PRO 6000, H100)

Table 4. Fp8 vs fp8_static. When to use which one.

Speeding up conv-based UNets beyond quantization

Benchmark results presented in the table above prove that quantization does not speed up convolutional-based diffusion models. Quantization targets nn.Linear layers: the attention projections and feed-forward blocks. In SD-family UNets, the dominant compute is in ResNet blocks (Conv2d + GroupNorm + SiLU), which are untouched by every quantization format evaluated here. Therefore other techniques should be used to boost the inference speed for SD models.

torch.compile generates optimized CUDA kernels that fuse Conv2d + GroupNorm + SiLU into a single kernel dispatch, eliminating Python-level overhead between operations. On SD, this typically yields 20-40% wall-clock reduction at no quality or VRAM cost.
Channels-last memory format. PyTorch stores tensors in NCHW format by default. cuDNN's Conv2d kernels are faster in NHWC (channels-last) because GPU memory access patterns align better with the convolution sliding window.
TensorRT with INT8 calibration. TensorRT's PTQ calibration can quantize Conv2d layers to INT8, unlike torchao, which is limited to nn.Linear. This is the only quantization approach that delivers real throughput gains on conv-dominated models. The trade-off is setup complexity: ONNX export, calibration passes, and GPU-architecture-specific engine compilation that cannot be reused across GPU generations.

Visualisations

The following samples illustrate the results obtained for different quantisation levels and strategies.

SDXL	LCM-SDXL	Flux.1 Schnell
fp16

fp8_w

fp8

nfp4

int8

int4

Table 5. Close-up portrait of a woman. Comparison for different diffusion model types, quantisation levels and diffusion steps.

SDXL_Turbo	SD 1.5	SDXL Lightning (2-step)	Hyper-SDXL (4-step)
fp16

Table 6. Close-up portrait of a woman. Comparison of different convolution-based diffusion models (lighter or optimised version of Stable Diffusion). The quality of those models stays behind the SDXL, LCM-SDXL and Flux.1 Schnell.

Conclusions

The most universal and effective optimization is reducing the number of denoising iterations. Some models are designed to produce high-quality results in as few as 4 steps, making this the single biggest win available regardless of architecture.

Quantization consistently reduces VRAM footprint across all model types, enabling deployment on resource-constrained devices or improving throughput at scale. Its effect on speed, however, is architecture-dependent: for transformer-based models, it accelerates inference only on newer GPU architectures with native hardware support, while for convolutional models, it can actually add processing overhead.

For convolutional diffusion models specifically, the most effective optimization strategies are kernel fusion, channel-last memory format, pruning, and TensorRT INT8 calibration.

Reviewed by: Michał Zaręba

Diffusion models quantization benchmark

Models compared

Comparison dimensions

Quantization scope: weights-only vs weights + compute

When quantization speeds things up

`int8` vs `fp8_w` - both 8-bit weight-only, different number formats

Results

Why `fp8` and `fp8_w` are slower on UNet models (Stable Diffusion family) and similarly fast to `fp16` on DiT models (Flux)

What actually unlocks fp8 speedup for transformer-based models: `fp8_static`

Speeding up conv-based UNets beyond quantization

Visualisations

Conclusions

Explore more topics

How We Got FP16 GPU Tests Running on GitHub Actions - Without a GPU

Visdom: The Ferrari Engine in a Fiat 500

NeurIPS 2025 Best Papers TL;DR part 1: Gated Attention

Diffusion models quantization benchmark

Models compared

Comparison dimensions

Quantization scope: weights-only vs weights + compute

When quantization speeds things up

int8 vs fp8_w - both 8-bit weight-only, different number formats

Results

Why fp8 and fp8_w are slower on UNet models (Stable Diffusion family) and similarly fast to fp16 on DiT models (Flux)

What actually unlocks fp8 speedup for transformer-based models: fp8_static

Speeding up conv-based UNets beyond quantization

Visualisations

Conclusions

Explore more topics

How We Got FP16 GPU Tests Running on GitHub Actions - Without a GPU

Visdom: The Ferrari Engine in a Fiat 500

NeurIPS 2025 Best Papers TL;DR part 1: Gated Attention

`int8` vs `fp8_w` - both 8-bit weight-only, different number formats

Why `fp8` and `fp8_w` are slower on UNet models (Stable Diffusion family) and similarly fast to `fp16` on DiT models (Flux)

What actually unlocks fp8 speedup for transformer-based models: `fp8_static`