Diffusion models became de facto standard in image generation. While remarkably powerful, they rely on a multi-step process of iterative denoising, which takes time. Text generation is also processed in a sequential manner, however the text can be streamed and visualized token by token, giving a natural, real-time feeling, similar to how humans read or write a text. With images, we must wait for the final denoising step to complete before a usable result appears. This makes optimizing diffusion models for speed and efficiency a critical challenge.
Diffusion models can be broadly categorized by their underlying architecture:
- Pure convolutional diffusion models
- Hybrid (convolution + transformer) diffusion models
- Fully transformer-based diffusion models
This architectural distinction has a direct impact on which optimization strategies are most effective.Transformer-based diffusion models benefit greatly from quantization - especially fp8 and nf4 on newer NVIDIA GPUs with native hardware acceleration. Convolutional models, by contrast, respond better to pruning and typically demand less VRAM. Regardless of architecture, reducing the number of denoising steps remains a universal approach for faster generation.
This article presents a benchmark for comparing state-of-the-art text-to-image diffusion models across four dimensions: generation speed, image quality, VRAM consumption, and quantization efficiency.
Models compared
For the purpose of this benchmark, multiple models were selected to span the full design space of modern text-to-image diffusion models: from the lightweight and widely-deployed SD 1.5, through the SDXL family and its distilled variants (Turbo, Lightning, Hyper, LCM), to the latest transformer-based Flux.1 architecture. Step counts range from 1 to 30, covering both standard and accelerated inference regimes.
| Model | Steps | Architecture | FP16 VRAM |
|---|---|---|---|
| SD 1.5 | 30 | UNet | ~4 GB |
| SDXL | 30 | UNet | ~11 GB |
| SDXL Turbo | 1-4 | UNet (ADD) | ~11 GB |
| SDXL Lightning (2-step) | 2 | UNet LoRA | ~11 GB |
| SDXL Lightning (4-step) | 4 | UNet LoRA | ~11 GB |
| Hyper-SDXL (4-step) | 4 | UNet LoRA | ~11 GB |
| LCM-SDXL | 4 | UNet | ~11 GB |
| Flux.1 Schnell | 4 | DiT (12B) | ~24 GB |
| Flux.1 Dev | 28 | DiT (12B) | ~24 GB |
Table 1. Datasheet of benchmarked models.
Comparison dimensions
- Speed - total generation time, time per denoising step, step count
- Quality - CLIP score (text alignment), Laplacian sharpness, colorfulness (Hasler & Süsstrunk), Shannon entropy
- VRAM - model weight footprint and peak generation VRAM
- Quantization - FP16 / BF16 / INT8 / NF4 / INT4 / INT4-GEMM / FP8-W / FP8 (bitsandbytes + torchao), FP8_static
Quantization scope: weights-only vs weights + compute
Not all quantization formats speed up inference. The key distinction is whether the GPU executes matrix multiplications in a lower-precision format, or just stores weights compactly and quantized back to fp16/bf16 before each operation. Weights-only approach reduces the VRAM stress allowing for inference on devices with less VRAM or larger batchsize (throughput) on the same device. Weights + compute allows for both - faster inference and larger batchsize, however require newer GPU architectures.
| Level | Weights | Compute (GEMM) | Quantized GEMM | VRAM vs FP16 | Speed vs FP16 | GPU requirement |
|---|---|---|---|---|---|---|
| fp16 / bf16 | fp16 / bf16 | fp16 / bf16 | No | 1× | baseline | Any CUDA GPU |
| int8 | INT8 | dequantized → fp16 | No | ~0.55× | slower | Any CUDA GPU |
| nf4 | NF4 4-bit | dequantized → fp16 | No | ~0.30× | slower | Any CUDA GPU |
| int4 | NF4 4-bit + double-quant | dequantized → bf16 | No | ~0.27× | slower | Any CUDA GPU |
| int4_gemm | INT4 symmetric | INT4 tensor cores | Yes | ~0.27× | faster | Ampere+ (RTX 3090+) |
| fp8_w | FP8 | dequantized → bf16 | No | ~0.50× | same | Any CUDA GPU |
| fp8 | FP8 | FP8 tensor cores | Yes | ~0.50× | faster | Ada/Blackwell (RTX 4090+) |
| fp8_static | FP8 | FP8 tensor cores | Yes | ~0.50× | faster | Ada/Blackwell (RTX 4090+) |
Table 2. Comparison of quantisation modes and supported hardware.
When quantization speeds things up
By leveraging native Tensor Core instructions, int4_gemm, fp8, and fp8_static perform matrix multiplication directly on quantized values and fuse the dequantize directly into the multiply, eliminating the overhead (weights are still dequantized before activations in the model). int4_gemm uses symmetric INT4 and runs on any Ampere+ GPU. fp8 uses FP8 dynamic activation quantization (rescales activations every step) and fp8_static uses pre-calibrated frozen scales (no per-step overhead) - both require Ada Lovelace or Blackwell (RTX 4090+).
int8 vs fp8_w - both 8-bit weight-only, different number formats
Both store weights in 8 bits and dequantize before compute, so neither speeds up inference. The difference is what those 8 bits represent:
int8stores weights as 8-bit integers: a uniform linear grid from -128 to 127. Each block of weights gets one scale factor. The grid spacing is constant, which is a poor fit for neural network weights that cluster near zero with rare large outliers. Before compute weights are dequantized to FP16.fp8_wstores weights as 8-bit floating point (E4M3: 4 exponent bits, 3 mantissa bits). Because it has an exponent, the grid is non-uniform - denser near zero, sparser at large magnitudes. This matches the distribution of trained weights much better. Dequantizes to bf16.
fp8_w should give marginally better quality than int8 at the same VRAM tier, for the same reason NF4 outperforms uniform int4: floating point naturally matches the bell-curve distribution of neural network weights.
Results
All benchmarks were run on an RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7). This matters for interpreting the quantization results: fp8 and fp8_static speedups require Ada Lovelace or newer (RTX 4090+), and int4_gemm requires Ampere or newer (RTX 3090+). Results for weights-only formats - nf4, int4, int8, fp8_w - generalize to any CUDA GPU, since they affect VRAM and quality but not compute throughput.
| Model | Quant | Avg Time (s) | s/step | Steps | VRAM peak (GB) | VRAM model (GB) | Sharpness | Colorfulness | CLIP score |
|---|---|---|---|---|---|---|---|---|---|
| SDXL Lightning (2-step) | fp16 | 0.35 | 0.1756 | 2 | 11.5 | 7.3 | 507.1 | 52.5 | 0.2729 |
| SDXL Lightning (4-step) | fp16 | 0.47 | 0.1167 | 4 | 11.5 | 7.3 | 333.8 | 54.9 | 0.2799 |
| SD 1.5 | fp16 | 0.87 | 0.0291 | 30 | 3.4 | 2.7 | 1460.9 | 75.1 | 0.2704 |
| Hyper-SDXL (4-step) | fp16 | 0.48 | 0.1191 | 4 | 11.5 | 7.3 | 1046.4 | 61.5 | 0.2692 |
| SDXL Turbo | fp16 | 0.22 | 0.0547 | 4 | 8.1 | 6.9 | 588.3 | 58.9 | 0.2827 |
| LCM-SDXL | fp16 | 0.49 | 0.1233 | 4 | 11.1 | 6.9 | 669.1 | 48.3 | 0.2850 |
| LCM-SDXL | fp8 | 1.15 | 0.2879 | 4 | 7.3 | 4.7 | 579.3 | 48.5 | 0.2854 |
| LCM-SDXL | fp8_w | 0.58 | 0.1446 | 4 | 7.3 | 4.7 | 590.4 | 48.3 | 0.2855 |
| LCM-SDXL | fp8_static | 0.73 | 0.1826 | 4 | 7.3 | 4.7 | 566.1 | 48.6 | - |
| LCM-SDXL | int8 | 1.09 | 0.2724 | 4 | 8.9 | 4.7 | 74.5 | 27.0 | 0.1348 |
| LCM-SDXL | int4_gemm | 1.00 | 0.2501 | 4 | 11.4 | 7.3 | 817.2 | 47.3 | 0.2869 |
| LCM-SDXL | nf4 | 0.50 | 0.1239 | 4 | 7.9 | 3.7 | 548.8 | 46.7 | 0.2844 |
| LCM-SDXL | int4 | 0.62 | 0.1560 | 4 | 7.8 | 3.6 | 545.1 | 46.8 | 0.2842 |
| SDXL | fp16 | 3.70 | 0.1235 | 30 | 11.1 | 6.9 | 431.4 | 53.7 | 0.2924 |
| SDXL | fp8_w | 4.39 | 0.1464 | 30 | 7.3 | 4.7 | 404.1 | 53.4 | 0.2920 |
| SDXL | fp8_static | 4.86 | 0.1620 | 30 | 7.3 | 4.7 | 446.2 | 53.9 | - |
| SDXL | int8 | 8.21 | 0.2738 | 30 | 8.9 | 4.7 | 448.2 | 53.5 | 0.2909 |
| SDXL | fp8 | 8.62 | 0.2873 | 30 | 7.3 | 4.7 | 425.5 | 53.6 | 0.2915 |
| SDXL | int4_gemm | 13.49 | 0.4497 | 30 | 11.4 | 7.3 | 611.2 | 54.2 | 0.2940 |
| SDXL | nf4 | 3.91 | 0.1304 | 30 | 7.9 | 3.7 | 456.7 | 51.9 | 0.2901 |
| SDXL | int4 | 4.79 | 0.1596 | 30 | 7.8 | 3.6 | 462.8 | 51.9 | 0.2888 |
| Flux.1 Schnell | fp16 | 1.51 | 0.3763 | 4 | 38.3 | 35.7 | 942.0 | 68.0 | 0.2807 |
| Flux.1 Schnell | fp8_static | 0.81 | 0.2021 | 4 | 24.4 | 21.8 | 959.9 | 69.2 | - |
| Flux.1 Schnell | fp8 | 1.54 | 0.3858 | 4 | 24.4 | 21.8 | 937.8 | 69.0 | 0.2776 |
| Flux.1 Schnell | int8 | 1.67 | 0.4172 | 4 | 26.4 | 23.9 | 941.5 | 68.0 | 0.2805 |
| Flux.1 Schnell | fp8_w | 2.05 | 0.5127 | 4 | 24.4 | 21.8 | 949.5 | 69.2 | 0.2778 |
| Flux.1 Schnell | int4_gemm | 8.74 | 2.1851 | 4 | 25.1 | 22.6 | 737.5 | 66.4 | 0.2763 |
| Flux.1 Schnell | int4 | 1.68 | 0.4211 | 4 | 20.7 | 18.1 | 1048.1 | 68.3 | 0.2768 |
| Flux.1 Schnell | nf4 | 1.53 | 0.3835 | 4 | 21.2 | 18.6 | 1052.0 | 69.1 | 0.2793 |
| Flux.1 Dev | fp16 | 9.45 | 0.3375 | 28 | 38.3 | 35.8 | 943.5 | 61.4 | 0.2740 |
| Flux.1 Dev | fp8_static | 4.75 | 0.1698 | 28 | 24.4 | 21.8 | 860.0 | 63.6 | - |
| Flux.1 Dev | int8 | 10.62 | 0.3794 | 28 | 26.4 | 23.9 | 860.2 | 60.9 | 0.2731 |
| Flux.1 Dev | fp8 | 9.68 | 0.3456 | 28 | 24.4 | 21.8 | 948.0 | 61.5 | 0.2697 |
| Flux.1 Dev | fp8_w | 13.35 | 0.4767 | 28 | 24.4 | 21.9 | 941.9 | 60.6 | 0.2703 |
| Flux.1 Dev | int4 | 10.41 | 0.3717 | 28 | 20.7 | 18.1 | 893.8 | 62.4 | 0.2730 |
| Flux.1 Dev | nf4 | 9.75 | 0.3482 | 28 | 21.2 | 18.7 | 903.3 | 62.1 | 0.2735 |
| Flux.1 Dev | int4_gemm | 32.93 | 1.1759 | 28 | 25.2 | 22.6 | 470.8 | 62.5 | 0.2676 |
Table 3. Results obtained for RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7)
Python 3.11 · PyTorch 2.11 · diffusers 0.38 · 13 prompts · 3 seeds each (SD 1.5, SDXL Turbo, SDXL, LCM-SDXL, Flux.1 Schnell, Flux.1 Dev, SDXL Lightning 2-step & 4-step, Hyper-SDXL 4-step).
Why fp8 and fp8_w are slower on UNet models (Stable Diffusion family) and similarly fast to fp16 on DiT models (Flux)
FP8 tensor cores only accelerate matrix multiplies (GEMM). The two architectures have fundamentally different operation profiles:
- UNet (SD family): the backbone is dominated by Conv2d layers, not linear projections.
fp8quantized the linear layers inside attention blocks, but the Conv2d operations, which do the majority of the compute, are untouched and stay in bf16. Additionally,Float8DynamicActivationFloat8WeightConfig, used for quantization, measures and scales activations dynamically on every forward pass, adding per-layer overhead to every layer, including the Conv2d ones that gain nothing. - DiT (Flux): a pure transformer with no convolutions - every operation is attention + linear projections.
FP8quantizes essentially 100% of the computer, and FP8 GEMM tensor cores accelerate it all.FP8tensor cores do run at ~2×fp16throughput on the GEMM itself, but the dynamic activation quantization overhead consumes most of that gain. Net result: roughly breakeven, with overhead tipping the balance slightly negative (~2.5% slower).Float8DynamicActivationFloat8WeightConfigrecomputes activation scales on every forward pass. For each linear layer it:
- Scans activations to find
amax(|x|)- one extra read over the activation tensor - Multiplies activations by the derived scale - one extra write
- Runs the FP8 GEMM
- Dequantizes the output back to bf16
Steps 1-2 are pure overhead that fp16 does not pay. The 2× GEMM speedup must first absorb this cost before any net gain appears.
What actually unlocks fp8 speedup for transformer-based models: fp8_static
The dynamic activation quantization overhead is the key problem. Calculating the required scale factors on the fly introduces a runtime overhead. fp8_static solves this with an offline calibration step. Because FP8 has a limited dynamic range, precision-sensitive operations - such as LayerNorm, softmax, and residual additions must stay in bf16 to prevent mathematical instability. Only the linear projections run in fp8. In regular fp8 quantization, the scale factor used for dequantization to fp16 is recomputed for each layer for each inference run. It creates compute overhead that stalls compute-bound hardware (data arrives fast enough but the arithmetic units are saturated). Fp8_static calculates those scale factors in the calibration run and reuses them in runtime.
| fp8 (dynamic) | fp8_static | |
|---|---|---|
| Activation scales | Recomputed every step | Calibrated once at load time |
| Per-step overhead | amax scan + scale multiply per layer | None |
| Accuracy | Exact dynamic range | Slight approximation (frozen scales) |
| Best for | Memory-bandwidth-bound GPUs | Compute-bound GPUs (RTX PRO 6000, H100) |
Table 4. Fp8 vs fp8_static. When to use which one.
Speeding up conv-based UNets beyond quantization
Benchmark results presented in the table above prove that quantization does not speed up convolutional-based diffusion models. Quantization targets nn.Linear layers: the attention projections and feed-forward blocks. In SD-family UNets, the dominant compute is in ResNet blocks (Conv2d + GroupNorm + SiLU), which are untouched by every quantization format evaluated here. Therefore other techniques should be used to boost the inference speed for SD models.
torch.compilegenerates optimized CUDA kernels that fuseConv2d + GroupNorm + SiLUinto a single kernel dispatch, eliminating Python-level overhead between operations. On SD, this typically yields 20-40% wall-clock reduction at no quality or VRAM cost.- Channels-last memory format. PyTorch stores tensors in NCHW format by default. cuDNN's Conv2d kernels are faster in NHWC (channels-last) because GPU memory access patterns align better with the convolution sliding window.
- TensorRT with INT8 calibration. TensorRT's PTQ calibration can quantize
Conv2dlayers to INT8, unlike torchao, which is limited tonn.Linear. This is the only quantization approach that delivers real throughput gains on conv-dominated models. The trade-off is setup complexity: ONNX export, calibration passes, and GPU-architecture-specific engine compilation that cannot be reused across GPU generations.
Visualisations
The following samples illustrate the results obtained for different quantisation levels and strategies.
| SDXL | LCM-SDXL | Flux.1 Schnell |
|---|---|---|
| fp16 | ||

| fp8_w |

| fp8 |

| nfp4 |

| int8 |

| int4 |

Table 5. Close-up portrait of a woman. Comparison for different diffusion model types, quantisation levels and diffusion steps.
| SDXL_Turbo | SD 1.5 | SDXL Lightning (2-step) | Hyper-SDXL (4-step) |
|---|---|---|---|
| fp16 | |||

Table 6. Close-up portrait of a woman. Comparison of different convolution-based diffusion models (lighter or optimised version of Stable Diffusion). The quality of those models stays behind the SDXL, LCM-SDXL and Flux.1 Schnell.
Conclusions
The most universal and effective optimization is reducing the number of denoising iterations. Some models are designed to produce high-quality results in as few as 4 steps, making this the single biggest win available regardless of architecture.
Quantization consistently reduces VRAM footprint across all model types, enabling deployment on resource-constrained devices or improving throughput at scale. Its effect on speed, however, is architecture-dependent: for transformer-based models, it accelerates inference only on newer GPU architectures with native hardware support, while for convolutional models, it can actually add processing overhead.
For convolutional diffusion models specifically, the most effective optimization strategies are kernel fusion, channel-last memory format, pruning, and TensorRT INT8 calibration.
Reviewed by: Michał Zaręba



