v5.11.0
Release v5.11.0
View on GitHubView PackagePublished: Jun 10, 2026

Release Notes

Release v5.11.0

New Model additions

DiffusionGemma

image

DiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models by employing an encoder-decoder architecture specifically optimized for inference speed. During inference, DiffusionGemma leverages multi-canvas sampling, where rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. This block-autoregressive approach facilitates text generation at higher speeds compared to traditional sequential generation methods.

Links: Documentation

  • GPU go brr (#46540) by @gante in #46540

DeepSeek-V3.2

image

DeepSeek-V3.2-Exp is an experimental model from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. Built on top of DeepSeek-V3.1-Terminus with a 685B-parameter Mixture-of-Experts backbone, it reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens while maintaining virtually identical benchmark performance. The work was extended in DeepSeek-V3.2 which pairs DSA with scalable reinforcement learning and achieves gold-medal level results on competition math and competitive programming benchmarks.

Links: Documentation | Paper

  • Add deepseek 3.2 exp (#41251) by @ArthurZucker in #41251

Kernels

The KernelConfig API was extended to support n-to-1 module fusion and parameter transformation, simplifying how custom kernels are integrated with Transformers modules. Additional fixes include resolving a dtype mismatch in the Mamba2 CUDA kernel path for NemotronH/Zamba2, adding fine-grained fp8/fp4 Triton kernel support, and correcting the FalconMamba fast-path warning to recommend pip install kernels instead of mamba-ssm.

  • Extended & simplified n-to-1 kernel fusion via KernelConfig (#46339) by @michaelbenayoun in [#46339]
  • Triton finegrained fp8/fp4 (#46407) by @IlyasMoutawwakil in [#46407]
  • Fix dtype mismatch in NemotronH/Zamba2 Mamba2 CUDA-kernel path (out_proj) (#46487) by @yuekaizhang in [#46487]
  • fix(falcon_mamba): recommend pip install kernels in fast-path warning (#46343) by @Anai-Guo in [#46343]

Parallelization

Fixed model parallel beam search bugs in the Qwen2-VL, Qwen2.5-VL, and Qwen3-VL MoE model families, and added documentation for tensor parallelism support with continuous batching.

  • [docs] tp for continuous batching (#46019) by @stevhliu in [#46019]
  • revisit history parallel beam search tests to avoid unnecessary fix (#46495) by @kaixuanliu in [#46495]
  • fix qwen series VL model's model parallel bug (#46316) by @kaixuanliu in [#46316]

Bugfixes and improvements

  • Fix the offsets in processing (#46525) by @zucchini-nlp in [#46525]
  • Fix buggy action sha pin (#46534) by @ydshieh in [#46534]
  • Fix trailing comma bug in DataCollatorForLanguageModeling example (#46527) by @JemmaUZH in [#46527]
  • Fix missing Gemma4Processor._compute_audio_num_tokens (#46416) by @csantosbh in [#46416]
  • Fix InternVL models (#46524) by @hmellor in [#46524]
  • fix(afmoe): reduce tokens in test_compile_static_cache to avoid flaky bfloat16 drift (#46521) by @ydshieh in [#46521]
  • [CB] Add a "max_requests_per_batch" parameter (#46434) by @remi-or in [#46434]
  • revamp cv docs and fix rf-detr (#46219) by @merveenoyan in [#46219]
  • Update hub metadata (#46379) by @zucchini-nlp in [#46379]
  • extend DeepseekV4FlashIntegrationTest to non-cuda device (#46517) by @sywangyi in [#46517]
  • [docs] deepgemm (#46361) by @stevhliu in [#46361]
  • [fix] regression introduced by #45534 (#46456) by @eustlb in [#46456]
  • Use torchvision's native LANCZOS interpolation instead of PIL fallback (#46496) by @NicolasHug in [#46496]
  • Add debugging info in pr-ci-caller.yml (#46505) by @ydshieh in [#46505]
  • Fix tests: 'Cohere2MoeModel' object has no attribute 'hf_device_map' (#46337) by @kaixuanliu in [#46337]
  • Bump the actions group across 1 directory with 19 updates (#46414) by @dependabot[bot] in [#46414]
  • Log some information in .github/workflows/pr-ci-post-dashboard-link.yml (#46499) by @ydshieh in [#46499]
  • feat(quantizers): support non-weight param names in TorchAo safetensors loading (#46325) by @agesf in [#46325]
  • docs: fix typo in make_list_of_images docstring (#46469) by @ramkumar27072006 in [#46469]
  • add XPU expectation for deepseek_ocr2 model tests (#46492) by @kaixuanliu in [#46492]
  • Fix sapiens2 tests: add XPU device expectations (#46488) by @kaixuanliu in [#46488]
  • Add vLLM smoke test to CI (#46383) by @hmellor in [#46383]
  • extend deepseek v4 test to xpu (#46366) by @sywangyi in [#46366]
  • Added cosmos3 model (#46146) by @MaciejBalaNV in [#46146]
  • fbgemm_fp8:Keep the current device aligned with the input tensor (#46403) by @kaixuanliu in [#46403]
  • [Modular] Add no_inherit_decorators and fixup wrong RoPE related inheritances (#46440) by @Bissmella in [#46440]
  • skip deepgemm test except cuda (#46090) by @jiqing-feng in [#46090]
  • Fix/video classification pipeline video processor (#46256) by @J3r3myPerera in [#46256]
  • ci: less flaky test_assisted_decoding_matches_greedy_search_1_same (#46445) by @ydshieh in [#46445]
  • Fix flip_back graph break (#46344) by @guarin in [#46344]
  • Add the other processors to auto-mappings (#46046) by @zucchini-nlp in [#46046]
  • fix: compatibility with torch<=2.7 (#46393) by @andylin-hao in [#46393]
  • fix: remove dynamic per-actor Slack ID lookup in ssh-runner workflow (#46327) by @ydshieh in [#46327]
  • [docs] Romanian translation of pipeline_tutorial.md, pipeline_gradio.md, pipeline_webserver.md and add_new_pipeline.md. (#46388) by @filipinescu in [#46388]
  • [docs] gemma4 typos (#46351) by @stevhliu in [#46351]
  • [docs] padding-free training (#46333) by @stevhliu in [#46333]
  • fix[vLLM x v5]: Default untied embeddings in AudioFlamingo3 and VibeVoice (#46400) by @harshaljanjani in [#46400]
  • Fix deepspeed docker (#46108) by @SunMarc in [#46108]
  • Fix conversion for clip models (#46406) by @zucchini-nlp in [#46406]
  • ci: mention code quality failure in CI dashboard comment (#46415) by @ydshieh in [#46415]
  • Fix noisy logging from image_processing module aliases issue - 46298 (#46350) by @skshmjn in [#46350]
  • Raise tqdm minimum to 4.60 to match tqdm.contrib.logging import (#46397) by @n0gu-furiosa in [#46397]
  • fix(gemma4_unified): conversion script and config bugs (#46398) by @douglas-reid in [#46398]
  • [docs] remove sparsity from compressed-tensors (#46387) by @stevhliu in [#46387]
  • [CB] Fix crashes when fork is not possible (#46251) by @remi-or in [#46251]
  • Improve CI dashboard comment: rename and deduplicate (#46412) by @ydshieh in [#46412]
  • Fix missing f-string prefixes in error messages (#46354) by @joaopedroassad in [#46354]
  • Add workflow to post CI Grafana dashboard link to PR (#46410) by @ydshieh in [#46410]
  • [docs] Romanian translation of fast_tokenizers.md, custom_tokenizers.md, tokenizer_summary.md, image_processors.md and video_processors.md. (#46356) by @filipinescu in [#46356]
  • Clean up new models after release (#46092) by @zucchini-nlp in [#46092]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ArthurZucker
    • Add deepseek 3.2 exp (#41251)
  • @gante
    • GPU go brr (#46540)
  • @merveenoyan
    • revamp cv docs and fix rf-detr (#46219)
  • @sgerrard
    • Quantization for small models (#46449)
  • @MaciejBalaNV
    • Added cosmos3 model (#46146)
  • @J3r3myPerera
    • Fix/video classification pipeline video processor (#46256)
  • @filipinescu
    • [docs] Romanian translation of pipeline_tutorial.md, pipeline_gradio.md, pipeline_webserver.md and add_new_pipeline.md. (#46388)
    • [docs] Romanian translation of fast_tokenizers.md, custom_tokenizers.md, tokenizer_summary.md, image_processors.md and video_processors.md. (#46356)