Release Notes
Release v5.12.0
New Model additions
MiniMax-M3-VL
MiniMax-M3-VL is the vision-language member of the MiniMax-M3 family that pairs a CLIP-style vision tower with 3D rotary position embeddings with the MiniMax-M3 text backbone. It uses a mixed dense/sparse Mixture-of-Experts decoder with SwiGLU-OAI gated experts and a lightning indexer for block-sparse attention. The model processes images through a Conv3d patch embedding system and includes specialized components for efficient multimodal understanding and generation.
Links: Documentation
- Add minimax m3vl (#46600) by @ArthurZucker in #46600
PP-OCRv6: update documentation and slow tests (#46576)
The official weights for PP-OCRv6 are out: PP-OCRv6 is a lightweight OCR system that combines architectural innovation with data-centric optimization. It redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge.
- PP-OCRv6: update documentation and slow tests (#46576) by @ zhang-prog
Add Parakeet-RNNT (#46331)
ParakeetForRNNT: a Fast Conformer Encoder + an RNN-T (RNN Transducer) decoder
- RNN-T Decoder: Standard neural transducer:
- LSTM prediction network maintains language context across token predictions.
- Joint network combines encoder and decoder outputs.
- Greedy transducer decoding for inference: a blank emission advances the encoder frame by one, a non-blank emission stays on the same frame.
- LSTM prediction network maintains language context across token predictions.
- Add Parakeet-RNNT (#46331) by @eustlb
Bugfixes and improvements
- [CI] don't export OTELs within the tests (#46602) by @tarekziade in [#46602]
- [CI] capture checkers output in OTEL (#46601) by @tarekziade in [#46601]
- Lfm2: thread
seq_idxthrough ShortConv for packed/varlen inputs (#46588) by @ChangyiYang in [#46588] - put output_hidden_states into filter_output_hidden_states (#46422) by @molbap in [#46422]
- a11 for checkers (#46599) by @tarekziade in [#46599]
- Fix stop string matching for byte-fragment tokens (#46530) by @Incheonkirin in [#46530]
- [DiffusionGemma] better docs and links (#46569) by @gante in [#46569]
- Require
trust_remote_codeto run a local-directorycustom_generate(#46483) by @LinZiyuu in [#46483] - Fix torchaudio version not tied to torch version in docker file (#46594) by @ydshieh in [#46594]
- [CI] Enable PR CI for all fork PRs via security gate (#46591) by @ydshieh in [#46591]
- [CB] [Minor] Add parameter to tune default compile level (#46533) by @remi-or in [#46533]
- Make DiffusionGemma trainable (#46568) by @kashif in [#46568]
- docs: 🌐 add Turkish translation for README file (#46312) by @onuralpszr in [#46312]
- fix-trainer-tests (#46541) by @SunMarc in [#46541]
- Remove unnecessary expand_as in get_placeholder_mask across VLMs (#44907) by @syncdoth in [#44907]
- [CI] Catch all shell/process execution issues in security gate via Bandit JSON report (#46560) by @ydshieh in [#46560]
- Honor a concrete dtype in AutoModel for composite checkpoints (#46514) by @qflen in [#46514]
- [CI] Implement real security check in PR CI security gate (#46557) by @ydshieh in [#46557]
- [CI] Add 60s delay in security gate for flow observation (#46555) by @ydshieh in [#46555]
- [TBC] [CI] Auto-approve PR CI for fork PRs via security gate (#46553) by @ydshieh in [#46553]
- [CI] fix and make less flaky (#46543) by @zucchini-nlp in [#46543]
- Fix hf_hub_download not placing file in current dir for url_to_local_path (#46545) by @ydshieh in [#46545]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ArthurZucker
- Add minimax m3vl (#46600)
- @eustlb
- Add Parakeet-RNNT (#46331)