[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
-
Updated
Jan 17, 2026 - Cuda
URL: http://github.com/topics/efficient-attention
.githubassets.com/assets/dashboard-155110efe45ab466.css" />[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval
The official PyTorch implementation for CascadedGaze: Efficiency in Global Context Extraction for Image Restoration, TMLR'24.
Unofficial PyTorch implementation of the paper "cosFormer: Rethinking Softmax In Attention".
Pytorch implementation of "Compact Global Descriptor for Neural Networks" (CGD).
Implementation of: Hydra Attention: Efficient Attention with Many Heads (https://arxiv.org/abs/2209.07484)
Official Implementation of SEA: Sparse Linear Attention with Estimated Attention Mask (ICLR 2024)
Nonparametric Modern Hopfield Models
Official repository for "SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space"
Minimal implementation of Samba by Microsoft in PyTorch
Resources and references on solved and unsolved problems in attention mechanisms.
🤖 Build a customizable, reliable Discord bot with Sage, designed for flexibility to enhance your server's interaction and engagement.
Add a description, image, and links to the efficient-attention topic page so that developers can more easily learn about it.
To associate your repository with the efficient-attention topic, visit your repo's landing page and select "manage topics."