atom

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k …

AdaSplash-2: Faster Differentiable Sparse Attention

Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is α-entmax attention, a differentiable sparse alternative to softmax that enables …

Self-Improving World Modelling with Latent Actions

Internal modelling of the world — predicting transitions between previous states X and next states Y under actions Z — is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled …

nvidia/Qwen3-8B-DMS-8x

8x KV cache compression without quality degradation. Ideal for inference-time scaling.

Bolmo: Byteifying the Next Generation of Language Models

Recent advances in generative AI have been largely driven by large language models (LLMs), deep neural networks that operate over discrete units called tokens. To represent text, the vast majority of LLMs use words or word fragments as the tokens, …

Fast and Expressive Multi-Token Prediction with Probabilistic Circuits

Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice …

Inference-Time Hyper-Scaling with KV Cache Compression

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key–value (KV) cache, rather than the number …