atom

AdaSplash-2: Faster Differentiable Sparse Attention

Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context training. A promising line of work is α-entmax attention, a differentiable sparse alternative to softmax that enables …

Self-Improving World Modelling with Latent Actions

Internal modelling of the world — predicting transitions between previous states X and next states Y under actions Z — is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled …

nvidia/Qwen3-8B-DMS-8x

8x KV cache compression without quality degradation. Ideal for inference-time scaling.

Bolmo: Byteifying the Next Generation of Language Models

Recent advances in generative AI have been largely driven by large language models (LLMs), deep neural networks that operate over discrete units called tokens. To represent text, the vast majority of LLMs use words or word fragments as the tokens, …

Fast and Expressive Multi-Token Prediction with Probabilistic Circuits

Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice …

Inference-Time Hyper-Scaling with KV Cache Compression

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key–value (KV) cache, rather than the number …

Bootstrapping Action-Grounded Visual Dynamics in Unified Vision-Language Models

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically …