Publications

Sparse attention has been proposed as a way to alleviate the quadratic cost of transformers, a central bottleneck in long-context …

Internal modelling of the world — predicting transitions between previous states X and next states Y under actions Z — is essential to …

Recent advances in generative AI have been largely driven by large language models (LLMs), deep neural networks that operate over …

Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including …

Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in …

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) …

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy …

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts …

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to …

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A …