ELF: Diffusion Models for Language (2026)

Explains ELF's continuous text-diffusion approach, 32–64 step sampling, JAX/TPU requirements, perplexity and task results, and practical trade-offs.

After three hours of trying to reproduce GPT-style results on my laptop with a 3090, I started wondering if there was another way to handle language generation without maxing out my patience or my GPU. That is when ELF landed on my radar. Built by Kaiming He's team at MIT, ELF skips the usual token-by-token slog of autoregressive models and replaces it with something that feels more like image diffusion, but for text.

The claim? ELF can hit GPT-like perplexities with a fraction of the training tokens and just 32 generation steps. It is not just a tweak to the usual diffusion setup either. ELF stays in a continuous embedding space almost the entire time, only snapping back to discrete tokens at the very last step. If you have been frustrated by the inefficiency of discrete diffusion or the rigidity of autoregressive pipelines, ELF promises to be a fresh alternative.

This post covers how ELF works, what it does well, and where it might fall short. It also looks at whether the JAX-only implementation is worth the effort for TPU users and what GPU folks should expect if they are waiting for the PyTorch port. No fluff, just the trade-offs and results.

Text diffusion: A new paradigm for LLMs

ELF and the Problem It Solves

ELF Embedded Language Flows architecture overview from the MIT paper

ELF vs autoregressive vs discrete diffusion comparison table for training tokens and sampling steps — ELF vs Autoregressive vs Discrete Diffusion: Key Metrics Compared

Autoregressive models, like GPT, generate text one token at a time. While effective for many tasks, this sequential process can be slow and struggles with operations that require flexibility, such as infilling, bidirectional editing, or parallel decoding. These limitations highlight why ELF's continuous approach is interesting.

Why Diffusion Models Matter for Language

ELF takes advantage of continuous embeddings throughout the generation process, only converting to discrete tokens at the final step (t = 1). Unlike earlier diffusion-based methods, such as Diffusion-LM, which mapped to discrete vocabularies at every step, ELF maintains the flow dynamics that make diffusion methods effective in other domains, like image generation.

The results speak for themselves: ELF-B achieves a perplexity of 24 in just 32 steps, compared to up to 1,024 steps required by discrete diffusion models. And it does this after training on only 45 billion tokens, a fraction of the 500 billion tokens needed for comparable discrete diffusion models. Its continuous framework also natively supports Classifier-Free Guidance (CFG), a technique that is notoriously difficult to implement in discrete or autoregressive models.

Feature	Autoregressive (GPT-style)	Discrete Diffusion (MDLM)	ELF (Continuous)
Generation	Sequential	Iterative unmasking	Iterative flow matching
Modeling Space	Discrete tokens	Discrete tokens	Continuous embeddings
Training Tokens	High	~500B	~45B
Sampling Steps	Sequence length	Up to 1,024	32 to 64
CFG Support	Difficult	Limited	Native

Who Built ELF and Why It Matters

ELF was developed at MIT in Kaiming He's lab, the same researcher responsible for the ResNet architecture. The project's co-first authors, Keya Hu and Linlu Qiu, are PhD students at MIT, with additional input from MIT's EECS, CSAIL, and Tsinghua University's Yao Class. The team's vision is captured in their statement:

"The problem may not be that 'language must be discrete,' but that prior works simply never let the continuous approach be continuous all the way." (Kaiming He's Team, MIT)

This highlights their belief that earlier attempts at continuous diffusion for language were hindered by prematurely forcing discretization.

What the ELF GitHub Repository Contains

ELF's repository is designed for modern TPU environments and showcases its technical innovations. The official repository, hosted at github.com/lillian039/ELF, was launched on May 13, 2026. It features a JAX-based implementation optimized for TPU v5p-64, with a PyTorch version planned. The repository includes three pretrained models (ELF-B at 105M, ELF-M at 342M, and ELF-L at 652M) available on HuggingFace under the "embedded-language-flows" organization.

The repository structure is straightforward:

src/: Contains core utilities for training, evaluation, and data processing (e.g., train.py, eval.py, data_utils.py).
configs/: YAML files for training configurations, such as train_owt_ELF-B.yml.
requirements.txt: Lists dependencies for JAX and TPU environments.

Evaluation tools for perplexity and entropy are provided, along with configurations for benchmarks like OpenWebText, WMT14 (German-English translation), and XSum (news summarization). These resources directly support ELF's claims of efficiency and quality.

How ELF Works: Core Architecture

ELF's framework revolves around three key design choices: operating within a continuous embedding space, adopting x-prediction flow matching, and postponing discretization. These elements collectively set ELF apart from both autoregressive models and earlier diffusion-based language models. Here is each component in detail.

Continuous Embedding Space and Frozen Encoders

ELF does not directly manipulate token IDs. Instead, it transforms discrete tokens into continuous vector representations through a frozen, pretrained T5 encoder. This encoder acts solely as a static embedding function, delivering high-dimensional, bidirectional contextual representations of input tokens. The denoising and decoding tasks are handled by a single network, which switches between the two functions using a binary "mode" token. During training, the loss is allocated as follows: 80% for denoising, measured by mean squared error, and 20% for decoding, measured by cross-entropy. Notably, the T5 encoder is only used during training and is excluded during inference.

Flow Matching and Delayed Discretization

With its continuous embedding setup, ELF employs rectified flows to define the noisy latent state at time t using a linear interpolation: z_t = t * x + (1 - t) * eps. While traditional flow matching predicts the velocity field (v-prediction), ELF takes a different route by predicting the clean embedding directly (x-prediction). This approach proves more stable when dealing with high-dimensional representations, such as 768 dimensions, by focusing directly on the desired clean output.

Discretization is deferred until t = 1, keeping the model in the continuous space until the very last step. As the ELF paper explains:

"ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network." (Keya Hu et al.)

This delayed discretization method avoids the inefficiencies and rigidity of token-level supervision at intermediate steps, a common limitation in earlier continuous diffusion models. ELF also incorporates self-conditioning, where predictions from the previous step are used as additional context. This enables classifier-free guidance natively without requiring extra forward passes during inference.

Performance Benchmarks and Model Variants

ELF comes in three sizes, all trained on a dataset of 45 billion tokens. Its performance metrics are summarized below:

Model	Parameters	Sampling Steps	Gen. PPL (OpenWebText)
ELF-B	105M	32	24.1
ELF-M	342M	64	21.7
ELF-L	652M	64	23.3

Among these, ELF-M shows the best perplexity at 21.7. Interestingly, ELF-L shows diminishing returns with its slightly higher perplexity of 23.3, suggesting that scaling beyond 342 million parameters may not yield proportional improvements at this training scale. On conditional tasks, ELF-B achieves a BLEU score of 26.4 for WMT14 German-English translation, outperforming an autoregressive baseline of similar size, which scores 25.2. For the XSum summarization task, ELF leads across ROUGE-1, ROUGE-2, and ROUGE-L metrics compared to other diffusion-based models. Additionally, its SDE-inspired sampler consistently delivers better perplexity results than the ODE variant under the same step constraints.

Getting Started with ELF from GitHub

Once you are familiar with ELF's architecture and performance, here is how to set up your environment and start using ELF effectively.

Hardware, Software, and Dependency Requirements

ELF is built using JAX and designed specifically for TPU v5p-64 hardware. The model's performance and results, as detailed in the paper, were achieved using a TPU v5p-64 cluster. While a PyTorch version was announced on May 13, 2026, it has not yet been released.

Dependencies for ELF are managed through a requirements.txt file, which includes JAX, TPU-specific libraries, and Hugging Face Transformers. For tracking experiments, you will need a WandB API key, though you can disable logging by setting use_wandb: false in the CLI.

There are some practical constraints to consider. For example:

ELF-B (105M parameters) can run with a global batch size of 512.
Larger models like ELF-M (342M) and ELF-L (652M) require reducing the batch size to 64 to fit on the same hardware.

Training ELF-B for five epochs on OpenWebText (approximately 95,000 steps) takes about 1.5 hours per epoch on a TPU v5p-64 cluster.

Cloning the Repository and Understanding Its Structure

To get started, clone the repository and install the necessary dependencies:

git clone https://github.com/lillian039/ELF
cd ELF
pip install -r requirements.txt
wandb login  # skip if use_wandb: false

The repository's structure includes the following key directories:

src/: Contains the main scripts you will use (train.py and eval.py).
configs/: Divided into training_configs/ for task and model-specific YAML overrides, and sampling_configs/ for parameters like ODE/SDE schedules and guidance scales.
assets/: Includes visualizations of denoising trajectories, which are useful for understanding the model's behavior but not required for running the code.

Scripts should always be executed from the src/ directory so relative paths resolve correctly.

Once the repository is set up, you can load pretrained models and tweak configuration files to start experimenting with ELF.

Loading Pretrained Models and Configuration Files

ELF's official checkpoints are hosted on Hugging Face under the embedded-language-flows organization. These are automatically fetched at runtime, so there is no need for manual downloads. Use the --checkpoint_path flag to specify the desired model:

cd src/
python eval.py --checkpoint_path embedded-language-flows/ELF-B-owt

Here is a quick look at the available checkpoints:

Model	Parameters	Task	Hugging Face ID
ELF-B	105M	Unconditional (OpenWebText)	`embedded-language-flows/ELF-B-owt`
ELF-M	342M	Unconditional (OpenWebText)	`embedded-language-flows/ELF-M-owt`
ELF-L	652M	Unconditional (OpenWebText)	`embedded-language-flows/ELF-L-owt`
ELF-B-de-en	105M	Translation (WMT14 De-En)	`embedded-language-flows/ELF-B-de-en`
ELF-B-xsum	105M	Summarization (XSum)	`embedded-language-flows/ELF-B-xsum`

The configuration system is structured in two layers:

A base Config dataclass in configs/config.py.
YAML overrides for specific tasks in configs/training_configs/.

Sampling parameters, such as step counts, SDE vs. ODE, and guidance scale, are managed separately in configs/sampling_configs/. You can override any parameter directly from the command line without modifying the YAML files. For instance:

--config_override global_batch_size=64

To confirm your setup is working, run a quick sanity check with ELF-B on OpenWebText. If eval.py completes successfully and returns a generative perplexity close to 24.1, your environment is ready to go.

Using ELF for Language Tasks

Unconditional Text Generation

To generate text with ELF, you will use the eval.py script, which serves as the main entry point for both evaluation and generation tasks. The command below will get you started:

cd src/
python eval.py \
  --config configs/training_configs/train_owt_ELF-B.yml \
  --checkpoint_path embedded-language-flows/ELF-B-owt

By default, this generates 1,000 samples and evaluates them against a pretrained GPT-2 Large model, calculating metrics like Gen. PPL (perplexity) and unigram entropy. Sampling configurations are stored in configs/sampling_configs/uncond_sampling_configs.yml. This file includes two predefined schedules: a 32-step SDE with gamma=1.5 and a 64-step SDE with gamma=1.0, both using a self-conditioning CFG scale of 3. Adjusting the CFG scale impacts the balance between perplexity and entropy, reflecting the trade-off between quality and diversity.

For larger models like ELF-M and ELF-L, include an override for global batch size to avoid running out of memory:

--config_override global_batch_size=64

Once you have mastered unconditional generation, you can move on to conditioning ELF for specific language tasks.

Conditional Tasks: Translation and Summarization

Conditional text generation uses the same eval.py script but requires task-specific configurations and checkpoints. For instance, to perform German-to-English translation using the WMT14 dataset:

cd src/
python eval.py \
  --config configs/training_configs/train_de-en_ELF-B.yml \
  --checkpoint_path embedded-language-flows/ELF-B-de-en

Here, the model conditions its output by prepending clean embeddings of the source text and using self-attention for guidance. Discretization is still limited to the first time step (t=1). The default conditional sampling configuration (cond_sampling_configs.yml) opts for a 64-step ODE schedule with a CFG scale of 2 and a self-conditioning CFG scale of 1. This setup is more constrained than the unconditional one, as the source sequence already narrows the generation scope. Using this method, ELF-B achieves 26.4 BLEU on WMT14 De-En and 36.0 ROUGE-1 on XSum.

"This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG)." (lillian039/ELF Repository)

For summarization tasks, you will need to swap in the appropriate configuration and checkpoint for XSum:

python eval.py \
  --config configs/training_configs/train_xsum_ELF-B.yml \
  --checkpoint_path embedded-language-flows/ELF-B-xsum

If you are working with new datasets or language pairs, tokenize the data using the T5 tokenizer and save it as a Hugging Face Arrow dataset. Make sure the dataset includes input_ids (target) and condition_input_ids (source) columns. Since the T5-small encoder is frozen by default, there is no need to retrain for these tasks.

With these tools, ELF can handle a variety of text generation tasks and fit into broader systems.

Adding ELF to Larger AI Pipelines

ELF's modular architecture simplifies its integration into larger AI workflows. The system relies on a frozen T5-small encoder for tokenization and embedding, a flow matching network for denoising, and a shared-weight unembedding head for discretization at the final step. Unlike traditional models, ELF does not use a separately trained decoder. However, its generation process spans 32 to 64 ODE/SDE steps before returning a single token, so latency planning is essential.

For high-throughput scenarios, increase batching. On a TPU v5p-64, ELF-B supports a global batch size of 512, generating a large volume of samples at 32 SDE steps. For smaller setups or when using ELF-M/ELF-L, reduce the batch size to 64, which will naturally increase wall-clock time per batch. The 32-step SDE schedule is better suited for latency-sensitive applications, while the 64-step schedule is ideal for offline or asynchronous tasks where quality takes precedence.

As of May 2026, ELF does not have a PyTorch version. This means it cannot be directly integrated into a PyTorch-based serving stack. Instead, you will need to run ELF as a separate inference service, communicating via HTTP or a message queue rather than importing it as a library.

Advanced Configuration and Customization

Tuning Flow Matching and Sampling Parameters

In ELF, sampling operates independently from training, giving you the flexibility to adjust inference settings without touching the model weights. These settings are managed through YAML files located in configs/sampling_configs/.

Two key factors to focus on are the sampler type and the CFG scale. The SDE sampler generally outperforms the ODE Euler sampler in generative perplexity at comparable step counts. For example, ELF-B achieves a generative perplexity of around 24.1 with just 32 SDE steps on OpenWebText. Below is a table summarizing recommended configurations for different task types:

Task	Sampler	Steps	CFG Scale	Self-Conditioning CFG Scale
Unconditional (OWT)	SDE	32 / 64	N/A	3
Translation (De-En)	ODE	64	2	1
Summarization (XSum)	ODE	64	2	1

For unconditional generation, the noise injection scale (gamma) plays a critical role. At 32 steps, the default value is gamma=1.5, while for 64 steps, it drops to gamma=1.0. Adjusting the CFG scale can help tighten perplexity and improve coherence, though it may reduce output diversity.

Two parameters related to training configurations also affect how the model allocates attention across the flow trajectory: denoiser_p_mean (default -1.5) and denoiser_p_std (default 0.8). These parameters control the logit-normal time schedule. Moving denoiser_p_mean closer to zero emphasizes mid-trajectory denoising, while more negative values focus on early-stage noise removal. Additionally, increasing decoder_prob from its default of 0.2 can boost fluency by raising the weight of token-level cross-entropy loss during the final discretization step.

Once you have optimized these sampling parameters, you can proceed to fine-tune ELF for specific domains.

Fine-Tuning ELF on Domain-Specific Data

Fine-tuning builds on the base setup, letting you adapt ELF to specialized tasks while keeping its core structure intact. Start by pre-tokenizing your datasets using the T5 tokenizer and saving them as Hugging Face Arrow datasets. For unconditional tasks, include an input_ids column. For conditional tasks, add both input_ids and condition_input_ids columns. Note that the T5-small encoder remains frozen throughout the process.

Fine-tuning configurations are managed through task-specific YAML files that override the default Config dataclass in configs/config.py. Begin by duplicating an existing configuration file, such as train_owt_ELF-B.yml for general-domain, unconditional tasks, and then modify parameters like data_path, output_dir, and relevant hyperparameters. The default optimizer is Muon, with a base learning rate (blr) of 0.001, resulting in an effective learning rate of 0.002 for a global batch size of 512. Running ELF-B on OpenWebText for one epoch on a TPU v5p-64 takes approximately 1.5 hours at this batch size, so plan your compute budget accordingly.

For smaller setups, gradient accumulation can help maintain an effective large batch size. Note that the upstream ELF training code targets TPU v5p-64; running on GPUs is not officially supported until the PyTorch port lands. Keep an eye on validation loss. If it stagnates or increases before the scheduled epoch count, consider stopping early. Fine-tuning on narrow domain datasets can lead to catastrophic forgetting, and the current ELF repository does not include built-in safeguards against this.

Measuring and Evaluating ELF Output

Once the model is fine-tuned, validate its performance using established benchmarks. Start by running eval.py with the official Hugging Face checkpoints to make sure your environment reproduces the published results. Only then should you compare your fine-tuned versions.

Published benchmarks for ELF models include:

ELF-B (105M): ~24.1 generative perplexity, ~5.15 unigram entropy; 26.4 BLEU on WMT14 De-En; 27.8 ROUGE-L on XSum
ELF-M (342M): 21.7 generative perplexity, 5.18 unigram entropy

These serve as baselines. If your fine-tuned model performs significantly worse on general benchmarks, it might be overfitting to the domain-specific data.

For specialized areas like legal or medical text, standard metrics like ROUGE and BLEU may not capture subtle nuances. In such cases, semantic similarity metrics like BERTScore are more effective for evaluating paraphrasing quality. For high-stakes scenarios, you can even use advanced models as judges, scoring outputs based on a custom rubric to catch subtle issues that traditional metrics might miss.

Verdict: Should You Use ELF?

ELF delivers impressive results with compact model sizes ranging from 105M to 652M parameters, while matching the performance of discrete diffusion models trained on approximately 500 billion tokens, despite using only 45 billion tokens for training. For instance, on the WMT14 De-En translation benchmark, ELF-B achieves a BLEU score of 26.4, outperforming similar-scale autoregressive models (25.2) and discrete diffusion competitors like MDLM (18.4). Yet, while these metrics highlight its potential, practical limitations in its implementation are worth noting.

Currently, ELF's official implementation is restricted to JAX and has been tested exclusively on TPU v5p-64. A PyTorch version is not yet available, which means NVIDIA GPU users will need to hold off for now. To stay updated on progress, you can monitor the ELF GitHub repository before committing to any integration.

Here is a quick breakdown of who might benefit from ELF now and who should wait:

Who should use ELF now	Who should wait
Researchers using Google Cloud TPUs	Teams reliant on NVIDIA GPUs awaiting PyTorch support
Teams investigating non-autoregressive generation methods	Projects requiring models at billion-parameter scale
Applications needing parallel decoding and global revision	Pipelines that depend on standard likelihood evaluation

That said, ELF does come with some technical constraints. Its reliance on a frozen T5-small encoder can limit domain adaptation, as mismatches between the encoder and decoder can arise. Another challenge is catastrophic forgetting when fine-tuning on narrow datasets, an issue that remains unresolved in the current codebase.

If you are exploring alternatives to autoregressive generation or require parallel decoding with bidirectional context, ELF provides a training-efficient option. As Linlu Qiu, Co-first Author from MIT, aptly remarked:

"Language is discrete, but language models don't have to be."

For teams proficient in JAX and working on TPUs, ELF is a strong contender for non-autoregressive generation tasks. However, GPU-only teams are better off waiting for the PyTorch port.

FAQs

How does ELF turn continuous embeddings back into tokens at the end?

ELF uses a single shared-weight network that handles both denoising and decoding, switching modes via a binary "mode" token. Only at the final time step (t=1) does it map the denoised continuous embedding back to a discrete token. Earlier methods like Diffusion-LM mapped to discrete vocabularies at every step, which ELF's authors argue prevents the continuous flow dynamics from working properly.

What hardware is needed to run or fine-tune ELF today?

As of May 2026, the official ELF codebase is JAX-only and was tested on a TPU v5p-64 cluster. The paper's reported runtimes (about 1.5 hours per OpenWebText epoch for ELF-B at global batch size 512) assume that hardware. NVIDIA GPU users have to wait for the announced PyTorch port; trying to run the current JAX code on a single GPU is not supported and the larger models (ELF-M at 342M, ELF-L at 652M) require batch size 64 even on a full TPU v5p-64.

When should I pick the SDE sampler vs. the ODE sampler?

Use the SDE sampler for unconditional generation. It consistently delivers lower generative perplexity than ODE at the same step count (ELF-B hits 24.1 PPL at 32 SDE steps on OpenWebText). Use the ODE sampler for conditional tasks like WMT14 translation and XSum summarization, where the default config is 64 ODE steps with CFG scale 2; the source sequence already constrains the output, so the deterministic trajectory is more efficient.

How many training tokens does ELF need vs. discrete diffusion?

ELF reaches comparable generative perplexity on OpenWebText using ~45 billion training tokens, compared to roughly 500 billion tokens for discrete diffusion baselines like MDLM. That is roughly an 11x reduction, which the authors attribute to staying in the continuous embedding space until the final discretization step instead of forcing token-level supervision at every diffusion step.

ELF: Kaiming He's MIT Team Just Brought Diffusion Models to Language

Text diffusion: A new paradigm for LLMs

ELF and the Problem It Solves

Why Diffusion Models Matter for Language

Who Built ELF and Why It Matters

What the ELF GitHub Repository Contains

How ELF Works: Core Architecture

Continuous Embedding Space and Frozen Encoders

Flow Matching and Delayed Discretization

Performance Benchmarks and Model Variants

Getting Started with ELF from GitHub

Hardware, Software, and Dependency Requirements

Cloning the Repository and Understanding Its Structure

Loading Pretrained Models and Configuration Files

Using ELF for Language Tasks

Unconditional Text Generation

Conditional Tasks: Translation and Summarization

Adding ELF to Larger AI Pipelines

Advanced Configuration and Customization

Tuning Flow Matching and Sampling Parameters

Fine-Tuning ELF on Domain-Specific Data

Measuring and Evaluating ELF Output

Verdict: Should You Use ELF?

FAQs

How does ELF turn continuous embeddings back into tokens at the end?

What hardware is needed to run or fine-tune ELF today?

When should I pick the SDE sampler vs. the ODE sampler?

How many training tokens does ELF need vs. discrete diffusion?

Author

Harry Richter

On this page

Related Posts

How to Make Xiaohongshu Carousels with Claude (Guizang)

PilotDeck Review: OpenBMB's Open-Source Agent OS Explained

GSD Redux Tutorial: Claude Code Without Context Rot

Text diffusion: A new paradigm for LLMs

ELF and the Problem It Solves

Why Diffusion Models Matter for Language

Who Built ELF and Why It Matters

What the ELF GitHub Repository Contains

How ELF Works: Core Architecture

Continuous Embedding Space and Frozen Encoders

Flow Matching and Delayed Discretization

Performance Benchmarks and Model Variants

Getting Started with ELF from GitHub

Hardware, Software, and Dependency Requirements

Cloning the Repository and Understanding Its Structure

Loading Pretrained Models and Configuration Files

Using ELF for Language Tasks

Unconditional Text Generation

Conditional Tasks: Translation and Summarization

Adding ELF to Larger AI Pipelines

Advanced Configuration and Customization

Tuning Flow Matching and Sampling Parameters

Fine-Tuning ELF on Domain-Specific Data

Measuring and Evaluating ELF Output

Verdict: Should You Use ELF?

FAQs

How does ELF turn continuous embeddings back into tokens at the end?

What hardware is needed to run or fine-tune ELF today?

When should I pick the SDE sampler vs. the ODE sampler?

How many training tokens does ELF need vs. discrete diffusion?

Related Blog Posts

Comments

Author

Harry Richter

On this page

Related Posts

How to Make Xiaohongshu Carousels with Claude (Guizang)

PilotDeck Review: OpenBMB's Open-Source Agent OS Explained

GSD Redux Tutorial: Claude Code Without Context Rot