FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Summary

TL;DR

21 tokens 50 tokens 117 tokens 180 tokens 292 tokens

Detailed

TL;DR

Detailed

Overview

FlexTok training & AR image generation

Learn about the FlexTok 1D tokenizer training strategy and autoregressive image generation with FlexTok tokens.

Flexible-length tokenization

FlexTok compresses images into 1D token sequences of flexible length, ordered from high-level semantics to low-level details.

AR image generation with FlexTok tokens

With FlexTok we perform conditional image generation in a coarse-to-fine manner, rather than in a raster-scan order.

FlexTok method overview

Our approach is structured into three distinct stages:

In Stage 0, we train VAE models (similar to the SDXL VAE) with continuous latents to perceptually compress images into compact 2D token grids, thereby reducing computational complexity.

Then, in Stage 1, the FlexTok tokenizer converts these continuous 2D grids into discrete 1D token sequences. This stage leverages a Transformer encoder, FSQ discretization, and a rectified flow decoder—combined with nested dropout—to produce a hierarchical bottleneck representation.

Finally, in Stage 2, we train autoregressive models for class-conditional and text-conditional image generation, which enables us to evaluate tokenizer design choices using the downstream generative performance. The following sections provide detailed insights into the FlexTok architecture and the evaluation design choices employed in these stages.

Stage 1: FlexTok training

FlexTok training overview: FlexTok resamples 2D VAE latents to a 1D sequence of discrete tokens using a ViT with registers. The FSQ-quantized bottleneck representation is used to condition a rectified flow model that decodes and reconstructs the original images. FlexTok learns ordered token sequences of flexible length by applying nested dropout on the register tokens.

The FlexTok architecture features a Transformer encoder and decoder trained with an autoencoding objective, using 256 register tokens that act as a 1D bottleneck representation of the input image. A 6-dimensional FSQ discretization step with levels [8, 8, 8, 5, 5, 5] is applied to the registers, resulting in an effective vocabulary size of 64'000. To learn an ordering, we perform nested dropout on the register tokens during training, which encourages the encoder to encode image content in a hierarchical manner, and enables the decoder to reconstruct images from any subsequence of tokens.

To ensure high-fidelity outputs no matter the number of tokens used, the decoder is a rectified flow model that receives noised VAE latent patches and the (randomly masked) registers as input to predict the flow. AdaLN-zero conditions the patches and registers separately on the current timestep. In addition to the rectified flow objective, we apply a REPA inductive bias loss (using DINOv2-L) on the intermediate decoder features, which greatly accelerates convergence.

The use of 2x2 patchification in both encoder and decoder, combined with the VAE's 8x downsampling, yields a total 16x downsampling from pixels to patch tokens. All models are trained at a resolutino of 256x256 pixels. We train three FlexTok sizes, parameterized by the number of layers d in the encoder and decoder. Setting the width w = 64d, we train models with encoder and decoder depth d12-d12, d18-d18, and d18-d28.

Stage 2: Autoregressive model training

Autoregressive model training training overview: We train class- and text-conditional autoregressive Transformers to predict 1D token sequences in a coarse-to-fine manner. As more tokens are predicted, the generated image becomes more specific, encoding high-level concepts first (e.g., presence of a car) followed by finer details (e.g., car shape, brand, color).

To evaluate different design choices and compare to relevant baselines, we measure both reconstruction and conditional image generation performance. We train autoregressive Transformers for class-conditional generation on ImageNet-1k and text-to-image generation on DFN-2B. Our autoregressive Transformer follows a Llama-inspired architecture that employs pre-normalization with RMSNorm and a SwiGLU feedforward. Since our tokens lack a 2D grid structure, we use learned absolute positional embeddings instead of 2D RoPE.

For class conditioning, we add a learned class embedding to an [SOI] token and concatenate it with the image token sequence. The AR model, ranging from 49M to 1.3B parameters, predicts the token sequence from the FlexTok tokenizer.

For text-conditioned generation, the AR decoder cross-attends to text embeddings from FLAN-T5-XL, which are projected to the model dimension via an MLP. We scale these models up to 3B parameters using μP to maintain consistent behavior across scales.

Following standard practice, we employ conditioning dropout during training to enable classifier-free guidance at inference, with text-conditioned models randomly replacing text inputs with an empty string.

Flexible-length tokenization

We show that FlexTok can effectively compress images into 1D sequences of flexible length, establishing a novel “visual vocabulary” where images can be specified and generated in a coarse-to-fine manner. To validate this capability, we evaluate the reconstruction performance on nested token sequences of varying lengths, both qualitatively and quantitatively.

Consider the visual examples below. We take out-of-distribution images generated with Midjourney 6.1, and encode them into a 1D sequence with a FlexTok d18-d28 model that was trained on DFN. We then show reconstructions from the rectified flow decoder using different truncation thresholds, ranging from 1 to 256 tokens. Notice how most of the images' semantic and geometric content is captured by fewer than 16 tokens. The first tokens already capture the high-level semantic concepts (e.g., gray bird, people in colorful garments, mountain scene, yellow flower, etc.), while more tokens are required to reconstruct more intricate scene details (e.g., position and clothing of every person, brushstroke placement, etc.). When using all 256 tokens, depending on the image complexity, the reconstructions are nearly indistinguishable from the original images. We note here that more complex images may require significantly more tokens to reconstruct well.

Image reconstruction examples using FlexTok d18-d28 trained on DFN. No matter the number of tokens the images are compressed into, FlexTok produces visually plausible reconstructions. Simple images (e.g., the red apple) can be compressed more than more complex scenes (e.g., the crowd of people). When pushing FlexTok to extreme compression rates, it maintains the most high-level semantic and geomtric properties of the images and discards lower-level details.

Quantitative analysis of the rate-distortion tradeoff

Using FlexTok models trained on ImageNet-1k, we measure reconstruction performance across a range of token counts, from a single token up to 256 tokens, measuring rFID, MAE, and DreamSim scores. This analysis gives us an idea of the rate-distortion tradeoff provided by FlexTok.

Our reconstruction FID (rFID) evaluations indicate that the FlexTok rectified flow decoder is capable of generating visually plausible images at all compression rates, even when given as little as a single token as the conditioning. Here, it is worth noting that the image fidelity improves significantly as we scale the model size, and we expect that more powerful generative decoders could further flatten this curve.

The per-image metrics MAE and DreamSim score indicate that as more tokens are added, the reconstructions become increasingly fine-grained, with improved alignment to the original image. For both metrics we observe a rough linear improvement when doubling the number of tokens. In contrast to the rFID evaluation, scaling the encoder and decoder sizes is not as crucial for improving MSE and DreamSim scores. This suggests that scaling the number of tokens, i.e. increasing the bottleneck size, is a more effective way of improving the reconstruction performance for complex images, as compared to scaling the model size further.

FlexTok rate-distortion tradeoff. We show ImageNet-1k reconstruction metrics for three different FlexTok sizes. The more tokens used, the closer the reconstructions get to the original RGB images. Scaling the tokenizer size significantly improves reconstruction FID, but is not as crucial in terms of MAE and DreamSim score.

Comparison with baselines

FlexTok's architecture consists of a register encoder and decoder, similar to TiTok. Unlike TiTok that requires training different models for each desired compression rates, FlexTok enables flexible encoding and decoding, all with a single model. FlexTok is conceptually similar to other concurrent work, namely ElasticTok, ALIT, and One-D-Piece. We discuss differences to those works in detail in the paper, and show some visual comparisons below.

Image reconstruction comparison between three different TiTok models, ALIT, and FlexTok. Compared to other 1D tokenizers, FlexTok is able to tokenize images in a highly semantic and ordered manner, all the way down to a single token, and all in a single model.

Interactive visualization of all possible first tokens

The first token in the FlexTok sequence captures the most essential high-level information about an image. Use the sliders below to explore different first token values and see how they affect the image reconstruction. Each slider corresponds to one FSQ level, and for each token index we show 9 random samples from the FlexTok d18-d28 decoder. When we consider only the first token, there are 64000 possible indices created by the FSQ levels [8,8,8,5,5,5]. In essence, that means FlexTok partitions the distribution of all possible images into 64000 clusters, each represented by a single token.

First token FSQ levels

FSQ level #1

0

FSQ level #2

0

FSQ level #3

0

FSQ level #4

0

FSQ level #5

0

FSQ level #6

0

First token index: 0

First token reconstruction explorer. Adjust the sliders to select different first token values from the FlexTok d18-d28 model trained on DFN. Each slider corresponds to one FSQ level, and together they determine the token index.

Autoregressive image generation with FlexTok tokens

In this part, we explore three key aspects of autoregressive image generation using FlexTok tokens. First, we examine how FlexTok token sequences serve as a “visual vocabulary” that enables coarse-to-fine generation, where images are progressively refined with increasing specificity. Second, we investigate the effect of simple versus complex conditioning on token requirements, revealing that different conditioning signals necessitate different numbers of tokens. Third, we discuss the impact of scaling the autoregressive model size on generation quality. Together, these illustrate how predicting FlexTok token sequences can yield high-fidelity, condition-specific images.

Coarse-to-fine generation with increasing specificity

As discussed in the Flexible-length tokenization section above, FlexTok compresses images into ordered token sequences, which leads us to explore the implications of predicting these sequences for autoregressive image generation. By training class- and text-conditional models, we find that FlexTok token sequences act as a “visual vocabulary”, allowing autoregressive models to describe images with increasing levels of specificity. Unlike conventional autoregressive models that generate images in a fixed raster-scan order on 2D token grids, our approach enables progressive refinement of image details. In the figure below, we demonstrate that images generated by class- or text-conditional models become increasingly more specific to their conditioning as more tokens are produced.

Image generation examples with varying numbers of tokens. Images generated with both class (top 3 rows) and text conditioning (bottom 3 rows) demonstrate that FlexTok-based models achieve high quality all the way down to a single token, and all within a single model. The conditioning alignment strengthens as more tokens are generated. For example with the prompt “a corgi's head depicted as an explosion of a nebula”, the first two tokens capture the high-level concept of a artistic depiction of a dog, while adding more tokens adds in further details such as the dog breed and the nebula background.

Quantitative results using 1.3B AR models confirm that the alignment between the conditioning signal and the generated images (i.e., the specificity) improves with higher token counts. In our experiments, the left subplot of the figure below shows that for class-conditional generation using a 1.3B AR model, the top-1 classification accuracy (as measured by DINOv2-L on the predictions) improves with additional tokens but plateaus around 32 tokens. In contrast, the center plot demonstrates that for text-conditioned generation with a 3B AR model, the image-text alignment (as measured by CLIPScore) continues to improve as more tokens are generated, indicating that more complex conditioning benefits from longer token sequences.

The right subplot demonstrates that the generation quality, measured by gFID for both class- and text-conditional cases, remains somewhat consistent across all token sequence lengths, which we attribute to our rectified flow decoder. Overall, these results confirm that simple conditions like ImageNet class labels can be fulfilled with as few as 16 tokens, while open-ended text prompts benefit from generating up to 256 tokens.

Compared to training a 3B AR model on more classical 2D-grid tokens, we find that AR models that predict FlexTok tokens perform similarly at the limit of 256 tokens used. That said, they are more versatile, as the allow for generating images that match the input conditioning with a significantly smaller number of tokens. This is especially useful for tasks where the conditioning is simple and the image generation should be fast and efficient.

Conditioning alignment and generation quality vs. number of tokens. Left: For class-conditional generation with a 1.3B AR model, we compute DINOv2-L top-1 accuracy on generated images conditioned on ImageNet-1k class labels. Center: For text-conditional generation with a 3B AR model, we show CLIPScore relative to input prompts from the COCO 30k validation set, using a CLIP base model. Right: We measure class-conditional gFID on ImageNet-1k, and text-conditional gFID on COCO.

Generation with simple and complex conditioning

As demonstrated above, simple conditions (like ImageNet-1k class labels) can be satisfied by predicting very few tokens, while more complex ones (like free-form text prompts) can benefit from generating up to the maximum number of tokens. We further investigate this relationship between conditioning complexity and token count with two example prompts, one simple (top rows) and one complex (bottom rows).

The left part of the figure below visually highlights two key observations. First, the number of tokens required to fulfill the conditioning signal varies significantly between simple and detailed prompts. The simple prompt is fulfilled with as few as 4-16 tokens, while the detailed prompt requires predicting at least 64 tokens. Second, with increasing token count, the variation between images decoded from the same token sequences but with different random seeds decreases fast for simple prompts, while detailed prompts maintain meaningful variation even with more tokens. In other words, when a given token sequence is underspecified for a detailed prompt, the FlexTok rectified flow decoder compensates for the lack of specificity, still producing realistic outputs, albeit with greater semantic variation between images decoded with different random seeds.

Image generation with simple and detailed prompts. Images generated with FlexTok-based models show that the number of tokens needed to fulfill the conditioning depends on prompt complexity. For a simple prompt, the desired image is achieved with as few as 4-16 tokens (as measured by CLIPScore), and semantic variation between different decoded images (as measured by pairwise DreamSim scores) vanishes quickly. In contrast, a detailed prompt requires the full 256-token sequence to fully meet the conditioning and shows greater variation at lower token counts as the FlexTok rectified flow decoder compensates for missing details. For each prompt, the FlexTok tokens are generated just once using the AR Transformer and then decoded with 10 random seeds in the rectified flow decoder.

We demonstrate the same observations quantitatively in the right side of the figure. The top subplot demonstrates that image-text alignment (measured by CLIPScore) improves with more tokens but plateaus earlier for the simple prompt (”a single red apple on a white background”) compared to the detailed prompt (”graffiti of a rocket ship on a brick wall”). The bottom subplot shows that image variation between different generations (measured by pairwise DreamSim scores) decreases more rapidly with increasing token counts for simple prompts compared to detailed prompts. This discrepancy underscores an intrinsic relationship between the complexity of the conditioning signal and the number of tokens required to faithfully generate that specific image.

Scaling autoregressive model size

We now investigate the scaling behavior of autoregressive class-conditional models trained on FlexTok tokens, focusing on how model size impacts image-caption alignment and image fidelity. As depicted in the figure below, increasing the AR model size consistently reduces training loss. Notably, for the prediction of the first few tokens (1-8), generation quality (as measured by gFID) remains effectively independent of AR model size, suggesting that even smaller models can capture the coarse image details. However, for longer sequences (beyond 128 tokens), both gFID and CLIPScore metrics improve significantly with larger AR models, indicating that these extended sequences require more powerful models for maintaining strong performance. This trend underscores a key trade-off: while FlexTok's rectified flow decoder drives high-quality outputs with few tokens, a strong AR model becomes crucial as more tokens are generated to better match the given condition.

Class-conditioned AR model scaling. We show training loss, gFID and image generation CLIPScore values for the class-conditional models with the FlexTok d18-d28 tokenizer. We calculate the CLIPScore using the text label of the classes from the ImageNet-1k validation set, and do not use classifier-free guidance for the AR Transformer (CFG scale = 1.0).

Conclusions & future work

In this work, we demonstrate the potential of a flexible sequence length tokenizer for image reconstruction and generation, enabling high-fidelity outputs with very few tokens. Our experiments show that FlexTok token sequences form a “visual vocabulary” that supports coarse-to-fine generation, where simpler conditions can be met with few tokens and more complex ones require generating longer sequences.

Looking ahead, we anticipate that FlexTok-like tokenizers, which adapt to the intrinsic complexity of the input data, could be applicable to other domains with high redundancy, such as audio and video. Training generative models on representations that can be both very compact and semantic, or very long and detailed, may enable further explorations into long-horizon video generation, understanding, as well as visual reasoning.

BibTeX

@article{flextok,
  title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
  author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
  journal={arXiv 2025},
  year={2025},
}

Acknowledgements

We thank Justin Lazarow and Miguel Angel Bautista Martin for their valuable feedback on earlier versions of this work, as well as Zhitong Gao for her help on the interactive visualizations.