Our approach is structured into three distinct stages:
In Stage 0, we train VAE models (similar to the SDXL VAE) with continuous latents to perceptually compress images into compact 2D token grids, thereby reducing computational complexity.
Then, in Stage 1, the FlexTok tokenizer converts these continuous 2D grids into discrete 1D token sequences. This stage leverages a Transformer encoder, FSQ discretization, and a rectified flow decoder—combined with nested dropout—to produce a hierarchical bottleneck representation.
Finally, in Stage 2, we train autoregressive models for class-conditional and text-conditional image generation, which enables us to evaluate tokenizer design choices using the downstream generative performance. The following sections provide detailed insights into the FlexTok architecture and the evaluation design choices employed in these stages.
The FlexTok architecture features a Transformer encoder and decoder trained with an autoencoding objective, using 256 register tokens that act as a 1D bottleneck representation of the input image. A 6-dimensional FSQ discretization step with levels [8, 8, 8, 5, 5, 5] is applied to the registers, resulting in an effective vocabulary size of 64'000. To learn an ordering, we perform nested dropout on the register tokens during training, which encourages the encoder to encode image content in a hierarchical manner, and enables the decoder to reconstruct images from any subsequence of tokens.
To ensure high-fidelity outputs no matter the number of tokens used, the decoder is a rectified flow model that receives noised VAE latent patches and the (randomly masked) registers as input to predict the flow. AdaLN-zero conditions the patches and registers separately on the current timestep. In addition to the rectified flow objective, we apply a REPA inductive bias loss (using DINOv2-L) on the intermediate decoder features, which greatly accelerates convergence.
The use of 2x2 patchification in both encoder and decoder, combined with the VAE's 8x downsampling, yields a total 16x downsampling from pixels to patch tokens. All models are trained at a resolutino of 256x256 pixels. We train three FlexTok sizes, parameterized by the number of layers d in the encoder and decoder. Setting the width w = 64d, we train models with encoder and decoder depth d12-d12, d18-d18, and d18-d28.
To evaluate different design choices and compare to relevant baselines, we measure both reconstruction and conditional image generation performance. We train autoregressive Transformers for class-conditional generation on ImageNet-1k and text-to-image generation on DFN-2B. Our autoregressive Transformer follows a Llama-inspired architecture that employs pre-normalization with RMSNorm and a SwiGLU feedforward. Since our tokens lack a 2D grid structure, we use learned absolute positional embeddings instead of 2D RoPE.
For class conditioning, we add a learned class embedding to an [SOI]
token and concatenate it with the image token sequence. The AR model, ranging from 49M to 1.3B parameters, predicts the token sequence from the FlexTok tokenizer.
For text-conditioned generation, the AR decoder cross-attends to text embeddings from FLAN-T5-XL, which are projected to the model dimension via an MLP. We scale these models up to 3B parameters using ÎĽP
to maintain consistent behavior across scales.
Following standard practice, we employ conditioning dropout during training to enable classifier-free guidance at inference, with text-conditioned models randomly replacing text inputs with an empty string.
We show that FlexTok can effectively compress images into 1D sequences of flexible length, establishing a novel “visual vocabulary” where images can be specified and generated in a coarse-to-fine manner. To validate this capability, we evaluate the reconstruction performance on nested token sequences of varying lengths, both qualitatively and quantitatively.
Consider the visual examples below. We take out-of-distribution images generated with Midjourney 6.1, and encode them into a 1D sequence with a FlexTok d18-d28 model that was trained on DFN. We then show reconstructions from the rectified flow decoder using different truncation thresholds, ranging from 1 to 256 tokens. Notice how most of the images' semantic and geometric content is captured by fewer than 16 tokens. The first tokens already capture the high-level semantic concepts (e.g., gray bird, people in colorful garments, mountain scene, yellow flower, etc.), while more tokens are required to reconstruct more intricate scene details (e.g., position and clothing of every person, brushstroke placement, etc.). When using all 256 tokens, depending on the image complexity, the reconstructions are nearly indistinguishable from the original images. We note here that more complex images may require significantly more tokens to reconstruct well.
Using FlexTok models trained on ImageNet-1k, we measure reconstruction performance across a range of token counts, from a single token up to 256 tokens, measuring rFID, MAE, and DreamSim scores. This analysis gives us an idea of the rate-distortion tradeoff provided by FlexTok.
Our reconstruction FID (rFID) evaluations indicate that the FlexTok rectified flow decoder is capable of generating visually plausible images at all compression rates, even when given as little as a single token as the conditioning. Here, it is worth noting that the image fidelity improves significantly as we scale the model size, and we expect that more powerful generative decoders could further flatten this curve.
The per-image metrics MAE and DreamSim score indicate that as more tokens are added, the reconstructions become increasingly fine-grained, with improved alignment to the original image. For both metrics we observe a rough linear improvement when doubling the number of tokens. In contrast to the rFID evaluation, scaling the encoder and decoder sizes is not as crucial for improving MSE and DreamSim scores. This suggests that scaling the number of tokens, i.e. increasing the bottleneck size, is a more effective way of improving the reconstruction performance for complex images, as compared to scaling the model size further.
FlexTok's architecture consists of a register encoder and decoder, similar to TiTok. Unlike TiTok that requires training different models for each desired compression rates, FlexTok enables flexible encoding and decoding, all with a single model. FlexTok is conceptually similar to other concurrent work, namely ElasticTok, ALIT, and One-D-Piece. We discuss differences to those works in detail in the paper, and show some visual comparisons below.
In this part, we explore three key aspects of autoregressive image generation using FlexTok tokens. First, we examine how FlexTok token sequences serve as a “visual vocabulary” that enables coarse-to-fine generation, where images are progressively refined with increasing specificity. Second, we investigate the effect of simple versus complex conditioning on token requirements, revealing that different conditioning signals necessitate different numbers of tokens. Third, we discuss the impact of scaling the autoregressive model size on generation quality. Together, these illustrate how predicting FlexTok token sequences can yield high-fidelity, condition-specific images.
As discussed in the Flexible-length tokenization section above, FlexTok compresses images into ordered token sequences, which leads us to explore the implications of predicting these sequences for autoregressive image generation. By training class- and text-conditional models, we find that FlexTok token sequences act as a “visual vocabulary”, allowing autoregressive models to describe images with increasing levels of specificity. Unlike conventional autoregressive models that generate images in a fixed raster-scan order on 2D token grids, our approach enables progressive refinement of image details. In the figure below, we demonstrate that images generated by class- or text-conditional models become increasingly more specific to their conditioning as more tokens are produced.
Quantitative results using 1.3B AR models confirm that the alignment between the conditioning signal and the generated images (i.e., the specificity) improves with higher token counts. In our experiments, the left subplot of the figure below shows that for class-conditional generation using a 1.3B AR model, the top-1 classification accuracy (as measured by DINOv2-L on the predictions) improves with additional tokens but plateaus around 32 tokens. In contrast, the center plot demonstrates that for text-conditioned generation with a 3B AR model, the image-text alignment (as measured by CLIPScore) continues to improve as more tokens are generated, indicating that more complex conditioning benefits from longer token sequences.
The right subplot demonstrates that the generation quality, measured by gFID for both class- and text-conditional cases, remains somewhat consistent across all token sequence lengths, which we attribute to our rectified flow decoder. Overall, these results confirm that simple conditions like ImageNet class labels can be fulfilled with as few as 16 tokens, while open-ended text prompts benefit from generating up to 256 tokens.
Compared to training a 3B AR model on more classical 2D-grid tokens, we find that AR models that predict FlexTok tokens perform similarly at the limit of 256 tokens used. That said, they are more versatile, as the allow for generating images that match the input conditioning with a significantly smaller number of tokens. This is especially useful for tasks where the conditioning is simple and the image generation should be fast and efficient.
As demonstrated above, simple conditions (like ImageNet-1k class labels) can be satisfied by predicting very few tokens, while more complex ones (like free-form text prompts) can benefit from generating up to the maximum number of tokens. We further investigate this relationship between conditioning complexity and token count with two example prompts, one simple (top rows) and one complex (bottom rows).
The left part of the figure below visually highlights two key observations. First, the number of tokens required to fulfill the conditioning signal varies significantly between simple and detailed prompts. The simple prompt is fulfilled with as few as 4-16 tokens, while the detailed prompt requires predicting at least 64 tokens. Second, with increasing token count, the variation between images decoded from the same token sequences but with different random seeds decreases fast for simple prompts, while detailed prompts maintain meaningful variation even with more tokens. In other words, when a given token sequence is underspecified for a detailed prompt, the FlexTok rectified flow decoder compensates for the lack of specificity, still producing realistic outputs, albeit with greater semantic variation between images decoded with different random seeds.
We demonstrate the same observations quantitatively in the right side of the figure. The top subplot demonstrates that image-text alignment (measured by CLIPScore) improves with more tokens but plateaus earlier for the simple prompt (”a single red apple on a white background”) compared to the detailed prompt (”graffiti of a rocket ship on a brick wall”). The bottom subplot shows that image variation between different generations (measured by pairwise DreamSim scores) decreases more rapidly with increasing token counts for simple prompts compared to detailed prompts. This discrepancy underscores an intrinsic relationship between the complexity of the conditioning signal and the number of tokens required to faithfully generate that specific image.
We now investigate the scaling behavior of autoregressive class-conditional models trained on FlexTok tokens, focusing on how model size impacts image-caption alignment and image fidelity. As depicted in the figure below, increasing the AR model size consistently reduces training loss. Notably, for the prediction of the first few tokens (1-8), generation quality (as measured by gFID) remains effectively independent of AR model size, suggesting that even smaller models can capture the coarse image details. However, for longer sequences (beyond 128 tokens), both gFID and CLIPScore metrics improve significantly with larger AR models, indicating that these extended sequences require more powerful models for maintaining strong performance. This trend underscores a key trade-off: while FlexTok's rectified flow decoder drives high-quality outputs with few tokens, a strong AR model becomes crucial as more tokens are generated to better match the given condition.
In this work, we demonstrate the potential of a flexible sequence length tokenizer for image reconstruction and generation, enabling high-fidelity outputs with very few tokens. Our experiments show that FlexTok token sequences form a “visual vocabulary” that supports coarse-to-fine generation, where simpler conditions can be met with few tokens and more complex ones require generating longer sequences.
Looking ahead, we anticipate that FlexTok-like tokenizers, which adapt to the intrinsic complexity of the input data, could be applicable to other domains with high redundancy, such as audio and video. Training generative models on representations that can be both very compact and semantic, or very long and detailed, may enable further explorations into long-horizon video generation, understanding, as well as visual reasoning.
@article{flextok,
title={{FlexTok}: Resampling Images into 1D Token Sequences of Flexible Length},
author={Roman Bachmann and Jesse Allardice and David Mizrahi and Enrico Fini and O{\u{g}}uzhan Fatih Kar and Elmira Amirloo and Alaaeldin El-Nouby and Amir Zamir and Afshin Dehghan},
journal={arXiv 2025},
year={2025},
}
We thank Justin Lazarow and Miguel Angel Bautista Martin for their valuable feedback on earlier versions of this work.