NextStep-1 is a large-scale text-to-image system built around a 14 billion-parameter autoregressive Transformer combined with a lightweight 157 million-parameter flow-matching head. Instead of relying on diffusion pipelines or lossy quantization, it directly models continuous visual tokens patch by patch, preserving fine detail and improving compositional consistency.
The design unifies language and vision tokens under the same prediction objective, creating a streamlined generation process that avoids the complexity of multi-stage models. Extensive evaluation across benchmarks such as Wise, GenAI-Bench, DPG-Bench, and OneIG Bench shows strong performance in world knowledge understanding and high-fidelity synthesis.
An additional version, NextStep-1-Edit, is fine-tuned for editing tasks and achieves competitive scores on GEdit-Bench and ImgEdit-Bench. The project is open-sourced, with code and models freely available for research and development.