Layered AI Images: Inside Qwen-Image-Layered Editing

Author: Łukasz Grochal

Qwen-Image-Layered is a diffusion-based model from Alibaba’s Qwen team that turns a single raster image into multiple clean RGBA layers, a bit like getting Photoshop-style layers out of a flat JPG. It aims to fix the usual “everything melts together” problem in AI editing by separating background, main subjects, text and other elements into semantically meaningful layers that can be edited independently while keeping the rest of the image intact. The system uses an RGBA VAE, a VLD-MMDiT architecture and multi-stage training to adapt a pretrained generator into a multilayer decomposer, and it supports a variable number of layers depending on scene complexity, typically 3 or up to around 8. In practice this lets users swap or remove objects, change backgrounds, adjust colors or tweak text with much better geometric and semantic consistency than classic inpainting, and layers can even be recursively decomposed again if finer control is needed.

The code and models are released openly (Apache-style licensing) across GitHub, Hugging Face and ModelScope, and the authors pitch the work as a step toward more structured, design-tool-friendly image representations rather than a replacement for existing raster workflows.