yczhou001/LongBench-T2I
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
If you find our research helpful, please consider giving us a star β to support the latest updates.
π₯ News
Coming Soonπ§ We will release the Plan2Gen agent implementation, along with full code for analysis and ablation studies β enabling complete reproduction and future extensions of our framework. Stay tuned!2025.06.07πππ To further advance AGI-level T2I, weβve added a structured summary of key insights to GitHub β including both in-paper highlights and new reflections.π Check more details in our paper: Draw ALL Your Imagine
π Feel free to open an issue to discuss with us!2025.06.03πππ The LongBench-T2I dataset is now available on Hugging Faceπ¦ at https://huggingface.co/datasets/YCZhou/LongBench-T2I. Explore, evaluate, and build on top of it! π€
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("YCZhou/LongBench-T2I")2025.05.31πππ We open-sourced the LongBench-T2I dataset and evaluation toolkit on GitHub β now available for the community! β Take the LongBench-T2I Challenge! π₯2025.05.30πππ We release the paper Draw ALL Your Imagine β a holistic benchmark and agent framework for complex instruction-based image generation. Please Check it out for more details! π
π Citation
If you find our work useful for your research, please kindly cite our paper as follows:
@article{zhou2025draw,
title={Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation},
author={Zhou, Yucheng and Yuan, Jiahao and Wang, Qianning},
journal={arXiv preprint arXiv:2505.24787},
year={2025}
}π Overview
LongBench-T2I is a comprehensive benchmark and agent framework for evaluating and improving complex instruction-based text-to-image generation β pushing toward AGI-level capabilities in controllable visual synthesis.
- π¦ Installation
- ποΈ Project Structure
- π οΈ How to Run
- π Key Insights
- π― Case Study Comparison
- π License
π¦ Installation
git clone https://github.com/yczhou001/LongBench-T2I.git
cd LongBench-T2I
conda create -n LB python=3.10 -y
conda activate LB
pip install -r requirements.txtποΈ Project Structure
.
βββ data/
β βββ instruction.jsonl # Input instructions + labels
βββ utils/ # Utility modules for LLM/VLM interaction and evaluation
β βββ evaluator.py # Evaluation interface (Gemini / InternVL)
β βββ prompt.py # Prompt templates
β βββ utils.py # General helper functions
βββ evaluate.py # Evaluation script for final outputs
βββ LICENSE
βββ README.mdπ οΈ How to Run
π Evaluation: Assessing Final Image Quality
python evaluate.py \
--method "plan2gen" \
--eval_folder "./eval" \
--object_file "data/instruction.jsonl" \
--evaluator "gemini-2.0-flash" \
--Gemni_API_Key "<your_api_key>"β Evaluation results will be saved as a .jsonl file in the specified --eval_folder, with per-image scores, comments, and an overall statistical summary.
Example entry:
{
"idx": ...,
"image": ".../generated_image_....jpg",
"objects": [
{
"category_name": "...",
"description": "...",
"score": ...,
"evaluation": "..."
},
...
],
"average_score": ...
}π Hyperparameters Explanation
| Hyperparameter | Type | Description | Default Value |
|---|---|---|---|
--method |
str |
Name of the image generation method. Determines the subdirectory under ./outputs/ to evaluate. |
"plan2gen_3" |
--eval_folder |
str |
Directory to save evaluation output results (.jsonl format). |
"./eval" |
--object_file |
str |
Path to the input .jsonl file containing object instruction labels. |
"data/instruction.jsonl" |
--evaluator |
str |
Evaluation model to use. Choices: "gemini-2.0-flash", "OpenGVLab/InternVL3-78B". |
"gemini-2.0-flash" |
--Gemni_API_Key |
List[str] |
API key(s) for accessing Gemini models. Multiple keys supported for rotation. | Required (for Gemini) |
π Key Insights
Key Insight 1: Diffusion-based vs AR-based Models
AR-based models outperform diffusion-based models in complex instruction-following by offering better structure, coherence, and efficiency, while diffusion models still lead in visual detail and richness.
Key Insight 2: Text Encoder-based vs LLM framework-based Models
LLM framework-based models significantly outperform text encoder-based models, especially in composition, text understanding, and background quality, confirming the advantage of LLM-guided planning in handling complex image generation prompts.
Key Insight 3: Language Understanding β Visual Quality
Surprisingly, higher perplexity sometimes correlates with better image qualityβespecially in smaller modelsβrevealing a disconnect between language understanding and visual generation.
π― Case Study Comparison
π License
This project is licensed under the MIT License.







