AutoRubric-T2I learns a compact set of weighted natural-language rubrics from image preference data, enabling interpretable VLM-based reward modeling without fine-tuning—using less than 0.01% of the annotated preference data that standard reward models require.
AutoRubric-T2I operates in two phases: seed rule generation from a small set of preference pairs, followed by iterative refinement that discovers and retains the most predictive rubrics through sparse weighting and hard-pair mining.
A framework that learns a compact, weighted set of natural-language rubrics from image preference data.
Rubric selection formulated as $\ell_1$-regularized logistic regression over VLM-scored rubric features, with iterative expansion through hard-pair mining.
Achieves leading preference prediction on MMRB2 and improves downstream T2I-RL on TIIF and UniGenBench++ using Flow-GRPO.
Standard scalar reward models compress semantic fidelity, object correctness, spatial layout, and perceptual quality into a single number. During RL optimization, the policy can exploit spurious shortcuts—such as adding human subjects or aesthetic flourishes—that inflate the scalar reward without satisfying prompt constraints. AutoRubric-T2I provides decomposed, interpretable feedback that mitigates this failure mode.
We evaluate on the out-of-domain MMRB2 benchmark (covering EvalMuse, OneIG-Bench, R2I-Bench, RealUnify, and WISE) and in-domain PickScore/HPSv3 test sets. AutoRubric-T2I consistently outperforms both fine-tuned scalar reward models and existing rubric-generation baselines across all VLM judge scales.
| Model | EvalMuse | OneIG | R2I | RealUnify | WISE | Overall | PickScore | HPSv3 |
|---|---|---|---|---|---|---|---|---|
| Fine-Tuned Scalar Reward Models (Qwen2.5-VL-7B) | ||||||||
| HPSv3 | 54.0 | 60.4 | 68.0 | 68.9 | 56.7 | 59.4 | 67.3 | 74.0 |
| UnifiedReward | 56.9 | 62.1 | 56.8 | 67.8 | 56.0 | 59.8 | 68.8 | 65.8 |
| Pointwise VLM Judge: Qwen3-VL-8B | ||||||||
| AutoRule (HPSv3) | 56.9 | 60.4 | 64.1 | 62.4 | 55.0 | 59.1 | 61.1 | 62.8 |
| AutoRubric (HPSv3) | 54.1 | 61.9 | 61.7 | 59.1 | 52.3 | 57.5 | 57.1 | 58.1 |
| AutoRubric-T2I (HPSv3) | 58.5 | 67.3 | 64.1 | 65.6 | 60.4 | 62.5 | 61.7 | 63.9 |
| AutoRubric-T2I (PickScore) | 60.8 | 66.2 | 65.6 | 61.3 | 55.9 | 62.4 | 63.2 | 61.5 |
| Pointwise VLM Judge: Gemini-3-Flash | ||||||||
| AutoRule (HPSv3) | 67.7 | 68.4 | 67.2 | 72.0 | 62.2 | 67.6 | 67.2 | 62.6 |
| AutoRubric (HPSv3) | 61.2 | 62.8 | 66.6 | 74.2 | 58.7 | 63.8 | 65.0 | 62.2 |
| AutoRubric-T2I (HPSv3) | 70.2 | 71.3 | 70.8 | 78.7 | 64.9 | 70.8 | 69.0 | 70.0 |
| AutoRubric-T2I (PickScore) | 70.7 | 71.9 | 71.1 | 79.5 | 66.0 | 71.4 | 70.3 | 66.8 |
Table 1 (abridged). Comparison across MMRB2 out-of-domain and in-domain benchmarks. Bold indicates best within each VLM pointwise block.
We apply AutoRubric-T2I as a reward signal for fine-tuning SD-3.5-Medium via Flow-GRPO. The learned rubrics provide more targeted feedback than scalar rewards, with particularly strong gains in reasoning, relational composition, and text generation categories.
| Model | Attr. | Rel. | Reason. | Text | Overall |
|---|---|---|---|---|---|
| SD-3.5-Medium | 78.0 | 76.6 | 66.8 | 45.7 | 65.3 |
| + HPSv3 | 83.5 | 75.5 | 65.8 | 62.0 | 68.8 |
| + AutoRule (HPSv3) | 85.0 | 72.4 | 70.0 | 55.2 | 69.1 |
| + Ours (HPSv3) | 84.5 | 76.2 | 74.6 | 65.2 | 71.6 |
TIIF (short prompts). AutoRubric-T2I improves from 65.3% to 71.6% overall.
| Model | Style | Rel. | Compound | Layout | Overall |
|---|---|---|---|---|---|
| SD-3.5-Medium | 91.7 | 70.4 | 62.2 | 74.6 | 64.0 |
| + HPSv3 | 82.7 | 69.2 | 60.1 | 73.3 | 62.6 |
| + AutoRule (HPSv3) | 91.4 | 70.6 | 66.3 | 75.6 | 66.3 |
| + Ours (HPSv3) | 90.7 | 71.0 | 66.9 | 78.3 | 66.9 |
UniGenBench++ (long prompts). Strongest gains in layout and compound reasoning.
If you find this work useful, please cite:
@article{kao2026autorubric,
title = {AutoRubric-T2I: Robust Rule-Based Reward Model
for Text-to-Image Alignment},
author = {Kao, Kuei-Chun and Huo, Daixuan and Ban, Yuanhao
and Hsieh, Cho-Jui},
journal = {arXiv preprint arXiv:2605.17602},
year = {2026}
}