Introduction
Large language models (LLMs) excel at understanding and generating code, but their ability to create symbolic graphics programs (SGPs)—which produce images remains underexplored. SVGs, as a widely used SGP format, provide an ideal testbed for evaluating LLMs in this area. We address two main research questions:
(1) how well can LLMs draw using SGPs?
To address the first question, we introduce SGP-GenBench, a comprehensive benchmark for evaluating LLMs' ability to generate SGPs from three perspectives: object, scene, and composition.
(2) how can we improve their ability to generate SGPs?
To improve performance, we propose a reinforcement learning approach that uses similarity scores between visual encoder outputs and input text descriptions as reward signals. Our experiments show that RL significantly boosts the symbolic graphics programming abilities of LLMs.
SGP as Visual Synthesis
Symbolic graphics programming (SGP) aims to generate structured graphics code, such as SVG, from natural language instructions. Unlike conventional text-to-image models that produce pixel-based images, SGP translates prompts into interpretable, parametric programs, enabling precise and controllable visual synthesis.
SVG is widely used and serves as a natural bridge between language and vision, making it ideal for evaluating LLMs' ability to ground semantics from prompt to code. SGPs are both:
(a) PARAMETRIC
- allowing exact geometric and appearance control
(b) PROCEDURAL
- supporting hierarchical and compositional scene construction.
These properties enable fine-grained, interpretable, and flexible visual generation, and allow us to directly analyze the complexity and structure of generated scenes.
SGP-GenBench
Construction
- - Scene generation: Evaluated on COCO-val, covering 80 object categories with complex scene descriptions. We sample 1,024 images for efficient and representative assessment.
- - Object generation: Assessed on SGP-Object-val, a set of 930 single-object examples with captions, focusing on the model's ability to render individual objects accurately.
- - Compositional generation: Tested on SGP-CompBench with 3,200 prompts, targeting attribute binding, spatial relationships, and numeracy using 80 common objects for compositional evaluation.
Method
Given a text description, we sample a group of SVG codes from the model and render them as images. Each SVG code is scored by the alignment between the rendered image and the text description. The advantages are calculated based on the scores, and used for updating the model.
Discussion of Method
Distilling from vision models. We use external vision models (e.g., DINO, CLIP) to define rewards, guiding the LLM to align its generations with strong visual and semantic priors. This implicit distillation improves both visual understanding and cross-modal alignment.
Training without ground truth SGPs. Our RL method learns directly from images using feedback from vision models, removing the need for paired image-program data and enabling scalable, diverse training.
Aligning language and vision. The RL process ensures that generated SGPs are consistent with both linguistic input and visual model judgments, grounding language in visual evidence and enhancing cross-modal reasoning.
Results
Overview of SGP-GenBench
- SGP-GenBench reflects general model abilities. The ranking of models on our benchmark aligns well with their perceived general capabilities, especially in code generation. For example, Claude 3.7 Sonnet Thinking generally outperforms o3, which in turn surpasses Gemini 2.5 Pro Preview, followed by open-source systems like DeepSeek-R1 and Qwen-2.5-7B. This consistent ordering suggests that SVG generation is a reliable indicator of broader model competence.
- Closed-source models remain strongest. Frontier systems achieve the best results not only on compositional reasoning tasks such as attribute binding and numeracy, where Claude 3.7 Sonnet Thinking reaches 90.5 on color binding and 89.4 on numeracy, but also on scene and object fidelity, where Gemini 2.5 Pro Preview attains the top DINO object score of 0.653 and strong VQA scene performance of 0.554.
- Our RL-trained model substantially narrows the gap. The RL post-trained Qwen-2.5-7B raises its overall compositional score from 8.8 to 60.8, outperforming all other open-source counterparts such as DeepSeek-R1 and QwQ-32B. It also achieves the best VQA score across all models at 0.596, slightly higher than Claude 3.7 Sonnet Thinking, demonstrating that reinforcement learning enables open-source models to approach the closed-source frontier.
Object & Scene
Model | CLIP ↑ | DINO ↑ | VQA ↑ | HPS ↑ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sce. | Obj. | Avg. | Sce. | Obj. | Avg. | Sce. | Obj. | Avg. | Sce. | Obj. | Avg. | |
Frontier closed-source LLMs | ||||||||||||
GPT-4o-mini | 0.207 | 0.278 | 0.243 | 0.021 | 0.573 | 0.297 | 0.295 | 0.465 | 0.380 | 0.118 | 0.174 | 0.146 |
GPT-4o | 0.219 | 0.284 | 0.252 | 0.031 | 0.602 | 0.316 | 0.338 | 0.497 | 0.417 | 0.125 | 0.182 | 0.153 |
o1-mini | 0.221 | 0.289 | 0.255 | 0.028 | 0.603 | 0.315 | 0.330 | 0.508 | 0.419 | 0.121 | 0.185 | 0.153 |
o1 | 0.220 | 0.285 | 0.252 | 0.031 | 0.607 | 0.319 | 0.354 | 0.520 | 0.437 | 0.122 | 0.185 | 0.153 |
o3-mini | 0.231 | 0.293 | 0.262 | 0.036 | 0.608 | 0.322 | 0.379 | 0.520 | 0.450 | 0.128 | 0.187 | 0.158 |
o3 | 0.253 | 0.283 | 0.268 | 0.067 | 0.595 | 0.331 | 0.521 | 0.482 | 0.502 | 0.153 | 0.180 | 0.166 |
o4-mini | 0.246 | 0.296 | 0.271 | 0.052 | 0.629 | 0.340 | 0.469 | 0.536 | 0.503 | 0.143 | 0.193 | 0.168 |
Gemini 2.0 Flash | 0.204 | 0.275 | 0.240 | 0.023 | 0.588 | 0.306 | 0.293 | 0.468 | 0.381 | 0.116 | 0.176 | 0.146 |
Gemini 2.5 Flash Preview | 0.222 | 0.286 | 0.254 | 0.033 | 0.603 | 0.318 | 0.349 | 0.498 | 0.424 | 0.125 | 0.183 | 0.154 |
Gemini 2.5 Pro Preview | 0.256 | 0.302 | 0.279 | 0.088 | 0.653 | 0.371 | 0.554 | 0.572 | 0.563 | 0.154 | 0.199 | 0.177 |
Claude 3.5 Sonnet | 0.240 | 0.293 | 0.266 | 0.055 | 0.624 | 0.340 | 0.428 | 0.528 | 0.478 | 0.140 | 0.190 | 0.165 |
Claude 3.7 Sonnet | 0.262 | 0.306 | 0.284 | 0.088 | 0.647 | 0.368 | 0.581 | 0.567 | 0.574 | 0.165 | 0.200 | 0.183 |
Claude 3.7 Thinking | 0.262 | 0.305 | 0.284 | 0.090 | 0.642 | 0.366 | 0.594 | 0.574 | 0.584 | 0.164 | 0.199 | 0.181 |
Open-source LLMs | ||||||||||||
QwQ-32B | 0.219 | 0.272 | 0.245 | 0.031 | 0.549 | 0.290 | 0.334 | 0.456 | 0.395 | 0.123 | 0.172 | 0.147 |
DeepSeek-R1 | 0.228 | 0.278 | 0.253 | 0.042 | 0.594 | 0.318 | 0.416 | 0.508 | 0.462 | 0.134 | 0.180 | 0.157 |
Qwen-2.5-7B | 0.155 | 0.213 | 0.184 | 0.008 | 0.400 | 0.204 | 0.265 | 0.385 | 0.325 | 0.103 | 0.148 | 0.125 |
Qwen-2.5-7B w/ RL (ours) | 0.258 | 0.286 | 0.272 | 0.102 | 0.566 | 0.334 | 0.632 | 0.560 | 0.596 | 0.150 | 0.177 | 0.164 |
Click any column header to sort ascending/descending. Bold indicates the best in each group.
Composition
Model | Attribute Binding ↑ | Spatial Relation ↑ | Numeracy ↑ | Avg. ↑ | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Color | Shape | Texture | Avg. | 2D | 3D | Implicit | Avg. | Total | Item | CPI | Overall | ||
Frontier closed-source LLMs | |||||||||||||
GPT-4o | 62.2 | 48.7 | 34.3 | 48.4 | 49.7 | 37.3 | 49.2 | 45.4 | 85.9 | 25.5 | 51.1 | 52.7 | 48.3 |
o1 | 70.8 | 25.2 | 53.0 | 49.6 | 54.6 | 39.4 | 46.4 | 46.8 | 66.4 | 20.1 | 41.7 | 42.0 | 46.7 |
o3 | 88.9 | 73.6 | 71.7 | 78.1 | 81.6 | 62.0 | 84.5 | 76.0 | 91.6 | 59.8 | 81.1 | 78.8 | 77.5 |
o4-mini | 82.4 | 62.1 | 69.6 | 71.4 | 71.0 | 57.9 | 76.5 | 68.5 | 90.3 | 52.9 | 76.1 | 74.3 | 71.0 |
Gemini 2.5 Flash Preview | 63.6 | 45.0 | 56.9 | 55.2 | 46.0 | 38.9 | 57.1 | 47.3 | 82.8 | 34.5 | 62.0 | 59.8 | 53.4 |
Gemini 2.5 Pro Preview | 88.1 | 65.7 | 74.9 | 76.2 | 77.4 | 59.1 | 80.0 | 72.2 | 94.7 | 68.0 | 83.8 | 82.3 | 76.2 |
Claude 3.7 Sonnet | 89.3 | 82.8 | 77.3 | 83.1 | 75.9 | 59.4 | 73.7 | 69.7 | 91.4 | 65.5 | 85.5 | 82.5 | 77.9 |
Claude 3.7 Sonnet Thinking | 90.5 | 85.6 | 82.4 | 86.2 | 80.2 | 74.4 | 86.4 | 80.3 | 94.9 | 78.9 | 91.4 | 89.4 | 84.8 |
Open-source LLMs | |||||||||||||
QwQ-32B | 54.3 | 51.0 | 31.4 | 45.6 | 43.6 | 33.5 | 46.0 | 41.0 | 79.9 | 21.1 | 51.4 | 50.9 | 45.2 |
DeepSeek-R1 | 72.6 | 62.7 | 48.4 | 61.2 | 59.3 | 43.8 | 58.2 | 53.7 | 83.5 | 35.4 | 60.4 | 57.4 | 57.4 |
Qwen-2.5-7B | 7.1 | 10.0 | 1.7 | 6.3 | 5.2 | 5.8 | 8.1 | 6.4 | 42.6 | 5.8 | 10.7 | 16.1 | 8.8 |
Qwen-2.5-7B - RL (Ours) | 84.3 | 71.3 | 46.0 | 67.2 | 55.7 | 53.9 | 61.7 | 57.1 | 63.4 | 47.5 | 57.6 | 56.8 | 60.8 |
Click any column header to sort ascending/descending. Bold indicates the best in each group.
Analysis
Reinforcement Learning vs. Best-of-N Sampling
How SGP Generation Capabilities Evolve during Training
Strategy 1: Decomposition into basic components.
Strategy 2: Contextual optional details.
Bibtex
@article{chen2025sgp, title={Symbolic Graphics Programming with Large Language Models}, author={Chen, Yamei and Zhang, Haoquan and Huang, Yangyi and Qiu, Zeju and Zhang, Kaipeng and Wen, Yandong and Liu, Weiyang}, journal={arXiv preprint arXiv:2509.01000}, year={2025} }