September 9, 2025 Released

Symbolic Graphics Programming with Large Language Models

SVG tesaer
AUTHORS
Yamei Chen1,2,*, Haoquan Zhang1,3,*, Yangyi Huang1,2, Zeju Qiu4, Kaipeng Zhang3, Yandong Wen2, Weiyang Liu1,4
* Equal Contributing
AFFILIATIONS
1 CUHK
2 Westlake University
3 Shanghai AI Laboratory
4 MPI-IS

Introduction

Large language models (LLMs) excel at understanding and generating code, but their ability to create symbolic graphics programs (SGPs)—which produce images remains underexplored. SVGs, as a widely used SGP format, provide an ideal testbed for evaluating LLMs in this area. We address two main research questions:

(1) how well can LLMs draw using SGPs?
To address the first question, we introduce SGP-GenBench, a comprehensive benchmark for evaluating LLMs' ability to generate SGPs from three perspectives: object, scene, and composition.

(2) how can we improve their ability to generate SGPs?
To improve performance, we propose a reinforcement learning approach that uses similarity scores between visual encoder outputs and input text descriptions as reward signals. Our experiments show that RL significantly boosts the symbolic graphics programming abilities of LLMs.

SGP as Visual Synthesis

SVG visual synthesis

Symbolic graphics programming (SGP) aims to generate structured graphics code, such as SVG, from natural language instructions. Unlike conventional text-to-image models that produce pixel-based images, SGP translates prompts into interpretable, parametric programs, enabling precise and controllable visual synthesis.

SVG is widely used and serves as a natural bridge between language and vision, making it ideal for evaluating LLMs' ability to ground semantics from prompt to code. SGPs are both:

(a) PARAMETRIC - allowing exact geometric and appearance control
(b) PROCEDURAL - supporting hierarchical and compositional scene construction.

These properties enable fine-grained, interpretable, and flexible visual generation, and allow us to directly analyze the complexity and structure of generated scenes.

SGP-GenBench

Construction

  • - Scene generation: Evaluated on COCO-val, covering 80 object categories with complex scene descriptions. We sample 1,024 images for efficient and representative assessment.
  • - Object generation: Assessed on SGP-Object-val, a set of 930 single-object examples with captions, focusing on the model's ability to render individual objects accurately.
  • - Compositional generation: Tested on SGP-CompBench with 3,200 prompts, targeting attribute binding, spatial relationships, and numeracy using 80 common objects for compositional evaluation.

Method

Method

Given a text description, we sample a group of SVG codes from the model and render them as images. Each SVG code is scored by the alignment between the rendered image and the text description. The advantages are calculated based on the scores, and used for updating the model.

Discussion of Method

Distilling from vision models. We use external vision models (e.g., DINO, CLIP) to define rewards, guiding the LLM to align its generations with strong visual and semantic priors. This implicit distillation improves both visual understanding and cross-modal alignment.

Training without ground truth SGPs. Our RL method learns directly from images using feedback from vision models, removing the need for paired image-program data and enabling scalable, diverse training.

Aligning language and vision. The RL process ensures that generated SGPs are consistent with both linguistic input and visual model judgments, grounding language in visual evidence and enhancing cross-modal reasoning.

Results

Overview of SGP-GenBench

  • SGP-GenBench reflects general model abilities. The ranking of models on our benchmark aligns well with their perceived general capabilities, especially in code generation. For example, Claude 3.7 Sonnet Thinking generally outperforms o3, which in turn surpasses Gemini 2.5 Pro Preview, followed by open-source systems like DeepSeek-R1 and Qwen-2.5-7B. This consistent ordering suggests that SVG generation is a reliable indicator of broader model competence.
  • Closed-source models remain strongest. Frontier systems achieve the best results not only on compositional reasoning tasks such as attribute binding and numeracy, where Claude 3.7 Sonnet Thinking reaches 90.5 on color binding and 89.4 on numeracy, but also on scene and object fidelity, where Gemini 2.5 Pro Preview attains the top DINO object score of 0.653 and strong VQA scene performance of 0.554.
  • Our RL-trained model substantially narrows the gap. The RL post-trained Qwen-2.5-7B raises its overall compositional score from 8.8 to 60.8, outperforming all other open-source counterparts such as DeepSeek-R1 and QwQ-32B. It also achieves the best VQA score across all models at 0.596, slightly higher than Claude 3.7 Sonnet Thinking, demonstrating that reinforcement learning enables open-source models to approach the closed-source frontier.

Object & Scene

Model CLIP ↑ DINO ↑ VQA ↑ HPS ↑
Sce. Obj. Avg. Sce. Obj. Avg. Sce. Obj. Avg. Sce. Obj. Avg.
Frontier closed-source LLMs
GPT-4o-mini 0.2070.2780.243 0.0210.5730.297 0.2950.4650.380 0.1180.1740.146
GPT-4o 0.2190.2840.252 0.0310.6020.316 0.3380.4970.417 0.1250.1820.153
o1-mini 0.2210.2890.255 0.0280.6030.315 0.3300.5080.419 0.1210.1850.153
o1 0.2200.2850.252 0.0310.6070.319 0.3540.5200.437 0.1220.1850.153
o3-mini 0.2310.2930.262 0.0360.6080.322 0.3790.5200.450 0.1280.1870.158
o3 0.2530.2830.268 0.0670.5950.331 0.5210.4820.502 0.1530.1800.166
o4-mini 0.2460.2960.271 0.0520.6290.340 0.4690.5360.503 0.1430.1930.168
Gemini 2.0 Flash 0.2040.2750.240 0.0230.5880.306 0.2930.4680.381 0.1160.1760.146
Gemini 2.5 Flash Preview 0.2220.2860.254 0.0330.6030.318 0.3490.4980.424 0.1250.1830.154
Gemini 2.5 Pro Preview 0.2560.3020.279 0.0880.6530.371 0.5540.5720.563 0.1540.1990.177
Claude 3.5 Sonnet 0.2400.2930.266 0.0550.6240.340 0.4280.5280.478 0.1400.1900.165
Claude 3.7 Sonnet 0.2620.3060.284 0.0880.6470.368 0.5810.5670.574 0.1650.2000.183
Claude 3.7 Thinking 0.2620.3050.284 0.0900.6420.366 0.5940.5740.584 0.1640.1990.181
Open-source LLMs
QwQ-32B 0.2190.2720.245 0.0310.5490.290 0.3340.4560.395 0.1230.1720.147
DeepSeek-R1 0.2280.2780.253 0.0420.5940.318 0.4160.5080.462 0.1340.1800.157
Qwen-2.5-7B 0.1550.2130.184 0.0080.4000.204 0.2650.3850.325 0.1030.1480.125
Qwen-2.5-7B w/ RL (ours) 0.2580.2860.272 0.1020.5660.334 0.6320.5600.596 0.1500.1770.164

Click any column header to sort ascending/descending. Bold indicates the best in each group.

Composition

Model Attribute Binding ↑ Spatial Relation ↑ Numeracy ↑ Avg. ↑
Color Shape Texture Avg. 2D 3D Implicit Avg. Total Item CPI Overall
Frontier closed-source LLMs
GPT-4o 62.248.734.348.4 49.737.349.245.4 85.925.551.152.7 48.3
o1 70.825.253.049.6 54.639.446.446.8 66.420.141.742.0 46.7
o3 88.973.671.778.1 81.662.084.576.0 91.659.881.178.8 77.5
o4-mini 82.462.169.671.4 71.057.976.568.5 90.352.976.174.3 71.0
Gemini 2.5 Flash Preview 63.645.056.955.2 46.038.957.147.3 82.834.562.059.8 53.4
Gemini 2.5 Pro Preview 88.165.774.976.2 77.459.180.072.2 94.768.083.882.3 76.2
Claude 3.7 Sonnet 89.382.877.383.1 75.959.473.769.7 91.465.585.582.5 77.9
Claude 3.7 Sonnet Thinking 90.585.682.486.2 80.274.486.480.3 94.978.991.489.4 84.8
Open-source LLMs
QwQ-32B 54.351.031.445.6 43.633.546.041.0 79.921.151.450.9 45.2
DeepSeek-R1 72.662.748.461.2 59.343.858.253.7 83.535.460.457.4 57.4
Qwen-2.5-7B 7.110.01.76.3 5.25.88.16.4 42.65.810.716.1 8.8
Qwen-2.5-7B - RL (Ours) 84.371.346.067.2 55.753.961.757.1 63.447.557.656.8 60.8

Click any column header to sort ascending/descending. Bold indicates the best in each group.

Analysis

Reinforcement Learning vs. Best-of-N Sampling

Reinforcement Learning vs. Best-of-N Sampling Reinforcement Learning vs. Best-of-N Sampling
Best-of-N (BoN) only approaches RL training results when N is extremely large, showing that RL is highly effective and not easily matched by simply sampling more.

How SGP Generation Capabilities Evolve during Training

Strategy 1: Decomposition into basic components. Strategy 1: Decomposition into basic components

Strategy 2: Contextual optional details. Strategy 2: Contextual optional details

Bibtex

@article{chen2025sgp,
  title={Symbolic Graphics Programming with Large Language Models},
  author={Chen, Yamei and Zhang, Haoquan and Huang, Yangyi and Qiu, Zeju and Zhang, Kaipeng and Wen, Yandong and Liu, Weiyang},
  journal={arXiv preprint arXiv:2509.01000},
  year={2025}
}