Can LLMs Draw with SGP?

Introduction

Large language models (LLMs) excel at understanding and generating code, but their ability to create symbolic graphics programs (SGPs)—which produce images remains underexplored. SVGs, as a widely used SGP format, provide an ideal testbed for evaluating LLMs in this area. We address two main research questions:

(1) how well can LLMs draw using SGPs?
To address the first question, we introduce SGP-GenBench, a comprehensive benchmark for evaluating LLMs' ability to generate SGPs from three perspectives: object, scene, and composition.

(2) how can we improve their ability to generate SGPs?
To improve performance, we propose a reinforcement learning approach that uses similarity scores between visual encoder outputs and input text descriptions as reward signals. Our experiments show that RL significantly boosts the symbolic graphics programming abilities of LLMs.

SGP as Visual Synthesis

Symbolic graphics programming (SGP) aims to generate structured graphics code, such as SVG, from natural language instructions. Unlike conventional text-to-image models that produce pixel-based images, SGP translates prompts into interpretable, parametric programs, enabling precise and controllable visual synthesis.

SVG is widely used and serves as a natural bridge between language and vision, making it ideal for evaluating LLMs' ability to ground semantics from prompt to code. SGPs are both:

(a) PARAMETRIC - allowing exact geometric and appearance control
(b) PROCEDURAL - supporting hierarchical and compositional scene construction.

These properties enable fine-grained, interpretable, and flexible visual generation, and allow us to directly analyze the complexity and structure of generated scenes.

SGP-GenBench

Construction

- Scene generation: Evaluated on COCO-val, covering 80 object categories with complex scene descriptions. We sample 1,024 images for efficient and representative assessment.
- Object generation: Assessed on SGP-Object-val, a set of 930 single-object examples with captions, focusing on the model's ability to render individual objects accurately.
- Compositional generation: Tested on SGP-CompBench with 3,200 prompts, targeting attribute binding, spatial relationships, and numeracy using 80 common objects for compositional evaluation.

Method

Given a text description, we sample a group of SVG codes from the model and render them as images. Each SVG code is scored by the alignment between the rendered image and the text description. The advantages are calculated based on the scores, and used for updating the model.

Discussion of Method

Distilling from vision models. We use external vision models (e.g., DINO, CLIP) to define rewards, guiding the LLM to align its generations with strong visual and semantic priors. This implicit distillation improves both visual understanding and cross-modal alignment.

Training without ground truth SGPs. Our RL method learns directly from images using feedback from vision models, removing the need for paired image-program data and enabling scalable, diverse training.

Aligning language and vision. The RL process ensures that generated SGPs are consistent with both linguistic input and visual model judgments, grounding language in visual evidence and enhancing cross-modal reasoning.

Results

Overview of SGP-GenBench

SGP-GenBench reflects general model abilities. The ranking of models on our benchmark aligns well with their perceived general capabilities, especially in code generation. For example, Claude 3.7 Sonnet Thinking generally outperforms o3, which in turn surpasses Gemini 2.5 Pro Preview, followed by open-source systems like DeepSeek-R1 and Qwen-2.5-7B. This consistent ordering suggests that SVG generation is a reliable indicator of broader model competence.
Closed-source models remain strongest. Frontier systems achieve the best results not only on compositional reasoning tasks such as attribute binding and numeracy, where Claude 3.7 Sonnet Thinking reaches 90.5 on color binding and 89.4 on numeracy, but also on scene and object fidelity, where Gemini 2.5 Pro Preview attains the top DINO object score of 0.653 and strong VQA scene performance of 0.554.
Our RL-trained model substantially narrows the gap. The RL post-trained Qwen-2.5-7B raises its overall compositional score from 8.8 to 60.8, outperforming all other open-source counterparts such as DeepSeek-R1 and QwQ-32B. It also achieves the best VQA score across all models at 0.596, slightly higher than Claude 3.7 Sonnet Thinking, demonstrating that reinforcement learning enables open-source models to approach the closed-source frontier.

Object & Scene

Model	CLIP ↑			DINO ↑			VQA ↑			HPS ↑
Model	Sce.	Obj.	Avg.	Sce.	Obj.	Avg.	Sce.	Obj.	Avg.	Sce.	Obj.	Avg.
Frontier closed-source LLMs
GPT-4o-mini	0.207	0.278	0.243	0.021	0.573	0.297	0.295	0.465	0.380	0.118	0.174	0.146
GPT-4o	0.219	0.284	0.252	0.031	0.602	0.316	0.338	0.497	0.417	0.125	0.182	0.153
o1-mini	0.221	0.289	0.255	0.028	0.603	0.315	0.330	0.508	0.419	0.121	0.185	0.153
o1	0.220	0.285	0.252	0.031	0.607	0.319	0.354	0.520	0.437	0.122	0.185	0.153
o3-mini	0.231	0.293	0.262	0.036	0.608	0.322	0.379	0.520	0.450	0.128	0.187	0.158
o3	0.253	0.283	0.268	0.067	0.595	0.331	0.521	0.482	0.502	0.153	0.180	0.166
o4-mini	0.246	0.296	0.271	0.052	0.629	0.340	0.469	0.536	0.503	0.143	0.193	0.168
Gemini 2.0 Flash	0.204	0.275	0.240	0.023	0.588	0.306	0.293	0.468	0.381	0.116	0.176	0.146
Gemini 2.5 Flash Preview	0.222	0.286	0.254	0.033	0.603	0.318	0.349	0.498	0.424	0.125	0.183	0.154
Gemini 2.5 Pro Preview	0.256	0.302	0.279	0.088	0.653	0.371	0.554	0.572	0.563	0.154	0.199	0.177
Claude 3.5 Sonnet	0.240	0.293	0.266	0.055	0.624	0.340	0.428	0.528	0.478	0.140	0.190	0.165
Claude 3.7 Sonnet	0.262	0.306	0.284	0.088	0.647	0.368	0.581	0.567	0.574	0.165	0.200	0.183
Claude 3.7 Thinking	0.262	0.305	0.284	0.090	0.642	0.366	0.594	0.574	0.584	0.164	0.199	0.181
Open-source LLMs
QwQ-32B	0.219	0.272	0.245	0.031	0.549	0.290	0.334	0.456	0.395	0.123	0.172	0.147
DeepSeek-R1	0.228	0.278	0.253	0.042	0.594	0.318	0.416	0.508	0.462	0.134	0.180	0.157
Qwen-2.5-7B	0.155	0.213	0.184	0.008	0.400	0.204	0.265	0.385	0.325	0.103	0.148	0.125
Qwen-2.5-7B w/ RL (ours)	0.258	0.286	0.272	0.102	0.566	0.334	0.632	0.560	0.596	0.150	0.177	0.164

Click any column header to sort ascending/descending. Bold indicates the best in each group.

Composition

Model	Attribute Binding ↑				Spatial Relation ↑				Numeracy ↑				Avg. ↑
Model	Color	Shape	Texture	Avg.	2D	3D	Implicit	Avg.	Total	Item	CPI	Overall	Avg. ↑
Frontier closed-source LLMs
GPT-4o	62.2	48.7	34.3	48.4	49.7	37.3	49.2	45.4	85.9	25.5	51.1	52.7	48.3
o1	70.8	25.2	53.0	49.6	54.6	39.4	46.4	46.8	66.4	20.1	41.7	42.0	46.7
o3	88.9	73.6	71.7	78.1	81.6	62.0	84.5	76.0	91.6	59.8	81.1	78.8	77.5
o4-mini	82.4	62.1	69.6	71.4	71.0	57.9	76.5	68.5	90.3	52.9	76.1	74.3	71.0
Gemini 2.5 Flash Preview	63.6	45.0	56.9	55.2	46.0	38.9	57.1	47.3	82.8	34.5	62.0	59.8	53.4
Gemini 2.5 Pro Preview	88.1	65.7	74.9	76.2	77.4	59.1	80.0	72.2	94.7	68.0	83.8	82.3	76.2
Claude 3.7 Sonnet	89.3	82.8	77.3	83.1	75.9	59.4	73.7	69.7	91.4	65.5	85.5	82.5	77.9
Claude 3.7 Sonnet Thinking	90.5	85.6	82.4	86.2	80.2	74.4	86.4	80.3	94.9	78.9	91.4	89.4	84.8
Open-source LLMs
QwQ-32B	54.3	51.0	31.4	45.6	43.6	33.5	46.0	41.0	79.9	21.1	51.4	50.9	45.2
DeepSeek-R1	72.6	62.7	48.4	61.2	59.3	43.8	58.2	53.7	83.5	35.4	60.4	57.4	57.4
Qwen-2.5-7B	7.1	10.0	1.7	6.3	5.2	5.8	8.1	6.4	42.6	5.8	10.7	16.1	8.8
Qwen-2.5-7B - RL (Ours)	84.3	71.3	46.0	67.2	55.7	53.9	61.7	57.1	63.4	47.5	57.6	56.8	60.8

Click any column header to sort ascending/descending. Bold indicates the best in each group.

Analysis

Reinforcement Learning vs. Best-of-N Sampling

Best-of-N (BoN) only approaches RL training results when N is extremely large, showing that RL is highly effective and not easily matched by simply sampling more.

How SGP Generation Capabilities Evolve during Training

Strategy 1: Decomposition into basic components.

Strategy 2: Contextual optional details.

Bibtex

@article{chen2025symbolic,
    title={Symbolic Graphics Programming with Large Language Models}, 
    author={Yamei Chen and Haoquan Zhang and Yangyi Huang and Zeju Qiu and Kaipeng Zhang and Yandong Wen and Weiyang Liu},
    journal={arXiv preprint arXiv:2509.05208},
    year={2025}
}