Crucible — Autonomous Sprite Synthesis Pipeline

Overview

Crucible is an autonomous pipeline built to procedurally generate combined 16-bit RPG sprites. It replaces conventional sequential processing with a cooperative multi-agent pipeline — perception, reasoning, and rendering as distinct role-agents — orchestrated in Python, with the Google Agent Development Kit (ADK) driving the reasoning agent and carrying the state handoff between stages.

The Problem: Raw Pixels vs. Semantic Fusion

The Goal: How do we logically combine two game items to make a completely new one?

If you rely on traditional image processing and mathematically average the raw pixels of a “Sword” and a “Green Gem”, you don’t get an Emerald Sword—you get a blurry, brown mess. We needed a system that could combine the underlying concepts of the items rather than their surface-level pixel data. We needed semantic fusion.

The Research Tradeoff: Why Not VAE Latent Interpolation?

A natural systems-level approach to sprite fusion is to train a Variational Autoencoder (VAE) on the sprite dataset. You encode both sprites into latent vectors (z_a and z_b) and decode an interpolated point between them:

z_fused = (1 - alpha) * z_a + alpha * z_b, alpha in [0, 1]

Because a well-trained VAE enforces a smooth posterior (q(z|x) ~ N(mu, sigma^2 I)) via the Kullback-Leibler (KL) divergence term in the Evidence Lower Bound (ELBO) loss:

L = E[log p(x|z)] - D_KL(q(z|x) || p(z))

Points mathematically between $z_a$ and $z_b$ on the learned manifold are more likely to decode into coherent images than random interpolations in standard pixel space.

The Bottleneck: After our cleaning heuristics, only ~1,460 structurally sound sprites survived (from ~89,400 raw). At this scale the reconstruction loss dominates and the KL term collapses—the posterior memorizes rather than generalizes, so the latent space is not smooth enough for reliable, cross-category item interpolation.

Instead of fighting mathematical dataset limitations, we sidestepped the problem entirely. We reframed sprite fusion from a latent manifold problem into a language grounding problem.

Architecture: Pipeline Decomposition & ADK

By utilizing the RPG Framework (Recaption, Plan, Generate) proposed by Yang et al., 2024, we built a cooperative multi-agent pipeline. Instead of a monolithic script, we separated perception (eyes), reasoning (brain), and rendering (brush) into distinct role-agents communicating through shared state. Only the reasoning agent runs on Google ADK as an LlmAgent; the perception and rendering agents are role-defined units in plain Python, coordinated sequentially by a Forge orchestrator.

Crucible Multi-Agent Pipeline Architecture

1. Agent 1: The Appraiser (Vision)

Identity and appearance are split across two specialist vision models, because asking a small generative VLM “what is this?” on a 16×16 icon hallucinates (it once called a blue gem a “health potion”):

Identity — SigLIP zero-shot (SO400M): the sprite is scored against a fixed RPG vocabulary (sword, gemstone, potion bottle, …) with prompt ensembling. Closed-set classification is far more reliable than open generation at this resolution. (This is the SigLIP paper finally doing real work in the pipeline.)
Appearance — Moondream2 (1.8B VLM): asked for colours, materials, and textures only — never to name the item or list parts. The structured type + appearance appraisal is written to shared state for the Smith.

2. Agent 2: The Master Smith (Reasoning)

A text-only ADK LlmAgent reads the {appraisal} state and runs an explicit four-step chain of thought: pick the fused archetype, then emit a part-by-part material blueprint — for every component (blade, grip, gem, …) a concrete material, color, source (A/B/fused), and surface detail. It does not write the diffusion prompt; it emits a structured parts object validated by a schema. This is true semantic compositing — e.g. “the blade is polished gold inherited from B, the grip is oak wood from A” — rather than a vague texture blend.

3. Agent 3: The Forger (Generative Execution)

The Forger assembles the Smith’s blueprint into a diffusion prompt deterministically — one localized clause per part — so every per-part material choice is guaranteed to reach the model, then renders via Flux.1 (Pollinations) and runs the output through our custom post-processing engine. An optional ControlNet-Canny path (SDXL, in the Colab notebook) conditions generation on the structure_source silhouette so the fused item keeps the chosen shape.

Post-Processing & Optimization

Getting a massive diffusion model to spit out a clean, 16-color RPG sprite required strict post-processing and hardware management.

1. Memory Bounding

The two local vision models (SigLIP SO400M and Moondream2) run in float16 and are loaded and unloaded in sequence, so peak VRAM stays ~3.5 GB — within the 6 GB budget of a consumer GPU (GTX 1660 Super). The reasoning and diffusion stages are API-only and use no GPU. Each generation samples a fresh random seed (logged to the console for traceability), giving infinite variations of the same fusion.

2. The Pixel-Lattice Layer

To convert high-fidelity diffusion outputs into accurate game assets, the Forger pipeline applies two strict operations:

Pixelation Crunch: The image is crushed down to 32x32 using Image.NEAREST filtering, then scaled back up to 512x512. This violently destroys gradient blending and forces a hard block grid regardless of the source image content.
Palette Quantization: We apply Image.quantize(colors=16, method=MEDIANCUT) after the pixelation crunch. This ensures the 16-color palette is snapped perfectly to the already-hard pixel edges, preventing the muddy intermediate colors that plague standard downscaling.

Project Documentation & Architecture Deck

loading document…

Crucible Research & Architecture Open in new tab ↗

What Shipped

Semantic Material Compositing. The Master Smith evolved from drafting vague texture prompts into a true semantic planner: it emits a localized, per-part material blueprint (material, colour, source item, surface detail) that the Forger renders deterministically — real material logic, not a blended-prompt average.
Reliable identity via SigLIP. Item recognition moved to a SigLIP zero-shot classifier over a fixed RPG vocabulary, fixing the small-VLM hallucinations (a blue gem read as a “health potion”).
Structural fidelity via ControlNet. An optional SDXL + ControlNet-Canny path conditions generation on the structure_source silhouette, so a fused item keeps the intended shape.

Recommendations & Future Work

This was a completed mini-project; the following are directions for anyone extending it rather than active work.

Asynchronous decoupling. The GPU-bound Forger dominates latency while the lightweight agents idle. A message broker (Redis Pub/Sub or RabbitMQ) — perception/reasoning as producers, diffusion as an independent consumer — would let the pipeline clear a backlog at full hardware throughput.
A multimodal Smith. The reasoning agent is currently text-only. Since Gemini is multimodal, feeding it the actual sprites would ground the plan on pixels and fully match the RPG paper’s multimodal CoT.
ControlNet as a core stage. Promote the notebook-only ControlNet path into the main pipeline so structure_source always drives the silhouette.
Ground-truth identity. The dataset ships labels; using them for known sprites would give exact identity and reserve SigLIP zero-shot for unlabeled inputs.
Robustness. Fail loudly when the Smith returns nothing (instead of rendering a generic fallback), and add an optional fixed seed for reproducible fusions.

References

Yang et al. (2024). Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs.
Qin et al. (2024). DiffusionGPT: LLM-Driven Text-to-Image Generation System.
Zhai et al. (2023). Sigmoid Loss for Language Image Pre-Training (SigLIP).
Chen, Y.-C., & Jhala, A. (2025). GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation.