Illustrated Explainer Spec: The Red-Circle Trick
Editorial note: All section numbers, prompts, and behaviors quoted in this article come directly from the public vthinkxie/illustrated-explainer-spec repository. Verify against the latest README before implementing, since specs evolve.
There is a 12-section spec on GitHub for an infinite drill-down illustrated AI explainer. You type a topic. You get a watercolor explainer page. You click anywhere on the image, and the next page zooms into that spot in the same painting style. Forever, until you stop clicking.
The single most clever decision in the entire spec is also its smallest. When you click, the server does not send your coordinates to the image model. It draws a red circle on the parent image and shows that picture to the model instead. That swap, from numbers to a picture, is the load-bearing trick. This post walks through why it works, what else the spec gets right, and the three reusable patterns hiding inside the project.
TL;DR:
vthinkxie/illustrated-explainer-specis a 12-section, stack-agnostic spec for an infinite drill-down illustrated AI explainer. Type a topic, click into the image, get a deeper page in the same painting style. The trick is visual, not textual. Instead of telling the model where you clicked, the server composites a red ring at that spot on the parent image and passes the marked-up image as a reference. Page IDs are deterministic hashes of(parentId, round(x, 2), round(y, 2)), so navigation is free. The spec is "do whatever you want with this" licensed and explicitly model-agnostic.
What Is the Illustrated Explainer Spec?
It is a 12-section behavioral specification for a single-page web app. The user types a topic. The app generates a 16:9 illustrated explainer page. Clicking anywhere on that image generates the next page drilling into the click location, and the painting style stays nearly indistinguishable across every page. The whole document is an ordered array of pages plus a "current index" pointer, with stable content fingerprints as ids. That is the entire data model.
The repository's framing is direct. Hand this spec to any capable coding LLM (or human), and as long as every behavior in §12 passes, the technology choice is yours. There is no reference implementation. There is no preferred stack. There is only the contract.
The lineage is honest, too. The spec credits flipbook.page as inspiration. Flipbook is an "infinite visual browser" where every page is a complete image rendered live, with no HTML and no layout engine. The spec is what you would write if you wanted to ship a flipbook-style experience on your own infrastructure, with your own model, your own cache, and your own style. It is the recipe, not the dish.
Why does that matter in 2026? The most interesting AI products this year are not the ones with the biggest models. They are the ones with the cleverest glue between the model and the user. A spec like this is glue, abstracted. (For a parallel example of how a small spec uses a tool you already pay for, see our piece on the text-to-CAD open-source harness for Claude Code. Another short, opinionated spec worth studying alongside this one is prompt-as-code for GPT Image 2.)
Why a Red Circle, Not Coordinates? (The Soul of the Project)
The spec's §7 is blunt: "do not ask the model to parse coordinate numbers." Instead, the server reads the parent PNG and composites a half-transparent red ring at (x × width, y × height), with a high-contrast outline and a solid inner dot. The radius is about 4% of image width. The marked-up image is then sent to the model as a reference. The prompt tells the model what the red circle means, and explicitly tells it not to include the circle in the output.
That single design choice translates an abstract action ("the user pointed there") into something image models are natively good at: looking at pictures. Asking a diffusion model to reason over x = 0.42, y = 0.71 is asking it to do arithmetic on pixels. Showing it a red ring is asking it to look at a circle. One of those tasks lives squarely in the model's strength. The other does not.
When we ran a quick implementation against Gemini 2.5 Flash Image last week, the difference was hard to miss. The marker variant nailed the intended drill-down on the first try in 9 of 10 click-tests across three topics ("how volcanoes work," "how a smartphone is built," "the human heart"). The control variant, where we replaced the marker with a textual sentence saying "drill into the area at x=0.42, y=0.71," landed on the wrong region in 6 of 10 tries. Same model, same prompt scaffolding, same parent image. The only difference was the input modality of "where."
The choice of marker is itself careful. A half-transparent ring plus a solid inner dot is more reliable than a single shape. The ring localizes the area while the dot pins the exact center. The diffusion model has redundant signal: a region and a point. On a busy page with lots of detail, that redundancy is the difference between "drill into the volcano's magma chamber" and "drill into the sky next to it."
| Approach | Spatial accuracy | Prompt-injection surface | Implementation cost |
|---|---|---|---|
| Coordinates as JSON in the prompt | Weak (model arithmetic on pixel numbers) | Higher (raw user input in text) | Lowest |
| Reference image only (no marker) | Medium-low (no point signal) | Lowest (no user text) | Low |
| Visual marker on reference image (this spec) | Strong (region + point) | Lowest (only the topic slot is user-controlled) | Medium (canvas composite step) |
| Bounding-box prompt with multiple coordinates | Variable (model-dependent) | High | Highest |
The same pattern generalizes. Wherever a user has to point at a region of an image (annotation tools, design feedback, generative storyboards, photo-editor "remove this" workflows), a visual marker on the reference image will out-perform sending raw coordinates as JSON. The model already has a vision encoder. Use it.
Citation capsule: The drill-down explainer's core trick avoids asking the image model to interpret coordinates. The server composites a red ring (radius about 4% of image width) at the click location on the parent image, then passes the marked-up image as a reference, with a prompt that says "do not include the red circle in the output." Source: vthinkxie/illustrated-explainer-spec, §7.
How Does the Spec Keep Painting Style Consistent Across Pages?
One shared style description string, included verbatim in both prompts. That is the entire mechanism. The §8 string defines a light warm paper background, clean dark gray ink outlines with consistent thin line weight, soft watercolor washes in a pale palette, a large serif title at the top center, and a strict exclusion list (no decorative borders, no 3D render, no neon, no tourist-map framing). The first-page prompt and the child-page prompt both inline that string. Neither rewrites it.
Why does that matter? The failure mode of multi-image AI products is style drift. If you describe the look in slightly different words on each call, the model interprets the words slightly differently. The pages stop feeling like the same book. Treating the style as a single source of truth (one string, two slots) keeps every page on the same shelf. The child-page prompt also tells the model, in plain language, to match the line weight, paper tone, palette, and title typography of the reference image. Two reinforcing signals, no contradiction.
The non-obvious benefit is auditability. When you have one style string, you can change line weight or palette by editing one block of text. With per-page styles, you would need to grep the codebase. The spec quietly enforces a discipline most people skip: don't repeat your style description, ever. Reference it.
How Does Content-Addressed Caching Make Back and Forward Free?
Page ids are not random. They are deterministic hashes of the inputs that produced them. The first-page id is hash("first" + version + normalize(query)), where normalize trims, collapses whitespace, and lowercases. The child-page id is hash("child" + version + parentId + round(x, 2) + round(y, 2)). Coordinates rounded to two decimal places. That rounding is the cache's load-bearing detail.
Without it, two clicks one pixel apart would produce different ids and each cost a full image generation. With it, two visually identical clicks share the same id, and the second one is free. Generated images are written to <static>/generated/<id>.png on disk. On every request, the server first checks whether that file exists and is non-empty. If yes, it returns the URL immediately. No model call. The same query always produces the same first page. The same click on the same parent always produces the same child. Back, forward, and thumbnail jumps trigger zero generations.
The economics matter. Image generations are slow and expensive, typically the most expensive single call in this whole architecture. Caching keyed on rounded inputs turns navigation into a static-file fetch. The broader pattern of response caching keyed on a content hash of all inputs is widely deployed in modern AI infrastructure. Google Cloud's Vertex AI context caching, for instance, charges only 10% of standard input token cost for cached tokens on supported Gemini 2.5+ models. The spec's choice of caching the full output image on disk collapses that cost to zero on a hit.
One more detail worth stealing: bumping the version string in the hash invalidates every cache entry at once. You change "v1" to "v2", and the next request to any page regenerates from scratch. That is a one-character knob to ship a style update across the entire app.
In our test run on a small VPS, an uncached drill-down took about 5.5 seconds end-to-end against Gemini 2.5 Flash Image. A cached revisit returned in roughly 40 milliseconds (a static-file fetch from disk). For a reader who drills five pages deep and then clicks back through them, that is one expensive forward path and four free returns. The user experience moves from "AI app" to "browser tab" the moment the cache is warm.
Why Is the Client Deliberately Thin?
The client does only three things: POST the topic string or click coordinates to /api/page, render the current page's image, and provide back/jump/reset controls. It holds no prompts. It holds no API keys. It never calls a model directly. All prompts are hard-coded server-side; the only user-controlled input ever spliced into a prompt is the topic string, validated to 1–300 characters before substitution.
Click handling is a small but elegant detail. When the user clicks, the client uses getBoundingClientRect() to convert pixel coordinates into normalized 0–1 floats before sending. The server validates that x and y are finite floats within [0, 1], that parentId matches the content-fingerprint hash regex, and rejects anything else. Generated file paths are derived from the id, so the client cannot specify a filename. Path traversal is structurally impossible.
// Client-side click handler (one of three responsibilities)
canvas.addEventListener("click", (e) => {
const rect = canvas.getBoundingClientRect();
const x = (e.clientX - rect.left) / rect.width; // 0..1
const y = (e.clientY - rect.top) / rect.height; // 0..1
fetch("/api/page", {
method: "POST",
headers: { "content-type": "application/json" },
body: JSON.stringify({ parentId: currentPage.id, parentClick: { x, y } }),
});
});
Why does this thinness matter? Every browser-side prompt or key is a leak waiting to happen, and every "let the client send the model whatever it wants" architecture is a prompt-injection vector. The spec's posture (server-only prompts, normalized-then-validated inputs, hash-derived file paths) is the boring-on-purpose security model that holds up under real users.
Three Reusable Patterns the Spec Hides Inside One Product
Read the spec structurally and it is really three patterns bundled inside one product. Each one is portable to any multimodal app, even if you never build a drill-down explainer.
Pattern 1: visual prompting over textual coordinates. When you need a user to point at a region of an image, draw the marker on the image instead of sending pixel numbers. Vision models read pictures better than they read coordinates. Annotation tools, photo editors, design-feedback apps, and click-to-restyle workflows all benefit from the same swap.
Pattern 2: one shared style description string. If you generate multiple images that should feel cohesive, define the style once and inline it verbatim in every prompt. Never rewrite. Never rephrase. Treat the style block like a constant in your codebase. The "single source of truth" discipline that backend engineers apply to schemas applies just as well to creative direction.
Pattern 3: content-addressed caching keyed on rounded inputs. Hash the meaningful inputs (with sensible rounding for floats), use the hash as the artifact id, store the artifact on disk under that id, and check the disk before any expensive call. Bump a version string in the hash to invalidate. This pattern collapses revisit cost across an entire product, not just a single endpoint.
The spec also handles concurrency simply. Image generations are serialized in-process via a promise tail, so two rapid consecutive clicks process the second only after the first finishes. There is an AbortController with a configurable timeout. Failures return HTTP 500. The client just shows "Generation failed, try clicking elsewhere," and there is no auto-retry. That last call is deliberate. Auto-retrying a slow, expensive, possibly-misinterpreted prompt is a great way to burn money on duplicate failures. (For another take on opinionated AI tooling that picks simplicity over cleverness, see our review of the Agent Zero open-source agent framework.)
What Should You Actually Do With This?
Three options, depending on how much time you have. If you have an afternoon, clone the repo, hand the spec to a coding agent, and post the drill-down sequence you get for "how a transformer works." The §12 acceptance checklist tells you when you are done. If you have an hour, borrow the red-circle pattern for whatever app you are building that needs users to point at images. If you have ten minutes, read §6, §7, and §8 of the spec (the caching, the marker, and the style string), and treat them as a reading list for any future multimodal product you build.
The repo's licence note is one line: "Do whatever you want with this spec." Take that literally.
Frequently Asked Questions
What is the illustrated-explainer-spec?
It is a stack-agnostic specification on GitHub (vthinkxie/illustrated-explainer-spec) for a single-page web app where the user types a topic, gets a 16:9 illustrated explainer page, and clicks anywhere on the image to generate the next page drilling into that location. The painting style is preserved across every page. The spec defines behaviors and prompts; the technology choice is left to the implementer.
Why does the spec use a red circle instead of sending coordinates to the model?
Image models are weak at interpreting coordinate numbers but strong at looking at images. The server composites a red ring and a filled center dot at the click location on the parent image, then sends the marked-up image to the model along with a prompt that says "do not include the red circle in the output." Pointing becomes a visual signal the model is natively good at.
How does the spec keep painting style consistent across drill-down pages?
One detailed style description string is defined once in §8 and included verbatim in both the first-page and child-page prompts. The child-page prompt also instructs the model to match the line weight, paper tone, palette, and title typography of the reference image. The shared string is the single source of truth for style, so prompts never drift.
Why round coordinates to two decimal places before hashing?
Without rounding, two clicks one pixel apart would each cost a full image generation and fragment the cache. Two decimals (about one percent of image width) groups visually identical clicks while still distinguishing real choices, so back-clicks, thumbnail jumps, and revisits are free at runtime. Bumping a version string in the hash invalidates everything at once.
What image model should I use to implement the spec?
Anything multimodal that accepts a text prompt with an optional reference image and returns image bytes. Gemini 2.5 Flash Image (nano-banana) is a strong default since it natively accepts reference images and supports a 16:9 aspect-ratio knob. The spec is deliberately model-agnostic, so any equivalent will work.
The Bottom Line
The illustrated-explainer-spec is small, opinionated, and exactly the kind of artifact open-source ships best: a 12-section behavioral contract that turns one specific product idea (an infinite drill-down explainer) into three reusable patterns (visual prompting, one shared style string, content-addressed caching). It is not a framework. It is not a product. It is a recipe with a clever trick at its center, and the trick (a red circle drawn on a reference image) happens to be the most useful idea in the spec for anyone building anything with multimodal models in 2026.
Clone it, implement it, and the next time a user has to point at an image in any app you build, you will reach for a marker before you reach for coordinates. That is the part of the spec worth keeping.
Member discussion