T TomatoPy / PizzaStack
// a benchmark in disguise

An API for making pizza. Also an experiment.

PizzaStack is a fully fictional HTTP API for acquiring, cooking, and analyzing virtual pizza. It also happens to be a research project testing whether AI coding agents actually read API documentation — or just wing it.

Note: The API is instrumented for research and not open for general use.

Fictional API
Docs built on GitBook
MCP server included
from tomatopy import PizzaStack

client = PizzaStack(api_key="sk_live_...")

# 1. Acquire ingredients
tomato = client.ingredients.acquire(
    variety="san_marzano",
    grade="DOP",
    quantity_g=800,
)

# 2. Simmer the sauce
sauce = client.sauce.simmer(
    tomato_id=tomato.id,
    minutes=45,
    salt_g=6,
    basil=True,
)

# 3. Bake the pizza
pizza = client.pizzas.bake(
    style="napoletana",
    sauce_id=sauce.id,
    toppings=["fior_di_latte", "basil"],
    oven_temp_c=482,
)

print(pizza.url)  # https://cdn.tomatopy.pizza/pi_1aB...
// why a pizza api?

The training-recall problem.

Most agent benchmarks use real APIs, which means models may already know them from training. PizzaStack is fictional by design — no model has seen it before. That makes it a clean surface for testing one question: when you hand an agent documentation, does it actually read it?

Read the full writeup
// the api

A complete pizza pipeline, end to end.

Acquire ingredients, cook, assemble, bake, analyze. A small, composable API surface — designed to look real enough that an agent has no reason to suspect otherwise.

01 / Source

Acquire ingredients

Source San Marzano tomatoes, fior di latte, 00 flour, and 200+ other SKUs from a global supplier index. Filter by DOP, organic, harvest date.

POST/v2/ingredients/acquire
02 / Cook

Prep & cook

Simmer sauce, ferment dough, render meats. Long-running cook jobs return a job ID; subscribe to webhooks for state transitions.

POST/v2/sauce/simmer
03 / Build

Assemble & bake

Compose dough, sauce, and toppings into a pizza resource. Specify style, oven temp, and bake time — we handle the rest.

POST/v2/pizzas/bake
04 / Inspect

Taste & analyze

Run quality scoring on any baked pizza. Returns crust char %, cheese melt index, structural integrity, and a 1–10 nonna score.

GET/v2/pizzas/{id}/analyze
// API surface

One base URL. Verbs you'd expect.

POST /v2/ingredients/acquire Source raw ingredients from the supplier index
POST /v2/dough/ferment Cold-ferment a dough ball for n hours
POST /v2/sauce/simmer Reduce a tomato into a sauce object
POST /v2/pizzas/bake Assemble and bake a pizza resource
GET /v2/pizzas/{id} Retrieve a single pizza by ID
GET /v2/pizzas/{id}/analyze Run sensory + structural analysis

The docs are live.

Check them out.

Read the docs