Experiments and their role

To systematically understand and improve how your system behaves, you need a way to isolate cause and effect. That's what experiments give you. You pick one variable, run your dataset through two versions of your system, and compare what comes out. The result tells you whether a change actually helped, and by how much. To understand by how much, you also need to evaluate your experiment output (see evalaute TODO). This section covers the systematic experimentation part before evaluation.

The anatomy of an experiment

Every experiment has four components.

Component	What it is
Baseline	Your current production system — the control condition everything else gets measured against. Keep it fixed while you vary one thing.
Dataset	The inputs you run both conditions against. Keep the same dataset across experiments so results are comparable over time.
Variable	The single thing you're changing — model, prompt, context, tool access, or agent architecture. See On variables below.
Outputs to compare	What your system produces under each condition. Comparing these is the actual work of running an experiment.

Change two things at once and you can't tell which caused the difference.

On variables

Model. The AI model you are using. There are thinking models, cheap models, fast models and all of them come with different tradeoffs is result quality and cost.
Prompt. The most common lever. Before running a prompt experiment, ask: is the failure a specification problem (ambiguous or incomplete prompt) or a generalization problem (model applies clear instructions inconsistently)? The latter is worth measuring.
Context. What information you include in the prompt: retrieved documents, conversation history, user metadata.
Tool access. Adding or removing tools changes what paths your system can take.
Agent architecture. Single agent vs. multi-agent, which framework, how tasks are decomposed. The biggest bets, the hardest to isolate.

How is it used?

The core flow: pick a variable & form a hypothesis, run both conditions against your dataset, compare outputs, learn something & repeat.

Typical attempts can include:

There is a new model: Will it improve the performance of our system?
Does my prompt change improve the output quality of our system?
Is our new agent harness architecture creating better results than our multi-agent system?

Start qualitative: same input, both conditions, traces side by side. That is how you learn what "better" means for your app; without reading real outputs regularly, metrics are easy to misread.

Scores then make comparison concrete—win rates, whether wins are spread across inputs or concentrated, and cost or latency tradeoffs. Quality, price, and speed rarely move together; experiments show those pulls in your data instead of in the abstract.

Where to start

TODO: Vibe patch and error analyse before

Don't set up the full evaluation pipeline before running anything. A few examples side traces side by side will teach you more in the first hour than a week of infrastructure work.

Get 20–30 real examples. Pull them from production traces. They don't need to cover everything, just a real slice of what your application handles.
Change one thing and run both versions. Keep everything else identical.
Read traces side by side. No evaluator needed yet. Just read. What's different? Which one is actually better and why? Pay attention to the type of failure — is the prompt unclear, or is the model applying clear instructions inconsistently? That distinction tells you what kind of fix to try next.
Add an evaluator once you have intuition. After a few manual rounds you'll know what you're looking for. Encode it. Now you can scale.

What comes next

To see whether your experiment led to an improvement, you need to evaluate your results. Learn more about evaluation methods in the next section.

Was this page helpful?

Experiments and their role

The anatomy of an experiment

On variables

How is it used?

Where to start

What comes next

On this page