Building qareen: My Experience with Multi-Agent Coding

A few weeks ago I shipped qareen, a framework for picking better few-shot examples. The algorithm is useful, but honestly? The interesting part was how I built it: by conducting a small orchestra of AI coding agents.

Imagine pair programming, except the pair consists of five different AIs with strong opinions about code style. (They also have a mysterious tendency to import libraries that don’t exist.)

The gist of `qareen`

When one prompts an LLM with examples (few-shot learning), the examples matter. A lot. Give it five variations of the same thing and it’ll parrot those patterns. Give it carefully selected, diverse examples and it generalizes better.

qareen picks examples that are relevant to a task but different enough from each other to actually teach something. It blends text and image signals—like making a mixtape where every song is both thematically appropriate and introduces something new.

flowchart LR
    planner["📋 Planner<br/>break this into<br/>small tasks"]
    builders["🔨 Builders<br/>Write the actual code<br/>(sometimes it works!)"]
    evaluator["📊 Evaluator<br/>Runs the metrics<br/>Is this actually better?"]
    critic["🔍 Critic<br/>Did you even run lint?"]
    human["👤 Me<br/>Ship it! or nope"]
    
    planner --> builders
    planner --> evaluator
    builders --> critic
    evaluator --> critic
    critic --> human
    human -.->|"Actually, let's try something else..."| planner
    
    style planner fill:#dbeafe,stroke:#3b82f6
    style builders fill:#dcfce7,stroke:#22c55e
    style evaluator fill:#ffedd5,stroke:#f97316
    style critic fill:#fef9c3,stroke:#eab308
    style human fill:#fce7f3,stroke:#ec4899

My new job: reviewer in chief

The biggest shift wasn’t technical. Rather, it was how I spent my time. Before agents, I wrote code. Now I mostly review it. I went from keyboard-forward developer to someone who spends more time saying “wait, why would you do it that way?” to a robot.

This sounds like a downgrade until one realizes: the development speed is wild. What used to be “set up an experiment, go get coffee, come back, realize I made a typo” became “propose five experiments, agents run all of them, pick the winner by lunch.”

from qareen.sampler import rr_rank, normalize_scores

def rerank_candidates(text_scores, image_scores, alpha=0.6):
    text = normalize_scores(text_scores)
    image = normalize_scores(image_scores)
    # Blend signals. Alpha was tuned through more experiments
    # than I want to admit.
    return rr_rank(text, image, weight=alpha)

Patterns that actually worked

I tried a lot of things. Most didn’t work. Here’s what survived:

flowchart TD
    root["Patterns That Worked"]
    
    root --> PlannerBuilders
    root --> CriticPass
    root --> ToolRouter
    root --> ReplayBuffer
    root --> UIFeedback
    root --> HumanTaste
    
    PlannerBuilders["📋→🔨 Planner to Builders<br/>One brain, many hands<br/>Small tasks = fewer mistakes"]
    CriticPass["🔍 Critic Pass<br/>Linting before human eyes<br/>Catches the obvious stuff"]
    ToolRouter["🔀 Tool Router<br/>Right tool for the job<br/>No more hammer for screws"]
    ReplayBuffer["📼 Replay Buffer<br/>Log every agent action<br/>What changed? → check tape"]
    UIFeedback["🎛️ UI Feedback Loop<br/>Sliders > staring at JSON<br/>See problems, not just numbers"]
    HumanTaste["👤 Human Taste Check<br/>Metrics aren't everything<br/>Does this feel right?"]
    
    style root fill:#f1f5f9,stroke:#64748b,stroke-width:2px,color:#0f172a
    style PlannerBuilders fill:#f8fafc,stroke:#94a3b8,stroke-width:1px,color:#1e293b
    style CriticPass fill:#f8fafc,stroke:#94a3b8,stroke-width:1px,color:#1e293b
    style ToolRouter fill:#f8fafc,stroke:#94a3b8,stroke-width:1px,color:#1e293b
    style ReplayBuffer fill:#f8fafc,stroke:#94a3b8,stroke-width:1px,color:#1e293b
    style UIFeedback fill:#f8fafc,stroke:#94a3b8,stroke-width:1px,color:#1e293b
    style HumanTaste fill:#f8fafc,stroke:#94a3b8,stroke-width:1px,color:#1e293b

Planner → Builders. One agent breaks issues into small, specific tasks. Others pick them up and execute. This sounds obvious, but getting the granularity right took some trial and error. Too big and agents get confused. Too small and developers are managing a to-do list the length of a CVS receipt.

Critic in the loop. Before code hits my screen, a critic agent runs linting and tests. It’s like having a very literal-minded coworker who catches the obvious stuff so I can focus on the subtle stuff.

Keep logs of everything. Every agent run dumps its commands and outputs to a log. When something breaks—and it will—the sequence can be replayed to figure out what changed. Think git blame, but for agent decisions.

Gradio for instant feedback. I wired up a quick UI with sliders for the ranking weights. Being able to see the reranker’s decisions made tuning dramatically faster than staring at JSON outputs.

What I’d do differently

A few lessons learned the hard way:

Vague tasks = creative interpretations. When I said “improve the reranker,” one agent decided the way to do that was to rewrite the entire module in a different framework. Specific acceptance criteria are essential.
Don’t trust imports. Agents will confidently import libraries that don’t exist, or that exist but do something completely different. The critic pass caught most of these, but not all.
Human taste still matters. Agents can optimize for metrics, but metrics don’t always capture “does this actually feel right?” I kept a human checkpoint before anything shipped.

The takeaway

Multi-agent coding isn’t a toy or a demo. It’s a genuinely different way to work. Not faster in every way (debugging agent confusion takes time), but faster in enough ways that the overall velocity goes up. Developers trade writing code for reviewing it, and if that shift is acceptable, it’s pretty great.