Patch notes | Execudo

why-multi-agent-systems-fail.pdf

PAPERarXiv 2503.13657 / 2025 / multi-agent systems

Why Do Multi-Agent LLM Systems Fail?/

The first serious autopsy of multi-agent systems.

Berkeley researchers dissected over 200 failed multi-agent runs and built a taxonomy of fourteen failure modes. The headline: most crashes come from specification and inter-agent coordination — fuzzy roles, lost context, weak verification — not from the models themselves. It matches every post-mortem in our drawer: fix the pipes and the contracts before blaming the model.

Read the paper / arxiv.org ↗

external / arxiv.org2025

orchestration-traces.pdf

PAPERarXiv 2605.02801 / 2026 / orchestration

RL for Multi-Agent Systems through Orchestration Traces/

Orchestration, finally given a grammar.

Reframes orchestration as five learnable decisions — when to spawn, whom to delegate to, how to communicate, how to aggregate, when to stop — and trains routers on execution traces from real industrial systems. Why it matters: that is the exact decision loop our gateways run all day, and learning it from traces is the first credible path to making it self-improving.

Read the paper / arxiv.org ↗

external / arxiv.org2026

dynamic-runtime-graphs.pdf

PAPERarXiv 2603.22386 / 2026 / workflows

From Static Templates to Dynamic Runtime Graphs/

Static pipelines are quietly losing.

A survey of workflow optimization for LLM agents that draws a clean line between graphs fixed at design time and graphs built at runtime, task by task — with the evidence tilting toward the latter on hard, variable workloads. Important because every hardcoded pipeline is a bet that the task never changes. It always changes.

Read the paper / arxiv.org ↗

external / arxiv.org2026

on-time-within-budget.pdf

PAPERarXiv 2605.06110 / 2026 / budgets & planning

On Time, Within Budget/

A goal is not enough. Give the agent a bill.

Monte Carlo Portfolio Planning assigns models to subtasks by simulating executions against a deadline and a hard budget, maximizing the odds that the workflow lands on time and on cost. We cap every production job in euros and minutes; this is the theory of what we do by hand — and a preview of it becoming automatic.

Read the paper / arxiv.org ↗

external / arxiv.org2026

compiling-workflows.pdf

PAPERarXiv 2605.22502 / 2026 / efficiency

Compiling Agentic Workflows into LLM Weights/

The workflow is the model now.

Instead of orchestrating many steps around a frontier model, the authors fine-tune the whole workflow into a small model's weights — near-frontier quality at roughly a hundredth of the inference cost. For anyone who pays a per-task bill, this is the most important cost curve of the year.

Read the paper / arxiv.org ↗

external / arxiv.org2026

storage-to-experience.pdf

PAPERarXiv 2605.06716 / 2026 / memory

From Storage to Experience/

Memory is becoming the moat.

Maps agent memory as an evolution in three stages — storage, then reflection, then experience — and shows what each stage unlocks, from recall to genuine learning on the job. Most production agents, ours included, live in stage two. Stage three, where a system learns from its own history, is where the compounding starts.

Read the paper / arxiv.org ↗

external / arxiv.org2026

end-of-transformers.pdf

PAPERarXiv 2510.05364 / 2025 / architectures

The End of Transformers?/

What actually threatens attention.

A sober audit of the sub-quadratic challengers — state-space models, modern recurrence, hybrid stacks — and where each genuinely beats attention on long inputs. The verdict: hybrids, not revolutions. It matters for one practical reason: context length drives the infra bill, and the next architecture decides it.

Read the paper / arxiv.org ↗

external / arxiv.org2025

qwen3-coder-next.pdf

PAPERarXiv 2603.00729 / 2026 / new models

Qwen3-Coder-Next Technical Report/

Big model, small bill.

An 80-billion-parameter coding model that activates only three billion parameters per call, trained on synthetic tasks with environment feedback. Sparse-when-it-runs is what production budgets have been waiting for: frontier-shaped skills at commodity cost, with open weights.

Read the paper / arxiv.org ↗

external / arxiv.org2026

evals-are-the-new-unit-tests.md

v5.2Research / 2026

Evals are the new unit tests/

You would not ship code without tests. Stop shipping prompts without evals.

Every system we run carries an eval suite next to its code: a few hundred real cases, scored on every deploy. When Barrique changes its refund prompt, 412 historical tickets replay before the change reaches a single customer. The suite fails, the deploy stops. Same discipline as unit tests, same place in the pipeline.

The hard part is not the harness, it is the cases. We collect them from production: every human correction becomes a test. After a year, the suite knows the job better than any prompt author. That is the quiet compounding nobody talks about.

markdown / 2.1 KBLast edit 2026

most-agent-failures-are-queue-failures.md

v5.1Note / 2026

Most agent failures are queue failures/

When an agent breaks in production, look at the pipes before the prompt.

Post-mortems across our 23 systems tell one story: the model was rarely the problem. Retries without backoff, poison messages that block a lane, timeouts that fire mid-action, two workers grabbing the same job. Boring distributed-systems failures, wearing an AI costume.

So we treat agents like any other worker on a queue: idempotent actions, dead-letter lanes, budgets per job, replay from log. The intelligence is rented. The plumbing is ours, and the plumbing is what fails.

markdown / 1.8 KBLast edit 2026

making-human-in-the-loop-humane.md

v4.8Field manual / 2025

The approval budget/

A human who approves four hundred things a day approves nothing.

Human-in-the-loop fails by volume. If the system escalates everything, the human becomes a rubber stamp with a sore wrist. So every Execudo system has an approval budget: a hard daily cap on what it may ask a person to review. Vasseur gets forty. Albane gets nine.

The budget forces the system to spend escalations like money: only the genuinely ambiguous cases reach a human, with full context attached and a one-tap decision. Everything else must be safe enough to act, log and reverse. Scarcity is what keeps the human's judgement sharp.

markdown / 1.9 KBLast edit 2025

five-years-of-self-hosting-the-gateway.md

v4.6Talk / 2025

Five years of self-hosting the gateway/

One door for every model call. We own the door.

Since 2021, every model call in every system passes through one gateway we host: routing, caching, budgets, logging, evals, kill switches. Providers changed, models changed, prices changed. The door stayed. Swapping a model behind it is a config line, not a migration.

The unglamorous payoff is the log. Five years of every prompt, every answer, every cost, every failure, in one place we can query. When a client asks why the system decided something in March 2023, we answer with the receipt. That is what owning the stack buys: memory.

markdown / 2.0 KBLast edit 2025