Workshop X account is live
Follow @FAGENWorkshop for deadline reminders, accepted-paper highlights, and program updates.
Follow @FAGENWorkshopReproducible Triggers, Trace Diagnostics, and Verified Fixes
Follow @FAGENWorkshop for deadline reminders, accepted-paper highlights, and program updates.
Follow @FAGENWorkshopThe submission portal is now live. Submission deadline May 8 (AOE); notifications by May 15.
Open OpenReviewReliability has been studied in ML for a long time, mostly through robustness benchmarks, adversarial evaluation, and red-teaming on chat-style language models. Foundation-model agents push the question somewhere harder. An agent run goes for hundreds of steps, each step depending on tool calls and memory writes from the steps before it. When the run breaks, it rarely breaks at the obvious moment. A bad assumption at step 3 quietly contaminates step 50, and by step 200 the agent has been wrong for a while without noticing. It might have spent its budget on the wrong subtask. It might be reading from memory it polluted itself. Or it landed on an answer at step 12 and spent the rest of the run defending it.
FAGEN is a place to take these failures seriously. The workshop is organized around four kinds of contributions. Definitions matter: what does "failure" actually mean here, beyond the loose way the term gets thrown around? Reproducible triggers matter at least as much. We want the smallest setup that breaks the agent the same way every time, so other groups can build on the case. Diagnostics should look at the trace itself, not just the final score, because final-score evaluation hides almost everything interesting. And the fixes worth presenting are the ones that admit what they cost in latency, in capability, or in how well they generalize.
Format
Submit on OpenReview
Topics of Interest
We welcome submissions on:
Well-documented negative results are in scope when the analysis is careful and the lesson transfers.
Operational definitions, triggering preconditions, minimal reproductions, composable failure primitives, and falsifiable mechanistic hypotheses.
Long-horizon evaluation protocols, interpretable process metrics, counterfactual tests, and logging tools that expose failures beyond terminal success.
Mitigations, recovery strategies, tool and memory interface improvements, reward and budget design, and repair mechanisms with verifiable trade-offs.
Submission deadline
2026-05-08
AOE · May 8, 2026 (AOE)
Opening remarks
Keynote 1
TBD
Keynote 2
TBD
Contributed spotlights
Five to eight short talks from accepted contributions
Coffee break
Keynote 3
TBD
Keynote 4
TBD
Lunch and posters
Keynote 5
TBD
Keynote 6
TBD
Coffee break
Panel discussion
Panelists TBD
Closing remarks and awards
Reach out for submissions, sponsorship, speaker logistics, or collaboration.