Ask HN: How do you debug multi-step AI workflows when the output is wrong?

I’ve been building multi-step AI workflows with multiple agents (planning, reasoning, tool use, etc.), and I sometimes run into cases where the final output is incorrect even though nothing technically fails. There are no runtime errors - just wrong results.

The main challenge is figuring out where things went wrong. The issue could be in an early reasoning step, how context is passed between steps, or a subtle mistake that propagates through the system. By the time I see the final output, it’s not obvious which step caused the problem.

I’ve been using Langfuse for tracing, which helps capture inputs and outputs, but in practice I still end up manually inspecting each step one by one to diagnose issues, which gets tiring quickly.

I’m curious how others are approaching this. Are there better ways to structure or instrument these workflows to make failures easier to localize? Any patterns, tools, or techniques that have worked well for you?

3 points | by terryjiang2020 13 hours ago

3 comments

  • tucaz 4 hours ago
    Do what you are doing but dump the contents of tracing into an LLM agent (cowork, code, opencode, etc) and ask for it to take a first pass. It’ll at least narrow it down for you. Use a smart model and it should be helpful.
  • BlueHotDog2 12 hours ago
    just releasing something in the direction. a git like for agents
  • newzino 8 hours ago
    [dead]