Why AI Agents Cannot Change Software Systems

(phroneses.com)

43 points | by jhevans 1 hour ago

22 comments

  • fzeindl 1 hour ago
    I was originally sceptical of LLMs and am far from the „agents will magically fix our future“-crowd, but sentences like these trip me up:

    > „But pattern‑matching is not system understanding, and plausibility is not correctness.“

    Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

    > „LLMs predict tokens, not consequences“

    Same here. LLMs output tokens but who says that they don’t form some internal group of token-predicting tensors that move together and constitute the internal model of a „consequence“? It is like saying humans don’t have thoughts, they just have electrical impulses moving their tongues.

    I too think that LLMs seem to be a very specific form of intelligence, maybe resembling the parts of our brain that do language-processing, but it is a fact that they at least fake intelligence very convincingly. And that we actually don’t know how they do it.

    • pjm331 1 hour ago
      > > "But pattern‑matching is not system understanding, and plausibility is not correctness."

      > Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

      I'm not in the camp of "system understanding is just more complex pattern matching"

      but I am absolutely in the camp of "there are many tasks where pattern matching is just as effective as actual understanding"

      • fzeindl 1 hour ago
        > but I am absolutely in the camp of "there are many tasks where pattern matching is just as effective as actual understanding“

        What if „being effective at something with pattern matching but not understanding it“ just means that you have identified only 90% of patterns and keep failing to learn the rest for whatever reason.

        • soco 24 minutes ago
          Aren't we humans functioning in the same way, failing for 10% (take a random number) whatever we learn because we can forget, or be tired, or distracted? And what is the practical effect of "actual understanding" other than actually getting the 90% right (or more, or less, whatever)? I cannot tell what's inside my neighbor's head so for all practical matters they could be an AI, so why should I care whether the AI has a real understanding (good luck proving that) or not? I only care whether they take away enough jobs (mine included) that I cannot life a peaceful life anymore because it sends me foraging for roots or I must defend my roots parcel against hungry foragers. And for AI to achieve that it definitely doesn't need "actual understanding" just following some less or better formulated goals and having the right tools under their "hands".

          What I want to say is, yeah fascinating topic about real understanding, but I think we have more pressing issues.

    • justincormack 1 hour ago
      The whole post is written by AI anyway, so its not worth engaging with.
    • amelius 1 hour ago
      > Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

      I think the naysayers already decided that the burden of proof is on the other side.

      • pjc50 1 hour ago
        That is the traditional "null hypothesis", yes.
        • ertgbnm 31 minutes ago
          The null hypothesis isn't just the opposite of whatever your opposition believes.

          For LLMs the null hypothesis would be that there is no relationship between the input and output tokens. Something that is so obviously not true that it's not even worth calculating the number of sigmas away from the null hypothesis that LLMs are.

          So clearly we discarded the null hypothesis sometime in 2017. Now we have a system that is really really good at pattern matching and seems to understand consequences. Is that "seeming" just a ruse or does it really understand stuff? A proper scientists would look at that evidence and put forward the hypothesis that maybe it really does understand stuff and begin working on experiments that would disprove that alternative hypothesis, moving forward with the assumption that the hypothesis is true until disproven or a better hypothesis is proposed that explains previous evidence more accurately. Naysayers saying "you haven't proven that pattern matching becomes understanding to my satisfaction" is not a rebuttal. They need an alternative hypothesis that can make predications that better fit the model and can be tested.

          The only rebuttals I've heard are "AI can't actually understand stuff and therefore can't do X" which is a testable hypothesis at least. But Invariably AI eventually does X, just in a different way than anyone really expected.

    • locknitpicker 1 hour ago
      > Why not? Who says that? Who proved that system understanding is not just more complex pattern matching?

      Yes indeed. That's a perplexing statement considering that a central concept or software engineering is architecture patterns.

      • thesz 1 hour ago

          > central concept or software engineering is architecture patterns.
        
        Both RUP and PSP/TSP do stand on the ground of defect prevention. All sorts of defects, from incorrect sets of requirements to memory corruption.

        Architecture patterns can be of help in that regard and they also can be very error-prone, as right now I am in the process of removing a bug introduced through misunderstanding of one rather old singleton.

    • GoodJokes 25 minutes ago
      [dead]
    • joka88xj 41 minutes ago
      [flagged]
  • baq 1 hour ago
    > LLMs generate statistically plausible continuations of text. This works well for self-contained tasks like writing a function or drafting documentation because these are pattern‑extension problems. But pattern‑matching is not system understanding, and plausibility is not correctness.

    closes tab

    • Wowfunhappy 1 hour ago
      Yeah, this is basically the same "LLMs are just next-word predictors", right?

      It's obviously true... and yet when the next word is the completion of a chat template, suddenly they can talk to you. I don't know how far that will ultimately go, but "they're fundamentally just X" isn't providing useful information anymore.

    • Bnjoroge 1 hour ago
      Yup. Just shows me that they are either oblivious about how much better they’ve gotten or simply unaware. Either way, hard to take their opinions seriously
  • passive 1 hour ago
    I think this does a very good job of describing the real gaps agents are hitting in practical usage, along with a fairly compelling rationale for why those gaps aren't likely to disappear any time soon.

    If we're going to stabilize the software industry, we need to have more discussions like this that identify what constraints apply. (We should have had those discussion before pushing AI out this widely, but that wouldn't have gotten anyone rich.)

    I actually think that there's a world of software systems agents can change, but it's materially different from the one we have now, and has a different set of constraints that we've also mostly done a poor job identifying. So hopefully the discussion can help those of us on both sides. ;)

  • liampulles 44 minutes ago
    Developing software is as much about the journey as the destination. I build a lot of my understanding of the actual problem in the pursuit of solving it.

    There are many times when writing a feature that my spidey senses flare up and tell me that this thing is a lot more painful to code then I was expecting (and will be painful to maintain) and that a more elegant process may actually solve the problem, at which point I'll draw up an alternative option and talk to the product owner.

    I've definitely started to see the consequences of the converse, which is large amounts of shite brittle code that solved the original spec narrowly, but is now an elephant on our back when we need to add other concerns to the system that cross over.

    (BTW, this isn't against the use of coding agents entirely, its more against high-level agentic usage. I tend to use Claude Code to do little well defined tasks whilst I reflect on it).

  • DanielHB 1 hour ago
    One thing I realized is just how much the harnesses are geared towards _not_ parsing files and take shortcuts. And even then I am very unimpressed at the speed these systems output code and the amount of tokens you consume doing fairly basic stuff is quite high.

    My gut feeling is that it will take at least a couple of orders of magnitude improvements before these LLMs can even hold large systems fully in their context, much less understand them holistically. And I don't see an order of magnitude improvement coming any time soon, it feels the last one was GPT 3.5.

  • adamtaylor_13 1 hour ago
    This is a lot of words to confirm what we already know: we have exosuits, not robots.

    Use them as capability enhancers, not drones who go do all the things without review.

  • danielpardo 59 minutes ago
    I used Claude Code to migrate from Electron to Node + React across ~6k LOC. It handled the mechanical parts well but anything that has to do with creativity or field of interest required human judgment.

    AI has no judgement or critical thinking even if it seems so, so we have to be wary to not let AI do this bc it will be poor quality and 0 innovative

  • jgbuddy 1 hour ago
    How to get to HN front page: 1) AI generate an article about why AI sucks 2) Profit
  • dvh 1 hour ago
    If you spend $x amount of tokens to "produce a PR ready diff", how much of the $x are you willing to give upstream for incorporating your diff and maintain it in the future? So far ai folks seems to expect it to be $0. That's my only issue so far.
  • lubujackson 1 hour ago
    The thing is AI can maintain systems. The key point is that it can't do this without human intent, but human intent can be encoded into skills and tied together with orchestration.

    Rough example: have an LLM generate a plan. Have a skill that refines the plan considering security risks, another that ensures codebase structures are followed, another that considers the infrastructure and usage demands, etc. Then write code and tests. Another process to validate the tests, validate all the above, simplify the logic, etc.

    The key is that an LLM can do every task capably, even in a complex system. We simply have not built reasonable orchestration of all the human intent behind each filter, and many of them are constantly in flux. It may be that some elements resist encoding because the complexity of encoding is not worth the hassle to maintain.

    For better or worse, managing intent, orchestrating narrow agentic tasks and solidifying patterns into deterministic code (i.e. validation/tests) is going to be the focus of engineers going forward.

  • jvanderbot 1 hour ago
    TFA falls into a few traps, like a reducto argument about text prediction. There's no reason text prediction can't do these things, fundamentally.

    But I pretty much agree with what they are saying. The missing "thing" is the developer context. Each agent I kick off needs a nonlinearly increasing amount of coaching, as a function of feature complexity. The sweet spot for productivity is currently the first 3 steps (from TFA), to get things into _my head_, then using the writing abilities more as ubersed or ubergrep with LSP integration. Love it for that.

    For example, I'll often write the first 5th to 3rd of a feature by hand, then ask the agent to extrapolate from there. The "Core" contains the important bits but in a large system there's a lot of corner cases and wiring, and agents are good a discovering those. I interrupt when it tries to fix things by departing from the design and instead nudge or write a better solution quickly.

    I absolutely hate the "Spin a cadre of agents to design/implement a feature from a concise spec" workflow. It involves so much planning to get the automatic execution working that it's often just easier to switch to hybrid planning/execution with both AI and people.

    • cautiouscat 1 hour ago
      I’ve also been finding that “Spin a cadre of agents to design/implement a feature from a concise spec” is really difficult. It’s been faster (for me) to do what you said and do a hybrid.

      I’ve been trying out this cadre of agents idea with PR stacking and while I think it’s going to end up working fine, it took so much massaging to get it to where I needed it to be. Whereas with the hybrid approach, the problem space is a lot narrower and easier for me to define and the LLM to implement.

  • injidup 1 hour ago
    Why do people keep writing this drivel. Obviously written by an LLM itself. What they are describing and which doesn't work is one shotting a fix. Almost or probably no human can one shot a fix to a significant working system.

    The human / llm needs to have some form of error correction signal. Either you have a corpus of tests or proof system that prevent regressions.

    If you have a working system with no tests or validation and let a human loose on it then it will break. How is this different?

  • r_lee 1 hour ago
    LLM slop article about LLM slop. amazing how this stuff just gets instantly to the front page
  • EGreg 1 hour ago
    "But agentic work is global and transformative: the LLM must change the system itself, which requires understanding dependencies, invariants, interactions, and downstream consequences.

    This is causal reasoning, not pattern extension. LLMs predict tokens, not consequences — and that is why the leap from writing code to producing a safe, system‑aware PR‑ready diff is not incremental but a shift into a fundamentally different problem space."

    This is well said. We need a new paradigm. I could go into the shortcomings of the current agent-oriented approaches but it would turn into a huge post. If you want to read it, I wrote it up here: http://safebots.ai/agents.html

  • taintlord223 1 hour ago
    I would simplify to: Why agents cannot meaningfully contribute
  • christkv 1 hour ago
    I find it works if you do it in small parts of the system but systemwide really creates a lot of slop.
    • cold_harbor 1 hour ago
      the slop has a mechanism: once you cross ~15 files the invariant set doesnt fit in context. locally correct edits, globally broken.
  • devagentic 1 hour ago
    [dead]
  • dewwi 1 hour ago
    [flagged]
  • ath3nd 1 hour ago
    [dead]
  • antirez 1 hour ago
    > LLMs generate statistically plausible continuations of text

    Jesus, it's fucking 2026. Even LeCun would never say this again.

    • thesz 52 minutes ago
      LLMs are Markov Chains [1]. "Emergent abilities" of LLMs can be explained by decrease of perplexity in text prediction [2].

        [1] https://arxiv.org/abs/2410.02724
        [2] https://arxiv.org/abs/2304.15004