Scaling long-running autonomous coding

(simonwillison.net)

94 points | by srameshc 8 hours ago

9 comments

  • Agent_Builder 2 hours ago
    This matches what I’ve seen with long-running agents. The failures usually aren’t one big mistake, but small assumptions compounding over time.

    What helped for me was forcing the agent into short, explicitly scoped steps. Each step declares what it can read, what it can do, and what it’s allowed to output, then that context gets torn down before the next step.

    I’ve been using GTWY for this kind of setup and it made long-running coding agents much more boring and predictable, which is exactly what you want at scale.

    Curious how you’re handling state reset and permission drift as runtimes get longer.

  • simonw 6 hours ago
    One of the big open questions for me right now concerns how library dependencies are used.

    Most of the big ones are things like skia, harfbuzz, wgpu - all totally reasonable IMO.

    The two that stand out for me as more notable are html5ever for parsing HTML and taffy for handling CSS grids and flexbox - that's vendored with an explanation of some minor changes here: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...

    Taffy a solid library choice, but it's probably the most robust ammunition for anyone who wants to argue that this shouldn't count as a "from scratch" rendering engine.

    I don't think it detracts much if at all from FastRender as an example of what an army of coding agents can help a single engineer achieve in a few weeks of work.

    • sealeck 6 hours ago
      I think the other question is how far away this is from a "working" browser. It isn't impossible to render a meaningful subset of HTML (especially when you use external libraries to handle a lot of this). The real difficulty is doing this (a) quickly, (b) correctly and (c) securely. All of those are very hard problems, and also quite tricky to verify.

      I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.

      Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.

      I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.

      • simonw 6 hours ago
        Yeah, I'm hoping they publish a lot more about this project! It deserves way more then the few sentences they've shared about it so far.
    • shubhamjain 4 hours ago
      Why attempt something that has abundant number of libraries to pick and choose? To me, however impressive it is, 'browser build from scratch' simply overstates it. Why not attempt something like a 3D game where it's hard to find open source code to use?
      • XenophileJKO 21 minutes ago
        There are a lot of examples out there. Funny that you mention this. I literally just last night started a "play" project having Claude Code build a 3D web assembly/webgl game using no frameworka. It did it, but it isn't fun yet.

        I think the current models are at a capability level that could create a decent 3D game. The challenges are creating graphic assets and debugging/Qa. The debugging problem is you need to figure out a good harness to let the model understand when something is working, or how it is failing.

      • Banditoz 3 hours ago
        Is something like a 3D game engine even hard to find source code for? There's gotta lots of examples/implementations scattered around.
      • cheevly 1 hour ago
        Assets are very hard to produce and largely unsolved by AI at the moment.
    • janoelze 5 hours ago
      Any views on the nature of "maintainability" shifting now? If a fleet of agents demonstrated the ability to bootstrap a project like that, would that be enough indication to you that orchestration would be able to carry the code base forward? I've seen fully llm'd codebases hit a certain critical weight where agents struggled to maintain coherent feature development, keeping patterns aligned, as well as spiralling into quick fixes.
      • simonw 5 hours ago
        Almost no idea at all. Coding agents are messing with all 25+ years of my existing intuitions about what features cost to build and maintain.

        Features that I'd normally never have considered building because they weren't worth the added time and complexity are now just a few well-structured prompts away.

        But how much will it cost to maintain those features in the future? So far the answer appears to be a whole lot less than I would previously budget for, but I don't have any code more than a few months old that was built ~100% by coding agents, so it's way too early to judge how maintenance is going to work over a longer time period.

      • brianjeong 4 hours ago
        I think there's a somewhat valid perspective that the Nth+1 model can simply clean up the previous models mess.

        Essentially a bet that the rate of model improvement is going to be faster than the rate of decay from bad coding.

        Now this hurts me personally to see as someone who actually enjoys having quality code but I don't see why it doesn't have a decent chance of holding

    • teaearlgraycold 3 hours ago
      It looks like JS execution is outsourced to QuickJS?
  • vedmakk 2 hours ago
    After reading that post it feels so basic to sit here, watching my single humble claude code agent go along with its work... confident, but brittle and so easily distracted.
  • light_hue_1 28 minutes ago
    Browsers are pretty much the best case scenario for autonomous coding agents. A totally unique situation that mostly doesn't occur in the real world.

    At a minimum:

    1. You've got an incredibly clearly defined problem at the high level. 2. Extremely thorough tests for every part that build up in complexity. 3. Libraries, APIs, and tooling that are all compatible with one another because all of these technologies are built to work together already. 4. It's inherently a soft problem, you can make partial progress on it. 5. There's a reference implementation you can compare against. 6. You've got extremely detailed documentation and design docs. 7. It's a problem that inherently decomposes into separate components in a clear way. 8. The models are already trained not just on examples for every module, but on example browsers as a whole. 9. This done condition for this isn't a working browser, it's displaying something.

    This isn't a realistic setup for anything that 99.99% of people work on. It's not even a realistic setup for what actual developers of browsers do who must implement new or fuzzy things that aren't in the specs.

    Note 9. That's critical. Getting to the point where you can show simple pages is one thing. Getting to the point where you have a working production browser engine, that's not just 80% more work, it's probably considerably more than 100x more work.

  • retinaros 57 minutes ago
    Agentic coding is a card castle built on another card castle (test time compute) built on another card castle (token prediction) the mere fact that using lot of iterations and compute works maybe tells us that nothing is really elegant about the things we craft.
  • halfcat 5 hours ago
    So AI makes it cheaper to remix anything already-seen, or anything with a stable pattern, if you’re willing to throw enough resources at it.

    AI makes it cheap (eventually almost free) to traverse the already-discovered and reach the edge of uncharted territory. If we think of a sphere, where we start at the center, and the surface is the edge of uncharted territory, then AI lets you move instantly to the surface.

    If anything solved becomes cheap to re-instantiate, does R&D reach a point where it can’t ever pay off? Why would one pay for the long-researched thing when they can get it for free tomorrow? There will be some value in having it today, just like having knowledge about a stock today is more valuable than the same knowledge learned tomorrow. But does value itself go away for anything digital, and only remain for anything non-copyable?

    The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

    • ramraj07 4 hours ago
      The fundamental idea that modern LLMs can only ever remix, even if its technically true (doubt), in my opinion only says to me that all knowledge is only ever a remix, perhaps even mathematically so. Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
      • pseudosavant 3 hours ago
        But all of my great ideas are purely from my own original inspiration, and not learning or pattern matching. Nothing derivative or remixed. /sarcasm
      • mrbungie 3 hours ago
        > Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.

        You know this is a false dichotomy right? You can treat and consider LLMs statistical parrots and at the same time take advantage of them.

      • heavyset_go 3 hours ago
        Yeah, Yann LeCun is just some luddite lol
        • NitpickLawyer 3 hours ago
          I don't think he's a luddite at all. He's brilliant in what he does, but he can also be wrong in his predictions (as are all humans from time to time). He did have 3 main predictions in ~23-24 that turned out to be wrong in hindsight. Debatable why they were wrong, but yeah.

          In a stage interview (a bit after the "sparks of agi in gpt4" paper came out) he made 3 statemets:

          a) llms can't do math. They can trick us with poems and subjective prose, but at objective math they fail.

          b) they can't plan

          c) by the nature of their autoregressive architecture, errors compound. so a wrong token will make their output irreversibly wrong, and spiral out of control.

          I think we can safely say that all of these turned out to be wrong. It's very possible that he meant something more abstract, and technical at its core, but in the real life all of these things were overcome. So, not a luddite, but also not a seer.

          • gjadi 2 hours ago
            Have this shortcomings of llms been addressed by better models or by better integration with other tools? Like, are they better at coding because the models are truly better or because the agentic loops are better designed?
            • NitpickLawyer 2 hours ago
              100% by better models. Since his talk models have gained more context windows (up to usable 1M), and RL (reinforcement learning) has been amazing at both picking out good traces, and taught the LLMs how to backtrack and overcome earlier wrong tokens. On top of that, RLAIF (RL with AI feedback) made earlier models better and RLVR (RL with verifiable rewards) has made them very good at both math and coding.

              The harnesses have helped in training the models themselves (i.e. every good trace was "baked in" the model) and have improved in enabling test time compute. But at the end of the day this is all put back into the models, and they become better.

              The simplest proof of this is on benchmarks like terminalbench and swe-bench with simple agents. The current top models are much better than their previous versions, when put in a loop with just a "bash tool". There's a ~100LoC harness called mini-swe-agent [1] that does just that.

              So current models + minimal loop >> previous gen models with human written harnesses + lots of glue.

              > Gemini 3 Pro reaches 74% on SWE-bench verified with mini-swe-agent!

              [1] - https://github.com/SWE-agent/mini-swe-agent

    • tornikeo 3 hours ago
      > The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

      It's nearly frictionless, not frictionless because someone has to use the output (or at least verify it works). Also, why do you think the "shape" of the knowledge is spherical? I don't assume to know the shape but whatever it is, it has to be a fractal-like, branching, repeating pattern.

    • ukuina 3 hours ago
      Single-idea implementations ("one-trick ponies") will die off, and composites that are harder to disassemble will be worth more.
  • tinyhouse 5 hours ago
    Well, software is measured over time. The devil is always in the details.
    • aronowb14 4 hours ago
      Yeah curious what would happen if they asked for an additional big feature on top of the original spec
  • anilgulecha 6 hours ago
    That's a wild idea-a browser from scratch! And ladybird has been moving at snails pace for a long time..

    I think a good abstractions design and good test suite will make it break success of future coding projects.

  • vivzkestrel 5 hours ago
    I am waiting for that guy or a team that uses LLMs to write the most optimal version of Windows in existence, something that even surpasses what Microsoft has done over the years and honestly looking at the current state of Windows 11, it really feels like it shouldn't even be that hard to make something more user friendly
    • kimixa 4 hours ago
      Considering Microsoft's significant (and vocal) investment in LLMs, I fear the current state of Windows 11 is related to a team trying to do exactly that.
      • g947o 4 hours ago
        I noticed that dialog that has worked correctly for the past 10+ years is using a new and apparently broken layout. Elements don't even align properly.

        It's hard to imagine a human developer misses something so obvious.