The State of AI Coding Report 2025

(greptile.com)

63 points | by dakshgupta 6 hours ago

13 comments

  • wessorh 4 minutes ago
    clearly selling the report to business people whom don't code. Like most things in the AI arena today, the report is BS about a system the mostly create technical debt and is sold as intelligence.
  • zkmon 2 hours ago
    I take this "code-output" metrics with a pinch of salt. Ofcourse, a machine can generate 1000 times more lines of code similar to a power loom does. However, the comparison with power loom ends there.

    How maintainable is this code output? I saw a SPA html file produced by a model, which appeared almost similar to assembly code. So if the code can only be maintained by model, then an appropriate metric should should be based on a long-term maintainability achieved, but not on instant generation of code.

    • a_imho 2 hours ago
      My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

      As a dev I very much subscribe to this line of thought, but I also have to admit most of the business class people would disagree.

      • order-matters 51 minutes ago
        From a business perspective, the developer is the expert in lines of code and the assumption is that expertise should agree on the necessity of a line of code. To create lines of code that do not need to be there is akin to simply not doing your job in this perspective. The finished product should have X lines of code

        so from a business standpoint, if equivalent expertise amongst staff is assumed then productivity comes down to lines of code created. Just like how you might measure productivity of a warehouse employee by the number of items moved per hour. Of course if someone just throws things across the warehouse or moves things that dont need to be moved they will maximize this metric, but that would be doing the job wrong - which is not a productivity measurement problem. though admittedly the incentive structures and competition make these things often related

        the bigger issue to highlight, imo, is that the business side of things have no idea if coders are doing the job sufficiently well or not, and the lack of understanding is amplified by the reality that productivity contribution varies wildly per line, some requiring much more work to conjure than others. The person they need to rely on validate this difference per instance is the same person who is responsible for creating the lines. So there is a catch-22 on the business side. An unproductive employee can claim productivity no matter what the measurement is.

        if the variance of work required per line could be understood by the business side then it could be managed for. I used to manage productivity metrics for a medical coding company, and some charts are more dense and harder to code than others. I did not know how to code a medical chart but I could still manage productivity by charts per hour while still understanding this caveat

        the point isnt to use the productivity metric as a one stop shop for promoting and firing people but as a filter for attention, where all the middle of the pack stuff will more of less even out and not require too much direct attention. you then just need to get an understanding of how the average difficulty per item varies by product/project.

        that said, maybe lines edited is still a step better - so that refactoring in a way that reduces the size of the codebase can still be seen as productive. 1 point for each line deleted and 1 point for each line added.

        I understand that every line should be viewed as a liability, not an asset, but thats the job responsibility of the hired expert to figure out how many need to exist. its not the job of the business side of things to manage.

        I wouldnt tell my foundation guys how much concrete to use, or my electrician how much wire to use, but if one team can handle more concrete per hour than another and they are both qualified professionals, it really doesnt seem unreasonable to start off conversations with an assumption that one is more productive than the other. Lazy people do exist everywhere, its usually a matter of magnitude of laziness between people more than it is a matter of actual full earnest capability

        • Talanes 29 minutes ago
          "Just like how you might measure productivity of a warehouse employee by the number of items moved per hour. Of course if someone just throws things across the warehouse or moves things that dont need to be moved they will maximize this metric, but that would be doing the job wrong - which is not a productivity measurement problem."

          I fail to see how having a measurement that clearly doesn't measure what is actually produced isn't exactly a productivity measurement problem. If your measurement is defeated by someone doing their job badly, what use is it?

    • hvb2 2 hours ago
      Agreed, I stopped reading at that point. You can't take yourself seriously to create a report and use LOC as your measure.

      I feel like we humans try to separate things and keep things short. We do this not because we think it's pretty, we do it so our human brains can still reason about a big system. As a result LOC is a bad measure as being concise then hurts your productivity????

      • dakshgupta 2 hours ago
        We're careful not to draw any conclusions from LoC. The fact is LoCs are higher, which by itself is interesting. This could be a good or bad thing depending on code quality, which itself varied wildly person-to-person and agent-to-agent.
        • mrdependable 2 hours ago
          Can you expand on why it is interesting?
          • zed31726 2 hours ago
            Because it's different. Change is important to track
    • dakshgupta 1 hour ago
      How would you measure code quality? Would persistence be a good measure?
      • scuff3d 1 hour ago
        That question has been baffling product managers, scrum masters, and C-suite assholes for decades. Along with how you measure engineering productivity.
      • epicureanideal 1 hour ago
        Bad code can persist because nobody wants to touch it.

        Unfortunately I’m not sure there are good metrics.

    • scuff3d 1 hour ago
      It shouldn't be taken with a pinch of salt, it should be disregarded entirely. It's an utterly useless metric, and given that the report leads with it makes the entire thing suspect.
    • apercu 1 hour ago
      When I was first learning Perl after being a shell scripter/sysadmin I produced a lot of code. 2-3 years later the same tasks would be way less code. So is more code good?

      Also, my anecdotal experience is that LLM code is flat wrong sometimes. Like a significant percentage. I can't quote a number really, because I rarely do the same thing/similar thing twice. But it's a double digit percentage.

  • dakshgupta 6 hours ago
    Hi, I'm Daksh, a co-founder of Greptile. We're an AI code review agent used by 2,000 companies from startups like PostHog, Brex, and Partiful, to F500s and F10s.

    About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.

    We decided to compile some of the most interesting findings into a report. This is the first time we've done this, so any feedback would be great, especially around what analytics we should include next time.

    • ChrisbyMe 2 hours ago
      Hey! Thanks for publishing this.

      Would be interested in seeing the breakdown between uplift vs company size.

      e.g. I work in a FAANG and have seen an uptick in the number of lines on PRs, partially due to AI coding tools and partially due to incentives for performance reviews.

      • dakshgupta 1 hour ago
        This is a good one, wish we had included it. I'd run some analysis on this a while ago and it was pretty interesting.

        An interesting subtrend is that Devin and other full async agents write the highest proportion of code at the largest companies. Ticket-to-PR hasn't worked nearly as well for startups as it has for the F500.

    • neom 2 hours ago
      If AI tools are making teams 76% faster with 100% more bugs, one would presume you're not more productive you're just punting more debt. I'm no expert on this stuff, but coupling it with some type of defect density insights might be helpful. Would be also interested to know what percentage of AI assisted code is "rolled back" or "reverted" within 48 hours. Has there been any change in number of review iterations over time?
      • apercu 23 minutes ago
        Right? I want to see the problem ticket variance year over year with something to qualify the data if release velocity is more frequent.
    • jacekm 1 hour ago
      > About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.

      Which stats in the report come from such analysis? I see that most metrics are based on either data from your internal teams or publicly available stats from npm and PyPi.

      Regardless of the source, it's still an interesting report, thank you for this!

      • dakshgupta 1 hour ago
        Thanks! The first 4 charts as well as Chart 2.3 are all from our data!
    • wrs 2 hours ago
      It’s hard to reach any conclusion from the quantitative code metrics in the first section, because as we all know, more code is not necessarily better. “Quantity” is not actually the same as “velocity”. And that gets to the most important question people have about AI assistance: does help you maintain a codebase long term, or does it help you fly headlong into a ditch?

      So, do you have any quality metrics to go with these?

      • dakshgupta 2 hours ago
        We weren’t able to find a good quality measure. LLM-as-judge dint feel right. You’re correct that without that the data is interesting but not particular insightful.
    • chis 1 hour ago
      Wish you'd show data from past years too! It's hard to know if these are seasonal trends or random variance without that.

      Super interesting report though.

  • magicloop 1 hour ago
    Your graphs roughly marry up with my anecdotal experience. After a while, when you know when and how to utilize LLMs/agents, coding does become more productive. There is a discernible improvement in productivity at the same quality level.

    Also I notice it when the LLMs are offline. It feels a bit like when the internet connect fails. You remember the old days of lower productivity.

    Of course, there is a lot of junk/silly ways to approach these tools but all tools are just a lever, and need judgement/skill to use them well.

  • locusofself 2 hours ago
    This is definitely interesting information and I plan to take a deeper look at it.

    What a lot of us must be wondering though is:

    - how maintainable is the code being outputted

    - how much is this newfound productivity saving (costing) on compute, given that we are definitely seeing more code

    - how many livesite/security incidents will be caused by AI generated code that hasn't been reviewed properly

    • dakshgupta 2 hours ago
      We weren’t able to agree on a good way to measure this. Curious - what’s your opinion on code churn as a metric? If code simply persists over some number of months, is that indication it’s good quality code?
      • arcwhite 1 hour ago
        I've seen code persist a long time because it is unmaintainable gloop that takes forever to understand and nobody is brave enough to rebuild it.

        So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it

      • wordpad 1 hour ago
        I've seen code entropy as the suggested hueriatic to measure.
  • TuringNYC 2 hours ago
    Kudos to the designer, this site is beautiful.
    • a1ff00 2 hours ago
      Was going to comment the same. Love the dot matrix paper look.
    • dionian 2 hours ago
      agreed. was it AI ?! not that i care - ive been doing a lot of tailwind apps in ai with great success. AI is great for the web, takes all the tedium out of it
  • simonw 2 hours ago
    > Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.

    Is that a per-year number?

    If a year has 200 working days that's still only about 40 lines of code a day.

    When I'm in full-blown work mode with a decent coding agent (usually Claude Code) I'm genuinely producing 1,000+ lines of (good, tested, reviewed) code a day.

    Maybe there is something to those absurd 10x multiplier claims after all!

    (I still think there's plenty of work done by software engineers that isn't crunching out code, much of which isn't accelerated by AI assistance nearly as much. 40 lines of code per day felt about right for me a few years ago.)

    • observationist 2 hours ago
      If you actually work, the amount of work you do is absurdly more than the amount of work most others do, and a lot of the time, both the high and low productivity people assume everyone just does as much as they do, in both directions.

      A lot of people are oblivious to Zipf distributions in effort and output, and if you ever catch on to it as a productive person, it really reframes ideas about fairness and policy and good or bad management.

      It also means that you can recognize a good team, and when a bunch of high performers are pushing and supporting eachother and being held to account out in the open, amazing things happen that just make other workplaces look ridiculous.

      My hope for AI is that instead of 20% of the humans doing 80% of the work, you end up with force multipliers, and a ramping up, so that more workplaces look like high function teams, making everything more fair and engaging and productive, but i suspect once people get better with AI, at least up to the point of AGI, is we're going to see the same distribution but 10x or 50x the productivity.

      • Garlef 42 minutes ago
        My experience with coding agents points into the other direction: It's mentally very taxing!

        Usually, you have a lot of time to think on the side while coding on what to do next, strategize, etc. But if you work in small increments with an LLM agent, this time is reduced and you have to be ready for the next thing once one increment is done.

        So I don't see this as an equalizer. Rather, those who can constantly push forward are getting much more than those who don't.

    • lumost 2 hours ago
      There is a long tail of engineers working on mature/stable code bases where there are fewer extremely large diffs, or the review burden is extremely high. If you work on core software - then you can never say that a line of code was wrong "because of the AI." e.g. places where you might need 2-3x code approvers or more.
    • rnewme 2 hours ago
      1k loc per day or 1k git additions? I don't think one person can consistently review 1k loc, and grow codebase at that speed and size and classify it as good, tested and reviewed.. Can you tell us more about your process?
      • simonw 2 hours ago
        I'm effectively no longer typing code by hand: I decide what change I want to make and then prompt Claude Code to describe that change. Sometimes I'll have it figure out the fix too.

        An example from earlier today: https://github.com/simonw/llm-gemini/commit/fa6d147f5cff9ea9...

        That commit added 33 lines and removed 13 - so I'm already at a 20-lines-a-day level just from that one commit (and I shipped a few more plus a release of llm-gemini: https://github.com/simonw/llm-gemini/commits/a2bdec13e03ca8a...)

        It took about 3.5 minutes. I started from this issue someone had filed against my repo:

        Then I opened Claude Code and said:

          Run this command: uv run llm -m gemma-3-27b-it hi 
        
        That ran the command and returned the error message. I then said:

          Yes, fix that - the gemma models do not support media resolution
        
        Which was enough for it to figure out the fix and run the tests to confirm it hadn't broken anything.

        I ran "git diff", thought about the change it had made for a moment, then committed and pushed it.

        Here's the full Claude Code transcript: https://gistpreview.github.io/?62d090551ff26676dfbe54d8eebbc...

        I verified the fix myself by running:

          uv run llm -m gemma-3-27b-it hi
        
        I pasted the result into an issue comment to prove to myself (and anyone else who cares) that I had manually verified the fix: https://github.com/simonw/llm-gemini/issues/116#issuecomment...

        Here's a more detailed version of the transcript including timestamps, showing my first prompt at 10:01:13am and the final response at 10:04:55am. https://tools.simonwillison.net/claude-code-timeline?url=htt...

        I built that claude-code-timeline application this morning too, and that thing is 2284 lines of code: https://github.com/simonw/tools/commits/main/claude-code-tim... - but that was much more of a vibe-coded thing, I hardly reviewed the code that was written at all and shipped it as soon as it appeared to work correctly. Since it's a standalone HTML file there's not too much that can go wrong if it has bugs in it.

        • WhyOhWhyQ 2 hours ago
          Whenever I start reviewing code produced by Claude I find hundreds of ways to improve it.

          I don't know if code quality really matters to most people or to the bottom line, but a good software engineer writes better code than Claude. It is a testament to library maintainers that Claude is able to code at all, in my opinion. One reason is that Claude uses API's in whacky ways. For instance by reading the SDL2 documentation I was able to find many ways that Claude writes SDL2 using archaic patterns from the SDL days.

          I think there are a lot of hidden ways AI booster types benefit from basic software engineering practices that they actively promote damaging ideas about. Maybe it will only be 10 years from now that we learn that having good engineers is actually important.

          • simonw 1 hour ago
            > Whenever I start reviewing code produced by Claude I find hundreds of ways to improve it.

            Same here. So I tell it what improvements I want to make and watch it make them.

            I've gained enough experience at prompting it that it genuinely is faster for me to tell it the change I want to make than it is for me to make that change myself, 90% of the time.

          • HDThoreaun 53 minutes ago
            Ok then you just make review comments and it fixes them. Still faster than writing code yourself
    • leothetechguy 2 hours ago
      I couldn't in good conscience work like that, I believe the risk of bad AI generated code due to the tiniest of output variation is far too high. Especially in systems that need to maintain a large state governed by many rules and edge cases.
    • noosphr 2 hours ago
      I'm a good aerospace engineer, my rockets weigh an extra 50kg after every day I work on them.
    • WhyOhWhyQ 2 hours ago
      You're writing Python and Javascript right? Those languages are extremely easy to write in (which conversely means the legibility is likely to be poor). People maintaining legacy systems in systems level languages aren't going to be able to produce as much code as people writing Python and Javascript.
      • simonw 1 hour ago
        Yes, mostly Python and JavaScript and SQL. I'm dabbling a little more with Go these days too.
    • CrzyLngPwd 2 hours ago
      1,000 lines of debt that you didn't review and probably have no idea what they do.
      • AlexandrB 2 hours ago
        Yeah, I don't get it. It's well know that "LOC" is not a good metric of developer productivity. But now that AI is writing those lines of code, it's fine as a metric?
        • noosphr 2 hours ago
          Senior developers know that every line of code is debt. Junior developers think that every line of code is wealth.
    • dakshgupta 2 hours ago
      This is per month, I see now that's not super clear on the chart!
    • cmdtab 2 hours ago
      I saw your example and it was a simple cli tool. Of course you can have claude make commits effectively to it!
      • simonw 2 hours ago
        Totally. I have dozens of "simple CLI tools" that I work on - and small plugins, and HTML+JavaScript utilities.

        If I was hacking on the Linux kernel I would be delighted with myself for producing 40 lines of landed code in a single day.

        • eikenberry 1 hour ago
          They are obviously talking about writing code against expectations greater than these simple tools. Why troll with the hyperbole?
    • waterproof 2 hours ago
      Looks like it's a monthly number.
  • vb-8448 1 hour ago
    In the engineering team velocity section, the most important metric is missing: change rate of new code or how many times it is change before being fully consolidated.
    • dakshgupta 1 hour ago
      This is a great suggestion. I'll note it down for next years. Curious, do you think this would be a good proxy for code quality?
      • all2 1 hour ago
        I would consider feature complete with robust testing to be a great proxy for code quality. Specifically, that if a chunk of code is feature complete and well tested and now changing slowly, it means -- as far as I can tell -- that the abstractions contained are at least ok at modeling the problem domain.

        I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.

        • dakshgupta 1 hour ago
          Most of our customers are enterprises, so I feel relatively comfortable assuming they have some decent testing and QA in place. Perhaps I am too optimistic?
      • vb-8448 1 hour ago
        It's tricky, but one can assume that code written once and not touched in a while is good code (didn't cause any issues, performance is good enough, ecc).

        I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.

      • sillyfluke 1 hour ago
        It depends on how well you vetted your sanples.

        fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.

        • dakshgupta 1 hour ago
          Apologies, that is poor wording on our part. It's internal data from engineers that use Greptile, which are tens of thousands of people from a variety of industries. As opposed to external, public data, which is where some of the charts are from.
  • superchris 1 hour ago
    This thing that can't be measured is up 76%. Eyeroll
  • nekooooo 2 hours ago
    i'm a designer and even i know not to measure 'lines of code' as meaningful output or impact. are we really doing this?
    • dakshgupta 2 hours ago
      We expressly did not conclude that more lines = better. You could easily argue more lines = worse. All we wanted to show is that there are more lines.
      • poliphili 1 hour ago
        Language like "productivity gains", "output" and "force multiplier" isn't neutral like you're claiming here, and does imply that the line count metric indicates value being delivered for the business.
  • nik0xffff 2 hours ago
    [flagged]
  • psunavy03 2 hours ago
    Sigh . . . once again I see "velocity" as something to be increased.

    This makes me metaphorically stabby.

    • dakshgupta 2 hours ago
      We were trying not to insinuate that, because we don’t have a good way to measure quality, without which velocity is useless.