While there's not a lot of meat on the bone for this post, one section of it reflects the overall problem with the idea of Claude-as-everything:
> I spent weeks casually trying to replicate what took years to build. My inability to assess the complexity of the source material was matched by the inability of the models to understand what it was generating.
When the trough of disillusionment hits, I anticipate this will become collective wisdom, and we'll tailor LLMs to the subset of uses where they can be more helpful than hurtful. Until then, we'll try to use AI to replace in weeks what took us years to build.
If LLMs stopped improving today I’m sure you would be correct- as it is I think it’s very hard to predict what the future holds and where the advancements take us.
I don’t see a particularly good reason why LLMs wouldn’t be able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well.
I feel like we’ve been hearing this for 4 years now. The improvements to programming (IME) haven’t come from improved models, they’ve come from agents, tooling, and environment integrations.
I would think/hope that the code assist LLMs would be optimizing towards supportable/legible code solutions overall. Mostly in that they can at least provide a jumping off point, largely accepting that they more often than not won't be able to produce complete, finished solutions entirely.
Funny to see this show up today since coincidentally I've had Claude code running for the past ~15 hours attempting to port MicroQuickJS to pure dependency-free Python, mainly as an experiment in how far a porting project can go but also because a sandboxed (memory constrained, to us time limits) JavaScript interpreter that runs in Python is something I really want to exist.
I'm currently torn on whether to actually release it - it's in a private GitHub repository at the moment. It's super-interesting and I think complies just fine with the MIT licenses on MicroQuickJS so I'm leaning towards yes.
> I think you halucinated this up. (Quote from original comment, pre malicious-edit)
No point in responding to a troll, but for the other people who may be reading this comment chain, he's used LLMs for various tasks. Not to mention that he founded TextSynth, an entire service that revolves around them.
> TextSynth provides access to large language, text-to-image, text-to-speech or speech-to-text models such as Mistral, Llama, Stable Diffusion, Whisper thru a REST API and a playground. They can be used for example for text completion, question answering, classification, chat, translation, image generation, speech generation, speech to text transcription, ...
TI had similar idea with TI-99/4 - running interpreted BASIC programs using BASIC written in special interpreted language (GPL) running in its own virtual machine, with actual CPU machine code executing from ram accessible thru single byte window of Video processor. Really brilliant system, turtles all the way down.
It's amusing to think that claude might be better at generating ascii diagrams than generating code to generate diagrams, despite it being nominally better at generating code.
I'm generating a lot of PDFs* in claude, so it does ascii diagrams for those, and it's generally very good at it, but it likely has a lot of such diagrams in its training set. What it then doesn't do very well is aligning them under modification. It can one-shot the diagram, it can't update it very well.
The euphoric breakthrough into frustration of so-called vibe-coding is well recognised at this point. Sometimes you just have to step back and break the task down smaller. Sometimes you just have to wait a few months for an even better model which can now do what the previous one struggled at.
I won't deny OP learned something in this process, but I can't help but wonder: if they spent the same time and effort just porting the code themselves, how much more would they have learned?
Specially considering that the output would be essentially the same: a bunch of code that doesn't work.
I guess it depends on well people want to know things like "Perl (and C) library to web" skills. Personally, there are languages I don't want to learn, but for one reason or another, I have to change some details in a project that happen to use that language. Sure, I could sit down and learn enough of the language so I can do the thing, but if I don't like or want to use that language, the knowledge will eventually atrophy anyways, so why bother?
As always, the answer is "divide & conquer". Works for humans, works for LLMs. Divide the task into as small, easy to verify steps as possible, ideally steps you can automatically verify by running one command. Once done, either do it yourself or offload to LLM, if the design and task splitting is done properly, it shouldn't really matter. Task too difficult? Divide into smaller steps.
You just need to know what you are doing. In this case, the problem is not "rewriting the logic" but "mapping Perl syntax to Typescript syntax" and "mapping Perl libs to Typescript libs". In other words, you'd be better off with an old-fashioned script that merely works on syntax mangling along with careful selection of dependencies (and maybe some manual labor around fixing the APIs of the consumers).
This is easy work, made hard by the "allure" of LLMs, which go from emphatic to emetic in the blink of an eye.
If you don't know what you are doing, you should stay away from LLMs if there is anything at all at stake.
> A reader (or dare I say a wiser version of me), armed with a future model and dedicated to the task, will succeed with this port where I failed and that makes me uneasy.
I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.
With the problem in the original article, I might have asked it to generate 100 test cases, and run them with the original Perl. Then I'd tell it, "ok, now port that to Typescript, make sure these test cases pass".
Really, you haven't found a single task they can't do? I like agents, but this seems a little unrealistic? Recently, I asked Codex and Claude both to "give me a single command to capture a performance profile while running a playwright test". Codex worked on this one for at least 2 hours and never succeeded, even though it really isn't that hard.
This is unfortunate. I thought porting code from one language to another was somewhere LLMs were great, but if you need expertise of the source code to know what you are doing that's only an improvement in very specific contexts: Basically just teams doing rewrites of code they already know.
Our team used claude to help port a bunch of python code to java for a critical service rewrite.
As a "skeptic", I found this to demonstrate both strengths and weaknesses of these tools.
It was pretty good at taking raw python functions and turning them into equivalent looking java methods. It was even able to "intuit" that a python list of strings called "active_set" was a list of functions that it should care about and discard other top level, unused functions. The functions had reasonable names and picked usable data types for every parameter, as the python code was untyped.
That is, uh, the extent of the good.
The bad: It didn't "one-shot" this task. The very first attempt, it generated everything, and then replaced the generated code with a "I'm sorry, I can't do that"! After trying a slightly different prompt it of course worked, but it silently dropped the code that caused the previous problem! There was a function that looked up some strings in the data, and the lookup map included swear words, and apparently real companies aren't allowed to write code that includes "shit" or "f you" or "drug", so claude will be no help writing swear filters!
It picked usable types but I don't think I know Java well enough to understand the ramifications of choosing Integer instead of integer as a parameter type. I'll have to look into it.
It always writes a bunch of utility functions. It refactored simple and direct conditionals into calls to utility functions, which might not make the code very easy to read. These utility functions are often unused or outright redundant. We have one file with like 5 different date parsing functions, and they were all wrong except for the one we quickly and hackily changed to try different date formats (because I suck so the calling service sometimes slightly changes the timestamp format). So now we have 4 broken date parsing functions and 1 working one and that will be a pain that we have to fix in the new year.
The functions look right at first glance but often had subtle errors. Other times the ported functions had parts where it just gave up and ignored things? These caused outright bugs for our rewrite. Enough to be annoying.
At first it didn't want to give me the file it generated? Also the code output window in the Copilot online interface doesn't always have all the code it generated!
It didn't help at all with the hard part: Actual engineering. I had about 8 hours and needed find a way to dispatch parameters to all 50ish of these functions and I needed to do it in a way that didn't involve rebuilding the entire dispatch infrastructure from the python code or the dispatch systems we had in the rest of the service already, and I did not succeed. I hand wrote manual calls to all the functions, filling in the parameters, which the autocomplete LLM in intellij kept trying to ruin. It would constantly put the wrong parameters places and get in my way, which was stupid.
Our use case was extremely laser focused. We were working from python functions that were designed to be self contained and fairly trivial, doing just a few simple conditionals and returning some value. Simple translation. To that end it worked well. However, we were only able to focus the tool into this use case because we already had the 8 years experience of the development and engineering of this service, and had already built out the engineering of the new service, building lots of "infrastructure" that these simple functions could be dropped into, and giving us easy tooling to debug the outcomes and logic bugs in the functions using tens of thousands of production requests, and that still wasn't enough to kill all errors.
All the times I turned to claude for help on a topic, it let me down. When I thought java reflection was wildly more complicated than it actually is, it provided the exact code I had already started writing, which was trivial. When I turned to it for profiling our spring boot app, it told me to write log statements everywhere. To be fair, that is how I ended up tracking down the slowdown I was experiencing, but that's because I'm an idiot and didn't intuit that hitting a database on the other side of the country takes a long time and I should probably not do that in local testing.
I would pay as much for this tool per year as I pay for Intellij. Unfortunately, last I looked, Jetbrains wasn't a trillion dollar business.
Good luck to Microsoft trying to port a billion lines of C++ mazes to Rust with their bullshit machines, I'm sure they won't give up on that one after half a week
> I spent weeks casually trying to replicate what took years to build. My inability to assess the complexity of the source material was matched by the inability of the models to understand what it was generating.
When the trough of disillusionment hits, I anticipate this will become collective wisdom, and we'll tailor LLMs to the subset of uses where they can be more helpful than hurtful. Until then, we'll try to use AI to replace in weeks what took us years to build.
I don’t see a particularly good reason why LLMs wouldn’t be able to do most programming tasks, with the limitation being our ability to specify the problem sufficiently well.
I'm currently torn on whether to actually release it - it's in a private GitHub repository at the moment. It's super-interesting and I think complies just fine with the MIT licenses on MicroQuickJS so I'm leaning towards yes.
Its got to 402 tests with 2 failing - the big unlock was the test suite from MicroQuickJS: https://github.com/bellard/mquickjs/tree/main/tests
Its been spitting out lines like this as it works:
Using a neural network to compress text is not using an llm.
https://bellard.org/nncp/
I hate when people makes up shit.
No point in responding to a troll, but for the other people who may be reading this comment chain, he's used LLMs for various tasks. Not to mention that he founded TextSynth, an entire service that revolves around them.
https://textsynth.com/
https://bellard.org/ts_sms/
???
I'm generating a lot of PDFs* in claude, so it does ascii diagrams for those, and it's generally very good at it, but it likely has a lot of such diagrams in its training set. What it then doesn't do very well is aligning them under modification. It can one-shot the diagram, it can't update it very well.
The euphoric breakthrough into frustration of so-called vibe-coding is well recognised at this point. Sometimes you just have to step back and break the task down smaller. Sometimes you just have to wait a few months for an even better model which can now do what the previous one struggled at.
* Well, generating Typst mark-up, anyway.
https://github.com/willtobyte/NES
Specially considering that the output would be essentially the same: a bunch of code that doesn't work.
I consider myself a bit of an expert vibe engineer and the challenge is alluring :D
This is easy work, made hard by the "allure" of LLMs, which go from emphatic to emetic in the blink of an eye.
If you don't know what you are doing, you should stay away from LLMs if there is anything at all at stake.
I'm sure the MS plan is not just asking Claude "port this code to rust: <paste>", but it's just fun to think it is :)
0: https://www.theregister.com/2025/12/24/microsoft_rust_codeba...
I simply cannot come up with tasks the LLMs can't do, when running in agent mode, with a feedback loop available to them. Giving a clear goal, and giving the agent a way to measure it's progress towards that goal is incredibly powerful.
With the problem in the original article, I might have asked it to generate 100 test cases, and run them with the original Perl. Then I'd tell it, "ok, now port that to Typescript, make sure these test cases pass".
Our team used claude to help port a bunch of python code to java for a critical service rewrite.
As a "skeptic", I found this to demonstrate both strengths and weaknesses of these tools.
It was pretty good at taking raw python functions and turning them into equivalent looking java methods. It was even able to "intuit" that a python list of strings called "active_set" was a list of functions that it should care about and discard other top level, unused functions. The functions had reasonable names and picked usable data types for every parameter, as the python code was untyped.
That is, uh, the extent of the good.
The bad: It didn't "one-shot" this task. The very first attempt, it generated everything, and then replaced the generated code with a "I'm sorry, I can't do that"! After trying a slightly different prompt it of course worked, but it silently dropped the code that caused the previous problem! There was a function that looked up some strings in the data, and the lookup map included swear words, and apparently real companies aren't allowed to write code that includes "shit" or "f you" or "drug", so claude will be no help writing swear filters!
It picked usable types but I don't think I know Java well enough to understand the ramifications of choosing Integer instead of integer as a parameter type. I'll have to look into it.
It always writes a bunch of utility functions. It refactored simple and direct conditionals into calls to utility functions, which might not make the code very easy to read. These utility functions are often unused or outright redundant. We have one file with like 5 different date parsing functions, and they were all wrong except for the one we quickly and hackily changed to try different date formats (because I suck so the calling service sometimes slightly changes the timestamp format). So now we have 4 broken date parsing functions and 1 working one and that will be a pain that we have to fix in the new year.
The functions look right at first glance but often had subtle errors. Other times the ported functions had parts where it just gave up and ignored things? These caused outright bugs for our rewrite. Enough to be annoying.
At first it didn't want to give me the file it generated? Also the code output window in the Copilot online interface doesn't always have all the code it generated!
It didn't help at all with the hard part: Actual engineering. I had about 8 hours and needed find a way to dispatch parameters to all 50ish of these functions and I needed to do it in a way that didn't involve rebuilding the entire dispatch infrastructure from the python code or the dispatch systems we had in the rest of the service already, and I did not succeed. I hand wrote manual calls to all the functions, filling in the parameters, which the autocomplete LLM in intellij kept trying to ruin. It would constantly put the wrong parameters places and get in my way, which was stupid.
Our use case was extremely laser focused. We were working from python functions that were designed to be self contained and fairly trivial, doing just a few simple conditionals and returning some value. Simple translation. To that end it worked well. However, we were only able to focus the tool into this use case because we already had the 8 years experience of the development and engineering of this service, and had already built out the engineering of the new service, building lots of "infrastructure" that these simple functions could be dropped into, and giving us easy tooling to debug the outcomes and logic bugs in the functions using tens of thousands of production requests, and that still wasn't enough to kill all errors.
All the times I turned to claude for help on a topic, it let me down. When I thought java reflection was wildly more complicated than it actually is, it provided the exact code I had already started writing, which was trivial. When I turned to it for profiling our spring boot app, it told me to write log statements everywhere. To be fair, that is how I ended up tracking down the slowdown I was experiencing, but that's because I'm an idiot and didn't intuit that hitting a database on the other side of the country takes a long time and I should probably not do that in local testing.
I would pay as much for this tool per year as I pay for Intellij. Unfortunately, last I looked, Jetbrains wasn't a trillion dollar business.