Hacker Newsnew | past | comments | ask | show | jobs | submit | pron's commentslogin

Yep. The only people I've heard saying that generated code is fine are those who don't read it.

The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.

Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:

- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.

- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.

- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.

Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.

Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.

Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.


If you know how to write good code you can force AI to write good code with various techniques. It's 100% doable. You just need to figure out the problems AI has and find solutions to make it easier for it. Ex: extremely small contexts Modularize to modules with clear boundaries and only allow the AI to work within those boundaries. Make modules pure from IO so they are easily testable. Hide modules behind interfaces etc .. You can write 100 tests that executes within a second. You can write benchmarks etc .. AI needs boundaries and small contexts to work well. If you fail to give it that it will perform poorly. You are in charge.

That doesn't quite work, and precisely for the reason I mentioned: You can definitely tell the AI to follow some strategy, but at some point the strategy is likely to change, and the AI won't tell you that (even if you tell it to). Unless you read the code every time you won't know if the AI is following the strategy and producing good results or following it and producing bad results because the strategy has to change. This can happen even in small changes: the AI will follow the strategy even if the change proves it's wrong, and if you don't pay close attention, these mistakes pile up.

So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.


So, basically you need to micro-manage it. Where are your 10x gains now? And is it fun to work like that?

I agree with this. I've been writing a new internal framework at work and migrating consumers of the old framework to the new one.

I had strong principles at the outset of the project and migrated a few consumers by hand, which gave me confidence that it would work. The overall migration is large and expensive enough that it has been deferred for nearly a decade. Bringing down the cost of that migration made me turn to AI to accelerate it.

I found that it was OK at the more mechanical and straightforward cases, which are 80% of the use cases, to be fair. The remaining 20% need changes to the framework. Most of them need very small changes, such as an extra field in an API, but one or two require a partial conceptual redesign.

To over simplify the problem, the backend for one system can generate certain data in 99% of cases. In a few critical cases, it logically cannot, and that data must be reported to it. Some important optimizations were made with the assumption that this would be impossible.

The AI tooling didn't (yet) detect this scenario and happily added migration logic assuming it would work properly.

Now, because of how this is being rolled out, this wasn't a production bug or anything (yet). However, asking the right questions to partner teams revealed it and unearthed that some others were going to need it as well.

Ultimately, it isn't a big problem to solve in a way that will mostly satisfy everyone, but it would have been a big problem without a human deeper in the weeds.

Over time, this may change. Validation tooling I built may make a future migration of this kind easier to vibe code even if AI functionality doesn't continue to improve. Smarter models with more context will eventually learn these problems in more and more cases.

The code it generates still oscilates between beautiful and broken (or both!) so for now my artistic sensibilities make me keep a close eye on it. I think of the depressed robot from the Hitchhiker's Guide to the Galaxy as the intelligence behind it. Maybe one day it'll be trustworthy


“The only people I've heard saying that generated code is fine are those who don't read it.” Are you sure these people aren’t busy working rather than chatting? (haha)

But in all seriousness it depends on what you’re doing with it. Writing a quick tool using an LLM is much easier than context changing to write it yourself. If you need the tool, that’s very valuable.


Sure. I'm talking about production software that needs to survive and evolve for a long while.

Can you not review it?

This the core unspoken bone of contention in most AI arguments I think: most people either arent writing code with strict quality requirements or dont realize where their use of AI is violating them.

That said most of the world's most useful code has strict quality requirements. Even before AI 90% of SLOC would be tossed away without much if any use, 9% was used infrequently while 1% runs half the world's software.


Also as a webdev, it writes basic CRUD pretty good. I am tired of having to build forms myself and the LLMs are usually really good at that.

Been building a new app with lots of policies and whatnot and instructing a LLM is just much faster than doing the same repetitive shit over and over myself.


If you were tired of writing forms yourself, had you looked at https://jsonforms.io/? Just specify the the data you need, or extract it from the api spec and go. Display the form uniformly every time across your site. No need to burn AI time.

I typically avoid any most abstractions or third party dependencies. Yea it could be neat, but I still need a lot of custom logic here and there. Same reason I avoid stuff like GraphQL.

A little update: upon viewing the page on phone, for me the "comitter" field in the demo is going out of bounds... Really not speaking for their product.


This might pair well with something like https://data-atlas.net.

And the solution is the same, as when it was outsourced- and the "patch" was fix it by writing spec. Thus i conclude my TED talk with the statement: LLMs are the new outsourcing and run into the same problems.

Not quite, because the architecture often needs to evolve when you learn more as the project evolves. People will complain when they feel the constraints drive them to unnatural workarounds, the agents don't.

You can try telling the agent to stop and ask when a constraint proves problematic, except it doesn't have as good a judgment as humans to know when that's the case. I often find myself saying, "why did you write that insane code instead of raising the alarm about a problem?" and the answer is always, "you're absolutely right; I continued when I should have stopped." Of course, you can only tell when that happens if you carefully review the code.


Don't outsource either then

How about we outsource it to pakistan and they use LLMs. That way, we do what the LLM people do - many agents and stacked on top

> Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.

Yeah I’m currently working for several months already on a harness that wraps Claude Code and Codex etc to ensure that these types of invariants are captured and enforced (after the first few harness attempts failed), and - while it’s possible - slows down the workflow significantly and burns a lot more tokens. In addition to requiring more human involvement, of course.

I suspect this is the right direction, though, as the alternatives inevitably lead any software project to delve into a spaghetti mess maintenance nightmare.


It's not enough to enforce the invariants because they may need to change. You need to follow the invariants when they're right, and go back and reconsider them when they prove unhelpful. Knowing which is the case requires judgment that today's models are simply incapable of (not consistently, at least).

> The only people I've heard saying that generated code is fine are those who don't read it.

Well, that is problematic. I have to either assume you are disinterested or lying and neither is great for any discourse.


Yeah, their statement just isn't true. With enough instruction, I've been able to get great output from models. I think that's the key: with detailed, pointed instructions, the output will match.

The point of TLA+ is easily readable specs with refinements. That's why it was designed in a novel way rather than in the older executable-with-more-complex-logic style that Lean (and other maths/spec languages) offer. I'm not saying you should prefer TLA+'s approach, but that's what it's meant to accomplish and LLMs don't change that.

How do you have "aggressive error detection" when one of the most common and pernicious mistakes agents make are architectural? The behaviour is fine, but the code is overly defensive, hiding possible bugs and invariant violations, leading to ever more layers of complexity that ultimately end up diverging when nothing can be changed without breaking something.

Except people who are learning to leverage it appropriately already know better than to generate important production code by "managing fleets of agents".

None of these execs are AI-native, and the managers tend not to be either

> Over the past year, I’ve watched engineers use AI to ship in days what used to take a team weeks.

No, you didn't. You watched engineers use AI to ship in days something that looks like what used to take a team weeks. After enough rounds of feature evolution, you'll realise that what they actually shipped isn't at all the same. Anthropic's C compiler, which also seemed like a good start that would have taken people much longer to deliver, ended up being impossible to turn into something actually workable.

In a year or so, software developed by "AI-native talent who can manage fleets of agents to drive outsized impact" - which is another way of saying people who ship code they don't understand and therefore haven't fixed the architectural mistakes the agents make - will become impossible to evolve, and then things will get very interesting.

AI can help software developers in many ways, but not like that.


AI definitely leads to some productivity gain but the claims of 10x, 100x, 1000+x are (for now) irrational exuberance. Churning out prototype software has always been quick, and now it's blazing. But these LLMs are like Happy Gilmore. They get to the green in one shot then they orbit the hole with an extremely dubious short game. The virtue is in their parallelizability but you still need to review their work, lest you come back to it wrestling an alligator while a ruined TV tower husk sends spark showers over the pin.

> But these LLMs are like Happy Gilmore. They get to the green in one shot then they orbit the hole with an extremely dubious short game.

Except that he got good at his short game by the end. LLMs will get there sooner than we think.


I don’t think we will though. Because the “short game” is match the requirements of the agent operator. If we don’t care about the finer details that we let the LLMs infer, then we shouldn’t care if a human infers them (but yet we do).

I think LLMs are great, and I think people who can use them to get to the green in one and take it from there will soar, just like people who could identify a problem and solve it themselves did in the past.


I am an engineer. I hire other engineers. I run a company that ships usable software for small businesses.

We do this every day. I'm sorry to say, we are indeed shipping in days what used to take weeks.


As a software engineer who also hires other software engineers, I’m curious about the disconnect in our experiences.

I do systems programming. Before AI feature development roughly went like, design, implement, test, review with some back edges and a lot of time spent in test and review.

AI has made the implementation part much faster, at the cost of even more time spent testing and reviewing, though still an improvement overall.

We do not see the weeks to days improvement though. The bottleneck before was testing and reviewing, and they are even bigger bottlenecks now.

What kind of work do you do, and what kind of workflow were you using before and after AI to benefit so much?


> I do systems programming.

I'll stop you right there. AI is not good at systems programming, it's good at CRUD web development, which is where most people are seeing the gains.


I think antirez mentioned somewhere he considered it particularly good at systems programming.

Depends what it's used for, generally I've seen that due to the paucity of C or Rust etc training data vs Javascript and TypeScript, LLMs aren't as good at the former vs the latter.

This is a myth in my experience. LLMs are good at all the kinds of programming I've tried using them on, including many cases that are very far from "CRUD web development".

>95% of software development is crud.

It's really not, though. As soon as systems have to scale, regulatory requirements come in, etc. it becomes more complex.

AI has solved simple CRUD, yes, but CRUD, was easy before.


Anytime you hear such wild claims, imagine a typical code sweat shop (not just crud apps but templated eshops/business pages etc), not a system that will evolve for another 10-20 years beyond initial implementation and is backend cornerstone of some part of some corporation. That is in the case its actually true, there is tons of PR happening here, plus another gigaton of uncritical fanboyism like with any strong topic.

Now there may be an additional corner case or 20 where its still valid but they are not your typical software engineering work.

I also have your experience, even 100x code delivery improvement would barely move the needle of project delivery in our place. Better, more automated integration and end-to-end functional tests which reflect real world usage/data flows would actually make much bigger difference, no reason to think llms couldn't deliver this in near future.


Not the OP, but it might be that AI isn't as good at systems programming as it is at other domains, or it might be that you're using it differently than I am. I don't know which one it is (maybe AI just isn't good at writing the language you work with).

For things like web frontents/backends, though, it works beautifully. I ship things in days that would take me weeks to write by hand, and I'm very fast at writing things by hand. The AI also ships many fewer bugs than our average senior programmer, though maybe not fewer bugs than our staff programmers.


In my experience ai has had far far more bugs than most of what i call senior engineers but far fewer than juniors.

The boost is for what are glorified crud apps which it 1000x the tedious work. However, the choices it makes along the way quickly blows up without cleaning. Seniors know how to keep their workstation clean or they should.


It sounds like we have opposite experiences.

I never touched kubernetes and in 1 week I have a few nodes running and i understand a lot of it. Not perfect but not bad.

I have recently learned Kubernetes without AI and one week is more than enough to understand most of it.

This is definitely not true. But I doubt GP understand "most" of kubernetes too. They probably have a good working knowledge of the important commonly used features.

…it definitely is true, I spun up a cluster at home to learn it for a new job and felt comfortable with the basics within a few days.

That was the usual experience pre AI

>AI has made the implementation part much faster, at the cost of even more time spent testing and reviewing,

Maybe they're using AI for testing and reviewing more than you are, not just for coding?


The "AI implementation" step in my workflow includes separate agents dedicated to testing and reviewing changes. The self feedback loop catches a lot of errors and mistakes, but it rarely produces working code in one go.

In my experience, the generated code handles the happy path, but isn't great about edge cases or writing clean code, even with explicit instruction in the initial prompt.

We usually end up doing multiple iterations with what claude/codex output, pointing out issues, asking for changes, etc.


>AI has made the implementation part much faster, at the cost of even more time spent testing and reviewing,

Maybe they're using AI for testing and reviewing more than you are?


i work on cutting edge c++ system programming and we are using codex for everything now, it’s pretty impressive honestly what it can do

We design and build software systems that our clients' businesses run on. So it's not the product, it's the system that allows them to run their business. Typically, it's less "QuickBooks" and more "Let QuickBooks talk to 10 different systems" and then custom functionality built on that.

It's glue, custom business workflows, and basic web CRUD stuff. We build almost everything on Rails unless there's a critical reason not to (e.g., maintaining an existing system versus building from scratch.)

With very few exceptions our team composition is one senior engineer paired to a business. So we get to avoid a large amount of SDLC busywork which is inter-team communication. This leaves more time for client<->engineer communication which has a host of additional benefits. We also build with a "North Star" methodology which keeps everyone, including the client, laser focused on the work at hand.

To answer your final question about how we're benefiting so much from AI, I think it's primarily that we're leaning into it for both implementation, testing, and review. I know it's a sin to let AI review AI, but... it works. I'm actively skeptical of it myself, but our error rate and rework rates don't lie.

And we've got clients in various stages of development and/or long-term support. It's not like we're just hammering a bunch of stuff out and then bouncing. Most of these are multi-year tightly-integrated projects with our clients and we don't see a lack of trust or frustration that you'd expect to see if you were shipping slop. Our Honeybadger errors typically stay at zero, our performance metrics are acceptable across the board, and most importantly our clients love the work we're doing.

I can't think of any other way to measure the quality of what we're doing. And by those metrics, AI has made us better, not worse.

I should write a blog post to outline more of this in detail.


The only way you could possibly know that is if you're reviewing the code, which means you're not "managing fleets of agents". If you're not reviewing the code (and you wouldn't be if you're managing fleets of agents), then you have no way to tell what you're shipping.

It’s under-appreciated that a proper review takes at least as long as the actual work: it’s all the same time spent understanding the challenge and coming up with the best solution, minus the time spent typing in your solution (almost never a significant amount), plus the time spent understanding their solution and explaining how to get from theirs to yours.

Correct. We do review the code, and we're not managing "fleets of agents". My experience has generally been that the "fleet" approach is not very effective.

Can you link to a changelog that shows the 5-10x feature increases? I keep hearing this, but I don’t see anything I use ever actually shipping like this, or people backing this up with any sort of proof.

Our projects are closed source due to our clients owning the code, but I can offer anecdote. We have a client whose business operates on 2-3 very niche SaaS applications in the veterinary/animal medicine space. In a span of about 6 months, we completely ripped out 2 of those 3 and are working on replacing the 3rd one right now. We've done this with a single senior engineer working with the client between 20-40 hours per week with no major regressions. The business has been able to continue working as usual with no disruptions throughout this process.

Obviously it's hard to measure this objectively, but I can't imagine having done this pre-AI with zero downtime and having replaced those SaaS applications in that timeframe.


That reminds me of a chart I saw posted in HN comments recently that someone created tracking bullet points in Claude Code release notes per day that was cited as "proof of a step change" in AI development over the last year. It showed like a dozen or so on average that jumped to to like over 50 one month and stayed around that number.

(Not the exact same chart but similar idea, I guess it's sort of a meme: https://imgur.com/a/YrNGYOR)

So I looked at the most recent CC release notes on Github and the majority look like this:

  Fixed /clear not resetting the terminal tab title after a conversation
  Fixed session title chip from /rename disappearing while a permission or other dialog is active
  Fixed agent panel below the prompt being hidden when subagents are running (regression in 2.1.122)
  Fixed external-editor handoff (Ctrl+G) blanking the conversation history above the prompt
  Fixed /context dumping its rendered ASCII visualization grid into the conversation, wasting ~1.6k tokens per call
  Fixed OAuth refresh race after wake-from-sleep that could log out all running sessions
  Fixed 1-hour prompt cache TTL being silently downgraded to 5 minutes
  Fixed cache-miss warning appearing spuriously after /clear or compaction when changing /effort or /model
I'd be extremely interested to know what percentage of these were just fixing last week's Claude Code written PR that no human ever set eyes on.

But hey, all that churn looks great on charts being circulated on social media as free advertising for their flagship product (and consequently the company's valuation) so never mind, LGTM!


Give an example.

I have an example in my line of work. Full service rewrite in a new language. Would have taken forever without AI. AI makes it easier, faster. The service has better throughput, uses less machines. Having a complete full test harness that allows us to ensure we are meeting all the functionality of the previous service is key. AND we are keeping the old service on standby because we know we don't know what might be wrong with the new one.

What's your example?


From another comment above:

> Our projects are closed source due to our clients owning the code, but I can offer anecdote. We have a client whose business operates on 2-3 very niche SaaS applications in the veterinary/animal medicine space. In a span of about 6 months, we completely ripped out 2 of those 3 and are working on replacing the 3rd one right now. We've done this with a single senior engineer working with the client between 20-40 hours per week with no major regressions. The business has been able to continue working as usual with no disruptions throughout this process.

> Obviously it's hard to measure this objectively, but I can't imagine having done this pre-AI with zero downtime and having replaced those SaaS applications in that timeframe.


Yeah that validates my experience. It's best / mostly preferable for ground up rewrites and greenfield work.

I worry we haven't had to maintain vibecoded applications much and have no idea how difficult they will be to debug (or not).


If you carefully review the code then you're not doing what Armstrong was talking about. If you're not reviewing the code, then you don't really know what it is that the AI built. Of course it passes tests; that's not the problem. The problem is that the code is complicated and obtuse, even if it doesn't seem that way on the surface, and after some rounds of evolution, the agents are no longer able to evolve or maintain the code.

The difference between it's working now and it will continue working in two years is exactly the problem with AI-generated code because the tests can't tell you that, and you don't know which one you have if you don't look really carefully.


I was pretty clear that we did not review all the code, and we have kept the original service on standby exactly because we are aware code is complicated and could have obscure failure modes while passing our whole test suite.

Does what you ship involve hundreds of lines of HTML/CSS by any chance? Do you care about accessibility?

It does indeed. Most of what we build are web applications used internally by our clients (e.g., inside their business, not customer facing.)

Because of that, we don't typically spend a lot of time on accessibility because it's internal facing software. As far as I'm aware, these businesses don't have individuals who need those accommodations. Of course, if that changed, it is something we'd need to consider.


> I am an engineer. I hire other engineers. I run a company that ships usable software for small businesses.

> We do this every day. I'm sorry to say, we are indeed shipping in days what used to take weeks.

I've been searching for months for evidence of this kinda thing. Do you have receipts you can share? Or is it more of the same "just trust me bro"?


I should put together a blog post to share more, but unfortunately it is more "trust me bro" at this stage. You can see a few other comments where I replied: we do have subjective evidence that seems to suggest to me that we're moving much faster than we could've moved in the past.

Of course, it's not just shipping, it's shipping stably in a way that doesn't disrupt the day-to-day operations of the businesses we're working for. One client that comes to mind has 2-3 niche SaaS applications that they used independently for various workloads. We completely replaced 2 of those without any disruptions to their business in about 6 months (no, we did not replace it feature-for-feature; we just built what they needed.)


What you are shipping is not the same as what Coinbase is shipping. These are vastly different things. Making a shiny app with AI is great, I'm doing it as I type this. But I am under no delusion that what I make can sustain a multi-million dollar or even billion dollar business in the case of Coinbase. That's plain silly.

I agree with you. I didn't intend to make the argument that what my company does and what Coinbase does are on the same level, if that's what came across.

Shipping garbage.

We have zero Honeybadger errors, performance is acceptable for all our routes in the application, and all of our key stakeholders are ecstatic about what we've built.

Is there some other metric I should be measuring our code by?


Yeah absolutely embarassing take. If I had a nickle for every time someone sent me some AI garbage that was supposedly "thoroughly vetted and cross checked agent output", I'd be at least a thousandaire (gotta keep it real).

There are strengths, but if you think its writing stream of code and just using it as is, I would LOVE to compete against you.


Ever notice how people making this claim never come with receipts?

This bothered me some months ago, so I began posting all the non-sensitive LLM sessions I made online. That way when someone makes an assertion I can present evidence.

I commented this yesterday, I’ll repeat it again - what do you guys think organizations that have heavily leaned into AI are shipping nowadays?

Most devs aren’t working on cutting edge, low level, mission critical systems. AI is great for that. Every company I personally know have been fast shipping features that are being used daily by millions of people for the past 7 months.

We have the same thing on my team, and we also understand the limitations of AI generated code. If you’re more or less experienced, you can easily see the “good” and “bad” sides of it. So you kinda plan it out in a way that you can “evolve AI generated software”. I wouldn’t say the same thing in 2025 January, but it’s much different times now. Things are already working.


> If you’re more or less experienced, you can easily see the “good” and “bad” sides of it. So you kinda plan it out in a way that you can “evolve AI generated software”.

If you're truly "managing fleets of agents" there's no way you're able to sift through the good and the bad in the output. If your AI-generated code is evolvable (which is hard to tell right now) then you're not writing it with "fleets of agents". If you are writing it with fleets of agents, I would bet it's not evolvable; you just haven't reached the breaking point yet.


We’re not managing fleets of agents. They’re not productive for our workflows yet. It’s usually a couple of CC CLIs running and going back and forth on specific tasks we closely control.

They're not productive for any workflow is my point because they don't produce sustainable software, yet that's exactly what Armstrong is calling for. They don't work, and people experienced with AI workflows already know that.

If you review the code and tell the agent to revert when it gets things wrong (not functionally but architecturally) you're fine. That's not what I was responding to.


You're just wrong on this though, and I don't know why you aren't realizing it's a skill issue on your part

Nah, it's a skill issue on the part of those who believe in "agent swarms" (in fact, that's how I recognise AI noobs; they think swarms work). Studies (like this [1]) and Anthropic's experiements have told us they don't. We do experiments with software correctness and formal methods experts who actually dive deep into "swarm outputs" and try to put evolutionary pressure on them. Swarms simply cannot (yet) produce viable software. They do, however, produce software that for a while passes tests. What I think is happening is that people who believe swarms work just look at test results. But obviously, every software engineer has known for decades that tests can only tell you if your software works today; they can't tell you that it will work tomorrow. And the people who say that unreviewed agent output will work tomorrow are those who didn't review it closely enough, so they have no idea, either.

[1]: https://arxiv.org/abs/2603.03823


You're successfully beating the shit out of the strawman you've created. People are using LLMs to see massive productivity benefits and ship production code right now.

If you aren't, it's a skill issue on your part


Oh, I think you just haven't read my comments. What I wrote, and I quote was: "AI can help software developers in many ways, but not like that."

I was saying how much more productive LLMs make developers unless you use them in the way Armstrong advocates. Coding agents are amazingly helpful but not when you use them through "fleets" or "swarms". People who know how to be most productive with coding agents know that, but Armstrong doesn't.


Most of the people making this argument vastly overestimate the quality of engineering and discipline that behind the software powering most corporations. CRUD apps are likely to be the most prominent type of application across industries, and most of them are crud

If the code is really simple, it's cheap to read it. When people don't read it (and when they need to use "fleets of agents"), it's because it's not so simple, and then the people who trust the outcome are those who don't know what it is that they've committed into the codebase. Their logic is no more than: the system hasn't collapsed under the load of 50 (or 500) changes so it probably won't collapse under the load of the next 500 (or 5000). Because that's how engineered systems work, right? If they're fine under light stress, they're fine under heavier stress.

> Because that's how engineered systems work, right? If they're fine under light stress, they're fine under heavier stress.

Isn't this wrong? I thought engineered systems meant something designed with limits.


I was being sarcastic.

Yes, it can. I do this regularly.

I have literally built and shipped multiple things that would have taken me many many months to do and I’ve done it in under a week.

Many of these are LLM heavy features where the LLM can literally self-evaluate and self-optimize. I start with a general feature, it will generate adverse, synthetic data, it will build a feature, optimize it the figure out new places to improve. 1 year ago, this would have taken an entire team months to do, now, it’s 2 or 3 days of work.


The C compiler was a prime example of an application where the LLM can self-evaluate/optimise, with one of the best set of tests could imagine. Yet the end result was a mess.

I have experienced areas where high productivity can be had without much loss in quality. So I can believe it. But it really depends on what you’re doing and I firmly believe many companies will run out of easy stuff that we can blaze through with AI fairly quickly. At least that’s where we seem to be heading


What's an example of such a thing? Just curious

And your parents must be proud of you. You’re just another cog

People that manage AI agents are not engineers as they do no engineering but are instead just supervisors.

only dorks care this much about being an "engineer" or "artist". Who gives a shit if misanthropes on websites consider you a real engineer?

What is the difference between supervisor and an architect in tech products area?

Early in my career, people said this about programmers who (weakly) insisted on using assemblers.

Then, about people using high-level languages like C.

Then, about people using C++.

Then, about people using "toy"/"scripting" languages like PHP and Python.

About people who use ORMs instead of writing SQL directly.

About people who use JavaScript ("not a real programming language" was the dis).

People used to argue how it was the mark of a tourist to use anything more visual than Emacs.

This slight won't stick, nobody cares, and it might end up sounding stupid later. You can't usefully insult a professional engineer in 2026 by pointing out that they haven't memorized ASCII or the Arm instruction set.


> In a year or so

Look at the best models from Spring 2025, and compare with now (and similarly for Springs 2024 and 2025). Armstrong and lots of others are betting that this trend will continue, and if it does, the LLMs will ship code the LLMs understand, and whether any human specifically understands any particular part will mostly not matter.


> the LLMs will ship code the LLMs understand, and whether any human specifically understands any particular part will mostly not matter.

I find this particularly funny. There were more than a couple Star Trek Episodes where some alien planet depends on some advanced AI or other technology that they no longer understand, and it turns out the AI is actually slowly killing them, making them sterile, etc. (e.g. https://en.wikipedia.org/wiki/When_the_Bough_Breaks_(Star_Tr... )

Sure, Star Trek is fiction, but "humans rely on a technology that they forget how to make" is a pretty recurrent theme in human history. The FOGBANK saga was pretty recent: https://en.wikipedia.org/wiki/Fogbank

It just amazes me that people think "Sure, this AI generated code is kinda broken now, but all we need is just more AI code to fix it at some unknowable point in the future because humans won't be able to understand it!"


If you'd told me 20-30 years ago we'd actually get the Star Trek computer in the mid-2020s and it still wouldn't be actually AGI, I would have thought that very strange and unlikely, so who knows?

> wouldn't be actually AGI

Not sure that's going to age well.


Well, I meant existing or previous here in May 2026, though "mid-2020s" could definitely be interpreted to mean 2023-2027 or so.

> It just amazes me that people think "Sure, this AI generated code is kinda broken now, but all we need is just more AI code to fix it at some unknowable point in the future because humans won't be able to understand it!"

This is not even limited to code. I've seen people justifying AI datacenters using fossil fuels because AI will solve fusion power plants at some unknowable point in the future.


So nothing about the last 3 years has caused you to update your beliefs on this stuff? feels like bitter cope

And if the trend doesn't continue? I understand that a company with Coinbase's performance has little to lose and not many options, but many companies are in a better position.

The problem is that executives could take the 15-20% productivity boost and be content, but they read stuff like this, get greedy, and they don't understand the risk they're taking.


Even if the trend doesn’t continue, the current models are very very good. They’re better than the average programmer in the industry, already.

I don't know how anyone who carefully and closely reviews their output could possibly think that. Much of the time their code is fine, but every now and again they make a catastrophic (though often well-hidden) mistake that is so bad that all the tests pass but the codebase will be bricked if enough of those go in. They make such disastrous mistakes frequently enough that a decent-sized codebase can't last for more than 18-24 months.

If the average programmer is this bad, then there must be better-than-average programmers reviewing the code. The problem with agents is that they can produce code at a far higher volume than the average programmer.

Anyway, I don't know how well the average programmer programs, but if you commit agent-generated code without careful review, your codebase will be cooked in a year or two.


Maybe at some coding benchmark. Certainly not at actually shipping and maintaining production grade software.

Agreed! That will be an... "interesting" outcome, if so, for a lot of these companies.

> and whether any human specifically understands any particular part will mostly not matter.

This is how I feel. It’s building things for me that work. I don’t care how it works under the hood in many cases.


It's not about caring how it works. It's about caring that it keeps working at all even after you add stuff to it for a year or three (and nearly all software written by companies is software they evolve).

And who’s to say it won’t? It’s working now. I’m adding stuff and it’s still working. Why won’t that continue in year 3?

If you carefully read the agent's output you'll see why. It adds layers upon layers of workarounds and defences that hide serious problems, until the codebase reaches a point where the agent can no longer understand it and work with it. All the tests pass right up until the moment when adding a feature or fixing a bug causes another bug, and then nothing and no one can save the codebase anymore.

Maybe a year ago? Right now the LLMs I mainly use (GPT5.5, Opus 4.7) will intuit exactly what I need from my brief specs and universally go above-and-beyond in creating code that is not only extremely high-quality, but catches a ton of the gotchas I would have stumbled on, in advance.

Just a minute ago 5.5 looked at some human-written code of mine from last year and while it was making the changes I asked for it determined the existing code was too brittle (it was) and rewrote it better. It didn't mention this in its summary at the end, I only know because I often watch the thinking output as it goes past before it hides it all behind a pop-open.


Interesting that we’ve have such different experiences. I was working with both those models today and on several occasions it proposed some pretty poor solutions.

I also find I need to run an llm code review or two against any code it produces to even get to the point where’s it’s ready for human review.

In any case they served as an extremely valuable tool.


I use GPT 5.5. Sometimes it does what you say. It certainly finds silly mistakes in my code better than I could. But frequently enough it makes catastrophic architectural mistakes in its own code.

Maintaining software is like 80% of the job.

Because the API’s it uses will change? Nothing in tech is static. And that’s just going to get worse re: this whole AI thing.

Turns out it's not infinitely spawnable after all.

There's a lot of flaws with their fantasy world, that's not even the most prominent one.

> But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.

Except that's not quite what unsigned types do. They are not (just) numbers that will always be >= 0, but numbers where the value of `1 - 2` is > 1 and depends on the type. This is not an accident but how these types are intended to behave because what they express is that you want modular arithmetic, not non-negative integers.

> e.g. "x--" won't compile without explicitly supplying a case for x==0

If you want non-negative types (which, again, is not what unsigned types are for) you also run into difficulties with `x - y`. It's not so simple.

There are many useful constraints that you might think it's "better to have a type that reflects that" - what about variables that can only ever be even? - but it's often easier said than done.


This is true, which means that a language has to be designed from the ground up to deal with these problems or there will always be inscrutable bugs due to misuse of arithmetic results. A simple example in a c-like language would be that the following function would not compile:

    unsigned foo(unsigned a, unsigned b) { return a - b; }
but this would:

    unsigned foo(unsigned a, unsigned b) {
      auto c = a - b;
      return c >= 0 ? c : 0;
    }
Assuming 32 bit unsigned and int, the type of c should be computed as the range [-0xffffffff, 0xffffffff], which is different from int [-0x100000000, 0x7fffffff]. Subtle things like this are why I think it is generally a mistake to type annotate the result of a numerical calculation when the compiler can compute it precisely for you.

First, your code is about having unsigned types represent the notion of non-negative values, but this is not the intent of unsigned types in C/C++. They represent modular arithmetic types.

Second, it's not as simple as you present. What is the type of c? Obviously it needs to be signed so that you could compare it to zero, but how many bits does it have? What if a and b are 64 bit? What if they're 128 bit?

You could do it without storing the value and by carrying a proof that a >= b, but that is not so simple, either (I mean, the compiler can add runtime checks, but languages like C don't like invisible operations).


That's true for signed numbers too though? `int_min - 2 > int_min`

I agree they're a bit more error-prone in practice, but I suspect a huge part of that is because people are so used to signed numbers because they're usually the default (and thus most examples assume signed, if they handle extreme values correctly at all (much example code does not)). And, legitimately, zero is a more commonly-encountered value... but that can push errors to occur sooner, which is generally a desirable thing.


> That's true for signed numbers too though? `int_min - 2 > int_min`

As someone else already pointed out, that's undefined behaviour in C and C++ (in Java they wrap), but the more important point is that the vast majority of integers used in programs are much closer to zero than to int_min/max. Sizes of buffers etc. tend to be particularly small. There are, of course, overflow problems with signed integers, but they're not as common.


> That's true for signed numbers too though? `int_min - 2 > int_min`

No, that's undefined behavior in C, and if you care about correctness, you run at least your testsuite in CI with -ftrapv so it turns into an abort().


Which makes them even less safe than unsigned, where it is defined, yes? The optimizations that can lead to are incredibly hard to predict.

Besides, for safety there are much clearer options, like wrapping_add / saturating_add. Aborting is great as a safety tool though, agreed - it'd be nice if more code used it.


You can have the trap during production, and then it is safer. If you need to catch the problem at run-time, there are checked integer options in C that you can use.

> you also run into difficulties with `x - y`.

If you have "uint x" and "uint y", then for "x - y", the programmer should explicitly write two cases (a) no underflow, i.e. x >= y, and (b) underflow, x < y. The syntax for that... that is an open question.

> what about variables that can only ever be even

Yes, maybe you should have an "EvenInt" type, if that is important. Maybe you should be able to declare a variable to be 7...13, just like a "uint8" can declare something 0...255. Of course, the type-checker can get complicated, and perhaps simply fail to type-check some things. But, having compile-time constraints to what you know your variables will be is good, IMHO.


Note that in Zig, unsigned integer have the sqle semantic qs integers on overflow (trap or wrap or UB). You also have operators providing wrapping. That is the correct solution.

In Java, unsigned arithmetic is available through an API and, as you said, it is pretty much only needed when marshalling to certain wire protocols or for FFI. Built-in unsigned types are useful primarily for bitfields or similar tiny types with up to 6 bits or so.

I miss them for doing bit juggling like file headers or networking packets.

However I do concede writing a few helper methods isn't that much of a burden.


I think all the unsigned arithmetic you need is already offered. Unsigned shift right is an operator; the primitive wrappers offer compareUnsigned, divideUnsigned, and remainderUnsigned, as well as conversion methods; unsigned exponentiation is offered in Math (because signed types in Java wrap, there's no need for special unsigned addition/subtraction).

The Turing test pits a human against a machine, each trying to convince a human questioner that the other is the machine. If the machine knows how humans generally behave, for a proper test, the human contestant should know how the machine behaves. I think that this YouTube channel clearly shows that none of today's models pass the Turing test: https://www.youtube.com/@FatherPhi

How have you used the Curry Howard correspondence to make proving the correctness of non-trivial algorithms easier (than, say, Isabelle/HOL or TLA+ proofs)?

I hardly use automated formal methods. Disappointing, I know. I use it for thinking through C and Labview programs. It helps with recognizing patterns in data structures and reasoning through code.

For example, malloc returns either null or a pointer. That is an "or" type, but C can't represent that. I use an if statement to decide which (or-elimination), and then call exit() in case of a null. exit() returns an empty type, but C can't represent that properly (maybe a Noreturn function attribute). I wrap all of this in my own malloc_or_error function, and I conclude that it will only return a valid pointer.

Instead of automating a correctness proof in a different language, I run it in my own head. I can make mistakes, but it still helps me write better code.


Oh, so I have used formal methods for many years (and have written about them [1]), including proof assistants, and have never found that constructive logic in general and type theory in particular makes proofs of program correctness any easier. The Curry-Howard correspondence is a cute observation (and it is at the core of Agda), but it's not really practically useful as far as proving algorithm correctness is concerned.

[1]: https://pron.github.io


I think for a cute observation, the metaphor helps me grasp where I can apply logic. I'll read your blog in my free time, thanks for sharing.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: