Part of my job is working on trying to make these models productive for the large corporation I work for. It's a lot of throwing tomatoes at a wall and to a degree I see the issue he is talking about output seemingly having a certain ceiling.
At the same time in no part of his post is any code snippet or anything to latch on to of "the model performed poorly here when it should have done this" - this style of criticism seems to be a pattern of most of these "the LLMs will never work" style posts on blogs and twitter.
They obviously can perform better than autocomplete and in my own day to day development build out huge portions of a codebase that I would have expected a junior or midlevel engineer to perform at.
How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
> How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
The mistakes they make are pretty subtle. Coding with LLMs can be like that scene in Whiplash – <excellent drumming >, not quite my tempo, <excellent drumming >, downbeat on 18, <excellent drumming>, you’re rushing, <excellent drumming>, dragging, …
Like yeah it produces working code almost always and the code usually does what you asked. And yet it makes you want to throw a chair because it’s not quite right in frustrating ways and it doesn’t even have the taste to know how it’s wrong.
Yeah. It does this. Pretty consistently and replicably depending on the issue, in fact! Yet I can point exactly where it fails.
Why are we not showing the bad choices? On my computer I have hundreds of diffs stored by my agent code review tool that point to style/architecture failures (and in the end, the result of that iteration on the AI output)
I'm not quite sure how people are generating unsalvageable outputs. I'd never ship the result of a first AI pass, either. I review all the code and the architecture, within reason (eg: in Rust I don't preoccupy myself anymore with precisely scoping pub, or whatever, unless I'm making a library crate). I sent a "changes requested" prompt+json to my agent, and it interactively fixes everything (even style, even comments with manual patches with my in-review-tool editor)
Well again that is just a "vibes" explanation with nothing concrete.
I feel like with LLMs, it's like a situation where you are close to some feature or project and have a pretty good idea in your head already of how you'd implement it yourself "I'd do this and have an API with that and a database table foo for storing bar with index on baz" and you're keen to get started on it ...but then someone else gets assigned to work on it not you.
They do it a totally different way than you would have thought of doing it, and the code feels alien and weird because it doesn't follow your "design" and decisions you already had in your head before they started work on it. Is it "bad" or just not how you'd have done it?
I think that is ok. So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go. Sure maybe you'd have done it a different way but ultimately that doesn't matter.
> So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go
That is the problem. The code often is full of hacks and bad abstractions. LLMs write code like a junior or mid-level engineer – perfectly overfitted to today’s request. Oh you need to work on this code tomorrow and there’s a laundry list of future requirements? Throw away and rewrite, I guess.
You can most easily see this when you ask LLMs to write tests. They have a tendency to write convoluted tests that absolutely definitely pass. Even when you know the code has a bug, they’ll write the test in a way that fits the code as written and passes. Because they know tests should pass.
Getting an LLM to write a failing test against a currently working function because you know the business requirements have changed is like pulling teeth.
You don’t see writing about this stuff because it doesn’t neatly fit in an article or video (I’ve tried). Plus it goes against the zeitgeist so you’d never get traction (even if people write these posts, we don’t see them)
The unit test example has been my team's experience as well. The unit tests look good on the surface, but their passing or failing has little predictive value on whether there are actually bugs in the code.
Some people have suggested you write the unit tests by hand to basically "check" the LLM's work and keep it honest, but to write good unit tests you have to understand the underlying code, which takes time (since you didn't write it), so to me this is another bullet point that suggests LLMs will eventually be relegated to "StackOverflow+" duty - give me snippets, but I'll still write effectively all the code.
Last week I helped a co-worker with some flaky tests, where the code and tests were generated from one of the models. While looking at how one of the tests work, I'd spotted a place in the code where a boolean condition was backwards in a way a human would never have written (and on top if it, there was a confidently-incorrect block comment above it so it was easy to assume it was correct) - so even if he'd fixed the test's flakiness, it would end up always failing instead of sometimes failing. He'd spent hours trying to figure out what was going on.
When people write blog posts about how LLMs failed for some particular task, the responses from boosters invariably fall along the lines of "just use this other model/just tweak your prompt like so/you're just not skilled enough—you can't make fundamental arguments about AI by citing specific examples."
So we can't make arguments by citing specific examples, and also can't make arguments by not citing specific examples. Whelp, I guess that's the ball game.
(yes yes, I'm committing a group attribution error, but still)
I think we should investigate the backgrounds of those making claims one way or another and rely on those backgrounds for determining credibility. I suspect that we'd find that those who are saying LLMs write great, bulletproof code with "100% unit test coverage" (true story- a coworker was bragging about 100% unit test coverage) are not really qualified to be software engineers. This is a trend I have noticed in my org. Those drinking the most LLM kool aid do NOT have an engineering/comp sci degree, have relatively little experience, resumes are incredibly weak (e.g., generic stuff that we've all done as software engineers).
We no longer have the luxury of welcoming bootcamp engineers into our field with open arms. We need to protect our craft. Call these fools out or they'll keep spreading hype/FOMO.
This is an excellent point, and as a novice using LLMs for projects I could never previously dream of doing I find myself looking for the same, examples or citations of what exactly agents are writing incorrectly and how would the human do it better. I'm sure they're out there, maybe someone can refer some good content showing such examples.
I have no doubt the top nth percent of coders could write circles around Claude or Codex, but how much worse are they than your average schnook?
Reality: the top nth percent of coders are seeing absurd, dramatic gains in productivity using LLMs. See: antirez, Simon Willison, Steve Yegge.
The more experience you bring to the table, the more value you get from these tools.
Look, about 12 years ago articles about how if you're not pair programming you're doing it wrong were on HN's home page every day. Doing well prompted plan -> agent -> debug cycles is like pair programming with someone that knows every SDK and API intuitively and doesn't have to pick up their kids from daycare at 4pm.
While I don't actually disagree - to me, Gas Town sounds literally insane - I suspect that if you reframe his work to compare it against the cost of developing a new medication or chip fabrication technique, you can make a strong argument that he's putting his money where his mouth is to see how far he can take a new technology. He's doing science! And I think that's admirable, even if nothing comes of it.
When I think of how much money gets wasted on gambling apps and how much human potential gets wasted watching reality television and compare that to Steve going full Alexander Shulgin with LLMs, the comparison really falls flat.
The problem is what they do to large existing systems: subtle misunderstandings mean subtle bugs are constantly being introduced, and very few shops have adequate systems in place to receive reports of subtle issues at the rates they occurred 10 years ago, let alone today. And don't even get me started on llm-assisted support that some might suggest as a solution.
That's what I did recently. Got a cheap Xiaomi ($120 used) with a 120hz OLED display and a mid-range CPU and flashed PixelExperience on it: it feels like a $800 Pixel phone.
Your work (and gyroscope/stethscope/other aggregators) has inspired me to also start my own work at creating a centralized aggregator to ingest all of my data.
Getting a job in Paris is like a getting a job anywhere else - there are stupid aspects of the system and there are meritocratic aspects. If you're looking to hear "why is it so hard for me to find a job in paris?" You will certainly find enough excuses - though I think the same can be said of almost anywhere.
That being said, here are a few things to know:
1) France's unemployment rate is north of 10% (closer to 25% if you're under 25) and it's not for lack of companies. It's because large companies are essentially on a hiring freeze until the economy balances out, because employment comes with a lot of string attached (hard to fire, expensive to fire, high social taxes on top of employees).
2) Startup employees don't work 35 hours a week - I've never once seen this, and I'd be curious to meet teams who did. If you want to work at a public tech company, or Google France, you may work 35hours/week, just like you would at EA or Oracle in the US. The fact is that the 35 hour work week is a law to protect employees, but the 6-month trail period that precedes your 'protection' usually sets the tempo for work life. Those who can't handle working 45-50 hours a week, or at the pace set by the company, are usually kicked out with 24 hour noticed within the first 6 months.
3) Getting a job in Paris is easy, especially for an American. Your CV is "I'm an American" and any startup who's raised 1Million+ will be a fool not to take you. So, if you're looking for a job in Paris, just tweet it, and you'll have a job offer by the end of the day. If you're a little less technical, Law/accountign firms hire international people with VISAs all the time to cover their intl. clients.
Well there is a huge difference. As a Software Engineer you want to work in San Francisco because it's where your job happens.
In Paris, nothing happens on that level. If you do want to come to Paris it's because of the amazing food, lifestyle, public transportation system, and certainly not because of the amazing startups we should have.
Not to mention working in Paris is like working part time here. 35 hour weeks!!! Mandated by law too. Of course this might also be why there isn't a vibrant startup culture.
Companies can easily workaround the 35 hours week with the "cadre" status that basically nullify the regulated hours constraint, and they do.
I don't know any developer in Paris that do a 35 hours week.
Actually it's even the contrary, French managers have a strong culture of evaluating employees performance based on presence. So you see a lot of peoples doing 10-12 hours a day, or coming back to the office on saturday.
This. I don't know any french developer not being on a day rate instead of an hour rate. An awful lot of people on an hour rate have a number of extra hours factored in their salary by default.
It's still better than the US, "cadre" status (but not top management) is officially around 180-190 day of work per year (depending on the unions).
> I'd say that in Paris, a referral is necessary and often sufficient. Le "piston".
+1.
I think it's a core deep cultural issue about France. The whole system is completely corrupted, people are used to get "passe-droit" (not sure about the translation) for anything, and even the top Political leaders are acting this way. Politician asking cops not to be fined for something illegal they just did, a friend of a friend asking the School President to let their kid attend their school, etc.
The whole French society is based on this. Many say it's a Monarchy, and not a Republic.
I recently spent the better part of a day trying to figure out this seemingly-straightforward problem, so if you do go down this road I'd recommend a tool called wkhtmltopdf: http://code.google.com/p/wkhtmltopdf/
Great work! The only thing is that this needs to get updated based on the changes of the new 0.9.9 release. Specifically you do not need to create a custom dispose function, remove now will remove listeners bound to the view.
At the same time in no part of his post is any code snippet or anything to latch on to of "the model performed poorly here when it should have done this" - this style of criticism seems to be a pattern of most of these "the LLMs will never work" style posts on blogs and twitter.
They obviously can perform better than autocomplete and in my own day to day development build out huge portions of a codebase that I would have expected a junior or midlevel engineer to perform at.
How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
reply