Unfortunately, it is not, and many of its attempts at mathematical proofs have major flaws. You shouldn't trust its proofs unless you are already able to evaluate them--which I think is pretty much all the OP is saying.
Trust isn’t a binary, and I can trust things I don’t understand enough that I can use them. OP was talking about needing to understand, which is quite a bit above the level of being able to validate enough to use for a task.
This jives with what I've experienced in the brief time I had access to 5.5 Pro. It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.
The downside (not noted in the article, but noted by others here) is cost. It uses tokens at an insane rate, the tokens cost a lot, and using it with subagent flows that you can use to have it tackle large problems with high accuracy costs even more. It is also much "slower" for large scale problems because of context limitations -- it has to constantly rediscover context for each part of the problem, and in order to make it accurate you need to wipe its context before progressing to the next small part, or launch even more agents. For mathematical proofs like these, where the required context to understand the problem and proof besides stuff that's already available in its training set is small and the problems are considered "important" enough, this might not be a problem, but for many of the tasks I would like to use it for (ensuring correctness of code that affects large codebases, or validating subtle assumptions) it definitely is one.
So I think it will be a while before the impressive capabilities of these models really percolate into our lives as programmers, unless you're one of the lucky ones given unlimited access to 5.5 Pro.
> It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.
I swear that people have said the same thing with effectively every new model that came out in the last six months.
I think it's because people walk every model up to its limits and become very aware of a task they can't make work. They do a lot of work simplifying and understanding limitations at that boundary. Then an improved model comes out and they immediately toe that barrier and make swift progress. They will also notice that the new model is natively doing tricks they had done manually.
The reality is likely that everyone is hitting similar barriers and the solutions are somewhat generalizable and get added to training new models.
Eventually people will reach the new limits and the cycle repeats.
> I swear that people have said the same thing with effectively every new model
That is definitely true, and at the same time, we can measure progress by who is making that claim. When Timothy Gowers, a Fields Medalist, says that models are now capable of "producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me," we can be pretty confident that we are getting into seriously interesting territory.
If we are having this meta-discussion, you can usually guess a person's age by which letter they are elongating. Millenial generation uses the vowel (as above) but gen alpha elongates the syllable - "longggg". Doesn't add anything to the convo just an interesting tidbit.
The only thing worse than complaining about this is being the guy complaining about the guy complaining about this. So congratulations on being second most annoying.
> can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided,
I don’t know about the rest of y’all but I find “rigidly guiding” LLM’s incredibly tedious and frustrating in the same way seeing an error code throw for the 40th time while troubleshooting something on my computer for two hours is frustrating. It also feels somewhat like micromanaging a direct report. I don’t find that process fun or enjoyable in the slightest and it teaches me little in the process. It’s just trading styles of work, and I guess the response to that is “some people prefer that of work.” I just don’t like being told by the world we all have to work that way now I guess.
I agree. I find it endlessly frustrating and kind of hate what programming has become. But at least for me it meets the minimum bar of "it works if you push things" now. For past models, under no circumstances could I get them to semi reliably solve these kinds of problems correctly without giving them so many "hints" that they weren't actually saving me time. The kind of reasoning I'm talking about is stuff like "can you actually construct a trace from program start for this condition that looks locally reachable?" Past model simply cannot reliably answer such questions as soon as the control flow involves enough hops or requires tracing through enough function calls.
Roblox made a platform. The people on that platform used it for making gambling and child grooming.
I guess the worst you can say for roblox is that it incentivizes that with the way they’re selling Robux, but that’s also the only way their platform can work.
For some reason people are perfectly able to understand this in the context of, say, cursive, calculator use, etc., but when it comes to their own skillset somehow it's going to be really different.
No, it hasn't. I did not have a problem before AI with people sending in gigantic pull requests that made absolutely no sense, and justifying them with generated responses that they clearly did not understand. This is not a thing that used to happen. That's not to say people wouldn't have done it if it were possible, but there was a barrier to submitting a pull request that no longer exists.
I'm mostly surprised that people found the output quality of Opus 4.6 good enough... 4.7 so far is a pretty sizable improvement for the stuff I care about. I don't really care how cheap 4.6 was per task when 90% of the tasks weren't actually being done correctly. Or maybe it's that people like the LLM agreeing with them blindly while sneakily doing something else under the hood? Did people enjoy Claude routinely disregarding their instructions? Not really sure I understand, I truly found 4.6 immensely frustrating (from the getgo, not just the "pre-nerf" version, whatever that means). 4.7 is a buggy mess, it's slow, and it costs a lot per token. It's also a huge breath of fresh air because it actually seems to make a good faith effort at doing the thing you asked it to do, and doesn't waste your time with irrelevant nonsense just to make it look busy or because it thinks you want that nonsense (I mean, it still does all of these things to some extent, but so far it seems like it does them much less than 4.6 did).
Disclaimer: I'm always running on max and don't really have token limits so I am in a position not to care about cost per token. But I am not surprised by the improved benchmark results at all, 4.6 was really not nearly as strong of a model as people seem to remember it being.
It's not reality. I'm really not a fan of the way that people excuse the really terrible code LLMs write by claiming that people write code just as bad. Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.
Yes and both are right. It’s a matter of which is working as expected and making fewer mistakes more often. And as someone using Claude Code heavily now, I would say we’re already at a point where AI wins.
> it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.
I had a coworker that more or less exactly did that. You left a comment in a ticket about something extra to be done, he answered "yes sure" and after a few days proceeded to close the ticket without doing the thing you asked. Depending on the quantity of work you had at the moment, you might not notice that until after a few months, when the missing thing would bite you back in bitter revenge.
You may have had one. It clearly made a pretty negative impression on you because you are still complaining about them years later. I find it pretty misanthropic when people ascribe this kind of antisocial behavior to all of their coworkers.
It's still relatively recent. Anyway I'm not saying everyone is like this, absolutely (not even an important chunk), but they do exist.
At the same time it's not true that current LLMs only write terrible code.
"Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later."
The point is, that's not the typical experience and people like that can be replaced. We don't willingly bring people like that on our teams, and we certainly don't aim to replace entire teams with clones of this terrible coworker prototype.
Not only have i never had a coworker as bad as these people describe, the point is as you say: why would I want an LLM that works like these people's shitty coworkers?
My worst coworkers right now are the ones using Claude to write every word of code and don't test it. These are people who never produced such bad code on their own.
So the LLMs aren't just as bad as the bad coworkers, they're turning good coworkers into bad ones!
Couple of reasons, but mainly speed and avaiability.
I can give Claude a job anytime and it will do it immediately.
And yes, I will have to double check anything important, but I am way better and faster at checking, than doing it myself.
So obviously I don't want a shitty LLM as coworker, but a competent one. But the progress they made is pretty astonishing and they are good enough now that I started really integrating them.
In the long run, good code makes everyone much happier than code that is bad because people are being "nice" and letting things slide in code review to avoid confrontation.
...but seriously... there was the "up until 1850" LLM or whatever... can we make an "up until 1920 => 1990 [pre-internet] => present day" and then keep prodding the "older ones" until they "invent their way" to the newer years?
We knew more in 1920 than we did in 1850, but can a "thinking machine" of 1850-knowledge invent 1860's knowledge via infinite monkeys theorem/practice?
The same way that in 2025/2026, Knuth has just invented his way to 2027-knowledge with this paper/observation/finding? If I only had a beowulf cluster of these things... ;-)
But a query optimizer only matters once you have an established business with large customers.
You seem to be implying Salesforce’s business is successful because they have their own query optimizer. But the causality is reversed. Salesforce has their own query optimizer because they’ve built a successful business.
My point is that a lot of people think it'd be really easy to build the next Salesforce until they actually try to compete with Salesforce in the market. Like it or not, if you want to build a Salesforce competitor (or try to get your company to build its own) you're going to be compared to actual Salesforce, not the version of Salesforce that existed when the market was new.
Unfortunately, it is not, and many of its attempts at mathematical proofs have major flaws. You shouldn't trust its proofs unless you are already able to evaluate them--which I think is pretty much all the OP is saying.
reply