More

Jweb_Guru · 2026-05-11T06:38:03 1778481483

> Claude is a phd level mathematician

Unfortunately, it is not, and many of its attempts at mathematical proofs have major flaws. You shouldn't trust its proofs unless you are already able to evaluate them--which I think is pretty much all the OP is saying.

adrianN · 2026-05-11T07:43:25 1778485405

To be fair, many of the proof attempts that mathematicians do also have major flaws. Most get caught before getting published.

seba_dos1 · 2026-05-11T12:35:20 1778502920

But that's the actually important difference. Mathematicians have the toolset and processes to catch the flaws, random people using Claude don't.

IanCal · 2026-05-11T08:30:23 1778488223

Trust isn’t a binary, and I can trust things I don’t understand enough that I can use them. OP was talking about needing to understand, which is quite a bit above the level of being able to validate enough to use for a task.

Jweb_Guru · 2026-05-09T18:29:28 1778351368

This jives with what I've experienced in the brief time I had access to 5.5 Pro. It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.

The downside (not noted in the article, but noted by others here) is cost. It uses tokens at an insane rate, the tokens cost a lot, and using it with subagent flows that you can use to have it tackle large problems with high accuracy costs even more. It is also much "slower" for large scale problems because of context limitations -- it has to constantly rediscover context for each part of the problem, and in order to make it accurate you need to wipe its context before progressing to the next small part, or launch even more agents. For mathematical proofs like these, where the required context to understand the problem and proof besides stuff that's already available in its training set is small and the problems are considered "important" enough, this might not be a problem, but for many of the tasks I would like to use it for (ensuring correctness of code that affects large codebases, or validating subtle assumptions) it definitely is one.

So I think it will be a while before the impressive capabilities of these models really percolate into our lives as programmers, unless you're one of the lucky ones given unlimited access to 5.5 Pro.

elAhmo · 2026-05-10T10:09:18 1778407758

> It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.

I swear that people have said the same thing with effectively every new model that came out in the last six months.

fluidcruft · 2026-05-10T12:13:21 1778415201

I think it's because people walk every model up to its limits and become very aware of a task they can't make work. They do a lot of work simplifying and understanding limitations at that boundary. Then an improved model comes out and they immediately toe that barrier and make swift progress. They will also notice that the new model is natively doing tricks they had done manually.

The reality is likely that everyone is hitting similar barriers and the solutions are somewhat generalizable and get added to training new models.

Eventually people will reach the new limits and the cycle repeats.

fasterik · 2026-05-10T19:18:29 1778440709

> I swear that people have said the same thing with effectively every new model

That is definitely true, and at the same time, we can measure progress by who is making that claim. When Timothy Gowers, a Fields Medalist, says that models are now capable of "producing a piece of PhD-level research in an hour or so, with no serious mathematical input from me," we can be pretty confident that we are getting into seriously interesting territory.

Jweb_Guru · 2026-05-10T16:00:29 1778428829

Many people may have, but I certainly haven't.

hackable_sand · 2026-05-11T04:58:11 1778475491

They did. The scam continues.

y1n0 · 2026-05-09T19:10:40 1778353840

> This jives with what I've experienced

Just as an fyi, the word you are looking for is jibes. Jive is something else entirely.

jibe · 2026-05-10T03:31:12 1778383872

I'm with you!

shnock · 2026-05-10T04:42:03 1778388123

Oh look it's jibe's account!

pfdietz · 2026-05-10T15:13:22 1778426002

Excuse me stewardess, I speak jive.

jfaat · 2026-05-11T00:43:31 1778460211

I'm going to start using malapropisms so people know I didn't use an llm to write things

billfor · 2026-05-10T06:34:45 1778394885

Blame The Bee Gees: https://en.wikipedia.org/wiki/Jive_Talkin'

pessimizer · 2026-05-10T17:41:26 1778434886

"Jive" is anacronistic black American slang for bullshit.

boring-human · 2026-05-10T04:20:11 1778386811

Cut me some slack, Jack.

bicepjai · 2026-05-09T21:15:11 1778361311

Interesting I did not know that I would have used jives :) thanks

refulgentis · 2026-05-09T19:50:53 1778356253

That ship sailed looooong ago.

ignoramous · 2026-05-10T00:26:53 1778372813

> looooong

Just as an fyi, the words you are looking for are ages/eons/an eternity.

ricardobayes · 2026-05-10T08:48:06 1778402886

If we are having this meta-discussion, you can usually guess a person's age by which letter they are elongating. Millenial generation uses the vowel (as above) but gen alpha elongates the syllable - "longggg". Doesn't add anything to the convo just an interesting tidbit.

hooo · 2026-05-10T04:08:19 1778386099

What has HN become…?

sdwr · 2026-05-10T14:28:17 1778423297

Same as it ever was

mimentum · 2026-05-11T06:59:36 1778482776

...Talking Heads

Culonavirus · 2026-05-10T09:54:54 1778406894

hooo nooo

idiotsecant · 2026-05-10T01:26:58 1778376418

The only thing worse than complaining about this is being the guy complaining about the guy complaining about this. So congratulations on being second most annoying.

xdavidliu · 2026-05-10T12:35:46 1778416546

oh the irony

Forgeties79 · 2026-05-10T14:29:46 1778423386

> can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided,

I don’t know about the rest of y’all but I find “rigidly guiding” LLM’s incredibly tedious and frustrating in the same way seeing an error code throw for the 40th time while troubleshooting something on my computer for two hours is frustrating. It also feels somewhat like micromanaging a direct report. I don’t find that process fun or enjoyable in the slightest and it teaches me little in the process. It’s just trading styles of work, and I guess the response to that is “some people prefer that of work.” I just don’t like being told by the world we all have to work that way now I guess.

Jweb_Guru · 2026-05-10T16:01:18 1778428878

I agree. I find it endlessly frustrating and kind of hate what programming has become. But at least for me it meets the minimum bar of "it works if you push things" now. For past models, under no circumstances could I get them to semi reliably solve these kinds of problems correctly without giving them so many "hints" that they weren't actually saving me time. The kind of reasoning I'm talking about is stuff like "can you actually construct a trace from program start for this condition that looks locally reachable?" Past model simply cannot reliably answer such questions as soon as the control flow involves enough hops or requires tracing through enough function calls.

Jweb_Guru · 2026-05-02T19:06:48 1777748808

God forbid people want to work on video game stuff instead of for an advertising company.

pamcake · 2026-05-02T20:41:26 1777754486

Video games is one thing. Roblox is something else.

lostlogin · 2026-05-02T19:31:33 1777750293

Yeah, but for Roblox? That’s child gambling and child grooming.

BigTTYGothGF · 2026-05-02T21:42:16 1777758136

Might still be a step up from Facebook.

Aeolun · 2026-05-03T00:06:42 1777766802

Roblox made a platform. The people on that platform used it for making gambling and child grooming.

I guess the worst you can say for roblox is that it incentivizes that with the way they’re selling Robux, but that’s also the only way their platform can work.

lostlogin · 2026-05-03T00:40:14 1777768814

I don’t think that’s a defence in the slightest.

Jweb_Guru · 2026-04-18T20:20:56 1776543656

For some reason people are perfectly able to understand this in the context of, say, cursive, calculator use, etc., but when it comes to their own skillset somehow it's going to be really different.

Jweb_Guru · 2026-04-18T20:18:03 1776543483

No, it hasn't. I did not have a problem before AI with people sending in gigantic pull requests that made absolutely no sense, and justifying them with generated responses that they clearly did not understand. This is not a thing that used to happen. That's not to say people wouldn't have done it if it were possible, but there was a barrier to submitting a pull request that no longer exists.

Jweb_Guru · 2026-04-18T03:43:47 1776483827

I'm mostly surprised that people found the output quality of Opus 4.6 good enough... 4.7 so far is a pretty sizable improvement for the stuff I care about. I don't really care how cheap 4.6 was per task when 90% of the tasks weren't actually being done correctly. Or maybe it's that people like the LLM agreeing with them blindly while sneakily doing something else under the hood? Did people enjoy Claude routinely disregarding their instructions? Not really sure I understand, I truly found 4.6 immensely frustrating (from the getgo, not just the "pre-nerf" version, whatever that means). 4.7 is a buggy mess, it's slow, and it costs a lot per token. It's also a huge breath of fresh air because it actually seems to make a good faith effort at doing the thing you asked it to do, and doesn't waste your time with irrelevant nonsense just to make it look busy or because it thinks you want that nonsense (I mean, it still does all of these things to some extent, but so far it seems like it does them much less than 4.6 did).

Disclaimer: I'm always running on max and don't really have token limits so I am in a position not to care about cost per token. But I am not surprised by the improved benchmark results at all, 4.6 was really not nearly as strong of a model as people seem to remember it being.

Jweb_Guru · 2026-04-07T08:10:48 1775549448

Yup. Every single time it's about to do the dumbest thing I've seen in my life.

Jweb_Guru · 2026-03-07T06:51:30 1772866290

It's not reality. I'm really not a fan of the way that people excuse the really terrible code LLMs write by claiming that people write code just as bad. Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.

imiric · 2026-03-07T07:28:37 1772868517

It's an easy copout.

Tool works as expected? It's superintelligence. Programming is dead.

Tool makes dumb mistake? So do humans.

brabel · 2026-03-07T08:31:04 1772872264

Yes and both are right. It’s a matter of which is working as expected and making fewer mistakes more often. And as someone using Claude Code heavily now, I would say we’re already at a point where AI wins.

darkwater · 2026-03-07T08:47:43 1772873263

> it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.

I had a coworker that more or less exactly did that. You left a comment in a ticket about something extra to be done, he answered "yes sure" and after a few days proceeded to close the ticket without doing the thing you asked. Depending on the quantity of work you had at the moment, you might not notice that until after a few months, when the missing thing would bite you back in bitter revenge.

Jweb_Guru · 2026-03-07T17:30:25 1772904625

You may have had one. It clearly made a pretty negative impression on you because you are still complaining about them years later. I find it pretty misanthropic when people ascribe this kind of antisocial behavior to all of their coworkers.

darkwater · 2026-03-07T17:56:51 1772906211

It's still relatively recent. Anyway I'm not saying everyone is like this, absolutely (not even an important chunk), but they do exist. At the same time it's not true that current LLMs only write terrible code.

lukan · 2026-03-07T09:26:16 1772875576

"Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later."

I admire your experience with people.

dns_snek · 2026-03-07T10:43:12 1772880192

The point is, that's not the typical experience and people like that can be replaced. We don't willingly bring people like that on our teams, and we certainly don't aim to replace entire teams with clones of this terrible coworker prototype.

queenkjuul · 2026-03-07T15:57:08 1772899028

Not only have i never had a coworker as bad as these people describe, the point is as you say: why would I want an LLM that works like these people's shitty coworkers?

My worst coworkers right now are the ones using Claude to write every word of code and don't test it. These are people who never produced such bad code on their own.

So the LLMs aren't just as bad as the bad coworkers, they're turning good coworkers into bad ones!

lukan · 2026-03-08T05:51:38 1772949098

Couple of reasons, but mainly speed and avaiability.

I can give Claude a job anytime and it will do it immediately.

And yes, I will have to double check anything important, but I am way better and faster at checking, than doing it myself.

So obviously I don't want a shitty LLM as coworker, but a competent one. But the progress they made is pretty astonishing and they are good enough now that I started really integrating them.

ttoinou · 2026-03-07T08:15:13 1772871313

No but they will despise you for bringing the problem up

Jweb_Guru · 2026-03-07T17:15:57 1772903757

In the long run, good code makes everyone much happier than code that is bad because people are being "nice" and letting things slide in code review to avoid confrontation.

Jweb_Guru · 2026-03-03T16:04:54 1772553894

I assure you that LLM thinking also has a speed limit.

ramses0 · 2026-03-03T17:18:59 1772558339

But imagine a beowulf cluster of them... /s

...but seriously... there was the "up until 1850" LLM or whatever... can we make an "up until 1920 => 1990 [pre-internet] => present day" and then keep prodding the "older ones" until they "invent their way" to the newer years?

We knew more in 1920 than we did in 1850, but can a "thinking machine" of 1850-knowledge invent 1860's knowledge via infinite monkeys theorem/practice?

The same way that in 2025/2026, Knuth has just invented his way to 2027-knowledge with this paper/observation/finding? If I only had a beowulf cluster of these things... ;-)

Jweb_Guru · 2026-02-22T16:05:16 1771776316

Salesforce literally has its own query optimizer, you are vastly underestimating the complexity of its software.

hippo22 · 2026-02-22T18:13:57 1771784037

But a query optimizer only matters once you have an established business with large customers.

You seem to be implying Salesforce’s business is successful because they have their own query optimizer. But the causality is reversed. Salesforce has their own query optimizer because they’ve built a successful business.

Jweb_Guru · 2026-02-25T15:48:41 1772034521

My point is that a lot of people think it'd be really easy to build the next Salesforce until they actually try to compete with Salesforce in the market. Like it or not, if you want to build a Salesforce competitor (or try to get your company to build its own) you're going to be compared to actual Salesforce, not the version of Salesforce that existed when the market was new.