More

pertymcpert · 2026-04-25T00:46:46 1777078006

I really don't understand how you don't understand how your site is completely misleading. Everyone here is telling you that including API reliability in with actual model performance is nonsense.

XCSme · 2026-04-25T01:12:14 1777079534

I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on.

I have already re-tested DeepSeek v4, so it doesn't have any API error issues.

API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/

I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs.

That being said, the reliability of the API is still a huge factor for production use-cases.

XCSme · 2026-04-25T01:13:36 1777079616

Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).

pertymcpert · 2026-04-21T03:17:34 1776741454

Carbon offsetting is nothing to do with river pollution.

umpalumpaaa · 2026-04-21T04:33:48 1776746028

Carbon offsetting is risky. You plant a tree and you don’t know if it will die. You create a swampy area to absorb co2 and 10 years later it dries out due to global warming. Offsetting should be used if there is no other way to reduce emissions in the first place. Same is true for sucking carbon out of the air and storing it somewhere… it’s expensive and it should not be the default - we need offsetting and carbon segregation for the really unavoidable stuff

reverius42 · 2026-04-21T06:01:17 1776751277

Sucking carbon out of the air using fully renewable energy (solar/wind) is a great thing to do! ... once we've fully decarbonized all other energy use and we have extra, left-over renewable energy.

pertymcpert · 2026-04-23T17:32:08 1776965528

Cool, but nothing to do with this conversation.

pertymcpert · 2026-04-20T21:30:58 1776720658

You give a trinket to a near dictator in order to not have your company, which you're responsible for, dragged over the coals and attacked by a psychopathic goverment. In the grand scheme of things this was a completely genius play and did no harm to anyone.

tjmc · 2026-04-21T02:01:42 1776736902

That's a reasonable take - Apple had a gun to their head regarding tarriffs and exposure to China, but I'd still love to know how Steve would have played the same hand.

pertymcpert · 2026-04-14T17:04:34 1776186274

The LLVM community used this model for years with Phabricator before it was EOL'd and moving to GH and PRs was forced. It's a proven model and works very well in complex code bases, multiple components and dependencies that can have very different reviewer groups. E.g: 1) A foundational change to the IR is the baseline commit 2) Then some tweaks on top to lay the groundwork for uses of that change 3) Some implementation of a new feature that uses the new IR change 4) A final change that flips the feature flag on to enable by default.

Each of these changes are dependent on the last. Without stacked PRs you have o only one PR and reviewing this is huge. Maybe thousands of lines of complex code. Worse, some reviewers only need to see some parts of it and not the rest.

Stacked diffs were a godsend and the LLVM community's number one complaint about moving to GitHub was losing this feature.

pertymcpert · 2026-04-13T21:00:44 1776114044

What might that be?

pertymcpert · 2026-04-12T08:37:45 1775983065

Shit. Really? You mean they modified their frontier model to improve it and make it better and just called it a day? That their benchmarks which show step change improvements are just the result of successive changes on an EXISTING MODEL?

Say it isn't so! I for one like to start from scratch each time I release my version of my compiler toolchain.

chjj · 2026-04-12T09:31:35 1775986295

They didn't call it a day. They created an entire deceptive hype cycle around it.

pertymcpert · 2026-04-12T08:34:27 1775982867

No one seems to have actually read the system card all the way through.

The reason they didn't publish it was that it's orders of magnitude more successful at writing exploits vs Opus 4.6, which only managed it something like 2% of the time.

pertymcpert · 2026-04-12T08:23:10 1775982190

Yeah...except Mythos's large context perf seems to be much better than Opus 4.6.

pertymcpert · 2026-04-10T21:31:35 1775856695

Here: https://chatgpt.com/s/t_69d96c050078819199750da28ebd2526

I gagged after 2 sentences.

pertymcpert · 2026-04-08T02:41:10 1775616070

If anything I’m seeing too much skepticism and not enough alarm. People burying their heads in the sand, fingers in their ears denying where this is all going. Unbelievable except it’s exactly what I expect from humans.

nananana9 · 2026-04-08T05:46:41 1775627201

Forgive me, but this is probably the 29th world destroying model I've seen in the last 4 years, that will change everything, take all the jobs, cure all the cancers and eat all the puppies.

pertymcpert · 2026-04-08T16:43:47 1775666627

I’m beyond trying to convince people to take this technology seriously. You’ll learn for yourself.

suddenlybananas · 2026-04-08T12:30:51 1775651451

OpenAI didn't want to make GPT2 available because it was "too dangerous" [1].

[1] https://www.theguardian.com/technology/2019/feb/14/elon-musk...

m3kw9 · 2026-04-08T14:32:39 1775658759

Alarm from hype is what they want, you are playing straight into their PR dept's hands

pertymcpert · 2026-04-12T08:43:35 1775983415

I'm not talking about Anthropic in particular. Other frontier labs will only be at most a year behind.

I'm seeing the future here beyond just what's in front of us.

rimliu · 2026-04-08T08:23:49 1775636629

alarm about what, exactly?