I really don't understand how you don't understand how your site is completely misleading. Everyone here is telling you that including API reliability in with actual model performance is nonsense.
I agree that it's confusing, I have already implemented a reliability score, but it will only apply for new tests from now on.
I have already re-tested DeepSeek v4, so it doesn't have any API error issues.
API errors are quite rare, most models tested have usually max 1 API Error failure reason, so fixing them won't change rankings much: https://aibenchy.com/fail/api-error/
I will try to retest all with API errors, so the score is only given by correct/wrong answers, and the reliability score will be an extra metric just as an indication of how the API performs.
That being said, the reliability of the API is still a huge factor for production use-cases.
Some API errors are actually not about reliability, are because that specific API doesn't support some common features (e.g. specific structured_output formats).
Carbon offsetting is risky. You plant a tree and you don’t know if it will die. You create a swampy area to absorb co2 and 10 years later it dries out due to global warming. Offsetting should be used if there is no other way to reduce emissions in the first place. Same is true for sucking carbon out of the air and storing it somewhere… it’s expensive and it should not be the default - we need offsetting and carbon segregation for the really unavoidable stuff
Sucking carbon out of the air using fully renewable energy (solar/wind) is a great thing to do! ... once we've fully decarbonized all other energy use and we have extra, left-over renewable energy.
You give a trinket to a near dictator in order to not have your company, which you're responsible for, dragged over the coals and attacked by a psychopathic goverment. In the grand scheme of things this was a completely genius play and did no harm to anyone.
That's a reasonable take - Apple had a gun to their head regarding tarriffs and exposure to China, but I'd still love to know how Steve would have played the same hand.
The LLVM community used this model for years with Phabricator before it was EOL'd and moving to GH and PRs was forced. It's a proven model and works very well in complex code bases, multiple components and dependencies that can have very different reviewer groups. E.g:
1) A foundational change to the IR is the baseline commit
2) Then some tweaks on top to lay the groundwork for uses of that change
3) Some implementation of a new feature that uses the new IR change
4) A final change that flips the feature flag on to enable by default.
Each of these changes are dependent on the last. Without stacked PRs you have o only one PR and reviewing this is huge. Maybe thousands of lines of complex code. Worse, some reviewers only need to see some parts of it and not the rest.
Stacked diffs were a godsend and the LLVM community's number one complaint about moving to GitHub was losing this feature.
Shit. Really? You mean they modified their frontier model to improve it and make it better and just called it a day? That their benchmarks which show step change improvements are just the result of successive changes on an EXISTING MODEL?
Say it isn't so! I for one like to start from scratch each time I release my version of my compiler toolchain.
No one seems to have actually read the system card all the way through.
The reason they didn't publish it was that it's orders of magnitude more successful at writing exploits vs Opus 4.6, which only managed it something like 2% of the time.
If anything I’m seeing too much skepticism and not enough alarm. People burying their heads in the sand, fingers in their ears denying where this is all going. Unbelievable except it’s exactly what I expect from humans.
Forgive me, but this is probably the 29th world destroying model I've seen in the last 4 years, that will change everything, take all the jobs, cure all the cancers and eat all the puppies.
reply