Hacker Newsnew | past | comments | ask | show | jobs | submit | pineapple_opus's commentslogin

Eye catching - "Open ended problems" claude code session success rate jumped from 20% (pre opus 4.5 release) to 70% after sometime after opus 4.6 was released.


Yeah this seems true. Claude Code are famously dubbed as best AI coding agent, but google doesn't care about that niche I guess. Somehow, I still rely on google search as they have diversified it.

If you ask questions, it will enable "AI overview" , but if we search about particular object/platform like "Google stock" or "bbc news", it will give the old classic search experience and we woulnd't need to swallow "AI overview" pill in that case.


I tried using Gemini CLI to sort some code issues for me, ran out of tokens mid-way through, even though I have Gemini Pro.

Turns out licensing is separate for "code" and "pro"...


Same happened to me. That was the death knell for Gemini as a coding agent to me. I even paid for a whole year...

I highly suspect they opaquely lowered usage limits on me.


> This is the opposite of the “10x productivity” slop-cannon style of development that most people imagine when they think of vibe coding, but I find it very satisfying.

I can relate to this. When I spend time on writing unit test , even the one which takes 1% of code coverage, it will be honestly wholesome moment for me to ship it confidently.


All I see is mention of how various models generate image of "pelican riding bicycle(s)"


Yes, the "pelican riding a bicycle" is the ultimate test of not understanding how LLMs work.

Well, a combination of that and believing that replication of test data is a good measure of progress.


Spicy — why does it show ultimate non-understanding?


because success comes from reproducing a memorized pattern rather than transferable reasoning?

At the same time failure proves little because most humans also could not manually create a correct SVG of a pelican riding a bicycle.

What is it exactly that such a test is testing?

In which situation would you measure the "competence" of a human being by asking them to write an SVG of a pelican riding a bicycle?


> most humans also could not manually create a correct SVG of a pelican riding a bicycle.

Most humans absolutely can write this with a suitable vector graphics tool such as inkscape or illustrator.

Surely, you're not suggesting that a fair comparison would be using a text editor?

If so, would you suggest an equivalent raster based task would only be fair, if the human would manually assigning RGB values to each pixel?


We all know the true test of AI is Will Smith eating spaghetti.


Wait, are you saying you don't handcraft svgs of pelicans riding bicycles?


As you said it's distributed across - People, conversations, AI agents , tooling, etc... , can't the LLM Knowledgebase/ wiki ( a.k.a. org's second brain) solve this ? I think if , second brain exists, no one needs to pay cognitive debt.


I use to do this and then do test manually to validate everything works as expected in my small open source project. But then over the time I saw that some bugs crept in which I was unable track since I was doing manual testing. So I wrote some e2e tests with playwright and I think that gives a bit relief (at least).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: