pineapple_opus's comments

pineapple_opus · 2026-06-05T06:17:52 1780640272

Eye catching - "Open ended problems" claude code session success rate jumped from 20% (pre opus 4.5 release) to 70% after sometime after opus 4.6 was released.

pineapple_opus · 2026-06-02T08:23:11 1780388591

Yeah this seems true. Claude Code are famously dubbed as best AI coding agent, but google doesn't care about that niche I guess. Somehow, I still rely on google search as they have diversified it.

If you ask questions, it will enable "AI overview" , but if we search about particular object/platform like "Google stock" or "bbc news", it will give the old classic search experience and we woulnd't need to swallow "AI overview" pill in that case.

marysol5 · 2026-06-02T09:46:46 1780393606

I tried using Gemini CLI to sort some code issues for me, ran out of tokens mid-way through, even though I have Gemini Pro.

Turns out licensing is separate for "code" and "pro"...

freedomben · 2026-06-02T12:17:33 1780402653

Same happened to me. That was the death knell for Gemini as a coding agent to me. I even paid for a whole year...

I highly suspect they opaquely lowered usage limits on me.

pineapple_opus · 2026-05-26T06:00:45 1779775245

> This is the opposite of the “10x productivity” slop-cannon style of development that most people imagine when they think of vibe coding, but I find it very satisfying.

I can relate to this. When I spend time on writing unit test , even the one which takes 1% of code coverage, it will be honestly wholesome moment for me to ship it confidently.

pineapple_opus · 2026-05-19T06:39:54 1779172794

All I see is mention of how various models generate image of "pelican riding bicycle(s)"

emil-lp · 2026-05-19T06:47:12 1779173232

Yes, the "pelican riding a bicycle" is the ultimate test of not understanding how LLMs work.

Well, a combination of that and believing that replication of test data is a good measure of progress.

vessenes · 2026-05-19T09:27:08 1779182828

Spicy — why does it show ultimate non-understanding?

JohnKemeny · 2026-05-19T13:02:20 1779195740

because success comes from reproducing a memorized pattern rather than transferable reasoning?

At the same time failure proves little because most humans also could not manually create a correct SVG of a pelican riding a bicycle.

What is it exactly that such a test is testing?

In which situation would you measure the "competence" of a human being by asking them to write an SVG of a pelican riding a bicycle?

okamiueru · 2026-05-20T18:04:06 1779300246

> most humans also could not manually create a correct SVG of a pelican riding a bicycle.

Most humans absolutely can write this with a suitable vector graphics tool such as inkscape or illustrator.

Surely, you're not suggesting that a fair comparison would be using a text editor?

If so, would you suggest an equivalent raster based task would only be fair, if the human would manually assigning RGB values to each pixel?

ClikeX · 2026-05-19T06:57:54 1779173874

We all know the true test of AI is Will Smith eating spaghetti.

ActionHank · 2026-05-19T19:17:19 1779218239

Wait, are you saying you don't handcraft svgs of pelicans riding bicycles?

pineapple_opus · 2026-05-05T06:10:56 1777961456

As you said it's distributed across - People, conversations, AI agents , tooling, etc... , can't the LLM Knowledgebase/ wiki ( a.k.a. org's second brain) solve this ? I think if , second brain exists, no one needs to pay cognitive debt.

pineapple_opus · 2026-05-05T06:00:13 1777960813

I use to do this and then do test manually to validate everything works as expected in my small open source project. But then over the time I saw that some bugs crept in which I was unable track since I was doing manual testing. So I wrote some e2e tests with playwright and I think that gives a bit relief (at least).