More

bisonbear · 2026-06-08T01:21:26 1780881686

beat saber is the only game I play on it and it's incredible

bisonbear · 2026-06-07T23:27:08 1780874828

The most salient point here is the societal acceptance of consuming slop - somehow we've gotten to a point where the majority of people are ok with mediocre art. I feel that this is a trend that AI has only amplified. The commodification of attention has gradually led us to a point where we're optimizing for engagement instead of for intrinsic value of the content itself.

Personally, I will continue seeking out high-quality music/art/movies/books that speak to me, and most of my friends do the same. There will always be a demand for human-created art, regardless of any plagiarism or replication by labs.

bisonbear · 2026-06-05T22:54:08 1780700048

Agree - all of this is based on vibes (I also use TDD based on vibes FWIW). The only way to settle "does TDD / caveman / [insert random skill here] help" is to replay real PRs from your repo and measure quality

bisonbear · 2026-05-28T04:05:08 1779941108

> Seems like the progressive disclosure approach is the best for context efficiency; I wound up with a somewhat tight generic AGENTS.md, and the .cursor/rules individual files with glob matching for file names. Cursor honored those well.

This is also generally where I've landed - keep the AGENTS.md super light, and link out to docs as needed. Same idea with skills as well. Basically, preserve the context window at all costs.

The part I'm curious about is, when we're making the sorts of behavior changes you're describing on shared repos, how do we actually measure and quantify impact? It's one thing to tell the team that the agent should perform better, and it's another to say that you made the agent 5% better across a variety of tasks for every dev in the repo.

fuckinpuppers · 2026-05-28T18:30:12 1779993012

I didn’t have to share it or quantify it… so I didn’t care.

I just relied on different agents/models and kept asking a thorough prompt of “analyze the agents.md, cursorrules, etc and ensure its token efficient and enforces everything” (it was very specifically worded, I may have even asked an agent for how to ask agents for it) and just kept jumping from the 3 big models and medium and high thinking, each one kept finding little things and at one point moved entirely from one strategy to another, if I remember right.

Once I felt good enough I’ve been using it as my setup for my application and it’s been pretty good without any modifications or tweaks. Originally I decided to do this because I got tired realizing that it wasn’t honoring things I told it to. For example “restart the application after every modification to the server code” and it would “forget” to do that often… somehow now I’ve got it really well tuned for my particular codebase and approach to developing.

bisonbear · 2026-05-28T04:01:59 1779940919

> we lack common tools to assess and compare

This has been bothering me for a while - the entire dev community is running on vibes when talking about AI. We're operating in an old paradigm, thinking that smart and logical additions to AGENTS.md result in good agent behavior, when in fact agents behavior is such a black box, that measurement is necessary.

> Even when all the rigging is controlled. (Which implies we need multiple experiments to compare against.)

Even the rigging is hard to control - Anthropic has an interesting piece on this here https://www.anthropic.com/engineering/infrastructure-noise

bisonbear · 2026-05-28T03:58:26 1779940706

Yes, agree that low n makes overclaiming a real risk with this sort of optimization loop. Low n results can be useful directionally but can't claim superiority without expanding the dataset. If I were running this for a shared repo with real consequences / value to improving AGENTS.md, instead of just as an experiment, I would expand n by a few factors for training / holdout, depending on expected variation on the tasks.

I'm also noticing similar patterns with needing to update AGENTS.md / skills per model release. E.g with Opus 4.6 -> 4.7, it became much more instruction adherent, so instructions written for the prior model generation might cause unexpected behavior in the new generation. I'm also convinced that an optimal AGENTS.md for Codex is not the same file as an optimized CLAUDE.md for Claude - the model personalities and behaviors are so different that we probably need to tune the instructions differently as well.

bisonbear · 2026-05-16T15:12:40 1778944360

AGENTS.md is extremely important - it's probably the highest leverage thing you can give your agent. It's injected into every turn, and the agents are trained to follow instructions. If anything, I think people are under-investing into AGENTS.md and going purely based on vibes.

For example, if I write a bad AGENTS.md for a repo with 100 engineers actively working in it, then every agent for every engineer gets worse, without anyone really noticing.

I think we should move towards data-based tuning of AGENTS.md, testing out changes, gathering data, and then making a decision on whether or not to ship it.

david_d8912 · 2026-05-16T16:00:29 1778947229

The project directory, sure. I'm more of talking about behavior rules here. What are you guys writing, is it effective

bisonbear · 2026-05-16T16:20:16 1778948416

My advice, from doing this myself and reading best practices, would be:

- Keep it concise, use progressive disclosure / nested AGENTS.md for information expansion - Give agent the high level repo structure if necessary - Have a "why" section to align the agent, high level, what your code is doing - Keep behavior instructions positive where possible, eg Always clarify intent before acting

david_d8912 · 2026-05-16T16:24:23 1778948663

cool. I agree with all the front points. The last part ```Keep behavior instructions positive where possible```, do you have good experience on it. I'm only asking since my own experience is they're constantly not followed.

bisonbear · 2026-05-16T17:26:41 1778952401

Yeah, I've found that to be more effective. Going with the example "Always clarify intent before acting" > "Never act without getting intent first", seemingly because telling the agent NOT to do something sometimes primes it to do that exact thing

bisonbear · 2026-05-16T14:47:08 1778942828

I've been building a tool to do this - build a dataset based on tasks from your repo, then A/B test the agent with whatever change you're making to determine the impact prior to actually shipping it. If you want to check it out - stet.sh

bisonbear · 2026-05-14T20:57:51 1778792271

Not the OP, but I've been thinking about this problem a lot - as devs we're overly reliant on vibes for evaluating coding agents. This is already a problem, and especially so if you're working in an engineering organization where a bad edit to AGENTS.md can cause silent regressions for everyone in the codebase.

To solve this, I've built an agent-native tool to run evaluations based on merged PRs in your codebase. Basically you can ask Claude to evaluate whether the skill made things better/worse on real tasks, and to then iteratively improve it

Stalking your profile (sorry..) I see you're pretty deep in the eval space, so I'm super curious what your approach has been to being rigorous for things like skill changes?

alexhans · 2026-05-20T08:14:52 1779264892

Hey, sorry for the late reply.

I looked superficially at your site/repo and based on that initial impression:

- Your approach of comparing different parts of the "black box" which affects agent behaviour (Harness, foundation model, skills, context (in your case the loaded on AGENTS.md context) is closely aligned with how I both think and operate. - You're both tackling the "regression" and the "answer hypothesis easily" problems.

> Stalking your profile (sorry..) I see you're pretty deep in the eval space, so I'm super curious what your approach has been to being rigorous for things like skill changes?

It depends on the level of automation and risk profile. For skills I use this framework of thinking [1] and encourage evals/ground truth as soon as possible so that you can have automatic feedback loops for the markdown part and for the deterministic part (scripts). Once you have the eval/ground truth pair, you're almost doing TDD or Eval Driven Development (which is quite hard the first times you try and realize you actually need to think about intent). The scripts should definitely have their own unit tests for the "skill iteration" in the event that a mutation is desired to cover new behaviour/fix wrong behaviour.

On Agent Skills, it may seem tempting to want more "openness" for the AI to solve the problem creatively but, more often than not, you've described a repeatable workflow and you want predictability and stability instead of novelty so it's really about 1) How can I freeze it to keep being good enough as much as possible 2) How can I know if something happened somewhere which changed the black box (e.g. coding harness auto model picking screws things up 3) How can I make the skill itself ETC (Easy to change), to keep control. Local Models can be a great tool for stability in some scenarios.

In particular, I prefer pass/fail (binary) outcomes instead of scoring which doesn't help regression decisions. Defining "good enough" should be very clear. Flakiness is not a good thing to accept, if the outcomes are consequential.

Anything actually risky should be solid RBAC/policy which doesn't really depend on the LLM.

I had a site that I didn't manage to make visible in HN to create a community for ai-evals.io. I've since interacted with a few people, developed further insights and given some private talks but need to get back to publishing outfacing and trying to contact more people interested in this space because it's absolutely critical. There's a lot of nuance in how different environments think about the eval problem differently: It's all about tracing and course correcting after launch, it's about simulations, sandboxing, security, automatic eval generation, etc.

In any case, I'll try to be more present from now on, and especially from June onwards to try to exchange insights in the open with people who are exploring different solutions in this space.

[1] - https://alexhans.github.io/posts/series/evals/building-agent...

bisonbear · 2026-05-11T02:38:05 1778467085

Agree, it's impossible to tell if someone else's workflow works with your codebase without actually trying it, which takes time/tokens. I've been thinking about how to make running quick, directional evals easier / more efficient to give more confidence in using / developing skills. Basically, how do we go from vibes to data?