This is where most of my productivity gains have come, I have a special harness I move from project to project now that does my testing orchestration, lots of my work day is setting up a prompt or two early and just letting them loop till they return evidence that the feature is working having gone through the big QA loop.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
> One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature.
Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints
>> if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough.
Maybe I misunderstand but this seems like a fairly low bar in the test suite only covers existing bugs.
I'd argue that if you aren't going to look at the code you actually need a fully comprehensive test suite - in the sense that if the tests pass, the code is correct and you don't have to look at it at all. The problem is, that isn't very quick to create it seems. Of course, if there is a way to do it quickly in a way that is reproducible by others I'd love to hear about it.
I don't mean just bugs, I mean any known issues. I test infra, I test UI, I test binary protocols, you name it. There is certainly no fast way to do it, even with AI (an AI generated suite is better than nothing but not as good), and it's a serious investment, but it's worth it. Testing becomes a process of correctness checking that snowballs over time, making everything else easier and better (or else the tests need further adjustment!)
Right. You mean all behaviors are tested, essentially.
So if you / team are going to implement a new feature, what does that look like? Do you write Gherkin or similar, unit tests or both? Can you provide an example of what that might look like? How much of this has changed for you since the pre-AI days?
I have it record a series of gifs or videos that I look over. If something looks off I'll dig into it, but I break down work into very very small chunks that are usually easily verifiable or don't require multiple steps.
Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.
I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).
None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...
I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.
I have also been finding the MCP auth story to be really lacking was excited to see OAuth 2 support until I tried to get it to work at work and realized our idp implementation didn't support 2.1, and went into the spec and started wondering if anyone had a good experience yet. Luckily most of our environment can settle on a OAuth token env var standard until that's all in order.
A lot of how well it works or won't work depends on your clients, as not all clients have support for things like RFC 9728 (Protected Resource Metadata). Assuming your client has good support for most of the OAuth 2.0 standards that MCP uses (you don't need DCR as you can statically register clients, assuming that is viable for your environment), then it is possible now with most IDPs to get an OAuth 2.0 auth code flow working just fine. You would then do a token exchange to the upstream to ensure to get the appropriate new audience and rescope/downscope as necessary. Gateways can also help here a lot as instead of baking in all of the auth concerns into your MCP Servers and upstreams, you delegate that to the MCP Gateway. Again, gateway here means different things to different vendors, but Kong, for example, has the ability to act as an MCP proxy (gateway), expose tools based on consumer role or group, apply OAuth 2.0 to it and do an upstream token exchange, while also acting as a regular API gateway that can protect an endpoint with OAuth 2.0/OIDC.
Exactly, and for a given task you don't need to recall what your friend's brother's name is to do a git commit and push. There's a pull for more context to make these things better, but also the pull to make these execute in such a small context effectively when appropriate.
I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.
I kept reading about how bad Android Auto was for years but we finally bought a more modern used car and I can't believe they would ship that experience to customers. I had a week where I just had to unpair and re-pair Everytime I got in the car.
I would love to read about why that stuff is the way it is from the engineers, hmm that might be a good spelunking. I really must be missing something that makes it harder than I think it really should be.
Maybe they prioritized it low, assigned it to a group of 3 devs, and it functioned the first time they demoed it to management, so it shipped. All other devs would’ve been working on their in-house software.
Not the guy you’re responding to but Quanta articles are invariably horribly written, horribly explained, and constantly do this thing whether they simultaneously are pretentious and over complicate things while also belabouring simple, elementary concepts. Essentially it’s the worst of every world.
Hah same exact setup one brick two ports and it charges everything even my laptop! I've been eyeing some of the ones with built in batteries, but I get a lot of mileage of one brick in the bag.
The steam deck forced me to finally pay attention to the usb-c ecosystem and I can only imagine how some non tech people might get with mysteriously bad or slow charging.
I find it crazy that Apple went back to magsafe in the m4 (maybe earlier but that's the machine I have at work). But at least you can still charge over usb-c.
I can't get myself to do the battery-built-in-to-charger thing. I've always treated portable power banks as semi-disposable since they do eventually get worse and fail, and it feels icky to me to tie ~immortal charging gear to something that will die.
I did have the same feeling about flashlights for camping/hiking with lithium batteries, though, until someone walked me through just how much better they are than lugging around AAs.
When I've been working on stuff that requires a SSO login, I noticed that it makes, what I considered, hostile anti-user choices in defaulting to tracking pieces of information I didn't want to track and hadn't mentioned.
Fair that I didn't instruct it explicitly to make more pro-user choices, it just seemed to think slurping as much information into the backend was an default intention. Wasted a few more tokens to iterate on it to remove things, but it was IMO interesting enough that I finally submitted feedback around what I imagine is an interesting training problem.
I would love to work for NASA so much even at a significant pay cut, but almost everything I've read in the past was they still do drug screenings for a lot of positions I was interested in. Maybe someday they will pull their heads out of the dark ages.
Normally I would agree but I get it with regards to NASA. They do life and death stuff that has like zero margin of error. They probably shouldn't be in the business of hiring people who's edible might be lasting longer than they expected.
I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.
reply