More

thesz · 2026-04-26T22:09:47 1777241387

Perturbation of dataset used for training can introduce adversarial behavior even without adding any other data, and idea is quite simple: you take two batches from the dataset for training and select model with more probable adversarial behavior. The more batches with posterior selection get processed, the more probable adversarial behavior become.

By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.

thesz · 2026-04-24T14:47:40 1777042060

The page does very poor job tokenizing phrase "Noinceolik fiyulnabmed fyvaproldge" into "Noinceolik fiyulnabm ed fyvaproldge", factoring only "ed" suffix. As if made up words such as "noinceolik" are so common they are part of 100K token vocabulary.

The actual application of GPT-5 tokenizer at [1] to my made up phrase results in 14 tokens, only two of them are four characters long and there are tokens containing spaces.

[1] https://gpt-tokenizer.dev/

I will read along, though.

ynarwal__ · 2026-04-24T14:54:51 1777042491

I appreciate the feedback, I did notice that as well and I had this thought perhaps this is not worth fixing since I have a link to tiktokenizer. I decided to remove it and just added a more prominent link to tiktokenizer.

thesz · 2026-04-25T10:06:37 1777111597

BPE that is used in tokenization is very simple: https://en.wikipedia.org/wiki/Byte-pair_encoding

thesz · 2026-04-24T14:10:51 1777039851

Actually, the euphoric mood disorder may make one hear voices telling to feel great, do good, help all grandmas of the world through the crossing, etc.

The "focus" and "get back to work" parts are hard, though.

thesz · 2026-04-23T19:27:29 1776972449

To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.

dataviz1000 · 2026-04-23T19:48:12 1776973692

I apologize for the potato quality of these links, however, I have been working tirelessly to wrap my head how to reason about how agents and LLM models work. They are more than just a black box.

The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]

The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]

Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.

Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.

[0] https://adamsohn.com/reliably-incorrect/

[1] https://adamsohn.com/grpo/

thesz · 2026-04-25T10:43:09 1777113789

Thank you, these posts are very insightful!

> The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.

This is very interesting, if true. It follows that one can generate several instances of the code, chose one with the bug and bug will not be found. Mythos can be used to fool Mythos.

thesz · 2026-04-23T09:39:00 1776937140

If I may (and correctly understand what is going inside Fil-C), it would be not so hard to add support for software transactional memory by adding some library calls.

This will greatly reduce coordination bugs in parallel programs and may even speed things up.

pizlonator · 2026-04-27T02:58:38 1777258718

It would be hard for the same reason it’s always been hard:

- STM interacts badly with any interesting effects (like IO)

- STM interacts badly with locks (C and C++ use locks implicitly, because they’re in libc and libc++)

- STM performs badly

- STM is harder to use than locks

thesz · 2026-04-20T16:16:55 1776701815

Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.

altruios · 2026-04-20T16:49:48 1776703788

Hopefully we will find a way to make it so that making minor changes don't require a full retrain. Training how to train, as a concept, comes to mind.

CamperBob2 · 2026-04-20T17:17:43 1776705463

And yet the KL divergence after changing all that stuff remains remarkably similar between different models, regardless of the specific hyperparameters and block diagrams employed at pretraining time. Some choices are better, some worse, but they all succeed at the game of next-token prediction to a similar extent.

To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.

Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.

thesz · 2026-04-20T19:34:04 1776713644

That "underappreciated underlying structure or geometry" can be just an artifact of the same tokenization used with different models.

Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree.

What does concern me here is that it is very hard to ablate tokenization artifacts.

dTal · 2026-04-20T17:06:13 1776704773

None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts.

[0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887

thesz · 2026-04-20T19:44:18 1776714258

You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training?

What if you still have to obtain the best result possible for given coefficient/tokenization budget?

I think that my comment express general case, while yours provide some exceptions.

dTal · 2026-04-21T08:30:03 1776760203

The general case is that our own current relative ignorance on the best way to use and adapt pretrained weights is a short-lived anomaly caused by an abundance of funding to train models from scratch, a rapid evolution of training strategies and architectures, and a mad rush to ship hot new LLMs as fast as possible. But even as it is, the things you mentioned are not impossible, they are easy, and we are only going to get better at them.

>What if you need to reduce number of layers

Delete some.

> and/or width of hidden layers?

Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.

> would the process of "layers to add" selection be considered training?

Er, no?

> What if you still have to obtain the best result possible for given coefficient/tokenization budget?

We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.

andriy_koval · 2026-04-20T19:32:31 1776713551

there is evidence it is useful in some cases, but obviously no evidence it is enough if you chase to beat SOTA.

thesz · 2026-04-20T15:56:54 1776700614

Is it possible to get hacked through US Robotics Courier?..

thesz · 2026-04-17T22:02:02 1776463322

From what I remember, typical new C++ debugged code speed is about 20-25K lines per year, lines that are non-blank, non-comment and not completely verifiable by compiler. E.g., standalone bracket or comma or semicolon are not lines of code, function header is too not a line of code, but computation, conditions and loops are. This is from old IBM statistics, I learned about it circa 2007.

If we assume that there are 50 weeks per year, this gives us about 400-500 lines of code per week. Even at long average 65 chars per line, it goes not higher than 33K bytes per week. Your comment is about 1250 bytes long, if you write four such comments per day whole week, you would exceed that 33K bytes limit.

I find this amusing.

octagons · 2026-04-18T12:52:41 1776516761

I mean this genuinely and in good faith in case you didn’t already know it: the term for “non-blank, non-comment…” in programming is usually “Significant Lines Of Code” or SLOC.

thesz · 2026-04-20T19:39:37 1776713977

Thanks, I didn't knew that. I thought that SLOC means "Source Lines Of Code."

raincole · 2026-04-18T05:56:09 1776491769

> I find this amusing.

In what way? You're either very young or very old, right? Voice-to-text has been a common way to input text online since iPhone. Someone commented on HN != they typed that many words with their fingers.

thesz · 2026-04-18T10:37:22 1776508642

I strongly believe you can use voice-to-text for coding.

If the person I replied do use voice-to-text, their mention of carpal syndrome is moot and this is amusing. If they do not use voice-to-text, it is still amusing in the sense of my previous comment.

raincole · 2026-04-18T12:15:03 1776514503

Or, you know, it's far easier to input natural language with voice-to-text than coding with voice-to-text, so even if they can write long comments on HN, coding is still a problem?

Nah, impossible. They must be making up their carpal syndrome because nothing is ever real.

slopinthebag · 2026-04-17T22:31:33 1776465093

LOL. If you look at their comment history, they sure are typing a lot of characters for their wrists.

thesz · 2026-04-17T23:03:11 1776466991

Yes, I checked their history of comments before posting. It made me confident that I hit the right note.

My software engineering experience longs almost 37 years now (December will be anniversary), six-to-seven years more than Earth's human population median age. I had two burnouts through that time, but no carpal tunnel syndrome symptoms at all. When I code, I prefer to factor subproblems out, it reduces typing and support costs.

lrvick · 2026-04-18T00:07:43 1776470863

I find it much more valuable to exchange ideas with humans than type every curl bracket and common boilerplate pattern and debug commit myself.

That said, I am also actively experimenting with VTT solutions which are getting quite good.

slopinthebag · 2026-04-18T00:10:16 1776471016

Most of the commentators here are bots these days anyways.

thesz · 2026-04-08T06:20:33 1775629233

  > I think over-thinking is only solved by thinking more, not less.

Despite "thinking" tokens being determined by the preceding tokens, they still are taken from some probability distribution, just a complex one. This means that at each token selection step there is a probability P_e of an error, of selecting a wrong token.

These errors compound exponentially: the probability of not selecting wrong token for N steps is 1-(1-P_e)^N.

The shorter "thinking" is, the less is the probability of it going astray.

richardjennings · 2026-04-08T11:02:48 1775646168

> The shorter "thinking" is, the less is the probability of it going astray

As long as the error introduced by more steps is less than the compounding error of sub-optimal token sampling, I would expect a better result.

I think your choice of "wrong" is extreme, suggesting such a token can catastrophically spoil the result. The modern reality is more that the model is able to recover.

thesz · 2026-04-07T14:21:15 1775571675

It is important to question "how to judge," not "who is to judge."

My answer of "how to judge?" question is the question "how easy is it to implement new unforeseen functionality with the code under scrutiny?"