Perturbation of dataset used for training can introduce adversarial behavior even without adding any other data, and idea is quite simple: you take two batches from the dataset for training and select model with more probable adversarial behavior. The more batches with posterior selection get processed, the more probable adversarial behavior become.
By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.
The page does very poor job tokenizing phrase "Noinceolik fiyulnabmed fyvaproldge" into "Noinceolik fiyulnabm ed fyvaproldge", factoring only "ed" suffix. As if made up words such as "noinceolik" are so common they are part of 100K token vocabulary.
The actual application of GPT-5 tokenizer at [1] to my made up phrase results in 14 tokens, only two of them are four characters long and there are tokens containing spaces.
I appreciate the feedback, I did notice that as well and I had this thought perhaps this is not worth fixing since I have a link to tiktokenizer. I decided to remove it and just added a more prominent link to tiktokenizer.
Actually, the euphoric mood disorder may make one hear voices telling to feel great, do good, help all grandmas of the world through the crossing, etc.
The "focus" and "get back to work" parts are hard, though.
To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.
I apologize for the potato quality of these links, however, I have been working tirelessly to wrap my head how to reason about how agents and LLM models work. They are more than just a black box.
The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]
The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]
Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.
> The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
This is very interesting, if true. It follows that one can generate several instances of the code, chose one with the bug and bug will not be found. Mythos can be used to fool Mythos.
If I may (and correctly understand what is going inside Fil-C), it would be not so hard to add support for software transactional memory by adding some library calls.
This will greatly reduce coordination bugs in parallel programs and may even speed things up.
And yet the KL divergence after changing all that stuff remains remarkably similar between different models, regardless of the specific hyperparameters and block diagrams employed at pretraining time. Some choices are better, some worse, but they all succeed at the game of next-token prediction to a similar extent.
To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.
Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.
That "underappreciated underlying structure or geometry" can be just an artifact of the same tokenization used with different models.
Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree.
What does concern me here is that it is very hard to ablate tokenization artifacts.
None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts.
You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training?
What if you still have to obtain the best result possible for given coefficient/tokenization budget?
I think that my comment express general case, while yours provide some exceptions.
The general case is that our own current relative ignorance on the best way to use and adapt pretrained weights is a short-lived anomaly caused by an abundance of funding to train models from scratch, a rapid evolution of training strategies and architectures, and a mad rush to ship hot new LLMs as fast as possible. But even as it is, the things you mentioned are not impossible, they are easy, and we are only going to get better at them.
>What if you need to reduce number of layers
Delete some.
> and/or width of hidden layers?
Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.
> would the process of "layers to add" selection be considered training?
Er, no?
> What if you still have to obtain the best result possible for given coefficient/tokenization budget?
We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.
From what I remember, typical new C++ debugged code speed is about 20-25K lines per year, lines that are non-blank, non-comment and not completely verifiable by compiler. E.g., standalone bracket or comma or semicolon are not lines of code, function header is too not a line of code, but computation, conditions and loops are. This is from old IBM statistics, I learned about it circa 2007.
If we assume that there are 50 weeks per year, this gives us about 400-500 lines of code per week. Even at long average 65 chars per line, it goes not higher than 33K bytes per week. Your comment is about 1250 bytes long, if you write four such comments per day whole week, you would exceed that 33K bytes limit.
I mean this genuinely and in good faith in case you didn’t already know it: the term for “non-blank, non-comment…” in programming is usually “Significant Lines Of Code” or SLOC.
In what way? You're either very young or very old, right? Voice-to-text has been a common way to input text online since iPhone. Someone commented on HN != they typed that many words with their fingers.
I strongly believe you can use voice-to-text for coding.
If the person I replied do use voice-to-text, their mention of carpal syndrome is moot and this is amusing. If they do not use voice-to-text, it is still amusing in the sense of my previous comment.
Or, you know, it's far easier to input natural language with voice-to-text than coding with voice-to-text, so even if they can write long comments on HN, coding is still a problem?
Nah, impossible. They must be making up their carpal syndrome because nothing is ever real.
Yes, I checked their history of comments before posting. It made me confident that I hit the right note.
My software engineering experience longs almost 37 years now (December will be anniversary), six-to-seven years more than Earth's human population median age. I had two burnouts through that time, but no carpal tunnel syndrome symptoms at all. When I code, I prefer to factor subproblems out, it reduces typing and support costs.
> I think over-thinking is only solved by thinking more, not less.
Despite "thinking" tokens being determined by the preceding tokens, they still are taken from some probability distribution, just a complex one. This means that at each token selection step there is a probability P_e of an error, of selecting a wrong token.
These errors compound exponentially: the probability of not selecting wrong token for N steps is 1-(1-P_e)^N.
The shorter "thinking" is, the less is the probability of it going astray.
> The shorter "thinking" is, the less is the probability of it going astray
As long as the error introduced by more steps is less than the compounding error of sub-optimal token sampling, I would expect a better result.
I think your choice of "wrong" is extreme, suggesting such a token can catastrophically spoil the result. The modern reality is more that the model is able to recover.
By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.
reply