They also sometimes flag stuff in their reasoning and then think themselves out of mentioning it in the response, when it would actually have been a very welcome flag.
This can result in some funny interactions. I don't know if Claude will say anything, but I've had some models act "surprised" when I commented on something in their thinking, or even deny saying anything about it until I insisted that I can see their reasoning output.
AI-assisted, I can see. I believe it doesn’t have to be that way, though. If you use AI as a grounding tool - essentially something that can take your stream of consciousness and parse it into a series of concerete and pointed search terms to do real-time research with instead of falling back on what’s in the weights - then it’s honestly hard to think of a technology that had the potential to be more useful in the history of the species - it gives you much more direct access to both your unknown unknowns and your unknown knowns.
That is, of course, provided that you pay attention it actually does research. In their current state, LLMs are practically useless for this purpose for the vast majority of users, as no one knows how they work, what to watch out for, what the failure modes look like, and how to keep nonsense apart from facts when both are presented with an equal amount of conviction. That’s not a user problem, it’s an education problem.
> Jai Das, president of investment firm Sapphire Ventures (who has no stake in either company), told the FT he saw OpenAI as “the Netscape of AI,” a reference to the once-dominant browser that was overtaken by Microsoft and eventually absorbed by AOL.
One can only pray and hope, I’d say. May they be absorbed by a company with just as much lasting staying power as AOL.
I’m more of a prosumer than a professional, but when I look for sounds, I look for individual ones; never for packs. What I’d appreciate more than anything else is the choice of either buying individual sounds for smaller money, or load up on a sub or credits if I have more of a bulk need.
Basically, look at FL Cloud and do exactly what they’re doing, haha. Image-Line is the prime example of a company worth trusting, and they get to reap the rewards of that trust as a result.
The point of an encyclopedia is that you can visit a very specific page under a very specific name and receive information that you know has been vetted and properly researched. You get precisely zero of any of this with an LLM, so they just seem like they’re fundamentally the wrong tool to even consider something like this for.
Assuming widespread adoption, what’s to stop this from turning into a closed “elite club” walled garden? I’ve found that oftentimes, as soon as you attempt to distribute the decision of who’s to be trusted, you’re definitely going to end up filtering for something, but it’ll rarely be what you initially intended to filter for.
It works better in maintainer-led allow/trustlists because the people who need the trust hold all the levers that decide what “trust” means. But when you distribute those levers to everyone, it makes sense that it’d eventually drift into almost a self-defined system.
I agree that this is an issue. We are currently trying a combination of two things. First, every human can verify themselves via the passport, and this comes with some "voting rights". This way, we keep it open. Second, there is a per-person, per-content limit on voting power. This aims to prevent some people from becoming too powerful. The exact limit needs to be adapted over time. But yeah, we haven't found the perfect solution here, and we need to keep working on it and improving it. That's also why we are looking for as much feedback as possible.
If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any “native diffuser,” actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.
And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).
I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.
I think your excitement is justified. The paper is claiming a serious bridge between AR quality and parallel decoding, and the lossless LoRA-assisted mode is the wildest part.
Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.
It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.
I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.
So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.
Same reason why prompt processing is faster than text generation.
When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.
Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.
From what I understood, if we’re talking a single user on a mac (not batching) you’re rarely compute bound in the first place. More rows per pass is nearly free that way when cores were sitting idle anyway.
If that’s wrong I would certainly appreciate being corrected, though. But if it’s right, a 2.9x speed-up after rejected tokens, nearly for free, sounds amazing.
That will depend on the model, but they'll hit compute limits before a typical GPU in almost all cases. Macs will still benefit a speedup from this, just not one as big as the one reported.
Eh. There is nothing diffusion about this. Nothing to do with denoising. This setup is still purely causal, making it quite a dishonest framing IMO. There is no more introspection here than what happens in MTP + SD setups.
Let me explain what is going on here. This is basically a form of multi-token prediction. And speculative decoding in inference. See my earlier post[1] to understand what that is. TL;DR, in multi-token prediction you train separate LM heads to predict the next as well as next to next token as well as... Upto chosen next kth token. Training multiple LM heads is expensive and can be unnecessary, so what people typically do is have a common base for all the k heads, explained further in [1]. These guys do another variant.
Here is what they do mechanically, given a sequence p consisting of five tokens PE([p1, p2, p3, p4, p5]). Where PE(.) adds relative position info to each token.
1. Create an augmented sequence PE([p1 MASK MASK MASK MASK]). Do a training pass on that, with the ground truth sequence p1..5. Here it is trained to, for example, to predict p3 given p1+pos=-2 MASK+pos=-1 MASK+pos=0, loosely notating.
2. Then separately[2], train it as usual on PE([p1 p2 p3 p4 p5]).
Step (1) teaches it to do multi-token prediction, essentially the single LM head will (very very loosely speaking) condition on the position `k` of the special MASK token and "route" it to the "implicit" k'th LM head.
Step (2) teaches it to be a usual LLM and predict the next token. No MASK tokens involved.
So far, you have trained a multi-token predictor.
Now during inference
You use this for speculative decoding. You generate 5 tokens ahead at once with MASK tokens. And then you run that sequence through the LLM again. This has the same benefits as usual speculative decoding, namely that you can do matrix-matrix multiplication as opposed to matrix-vector. The former is more memory-bandwidth efficient due to higher arithmetic intensity.
here is an example,
query = ["what", "is", "2+2"])
prompt = PE([...query, MASK*5])
you run output = LLM(prompt). Say output is ["what", "is", "2+2", "it", "is", "4"]. Note that the NN is trained to predict the kth next token when faced with positionally encoded MASK tokens. So you get all 5 in one go. To be precise, it learns to predict "4" given ["what", "is", "2+2", MASK, MASK]. Since it does not need the "it" and "is" explicitly, you can do it in parallel with generating the "it" and the "is". "is" is predicted given ["what", "is", "2+2", MASK], for example, and that also doesn't depend on the explicit "it" being there, and thus can also be done in parallel with generating "it", which is just normal generating the next token given the query. And then you use this as a draft in your speculative decoding setup.
Their claim is that using a multi-token predictor this way as a draft model works really well. To be clear, this is still causal, the reason diffusion models have hype is because they are capable of global refinement. This is not. In the same thread as [1], I explain how increasing the number of MASK tokens, i.e increasing `k`, i.e the number of tokens you predict at once in your multi-token prediction setup quickly leads to poor quality. This paper agrees with that. They try out k=2,3,4,8. They see a drop in quality at 8 itself. So finally, this is 4-token-prediction with self-speculative decoding(sans LayerSkip or such), removing seemingly no existing limitation of such setups. It is definitely an interesting way to train MTP though.
[2] Note that it is computationally a single forward pass. Attention masks help you fuse steps 1 and 2 into a single operation. However, you still have 2 separate loss values.
After trying to understand their method, I think you're right. Doesn't seem like anything that I would personally call "diffusion". Much closer to MTP + speculative decoding.
Then again, their results with it are great. It would be interesting to benchmark it against standard SD on a model that already uses MTP.
Yeah, I think it's a super neat way to do MTP. Conceptually much more pleasing and simple than existing methods. Especially since this way scaling `k` as models get better will be easier. Wish it had been presented as such.
This reminds me a lot of the tricks to turn BERT into a generative model. I guess the causal masking that keeps it to essentially be autoregressive is an important difference though. Kind of best of both worlds.
> More likely, advertisers will need you to insert a “bootloader” that fetches their code and passes it to eval().
Sounds like legal precedent waiting to be set. “Run our code so that it looks like your code, acts like your code, and has all the same access as your code” seems like it should be a slam dunk if said code ends up doing a Very Bad Thing to your visitors.
But of course that’s assuming common sense, and the law’s relationship with that isn’t always particularly apparent.
There is already plenty of precedent for real-time-served ads which are annoying, or malicious, or install malware; or outright exploit vulnerabilities in the browser.
reply