Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It does seem to match transformers, but I wouldn't say it meaningfully outperforms them in terms of quality vs parameters.

Model: #Params (M), SlimPajama (15B) ppl ↓

- GPT-3: 356M, 14.26

- Llama: 407M, 14.25

- H3: 420M, 18.23

- Mamba: 423M, 13.70

- Hyena: 435M, 17.59

- RWKV-4: 430M, 15.62

- RWKV-5: 456M, 16.53

- RWKV-6: 442M, 17.40

- RetNet: 431M, 16.23

- HGRN: 411M, 21.83

- GLA: 412M, 19.56

- HGRN2: 411M, 16.77

- xLSTM[1:0]: 409M, 13.43

- xLSTM[7:1]: 408M, 13.48

There are more detailed perplexity and task benchmarks in the paper. Overall, all the architectures perform very similarly on every benchmark, sometimes xLSTM is slightly ahead but not always, and the difference is not really meaningful.

This is great news though, it means we are not losing anything by switching to xLSTM and we get important advantages like the scalable context window.

I'm quite excited about this because we can potentially have the LLM remember what you say and do few-shot persistent learning from user interaction (updating "itself", the state vector). It would be very interesting if LLMs were no longer static. Although I'm sure it will be a challenge to train the model to keep such learnings in its memory long-term.

The paper: https://arxiv.org/abs/2405.04517



> It would be very interesting if LLMs were no longer static.

Little bit of a nightmare too. Instructions keep piling up for you that you no longer openly can access and remove


Linear scaling for context is also a bit deal. Flash attention partially solved this for TF, but xLSTM seems promising!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: