It does seem to match transformers, but I wouldn't say it meaningfully outperforms them in terms of quality vs parameters.
Model: #Params (M), SlimPajama (15B) ppl ↓
- GPT-3: 356M, 14.26
- Llama: 407M, 14.25
- H3: 420M, 18.23
- Mamba: 423M, 13.70
- Hyena: 435M, 17.59
- RWKV-4: 430M, 15.62
- RWKV-5: 456M, 16.53
- RWKV-6: 442M, 17.40
- RetNet: 431M, 16.23
- HGRN: 411M, 21.83
- GLA: 412M, 19.56
- HGRN2: 411M, 16.77
- xLSTM[1:0]: 409M, 13.43
- xLSTM[7:1]: 408M, 13.48
There are more detailed perplexity and task benchmarks in the paper. Overall, all the architectures perform very similarly on every benchmark, sometimes xLSTM is slightly ahead but not always, and the difference is not really meaningful.
This is great news though, it means we are not losing anything by switching to xLSTM and we get important advantages like the scalable context window.
I'm quite excited about this because we can potentially have the LLM remember what you say and do few-shot persistent learning from user interaction (updating "itself", the state vector). It would be very interesting if LLMs were no longer static. Although I'm sure it will be a challenge to train the model to keep such learnings in its memory long-term.
Model: #Params (M), SlimPajama (15B) ppl ↓
- GPT-3: 356M, 14.26
- Llama: 407M, 14.25
- H3: 420M, 18.23
- Mamba: 423M, 13.70
- Hyena: 435M, 17.59
- RWKV-4: 430M, 15.62
- RWKV-5: 456M, 16.53
- RWKV-6: 442M, 17.40
- RetNet: 431M, 16.23
- HGRN: 411M, 21.83
- GLA: 412M, 19.56
- HGRN2: 411M, 16.77
- xLSTM[1:0]: 409M, 13.43
- xLSTM[7:1]: 408M, 13.48
There are more detailed perplexity and task benchmarks in the paper. Overall, all the architectures perform very similarly on every benchmark, sometimes xLSTM is slightly ahead but not always, and the difference is not really meaningful.
This is great news though, it means we are not losing anything by switching to xLSTM and we get important advantages like the scalable context window.
I'm quite excited about this because we can potentially have the LLM remember what you say and do few-shot persistent learning from user interaction (updating "itself", the state vector). It would be very interesting if LLMs were no longer static. Although I'm sure it will be a challenge to train the model to keep such learnings in its memory long-term.
The paper: https://arxiv.org/abs/2405.04517