> Doesn't Language itself encode multimodal experiences
Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.
The embedding layer in an llm deals with much more than the meaning. It has to capture syntax, grammar, morphology, style and sentiment cues, phonetic and orthographic relationships and 500 other things that humans can't even reason about but exist in words combinations.
Of course it does. We immediately encode pictures/words/everything into vectors anyway. In practice we don't have great text datasets to describe many things in enough detail, but there isn't any reason we couldn't.