This weekend I was reading this paper on programming the Cerebras wafer scale engine, https://arxiv.org/html/2405.07898v1 . Data movement is the expensive part of computing, and some algorithms like stencils only require nearest neighbor data movement per cycle. Cerebras wafers have very low energy transfer between neighboring processing elements on the same wafer, so they come up with a language called Tungsten that focuses on this exchange primitive in the kernel programming model.
I thought the challenge of programming 100,000s of cores using a mesh would be interesting so I wrote a simulator, simple compiler, and a few simple kernels for the wafer scale engine using publicly available documents.
I'm used to CUDA. So I asked: "How would you map something like CUDA onto a machine like this?" Well I use something like malloc to allocate global memory, memcpy to move between host and device memory, and a queue of launch thread block launches, but this time, thread blocks can communicate using nearest neighbor send/recv instructions within the same block instead of through shared memory on a streaming multiprocessor. This is inspired by the stencils in Tungsten.
The whole program is made up of a bulk synchronous kernel of many thread blocks.
I think it is interesting because CUDA has some hard limits on thread block sizes, but this mesh perspective lets you grow or shrink the blocks significantly.
Note that some information about cerebras wafer engines like the ISA is not public (as far as I know). In this code, I just guessed what it could be.
So this should not be taken as a faithful or accurate simulation of the wafer scale engine. More like a point on the design space that is similar in that it includes a wafer sized mesh of processing elements.
I sincerely wonder why. Chinese censorship is only really relevant if you're doing anti China stuff, which is to say never, while the Western kind of model censorship ( a combination of copyrights and general fairness ) are something everyone's had to work around at least once, even if just for writing an interesting story.
It’s about enterprises who care about supply chain risk and having a throat to choke if they have a problem.
Here’s a real example.
I’m in a design meeting talking about a model use case. We have a question about the data pipeline or the prompt format that would benefit from knowing about how the model was trained. The enterprise team lead calls the dev tech engineer from the company who produced the model. He is already in the office and walks into the meeting to answer the question.
https://split-brain-ui.scalarxlm.com/docs/clients
I expect Claude to train on my general tokens. I train my own model on my IP related tokens.
reply