Why surely? Eg on GPU style chips today you'd tile to warpsize (or half or something like that) to target warp cooperative primitives (so called tensor cores). On AMD and NV that's like 16x16 or 32x32 depending on the data type. That's not that far from 4x4 and it's not like all chips in the world have 32-lane warps. Anyway if a trick is good enough and people start trying to shoehorn it into everything (not saying this one is) then the next gen (or the one after that or after) will just have units to support the trick (it's an expensive way to gain ground on mlperf rankings).
Blackwell is doing matrix multiplies across thread blocks these days, and doesn’t understand anything smaller than 64x16x16. I assume large matrix multiples are using the bigger sizes like 128x256x16 or whatever.