Except that's not how it played out. The Tesla GPU all the way through Maxwell was just a different binning of the GeForce GPU, usually clocked lower for passive cooling and improved stability (sure, K80 is an exception, but K80 is a strange and tragic evolutionary dead-end IMO).
Tesla P100 was the first real HW-level divergence with its 2x FP16 support. But because we still can't have nice things, GTX 1080 was the first GPU with fast INT8/INT16 instructions, followed by the mostly identical except much more expensive Tesla P40. So we ended up with the marchitecture nonsense that P100 was for training as P40 is for inference despite being mostly identical except as noted above.
I'll assume Volta unifies INT8/INT16/FP16? And I think it's OK if the Tesla card has higher tensor core performance, but if the tensor core on GeForce is slower than its native FP16 support, I can only conclude NVIDIA now hates its own developers and has decided to sniff its own exhaust pipe. Isn't having to refactor all existing warp-level code for thread-within-thread enough complication for one GPU generation?
Also, if consumer Volta ends up with craptastic FP16 support (ala 1/64 perf in GP102 vs GP100, slower than emulating it with FP16 loads and FP32 math), NVIDIA will create a genuine opening for AMD to be the other GPU provider in deep learning.
Tesla P100 was the first real HW-level divergence with its 2x FP16 support. But because we still can't have nice things, GTX 1080 was the first GPU with fast INT8/INT16 instructions, followed by the mostly identical except much more expensive Tesla P40. So we ended up with the marchitecture nonsense that P100 was for training as P40 is for inference despite being mostly identical except as noted above.
I'll assume Volta unifies INT8/INT16/FP16? And I think it's OK if the Tesla card has higher tensor core performance, but if the tensor core on GeForce is slower than its native FP16 support, I can only conclude NVIDIA now hates its own developers and has decided to sniff its own exhaust pipe. Isn't having to refactor all existing warp-level code for thread-within-thread enough complication for one GPU generation?
Also, if consumer Volta ends up with craptastic FP16 support (ala 1/64 perf in GP102 vs GP100, slower than emulating it with FP16 loads and FP32 math), NVIDIA will create a genuine opening for AMD to be the other GPU provider in deep learning.