"GPU-accelerated" is a poor name because it actually uses a hardware video decoder (a small CPU+DSP) that just happens to be on the GPU. In fact, video encoding/decoding is very poorly suited to GPUs, and CPUs aren't that bad at it although they aren't very power-efficient.
Indeed, video decode would be offloaded to dedicated silicon blocks on the GPU for video decode (e.g. AMD UVD/VCN or Nvidia NVDEC, etc) rather than running entirely on compute blocks.
Implementation-wise though, I don’t think hardware decoders would use general purpose CPU cores under the hood (someone correct me if I’m wrong — been a long time since I’ve worked on this stuff), but would use dedicated fixed-function blocks. On hardware without a full fixed-function decode, I would think it would fall back to compute blocks with fixed-function blocks for acceleration. Some other implementations might make use of customized CPU cores like from Tensilica. General purpose CPU cores might instead be used for things like managing HDCP, etc.
H264 decoders have to be bit-exact so there's no difference in what they do, but they use a lot less battery to do it, and they free the main CPU to do other things.
Just to spell out what this means to those who are unfamiliar: since software decoding involves general purpose software, decoders are available that can handle mostly any valid h264 video file. On the other hand, hardware decoders are purpose built and frequently handle only a subset of valid files. For instance, a decoder might (1) only handle encoding profiles < 5, as used by Blu-ray, (2) only support the most common bit depth of 8, (3) only support 4:2:0 chroma subsampled video and not 4:4:4. And so on.
I do have some experiences on the linux side of this.
You can compile ffmpeg with support for a bunch of hardware encoders such as nvidia's nvcuvid/nvenc and intel's quicksync and so on. The build script is perhaps a hundred lines, so you want to make sure that you don't build that too often and keep working binaries around.
The speedup on nvidia cards is amazing, both for the faster h.264 and h.265 (no experience with others). Note that nvidia cards to have two distinct limits: The number of hardware encoding engines on your card and the number of simultaneous streams they will allow. quadro / professional cards often have unlimited streams but similar numbers of encoders, so they are more flexible but not necessarily faster.
At least from my testing on pascal generation cards, simultaneous encodes are not faster than serial ones because the encoding speed is split. So if you're batching, go ahead and do it serially to keep things simpler.
If you're doing machine learning on video, there is a nifty library called "decord" which does gpu-accelerated decoding and zero-copy reading of frames to your dl framework. That can be nice, but remember that (1) your video samples may blow up the GPU ram if the file is too large and (2) keeping lots of video/frames on the GPU will limit the amount of memory available to your networks. In addition, it may actually be faster to do decoding on the CPU because you then use both a fast CPU and the GPU simultaneously. YMMV.
Do you know if you can use Intel's on-CPU video decoders concurrently with an nVidia GPU? At least last time I tried that on Windows it wasn't possible to use QuickSync while also having a discrete GPU plugged into the machine, or at least no easy way to enable it.
I haven't tried it, and to be honest, I do not see a large need to. The nvidia encoders are typically much faster and have higher quality. My experiments with quicksync were not very favorable (though that was some generations ago).
I also first assumed it was gonna be about Linux since that's the only platform where in browser hardware video decode never works out of the box.
It sucks so bad that you need to Google tutorials and fiddle with command line settings, depending on your GPU and drivers (do you use VA-API, VDPAU, or NVDEC?) , just to get HW decode in browser working (if you even manage to), when this works out of the box even on $99 Chinese Android phones, which ironically, have Linux underneath.
Why does is it make sense to compare RGB32 to YUV?
I would’ve thought simple RGB24 to be a more apt comparison in the context below:
With RGB32, you need 32 bits per pixel. With an YUV 4:2:0 format, for 4 pixels you need to store a total of 6 samples, so 6x8= 48 bits. That is effectively 48/4 = 12 bits per pixel, so only 37.5% of RGB32. That matters.
This was in the context of using the video as a texture. As I tried to explain, the RGB texture formats with 8-bit precision use 32-bit words. There are no 24-bit formats available as a texture format.
NV12 is also a native texture format, so it makes sense to compare these two formats in this context.
the writer mentions ffmpeg as a simple hardware-accelerated video player, but has there ever been a study on how much overhead added for low-latency application over lower-level library such as VAPPI?