Having spent a good bit of time in HPC from a practical and academic perspective, I'll second that claim.
One example I like to bring up is the textbook hotplate, where the temperature for a single cell is equal to the average of its temperature and the temperatures around it. Each iteration is embarrassingly parallel, but the result of the iteration is necessary to proceed. Instead of synchronizing the entire cluster between iterations, each node computes the results for an extra N cells in each direction, allowing N iterations to occur before synchronization is necessary (after the first iteration, the outermost cells are inaccurate, after the 2nd iteration, the outer two layers are inaccurate, etc. Eventually, you want to sync before the cells we care about are affected). This leads to massively diminished returns as you break the problem into smaller chunks. But even for smaller chunks, doubling or tripling the work per node is still faster than synchronizing.
My sole contribution to HPC was inventing a graph algorithm (circa 2009) for breadth-first search that didn't require the usual iterated barrier synchronization super-step you have with BSP algorithms. Instead, every node would run free with almost no synchronization but a clever error correction mechanism backed out all the incorrect traversals for a completely negligible cost (both space and time) relative to the ability to let the computation run as though it was embarrassingly parallel. It basically allowed every node to always be doing (mostly) constructive work without waiting on another node.
But yeah, burning a bit of CPU to eliminate synchronization is frequently a huge win. This is where knowing how much compute you can do within a given latency window is helpful.
One example I like to bring up is the textbook hotplate, where the temperature for a single cell is equal to the average of its temperature and the temperatures around it. Each iteration is embarrassingly parallel, but the result of the iteration is necessary to proceed. Instead of synchronizing the entire cluster between iterations, each node computes the results for an extra N cells in each direction, allowing N iterations to occur before synchronization is necessary (after the first iteration, the outermost cells are inaccurate, after the 2nd iteration, the outer two layers are inaccurate, etc. Eventually, you want to sync before the cells we care about are affected). This leads to massively diminished returns as you break the problem into smaller chunks. But even for smaller chunks, doubling or tripling the work per node is still faster than synchronizing.