> A lot of our servers have 10-40gbit links that we saturate for minutes/hours at a time, which I suspect would be expensive without the kind of topology optimization we do in our datacenters.
I think everyone does something of this sort nowadays, that's why networking is ~free within data centers :)
> but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems
That speaks to me. You will always be just the n-th client unless you own the cross-datacenter data links (i.e. have full autonomy on deciding the priority of the traffic). It's similar to the covid provisioning problems you had mentioned.
> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box.
Yeah like others pointed out - that's just what "cloud" is, and is generally a good idea. You're supposed to handle a certain % of your machines going dark without a warning without violating any SLO (or even worse, certain % of your machines "pretending" they're up but actually being ridiculously slow for this or that reason; and don't even get me started on CPU/RAM bitflips).
It sounds to me that you run an extremely highly sensitive service, something for which paying for true ownership of the hardware just makes sense to remove those kinds of risks that most services don't care about. At the end of the day "cloud" is a shared resource, and no resource separation efforts will be 100% effective.
I think everyone does something of this sort nowadays, that's why networking is ~free within data centers :)
> but an extra 1ms (say) between datacenters is generally bad for us and measurably reduces performance of some systems
That speaks to me. You will always be just the n-th client unless you own the cross-datacenter data links (i.e. have full autonomy on deciding the priority of the traffic). It's similar to the covid provisioning problems you had mentioned.
> One obvious example is that EC2 has "scheduled maintenance events" where they force you to reboot your box.
Yeah like others pointed out - that's just what "cloud" is, and is generally a good idea. You're supposed to handle a certain % of your machines going dark without a warning without violating any SLO (or even worse, certain % of your machines "pretending" they're up but actually being ridiculously slow for this or that reason; and don't even get me started on CPU/RAM bitflips).
It sounds to me that you run an extremely highly sensitive service, something for which paying for true ownership of the hardware just makes sense to remove those kinds of risks that most services don't care about. At the end of the day "cloud" is a shared resource, and no resource separation efforts will be 100% effective.