Or, to be more specific, you don't like your company's implementation of their i...

potamic · on Aug 26, 2024

What metrics do you alert on? How do you distinguish between error due to faulty database client vs error due to database disk failure?

majewsky · on Aug 26, 2024

Taking my managed container image registry service as an example.

- The only critical alert that can actually page people is if the blackbox test fails. Every 30 seconds, it downloads a test image and if the contents don't match the expectation, an alert is raised (with some delay).

- Warning alerts are mostly for any errors being returned from background tasks, but these are only monitored during business hours.

perfect_wave · on Aug 26, 2024

i dont see how that is separated from the underlying infra. If the network/server/some dependency goes down, the blackbox test will fail and you'll get paged.

silisili · on Aug 26, 2024

You can test for this. For example, we had routines that were called on repeated HTTP failures that would then get 5 or so of the top US websites. If those fail too, it moves from an application error to an infra one.

dullcrisp · on Aug 26, 2024

Define SLOs based on what can realistically be achieved with underlying infrastructure, only alert if those SLOs are breached?

sgarland · on Aug 26, 2024

If your endpoint is failing, it might be you. If everyone’s endpoint is failing, it’s almost certainly not you.