Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Or, to be more specific, you don't like your company's implementation of their idea.

We have the same setup in my org, but we get to define alerts ourselves. All our own alerts are built so that they don't go off if the underlying infra is borked, and only if there's something we can actually do on our level. We are being kept honest because there is a big kerfuffle when an incident is reported by customers first (instead of alerting).



What metrics do you alert on? How do you distinguish between error due to faulty database client vs error due to database disk failure?


Taking my managed container image registry service as an example.

- The only critical alert that can actually page people is if the blackbox test fails. Every 30 seconds, it downloads a test image and if the contents don't match the expectation, an alert is raised (with some delay).

- Warning alerts are mostly for any errors being returned from background tasks, but these are only monitored during business hours.


i dont see how that is separated from the underlying infra. If the network/server/some dependency goes down, the blackbox test will fail and you'll get paged.


You can test for this. For example, we had routines that were called on repeated HTTP failures that would then get 5 or so of the top US websites. If those fail too, it moves from an application error to an infra one.


Define SLOs based on what can realistically be achieved with underlying infrastructure, only alert if those SLOs are breached?


If your endpoint is failing, it might be you. If everyone’s endpoint is failing, it’s almost certainly not you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: