Since part of Testing in Production is to monitoring in production, design a flexible and accurate alert system is necessary. The alert system should combine information from different data sources and be adaptive to open tickets to the right people at the right time. I wonder does you have good references to design such alert system (I can image this might be more challenge and should be fully integrate together with Tip System). When you design an alert system, I am particular interesting that
1) how many alerts are false positive (noise), false negative (I.e., are too conservative to open tickets).
2) do the alerts are actionable?
3) do we apply some heuristic and statistics analysis before sending to the alert
4) do we combine data from multi sources in the alert, such as rollout information, internal service logs, etc.
5) what is the reception of the alert. If it is human, how many conversations, how many hubs, and how long it will reach the right team and get resolved. Does the resolution manual or automatic. If it is automatic process (which is I am looking for). do we have a guideline to design such process.