Here is my thought on external monitoring. Please providing some comments/suggestions.
1) what is the different between external monitoring with run tests in production
The goal are different, 1 is for monitoring the service health, such as reliability, latency, availability, 2 is for the verify the functionality of a feature. Both are valid testing strategy, but the metric, the frequency of running "tests" are different. The latter is also very important since production is the only truth, and running a test in production to verify it works is the only way to confirm it works.
2) How many kind of external monitoring we have?
We have three kinds of external monitoring: 1. "ping" like monitoring which measure the available or latency of a service access point. 2. running basic canonical scenarios again your service to measure the functionality availability of your service. 3) simulate your typical user scenario in production to make sure the workflow for your service works. Note, the big difference between 1 and 2 is that metric/measure we collected to judge whether our service has issues. In some case, the verification of 2 will be much harder than 1. For example, suppose we are monitoring web base e-mail service. The ping monitoring will call either web server address or e-mail service API to make sure we can get response. The canonical scenario will try to login as a user and make sure it can succeed, and the user scenario simulation will login, read a e-mail, reply an email and logout.
3) What is the typical frequency of running external monitoring tests?
The frequency is depending on the SLA of your service. For example, if you aim to detect connection broken issues within 5 minutes then you external monitoring tests must at lease runs every 5 minutes to measure it. If your service contains 10 load balance machines, then you need at least issue 10 requests at the same time per 5 minutes in order to detect a failure in one of the load balance machine.
4) how to design a good external monitor tests?
First, keep in mind, an external monitor tests is a monitoring, but not a functional tests. The idea of writing one, running everywhere is not apply to external monitoring tests because when we run a test on OneBox, we measure the functionality, when we run a monitoring test in production, we monitor our service reliability and latency. The goals are the different, which cause the way of writing tests are totally different. It is possible that you can share some of the scenario, but I'd prefer you write separate tests.
Due to the natural of cloud/service, error happens randomly, the external monitoring tests need to take possibility/frequency and false failures into consideration. A statistics model might also need to be built in order to be correct.
5) what is the major issues related to external monitoring tests?
Case 1 of the external monitoring, i.e., ping like external tests are always valid and good indicator of your service, and every service should expose such simple ping point to allow others to measure your service.
Note, case 1 is mainly aim to find service wide issue, not component or machine level issues. i.,e., if a service is totally or partially not accessible, case 1 can easily detect this.
For case 2 and case 3, since we are running scenario based external monitoring, the main issues comes to two factors here:
a) how can we know whether your service has issue?
Take the previous e-mail service as an example. If our external monitor tried to log-in as a user and make sure it succeed. However, successfully logon only means that you can logon once, but there is no guarantee that other customer can login. Also, a failure login does not mean that we have service issue. How about we retry twice and third time and if they all failed, we treat as a service issue? Still not, because we don't know whether it impact on one account or many account at the same time. How about creating many testing accounts, and monitor them all at once. It seems working if we create enough accounts. Suppose that one out of ten front-end machines failed, and requests to the machine will always failed, and each request will go to a random front-end machine randomly. How many account or how many requests we should issue per run to detect a bad machine issue? We have to issue 100 requests, and 1 out of 100 request will be failed twice, and we can assume that we hit this issue? What about we have to retry three time? The possibility is even low. In conclusion, we can only detect global level service outage in this case.
b) how you diagnose your service issue if your external monitoring failed?
Suppose we found that we have global wide service issue in our external monitoring, and fire an alert to the right team. Does the external monitoring result help the team to find the RCA and resolve them quickly. In my experience, external monitoring has no internal knowledge about the service. In other word, the result is the same as what our customer get, such as time out, service unavailable error code, trace id, no more no less. How we should handle in this case? This come to my next question:
6. How can we combine external monitoring, internal monitoring, log analysis and live-site troubleshooting tool together?
This is the main point of my post. Keep in mind, not a single solution can solve service issues, but a combination of all these solution plus a well design health model and alert system is the key. While investment on external monitoring seems cool, we still need internal monitoring, real time log analysis, and live-site trouble shooting or diagnose tool support. In my opinion, the latter might be more important than external monitoring since they are the key of understand service insight. For example, if you don't have good trouble shooting tool, you will find very hard to debug external monitoring failures. If you don't have real-time log analysis tool, you don’t know which service, which component has errors. Let me give an example how real time log analysis can resolve the issue I described in previous section. Suppose that there is one customer who always have log-in issue (not because of wrong password, but some service bugs only impact on him/her). In this case, no external monitoring can detect this although it is 100% unavailable for such customer. Most of our service have rich logs, and all exceptions and errors are records. If we apply data mining algorithms to these logs, it is not hard to find such customer and the root cause as well.
The above examples shows the necessary to define a hierarchical service health model, such as service level, machine level, user level, and map your monitoring and alert solution to different levels with different priorities, and routing to the right team.
Conclusion, don’t think external monitoring as separate process or the most important area to invest. Thinking about the big picture, and combine all others techniques to provide a well-defined solution is necessary.