The past four months are amazing for me. I initialized a grass root project called "SQL Azure Dashboard" and have a first version implemented when my manager was leave (since he thought this area should belong to other team). The basic idea about my project is to periodic scan all our cluster to get telemetry data from it, and build a life dashboard. That is exactly the first version did, and even such basic data pipeline surprised many people in our team that we now can have a single and up-to-date view of the status of our clusters.
Guess what? I did not stop at that, and I knew we can get more from it. Then, one month later, we had the first alert system build on top of this. Even it only send email to our rapid response team and Ops team, we just found that it just opened a door to understanding our service insight. The data I gathered played an very important role to decision making. For example, when we have an outage in one of the cluster due to an exception, we just scan the data I collected (which contains all exceptions happen in production) and know the impact of this issue only apply to this cluster.
My dashboard is so succeed in that people realize that the important of data driven decision and I provide the data they want on time. Then another month later, we have daily machine reboot history with details causes, load balance history per day, user login failures summary per user per 15 minutes,etc.
In my next a couple of blogs, I like to share more experience I learned from this project, and also share the testing strategies on how we testing Windows Azure SQL Database (i.e., SQL Azure). Of course, you need to put more comments to encourage me to write blogs. So how long one blog per comment?