My thought on data pipeline

Article
09/05/2013

We learned many lesson during developing and using SQL Azure Dashboard for live site troubleshooting. We also collected and learned how other team do data pipeline and data driven decision. In this document, we provide some initial thought on how we can improve our SQL Azure’ data driven decision.

Business Requirement

In this section, we describe the high level business goal of data driven decision. We want to make sure that any technique investment will result in better ROI, and thus provide value to our customer.

Make data flow and make usage of data

One of the business requirement is to make use our existing data to help people to make decision. In order to achieve this goal, we need to build a complete system which allow the row data can be efficiently process and flow to the right people at the right time. In here, I list a couple of examples of such requirement:

Reporting.

SQL Server Reporting is a simple way to summary data for people to consume. We saw significant usage of reports in recent months. It is a good sign that we are more toward data focus. However, developing a “good” report is still hard. Our daily availability report is a very good example that how our report evolved when the requirement changed. We used to measure the availability of a cluster by percentage of login failures per cluster, and categorized by the error category. Then our PM did not make the report static, they constantly refine the report to make the report more useful. A couple of actions we did:

We discovered several login categorization bugs and gaps, and try very hard to correct them. Data correctness is very important otherwise people may drive to the wrong conclusion.
We used to exclude one category of unavailability, called Client Closed due to customer’s connect time is too short (15 seconds). After we realized that it is not right to ignore this, we expose this in daily report, we start to see lot of client close issue every day. We gain lot of attraction from leadership which leads us to drive several work items to fix then. This improves our availability significantly. The goal of reporting is not generating a report, but using your report to drive business decision. We need constantly look back to see the value return for our reports.

Live site incident alert and resolving

Data pipeline should provide enough data with limited time for people to detect and resolving live site incident. Every live site incident has business impact and might hurt our reputation if we don’t handle them well. Mean time to detection and mean time to mitigation are the important metrics about live site incident. If there is signal of Livesite incident, the data should flow to the right person with low latency so that we can reduce the two metric below.

Trend analysis and prediction

We all know that analytic and predication is very important for our service and our business. Depending on the requirement, we might need very low latency to get data. Today, the trend of running service is to predict what will happen (in the future) given the fact of the current status. The sooner we predict it, the better we can prevent bad thing happens. For example, if we observe that the log file in one machine is growing significant faster in the last 15 minutes, we can predict when we will fill out log space and cause the machine down. Whenever something happens, Perf Counter is usually the best indicator of the issue. Having a very fast ETL pipeline for perf counter data enables us to predicate issue in short time.

Data driven decision

Technique Requirement

Data accessibility and discoverability

Many people are comfortable of using SQL queries which are declarative and flexible, our data pipeline should provide SQL access to the data or build solution on top of SQL query. Even Hadoop or other big data platform is popular, many team treat them as the row data storage and they usually cleanup and aggregate data into relational tables stored in SQL Server to allow easy to access and integrated with existing system.

One of problem we are facing today is that we have multiple data sources and the pipeline data we generated are storages in many different DWs. It makes people hard to find data and correlated with different source of data. As a requirement, we should try best to co-located data in single data source and provide interface to correlated data from different data source.

Latency

Latency is an important factor when we do alert, live site troubleshooting, and prediction. Many team have clearly defined data latency requirement for their data pipeline. For example, Windows Telemetry team refresh all their telemetry data in 30 minutes, so that people can also know their service trend with max latency of 30 minutes. In some case, we require even low latency. For example, for customer database outage, the sooner we find it out, the better chance we can resolve it.

As a general goal, I like to say that data should be available for end user within 30 minutes after the event/data was original generated. Critical data should expose to end user ASAP.

Data pipeline authoring

Take an example of our opstore data authoring, it is very simple to a people to authoring such data. He/she only need to 1) tracing the data 2) define opstore table schema 3) define a rule to map the two. This usually take a day for a developer to write such a data pipeline.

For example, Window telemetry team provide five kind of generic plugs, such as T-SQL, Perf Counter, to allow people to define their KPI in a simple UI, and leave the actual data collection part as blackbox.

In summary, we should provide the best experience for end user to authoring data pipeline more efficiently.

Hosting and Security

Due to FISIM requirement, everyone need smartcard in order to access public SAWA cluster. This makes hard for self-hosting trouble shooting and data collection tool to work automatically. Data pipeline system should provide way to access data in a security manner with better accessibility. I.e., we should not need smartcard to access the data already collected and stored outside of the cluster.

Impact on live system

There are a couple of questions related to the relationship between data pipeline with live system, such as whether data gathering should be inside or outside of the cluster, data storage should be within the cluster or not; Some of the key indicators are 1) data pipeline should have loose dependence on live system. In that we can update data pipeline or live system independently and the availability of either system does not impact another. 2) Data pipeline’s availability has loose dependence on live system’s availability. In that if live system has issues, we can still collect data 3) data pipeline should have low impact on live system

Data Volume

Our system will generate large number of data, thus the design of a pipeline system need to take data volume and retention policy into consideration, so that we don’t hit scale out problem

Summary

Below are the list of requirement which I think it is important:

Data pipeline should provide enough data with limited time for people to detect and resolving live site incident.
Having a very fast ETL pipeline for perf counter data enables us to predicate issue in short time.
our data pipeline should provide SQL access to the data or build solution on top of SQL query
try best to co-located data in single data source and provide interface to correlated data from different data source
Data should be available for end user within 30 minutes after the event/data was original generated. Critical data should expose to end user ASAP.
Provide the best experience for end user to authoring data pipeline more efficiently.
data pipeline should have loose dependence on live system
Data pipeline’s availability has loose dependence on live system
Data pipeline should have low impact on live system’s availability
The design of a pipeline system need to take data volume and retention policy into consideration

If we think back about our opstore, it satisfied a lot of the requirement above, such as data completeness, data authoring experience, and well defined retention policy. What if we take opstore outside of the cluster, and provide a much better query interface allow query cross partitions and clusters? I think it will solve all of our problem with great usage of existing investment.