Most of companies in the world invest lot of money & resources into collecting logs. There are various kinds of logs, for example:
- server logs
- client application logs,
- event driven systems logs etc.
Lets take a step back and ask why do companies care about logs. There are multiple reasons but few top reasons are:
- Monitoring (is my system up & running, understand peak load patterns)
- Understanding the usage pattern of customers
- Derive insights which helps them identify areas to grow business
- A/B test etc.
Rest of the post explains one of the ways to build data engineering systems for solving 'Derive Insights which helps them identify areas to grow business' . There are many different design/architectural solutions out there, hence I want to call it as one of the solutions. I will try to be implementation free as much as possible.
There will be multiple teams needed to take logs and convert them into Insights.
- Instrumentation framework team : Write framework which can be used by product or feature teams to log things consistently
- Logs Collection team : Write framework which can go to client, servers, api's and collect logs into Big Data environment.
- Big Data environment : Ecosystem of Hadoop (Map Reduce) and Hive etc.
- Data Processing team (typically called as DataWarehouse team) :
- Once data is made available in Big Data environment, write code to shape data from unstructured to structured , apply schema adjustors etc.
- Once data is available in structured shape, apply business logics, process it to different aggregates ( and/or traditionally facts or dimensions)
- Build DataWarehouse in either big data environment (Hive) or Push data to Massively Parallel Processing (MPP) systems. Typically MPP systems have concurrency limits so you will need to scale out them with multiple clusters but they give you good query experience. Some companies keeps their DataWarehouse in Hive & do not invest in MPP systems layers. I think both have advantages and disadvantages.
- OLAP layer : Build OLAP layer for self service BI
Data volume decreases from PB to TB to GB as you come from Big Data Environment to MPP to OLAP tools. Typically lot of users would just care about DataWarehouse in Hive or MPP systems to derive most of insights, however in these solution, they can always go back to Big Data Environment if they want to look at more detailed data. For Leadership, executives, business folks they would love to connect to OLAP tools where they do not have to write any code to get insights. This solution stack caters to highly technically engineering teams & non-technical people both. They get to choose which layer they want to access data, and make good use of their skills, time to get Insights.
Happy reading, feel free to comment and let me know any feedback.