Since I started working at Microsoft in 2016, I’ve been wanting to share more about this topic and even now I intend to turn this into a presentation that I can give in public forums. I’ll happily take feedback and welcome critique as I refine the message here and help make it applicable to as many as possible. Please comment and tell me your thoughts.
Note: In an effort to share more often and react to feedback I might receive, I’m going to break this up into a series of posts rather than writing a book.
The origin story
Years ago, I worked with a company that was implementing Hadoop as a means of speeding up data transformation processes on petabytes of information received from client organizations. We had used traditional ETL for years before massive datasets & increasing business demands meant we needed to look at other options. Through many meetings, I introduced Hadoop to our technical data & infrastructure teams and we started down the path of setting up some initial POC clusters. Eventually we had environments & architectures designed that solved development & test, production as well as NoSQL-based solutions for indexing & searching decades of communication data as well as transactional data for major retail customers in US, Canada, Latin America, Europe, Asia
Hadoop & why parallelism is great
At the core of our solution was a traditional Hadoop cluster utilizing the Hadoop Distributed File System (HDFS) for data storage paired with MapReduce-based processing in our case this was Apache Hive queries. This worked great as we took some ETL workloads that took 160+ hours down to less than a day. It also gave us improved data management capabilities as appending hundreds of GB of data to existing tables now was as easy as uploading new files to existing HDFS directories thanks to the schema-on-read nature of Hadoop & Hive.
Since our on-premise and co-lo data centers were the only data centers we used we went from POC to Production with physical servers that lived in a rack in our datacenter and we built backup & disaster recovery strategies around dual-pipeline processes & cross-cluster data replication from Dev/Test & Prod clusters to a DR cluster in a DR location in another state. We had a short-lived experiment that used virtual machine nodes but were quickly convinced by a Hadoop vendor that physical was best and in those days, it was hard to argue that VMs would provide a better solution.
HDFS & Hive-on-MapReduce allowed us to take large datasets and introduce parallelism in our processing very easily. We didn’t have to think about partition keys or segmenting data within our cluster, we copied a 50 GB data file to the cluster and HDFS split the file into blocks for us and we could trust that when we ran our Hive query against that data there would likely be a degree of parallelism (though sometimes this was very small until we did think about partition keys and data segmentation) and we’d likely see a performance gain. At the end of the day we implemented Hadoop, saw impressive performance gains and everyone was a hero to the business.
And then the cloud happened…
After I helped the company establish solutions to priority challenges and I felt that team was affixed on a solid path I moved on to other challenges. At the same time, the rest of the world had been, and was awakening to, this thing called the cloud. Architecting a Hadoop-based solution in the cloud should look different than one on premise and the purpose of this series is to dive into those reasons, explore what they mean and ultimately prove out why I think Hadoop will evolve into only an API (if it hasn’t already).