Hadoop is the hot buzzword of the Big Data world, and many IT people are being told "go create a Hadoop cluster and do some magic". It's hard to know where to start or which projects are a good fit. The information available online is sparse, often conflicting, and usually focused on how to solve a technical problem rather than a business problem. So let's look at this from a business perspective.
For the average business just getting into using Hadoop for the first time, you are most likely to be successful if you choose a project related to data exploration, analytics and reporting, and/or looking for new data-driven actionable insights. In many ways Hadoop is 'just another data source." Generally most businesses will not start with replacing existing, high-functioning OLTP implementations. Instead you will likely see the highest initial return on investment (ROI) from adding on to those existing systems. Pull some of the existing data into Hadoop, add new data, and look for new ways to use that data. The goal should remain clearly focused on how to use the data to take action based on the new data-driven insights you will uncover.
- Goals include innovation, exploration, iteration, and experimentation. Hadoop allows you to ask lots of "what-if" questions cheaply, to "fail fast" so you can try out many potential hypotheses, and look for that one cool thing everyone else has missed that can really impact your business.
- New data or data variations will be explored. Some of it may be loosely structured. Hadoop, especially in the cloud, allows you to import and experiment with data much more quickly and cheaply than with traditional systems. Hadoop on Azure in particular has the WASB option to make data ingestion even easier and faster.
- You are looking for the "Unknown Unknowns". There are always lurking things that haven't come to your attention before but which may be sparks for new actions. You know you don't know what you want or what to ask for and will use that to spur innovation.
- Flexible, fast scaling without the need to change your code is important. Hadoop is built on the premise that it is infinitely scalable - you simply add more nodes when you need more processing power. In the cloud you can also scale your storage and compute separately and more easily scale down during slow periods.
- You are looking to gain some competitive advantage faster than your competition based on data-driven actions. This goes back to the previous points, you are using Hadoop to look for something new that can change your business or help you be first to market with something.
- There are a low number of direct, concurrent users of the Hadoop system itself. The more jobs you have running at the same time, the more robust and expensive your head node(s) must be and often the larger your cluster must be. This changes the cost/benefit ratio quickly. Once data is processed and curated in Hadoop it can be sent to systems that are less-batch oriented and more available and familiar to the average power user or data steward.
- Archiving data in a low-cost manner is important. Often historical data is kept in Hadoop while more interactive data is kept in a relational system.
Quite often I hear people proposing Hadoop for projects that are not an ideal use for Hadoop, at least not as you are learning it and looking for quick successes to bolster confidence in the new technology. The below characteristics are generally indicators that you do NOT want to use Hadoop in a project.
- You plan to replace an existing system whose pain points don't align with Hadoop's strengths.
- There are OLTP business requirements, especially if they are adequately met by an existing system. Yes, there are some components of Hadoop that can meet OLTP requirements and those features are growing and expanding rapidly. If you have an OLTP scenario that requires ACID properties and fast interactive response time it is possible Hadoop could be a fit but it's usually not a good first project for you to learn Hadoop and truly use Hadoop's strengths.
- Data is well-known and the schema is static. Generally speaking, though the tipping point is changing rapidly, when you can use an index for a query it will likely be faster in a relational system. When you do the equivalent of a table scan across a large volume of data and provide enough scaled-out nodes it is likely faster on a Big Data system such as Hadoop. Well-known, well-structured data is highly likely to have well-known, repeated queries that have supporting indexes.
- A large number of users will need to directly access the system and they have interactive response time requirements (response within seconds).
- Your first project and learning is on a mission critical system or application. Learn on something new, something that makes Hadoop's strengths really apparent and easy to see.
And in Conclusion
Choosing the right first project for your dive into Hadoop is crucial. Make it bite-sized, clearly outline your goals, make sure it has some of the above success criteria and avoid the anti-patterns. Make learning Hadoop a key goal of the project. Budget time for everyone to really learn not only how things work but why they work that way and whether there are better ways to do certain things. Hadoop is becoming ubiquitous, avoiding it completely is not an option. Jump in, but do so with your eyes wide open and make some good up-front decisions. Happy Big Data-ing!