In my earlier blog I wrote about basics of BIG Data concept and now I am continuing from there.
In my previous blog I talked about 2 layers 1) Storage Layer 2) Query Layer. Typically Hadoop has master and slave architecture for distributed storage and parallel query. In a typical Hadoop cluster there will be 3 nodes. First is Name node, Second is Secondary/backup node and third one is Data node. In Hadoop cluster we can have as many data node as we want.
Name/Master Node: - Any call to Hadoop cluster will reach to Name node. As I mentioned that Hadoop distributes data across the data nodes. This is basically called HDFS (Hadoop distributed File System). Name Node basically stores metadata information like how to distribute the data, where to send the data etc. The actual data never get stores in Name node. We can think name node as Management node which will only responsible for Data/query management.
Now let say a client application request to store data in Hadoop cluster. Name Node receive the call and informs in which data nodes data needs to be written along with the block size. The data actually break down in various blocks. Each block size is 64 MB (default) but it can be configurable. So if application request to store 1 GB of data which is 1024 MB then it will create 16 blocks (assuming default block size). It basically replicate the data 3 times (can be change). This is because of data availability. If one of the data nodes fails it still can get the data.
Secondary/Backup Node: - It’s recommended to have backup node on different machine. Backup node take snapshot of HDFS metadata from name node at regular interval (defined in configuration file). When Name node fails due to any reason then Backup node comes in picture. However failover doesn’t happen automatically that’s mean there is a downtime involved here.
Data Node: - As I mentioned earlier, a file is splits in blocks and each block will get stored in data nodes. It’s also used for processing the query. Data node also send heartbeats to Name node so that name node is aware whether data node is alive or not.
Map/Reduce framework is used for query the data from distributed storage.
Basically it runs query parallel on different data nodes. In above picture Job tracker will keep monitor on all the task running on data nodes. Once a task completes it reports back to Job tracker. It helps to get data faster. If any task not respond it’ll try multiple times before allocate the task to another node.
In my next blog I will describe about Microsoft offering to setup Hadoop cluster and setting up cluster by using one the offerings.
*Disclaimer: - The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.