What is Hadoop all about? | Hadoop Tutorial pdf

What is Hadoop all about?
While studying about Big Data, it is inevitable that you will come across the term – “Hadoop” quite frequently. Do you know what this cute yellow elephant is all about?

So, What exactly is Hadoop?
It is truly said that ‘Necessity is the mother of all inventions’ and ‘Hadoop’ is amongst the finest inventions in the world of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle and process Big Data efficiently.
Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Doug’s son’s toy elephant!

Hadoop Ecosystem:
Once you are familiar with ‘What is Hadoop’, let’s probe into its ecosystem. Hadoop Ecosystem is nothing but various components that make up Hadoop so powerful, among which HDFS and MapReduce are the core components!

1. HDFS:
The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop. HDFS is designed to amass gigantic amount of data unfailingly, and to transfer the data at an amazing speed among nodes and facilitates the system to continue working smoothly even if any of the nodes fail to function. HDFS is very competent in writing programs, handling their allocation, processing the data and generating the final outcomes. In fact, HDFS manages around 40 petabytes of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary NameNode.

2. MapReduce:
It all started with Google applying the concept of functional programming to solve the problem of how to manage large amounts of data on the internet. Google named it as the ‘MapReduce’ system and was penned down in a paper published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in 2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and JobHistoryServer.

3. Apache Pig:
Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-level language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic attribute of Pig programs is ‘parallelization’ which helps them to manage large data sets. Apache Pig consists of a compiler that generates a series of MapReduce program and a ‘Pig Latin’ language layer that facilitates SQL-like queries to be run on distributed databases in Hadoop.

4. Apache Hive:
As the name suggests, Hive is Hadoop’s data warehouse system that enables quick data summarization for Hadoop, handle queries and evaluate huge data sets which are located in Hadoop’s file systems and also maintains full support for map/reduce. Another striking feature of  Apache Hive is to provide indexes such as bitmap indexes in order to speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by other companies too, including Netflix.

5. Apache HCatalog:
Apache HCatalog is another important component of Apache Hadoop which provides a table and storage management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop such as such as Pig, Map Reduce, Streaming, and Hive.

6. Apache HBase:
HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS for storage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it handles point que¬ries (random reads). The key components of Apache HBase are HBase Master and the RegionServer.

7. Apache Zookeeper:
Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a record of configuration information, naming, providing distributed synchronization, and providing group services which are immensely crucial for various distributed systems. Infact, HBase is dependent  upon ZooKeeper for its functioning.
All these components make Hadoop a real solution to face the challenges of Big Data!

No comments:

Post a Comment