Hadoop Eco System Installation - PART 1

Hey Guys!
This article might be useful for the beginners those who wish to learn Big Data frameworks and choose their career in the stream of "Big Data Analytics, Big Data Scientist and Big Data Administrator". It is good that most of the tools and frameworks are open source. 

Let us start with the definition of Big Data
Big data is the domain which deals with voluminous of data both in veracity and velocity in-terms of Mega, Giga, Tera, Peta, Exa and Zetta bytes. For example social media analysis (Recommenders), Online Transactional Data Processing (OLTP), e-commerce data (eBay, Amazon, FlipKart), ERP, Online Analytical Processing (OLAP), Financial Data, XML, JSON Objects, Customer Relationship Management (CRM), Sparse Data (Non IT Devices like sensors), larger volumes of image, video, audio data (NETFLIX AND AMAZON PRIME) etc.




To start with hand-on practices there are three methods are suggested.


1. The first method
You can setup your own 'Virtual Machine' using Apache Hadoop &  Hadoop Distributed File System (HDFS) and the required eco systems for data ingestion, data processing, data analytics and user interface based on the type of database either SQL or NoSQL (NoSQL stands for Not Only SQL and provides mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. ).
Hadoop Installation requires Java and SSH (Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network) Essentials.
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

2. The Second Method
You can use Cloudera Virtual-box
Cloudera QuickStart VMs (single-node cluster) make it easy to quickly get hands-on with CDH for testing, demo, and self-learning purposes, and include Cloudera Manager for managing your cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts for getting started.
Virtual Box download site https://www.virtualbox.org/wiki/Downloads select your host operating system and download it and install.
Cloudera QuickStart VM download link https://www.cloudera.com/downloads/quickstart_vms/5-13.html  Select your platform as Virtual Box and click download. To download you need to create Cloudera account its free.

3. The Third Method
You can Use Hortonworks Virtual-box 
To begin your hands-on and deploy virtual box click here:
https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide/section/1/

Mindmap for Hadoop Ecosystem 

Since Hadoop and its eco system have got vast number of tools and framework, I would recommend you to start with...


Be familiar with any one of the following NoSQL Database
MongoDB, CouchDB, Cassandra, HBase

Be familiar with any one of the following SQL Database
Oracle, MYSQL, PostgreSQL, IBM Informix & DB2 

Data Ingestion Tool
One data Ingestion tool for SQL (Ex. Sqoop)
One data Ingestion tool for NoSQL (Ex. Flume)
Data Streaming
Spark Stream or Apache Storm 
Data Processing
SQL Data Processing tool (Ex. HDFS & MapReduce)
NoSQL Data Processing tool (Ex. HBase)
Data Analytics
SQL Processing (Ex. HIVE)
NoSQL Processing (Ex.  PIG Latin)
Machine Learning Algorithms
Spark Machine Learning Library (MLib) 
Mahout Framework 
User Interface
Hadoop User Experience (HUE) is recommended for Programmer
Cloudera Search is recommended for non IT users

Once you become versatile in using the above mentioned tools/framework you can move with the following:

Tableau, OzzieZookeeper, Impala, HIVE + TEZ, Spark, Phenix, and Kafka, security tools, authentication and management etc.

The below mentioned links might be useful for self-learning on 'Hadoop ecosystem'.

http://hadooptutorial.info

https://data-flair.training/blogs/hadoop-tutorial/




Good article for MR1, MR2 (YARN)

Excellent Website for list of NoSQL


Sqoop Hands-on
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html


Suggested YouTube Channels
eureka!
https://www.youtube.com/user/edurekaIN

Intellipaat 
https://www.youtube.com/user/intellipaaat/videos


Stay tuned for future episodes  - Learn Hadoop Eco System

Comments

  1. Python Training in Hyderabad. Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace.

    Amazon web services Training in Hyderabad

    Amazon web services Course in Hyderabad

    ReplyDelete

Post a Comment

Popular posts from this blog

Krishna University Workshop - Pre Workshop Materials

Hadoop Introduction & Terminologies - PART 2

Loyola Academy Day 1 Recordings