Hadoop Eco System Installation

Hadoop Eco System Installation - PART 1

- November 01, 2018

Hey Guys!

This article might be useful for the beginners those who wish to learn Big Data frameworks and choose their career in the stream of "Big Data Analytics, Big Data Scientist and Big Data Administrator". It is good that most of the tools and frameworks are open source.

Let us start with the definition of Big Data

Big data is the domain which deals with voluminous of data both in veracity and velocity in-terms of Mega, Giga, Tera, Peta, Exa and Zetta bytes. For example social media analysis (Recommenders), Online Transactional Data Processing (OLTP), e-commerce data (eBay, Amazon, FlipKart), ERP, Online Analytical Processing (OLAP), Financial Data, XML, JSON Objects, Customer Relationship Management (CRM), Sparse Data (Non IT Devices like sensors), larger volumes of image, video, audio data (NETFLIX AND AMAZON PRIME) etc.

To start with hand-on practices there are three methods are suggested.

1. The first method

You can setup your own 'Virtual Machine' using Apache Hadoop & Hadoop Distributed File System (HDFS) and the required eco systems for data ingestion, data processing, data analytics and user interface based on the type of database either SQL or NoSQL (NoSQL stands for Not Only SQL and provides mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. ).

Hadoop Installation requires Java and SSH (Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network) Essentials.

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

2. The Second Method
You can use Cloudera Virtual-box

Cloudera QuickStart VMs (single-node cluster) make it easy to quickly get hands-on with CDH for testing, demo, and self-learning purposes, and include Cloudera Manager for managing your cluster. Cloudera QuickStart VM also includes a tutorial, sample data, and scripts for getting started.

Virtual Box download site https://www.virtualbox.org/wiki/Downloads select your host operating system and download it and install.

Cloudera QuickStart VM download link https://www.cloudera.com/downloads/quickstart_vms/5-13.html Select your platform as Virtual Box and click download. To download you need to create Cloudera account its free.

3. The Third Method
You can Use Hortonworks Virtual-box
To begin your hands-on and deploy virtual box click here:
https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide/section/1/

Mindmap for Hadoop Ecosystem

https://drive.google.com/open?id=0BwZgPbGzTcgaN3l5aUYyOENBemdfRlhSUHU2M2Q4MWcwb2Vz

Courtesy for the Cheat Sheet: http://mattturck.com/wp-content/uploads/2018/07/Matt_Turck_FirstMark_Big_Data_Landscape_2018_Final.png

Since Hadoop and its eco system have got vast number of tools and framework, I would recommend you to start with...

Courtesy for the Image: https://www.slideshare.net/LiorSidi/hadoop-ecosystem-65935516

Be familiar with any one of the following NoSQL Database

MongoDB, CouchDB, Cassandra, HBase

Be familiar with any one of the following SQL Database

Oracle, MYSQL, PostgreSQL, IBM Informix & DB2

Data Ingestion Tool

One data Ingestion tool for SQL (Ex. Sqoop)

One data Ingestion tool for NoSQL (Ex. Flume)
Data Streaming
Spark Stream or Apache Storm

Data Processing

SQL Data Processing tool (Ex. HDFS & MapReduce)

NoSQL Data Processing tool (Ex. HBase)

Data Analytics

SQL Processing (Ex. HIVE)

NoSQL Processing (Ex. PIG Latin)

Machine Learning Algorithms

Spark Machine Learning Library (MLib)

Mahout Framework

User Interface

Hadoop User Experience (HUE) is recommended for Programmer

Cloudera Search is recommended for non IT users

Once you become versatile in using the above mentioned tools/framework you can move with the following:

Tableau, Ozzie, Zookeeper, Impala, HIVE + TEZ, Spark, Phenix, and Kafka, security tools, authentication and management etc.

The below mentioned links might be useful for self-learning on 'Hadoop ecosystem'.

http://hadooptutorial.info

https://data-flair.training/blogs/hadoop-tutorial/

https://intellipaat.com/tutorial/hadoop-tutorial/introduction-hadoop/

https://jethro.io/hadoop-deployment-cheat-sheet

https://hortonworks.com/tutorial/hadoop-tutorial-getting-started-with-hdp/section/1/

https://www.dezyre.com/article/hadoop-ecosystem-components-and-its-architecture/114

Good article for MR1, MR2 (YARN)

http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/

Excellent Website for list of NoSQL

http://nosql-database.org/

Sqoop Hands-on
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Suggested YouTube Channels
eureka!
https://www.youtube.com/user/edurekaIN

Intellipaat
https://www.youtube.com/user/intellipaaat/videos

Stay tuned for future episodes - Learn Hadoop Eco System

Comments

surenkumarJanuary 12, 2021 at 10:06 PM
Thanks for Sharing This Article. It was a valuable content. Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
python Training in Hyderabad
ReplyDelete
Replies
jessijessicaJanuary 26, 2021 at 1:53 AM
Python Training in Hyderabad. Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace.

Amazon web services Training in Hyderabad

Amazon web services Course in Hyderabad
ReplyDelete
Replies

Search This Blog

Educational Thoughts

Hadoop Eco System Installation - PART 1

Comments

Post a Comment

Popular posts from this blog

Krishna University Workshop - Pre Workshop Materials

Hadoop Introduction & Terminologies - PART 2

Loyola Academy Day 1 Recordings