Hadoop Introduction & Terminologies

Hadoop Introduction & Terminologies - PART 2

- November 02, 2018

Please refer the previous article for continuity. Click here for the link.

https://educationforempowerment.blogspot.com/2018/11/hadoop-eco-systems-part-1.html

Fundamentals

The foundation of Big Data & Hadoop Eco System lays on Distributed Operating System, Distributed File System, Data Structures and Database Management System. A distributed system is a model in which components on networked computers communicate and coordinate their actions by passing messages.

How does a distributed System Work?

Single machine having multiple I/O channels and each channel is cable to stream the data 100MB's.

In recent times, distributed systems have been replaced by Hadoop. Hadoop is used to overcome the shortfalls of distributed systems like high chances of system failure, limited bandwidth and high programming complexity.

Hadoop

Hadoop is framework that allows distributed processing of large datasets across clusters of computers using single programming model. Doug Cutting discovered Hadoop and named after his yellow toy elephant. Hadoop is attracted by commercial users and IT industries after the technical document (white paper) published by Google. Hadoop runs in the Linux or Ubuntu platform.

Apache Hadoop (Open Source Framework, runs in Linux or Ubuntu Platform) is the most important framework for working with Big Data. It upgrades from working on a single node to thousands of nodes without any issue in a seamless manner. Hadoop runs the applications on the basis of MapReduce where the data is processed in parallel and accomplish the entire statistical analysis on large amount of data.

Hadoop is a framework which is based on Java programming. It is intended to work upon from a single server to thousands of machines each offering local computation and storage. It supports the large collection of data set in a distributed computing environment. The Apache Hadoop software library based framework that gives permissions to distribute huge amount of data sets processing across clusters of computers using easy programming models.

Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.

Hadoop has two major core components Hadoop HDFS (Hadoop Distributed File System) otherwise called Google File System (GFS) and Hadoop MapReduce (MR). MR is a parallel programming model used for writing large amount of dates into data warehouse.

Hadoop Architectures

Hadoop 1.0 --> HDFS + MapReduce (Era of silicon valley Hadoop, Let there be a batch processing)
Hadoop 2.0 --> HDFS + MapReduce + YARN (Era of enterprise Hadoop, Let there be YARN App's)

Hadoop 3.0 --> JDK 8 is minimum runtime version of Java, Improved Fault Tolerance and Supports Erasure Coding, Improved scalability and reliability, Intra Data Node balancing is introduced, Supports 2 or 3 Name Nodes, Heap size or map reduce.*.memory.mb is derived automatically.

Erasure Coding: The removal of writing, recorded material, or data. Replication method consumes much of the storage space which is reduced drastically with the help of Erasure coding which was traditionally used for accessing less frequent data. To know more click here
https://blog.knoldus.com/hdfs-erasure-coding-hadoop-3-0/

Hadoop Programming

You can code in C, C++, Perl, Python, R, Java, Ruby etc. You can code the Hadoop framework in any language but it will be more good to code in Java as you will have lower level control of the code or Byte-code.

Hadoop initially distributes the data to multiple systems and later runs the computation wherever the data is located. In Hadoop the program is sent to the data. Hadoop works better only when the data size is big. It can process and store a large amount of data easily and effectively. Hadoop has the ability to store a variety of data like structures, semi structures, unstructured and sparse data.

Structured Data: Roes and Columns Type such as, Tables, RDBMS Schemas, Excel, CSV, Java Script Object Notation (JSON)

Semi Structured: XML Scripts

Unstructured: Audio, Video and Images

Sparse Data: Data generated from non IT devices like sensors (IoT)

Hadoop runs code across a cluster of computers and performs the following tasks:

Data are initially divided into files and directories. Files are divided into consistent sized blocks ranging from 128M and 64M.
Then the files are distributed across various cluster nodes for further processing of data.
Job tracker starts its scheduling programs on individual nodes.
Once all the nodes are done with scheduling then the output is return back.

There is a standardized methodology that Big Data follows highlighting usage methodology of ETL.

ETL – stands for Extract, Transform, and Load.

Extract –fetching the data from multiple sources

Transform – convert the existing data to fit into the analytical needs

Load –right systems to derive value in it.

Hadoop Characteristics

Scalable: Can flow data both in horizontal (Parallel Steaming/Processing) and vertical (Volumes in terms of MB's and GB's)

Flexible: Stores a lot of data and enables you house it later

Reliable: Stores three copies of the data on different machines and is resistant to hardware failure/fault tolerance

Master-Slave Architecture: It uses Master-Slave Architecture Model, HDFS Name Node acts as a master whereas Data Node acts as a slaves. Always recommend that master and slave nodes get separated. Because slave nodes are frequently decommissioned (withdrawn from services) for maintenance.

Consider the following Problems:

If one of the nodes fails, then the data stored goes missing at that node
Network Failure issues.
Single Point of Failure: Name Node which is the Heart of Hadoop ecosystem

Solution:

Hadoop solves this problem by duplicating each and every data fragment thrice and save it on different nodes, so that even if one node fails, we can still recover the data.
Network Failure is an important issue as lot of shuffle happens in the day to day activity.
Name node failure was reduced initially by storing the name node data on NFS (Network file Server) which is been recovered in case of Name node crash.

Hadoop Cluster

Cluster means group of systems connected through Local Area Network (LAN).

1. Interconnection of multiple nodes

2. In large cluster, nodes are placed in rack

3. In a rack, all the nodes are connected to a switch

4. All the switches are connected to a main switch

5. Switches are normally of high band width 24-48 ports on a switch

6. The band width within a rack is normally 3 to 4 GB/Sec

7. The bandwidth between the racks are 1 GB/Sec

Types of Cluster Modes

Hadoop Basically runs in Fully Distributed Mode, Standalone or Local Mode, Pseudo Distribution Mode

Fully Distributed Mode: Hadoop daemons run on a cluster of machines

Standalone or Local Mode: Hadoop daemons run on the local machine

Pseudo Distribution Mode: No daemons, everything runs in a single Java Virtual Machine (JVM). Suitable for running MapReduce programs during development. No need of HDFS.

HDFS Daemons

HDFS is a storage system for SQL databases. It executes the demons like,

Name Node (FS Image, Edit Log),

Data Node and,

Secondary Name Node (FS Image).

Data is distributed across nodes, natively redundant and Name Node tracks locations.

Notes:

DAEMON

In multitasking computer operating systems, a daemon (/ˈdiːmən/ or /ˈdeɪmən/)^[1] is a computer program that runs as a background process, rather than being under the direct control of an interactive user. Traditionally, the process names of a daemon end with the letter d, for clarification that the process is in fact a daemon, and for differentiation between a daemon and a normal computer program. For example, syslogd is the daemon that implements the system logging facility, and sshd is a daemon that serves incoming SSH connections.

FILE SYSTEM (FS) IMAGE

FSimage is a point-in-time snapshot of HDFS's namespace (collection of names within a system), File name with their path is maintained by Name Node
Edit log records every changes from the last snapshot. The last snapshot is actually stored in FSImage

Name Node

Name node is the master server that maintains the metadata about the files in the system. where as the data node stores the actual data.

Meta Data

Metadata maintained by the name node includes the name of the file (including path), size, owner, group, permissions, block size etc. The namenode also gets a report from data nodes called block report that contains the location of each block of a file within that data node.

FS Image

The namenode stores the metadata about the files in memory as well as disk. This metadata is stored in a file is called fsimage.

The fsimage is read from the disk when namenode starts and maintained in memory during starting stage or state.

Edit Logs

Any changes done to the filesystem (adding a file, removing a file etc) are not written to fsimage immediately and are maintained in a separate file on disk called editlog. When a name node starts, it syncs the edit log changes with the old fsimage file and updates that into a new copy.

Secondary Name Node

This process of updating the changes from the edit log to the fsimage file can also be done by a periodic checkpointing process by the secondary namenode.

Data Node

Data node stores the actual data

Web User Interface (UI)

Name Node Status:

Resource Manager Status:

Please click the link for use cases and solutions:

http://hadooptutorial.info/hadoop-real-time-usecases-with-solutions/

Big data processing have got four major stages such as, Data Ingestion, Data Processing, Data Analytics and Data Access. We look into detail for each of stage and the required Hadoop 2.0 Eco System in the forthcoming article.

Stay Tuned and please share with your friends and circles.

Comments

AnonymousJune 23, 2020 at 9:57 PM
Wow, happy to see this awesome post. I hope this think help any newbie for their awesome work. By the way thanks for share this awesomeness from
360DigiTMG AI Course in Malaysia
360DigiTMG AI Course
360DigiTMG AI Courses
360DigiTMG Artificial intelligence Course in Malaysia
ReplyDelete
Replies
360digiTMG.comMay 6, 2022 at 7:24 PM
360DigiTMG offers the best Data Science courses on the market for pocket-friendly fees. To know more about fee details click on the link below. data science training in chennai
ReplyDelete
Replies
Shubham SainiMay 25, 2022 at 2:46 AM
Nice Blog! such a informative things you are sharing ,I really liked your content. If you wanna know about "Skillslash | Training with live industry experience that gets you hired" go to Data science course in bangalore
ReplyDelete
Replies
data scientist delhiDecember 16, 2022 at 12:02 AM
Register for the Data Scientist courses in delhi and learn to build your Data Science and Machine learning workflows. Build a portfolio of work to have on your resume with live projects which are supported by an industry-relevant curriculum. Get Access to our learning management system (LMS) that provides you with all the material and assignments that will help you master all the concepts for you to solve any problem related to deciphering the hidden meaning in data.data science course in delhi
ReplyDelete
Replies
DataScience vijayawadaDecember 24, 2022 at 1:10 AM
Get a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute vijayawada and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.
Data analytics training in vijayawada
ReplyDelete
Replies
DataScience vijayawadaDecember 31, 2022 at 4:50 AM
Develop technical skills and become an expert in analyzing large sets of data by enrolling for the Best Data Science course in vijayawada. Gain in-depth knowledge in Data Visualization, Statistics, and Predictive Analytics along with the two famous programming languages and Python. Learn to derive valuable insights from data using skills of Data Mining, Statistics, Machine Learning, Network Analysis, etc, and apply the skills you will learn in your final Capstone project to get recognized by potential employers.Data analytics certification in vijayawada
ReplyDelete
Replies
Datascience puneJanuary 7, 2023 at 9:28 AM
With decision making becoming more and more data-driven, learn the skills necessary to unveil patterns useful to make valuable decisions from the data collected. Also, get a chance to work with various datasets that are collected from various sources and discover the relationships between them. Ace all the skills and tools of Data Science and step into the world of opportunities with the Best Data Science training institutes in pune.
data science institute in pune
ReplyDelete
Replies
data scientist delhiJanuary 10, 2023 at 6:08 AM
Register for the Data Scientist courses in delhi and learn to build your Data Science and Machine learning workflows. Build a portfolio of work to have on your resume with live projects which are supported by an industry-relevant curriculum. Get Access to our learning management system (LMS) that provides you with all the material and assignments that will help you master all the concepts for you to solve any problem related to deciphering the hidden meaning in data.data science course in delhi
ReplyDelete
Replies
DataScience GunturJanuary 20, 2023 at 3:19 AM
The increase in big data has led to a boom in the field of Data Science spiking ample career opportunities. Enroll in the Data Science training in guntur and invest in emerging skills and transform any business by wrangling, analyzing, and visualizing data. Give your career a makeover and gain in-depth knowledge on how to extract valuable insights from complex and large sets of data. Get to work on a live project which is designed to give hands-on experience to you along with career guidance and mentorship.Data analytics training in guntur
ReplyDelete
Replies
Datascience puneMarch 24, 2023 at 11:42 PM
After reading this post, I am thoroughly impressed by the quality of data analytics and data science courses at 360digiTMG. I was a little skeptical about choosing data science as my path of career. However, this post has been so informative and interactive that all my skepticism has been removed from my mind completely.
data analyst course fees in pune
ReplyDelete
Replies
Ashok369March 27, 2023 at 12:54 AM
Hello, Thanks for your Awesome post! I quite Satisfied reading it, you are a good author.I will Make sure to bookmark your blog and definitely will come back from now on. I want to encourage that you continue your great job, have a nice day.

Adf Training In Hyderabad
Adf Online Training
Adf Training In Ameerpet
ReplyDelete
Replies
Datascience puneApril 4, 2023 at 11:16 PM
This post is pretty good, and I enjoyed reading this blog post. In order to make a good career as a data scientist or data analytics, you expand your skills and knowledge base, which is possible only when you enroll in a proper course program. This article is helpful for all students who want to embark on their journey of becoming successful data scientists.
data science course training in pune
ReplyDelete
Replies
AnonymousJuly 19, 2023 at 8:41 PM
Learn to perform Data Mining, Data Cleansing, Data Exploring, Feature Engineering, Prediction Model, and Data Visualization with the Data analytics coaching in Bangalore. Learn to extract business-focused insights from data with the help of mathematics and statistics. Hone your skills with the combined pedagogy approach in classrooms and extensive student-faculty interaction that helps identify students for our internship program giving you the feel of a real-world professional environment.
data analyst course in bangalore with placement
ReplyDelete
Replies
A Chinna - Data AnalystSeptember 14, 2023 at 10:20 PM
Great Info, Thanks For Sharing , keep it up we are here to learn more

Great! I like to share it with all my friends and hope they will also like this information.
ADF Training In Hyderabad
ADF Online Training
ADF Training
ADF Training Online S
ReplyDelete
Replies

Add comment

Search This Blog

Educational Thoughts