Hadoop Introduction & Terminologies - PART 2

Please refer the previous article for continuity. Click here for the link.

Fundamentals
The foundation of Big Data & Hadoop Eco System lays on Distributed Operating System, Distributed File System, Data Structures and Database Management System. A distributed system is a model in which components on networked computers communicate and coordinate their actions by passing messages. 
How does a distributed System Work?
Single machine having multiple I/O channels and each channel is cable to stream the data 100MB's.

In recent times, distributed systems have been replaced by Hadoop. Hadoop is used to overcome the shortfalls  of distributed systems like high chances of system failure, limited bandwidth and high programming complexity.

Hadoop
Hadoop is  framework that allows distributed processing of large datasets across clusters of computers using single programming model. Doug Cutting discovered Hadoop and named after his yellow toy elephant. Hadoop is attracted by commercial users and IT industries after the technical document (white paper) published by Google. Hadoop runs in the Linux or Ubuntu platform.

Apache Hadoop (Open Source Framework, runs in Linux or Ubuntu Platform) is the most important framework for working with Big Data. It upgrades from working on a single node to thousands of nodes without any issue in a seamless manner. Hadoop runs the applications on the basis of MapReduce where the data is processed in parallel and accomplish the entire statistical analysis on large amount of data. 


Hadoop is a framework which is based on Java programming. It is intended to work upon from a single server to thousands of machines each offering local computation and storage. It supports the large collection of data set in a distributed computing environment. The Apache Hadoop software library based framework that gives permissions to distribute huge amount of data sets processing across clusters of computers using easy programming models.
Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.
Hadoop has two major core components Hadoop HDFS (Hadoop Distributed File System) otherwise called Google File System (GFS) and Hadoop MapReduce (MR). MR is  a parallel programming model used for writing large amount of dates into data warehouse. 

Hadoop Architectures 
Hadoop 1.0 --> HDFS + MapReduce (Era of silicon valley Hadoop, Let there be a batch processing)
Hadoop 2.0 --> HDFS + MapReduce + YARN (Era of enterprise Hadoop, Let there be YARN App's)
Hadoop 3.0 --> JDK 8 is minimum runtime version of Java, Improved Fault Tolerance and Supports Erasure Coding, Improved scalability and reliability, Intra Data Node balancing is introduced, Supports 2 or 3 Name Nodes, Heap size or map reduce.*.memory.mb is derived automatically.

Erasure Coding: The removal of writing, recorded material, or data. Replication method consumes much of the storage space which is reduced drastically with the help of Erasure coding which was traditionally used for accessing less frequent data. To know more click here
https://blog.knoldus.com/hdfs-erasure-coding-hadoop-3-0/

Hadoop Programming

You can code in C, C++, Perl, Python, R, Java, Ruby etc. You can code the Hadoop framework in any language but it will be more good to code in Java as you will have lower level control of the code or Byte-code.

Hadoop initially distributes the data to multiple systems and later runs the computation wherever the data is located. In Hadoop the program is sent to the data. Hadoop works better only when the data size is big. It can process and store a large amount of data easily and effectively. Hadoop has the ability to store a variety of data like structures, semi structures, unstructured and sparse data.

Structured Data: Roes and Columns Type such as, Tables, RDBMS Schemas, Excel, CSV, Java Script Object Notation (JSON) 

Semi Structured: XML Scripts 

Unstructured: Audio, Video and Images

Sparse Data: Data generated from non IT devices like sensors (IoT)

Hadoop runs code across a cluster of computers and performs the following tasks:
  • Data are initially divided into files and directories. Files are divided into consistent sized blocks ranging from 128M and 64M.
  • Then the files are distributed across various cluster nodes for further processing of data.
  • Job tracker starts its scheduling programs on individual nodes.
  • Once all the nodes are done with scheduling then the output is return back.
There is a standardized methodology that Big Data follows highlighting usage methodology of ETL.
ETL – stands for Extract, Transform, and Load.
Extract –fetching the data from multiple sources
Transform – convert the existing data to fit into the analytical needs
Load –right systems to derive value in it.

Hadoop Characteristics
Scalable: Can flow data both in horizontal (Parallel Steaming/Processing) and vertical (Volumes in terms of MB's and GB's)
Flexible: Stores a lot of data and enables you house it later
Reliable: Stores three copies of the data on different machines and is resistant to hardware failure/fault tolerance
Master-Slave Architecture: It uses Master-Slave Architecture Model, HDFS Name Node acts as a master whereas Data Node acts as a slaves. Always recommend that master and slave nodes get separated. Because slave nodes are frequently decommissioned (withdrawn from services) for maintenance. 

Consider the following Problems:
  1. If one of the nodes fails, then the data stored goes missing at that node
  2. Network Failure issues.
  3. Single Point of Failure: Name Node which is the Heart of Hadoop ecosystem

Solution:

  1. Hadoop solves this problem by duplicating each and every data fragment thrice and save it on different nodes, so that even if one node fails, we can still recover the data.
  2. Network Failure is an important issue as lot of shuffle happens in the day to day activity.
  3. Name node failure was reduced initially by storing the name node data on NFS (Network file Server) which is been recovered in case of Name node crash.

Hadoop Cluster
Cluster means group of systems connected through Local Area Network (LAN). 
1. Interconnection of multiple nodes
2. In large cluster, nodes are placed in rack
3. In a rack, all the nodes are connected to a switch
4. All the switches are connected to a main switch
5. Switches are normally of high band width 24-48 ports on a switch
6. The band width within a rack is normally 3 to 4 GB/Sec
7. The bandwidth between the racks are 1 GB/Sec

Types of Cluster Modes
Hadoop Basically runs in Fully Distributed Mode, Standalone or Local Mode, Pseudo Distribution Mode
Fully Distributed Mode: Hadoop daemons run on a cluster of machines

Standalone or Local Mode: Hadoop daemons run on the local machine

Pseudo Distribution Mode: No daemons, everything runs in a single Java Virtual Machine (JVM). Suitable for running MapReduce programs during development. No need of HDFS. 

HDFS Daemons 
HDFS is a storage system for SQL databases. It executes the demons like,
Name Node (FS Image, Edit Log), 
Data Node and, 
Secondary Name Node (FS Image). 
Data is distributed across nodes, natively redundant and Name Node tracks locations. 



Notes: 
DAEMON
In multitasking computer operating systems, a daemon (/ˈdmən/ or /ˈdmən/)[1] is a computer program that runs as a background process, rather than being under the direct control of an interactive user. Traditionally, the process names of a daemon end with the letter d, for clarification that the process is in fact a daemon, and for differentiation between a daemon and a normal computer program. For example, syslogd is the daemon that implements the system logging facility, and sshd is a daemon that serves incoming SSH connections.

FILE SYSTEM (FS) IMAGE
  • FSimage is a point-in-time snapshot of HDFS's namespace (collection of names within a system), File name with their path is maintained by Name Node
  • Edit log records every changes from the last snapshot. The last snapshot is actually stored in FSImage
Name Node
Name node is the master server that maintains the metadata about the files in the system. where as the data node stores the actual data. 
Meta Data
Metadata maintained by the name node includes the name of the file (including path), size, owner, group, permissions, block size etc. The namenode also gets a report from data nodes called block report that contains the location of each block of a file within that data node. 
FS Image
The namenode stores the metadata about the files in memory as well as disk. This metadata is stored in a file is called fsimage. 
The fsimage is read from the disk when namenode starts and maintained in memory during starting stage or state. 
Edit Logs
Any changes done to the filesystem (adding a file, removing a file etc) are not written to fsimage immediately and are maintained in a separate file on disk called editlog. When a name node starts, it syncs the edit log changes with the old fsimage file and updates that into a new copy. 
Secondary Name Node
This process of updating the changes from the edit log to the fsimage file can also be done by a periodic checkpointing process by the secondary namenode.
Data Node

Data node stores the actual data

Web User Interface (UI) 
Name Node Status:

Resource Manager Status: 

Please click the link for use cases and solutions:


Big data processing have got four major stages such as, Data Ingestion, Data Processing, Data Analytics and Data Access. We look into detail for each of stage and the required Hadoop 2.0 Eco System in the forthcoming article.
Stay Tuned and please share with your friends and circles. 

Comments

  1. Amazingly by and large very interesting post. I was looking for such an information and thoroughly enjoyed examining this one. Keep posting. An obligation of appreciation is all together for sharing.data scientist course in bhubaneswar

    ReplyDelete
  2. 360DigiTMG offers the best Data Science courses on the market for pocket-friendly fees. To know more about fee details click on the link below. data science training in chennai

    ReplyDelete
  3. Nice Blog! such a informative things you are sharing ,I really liked your content. If you wanna know about "Skillslash | Training with live industry experience that gets you hired" go to Data science course in bangalore

    ReplyDelete
  4. Register for the Data Scientist courses in delhi and learn to build your Data Science and Machine learning workflows. Build a portfolio of work to have on your resume with live projects which are supported by an industry-relevant curriculum. Get Access to our learning management system (LMS) that provides you with all the material and assignments that will help you master all the concepts for you to solve any problem related to deciphering the hidden meaning in data.data science course in delhi

    ReplyDelete
  5. Get a comprehensive overview of Data Science and learn all the essential skills including collecting, modeling, and interpreting data. Register with Data Science institute vijayawada and build a strong foundation for a career where you will be involved in uncovering valuable information for your organization. Learn Python, Machine Learning, Big Data, Deep Learning, and Analytics to take center stage in Data Science.
    Data analytics training in vijayawada

    ReplyDelete
  6. Develop technical skills and become an expert in analyzing large sets of data by enrolling for the Best Data Science course in vijayawada. Gain in-depth knowledge in Data Visualization, Statistics, and Predictive Analytics along with the two famous programming languages and Python. Learn to derive valuable insights from data using skills of Data Mining, Statistics, Machine Learning, Network Analysis, etc, and apply the skills you will learn in your final Capstone project to get recognized by potential employers.Data analytics certification in vijayawada

    ReplyDelete
  7. With decision making becoming more and more data-driven, learn the skills necessary to unveil patterns useful to make valuable decisions from the data collected. Also, get a chance to work with various datasets that are collected from various sources and discover the relationships between them. Ace all the skills and tools of Data Science and step into the world of opportunities with the Best Data Science training institutes in pune.
    data science institute in pune

    ReplyDelete
  8. Register for the Data Scientist courses in delhi and learn to build your Data Science and Machine learning workflows. Build a portfolio of work to have on your resume with live projects which are supported by an industry-relevant curriculum. Get Access to our learning management system (LMS) that provides you with all the material and assignments that will help you master all the concepts for you to solve any problem related to deciphering the hidden meaning in data.data science course in delhi

    ReplyDelete
  9. The increase in big data has led to a boom in the field of Data Science spiking ample career opportunities. Enroll in the Data Science training in guntur and invest in emerging skills and transform any business by wrangling, analyzing, and visualizing data. Give your career a makeover and gain in-depth knowledge on how to extract valuable insights from complex and large sets of data. Get to work on a live project which is designed to give hands-on experience to you along with career guidance and mentorship.Data analytics training in guntur

    ReplyDelete

Post a Comment

Popular posts from this blog

Krishna University Workshop - Pre Workshop Materials

Loyola Academy Day 1 Recordings