Hadoop Data Format, Ingestion, Streaming - PART 3

To read previous related articles.


Hadoop Eco System Installation - PART 1 



Hadoop Introduction & Terminologies - PART 2



Data Formats
Data Formats:  Way of representing and storing the raw data in the secondary storage devices. 
Each of the data file format have got its own list of pros and cons depending upon the business context and usecase.
Plain Text Files (CSV, TSV, XML or JSON files), binary files, rich file formats like Avro, ORC and Parquet

To know more on data format click here: https://techmagie.wordpress.com/category/big-data/data-formats/

Data Ingestion & Data Streaming
Ingest data in Batches or Stream data in real time. Both of the processes allows to collect, load, transfer, integrate and process data from wide range of data sources. 


Data Ingestion

Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. To ingest something or to take something in batches or chunks of data. 


Data Streaming
Streaming supports any programming language that can read from standard input and write to standard output. It uses UNIX standard streams as the interface between Hadoop and your program. 


Data Ingestion/Streaming Tools 

Apache Sqoop, 
Apache Flume, 
Apache STORM, 
Spark Streaming 
Apache Kafka (LinkedIn general purpose messaging system), 
Amazon Kinesis, 
Apache Samza, 
Cloudera Morphilines, 
White Elephant,  
Apache Chukwa, 
Data Torrent, 
Gobblin, 
Synsort, 
Wave Front, 
Fluentd, 
Scribe, 
Heka, 
Data Bus, 
Apache NIFI,
NFA Talend etc. 

How to Choose Data Ingestion/Streaming Tool?
Data ingestion or streaming tools/framework need to be chosen based on data ingestion/streaming parameters like,

Data Velocity:Speed at which data is generated from different sources like machines, networks, servers, human interaction (clicks and messaging), media sites and social media. 
Movement of data:is the type of massive or continuous
Data Size: Mega, Giga, Tera, Peta, Exa and Zetta bytes 
Data Frequency: Batch or Real-Time
Data Format: Structured, Semi-Structured or Unstructured 
Network Bandwidth: Heterogeneous Technologies and System 

Although many tools and frameworks are available both in the open source and licensed versions,  Apache Flume, Apache Sqoop,   Apache STORM and Spark Streaming are used widely in the big data analysis. These two tools are otherwise called Hadoop ETL tools. Extract, Transform, Load (ETL). 

Batch Processing Tools

Apache Sqoop: 
  • Data ingestion tool for structured data sources like Oracle DBMS, PostgreSQL (Object Oriented RDMS-ORDBMS), MYSQL, IBM Informix and DB2
  • Its transfers data from SQL to Hadoop & Hadoop to SQL
  • Useful for Batch Load
  • Supports Command Line Interface (CLI)
  • Supports data imported from HDFS, HIVE, HBase, KCatelog and Accumulo 

 Spark Streaming:
  • Spark Streaming is useful for batch processing 
  • It can also perform micro-batching 


Stream Processing Tools
Apache Flume:
  • Data ingestion tool for unstructured data sources and streamed Data sources which are generated continuously in Hadoop environment
  • Flume helps to collect data from a variety of sources like Logs, Java Message Services (JMS), Directory etc. Multiple flume agents can be configured to collect high volume of data.


STORM:
  • Handles large volumes of high velocity data like clicks on the web, stream data and Dashboard.
  • It can also do micro-batching using Trident


To know more on data ingestion and streaming click here: 




Stay Tuned for Further Episodes. Please share this article in your circles. 






Comments

Popular posts from this blog

Krishna University Workshop - Pre Workshop Materials

Hadoop Introduction & Terminologies - PART 2

Loyola Academy Day 1 Recordings