Hadoop Data Format, Ingestion, Streaming

Hadoop Data Format, Ingestion, Streaming - PART 3

- November 03, 2018

To read previous related articles.

Hadoop Eco System Installation - PART 1

Hadoop Introduction & Terminologies - PART 2

Data Formats

Data Formats: Way of representing and storing the raw data in the secondary storage devices.

Each of the data file format have got its own list of pros and cons depending upon the business context and usecase.

Plain Text Files (CSV, TSV, XML or JSON files), binary files, rich file formats like Avro, ORC and Parquet

To know more on data format click here: https://techmagie.wordpress.com/category/big-data/data-formats/

Data Ingestion & Data Streaming

Ingest data in Batches or Stream data in real time. Both of the processes allows to collect, load, transfer, integrate and process data from wide range of data sources.

Data Ingestion

Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. To ingest something or to take something in batches or chunks of data.

Data Streaming

Streaming supports any programming language that can read from standard input and write to standard output. It uses UNIX standard streams as the interface between Hadoop and your program.

Data Ingestion/Streaming Tools

Apache Sqoop,

Apache Flume,

Apache STORM,

Spark Streaming

Apache Kafka (LinkedIn general purpose messaging system),

Amazon Kinesis,

Apache Samza,

Cloudera Morphilines,

White Elephant,

Apache Chukwa,

Data Torrent,

Gobblin,

Synsort,

Wave Front,

Fluentd,

Scribe,

Heka,

Data Bus,

Apache NIFI,

NFA Talend etc.

How to Choose Data Ingestion/Streaming Tool?

Data ingestion or streaming tools/framework need to be chosen based on data ingestion/streaming parameters like,

Data Velocity:Speed at which data is generated from different sources like machines, networks, servers, human interaction (clicks and messaging), media sites and social media.

Movement of data:is the type of massive or continuous

Data Size: Mega, Giga, Tera, Peta, Exa and Zetta bytes

Data Frequency: Batch or Real-Time

Data Format: Structured, Semi-Structured or Unstructured

Network Bandwidth: Heterogeneous Technologies and System

Although many tools and frameworks are available both in the open source and licensed versions, Apache Flume, Apache Sqoop, Apache STORM and Spark Streaming are used widely in the big data analysis. These two tools are otherwise called Hadoop ETL tools. Extract, Transform, Load (ETL).

Batch Processing Tools

Apache Sqoop:

Data ingestion tool for structured data sources like Oracle DBMS, PostgreSQL (Object Oriented RDMS-ORDBMS), MYSQL, IBM Informix and DB2
Its transfers data from SQL to Hadoop & Hadoop to SQL
Useful for Batch Load
Supports Command Line Interface (CLI)
Supports data imported from HDFS, HIVE, HBase, KCatelog and Accumulo

Spark Streaming:

Spark Streaming is useful for batch processing
It can also perform micro-batching

Stream Processing Tools

Apache Flume:

Data ingestion tool for unstructured data sources and streamed Data sources which are generated continuously in Hadoop environment
Flume helps to collect data from a variety of sources like Logs, Java Message Services (JMS), Directory etc. Multiple flume agents can be configured to collect high volume of data.

STORM:

Handles large volumes of high velocity data like clicks on the web, stream data and Dashboard.
It can also do micro-batching using Trident

To know more on data ingestion and streaming click here:

https://www.dezyre.com/article/sqoop-vs-flume-battle-of-the-hadoop-etl-tools-/176

https://www.xenonstack.com/blog/big-data-engineering/ingestion-processing-big-data-iot-stream/

Stay Tuned for Further Episodes. Please share this article in your circles.

Search This Blog

Educational Thoughts