Hadoop Data Format, Ingestion, Streaming - PART 3
To read previous related articles.
Hadoop Eco System Installation - PART 1
Hadoop Introduction & Terminologies - PART 2
Data Formats
Data Formats: Way of representing and storing the raw data in the secondary storage devices.
Each of the data file format have got its own list of pros and cons depending upon the business context and usecase.
Plain Text Files (CSV, TSV, XML or JSON files), binary files, rich file formats like Avro, ORC and Parquet
To know more on data format click here: https://techmagie.wordpress.com/category/big-data/data-formats/
Data Ingestion & Data Streaming
Ingest data in Batches or Stream data in real time. Both of the processes allows to collect, load, transfer, integrate and process data from wide range of data sources.
Data Ingestion
Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. To ingest something or to take something in batches or chunks of data.
Data Streaming
Streaming supports any programming language that can read from standard input and write to standard output. It uses UNIX standard streams as the interface between Hadoop and your program.
Data Ingestion/Streaming Tools
Apache Sqoop,
Apache Flume,
Apache STORM,
Spark Streaming
Apache Kafka (LinkedIn general purpose messaging system),
Amazon Kinesis,
Apache Samza,
Cloudera Morphilines,
White Elephant,
Apache Chukwa,
Data Torrent,
Gobblin,
Synsort,
Wave Front,
Fluentd,
Scribe,
Heka,
Data Bus,
Apache NIFI,
NFA Talend etc.
Data ingestion or streaming tools/framework need to be chosen based on data ingestion/streaming parameters like,
Data Velocity:Speed at which data is generated from different sources like machines, networks, servers, human interaction (clicks and messaging), media sites and social media.
Movement of data:is the type of massive or continuous
Data Size: Mega, Giga, Tera, Peta, Exa and Zetta bytes
Data Frequency: Batch or Real-Time
Data Format: Structured, Semi-Structured or Unstructured
Network Bandwidth: Heterogeneous Technologies and System
Although many tools and frameworks are available both in the open source and licensed versions, Apache Flume, Apache Sqoop, Apache STORM and Spark Streaming are used widely in the big data analysis. These two tools are otherwise called Hadoop ETL tools. Extract, Transform, Load (ETL).
Batch Processing Tools
Apache Sqoop:
- Data ingestion tool for structured data sources like Oracle DBMS, PostgreSQL (Object Oriented RDMS-ORDBMS), MYSQL, IBM Informix and DB2
- Its transfers data from SQL to Hadoop & Hadoop to SQL
- Useful for Batch Load
- Supports Command Line Interface (CLI)
- Supports data imported from HDFS, HIVE, HBase, KCatelog and Accumulo
- Spark Streaming is useful for batch processing
- It can also perform micro-batching
Stream Processing Tools
Apache Flume:
- Data ingestion tool for unstructured data sources and streamed Data sources which are generated continuously in Hadoop environment
- Flume helps to collect data from a variety of sources like Logs, Java Message Services (JMS), Directory etc. Multiple flume agents can be configured to collect high volume of data.
STORM:
- Handles large volumes of high velocity data like clicks on the web, stream data and Dashboard.
- It can also do micro-batching using Trident
To know more on data ingestion and streaming click here:
Stay Tuned for Further Episodes. Please share this article in your circles.
Comments
Post a Comment