In this module, you will understand the meaning of big data, how traditional systems are limited in their ability to handle big data and how the Hadoop eco-system helps in solving this problem. You will learn about the various parts of the Hadoop eco-system and their roles.
Characteristics of big data
Traditional data management systems and their limitations
After this module you will be able to install, setup and configure a Hadoop cluster and also learn the various basic Hadoop shell commands. You will also learn about the distributed file storage system of Hadoop, HDFS, why it is used and how is it different and how files are read and written in the storage system. You will work hands-on in implementing what is taught in this module.
In this module, you will understand the MapReduce framework, how it works on HDFS and learn the basics of MapReduce programming and data flow (Basic Java knowledge will be required in the MapReduce modules)
Hive is a data warehouse software for managing and querying large scale datasets. It uses a SQL like language, HiveQL to query the data. Pig is a platform to analyse large data sets through a high level language. In this module you will focus on learning both to query and analyse large amounts of data stored in distributed storage systems.
HBase is a column-oriented database management system that runs on top of HDFS. Sqoop is a tool designed to transfer data between Hadoop and relational databases. Learn the basics of HBase, Zookeeper and Sqoop in this module.
Oozie is a workflow scheduler system to manage Hadoop jobs.Flume is a service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS. YARN is the 2nd generation of MapReduce while HDFS federation is the 2nd generation of HDFS. Learn the basics of Oozie and Flume and Hadoop 2.2 in this module