Get in Touch

Course Outline

  • Introduction
    • History and core concepts of Hadoop
    • The Hadoop ecosystem
    • Various distributions
    • High-level architecture overview
    • Common Hadoop myths
    • Hadoop challenges (hardware and software)
    • Labs: Discuss your Big Data projects and challenges
  • Planning and installation
    • Selecting software and Hadoop distributions
    • Sizing the cluster and planning for future growth
    • Selecting appropriate hardware and network infrastructure
    • Rack topology design
    • Installation procedures
    • Multi-tenancy considerations
    • Directory structure and log management
    • Benchmarking performance
    • Labs: Perform cluster installation and run performance benchmarks
  • HDFS operations
    • Core concepts (horizontal scaling, replication, data locality, and rack awareness)
    • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
    • Health monitoring protocols
    • Administration via command-line and browser interfaces
    • Adding storage capacity and replacing defective drives
    • Labs: Familiarise yourself with HDFS command lines
  • Data ingestion
    • Using Flume for logs and other data ingestion into HDFS
    • Using Sqoop for importing data from SQL databases to HDFS, and exporting back to SQL
    • Implementing Hadoop data warehousing with Hive
    • Copying data between clusters using distcp
    • Leveraging S3 as a complementary solution to HDFS
    • Best practices and architectures for data ingestion
    • Labs: Set up and utilise Flume and Sqoop
  • MapReduce operations and administration
    • Parallel computing before MapReduce: comparing HPC with Hadoop administration
    • Managing MapReduce cluster loads
    • Nodes and Daemons (JobTracker, TaskTracker)
    • Walk-through of the MapReduce UI
    • MapReduce configuration options
    • Job configuration specifics
    • Strategies for optimising MapReduce performance
    • Preparing for MapReduce success: guidance for programmers
    • Labs: Execute MapReduce examples
  • YARN: New architecture and capabilities
    • YARN design goals and implementation architecture
    • New actors: ResourceManager, NodeManager, Application Master
    • Installing YARN
    • Job scheduling within YARN
    • Labs: Investigate job scheduling mechanisms
  • Advanced topics
    • Hardware monitoring techniques
    • Comprehensive cluster monitoring
    • Adding and removing servers, and upgrading Hadoop versions
    • Backup, recovery, and business continuity planning
    • Oozie job workflows
    • Hadoop High Availability (HA)
    • Hadoop Federation
    • Securing your cluster with Kerberos
    • Labs: Set up monitoring systems
  • Optional tracks
    • Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5).
    • Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).

Requirements

  • Comfort with basic Linux system administration
  • Basic scripting skills

Prior knowledge of Hadoop and Distributed Computing is not required, as these topics will be introduced and explained during the course.

Lab environment

Zero Install: There is no need to install Hadoop software on your personal machines! A functional Hadoop cluster will be provided for use by all students.

Students will require the following tools:

  • An SSH client (Linux and Mac systems come with SSH clients built-in; for Windows, PuTTY is recommended)
  • A browser to access the cluster. We recommend using the Firefox browser with the FoxyProxy extension installed.
 21 Hours

Testimonials (1)

Related Categories