Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Quick overview of Python and Scala

Core Concepts (Theory):

  • System Architecture
  • Resilient Distributed Datasets (RDDs)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Foundational Concepts via Databricks (Hands-on Workshop):

  • RDD API exercises
  • Essential action and transformation functions
  • PairRDDs
  • Join operations
  • Caching strategies
  • DataFrame API exercises
  • SparkSQL
  • DataFrame operations: select, filter, group, sort
  • User-Defined Functions (UDFs)
  • Introduction to the Dataset API
  • Streaming

Deployment on AWS Environment (Hands-on Workshop):

  • AWS Glues basics
  • Differences between AWS EMR and AWS Glue
  • Sample jobs on both platforms
  • Advantages and disadvantages of each

Additional Topics:

  • Introduction to Apache Airflow for orchestration

Requirements

Programming proficiency (preferably in Python or Scala)

Basic SQL knowledge

 21 Hours

Testimonials (3)

Related Categories