Full Stack Data Engineering (Azure + Databricks + AWS) | Self Paced | With Career Support
About Course
Training for Complete Data Engineering course with Big Data Hadoop and Spark. The course focuses on various aspects of Big Data frameworks like Hadoop and Spark. We will be learning about many tools in the Hadoop ecosystem such as hive, sqoop, flume, spark, and Kafka.
Course Content:
- Azure Data Engineering
- AWS Data Engineering
- DataBricks Data Engineering
- 3 End to End Projects
- SparkStreaming
- Python Programming
- Apache Hadoop
- Apache Hive
- PySpark 500 Hands On Exercises
- SparkSQL
- Kafka
- NoSQL
What Will You Learn?
- Job interview preparation
- Covers most of the contents for "Databricks Certified Developer For Apache Spark 3.0" Certification
- In depth understanding of Hadoop Ecosystem components.
- Resume support.
- Enhanced understanding with Hands on exercises.
Course Content
M1 – Course Introduction
-
Roles and Responsibilities of a Data Engineer | Demo 1
01:10:20 -
Introduction to Big data | Demo 2
01:16:10 -
Hands On | Distributed Storage | Distributed Processing | Introduction | Demo 3
01:16:37 -
Case Study to understand Data Engineering Domain
44:14 -
Your First 5 Years as a Data Engineer
16:13
M2 – Hadoop Ecosystem (HISTORY LESSONS)
To understand how data engineering practices have evolved, you may review the following legacy sessions.
For a modern, industry-aligned learning path, I recommend the sequence below:
SQL → Python → PySpark → PySpark Projects
Before beginning this path, I also suggest covering Hadoop fundamentals up to the YARN architecture, as it provides helpful context for distributed processing.
This sequence will give you a strong foundation for the upcoming modules.
The legacy sessions are included for those working with older systems who may still find them useful.
-
WARNING IMPORTANT !!!!!!!!!
03:07 -
Hadoop Session 1 – Introduction to Hadoop
01:06:10 -
Hadoop Session 2 – Hadoop Components and Daemons
01:30:57 -
Hadoop Session 3 – File Blocks | Replication | Rack Awareness
01:12:31 -
Hadoop Session 4 – Rack Awareness | HA and Federation
01:19:56 -
Hadoop Session 5 – YARN Architecture – I
01:26:57 -
Hadoop Session 6 – YARN Architecture Doubts and Seminars
01:14:47 -
Hadoop Session 7 – Necessary Setups and Discussion
01:30:26 -
Hadoop Session 8 – MR workflow [Deprecated]
01:27:24 -
Hadoop Session 9 – YARN and MR QnA and Revision | Safe Mode | Load Balancer
01:07:57 -
Hadoop Session 10 – MR | File Blocks vs Input Splits [Deprecated]
02:13:29 -
Hadoop Session 11 – MR Workflow Revision | WordCount Hands On [Deprecated]
02:14:08 -
Hadoop Session 12 – Combiner and Partitioner [Deprecated]
02:39:59 -
Hadoop Session 13 – Reduce side join 1 [Deprecated]
01:32:32 -
Hadoop Session 14 – Reduce side join 2 [Deprecated]
01:32:10 -
Hadoop Session 15 – Map Side Join [Deprecated]
01:36:28 -
[Optional] setting_up_single_node_hadoop_cluster
00:00 -
AWS EMR Session 1 – EC2 introduction
23:37 -
AWS EMR Session 2 – IAM Roles
24:27 -
AWS EMR Session 3 – Starting EMR Cluster | Connecting to EMR from your System
58:26 -
AWS EMR Session 4 – Accessing EMR web Uis
58:26 -
Hive Session 1 – Hive Introduction and Architecture
01:25:03 -
Hive Session 2 – Hive Basic Commands
01:27:23 -
Hive Session 3 – Internal vs External Tables
58:35 -
Hive Session 4 – Design level optimizations (Partitioning)
01:06:31 -
Hive Session 5 – Design level optimizations (Bucketing) | Logical Joins
01:15:13 -
Hive Session 6 – Bucketing Scenarios | Hive SerDe
01:06:59 -
Hive Session 7 – SerDe Correction
02:55 -
Hive Session 8 – Join strategies in Hive | MR
01:13:03 -
Hive Session 9 – Hive Project
22:29 -
Hive Session 10 – Join Optimizations
01:34:31 -
Hive Session 11 – Hive Transactional Tables
01:19:36 -
Hive Session 12 – Hive Transactional Tables Materialized
01:15:29 -
Hive Session 13 – CBO | Vectorization | Resource level optimization | Materialized Views
01:44:59 -
Hive Session 14 – Vectorization in Hive
12:48 -
Hive Session 15 – MSCK repair
06:09 -
Hive Session 16 – UDF, UDAF , UDTF
01:30:37 -
Hive Exercise Documents and Data Files
00:00 -
Hive Notes
00:00 -
Sqoop Session 1 – Sqoop Introduction
01:19:28 -
Sqoop Session 2 – Sqoop Incremental Import
01:21:12 -
Flume Session 1 – Flume Introduction
01:39:31 -
Flume Session 2 – Flume Configuration
02:11:07
M3 – Python for Pyspark
-
Python | Pre requisites
03:58 -
Session 1 | Setup and Cloning repo
01:03:20 -
Session 2 | python | Variables and Printing
53:55 -
Session 3 | Python | Formatting and Escape Sequences
29:21 -
Session 4 | Python | Decision making | if elif else
50:10 -
Session 5 | Python | List Tuple Set Dictionaries
51:53 -
Session 6 | Python | Loops
46:29 -
Session 7 | Python | Gaming using loops
33:51 -
Session 8 | Python | List comprehension | String slicing
45:56 -
Session 9 | Python | Slicing and step
45:57 -
Session 10 | String Manipulation
51:57 -
Session 11 | String Manipulation 2
01:14:27 -
Session 12 | list comprehension
56:52 -
Session 13 | Functions Introduction
01:05:46 -
Session 14 | Functions 2
01:09:21 -
Session 15 | lambda functions
01:00:29 -
Session 16 | File IO
01:08:12 -
Session 17 | Modules | Exception handling
01:08:12 -
Session 18 | Regular Expressions
01:25:24 -
Session 19 | RE | OOP Introduction
01:25:24 -
Session 20 | Class Instance Attributes
53:10 -
Session 21 | Class Methods vs Instance Methods vs Static Methods
01:04:13 -
Session 22 | Dunder methods | Operator overloading
01:04:13 -
Session 23 | Property Decorator
54:43 -
Session 24 | Encapsulation and Private attributes
54:39 -
Session 25 | Abstraction and Spark Introduction
01:47:44
M4 – PySpark Essentials For Data Engineering
-
[Overview] Spark Session 1 – Introduction
16:37 -
[ClassRec] Spark Session 1 – Introduction
37:34 -
[Overview] Spark Session 2 – Spark Cluster vs Application Architecture
12:07 -
[ClassRec] Spark Session 2 – Spark Cluster vs Application Architecture
47:42 -
[Overview] Spark Session 3 – RDD Terminologies and Features
47:20 -
[ClassRec] Spark Session 3 – RDD Terminologies and Features
52:01 -
[Overview] Spark Session 4 – App vs Job vs Stage vs Task
33:41 -
[ClassRec] Spark Session 4 – App vs Job vs Stage vs Task
49:23 -
[Overview] Spark Session 5 – Spark Cluster vs Client Mode
12:47 -
[ClassRec] Spark Session 5 – Spark Cluster vs Client Mode
35:05 -
[Overview] Spark Session 6 – Spark Architecture
39:24 -
[ClassRec] Spark Session 6 – Spark Architecture
48:59 -
[Overview] Spark Session 7 – Spark Distrubuted Shared Variables
43:39 -
[ClassRec] Spark Session 7 – Spark Distrubuted Shared Variables
36:21 -
[Overview] Spark Session 8 – SparkSQL Introduction – RDD vs DF vs DS
40:58 -
[ClassRec] Spark Session 8 – SparkSQL Introduction – RDD vs DF vs DS
45:47 -
[Overview] Spark Session 9 – Spark Catalyst Optimizer
26:01 -
[ClassRec] Spark Session 9 – Spark Catalyst Optimizer
10:06 -
[Overview] Spark Session 10 – SparkContext vs SpakSession
39:32 -
[ClassRec] Spark Session 10 – SparkContext vs SpakSession
46:56 -
[Installation] Spark 3.5 Installation
20:35 -
[Overview] Spark Session 11 – Setup For Exercises
12:55 -
[ClassRec] Spark Session 12 : Ways to Create RDDs
00:00 -
[ClassRec] Spark Session 13 – RDD Creations Practice and Good Practices
16:34 -
[Overview] Spark Session 14 – map, mapPartitions, mapPartitionsWithIndex, glom
35:56 -
[ClassRec] Spark Session 14 – map, mapPartitions, mapPartitionsWithIndex, glom
27:08 -
[ClassRec] Spark Session 15 – map vs flatMap
13:11 -
[ClassRec] Spark Session 16 – groupByKey vs reduceByKey
40:50 -
[Overview] Spark Session 17 – Creating DFs from CSV files
38:11 -
[ClassRec] Spark Session 17 – Creating DFs from CSV files
27:18 -
[Overview] Spark Session 18 – Creating DF From JSON and XML Files
07:08 -
[ClassRec] Spark Session 18 – Creating DF from JSON, nested JSON, MultiChar and Custom Delimiter
22:15 -
[Overview] Spark Session 19 – Creating DFs from Binary files
11:42 -
[ClassRec] Spark Session 19 – Creating DFs from Binary files
33:36 -
[Overview] Spark Session 20 – Referring Columns, select, selectExpr, filter
15:50 -
[ClassRec] Spark Session 20 – Referring Columns, select, selectExpr, filter
17:28 -
[ClassRec] Spark Session 21 – sort / orderBy
17:38 -
[Overview] Spark Session 22 – groupBy and Aggregations
21:07 -
[ClassRec] Spark Session 22 – groupBy and Aggregations
19:23 -
[Overview] Spark Session 23 – Joins (inner, outer, left, right, left semi, left anti, cross, self)
18:58 -
[ClassRec] Spark Session 23 – Joins (inner, outer, left, right, left semi, left anti, cross, self)
39:20 -
[ClassRec] Spark Session 24 – Joins Revision
20:15 -
[Overview] Spark Session 25 – Window Functions | Ranking Functions
27:34 -
[ClassRec] Spark Session 25 – Window Functions | Ranking Functions
32:13 -
[Overview] Spark Session 26 – Window Analytical and Aggregate Functions
29:07 -
[ClassRec] Spark Session 26 – Window Aggregate Functions
21:56 -
[ClassRec] Spark Session 27 – Window Analytical Functions
17:10 -
[Overview] Spark Session 28 – Dealing With NULL Values
16:17 -
[Overview] Spark Session 29 – Dealing With Duplicate Records
05:18 -
[Overview] Spark Session 30 – Pivot and UnPivot
32:15 -
[Overview] Spark Session 31 – UDFs in PySpark
23:39
M5 – Spark Advanced | Optimization Techniques | Scenarios
-
Executor Memory Architecture
01:28:19 -
AQE | Cache VS Persist
01:48:42 -
Cache Doubts SER DE
53:09 -
Resource Calculations for Spark Application | DRA
01:48:42 -
Coalesce vs Repartition | Dealing with Data Skew
35:02 -
Garbage Collection Tuning
01:12:20 -
Join Strategies 1 : Broadcast Join
35:02 -
Join Strategies | Broadcast | Shuffle Hash | Sort Merge
01:10:32
Industry Level PySpark | Scenarios and Databricks Certification Practice
-
Session : 1
01:07:53 -
Session : 2
01:02:28 -
Session : 3
01:16:19 -
Session : 4
51:23 -
Session : 5
47:50 -
Session : 6
35:49 -
Session : 7
40:39 -
Session : 8
29:37 -
Practice Exercises and Solutions Document
00:00
M6 – Kafka Essentials For Data Engineering
-
Kafka Installation
18:04 -
Session 1
48:44 -
Session 2
39:55 -
Session 3
15:00 -
Session 4
19:09 -
Session 5
01:04:18 -
flume to kafka
36:44
M7 – Industry Level PySpark | Spark Streaming
-
Spark streaming – I
47:14 -
Spark streaming – II
01:48:41 -
Spark Streaming – III
01:49:49
M8 – Data Modelling Essentials
-
Sessions will be updated as soon as they’re done in the Live Batch
M9 – Data Engineering Using DataBricks
-
Sessions Will Be Added As Soon As They’re Covered in Live batch
M10 – Azure Data Engineering Complete Course
-
Sessions will be updated as soon as they’re covered in Live Class
M11 – AWS Data Engineering Complete Course
-
Big Data With AWS || Demo Session
36:43 -
AWS IAM User, Group and Policies
01:26:35 -
Big Data With AWS | IAM Roles
01:23:47 -
AWS Cloud Infrastructure
31:39 -
Big Data With AWS | S3 Session 1
01:40:25 -
S3 Session 2
01:00:49 -
S3 Session 3
01:00:45 -
AWS glue | Crawlers And Jobs
01:23:58 -
Glue Scenarios | Glue Workflows
01:24:01 -
Glue Scenarios
01:15:03 -
EMR basics EC2 introduction
23:36 -
EMR_basics_IAM_Role
24:26 -
Starting EMR cluster | Connecting to EMR Cluster
58:25 -
AWS EMR | Starting and Deploying Spark Application on EMR
58:04 -
AWS EMR | Cluster Mode Deployment | Accessing Web UIs | Steps Introduction
58:48 -
BDA | Deploying Spark Application in Cluster Mode on EMR
11:15 -
Deploying Spark Application using Steps on AWS EMR
01:02:03 -
Athena Basics
01:23:48 -
Athena on Command Line
59:43 -
Using Athena through python code
41:27 -
Redshift And Data Warehousing Introduction
41:34 -
Redshift Clusters | Snapshots | S3 Copy
59:16 -
Creating redshift cluster and making it publicly accessible
13:58 -
Redshift connect using sql workbench and python script
38:55 -
dist keys and sort keys in redshift
24:30 -
DIST Keys hands on
30:38 -
Redshift Federated Queries
40:58 -
What is Streaming Data | Streaming Data Terminologies
23:24 -
Kinesis Data Streams | Kinesis Architecture and terminologies
40:05 -
Streaming Data using Console Producer | Python Producer and Python Consumer
39:48
M12 – MongoDB NoSQL For Data Engineering
-
Introduction to NoSQL | Use Cases | Types
58:09 -
MongoDB Essential Elements | CRUD Operations
37:06 -
MongoDB Indexing
01:27:02
M13 – Complete Airflow For data Engineering
-
SL | Airflow | Session 1
-
SL | Airflow | Session 2
-
SL | Airflow | Session 3
-
Coming Soon
M14 – DevOps in DE | Version Control System Essentials
M15 – CI / CD for data Engineering Pipelines
Course End Projects | Live Projects
-
Sessions will be added as soon as they’re covered in Live Batch
Course Material
-
tg_vm Setup
20:56 -
Presentation Slides 22 july
00:00 -
Presentation Slides 7 Jan
00:00
Student Ratings & Reviews
No Review Yet