Data Engineering with AWS, Azure DataBricks, Apache Spark 3, Kafka, Hive, Hadoop, Airflow
About Course
Complete Data Engineering Course
Live classroom training for Complete Data Engineering course with Big Data Hadoop and Spark. The course focuses on various aspects of Big Data frameworks like Hadoop and Spark. We will be learning about many tools in the Hadoop ecosystem such as hive, sqoop, flume, spark, and Kafka.
Course Content:
- Azure Databricks
- AWS EMR, Glue, Lambda, Kinesis
- DataBricks Certified Spark 3 Developer
- SparkStreaming
- Python Programming
- Apache Hadoop
- Apache Hive
- PySpark 500 Hands On Exercises
- SparkSQL
- Kafka
- NoSQL: MongoDB
What Will You Learn?
- Job interview preparation
- Covers most of the contents for "Databricks Certified Developer For Apache Spark 3.0" Certification
- In depth understanding of Hadoop Ecosystem components.
- Resume support.
- Enhanced understanding with Hands on exercises.
Course Content
Big Data Hadoop and Spark
-
Introductory Session
01:04:49
Module 1 : Introduction to Big Data
-
Big Data Fundamentals
01:13:02
Module 2 : Introduction to Hadoop
-
Introduction to Hadoop
01:48:07 -
High Availability and Federation
01:40:17
Module 3 : HDFS (Hadoop Distributed File System)
-
HA Federation Revision | File blocks replication | Rack Awareness
01:38:06 -
[old] Rack awareness , Read/Write Anatomy
01:26:57
Module 4 : YARN (Yet Another Resource Negotiator)
-
YARN Architecture
01:38:05 -
YARN Architecture Failure Domains | MR Workflow
01:07:56 -
YARN and MR QnA and Revision | Safe Mode | Load Balancer
01:07:56 -
HDFS Commands and MR Job Execution
52:59
Module 5 : Setting up Pseudo Distributed Hadoop Cluster
-
Document (PDF File)
00:00
Module 6 : Map Reduce [Deprecated]
-
MapReduce Workflow
01:50:52 -
WordCount MR Program
02:11:12 -
MR and Input Splits
02:12:32 -
Transactions use case
01:38:21
Module 7 : Advance Map Reduce [Deprecated]
-
Combiner and Partitioner
02:39:59 -
Reduce side join 1
01:32:32 -
Reduce side join 2
01:32:10 -
Map Side Join
01:36:28
AWS EMR | Complete Guide for Setting Up and Working with AWS EMR
-
EMR basics EC2 introduction
23:37 -
EMR basics IAM Role
24:27 -
Starting EMR Cluster | Connecting to EMR from your System
58:26 -
Accessing EMR web Uis
58:26
Module 8 : Hive (SQL on top of Hadoop)
-
Hive Introduction and Architecture
01:25:03 -
Hive Basic Commands
01:27:23 -
Internal vs External Tables
58:35 -
Design level optimizations (Partitioning)
01:06:31 -
Design level optimizations (Bucketing) | Logical Joins
01:15:13 -
Bucketing Scenarios | Hive SerDe
01:06:59 -
SerDe Correction
02:55 -
Join strategies in Hive | MR
01:13:03 -
Hive Project
22:29 -
Join Optimizations
01:34:31 -
Hive Transactional Tables
01:19:36 -
Hive Transactional Tables Materialized
01:15:29 -
CBO | Vectorization | Resource level optimization | Materialized Views
01:44:59 -
UDF, UDAF , UDTF
01:30:37 -
Hive Exercise Documents and Data Files
00:00 -
Hive Notes
00:00 -
Vectorization in Hive
12:48 -
MSCK repair
06:09
Module 9 : Sqoop [Deprecated]
-
SQOOP Introduction
01:19:28 -
Sqoop Incremental Import
01:21:12
Module 10 : Flume [Deprecated]
-
Flume Introduction
01:39:31 -
Flume Configuration
02:11:07
PreRequisite – 3 | Python for Pyspark
-
Python | Pre requisites
03:58 -
Session 1 | Setup and Cloning repo
01:03:20 -
Session 2 | python | Variables and Printing
53:55 -
Session 3 | Python | Formatting and Escape Sequences
29:21 -
Session 4 | Python | Decision making | if elif else
50:10 -
Session 5 | Python | List Tuple Set Dictionaries
51:53 -
Session 6 | Python | Loops
46:29 -
Session 7 | Python | Gaming using loops
33:51 -
Session 8 | Python | List comprehension | String slicing
45:56 -
Session 9 | Python | Slicing and step
45:57 -
Session 10 | String Manipulation
51:57 -
Session 11 | String Manipulation 2
01:14:27 -
Session 12 | list comprehension
56:52 -
Session 13 | Functions Introduction
01:05:46 -
Session 14 | Functions 2
01:09:21 -
Session 15 | lambda functions
01:00:29 -
Session 16 | File IO
01:08:12 -
Session 17 | Modules | Exception handling
01:08:12 -
Session 18 | Regular Expressions
01:25:24 -
Session 19 | RE | OOP Introduction
01:25:24 -
Session 20 | Class Instance Attributes
53:10 -
Session 21 | Class Methods vs Instance Methods vs Static Methods
01:04:13 -
Session 22 | Dunder methods | Operator overloading
01:04:13 -
Session 23 | Property Decorator
54:43 -
Session 24 | Encapsulation and Private attributes
54:39 -
Session 25 | Abstraction and Spark Introduction
01:47:44
Module 11 : Apache Spark
-
[base] Spark Introduction
01:25:24 -
[base] Spark Achitecture Overview | Driver, Cluster Manager, Executor, RDD
01:14:37 -
[base] RDD Operations | Types of Transformations | DAG
01:14:37 -
[base] Spark Application | Job | Stages | tasks
01:48:23 -
[base] Spark Architecture End to End
01:32:37 -
[base] Spark Deploy Modes
14:53 -
PySpark Installation on windows
24:26 -
Spark Revision and Discussion
01:05:49 -
[Current] Spark Introduction
25:41 -
[Current] Spark Architecture | Driver, Executor, CM
01:26:40 -
[Current] Spark Architecture | RDD, Features, Operations
01:28:06 -
[Current] Spark App, Job, Stages, Tasks
01:43:52 -
[Current] Spark Architecture
01:06:16
Module 12 : Spark Programming Model
-
RDD Basics
01:03:08 -
RDD exercises Hands On | PySpark
01:31:10 -
RDD practice | PySpark
01:21:36 -
Broadcast Variables and Accumulators
01:18:05 -
Broadcast Variables | PySpark
46:16 -
Accumulator and SparkSQL Introduction | PySpark
01:08:20 -
[Current] RDD basics
01:16:04 -
[Current] KV pair RDD | GroupByKey VS ReduceBykey
01:23:43 -
[Current] Broadcast Variables and Accumulators
01:33:33
Module 13 : Spark SQL
-
Creating DataFrames from various Sources
01:10:40 -
DF vs DS, catalyst Optimizer | PySpark
01:06:16 -
SC vs SS | TempView vs GlobalTempView | PySpark
01:27:34 -
DF Structured Transformations | PySpark
29:54 -
Hands on | Creating DFs | PySpark
43:06 -
SparkSQL – I | Scala
01:29:22 -
SparkSQL – II | Scala
01:46:55 -
SparkSQL -III | Scala
01:32:29 -
Spark hive Integration | Scala
01:11:46 -
Deploy modes and Resource calculations | Scala
01:34:47 -
[current] SparkSQL Introduction | DF vs DS | Catalyst Optimizer
14:53 -
[current] SparkContext vs SparkSession
01:44:49 -
[current] Executor memory architecture
01:44:49
pyspark-practice-hands-on-200
-
Pyspark 200 Setup
01:13:39 -
Creating DFs using textFiles
44:14 -
Creating DF using Binary file Formats
24:55 -
Creating DF doubts
01:03:34 -
Creating DF using mysql, s3
03:14 -
Select SelectExpr
35:06 -
Select Filter Intermediate
17:00 -
withColumn and withColumnRenamed
01:25:12 -
sort orderBy
34:25 -
groupBy and aggregate
01:19:08 -
Join Operations
35:59 -
set operations
35:06 -
Window Operations
01:19:08 -
Data Cleaning
54:26 -
distinct and dropDuplicates
15:26 -
pivot unpivot
37:37 -
UDF
01:19:08
PySpark DF Scenarios and Databricks Certification Practice
-
Session : 1
01:07:53 -
Session : 2
01:02:28 -
Session : 3
01:16:19 -
Session : 4
51:23 -
Session : 5
47:50 -
Session : 6
35:49 -
Session : 7
40:39 -
Session : 8
29:37 -
Practice Exercises and Solutions Document
00:00
SparkSQL Advanced
-
Executor Memory Architecture
01:28:19 -
AQE | Cache VS Persist
01:48:42 -
Cache Doubts SER DE
53:09 -
Resource Calculations for Spark Application | DRA
01:48:42 -
Coalesce vs Repartition | Dealing with Data Skew
35:02 -
Garbage Collection Tuning
01:12:20 -
Join Strategies 1 : Broadcast Join
35:02 -
Join Strategies | Broadcast | Shuffle Hash | Sort Merge
01:10:32
Module 14 : Spark Streaming
-
Spark streaming – I
47:14 -
Spark streaming – II
01:48:41 -
Spark Streaming – III
01:49:49
Module 15 : Kafka Sessions
-
Kafka Installation
18:04 -
Session 1
48:44 -
Session 2
39:55 -
Session 3
15:00 -
Session 4
19:09 -
Session 5
01:04:18 -
flume to kafka
36:44
Module 16 : MongoDB NoSQL
-
Introduction to NoSQL | Use Cases | Types
58:09 -
MongoDB Essential Elements | CRUD Operations
37:06 -
MongoDB Indexing
01:27:02
Module 17 : Airflow
-
SL | Airflow | Session 1
-
SL | Airflow | Session 2
-
SL | Airflow | Session 3
End To End PySpark Project
-
Project Pre Requisites
38:36 -
Code Walk Through
58:07 -
Logging in Python
41:36
Course Material
-
tg_vm Setup
20:56 -
Presentation Slides 22 july
00:00 -
Presentation Slides 7 Jan
00:00
AWS Data Engineering
-
Big Data With AWS || Demo Session
36:43 -
AWS IAM User, Group and Policies
01:26:35 -
Big Data With AWS | IAM Roles
01:23:47 -
AWS Cloud Infrastructure
31:39 -
Big Data With AWS | S3 Session 1
01:40:25 -
S3 Session 2
01:00:49 -
S3 Session 3
01:00:45 -
AWS glue | Crawlers And Jobs
01:23:58 -
Glue Scenarios | Glue Workflows
01:24:01 -
Glue Scenarios
01:15:03 -
EMR basics EC2 introduction
23:36 -
EMR_basics_IAM_Role
24:26 -
Starting EMR cluster | Connecting to EMR Cluster
58:25 -
AWS EMR | Starting and Deploying Spark Application on EMR
58:04 -
AWS EMR | Cluster Mode Deployment | Accessing Web UIs | Steps Introduction
58:48 -
BDA | Deploying Spark Application in Cluster Mode on EMR
11:15 -
Deploying Spark Application using Steps on AWS EMR
01:02:03 -
Athena Basics
01:23:48 -
Athena on Command Line
59:43 -
Using Athena through python code
41:27 -
Redshift And Data Warehousing Introduction
41:34 -
Redshift Clusters | Snapshots | S3 Copy
59:16 -
Creating redshift cluster and making it publicly accessible
13:58 -
Redshift connect using sql workbench and python script
38:55 -
dist keys and sort keys in redshift
24:30 -
DIST Keys hands on
30:38 -
Redshift Federated Queries
40:58 -
What is Streaming Data | Streaming Data Terminologies
23:24 -
Kinesis Data Streams | Kinesis Architecture and terminologies
40:05 -
Streaming Data using Console Producer | Python Producer and Python Consumer
39:48
Student Ratings & Reviews
No Review Yet