Pyspark - Python Spark Hadoop Coding Framework & Testing

Posted By: ELK1nG

Pyspark - Python Spark Hadoop Coding Framework & Testing
Last updated 11/2021
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.46 GB | Duration: 3h 33m

Big data Python Spark PySpark coding framework logging error handling unit testing PyCharm PostgreSQL Hive data pipeline

What you'll learn
Python Spark PySpark industry standard coding practices - Logging, Error Handling, reading configuration, unit testing
Building a data pipeline using Hive, Spark and PostgreSQL
Python Spark Hadoop development using PyCharm
Requirements
Basic programming skills
Basic database skills
Hadoop entry level knowledge
Description
This course will bridge the gap between your academic and real world knowledge and prepare you for an entry level Big Data Python Spark developer role. You will learn the followingPython Spark coding best practicesLoggingError HandlingReading configuration from properties fileDoing development work using PyCharmUsing your local environment as a Hadoop Hive environmentReading and writing to a Postgres database using SparkPython unit testing frameworkBuilding a data pipeline using Hadoop , Spark and PostgresPrerequisites :Basic programming skillsBasic database knowledgeHadoop entry level knowledge

Overview

Section 1: Introduction

Lecture 1 Introduction

Lecture 2 What is Big Data Spark?

Section 2: Setting up Hadoop Spark development environment

Lecture 3 Environment setup steps

Lecture 4 Installing Python

Lecture 5 Installing PyCharm

Lecture 6 Creating a project in the main Python environment

Lecture 7 Installing JDK

Lecture 8 Installing Spark 3 & Hadoop

Lecture 9 Running PySpark in the Console

Lecture 10 PyCharm PySpark Hello DataFrame

Lecture 11 PyCharm Hadoop Spark programming

Lecture 12 Special instructions for Mac users

Lecture 13 Quick tips - winutils permission

Lecture 14 Python basics

Section 3: Creating a PySpark coding framework

Lecture 15 Structuring code with classes and methods

Lecture 16 How Spark works?

Lecture 17 Creating and reusing SparkSession

Lecture 18 Spark DataFrame

Lecture 19 Separating out Ingestion, Transformation and Persistence code

Section 4: Logging and Error Handling

Lecture 20 Python Logging

Lecture 21 Managing log level through a configuration file

Lecture 22 Having custom logger for each Python class

Lecture 23 Error Handling with try except and raise

Lecture 24 Logging using log4p and log4python packages

Section 5: Creating a Data Pipeline with Hadoop Spark and PostgreSQL

Lecture 25 Ingesting data from Hive

Lecture 26 Transforming ingested data

Lecture 27 Installing PostgreSQL

Lecture 28 Spark PostgreSQL interaction with Psycopg2 adapter

Lecture 29 Spark PostgreSQL interaction with JDBC driver

Lecture 30 Persisting transformed data in PostgreSQL

Section 6: Reading configuration from properties file

Lecture 31 Organizing code further

Lecture 32 Reading configuration from a property file

Section 7: Unit testing PySpark application

Lecture 33 Python unittest framework

Lecture 34 Unit testing PySpark transformation logic

Lecture 35 Unit testing an error

Section 8: spark-submit

Lecture 36 PySpark spark-submit

Lecture 37 Thank you

Section 9: Appendix - PySpark on Colab and DataFrame deep dive

Lecture 38 Running Python Spark 3 on Google Colab

Lecture 39 SparkSDL and Dataframe deep dive on Colab

Section 10: Appendix - Big Data Hadoop Hive for beginners

Lecture 40 Big Data concepts

Lecture 41 Hadoop concepts

Lecture 42 Hadoop Distributed File System (HDFS)

Lecture 43 Understanding Google Cloud (GCP) Dataproc

Lecture 44 Signing up for a Google Cloud free trial

Lecture 45 Storing a file in HDFS

Lecture 46 MapReduce and YARN

Lecture 47 Hive

Lecture 48 Querying HDFS data using Hive

Lecture 49 Deleting the Cluster

Lecture 50 Analyzing a billion records with Hive

Students looking at moving from Big Data Spark academic background to a real world developer role