Pyspark - Python Spark Hadoop Coding Framework & Testing
Last updated 11/2021
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.46 GB | Duration: 3h 33m
Last updated 11/2021
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.46 GB | Duration: 3h 33m
Big data Python Spark PySpark coding framework logging error handling unit testing PyCharm PostgreSQL Hive data pipeline
What you'll learn
Python Spark PySpark industry standard coding practices - Logging, Error Handling, reading configuration, unit testing
Building a data pipeline using Hive, Spark and PostgreSQL
Python Spark Hadoop development using PyCharm
Requirements
Basic programming skills
Basic database skills
Hadoop entry level knowledge
Description
This course will bridge the gap between your academic and real world knowledge and prepare you for an entry level Big Data Python Spark developer role. You will learn the followingPython Spark coding best practicesLoggingError HandlingReading configuration from properties fileDoing development work using PyCharmUsing your local environment as a Hadoop Hive environmentReading and writing to a Postgres database using SparkPython unit testing frameworkBuilding a data pipeline using Hadoop , Spark and PostgresPrerequisites :Basic programming skillsBasic database knowledgeHadoop entry level knowledge
Overview
Section 1: Introduction
Lecture 1 Introduction
Lecture 2 What is Big Data Spark?
Section 2: Setting up Hadoop Spark development environment
Lecture 3 Environment setup steps
Lecture 4 Installing Python
Lecture 5 Installing PyCharm
Lecture 6 Creating a project in the main Python environment
Lecture 7 Installing JDK
Lecture 8 Installing Spark 3 & Hadoop
Lecture 9 Running PySpark in the Console
Lecture 10 PyCharm PySpark Hello DataFrame
Lecture 11 PyCharm Hadoop Spark programming
Lecture 12 Special instructions for Mac users
Lecture 13 Quick tips - winutils permission
Lecture 14 Python basics
Section 3: Creating a PySpark coding framework
Lecture 15 Structuring code with classes and methods
Lecture 16 How Spark works?
Lecture 17 Creating and reusing SparkSession
Lecture 18 Spark DataFrame
Lecture 19 Separating out Ingestion, Transformation and Persistence code
Section 4: Logging and Error Handling
Lecture 20 Python Logging
Lecture 21 Managing log level through a configuration file
Lecture 22 Having custom logger for each Python class
Lecture 23 Error Handling with try except and raise
Lecture 24 Logging using log4p and log4python packages
Section 5: Creating a Data Pipeline with Hadoop Spark and PostgreSQL
Lecture 25 Ingesting data from Hive
Lecture 26 Transforming ingested data
Lecture 27 Installing PostgreSQL
Lecture 28 Spark PostgreSQL interaction with Psycopg2 adapter
Lecture 29 Spark PostgreSQL interaction with JDBC driver
Lecture 30 Persisting transformed data in PostgreSQL
Section 6: Reading configuration from properties file
Lecture 31 Organizing code further
Lecture 32 Reading configuration from a property file
Section 7: Unit testing PySpark application
Lecture 33 Python unittest framework
Lecture 34 Unit testing PySpark transformation logic
Lecture 35 Unit testing an error
Section 8: spark-submit
Lecture 36 PySpark spark-submit
Lecture 37 Thank you
Section 9: Appendix - PySpark on Colab and DataFrame deep dive
Lecture 38 Running Python Spark 3 on Google Colab
Lecture 39 SparkSDL and Dataframe deep dive on Colab
Section 10: Appendix - Big Data Hadoop Hive for beginners
Lecture 40 Big Data concepts
Lecture 41 Hadoop concepts
Lecture 42 Hadoop Distributed File System (HDFS)
Lecture 43 Understanding Google Cloud (GCP) Dataproc
Lecture 44 Signing up for a Google Cloud free trial
Lecture 45 Storing a file in HDFS
Lecture 46 MapReduce and YARN
Lecture 47 Hive
Lecture 48 Querying HDFS data using Hive
Lecture 49 Deleting the Cluster
Lecture 50 Analyzing a billion records with Hive
Students looking at moving from Big Data Spark academic background to a real world developer role