Problem Solving Using Pyspark - Regression & Classification

Posted By: ELK1nG

Problem Solving Using Pyspark - Regression & Classification
Published 12/2023
MP4 | Video: h264, 1920x1080 | Audio: AAC, 44.1 KHz
Language: English | Size: 1.07 GB | Duration: 1h 49m

Gradient Boosted Trees, XGBoost, Spark NLP, Prophet, Data Cleaning, Descriptive Statistics, Spark SQL

What you'll learn

Data analysis and descriptive statistics with PySpark - Learning to compute essential descriptive statistics for data understanding and summarization

Data Cleaning with PySpark

Predictive modeling with PySpark using Regression

Applying Classification techniques to a real world problem in PySpark

Text analytics using PySpark and Spark NLP

Time-Series modeling with PySpark and Prophet

Introduction to Spark SQL for data querying

Requirements

Basic knowledge of data science and ML principles will be helpful

Familiarity with Python to work with PySpark

A computer with internet to access course material

Description

This course is based on real world problems in PySpark, surrounding Data Cleaning, Descriptive statistics, Classification and Regression Modeling. The first segment introduces descriptive statistics in PySpark and computing fundamental measures such as mean, standard deviation and generating an extended statistical summary. The second segment is based on cleaning the data in PySpark, working with null values,  redundant data and imputing the null values.The third segment is about Predictive modeling with PySpark using Gradient Boosted Trees RegressionThe fourth and fifth segments  are based on applying classification techniques in PySpark. The fourth Segment introduces the application of Spark XGB Classifier for a classification problem and the fifth segment is about using a deep learning model for text sentiment classification. The sixth segment is about time series analytics and modeling using PySpark and ProphetThe seventh segment introduces  Spark SQL for data querying and analysis.These segments also include advanced visualization techniques through Seaborn and Plotly libraries including  Box plots to understand the distribution of the data and assessment of outliers, Count plots to understand balance in the proportion of data, Bar chart to represent feature importance as part of the Gradient Boosted Trees Regression Model, Word Cloud for text analytics and analyzing time series data to extract seasonality and trend components. Each of these segments, has a Google Colab notebook included aligning with the lecture.

Overview

Section 1: Introduction

Lecture 1 Introduction

Lecture 2 Problem Solving with PySpark : Regression and Classification

Section 2: Data analysis and descriptive statistics with PySpark

Lecture 3 Setting up PySpark Environment in Google Colab

Lecture 4 Understanding Descriptive Statistics in PySpark

Lecture 5 Understanding Data Filtering and Slicing in PySpark

Lecture 6 Summary of Descriptive Statistics in PySpark and Quiz

Section 3: Data Cleaning with PySpark

Lecture 7 Introduction to Data Cleaning with PySpark

Lecture 8 Setting up PySpark Environment for Data Cleaning on Google Colab

Lecture 9 Understanding the Dataset : Explanatory Analysis and Data Cleaning with PySpark

Lecture 10 PySpark Data Cleaning : Assessment of Null Values and Outliers

Lecture 11 Data Cleaning with PySpark : Imputation Strategy Quiz

Lecture 12 Introduction to Pivot Tables in PySpark

Section 4: Predictive modeling with PySpark using Regression

Lecture 13 Introduction to Regression and Classification Problems in PySpark

Lecture 14 Understanding the Data Set through Explanatory Analysis

Lecture 15 Correlation Analysis and Data Preparation

Lecture 16 Modeling the data using Gradient Boosted Trees Regression

Lecture 17 Understanding Feature Importance

Lecture 18 Gradient Boosted Trees Regression - Quiz

Section 5: Predictive Modeling with PySpark using Classification

Lecture 19 Classification Problem Statement : Supervised Machine Learning

Lecture 20 Data Cleaning and Preparation for XGBoost Classification Model

Lecture 21 XGBoost Classification Model Pipeline using PySpark

Lecture 22 Summary of the segment on Spark XGBoost Classifier

Section 6: Text analytics using PySpark and Spark NLP

Lecture 23 Classification Model for Text Data

Lecture 24 Understanding the Data for Text Classification

Lecture 25 Word Cloud : Text Analytics Quiz

Lecture 26 Spark NLP Pipeline : Classification Model

Section 7: Time Series Analysis and Forecast with PySpark and Prophet

Lecture 27 Introduction to Time Series Analysis : Setting up the Google Colab Notebook

Lecture 28 Explanatory Analysis and Data Cleaning

Lecture 29 Analysis of time series components using advanced visualization techniques

Lecture 30 Use of Prophet Model for Time Series Forecasting

Lecture 31 Time Series Forecasting - Quiz

Section 8: Introduction to Spark SQL

Lecture 32 Introduction to Spark SQL Querying

Lecture 33 Comparison of PySpark statements and Spark SQL Query

Lecture 34 Join in Spark SQL

Lecture 35 Join in Spark SQL - Quiz

This course is suited for anyone interested in the realm of analytics using PySpark - particularly useful for analysts and engineers interested in Big Data, someone with a basic knowledge of data science and ML principles