Apache Spark 3 For Data Engineering & Analytics With Python
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English (US) | Size: 1.93 GB | Duration: 8h 39m
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English (US) | Size: 1.93 GB | Duration: 8h 39m
Learn how to use Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks) - Beginner to Ninja
What you'll learn
Learn the Spark Architecture
Learn Spark Execution Concepts
Learn Spark Transformations and Actions using the Structured API
Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API
Learn how to set up your own local PySpark Environment
Learn how to interpret the Spark Web UI
Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution
Learn the RDD (Resilient Distributed Datasets) API (Crash Course)
Learn the Spark DataFrame API (Structured APIs)
Learn Spark SQL
Learn Spark on Databricks
Learn to Visualize (Graphs and Dashboards) Data on Databricks
Requirements
A basic laptop PC running Windows or Mac OS with at least 6 - 8GB of RAM
Basic programming knowledge
Description
The key objectives of this course are as follows;Learn the Spark ArchitectureLearn Spark Execution ConceptsLearn Spark Transformations and Actions using the Structured APILearn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) APILearn how to set up your own local PySpark EnvironmentLearn how to interpret the Spark Web UILearn how to interpret DAG (Directed Acyclic Graph) for Spark ExecutionLearn the RDD (Resilient Distributed Datasets) API (Crash Course)RDD TransformationsRDD ActionsLearn the Spark DataFrame API (Structured APIs)Create Schemas and Assign DataTypesRead and Write Data using the DataFrame Reader and WriterRead Semi-Structured Data such as JSONCreate and New Data Columns to the DataFrame using ExpressionsFilter the DataFrame using the "Filter" and "Where" TransformationsEnsure that the DataFrame has unique rowsDetect and Drop DuplicatesAugment the DataFrame by Adding New Rows Combine 2 or More DataFramesOrder the DataFrame by Specific ColumnsRenaming and Drop Columns from the DataFrameClean the DataFrame by detecting and Removing Missing or Bad DataCreate User-Defined Spark FunctionsRead and Write to/from Parquet FilePartition the DataFrame and Write to Parquet FileAggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)Perform Aggregations with GroupingLearn Spark SQL and DatabricksCreate a Databricks AccountCreate a Databricks ClusterCreate Databricks SQL and Python NotebooksLearn Databricks shortcutsCreate Databases and Tables using Spark SQLUse DML, DQL, and DDL with Spark SQLUse Spark SQL FunctionsLearn the differences between Managed and Unmanaged TablesRead CSV Files from the Databricks File SystemLearn to write Complex SQLUse Spark SQL FunctionsCreate Visualisations with DatabricksCreate a Databricks DashboardThe Python Spark project that we are going to do together;Sales DataCreate a Spark SessionRead a CSV file into a Spark DataframeLearn to Infer a SchemaSelect data from the Spark DataframeProduce analytics that shows the topmost sales orders per Region and CountryConvert Fahrenheit to Degrees CentigradeCreate a Spark SessionRead and Parallelize data using the Spark Context into an RDDCreate a Function to Convert Fahrenheit to Degrees CentigradeUse the Map Function to convert data contained within an RDDFilter temperatures greater than or equal to 13 degrees celsiusXYZ ResearchCreate a set of RDDs that hold Research DataUse the union transformation to combine RDDsLearn to use the subtract transformation to minus values from an RDDUse the RDD API to answer the following questionsHow many research projects were initiated in the first three years?How many projects were completed in the first year?How many projects were completed in the first two years?Sales AnalyticsCreate the Sales Analytics DataFrame to a set of CSV FilesPrepare the DataFrame by applying a StructureRemove bad records from the DataFrame (Cleaning)Generate New Columns from the DataFrameWrite a Partitioned DataFrame to a Parquet DirectoryAnswer the following questions and create visualizations using Seaborn and MatplotlibWhat was the best month in sales?What city sold the most products?What time should the business display advertisements to maximize the likelihood of customers buying products?What products are often sold together in the state "NY"? Technology SpecPythonJupyter NotebookJupyter LabPySpark (Spark with Python)PandasMatplotlibSeaborneDatabricksSQL
Who this course is for:
Python Developers who wish to learn how to use the language for Data Engineering and Analytics with PySpark,Aspiring Data Engineering and Analytics Professionals,Data Scientists / Analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster,Data Managers who want to gain a deeper understanding of managing data over a cluster