Introduction to PySpark

Overview

Instructor: Jian Tao

Time: Friday, March 6, 2020 — 10:00AM-12:30PM CT

Location: SCC 102.B

Prerequisites: Python

PySpark is the Python API for Apache Spark, which is an open-source distributed general-purpose cluster-computing framework. Spark, written in Scala programming language, provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since (wikipedia.org).

PySpark is a great tool for performing exploratory data analysis (EDA) at scale, building machine learning models, and deploying large scale data analysis pipelines.

This short course will introduce the functionalities of Apache Spark with its Python APIs and show how to use PySpark to perform common tasks on both laptops and supercomputers.

Agenda

This course focuses, among others, on the following topics:

Introduction to PySpark
Run PySpark programs on Jupyter notebook
Resilient Distributed Dataset (RDD)
Spark DataFrame
PySpark SQL
Streaming
Machine Learning Pipeline
Hands-on session

This short course will make use of the Jupyter interactive environment. A brief introduction to Jupyter will be covered if necessary.

Course Materials

A quickstart page for PySpark can be found on Spark web site (spark.apache.org) : https://spark.apache.org/docs/latest/quick-start.html
PySpark Tutorial for Beginners: Machine Learning Example (guru99.com) : https://www.guru99.com/pyspark-tutorial.html