Introduction to PySpark
Overview
Instructor: Jian Tao
Time: Friday, March 6, 2020 10:00AM-12:30PM CT
Location: SCC 102.B
Prerequisites: Python
PySpark is the Python API for Apache Spark, which is an open-source distributed general-purpose cluster-computing framework. Spark, written in Scala programming language, provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since (wikipedia.org).
PySpark is a great tool for performing exploratory data analysis (EDA) at scale, building machine learning models, and deploying large scale data analysis pipelines.
This short course will introduce the functionalities of Apache Spark with its Python APIs and show how to use PySpark to perform common tasks on both laptops and supercomputers.
Agenda
This course focuses, among others, on the following topics:
- Introduction to PySpark
- Run PySpark programs on Jupyter notebook
- Resilient Distributed Dataset (RDD)
- Spark DataFrame
- PySpark SQL
- Streaming
- Machine Learning Pipeline
- Hands-on session
This short course will make use of the Jupyter interactive environment. A brief introduction to Jupyter will be covered if necessary.
Course Materials
- A quickstart page for PySpark can be found on Spark web site (spark.apache.org) : https://spark.apache.org/docs/latest/quick-start.html
- PySpark Tutorial for Beginners: Machine Learning Example (guru99.com) : https://www.guru99.com/pyspark-tutorial.html