PySpark tutorial
To simulate Big Data workflow I installed a VM on my local computer, Spark and configured PySpark to work with Jupyter Notebook. For different setup scenarios, check the course Spark and Python for Big Data with PySpark.
The notebook includes:
- DataFrame basics
- DataFrame operations
- DataFrame aggregation
- missing data
- dates and timestamps
Dataset
The datasets used (people.json,appl_stock.csv,sales_info.csv and ContainsNull.csv) could be downloaded form the repository.
Python version
Python 2
Wrote during participation in the course Spark and Python for Big Data with PySpark