Data Engineering
The amount of data in the world, the form these data take, and the ways to
interact with data have all increased exponentially in recent years. The
extraction of useful knowledge from data has long been one of the grand
challenges of computer science, and the dawn of "big data" has transformed the
landscape of data storage, manipulation, and analysis. In this module, we will
look at the tools used to store and interact with data.
The objective of this class is that students gain:
- First hand experience with and detailed knowledge of computing models, notably cloud computing
- An understanding of distributed programming models and data distribution
- Broad knowledge of many databases and their respective strengths
As a part of the Data and Decision Sciences
Master's program, this module aims specifically at providing the tool set
students will use for data analysis and knowledge extraction using skills
acquired in the Algorithms of Machine Learning and Digital Economy and Data Uses
classes.
Class structure
The class is structured in four parts:
Data engineering fundamentals
In this primer class, students will cover the basics of Linux command line
usage, git, ssh, and data manipulation in python. The format of this class is
an interactive capture-the-flag event.
Data storage
This module covers Database Management Systems with a focus on SQL systems. For
evaluation, students will install and manipulate data in PostgreSQL and MongoDB
and compare the two systems.
Data computation
A technical overview of the computing platforms used in the data ecosystem.
We will briefly cover cluster computing and then go in depth on cloud
computing, using Google Cloud Platform as an example. Finally, a class on GPU
computing will be given in coordination with the deep learning section of the
AML class.
Data distribution
In the final module, we cover the distribution of data, with a focus on
distributed programming models. We will introduce functional programming and
MapReduce, then use these concepts in a practical session on Spark. Finally,
students will do a graded exercise with Dask.