Ryan's Project Notes
- Overall I enjoyed working on this project - there were some interesting challenges with setting up the ELT pipeline, data cleaning, creating staging tables, creating reporting tables, and doing data analysis. I probably spent around 10-15 hours total on this project. Someone could honestly extend this project for another few weeks if they wanted to - there is a lot of interesting data to explore and different ways you can deepen your analysis.
- My project setup is pretty much the same as the starter project. Just make sure to edit .env to change the POSTGRES_URL to your own URL.
- My SQL table structure uses source, staging, and reporting tables. The final layer is the iPython Notebook, which uses the reporting tables to create data visualizations using Python tools like seaborn and matplotlib.
- In the Data Transformation stage, I create several staging tables for stg_collisions, stg_persons, and stg_vehicles, where I clean and standardize the source data into useful staging tables. This makes it easier for use to separate data ingestion, data cleaning, and data analysis.
- If I had more time, I would want to do a bit more to standardize field casing (e.g. using UPPER for all text-based fields). I would also want to check for duplicates using the unique ID fields for each of the tables. I did not get around to these because I was running low on time and based on what I saw in the dataset, it didn't make a huge difference for analysis purposes.
- I also set up several stg_demographics tables where I hardcoded some NYC population-level demographic data (e.g. what percent male is the city) so that I could use this in the reporting table stage for comparison purposes. In the future we may want to adjust this to pull from an API from a reputable source.
- Transform_collisions.sql is where I create all of the reporting tables for analysis.
- The iPython notebook is where I use the reporting tables to create data visualizations using Python tools like seaborn and matplotlib.
- Please make sure to read the Key Takeaways & Further Analysis sections below each data visualization. These give some great context about the data and its limitations!
- I have included the data visualization images in the analyze folder as well, in case the iPython notebook has issues for whatever reason.
- The iPython notebook data visualization code was heavily AI-assisted. I found that AI was very good at creating good data visualizations once I had my reporting tables ready and added some additional prompt information with my preferences.
- Throughout all of my code I tried to use good style and include helpful comments.
- Please reach out to me at rmatsumoto1994@gmail.com if you have any questions or if there are issues getting everything to run properly!
- Thanks!
SL/VF Data Engineer Technical Take Home
Build is a mini ELT pipeline that extracts recent NYC traffic collision data, loads it into a database, transforms it into an analytical tables, and display the information in a meaningful way.
Evaluation
We are compiling a report of the motor vehicle collisions from 2024 and trying to determine what contributing factors lead to collisions, injuries, and fatalities. We have started the code for you already, but its up to you to finish the code, transform and model the data, and then use the data to make a report.
- Extracts collision data from NYC Open Data API for 2024 and save raw data to a .csv file
- Loads the raw data from previously saved .csv files into a local Postgres database
- Transform the data into analytical tables
- Directly query transforms from a Jupyter notebook and display the data in a meaningful way. You can use the notebook provided as a blueprint, or you can choose to display information you find relevant.
We have provided starter code. Feel free to use as much or as little as you would like. If you decide to use different technologies than what is provided please leave detailed instructions on how to run your project in a README during submission
What we are looking for
- Does it work?
- Is the code clean and accessible to others?
- Decision on data modeling and transformation
- We want to be able to understand your thought process
- How did you handle cleaning the data
- SQL and python knowlege
Time Limit
The purpose of the test is not to measure the speed of code creation. Please try to finish within 5 days of being sent the code test, but extra allowances are fine and will not be strictly penalized.
Submitting Your Code
The preferred way to submit your code is to create a fork of this repo, push your changes to the forked repo, and then either:
- open a pull request against the original repo from your forked repo
- grant access to your forked repo to erhowell, so that we can access the code there.
Alternatively, you may submit the code in the form of a zip file and send it to erhowell@swingleft.org.
Please be sure to include a README in your submission with full details on how to set up and run your code as well as answer the following questions:
- Roughly how long did this project take you
- How you felt about this project, and what issues did you face, if any.
Speed is not what we are evaluating; we are evaluating the process as a whole and the effort it takes to complete it.
Questions or Concerns
If you have any questions at all, feel free to reach out to erhowell@swingleft.org
Running The Code
[If you choose to clone this repo and work from the hello-world sample, please use the directions below. If you implement another solution using a different language or framework, please update these directions to reflect your code.]
Setup
This project requires python. Everyone has their preferred python setup. If you don't, try pyenv. If you're also looking for a way to manage virtual python environments, consider pyenv-virtualenv. Regardless, these instructions assume you have a working python environment.
Set up virtual environment
cd /where/you/like/source/code
# Check to make sure the version of python is correct.
# The starter code is utilizing Python 3.11 to match the environment we are currently on
python -V
python -m venv <env-name>
cd <env-name>
git clone <github-url>
cd <env-name>
Activate your virtualenv so that pip packages are installed
# locally for this project instead of globally.
source ../bin/activate
pip3 install -r requirements.txt
# Installed kernelspec sl-data-eng-take-home
python -m ipykernel install --user --name=<env-name> --display-name "Python (NYC Collisions)"
Create your postgres DB.
# Set up the initial state of your DB.
# You can change the name of the db from nyc_collisions to anything you'd like.
# Just be sure to update the postgres url in the .env
createdb <nyc_collisions>Running the server
# Make sure your environment is running correctly
python main.py
#working with the notebook
jupyter notebook <path_to_notebook>