lochness-labs/facebook-ingestion
Facebook Ingestion workflow as a Glue Job
Facebook Ads Ingestion for Data Lakes
A Data Lake ingestion connector for the Facebook Ads APIs
The main component of the repository is an AWS Glue job, of type pythonshell, which uses the Facebook Marketing API to retrieve facebook ads entities (such as ad, ad_set, campaign, ad_image, and ad_insights, etc... also known as facebook's extractions) data for all advertising account under your business account on Facebook. For each API object, the job retrieves the last execution time, and gets all updated/new data since then. It then proceeds to store data with AWS Data Wrangler which sinks the data in S3 and generates a validation metadata file. The glue job is deployed by the Serverless Framework Stack and the script is located here: facebook-ingest/src/facebook_ingest.py.
The infrastructure is described (IaC) and deployed with Serverless Framework (https://www.serverless.com/framework/). The entry point is facebook-ingest/serverless.yml.
The infrastructure has been developed on the AWS Cloud Platform.
Getting Started
Requirements
- Node.js and NPM: https://nodejs.org/en/download/
- Serverless Framework: https://www.serverless.com/framework/docs/getting-started/
For local development only
- Python: https://www.python.org/downloads/
- virtualenv: https://virtualenv.pypa.io/en/latest/installation.html
Environments setup
The facebook-ingest/env/ contains the environment configuration files, one for each of your AWS environments.
The name of the files corresponds to the environment names. For example: substitute example_enviroment.yml with dev.yml for a development environment.
Development environment setup
- Create virtualenv:
virtualenv -p python3 venv - Activate virtualenv:
source venv/bin/activate - Install requirements:
pip install -r requirements.txt
Deployment instructions
- You need 2 AWS S3 buckets, one for the glue code and one as the Data Lake, if you have them, just keep in mind the names for the nexts steps, otherwise create the buckets on S3.
- Make a copy of
facebook-ingest/env/example-environment.yml, name it as your desired environment's name and substitute:000000000000with your AWS account id.example-data-s3-bucket-namefor your data lake AWS S3 bucket.example-code-s3-bucket-namefor your code AWS S3 bucket.eu-west-1with your AWS region.
- Substitute
000000000000with your AWS Account ID infacebook-ingest/serverless-parts/resources.yml. - Make a secret on AWS Secrets Manager for your Facebook access token and save its name on the
secret_namefield in your environment files located infacebook-ingest/env/.- For example, we named it
accessToken-appId-appSecret-businessId/facebookApi/ingestion.
- For example, we named it
- Check and substitute s3 bucket and key as needed on the
wr,facebook_sdkandpandasfields in your environment files located infacebook-ingest/env/. - Go to the
facebook-ingestfolder:cd facebook-ingest. - Install npm dependencies:
npm install. - Deploy on AWS with:
sls deploy --stage {stage}. - Substitute
{stage}with one of the available stages defined as the YAML files in thefacebook-ingest/env/directory.
Note: You can set execute_libraries_upload as False in facebook-ingest/serverless-parts/custom.yml to speed up the deployment if there are no updates to the libraries.
Usage
You can start the Glue job manually from the AWS console or using any of the AWS allowed methods such as AWS CLI, AWS SDKs, etc...
There is also a triggering schedule enabled by default, described below:
Trigger Schedule
By default, the glue job is triggered by the following rules:
- Every 55 minutes, between 6 and 20, from Mondays to Fridays
- At 10, on Saturdays and Sundays
You can change the rules on the Glue.triggers YAML property in the facebook-ingest/serverless.yml file.
Contributing
Feel free to contribute! Create an issue and submit PRs (pull requests) in the repository. Contributing to this project assumes a certain level of familiarity with AWS, the Python language and concepts such as virtualenvs, pip, modules, etc.
Try to keep commits inside the rules of https://www.conventionalcommits.org/. The sailr.json file is used for configuration of the commit hook, as per: https://github.com/craicoverflow/sailr.
License
This project is licensed under the Apache License 2.0.
See LICENSE for more information.
Acknowledgements
Many thanks to the mantainers of the open source libraries used in this project:
- Serverless Framework: https://github.com/serverless/serverless
- Serverless Glue: https://github.com/toryas/serverless-glue
- Pandas: https://github.com/pandas-dev/pandas
- AWS Data Wrangler: https://github.com/awslabs/aws-data-wrangler
- Boto3 (AWS SDK for Python): https://github.com/boto/boto3
- Facebook Business SDK for Python: https://github.com/facebook/facebook-python-business-sdk
- Sailr (conventional commits git hook): https://github.com/craicoverflow/sailr/
Contact us if we missed an acknowledgement to your library.
This is a project created by Linkalab and Talent Garden.