Salary Scrapy
This project crawls through Glassdoor and analyzes the salaries per profession and country.
The profession is simply declared in the glassdoor_spider.py in an array and the countries
are located in utils/country_codes.json. More can be added in both
Crawler
The first part is the crawler. The salaryscraper crawls through specific urls to download the data as seen below:
{
'country_currency': 'EUR',
'job_median_payment': '1298',
'job_percentile10_payment': '869',
'job_percentile90_payment': '2434',
'job_title': 'Data Scientist',
'location': 'Athens, Attica',
'sample_size': '56'
}
DynamoDB
This project uses a connection to AWS DynamoDB to store the data in.
- Create a new table called "glassdoor" in DynamoDB and create & set partition key to "timestamp"
- Go in IAM and create a new User group and under attach permissions policies use only "AmazonDynamoDBFullAccess"
also create a user and add him to that group - The above steps will also give you the Access key ID & Secret key access that is needed in order to host this in Heroku
add those in environment variables in Heroku and also add in your glassdoor username & password to authenticate the session - The pipeline that stores the data in herokuDB along with the connection initialization is in
pipelines.py
Scrapy & Heroku & Flask
- The
glassdoor_spiderscrapes the data by creating the URLs based on the information in the static_files - Scraping in Heroku is not allowed so proxies should be used instead (
salaryscrape/settings.py) and that makes the process slower
to make this faster, hit an API with valid proxies instead of a static list - The
pipelines.pystore each parsed item into the dynamodb table
Scheduler
- When we post a request at
/scheduled_crawl, the spider is triggered and then the scheduler takes over to keep triggering it - The crawling is scheduled once every 1 month to get data
Core Scrape Architecture
Visualization
Run app.py locally
How to run
- To run locally simply change
SPIDER_MODULES&NEWSPIDER_MODULE&ITEM_PIPELINESinsettings.pytosalaryscrape.spiders
and the same fordefaultinscrapy.cfg. Then runscrapy crawl glassdoor_spiderwhile in the scrapy dir - To run in heroku, simply deploy it and run once the
/crawlendpoint. Make sure to have all env variables as described above - More jobs/countries can be added in
salaryscrape/utils/country_codes.json
TODO:
- Unit tests
- Add circleci for 1) linting 3) unit tests
- [future perf] Use .query instead of .scan in the first 2 cases -> https://stackoverflow.com/questions/65282731/dynamodb-select-specific-attributes-using-boto3
On this page
Languages
CSS50.3%Python47.8%Shell1.8%Procfile0.1%
Contributors
Created April 4, 2022
Updated February 22, 2024
