Web scraper to get news article content
Codementor Page
- build a simple web scraper that will return the content of a news article when given a specific URL. Some examples of real products which use similar technologies include price-tracking websites and SEO audit tools which may scrape top search results.
Requirements
Choose one news website - see article examples below for inspiration. Given a specific article URL from the website of your choice, return the title and content of the article to the user.
Examples article URLs:
https://www.nytimes.com/2020/09/02/opinion/remote-learning-coronavirus.html
https://www.washingtonpost.com/technology/2020/09/25/privacy-check-blacklight/
https://edition.cnn.com/travel/article/scenic-airport-landings-2020/index.html
For an extra challenge: Parse out information such as the article title, updated date, and byline to return separately to the user.
Suggested Implementation
You can use something similar to this service in command line:
> python scrape_newyorktimes.py news_urlWe suggest using a HTTP library like Requests to get the raw HTML file of the URL. Then use a parsing library like Beautiful Soup to parse the content. Alternatively, you can also use a Python scraping tool like Scrapy.
References
- You can use xPath to select elements if there’s no class or div for the element
- Take note of the Python version you have installed! (reference)
Installation
# run scrapy
> scrapy runspider news.py
# create a csv file
> scrapy runspider news.py -o nyt.csv