hamedhsn/crawler

Note: For Architecture and documentation please look at the doc folder.

Requirements:

Fill mongodb connection info in configuration.py
Fill Kafka broker IPs in configuration.py
Create a topic in kafka with a large number of partitions / fill topic name in configuration.py

How to install it:

clone repository
go to clone folder
install: sudo pip/pip3 -e install .

How to run it:

Run consumer:
Python crawler/run.py

Note: For better response time, run the above on multiple instances of the consumer using different processes to increase parallelism.
Alternatively Use docker swarm or marathon to start many containers.

Start the web service:
Python crawler/webservice.py

Sumbit/query:

To submit a url use this API call example:
curl 127.0.0.1:5000/api/v1/crawl/submit?url=https://www.apple.com/uk/iphone&domain=https://www.apple.com
To query the results:
curl 127.0.0.1:5000/api/v1/crawl/query?url=https://www.apple.com/uk/iphone

Note: The code is tested with Python3.

hamedhsn/crawler

Contributors