Note: For Architecture and documentation please look at the doc folder.
Requirements:
- Fill mongodb connection info in configuration.py
- Fill Kafka broker IPs in configuration.py
- Create a topic in kafka with a large number of partitions / fill topic name in configuration.py
How to install it:
-
clone repository
-
go to clone folder
-
install:
sudo pip/pip3 -e install .
How to run it:
- Run consumer:
Python crawler/run.py
Note: For better response time, run the above on multiple instances of the consumer using different processes to increase parallelism.
Alternatively Use docker swarm or marathon to start many containers.
- Start the web service:
Python crawler/webservice.py
Sumbit/query:
-
To submit a url use this API call example:
curl 127.0.0.1:5000/api/v1/crawl/submit?url=https://www.apple.com/uk/iphone&domain=https://www.apple.com -
To query the results:
curl 127.0.0.1:5000/api/v1/crawl/query?url=https://www.apple.com/uk/iphone
Note: The code is tested with Python3.
Contributors
Created May 1, 2017
Updated May 1, 2017