SagarBiswas-MultiHAT/WebSource-Harvester
WebSource Harvester is an educational web-source harvester that crawls a site (BFS, depth-controlled), downloads browser-visible assets (HTML, CSS, JS, images, fonts, PDFs), and rewrites paths so pages work offline, including nested routes. It enforces same-origin limits and is designed for learning, offline analysis, and safe portfolio demos.
Web Source Code Downloader & Crawler
Author: @SagarBiswas-MultiHAT
Category: Educational Web Crawling & Client-Side Security Analysis
Status: Learning-grade, interview-safe, portfolio-ready
“This project performs depth-controlled crawling and client-side source reconstruction, capturing everything a browser can observe from a given URL, while intentionally respecting server-side trust boundaries.”
Tested example: python ".\PasourceDownloader.pyssword-Strength-Checker" https://sagarbiswas-multihat.github.io/ --depth 2
Overview
This project is an educational website source code downloader and crawler that extracts and reconstructs everything a browser can observe from a given URL.
It crawls a site with depth-controlled BFS, downloads client-visible resources (HTML, CSS, JS, images, fonts, PDFs, etc.), and rewrites links so the pages work offline, even on nested paths like /blog/*.
The tool respects server trust boundaries and does not attempt to fetch backend code, databases, or private data.
What this project provides (accurate scope)
This tool captures everything a browser can retrieve from a URL:
- HTML pages (multiple pages via crawling)
- Linked CSS files
- JavaScript files
- Images (including
srcset) - Fonts and media files
- PDFs and other static assets
- XML files (e.g.,
sitemap.xml) - Correct offline reconstruction via path rewriting
Ideal for:
- Learning how real websites are structured
- Offline inspection and analysis
- Client-side security research
- Understanding exposure and attack surface
- Portfolio demonstrations of crawling logic
What this project intentionally does NOT do
By design, this project does not:
- Download backend source code (PHP, Python, Node.js, etc.)
- Access databases or APIs that require authentication
- Execute JavaScript (SPA/React/Vue rendering)
- Bypass authentication, paywalls, or access controls
- Retrieve secrets, tokens, or server configuration
These limitations are intentional and make the project accurate and interview-safe.
Key features
- Depth-controlled crawling (
--depth 2or--depth 1-2) - Breadth-First Search (BFS) for reliable depth measurement
- Crawl vs analyze separation to control what gets saved
- Same-origin enforcement (no external domain crawling)
- Offline-safe path rewriting for nested pages
- Asset handling for
src,href, andsrcset - URL decoding (
%20→ spaces) - Query collision handling via hash suffix
- Content-type aware saving for missing extensions
- XML-aware parsing for sitemaps and RSS
- Export visited URLs (
--export-urls) to a text file for analysis
How depth works (--depth)
Depth is measured in link hops from the base URL:
0→ only the base URL1→ base URL + pages directly linked from it2→ links from depth-1 pages1-2→ crawl broadly, analyze only depth 1–2
Example structure:
Depth 0
└── https://example.com
Depth 1
├── /about
├── /blog
└── /login
Depth 2
├── /blog/post-1
├── /blog/post-2
└── /about/team
Depth control reduces noise and focuses on pages where real-world issues usually live.
Installation
Recommended: use a virtual environment.
python -m venv .venvActivate:
Windows (PowerShell):
.venv\Scripts\Activate.ps1Linux / macOS:
source .venv/bin/activateInstall dependencies:
- Install from the bundled
requirements.txt(recommended):
pip install -r requirements.txt- Or install packages individually (equivalent):
pip install beautifulsoup4 lxmlNotes:
lxmlis optional but recommended (it provides a robust XML parser and removes XML parsing warnings).- Use the Python provided in your
.venvwhen running thepipcommand to ensure packages install into the virtual environment.
Usage
python sourceDownloader.py <BASE_URL> --depth <DEPTH> [--export-urls]Options
| Option | Description |
|---|---|
url |
Base URL to crawl (include http:// or https://) |
--depth |
Crawl depth (e.g., 2 or 1-2). Default: 0 |
--export-urls |
Export all visited URLs to urls.txt in the output directory |
Examples
# Crawl only the base URL
python sourceDownloader.py https://example.com --depth 0
# Crawl base URL and pages directly linked from it
python sourceDownloader.py https://example.com --depth 1
# Crawl depth 1-2, analyze pages at depth 1 and 2
python sourceDownloader.py https://example.com --depth 1-2
# Crawl and export all visited URLs to urls.txt
python sourceDownloader.py https://example.com --depth 2 --export-urlsOutput structure (example)
example_com/
├── index.html
├── assets/
│ ├── css/
│ ├── js/
│ ├── images/
│ └── fonts/
├── blog/
│ ├── post-1.html
│ └── post-2.html
└── urls.txt (when using --export-urls)
All pages open offline without broken CSS or images.
Why nested pages work correctly
Many crawlers rewrite assets relative to the project root, which breaks pages like /blog/post.html.
This project rewrites assets relative to each HTML file’s directory, so both root and nested pages load properly.
Output behavior (important)
- Assets are saved as binary, so images and PDFs stay intact.
- Pages are saved as HTML, with rewritten local paths.
- External URLs (GitHub badges, CDNs) are kept external.
- Fragment-only links (
#about) are ignored to reduce crawl noise.
Troubleshooting
- Images missing on nested pages → fixed by file-relative rewriting
- Responsive images missing →
srcsetentries are downloaded and rewritten - Resume/PDF not opening → binary assets are saved directly
- XML warnings → install
lxmlor ignore (HTML fallback is handled)
Ethical & Legal Notice
This tool is for educational and authorized testing only. Always respect terms of service and robots policies.
Future improvements (optional)
- Headless rendering (Playwright) for JS-heavy sites
- robots.txt enforcement
- JSON crawl reports
- Security header analysis
- Rate limiting and concurrency
- Authentication support for authorized environments
Final note
This project is designed to be honest, technically correct, and impressive without exaggeration. It demonstrates strong understanding of web architecture, crawling logic, and security boundaries.

