GitHunt
SA

SagarBiswas-MultiHAT/WebSource-Harvester

WebSource Harvester is an educational web-source harvester that crawls a site (BFS, depth-controlled), downloads browser-visible assets (HTML, CSS, JS, images, fonts, PDFs), and rewrites paths so pages work offline, including nested routes. It enforces same-origin limits and is designed for learning, offline analysis, and safe portfolio demos.

Web Source Code Downloader & Crawler

Actions Status
 
License
 
Last Commit
 
Open Issues
 
Top Language
 
Release

Author: @SagarBiswas-MultiHAT
Category: Educational Web Crawling & Client-Side Security Analysis
Status: Learning-grade, interview-safe, portfolio-ready

“This project performs depth-controlled crawling and client-side source reconstruction, capturing everything a browser can observe from a given URL, while intentionally respecting server-side trust boundaries.”


after running the project


Tested example: python ".\PasourceDownloader.pyssword-Strength-Checker" https://sagarbiswas-multihat.github.io/ --depth 2


downloaded elements



Overview

This project is an educational website source code downloader and crawler that extracts and reconstructs everything a browser can observe from a given URL.

It crawls a site with depth-controlled BFS, downloads client-visible resources (HTML, CSS, JS, images, fonts, PDFs, etc.), and rewrites links so the pages work offline, even on nested paths like /blog/*.

The tool respects server trust boundaries and does not attempt to fetch backend code, databases, or private data.


What this project provides (accurate scope)

This tool captures everything a browser can retrieve from a URL:

  • HTML pages (multiple pages via crawling)
  • Linked CSS files
  • JavaScript files
  • Images (including srcset)
  • Fonts and media files
  • PDFs and other static assets
  • XML files (e.g., sitemap.xml)
  • Correct offline reconstruction via path rewriting

Ideal for:

  • Learning how real websites are structured
  • Offline inspection and analysis
  • Client-side security research
  • Understanding exposure and attack surface
  • Portfolio demonstrations of crawling logic

What this project intentionally does NOT do

By design, this project does not:

  • Download backend source code (PHP, Python, Node.js, etc.)
  • Access databases or APIs that require authentication
  • Execute JavaScript (SPA/React/Vue rendering)
  • Bypass authentication, paywalls, or access controls
  • Retrieve secrets, tokens, or server configuration

These limitations are intentional and make the project accurate and interview-safe.


Key features

  • Depth-controlled crawling (--depth 2 or --depth 1-2)
  • Breadth-First Search (BFS) for reliable depth measurement
  • Crawl vs analyze separation to control what gets saved
  • Same-origin enforcement (no external domain crawling)
  • Offline-safe path rewriting for nested pages
  • Asset handling for src, href, and srcset
  • URL decoding (%20 → spaces)
  • Query collision handling via hash suffix
  • Content-type aware saving for missing extensions
  • XML-aware parsing for sitemaps and RSS
  • Export visited URLs (--export-urls) to a text file for analysis

How depth works (--depth)

Depth is measured in link hops from the base URL:

  • 0 → only the base URL
  • 1 → base URL + pages directly linked from it
  • 2 → links from depth-1 pages
  • 1-2 → crawl broadly, analyze only depth 1–2

Example structure:

Depth 0
└── https://example.com

Depth 1
├── /about
├── /blog
└── /login

Depth 2
├── /blog/post-1
├── /blog/post-2
└── /about/team

Depth control reduces noise and focuses on pages where real-world issues usually live.


Installation

Recommended: use a virtual environment.

python -m venv .venv

Activate:

Windows (PowerShell):

.venv\Scripts\Activate.ps1

Linux / macOS:

source .venv/bin/activate

Install dependencies:

  1. Install from the bundled requirements.txt (recommended):
pip install -r requirements.txt
  1. Or install packages individually (equivalent):
pip install beautifulsoup4 lxml

Notes:

  • lxml is optional but recommended (it provides a robust XML parser and removes XML parsing warnings).
  • Use the Python provided in your .venv when running the pip command to ensure packages install into the virtual environment.

Usage

python sourceDownloader.py <BASE_URL> --depth <DEPTH> [--export-urls]

Options

Option Description
url Base URL to crawl (include http:// or https://)
--depth Crawl depth (e.g., 2 or 1-2). Default: 0
--export-urls Export all visited URLs to urls.txt in the output directory

Examples

# Crawl only the base URL
python sourceDownloader.py https://example.com --depth 0

# Crawl base URL and pages directly linked from it
python sourceDownloader.py https://example.com --depth 1

# Crawl depth 1-2, analyze pages at depth 1 and 2
python sourceDownloader.py https://example.com --depth 1-2

# Crawl and export all visited URLs to urls.txt
python sourceDownloader.py https://example.com --depth 2 --export-urls

Output structure (example)

example_com/
├── index.html
├── assets/
│   ├── css/
│   ├── js/
│   ├── images/
│   └── fonts/
├── blog/
│   ├── post-1.html
│   └── post-2.html
└── urls.txt          (when using --export-urls)

All pages open offline without broken CSS or images.


Why nested pages work correctly

Many crawlers rewrite assets relative to the project root, which breaks pages like /blog/post.html.

This project rewrites assets relative to each HTML file’s directory, so both root and nested pages load properly.


Output behavior (important)

  • Assets are saved as binary, so images and PDFs stay intact.
  • Pages are saved as HTML, with rewritten local paths.
  • External URLs (GitHub badges, CDNs) are kept external.
  • Fragment-only links (#about) are ignored to reduce crawl noise.

Troubleshooting

  • Images missing on nested pages → fixed by file-relative rewriting
  • Responsive images missingsrcset entries are downloaded and rewritten
  • Resume/PDF not opening → binary assets are saved directly
  • XML warnings → install lxml or ignore (HTML fallback is handled)

This tool is for educational and authorized testing only. Always respect terms of service and robots policies.


Future improvements (optional)

  • Headless rendering (Playwright) for JS-heavy sites
  • robots.txt enforcement
  • JSON crawl reports
  • Security header analysis
  • Rate limiting and concurrency
  • Authentication support for authorized environments

Final note

This project is designed to be honest, technically correct, and impressive without exaggeration. It demonstrates strong understanding of web architecture, crawling logic, and security boundaries.

SagarBiswas-MultiHAT/WebSource-Harvester | GitHunt