AR
AR0NICA/project-wpaa
A tool that analyzes the HTML architecture of a web page and visualizes it as a tree
WPAA: WebPage Architecture Analyzer
EN | KR
WPAA is a comprehensive tool for analyzing and visualizing HTML architecture of web pages. It provides tree-structured visualization for both static and dynamic web pages, making DOM structure analysis intuitive and efficient.
Key Features
- ๐ณ Tree Visualization: Hierarchical representation of HTML structure
- ๐ Change Detection: Automatic detection and comparison of webpage structure changes
- ๐ Web Interface: Intuitive web UI for easy analysis
- ๐ Multiple Export Formats: Support for SVG, interactive HTML, CSV, and Markdown
- โก Performance Optimization: Asynchronous processing, caching, and memory optimization
- ๐ง Static/Dynamic Analysis: Support for JavaScript-rendered web pages
- ๐ฏ Custom Filtering: CSS selector and attribute filtering capabilities
- ๐ Performance Monitoring: Track execution time, memory usage, and cache efficiency
Installation
Requirements
Python 3.7+
pip install -r requirements.txt
Required External Programs:
-
Graphviz Installation:
- Download installer from official website
- Add bin directory to system PATH (e.g.,
C:\Program Files\Graphviz\bin)
-
ChromeDriver Installation (for dynamic page analysis):
- Download from ChromeDriver website
- Save to appropriate location and update path in code:
service = Service('your/path/to/chromedriver')
Usage
1. Web Interface (Recommended)
python run_web_interface.pyAccess http://127.0.0.1:5000 in your browser for intuitive web-based analysis.
Web Interface Features:
- ๐ฑ User-friendly web UI
- ๐ Real-time analysis progress display
- ๐ Download various output formats
- ๐ Change comparison functionality
- ๐ Performance statistics
2. Command Line Interface
Basic usage:
python wpaa_run.py --urls https://example.comAdvanced options:
python wpaa_run.py --urls https://example.com https://test.com \
--exclude script style \
--include-attrs class href \
--custom-filter "div.content" \
--max-depth 3 \
--export-html \
--compare-changes \
--show-performanceCommand Line Options
--urls: List of webpage URLs to analyze (required)--use-selenium: Use Selenium for dynamic content fetching--exclude: HTML tags to exclude (e.g., script style)--include-attrs: HTML attributes to include in nodes (e.g., class id href)--custom-filter: Filter specific elements using CSS selectors (e.g., div.classname)--max-depth: Limit maximum tree depth--include-text: Include text content--output: Choose output format (text or json)--visualize: Visualize tree structure as PNG file--export-svg: Export to SVG format--export-html: Export to interactive HTML--export-csv: Export to CSV format--export-markdown: Export to Markdown format--compare-changes: Compare with previous version--show-performance: Display performance report--optimize-tree: Optimize tree structure
Examples
Basic Analysis
python wpaa_run.py --urls https://news.ycombinator.comDynamic Content Analysis (Using Selenium)
python wpaa_run.py --urls https://www.example.com --use-seleniumExclude Specific Tags and Visualize
python wpaa_run.py --urls https://www.example.com --exclude script style meta link --visualizeInclude Specific Attributes and JSON Output
python wpaa_run.py --urls https://www.example.com --include-attrs class id href --output jsonInteractive HTML with Change Comparison
python wpaa_run.py --urls https://www.example.com --export-html --compare-changes --show-performanceExport to Multiple Formats
python wpaa_run.py --urls https://www.example.com --export-svg --export-csv --export-markdownArchitecture Overview
- Caching: Performance optimization for repeated URL analysis
- Asynchronous Processing: Concurrent analysis of multiple URLs
- Error Handling: Consistent error handling through decorators
- Tree Structure: HTML DOM visualization using anytree library
Development History
MK-II_2523: Feature improvements completed
- Tree comparison functionality for detecting site changes
- Web interface implementation
- Support for more output formats (SVG, interactive HTML)
- Performance optimization and memory usage improvements
On this page
Languages
Python87.6%HTML10.8%CSS1.6%
Contributors
MIT License
Created April 10, 2025
Updated December 31, 2025