HA
haroonawanofficial/metacrawler
A comprehensive Python tool for extracting secrets, metadata, and sensitive data from ALL file types.
๐ MetaCrawler: Universal Sensitive Data Extraction Platform
A comprehensive Python tool for extracting secrets, metadata, and sensitive data from ALL file types.
๐ Features
๐ Universal File Support
- ๐ Documents: PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX
- ๐ผ๏ธ Images: JPG, PNG, GIF, BMP, TIFF, WEBP (with EXIF/metadata extraction)
- ๐ Archives: ZIP, TAR, GZ, 7Z, RAR (recursive extraction)
- ๐ Certificates: PEM, KEY, CRT, CER
- ๐ป Code: JavaScript, PHP, HTML, CSS, JSON, XML
- ๐ Data: CSV, TXT, LOG, SQL, Config files
- ๐ Web: All web-accessible file types
๐ก๏ธ Comprehensive Pattern Detection
- API Keys: Google, AWS, Stripe, GitHub, Slack, Twilio, SendGrid
- Authentication: JWT, OAuth, Bearer tokens, Session tokens
- Cryptographic Material: Private keys, SSH keys, PGP keys, Certificates
- Database Connections: MongoDB, PostgreSQL, MySQL, Redis, SQLite
- PII: Email addresses, Credit cards, SSN, Phone numbers, IP addresses
- Web3/Blockchain: Ethereum addresses, Bitcoin addresses, Private keys
- Financial: Bank accounts, SWIFT codes, IBAN
- Medical: Medical records, Health insurance
- Government: Passport numbers, Driver licenses
๐ Advanced Web Crawling
- Intelligent website crawling with robots.txt respect
- Automatic file discovery from HTML content
- Configurable depth and limits
- Async concurrent processing
๐ฆ Installation
Full Installation (Recommended)
pip install aiohttp beautifulsoup4 PyPDF2 Pillow exifread olefile python-magic๐ Quick Start
Analyze a Single File
python metacrawler.py -f document.pdfCrawl a Website
python metacrawler.py -u https://example.com --crawlAnalyze Multiple Files
python metacrawler.py -f file1.pdf -f file2.docx -f image.jpgAnalyze Local Directory
python metacrawler.py -d ./documentsAdvanced Usage
# Crawl with depth limit and save results
python metacrawler.py -u https://example.com --crawl --crawl-depth 3 -o results.json
# Analyze from target list file
python metacrawler.py -l targets.txt --max-files 50๐ Usage Examples
Basic File Analysis
python metacrawler.py -f sensitive_document.pdfComprehensive Website Audit
python metacrawler.py -u https://target-site.com --crawl --crawl-depth 2 -o website_audit.jsonBatch Directory Processing
python metacrawler.py -d ./project_files --max-files 200 -o project_scan.jsonMultiple Target Types
python metacrawler.py -u https://api.example.com -f config.json -d ./src --crawl๐ง Advanced Options
| Option | Description | Default |
|---|---|---|
-u, --url |
Target URL to analyze | - |
-f, --file |
Local file to analyze | - |
-d, --directory |
Directory to analyze recursively | - |
-l, --list |
File containing list of targets | - |
-o, --output |
Output file for results (JSON) | - |
--crawl |
Enable website crawling | False |
--crawl-depth |
Maximum crawl depth | 2 |
--max-files |
Maximum files to analyze | 100 |
๐ Output Format
Results are provided in detailed JSON format with the following structure:
{
"filename": "document.pdf",
"file_type": "pdf",
"file_size": 102400,
"md5_hash": "...",
"sha256_hash": "...",
"sensitive_patterns": [
{
"pattern_type": "aws_access_key",
"matched_text": "AKIA*****KEY",
"risk_score": 0.9,
"position": [120, 140]
}
],
"extracted_data": {
"pdf_data": {
"metadata": {"author": "John Doe", "title": "Secret Document"},
"text_content": "...",
"page_count": 5
}
},
"analysis_timestamp": "2024-01-15T10:30:00"
}๐ก๏ธ Security Features
Risk Scoring
- High Risk (0.7-1.0): Private keys, API secrets, credentials
- Medium Risk (0.4-0.7): Configuration data, tokens
- Low Risk (0.0-0.4): Public information, test data
False Positive Reduction
- Intelligent pattern validation
- Common test data filtering
- Context-aware detection
Safe Data Handling
- Sensitive data masking in output
- Secure memory handling
- No data persistence without explicit consent
๐ Detection Capabilities
File Type Detection
- Magic byte signatures
- File extension mapping
- Content-based classification
- Fallback binary/text detection
Pattern Recognition
- 50+ sensitive data patterns
- Regular expression-based matching
- Multi-format support (Base64, Hex, etc.)
- Contextual validation
Metadata Extraction
- PDF metadata and text content
- Image EXIF and GPS data
- Office document properties
- Archive contents listing
๐ Web Crawling Features
Intelligent Discovery
- HTML parsing for file links
- Script and resource detection
- Sitemap and directory enumeration
- Recursive link following
Respectful Crawling
- robots.txt compliance
- Configurable delay between requests
- Domain restriction options
- Rate limiting
Async Performance
- Concurrent file processing
- Non-blocking network operations
- Configurable connection limits
- Efficient memory usage
๐ Troubleshooting
Common Issues
Missing Dependencies
# Install all optional dependencies
pip install PyPDF2 Pillow exifread olefile python-magicSSL Certificate Errors
- Tool automatically handles SSL verification bypass for testing
- Use in controlled environments only
Memory Issues with Large Files
- Use
--max-filesto limit processing - Tool includes safeguards for large archive processing
Performance Tips
- Use
--max-filesfor large directories - Adjust
--crawl-depthbased on target size - Process files in batches for memory efficiency
Adding New Patterns
Edit the ComprehensivePatternEngine class to add new detection patterns:
"new_pattern": r"your_regex_pattern_here"Supporting New File Types
Extend the UniversalFileParser with new parsing methods:
async def _parse_new_format(self, content: bytes) -> Dict[str, Any]:
# Your parsing logic here
return extracted_data๐ License
This project is licensed under the MIT License.
โ ๏ธ Disclaimer
This tool is designed for:
- Security research and penetration testing
- Educational purposes