NODECRAW

Nodecraw is a command-line tool for advanced web crawling, suitable for bug bounty programs and general web scraping needs. It supports various crawling techniques and provides flexible options for customization.

Features

Web Crawling Using Different Techniques:
- PlaywrightCrawler: Uses Playwright to navigate and crawl web pages.
- PuppeteerCrawler: Uses Puppeteer to crawl web pages and extract data.
- CheerioCrawler: Uses Cheerio for efficient and lightweight crawling.
- Crawler: Uses the 'crawler' module to crawl web pages.
- Web Archive Crawler: Retrieves archived versions of web pages from the Wayback Machine.
Recursive Crawling: Enables crawling through links found on the web pages to discover and crawl more pages.
Iterative Crawling: Automatically continue crawling URLs found in the current crawl session, useful for in-depth exploration.
Proxy Support: Supports HTTP, HTTPS, SOCKS4, SOCKS5 proxies with optional proxy authentication.
Exclude File Extensions: Option to exclude certain file types (e.g., images, videos, CSS) from the crawling results.
Output Customization:
- Specify the output file to save the crawled URLs.
- Choose between different output formats: TXT or JSON.
Timeout Functionality: Set a timeout duration to limit the crawling process.
Command-Line Interface (CLI): User-friendly and powerful command-line interface.
Support for stdin: Accepts URLs via stdin for streamlined and automated workflows.

Installation

Clone the repository:

git clone https://github.com/pikpikcu/nodecraw.git

Navigate to the project directory:
```
cd nodecraw
```
Install the dependencies:
```
npm install
```

Usage

To start crawling, you can use the following command:
You can also use the following options:

nodecraw -h
Usage: nodecraw [options]

Options:
  TARGET:
    -u, --url <url>                    Specify a single target URL
    -l, --list <file>                  Specify a list of target URLs
    -s, --scope <scope>                Specify the scope for crawling

  CONFIGURATION:
    -a, --aggressive <maxConcurrency>  Set the maximum concurrency for aggressive crawling
    -r, --recursive                    Enable recursive crawling
    -t, --timeout <timeout>            Specify the timeout duration
    -is, --ignore-ssl                  Ignore SSL certificate errors
    -fr, --force-redirect              Force redirection of URLs
    -ex, --exclude-ext <extensions>    Comma-separated list of file extensions to exclude (e.g., png,jpg,gif,css)
    -p, --proxy <proxy>                Specify a proxy server or a file containing a list of proxies
    -ap, --auth-proxy <auth>           Specify proxy authentication in the format username:password

  OUTPUT:
    -o, --output <file>                Specify the output file
    -json, --json-output               Output in JSON format with detailed information
    -ic, --iterate-crawl               Enable iterative crawling for URLs found in the current crawl
  -h, --help                         display help for command

Crawling a Single URL

To crawl a single URL, use the -u or --url option followed by the target URL:

nodecraw -u <target-url>

Crawling from a List of URLs

To crawl multiple URLs from a list, use the -l or --list option followed by the path to the file containing the URLs:

nodecraw -l <path-to-file>

This will crawl the URLs listed in the urls.txt file.

Using stdin

You can pipe a list of URLs directly into nodecraw using stdin:

cat urls.txt | nodecraw -a 30 -r -is -fr -o output.json -json -ic

This will accept the list of URLs from urls.txt and crawl them with the specified options.

Using a Proxy

To use a proxy, specify the proxy URL using the -p or --proxy option:

nodecraw -u <target-url> -p http://127.0.0.1:8080

You can also specify a list of proxies in a file and pass the file path to the --proxy option.

Iterative Crawling

Enable iterative crawling using the -ic option to continue crawling URLs found during the current crawl:

nodecraw -u <target-url> -ic

This will automatically crawl all the URLs discovered during the initial crawl.

License

This project is licensed under the MIT License. See the LICENSE file for details.

pikpikcu/nodecraw