Site Scraper with Repomix and GitHub Actions
This project automatically scrapes websites and repositories listed in a CSV file using Repomix and GitHub Actions.
Prerequisites
- Bash (with arrays and
set -o pipefail) - Node.js 18+ (with
npx) - Repomix (via
npx) - Miller (
mlr)
How it Works
- The project reads URLs and configuration from a CSV file (
sites.csv). - Uses Repomix (via npx) to convert each website or repository to plain text.
- Saves the results in the
repomix-output/directory, one file per URL (overwriting previous runs for the same URL). - Logs for each run are saved in the
repomix-logs/directory (with timestamps for uniqueness). - Runs automatically daily via GitHub Actions, or can be run manually.
- Commits and pushes any changes in
repomix-output/to the repository.
CSV Format
The sites.csv file should have the following columns:
| Column | Description | Required |
|---|---|---|
url |
The URL or Git repo to scrape | Yes |
directory |
Specific directory within a repo/site (optional) | No |
include_files |
Glob patterns to include (optional, comma-separated) | No |
exclude_files |
Glob patterns to exclude (optional, comma-separated) | No |
Example:
url,directory,include_files,exclude_files
https://github.com/google/adk-docs,,docs/**/*.md,
https://example.com,src,**/*.js,**/*.test.js
Local Usage
To run the scraper locally:
- Make sure you have Node.js 18+ and Miller (mlr) installed.
- Install Miller:
sudo apt-get install -y miller(Linux) orbrew install miller(macOS)
- Install Miller:
- Make the script executable:
chmod +x scrape.sh
- Run the script:
./scrape.sh
- Output files will be in
repomix-output/, logs inrepomix-logs/.
GitHub Actions Usage
The workflow is defined in .github/workflows/scrape.yml and will:
- Run on a schedule (daily at midnight UTC) or manually via the Actions tab.
- Install Node.js and Miller.
- Run
scrape.shto generate outputs and logs. - Commit and push any changes in
repomix-output/back to the repository.
Output
- Scraped content: Saved in
repomix-output/, one.mdfile per URL (overwritten on each run). - Logs: Saved in
repomix-logs/, one.txtfile per run per URL (timestamped for uniqueness). - Only
repomix-output/is tracked by git. Therepomix-logs/directory is ignored (see.gitignore).
Requirements
- Node.js 18+
- Repomix (installed via npx)
- Miller (mlr) for robust CSV parsing
Troubleshooting
- If the GitHub Action does not commit changes, check that the output in
repomix-output/actually changed. - If you see errors about missing files or permissions, ensure the script is executable and all dependencies are installed.
- Check the logs in
repomix-logs/for detailed error messages.
Feel free to open issues or pull requests for improvements!