tomato414941/web-search
An open-source web search engine with BM25 + vector hybrid search, 600K+ indexed pages, and Japanese NLP support
PaleBlueSearch
Web Search API for AI Agents — Source-aware search built for LLMs and autonomous agents.
A full-stack search engine with its own crawler, BM25 ranking,
clean content extraction, and Japanese NLP support. Designed to return
source-grounded public information that AI agents can inspect.
Features
- Source-Aware Ranking: Navigational and reference queries get a thin canonical-source boost, while every hit still carries transparency metadata (
temporal_anchor,authorship_clarity,factual_density,origin_score). - Information Origin: Documents classified as spring/river/delta/swamp based on link direction — primary sources rank higher than aggregation.
- Factual Density: Scores verifiable facts per unit of text (numbers, dates, citations, code, named entities) — replaces shallow word-count quality.
- Clean Content Extraction: trafilatura strips navigation, footers, and sidebars — only main content is indexed.
- Million-scale Indexing: Own crawler with robots.txt compliance and authorship metadata extraction.
- Japanese NLP: SudachiPy morphological analysis for high-quality Japanese search.
- Free API: Anonymous access with IP-based rate limiting (100 req/min).
Search API
Base URL: https://palebluesearch.com/api/v1
Quick Example
curl "https://palebluesearch.com/api/v1/search?q=python+web+framework"{
"query": "python web framework",
"total": 42,
"page": 1,
"per_page": 10,
"last_page": 5,
"hits": [
{
"url": "https://example.com/fastapi",
"title": "FastAPI - Modern Python Web Framework",
"snip": "A modern, fast web framework for building APIs with <mark>Python</mark>...",
"snip_plain": "A modern, fast web framework for building APIs with Python...",
"rank": 12.5,
"indexed_at": "2026-03-01T12:00:00.000000+00:00",
"published_at": "2026-02-28T09:30:00+00:00",
"temporal_anchor": 1.0,
"factual_density": 0.72,
"origin_score": 0.85,
"origin_type": "spring"
}
],
"mode": "bm25",
"request_id": "a1b2c3d4e5f6"
}Authentication
Anonymous access is available with IP-based rate limiting (100 req/min).
For higher limits, use an API key via header or query parameter:
# Header (recommended)
curl -H "X-API-Key: pbs_your_key_here" \
"https://palebluesearch.com/api/v1/search?q=rust"
# Query parameter
curl "https://palebluesearch.com/api/v1/search?q=rust&api_key=pbs_your_key_here"With a valid key, the response includes usage info:
{
"usage": { "daily_used": 5, "daily_limit": 1000 }
}Search Modes
| Mode | Description |
|---|---|
bm25 |
Classic keyword matching with BM25 scoring (default) |
curl "https://palebluesearch.com/api/v1/search?q=machine+learning&mode=bm25"Pagination
curl "https://palebluesearch.com/api/v1/search?q=python&limit=20&page=2"Click Tracking
Report user clicks to improve search quality:
curl -X POST "https://palebluesearch.com/api/v1/search/click" \
-H "Content-Type: application/json" \
-d '{"request_id": "a1b2c3d4e5f6", "query": "python", "url": "https://example.com", "rank": 1}'Error Codes
| Code | Description |
|---|---|
401 |
Invalid API key |
429 |
Rate limit exceeded |
Documentation
Start with Documentation Guide.
Key entry points:
- Product Direction: mission, principles, and anti-goals
- Architecture: current system architecture and service boundaries
- Search Ranking Policy: current ranking behavior
- Search Evaluation: golden set policy and release expectations
- API Reference: current API surface
- Setup Guide: local development and environment setup
Quick Start
Prerequisites
- Docker & Docker Compose
Running the App
# Build and start the default lightweight stack
docker compose up --build -dOnce running, access the following:
- Search UI: http://localhost:8083/
- API Docs: http://localhost:8083/docs
- Indexer API: http://localhost:8081/docs
To start the optional crawler and search stack as well:
COMPOSE_PROFILES=search,crawler docker compose up --build -dsearch: startsopensearchsearch-backfill: runsopensearch-backfillas a one-off backfill jobcrawler: startscrawlerembedding: startsembedding-backfillmonitoring: startsprometheusandgrafana- Crawler API: http://localhost:8082/docs when the
crawlerprofile is enabled
To run the OpenSearch backfill manually:
COMPOSE_PROFILES=search,search-backfill docker compose up --build opensearch-backfillTo enable the monitoring stack locally:
COMPOSE_PROFILES=monitoring docker compose up --build -d- Prometheus: http://localhost:9090/targets
- Grafana: http://localhost:3000/ (
admin/admin-change-meby default)
Architecture
- Web Node (Frontend): FastAPI (serves UI and Search API, runs BM25 retrieval and thin canonical ranking policy).
- Write Node (Indexer): FastAPI (handles ingestion, metadata/signal scoring, optional embeddings, OpenSearch sync).
- Worker Node (Crawler): Custom Python worker using
aiohttpandtrafilaturawith metadata extraction. - Database: PostgreSQL for production, SQLite for local development.
License
MIT