GitHunt
TO

tomato414941/web-search

An open-source web search engine with BM25 + vector hybrid search, 600K+ indexed pages, and Japanese NLP support

PaleBlueSearch

Web Search API for AI Agents — Source-aware search built for LLMs and autonomous agents.

A full-stack search engine with its own crawler, BM25 ranking,
clean content extraction, and Japanese NLP support. Designed to return
source-grounded public information that AI agents can inspect.

Features

  • Source-Aware Ranking: Navigational and reference queries get a thin canonical-source boost, while every hit still carries transparency metadata (temporal_anchor, authorship_clarity, factual_density, origin_score).
  • Information Origin: Documents classified as spring/river/delta/swamp based on link direction — primary sources rank higher than aggregation.
  • Factual Density: Scores verifiable facts per unit of text (numbers, dates, citations, code, named entities) — replaces shallow word-count quality.
  • Clean Content Extraction: trafilatura strips navigation, footers, and sidebars — only main content is indexed.
  • Million-scale Indexing: Own crawler with robots.txt compliance and authorship metadata extraction.
  • Japanese NLP: SudachiPy morphological analysis for high-quality Japanese search.
  • Free API: Anonymous access with IP-based rate limiting (100 req/min).

Search API

Base URL: https://palebluesearch.com/api/v1

Quick Example

curl "https://palebluesearch.com/api/v1/search?q=python+web+framework"
{
  "query": "python web framework",
  "total": 42,
  "page": 1,
  "per_page": 10,
  "last_page": 5,
  "hits": [
    {
      "url": "https://example.com/fastapi",
      "title": "FastAPI - Modern Python Web Framework",
      "snip": "A modern, fast web framework for building APIs with <mark>Python</mark>...",
      "snip_plain": "A modern, fast web framework for building APIs with Python...",
      "rank": 12.5,
      "indexed_at": "2026-03-01T12:00:00.000000+00:00",
      "published_at": "2026-02-28T09:30:00+00:00",
      "temporal_anchor": 1.0,
      "factual_density": 0.72,
      "origin_score": 0.85,
      "origin_type": "spring"
    }
  ],
  "mode": "bm25",
  "request_id": "a1b2c3d4e5f6"
}

Authentication

Anonymous access is available with IP-based rate limiting (100 req/min).
For higher limits, use an API key via header or query parameter:

# Header (recommended)
curl -H "X-API-Key: pbs_your_key_here" \
  "https://palebluesearch.com/api/v1/search?q=rust"

# Query parameter
curl "https://palebluesearch.com/api/v1/search?q=rust&api_key=pbs_your_key_here"

With a valid key, the response includes usage info:

{
  "usage": { "daily_used": 5, "daily_limit": 1000 }
}

Search Modes

Mode Description
bm25 Classic keyword matching with BM25 scoring (default)
curl "https://palebluesearch.com/api/v1/search?q=machine+learning&mode=bm25"

Pagination

curl "https://palebluesearch.com/api/v1/search?q=python&limit=20&page=2"

Click Tracking

Report user clicks to improve search quality:

curl -X POST "https://palebluesearch.com/api/v1/search/click" \
  -H "Content-Type: application/json" \
  -d '{"request_id": "a1b2c3d4e5f6", "query": "python", "url": "https://example.com", "rank": 1}'

Error Codes

Code Description
401 Invalid API key
429 Rate limit exceeded

Documentation

Start with Documentation Guide.

Key entry points:

Quick Start

Prerequisites

  • Docker & Docker Compose

Running the App

# Build and start the default lightweight stack
docker compose up --build -d

Once running, access the following:

To start the optional crawler and search stack as well:

COMPOSE_PROFILES=search,crawler docker compose up --build -d
  • search: starts opensearch
  • search-backfill: runs opensearch-backfill as a one-off backfill job
  • crawler: starts crawler
  • embedding: starts embedding-backfill
  • monitoring: starts prometheus and grafana
  • Crawler API: http://localhost:8082/docs when the crawler profile is enabled

To run the OpenSearch backfill manually:

COMPOSE_PROFILES=search,search-backfill docker compose up --build opensearch-backfill

To enable the monitoring stack locally:

COMPOSE_PROFILES=monitoring docker compose up --build -d

Architecture

  • Web Node (Frontend): FastAPI (serves UI and Search API, runs BM25 retrieval and thin canonical ranking policy).
  • Write Node (Indexer): FastAPI (handles ingestion, metadata/signal scoring, optional embeddings, OpenSearch sync).
  • Worker Node (Crawler): Custom Python worker using aiohttp and trafilatura with metadata extraction.
  • Database: PostgreSQL for production, SQLite for local development.

License

MIT

Languages

Python88.6%HTML5.6%Shell3.9%CSS1.0%Makefile0.4%Dockerfile0.4%Mako0.1%

Contributors

Created January 4, 2026
Updated March 21, 2026
tomato414941/web-search | GitHunt