GitHunt
ZY

TopicStreams

Real-time news aggregation system that continuously scrapes Google (not Google News) for any topics (search keywords) and streams updates via WebSocket.

Why TopicStreams?

The Limitations with Google News & RSS

Google News (https://news.google.com) and Google News RSS (https://news.google.com/rss?search=<keyword>) provide curated news collections based on Google's algorithms. While convenient, they have limitations:

  • Results are not necessarily the latest - articles may be hours or days old
  • Google filters by quality and relevance, potentially missing breaking news
  • No control over what Google considers "newsworthy"

Google News Search result - hours or days old
Google News Search result - hours or days old

Google News RSS - same as Google News search
Google News RSS - same as Google News search

TopicStreams' Approach

TopicStreams scrapes Google Search → News Tab with time filters, giving you:

  • Real-time results - All news indexed by Google, regardless of quality rating
  • Unfiltered access - No curation, you decide what's relevant
  • Near-instant updates - Scrape frequently enough and catch news as it breaks
  • Full control - Customize topics (search keywords) and scrape intervals

Google Search News Tab - Latest, Unfiltered Results
Google Search News Tab - Latest, Unfiltered Results

Try It Live

Experience TopicStreams in action: http://topicstreams.dongziyu.com

Quick Demo

# Add topics (ensure they exist)
curl -X POST http://topicstreams.dongziyu.com/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "Bitcoin"}'

# List all active topics (contain "bitcoin")
curl http://topicstreams.dongziyu.com/api/v1/topics | jq

# Get latest news for "Bitcoin"
curl http://topicstreams.dongziyu.com/api/v1/news/bitcoin?limit=5 | jq

WebSocket Streaming

For real-time news updates, connect via WebSocket:

# Real-time WebSocket news stream for "China" (automatically add topic if not present)
websocat ws://topicstreams.dongziyu.com/api/v1/ws/news/china | jq

The WebSocket delivers live news updates as they're scraped, showing the same content you'd see by continuously refreshing Google's news search page.

WebSocket Real-time News Stream - Live updates as articles are scraped
WebSocket Real-time News Stream - Live updates as articles are scraped

What TopicStreams Offers

  • Real-time news streaming on customizable topics (any search keywords)
  • Self-hosted - No third-party news API costs

Limitations

  • Google Dependency - Black box algorithms, no source control, variable indexing speed, geographic filtering
  • Inconsistent Results - Same queries return different results based on IP, geolocation, browser, A/B testing
  • No Quality Control - All news included, credible or not
  • Access Risks - Google may detect scraping and rate limit or block access, mitigation: Anti-Bot Detection

Features

  • Real-time News Aggregation - Continuously scrapes Google Search News tab (not Google News site) for the latest articles
  • Multi-Topic Tracking - Monitor multiple news topics simultaneously with configurable scrape intervals
  • WebSocket Streaming - Subscribe to live news updates per topic via WebSocket connections
  • REST API - Manage topics and retrieve historical news entries through HTTP endpoints
  • Anti-Bot Detection - Playwright with stealth patches, realistic browser fingerprinting, and configurable geolocation (details)

Architecture

TopicStreams consists of three main components:

┌─────────────────────────┐
│         Client          │
│ (REST API / WebSocket)  │
└────────────┬────────────┘
             │                               
             ▼                               
┌─────────────────────────┐    ┌──────────────────────────────┐
│     FastAPI Server      │    │      Scraper Service         │
│                         │    │                              │
│  - REST endpoints       │    │  - Playwright browser        │
│  - WebSocket streams    │    │  - BeautifulSoup parser      │
│  - PostgreSQL listener  │    │  - Continuous scraping loop  │
└────────────┬────────────┘    └─────────────┬────────────────┘
             │                               │
             ▼                               ▼
┌─────────────────────────────────────────────────────────────┐
│                   PostgreSQL Database                       │
│                                                             │
│          - Topics (tracked keywords)                        │
│          - News Entries (scraped articles)                  │
│          - Scraper Logs (monitoring)                        │
│          - LISTEN/NOTIFY for real-time updates              │
└─────────────────────────────────────────────────────────────┘

Data Flow

  1. Scraper Service continuously scrapes Google Search News tab for tracked topics
  2. New articles are inserted into PostgreSQL with automatic deduplication
  3. Database triggers send NOTIFY events on new inserts
  4. FastAPI Server listens for these events via PostgreSQL's LISTEN/NOTIFY
  5. Updates are pushed to connected WebSocket clients in real-time
  6. Clients can also fetch historical data via REST API

Key Technologies

  • FastAPI - Web framework for REST and WebSocket
  • Playwright - Browser automation with anti-bot detection (see how it works)
  • PostgreSQL - Reliable storage with LISTEN/NOTIFY for real-time events
  • Docker - Container orchestration for easy deployment

Prerequisites

That's it! All dependencies (Python, PostgreSQL, Playwright browsers) are handled inside containers.

Optional: Install websocat for WebSocket testing (used for demo in this article), or use any WebSocket client you prefer.

Web UI

TopicStreams includes a modern, responsive Web UI that provides a complete dashboard for monitoring and managing your news aggregation system.

Features

  • System Status Dashboard - Real-time monitoring of scraper health and activity
  • Topic Management - Easy add/remove topics with visual feedback
  • Real-time News Feed - Live updates with WebSocket connections
  • Scraper Logs Panel - Historical activity monitoring

Access the Web UI

After Quick Start, simply open your browser and navigate to:

http://localhost:5000

Note: By default, the Web UI is accessible on port 5000. If you changed HOST_PORT in your .env file (e.g., set to 80 for production), use that port instead (e.g., http://localhost:80).

TopicStreams Web UI - Complete dashboard for real-time news aggregation
TopicStreams Web UI - Complete dashboard for real-time news aggregation

Quick Start

1. Clone the Repository

git clone https://github.com/zydo/topicstreams.git
cd topicstreams

2. Configure Environment

Copy .env.example to .env and customize if needed:

cp .env.example .env

Default settings work out-of-the-box.

3. Start Services

docker compose up -d

This will start three containers:

  • postgres - Database
  • scraper - Background scraping service
  • api - FastAPI server http://localhost:5000 (or port set by HOST_PORT in .env)

4. Add Topics to Track

# Add a topic (replace 5000 with your HOST_PORT if changed)
curl -X POST http://localhost:5000/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "artificial intelligence"}'

Scraping of the topic will start on the next iteration.

5. Access Real-Time News

WebSocket (for real-time):

# Using websocat
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence

# Or with jq for prettier formatted output
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence | jq

REST API (for historical data):

# Get recent news for a topic with pagination (result 11 to 15, newest first)
curl http://localhost:5000/api/v1/news/artificial+intelligence?offset=10&limit=5 | jq

# List all actively scraping topics
curl http://localhost:5000/api/v1/topics | jq

# List recent 10 scraper logs (each log represents one Google webpage load - typically up to 10 news entries)
curl http://localhost:5000/api/v1/logs?limit=10 | jq

See the API Reference section below for complete endpoint documentation.

6. Monitor Logs

# Background scraper logs
docker compose logs -f scraper

# FastAPI server logs
docker compose logs -f api

Stop Services

docker compose down

Configuration

For complete configuration documentation including environment variables, YAML files, and browser fingerprinting settings, see Configuration.

Quick links:

Anti-Bot Detection

TopicStreams uses sophisticated techniques to make the scraper appear as a real human user, minimizing the risk of being blocked by Google.

For detailed information about anti-detection strategies (Playwright stealth, browser fingerprinting, random delays, etc.), see Anti-Bot Detection Documentation.

Quick Reference:

  • All anti-detection strategies are configurable via config/anti_detection.yml (created from template on first-time setup)
  • See Configuration for YAML configuration details

Scraping Behavior

For detailed information about scraping behavior, monitoring, and scaling strategies, see Scraping Behavior.

Quick links:

Authentication & Security

Not implemented yet - For security recommendations and implementation strategies, see Authentication & Security.

Quick links:

WebSocket Scalability

Not implemented yet - For scalability recommendations and implementation strategies, see WebSocket Scalability.

Quick links:

API Reference

For complete API documentation including all endpoints, request/response formats, and examples, see API Reference.

Quick links:

  • Topics - List, add, and delete topics
  • News - Get news entries for topics with pagination
  • Logs - View scraper logs
  • WebSocket - Real-time news updates via WebSocket

Database Access

For database access, common SQL queries, backup, and restore instructions, see Database Access.

Quick links:

License

MIT

zydo/topicstreams | GitHunt