TopicStreams

Real-time news aggregation system that continuously scrapes Google (not Google News) for any topics (search keywords) and streams updates via WebSocket.

Why TopicStreams?

The Limitations with Google News & RSS

Google News (https://news.google.com) and Google News RSS (https://news.google.com/rss?search=<keyword>) provide curated news collections based on Google's algorithms. While convenient, they have limitations:

Results are not necessarily the latest - articles may be hours or days old
Google filters by quality and relevance, potentially missing breaking news
No control over what Google considers "newsworthy"

Google News Search result - hours or days old

Google News RSS - same as Google News search

TopicStreams' Approach

TopicStreams scrapes Google Search → News Tab with time filters, giving you:

Real-time results - All news indexed by Google, regardless of quality rating
Unfiltered access - No curation, you decide what's relevant
Near-instant updates - Scrape frequently enough and catch news as it breaks
Full control - Customize topics (search keywords) and scrape intervals

Google Search News Tab - Latest, Unfiltered Results

Try It Live

Experience TopicStreams in action: http://topicstreams.dongziyu.com

Quick Demo

# Add topics (ensure they exist)
curl -X POST http://topicstreams.dongziyu.com/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "Bitcoin"}'

# List all active topics (contain "bitcoin")
curl http://topicstreams.dongziyu.com/api/v1/topics | jq

# Get latest news for "Bitcoin"
curl http://topicstreams.dongziyu.com/api/v1/news/bitcoin?limit=5 | jq

WebSocket Streaming

For real-time news updates, connect via WebSocket:

# Real-time WebSocket news stream for "China" (automatically add topic if not present)
websocat ws://topicstreams.dongziyu.com/api/v1/ws/news/china | jq

The WebSocket delivers live news updates as they're scraped, showing the same content you'd see by continuously refreshing Google's news search page.

WebSocket Real-time News Stream - Live updates as articles are scraped

What TopicStreams Offers

Real-time news streaming on customizable topics (any search keywords)
Self-hosted - No third-party news API costs

Limitations

Google Dependency - Black box algorithms, no source control, variable indexing speed, geographic filtering
Inconsistent Results - Same queries return different results based on IP, geolocation, browser, A/B testing
No Quality Control - All news included, credible or not
Access Risks - Google may detect scraping and rate limit or block access, mitigation: Anti-Bot Detection

Features

Real-time News Aggregation - Continuously scrapes Google Search News tab (not Google News site) for the latest articles
Multi-Topic Tracking - Monitor multiple news topics simultaneously with configurable scrape intervals
WebSocket Streaming - Subscribe to live news updates per topic via WebSocket connections
REST API - Manage topics and retrieve historical news entries through HTTP endpoints
Anti-Bot Detection - Playwright with stealth patches, realistic browser fingerprinting, and configurable geolocation (details)

Architecture

TopicStreams consists of three main components:

┌─────────────────────────┐
│         Client          │
│ (REST API / WebSocket)  │
└────────────┬────────────┘
             │                               
             ▼                               
┌─────────────────────────┐    ┌──────────────────────────────┐
│     FastAPI Server      │    │      Scraper Service         │
│                         │    │                              │
│  - REST endpoints       │    │  - Playwright browser        │
│  - WebSocket streams    │    │  - BeautifulSoup parser      │
│  - PostgreSQL listener  │    │  - Continuous scraping loop  │
└────────────┬────────────┘    └─────────────┬────────────────┘
             │                               │
             ▼                               ▼
┌─────────────────────────────────────────────────────────────┐
│                   PostgreSQL Database                       │
│                                                             │
│          - Topics (tracked keywords)                        │
│          - News Entries (scraped articles)                  │
│          - Scraper Logs (monitoring)                        │
│          - LISTEN/NOTIFY for real-time updates              │
└─────────────────────────────────────────────────────────────┘

Data Flow

Scraper Service continuously scrapes Google Search News tab for tracked topics
New articles are inserted into PostgreSQL with automatic deduplication
Database triggers send NOTIFY events on new inserts
FastAPI Server listens for these events via PostgreSQL's LISTEN/NOTIFY
Updates are pushed to connected WebSocket clients in real-time
Clients can also fetch historical data via REST API

Key Technologies

FastAPI - Web framework for REST and WebSocket
Playwright - Browser automation with anti-bot detection (see how it works)
PostgreSQL - Reliable storage with LISTEN/NOTIFY for real-time events
Docker - Container orchestration for easy deployment

Prerequisites

Docker installed on your system
- Install Docker

That's it! All dependencies (Python, PostgreSQL, Playwright browsers) are handled inside containers.

Optional: Install websocat for WebSocket testing (used for demo in this article), or use any WebSocket client you prefer.

Web UI

TopicStreams includes a modern, responsive Web UI that provides a complete dashboard for monitoring and managing your news aggregation system.

Features

System Status Dashboard - Real-time monitoring of scraper health and activity
Topic Management - Easy add/remove topics with visual feedback
Real-time News Feed - Live updates with WebSocket connections
Scraper Logs Panel - Historical activity monitoring

Access the Web UI

After Quick Start, simply open your browser and navigate to:

http://localhost:5000

Note: By default, the Web UI is accessible on port 5000. If you changed HOST_PORT in your .env file (e.g., set to 80 for production), use that port instead (e.g., http://localhost:80).

TopicStreams Web UI - Complete dashboard for real-time news aggregation

Quick Start

1. Clone the Repository

git clone https://github.com/zydo/topicstreams.git
cd topicstreams

2. Configure Environment

Copy .env.example to .env and customize if needed:

cp .env.example .env

Default settings work out-of-the-box.

3. Start Services

docker compose up -d

This will start three containers:

postgres - Database
scraper - Background scraping service
api - FastAPI server http://localhost:5000 (or port set by HOST_PORT in .env)

4. Add Topics to Track

# Add a topic (replace 5000 with your HOST_PORT if changed)
curl -X POST http://localhost:5000/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "artificial intelligence"}'

Scraping of the topic will start on the next iteration.

5. Access Real-Time News

WebSocket (for real-time):

# Using websocat
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence

# Or with jq for prettier formatted output
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence | jq

REST API (for historical data):

# Get recent news for a topic with pagination (result 11 to 15, newest first)
curl http://localhost:5000/api/v1/news/artificial+intelligence?offset=10&limit=5 | jq

# List all actively scraping topics
curl http://localhost:5000/api/v1/topics | jq

# List recent 10 scraper logs (each log represents one Google webpage load - typically up to 10 news entries)
curl http://localhost:5000/api/v1/logs?limit=10 | jq

See the API Reference section below for complete endpoint documentation.

6. Monitor Logs

# Background scraper logs
docker compose logs -f scraper

# FastAPI server logs
docker compose logs -f api

Stop Services

docker compose down

Configuration

For complete configuration documentation including environment variables, YAML files, and browser fingerprinting settings, see Configuration.

Quick links:

Environment variables - Database and API settings in .env
Scraper settings - scrape_interval and max_pages
Anti-detection settings - Browser fingerprinting and stealth strategies
Reloading config - How to apply configuration changes

Anti-Bot Detection

TopicStreams uses sophisticated techniques to make the scraper appear as a real human user, minimizing the risk of being blocked by Google.

For detailed information about anti-detection strategies (Playwright stealth, browser fingerprinting, random delays, etc.), see Anti-Bot Detection Documentation.

Quick Reference:

All anti-detection strategies are configurable via config/anti_detection.yml (created from template on first-time setup)
See Configuration for YAML configuration details

Scraping Behavior

For detailed information about scraping behavior, monitoring, and scaling strategies, see Scraping Behavior.

Quick links:

Sequential execution - How topics are scraped one after another
Scrape interval - How scrape_interval controls timing
Monitoring - Track scraper performance
Proxy rotation - Scaling strategies for high-volume scraping (not implemented yet)

Authentication & Security

Not implemented yet - For security recommendations and implementation strategies, see Authentication & Security.

Quick links:

Current state - Localhost/LAN only, no built-in security
Authentication - API keys, JWT, OAuth2 options
Rate limiting - Protect against abuse and DDOS
Cloudflare - Recommended for public deployment

WebSocket Scalability

Not implemented yet - For scalability recommendations and implementation strategies, see WebSocket Scalability.

Quick links:

Current state - Simple in-memory broadcasting
Limitations - O(n) broadcast cost, single point of failure
Redis Pub/Sub - Horizontal scaling with O(1) publish cost
Apache Kafka - For very large-scale deployments (10K+ subscribers)

API Reference

For complete API documentation including all endpoints, request/response formats, and examples, see API Reference.

Quick links:

Topics - List, add, and delete topics
News - Get news entries for topics with pagination
Logs - View scraper logs
WebSocket - Real-time news updates via WebSocket

Database Access

For database access, common SQL queries, backup, and restore instructions, see Database Access.

Quick links:

psql access - Connect to PostgreSQL database
Common queries - Topics, news entries, scraper logs, and table sizes
Backup & restore - Database backup and restore procedures

License

MIT

zydo/topicstreams

TopicStreams

Why TopicStreams?

The Limitations with Google News & RSS

TopicStreams' Approach

Try It Live

Quick Demo

WebSocket Streaming

What TopicStreams Offers

Limitations

Features

Architecture

Data Flow

Key Technologies

Prerequisites

Web UI

Features

Access the Web UI

Quick Start

1. Clone the Repository

2. Configure Environment

3. Start Services

4. Add Topics to Track

5. Access Real-Time News

6. Monitor Logs

Stop Services

Configuration

Anti-Bot Detection

Scraping Behavior

Authentication & Security

WebSocket Scalability

API Reference

Database Access

License

On this page

Languages

Contributors