zydo/topicstreams
Real-time news aggregation system that scrapes Google Search for custom topics and streams live updates via WebSocket.
TopicStreams
Real-time news aggregation system that continuously scrapes Google (not Google News) for any topics (search keywords) and streams updates via WebSocket.
Why TopicStreams?
The Limitations with Google News & RSS
Google News (https://news.google.com) and Google News RSS (https://news.google.com/rss?search=<keyword>) provide curated news collections based on Google's algorithms. While convenient, they have limitations:
- Results are not necessarily the latest - articles may be hours or days old
- Google filters by quality and relevance, potentially missing breaking news
- No control over what Google considers "newsworthy"

Google News Search result - hours or days old

Google News RSS - same as Google News search
TopicStreams' Approach
TopicStreams scrapes Google Search → News Tab with time filters, giving you:
- Real-time results - All news indexed by Google, regardless of quality rating
- Unfiltered access - No curation, you decide what's relevant
- Near-instant updates - Scrape frequently enough and catch news as it breaks
- Full control - Customize topics (search keywords) and scrape intervals

Google Search News Tab - Latest, Unfiltered Results
Try It Live
Experience TopicStreams in action: http://topicstreams.dongziyu.com
Quick Demo
# Add topics (ensure they exist)
curl -X POST http://topicstreams.dongziyu.com/api/v1/topics \
-H "Content-Type: application/json" \
-d '{"name": "Bitcoin"}'
# List all active topics (contain "bitcoin")
curl http://topicstreams.dongziyu.com/api/v1/topics | jq
# Get latest news for "Bitcoin"
curl http://topicstreams.dongziyu.com/api/v1/news/bitcoin?limit=5 | jqWebSocket Streaming
For real-time news updates, connect via WebSocket:
# Real-time WebSocket news stream for "China" (automatically add topic if not present)
websocat ws://topicstreams.dongziyu.com/api/v1/ws/news/china | jqThe WebSocket delivers live news updates as they're scraped, showing the same content you'd see by continuously refreshing Google's news search page.

WebSocket Real-time News Stream - Live updates as articles are scraped
What TopicStreams Offers
- Real-time news streaming on customizable topics (any search keywords)
- Self-hosted - No third-party news API costs
Limitations
- Google Dependency - Black box algorithms, no source control, variable indexing speed, geographic filtering
- Inconsistent Results - Same queries return different results based on IP, geolocation, browser, A/B testing
- No Quality Control - All news included, credible or not
- Access Risks - Google may detect scraping and rate limit or block access, mitigation: Anti-Bot Detection
Features
- Real-time News Aggregation - Continuously scrapes Google Search News tab (not Google News site) for the latest articles
- Multi-Topic Tracking - Monitor multiple news topics simultaneously with configurable scrape intervals
- WebSocket Streaming - Subscribe to live news updates per topic via WebSocket connections
- REST API - Manage topics and retrieve historical news entries through HTTP endpoints
- Anti-Bot Detection - Playwright with stealth patches, realistic browser fingerprinting, and configurable geolocation (details)
Architecture
TopicStreams consists of three main components:
┌─────────────────────────┐
│ Client │
│ (REST API / WebSocket) │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐ ┌──────────────────────────────┐
│ FastAPI Server │ │ Scraper Service │
│ │ │ │
│ - REST endpoints │ │ - Playwright browser │
│ - WebSocket streams │ │ - BeautifulSoup parser │
│ - PostgreSQL listener │ │ - Continuous scraping loop │
└────────────┬────────────┘ └─────────────┬────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ PostgreSQL Database │
│ │
│ - Topics (tracked keywords) │
│ - News Entries (scraped articles) │
│ - Scraper Logs (monitoring) │
│ - LISTEN/NOTIFY for real-time updates │
└─────────────────────────────────────────────────────────────┘
Data Flow
- Scraper Service continuously scrapes Google Search News tab for tracked topics
- New articles are inserted into PostgreSQL with automatic deduplication
- Database triggers send NOTIFY events on new inserts
- FastAPI Server listens for these events via PostgreSQL's LISTEN/NOTIFY
- Updates are pushed to connected WebSocket clients in real-time
- Clients can also fetch historical data via REST API
Key Technologies
- FastAPI - Web framework for REST and WebSocket
- Playwright - Browser automation with anti-bot detection (see how it works)
- PostgreSQL - Reliable storage with LISTEN/NOTIFY for real-time events
- Docker - Container orchestration for easy deployment
Prerequisites
- Docker installed on your system
That's it! All dependencies (Python, PostgreSQL, Playwright browsers) are handled inside containers.
Optional: Install websocat for WebSocket testing (used for demo in this article), or use any WebSocket client you prefer.
Web UI
TopicStreams includes a modern, responsive Web UI that provides a complete dashboard for monitoring and managing your news aggregation system.
Features
- System Status Dashboard - Real-time monitoring of scraper health and activity
- Topic Management - Easy add/remove topics with visual feedback
- Real-time News Feed - Live updates with WebSocket connections
- Scraper Logs Panel - Historical activity monitoring
Access the Web UI
After Quick Start, simply open your browser and navigate to:
http://localhost:5000
Note: By default, the Web UI is accessible on port 5000. If you changed
HOST_PORTin your.envfile (e.g., set to80for production), use that port instead (e.g.,http://localhost:80).
TopicStreams Web UI - Complete dashboard for real-time news aggregation
Quick Start
1. Clone the Repository
git clone https://github.com/zydo/topicstreams.git
cd topicstreams2. Configure Environment
Copy .env.example to .env and customize if needed:
cp .env.example .envDefault settings work out-of-the-box.
3. Start Services
docker compose up -dThis will start three containers:
- postgres - Database
- scraper - Background scraping service
- api - FastAPI server http://localhost:5000 (or port set by
HOST_PORTin.env)
4. Add Topics to Track
# Add a topic (replace 5000 with your HOST_PORT if changed)
curl -X POST http://localhost:5000/api/v1/topics \
-H "Content-Type: application/json" \
-d '{"name": "artificial intelligence"}'Scraping of the topic will start on the next iteration.
5. Access Real-Time News
WebSocket (for real-time):
# Using websocat
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence
# Or with jq for prettier formatted output
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence | jqREST API (for historical data):
# Get recent news for a topic with pagination (result 11 to 15, newest first)
curl http://localhost:5000/api/v1/news/artificial+intelligence?offset=10&limit=5 | jq
# List all actively scraping topics
curl http://localhost:5000/api/v1/topics | jq
# List recent 10 scraper logs (each log represents one Google webpage load - typically up to 10 news entries)
curl http://localhost:5000/api/v1/logs?limit=10 | jqSee the API Reference section below for complete endpoint documentation.
6. Monitor Logs
# Background scraper logs
docker compose logs -f scraper
# FastAPI server logs
docker compose logs -f apiStop Services
docker compose downConfiguration
For complete configuration documentation including environment variables, YAML files, and browser fingerprinting settings, see Configuration.
Quick links:
- Environment variables - Database and API settings in .env
- Scraper settings - scrape_interval and max_pages
- Anti-detection settings - Browser fingerprinting and stealth strategies
- Reloading config - How to apply configuration changes
Anti-Bot Detection
TopicStreams uses sophisticated techniques to make the scraper appear as a real human user, minimizing the risk of being blocked by Google.
For detailed information about anti-detection strategies (Playwright stealth, browser fingerprinting, random delays, etc.), see Anti-Bot Detection Documentation.
Quick Reference:
- All anti-detection strategies are configurable via
config/anti_detection.yml(created from template on first-time setup) - See Configuration for YAML configuration details
Scraping Behavior
For detailed information about scraping behavior, monitoring, and scaling strategies, see Scraping Behavior.
Quick links:
- Sequential execution - How topics are scraped one after another
- Scrape interval - How scrape_interval controls timing
- Monitoring - Track scraper performance
- Proxy rotation - Scaling strategies for high-volume scraping (not implemented yet)
Authentication & Security
Not implemented yet - For security recommendations and implementation strategies, see Authentication & Security.
Quick links:
- Current state - Localhost/LAN only, no built-in security
- Authentication - API keys, JWT, OAuth2 options
- Rate limiting - Protect against abuse and DDOS
- Cloudflare - Recommended for public deployment
WebSocket Scalability
Not implemented yet - For scalability recommendations and implementation strategies, see WebSocket Scalability.
Quick links:
- Current state - Simple in-memory broadcasting
- Limitations - O(n) broadcast cost, single point of failure
- Redis Pub/Sub - Horizontal scaling with O(1) publish cost
- Apache Kafka - For very large-scale deployments (10K+ subscribers)
API Reference
For complete API documentation including all endpoints, request/response formats, and examples, see API Reference.
Quick links:
- Topics - List, add, and delete topics
- News - Get news entries for topics with pagination
- Logs - View scraper logs
- WebSocket - Real-time news updates via WebSocket
Database Access
For database access, common SQL queries, backup, and restore instructions, see Database Access.
Quick links:
- psql access - Connect to PostgreSQL database
- Common queries - Topics, news entries, scraper logs, and table sizes
- Backup & restore - Database backup and restore procedures
License
MIT