parvvaresh/telegram-scraper
A modular, high-performance Telegram channel scraper that collects posts, comments, and reactions from public Telegram channels and stores them in ClickHouse.
Daily Telegram Scraper
A modular, high-performance Telegram channel scraper that collects posts, comments, and reactions from public Telegram channels and stores them in ClickHouse. It supports HTTP proxy and MTProto proxy connections, Jalali (Shamsi) calendar conversion, parallel batch processing, and distributed tmux-based workers.
Features
- Post Scraping via HTTP: Scrapes public channel pages (
t.me/s/{channel}) using HTTP requests and BeautifulSoup — no Telegram client needed for posts. - Comment Fetching via Telethon: Retrieves threaded comments using the Telegram MTProto API with multiple sessions.
- Reaction Extraction: Extracts emoji reactions and their counts for each post.
- ClickHouse Storage: All data (posts, comments, reactions) is batch-inserted into ClickHouse using the HTTP interface and Buffer tables for high throughput.
- Jalali Calendar Support: Converts Gregorian dates to Jalali (Shamsi), detects Iranian holidays using the
salnamalibrary. - Parallel Batch Processing: Uses
ProcessPoolExecutor+ThreadPoolExecutorfor post scraping across thousands of channels. - Distributed tmux Workers: Launches multiple tmux sessions for concurrent comment/reaction fetching across Telegram sessions.
- Rate-Limit Handling: Gracefully handles Telegram's
FloodWaitErrorwith automatic retry and backoff. - Proxy Support: Supports both HTTP proxies (for web scraping) and MTProto proxies (for Telegram API).
- Modular Architecture: Clean separation of concerns across fetch tools, database models, workers, and configuration.
Project Structure
daily_telegram/
├── config.py # Telegram API credentials, proxy settings, session count
├── get_posts.py # Main entry point: batch post scraper (multiprocessing + threading)
├── get_comments_reactions.py # Splits work & launches tmux workers for comments/reactions
├── run_full_scraper.sh # Shell script to run post scraper then comments/reactions sequentially
├── requirements.txt # Python dependencies
├── input/
│ └── channels.json # JSON file with list of Telegram channel URLs
├── fetch_tools/
│ ├── posts.py # HTTP-based post scraper (t.me/s/ pages), parses HTML, inserts to ClickHouse
│ ├── comments.py # Async Telethon-based comment fetcher
│ ├── reactions.py # Async Telethon-based reaction fetcher
│ ├── utils.py # ClickHouse batch insertion, HTML parsing, holiday checking, media type detection
│ └── configDB.py # ClickHouse connection settings (URL, auth, DB name, HTTP headers)
├── db/
│ ├── db.py # SQLAlchemy engine & session for ClickHouse
│ └── models.py # SQLAlchemy models: BufferPost, BufferComment, BufferReaction
├── utils/
│ ├── worker_posts.py # Channel normalization, worker function for post fetching
│ └── worker_comments_reactions.py # Async worker: connects Telethon client, processes batch file
├── session/
│ ├── create_esssion.py # Script to create multiple Telethon sessions with MTProto proxy
│ └── telegram_*.session # Auto-generated Telethon session files
├── output/
│ ├── batches/ # Batch JSON files with channel → msgid mappings
│ └── tmp_batches/ # Temporary worker batch files
└── logs/ # Log files from scraper runs
How It Works
Phase 1: Post Scraping (get_posts.py)
- Reads the channel list from
input/channels.json. - Splits channels into batches (default: 1000 channels per batch).
- Each batch is processed in a separate process (
ProcessPoolExecutor). - Within each batch, channels are scraped concurrently using threads (
ThreadPoolExecutor). - Posts are fetched by scraping
https://t.me/s/{channel}HTML pages via HTTP (with proxy if configured). - Each post is parsed for: message ID, text content, view count, date, and media type.
- Gregorian dates are converted to Jalali; Iranian holidays are detected.
- Posts are batch-inserted into the
telegram_posts_bufferClickHouse table. - Collected message IDs are saved to
output/batches/batch_XXX.json.
Phase 2: Comments & Reactions (get_comments_reactions.py)
- Reads all batch JSON files from
output/batches/. - Flattens
(channel, msgid)pairs and splits them into chunks for N workers. - Each chunk is saved as a text file (
worker_batch_N.txt). - tmux sessions are launched, each running
worker_comments_reactions.pywith a Telethon session file. - Each worker connects via MTProto proxy, iterates over its batch, and:
- Fetches reactions per message and inserts into
telegram_reactions_buffer. - Fetches threaded comments and inserts into
telegram_comments_buffer.
- Fetches reactions per message and inserts into
Full Pipeline (run_full_scraper.sh)
Runs Phase 1, waits for completion, then launches Phase 2 — all via tmux with logging.
Setup
1. Clone the Repository
git clone <repository-url>
cd daily_telegram2. Install Dependencies
pip install -r requirements.txt3. Configure Telegram API Credentials
Edit config.py with your API credentials from my.telegram.org:
api_id = 123456
api_hash = 'your_api_hash_here'4. Configure Proxy Settings
In config.py, update the HTTP and MTProto proxy settings:
# HTTP Proxy (for web scraping)
HTTP_PROXY_HOST = 'your_proxy_ip'
HTTP_PROXY_PORT = 8888
HTTP_PROXY_USER = 'user'
HTTP_PROXY_PASS = 'password'
# MTProto Proxy (for Telegram API)
MTPROTO_PROXY = {
'server': 'your_proxy_ip',
'port': 443,
'secret': 'your_hex_secret'
}5. Configure ClickHouse
Edit fetch_tools/configDB.py:
CLICKHOUSE_HTTP_URL = "http://localhost:8123/"
CLICKHOUSE_AUTH = ("your_user", "your_password")
DB_NAME = "your_database"6. Create ClickHouse Tables
Connect to ClickHouse and create the required database and tables:
CREATE DATABASE IF NOT EXISTS vazir1;
-- Main posts table
CREATE TABLE vazir1.telegram_posts (
channel String,
msgid UInt32,
ptype Nullable(String),
views UInt64,
reactions UInt64,
is_forwarded UInt8,
forwarded_from Nullable(String),
txtContent Nullable(String),
date DateTime('Asia/Tehran'),
shdate String,
day_of_week UInt8,
is_holiday UInt8,
insert_date DateTime('Asia/Tehran'),
update_date DateTime('Asia/Tehran')
) ENGINE = MergeTree()
ORDER BY (channel, msgid);
-- Buffer table for posts (high-throughput writes)
CREATE TABLE vazir1.telegram_posts_buffer AS vazir1.telegram_posts
ENGINE = Buffer(vazir1, telegram_posts, 16, 10, 600, 10000, 100000, 10000000, 200000000);
-- Main comments table
CREATE TABLE vazir1.telegram_comments (
comment_id UInt32,
channel String,
msgid UInt32,
user_id UInt64,
ptype Nullable(String),
reactions UInt64,
is_reply UInt8,
reply_to Nullable(UInt32),
txtContent Nullable(String),
date DateTime('Asia/Tehran'),
shdate String,
day_of_week UInt8,
is_holiday UInt8,
insert_date DateTime('Asia/Tehran'),
update_date DateTime('Asia/Tehran')
) ENGINE = MergeTree()
ORDER BY (channel, msgid, comment_id);
-- Buffer table for comments
CREATE TABLE vazir1.telegram_comments_buffer AS vazir1.telegram_comments
ENGINE = Buffer(vazir1, telegram_comments, 16, 10, 600, 10000, 100000, 10000000, 200000000);
-- Main reactions table
CREATE TABLE vazir1.telegram_reactions (
channel String,
msgid UInt32,
reaction String,
count UInt32,
insert_date DateTime('Asia/Tehran'),
update_date DateTime('Asia/Tehran')
) ENGINE = MergeTree()
ORDER BY (channel, msgid, reaction);
-- Buffer table for reactions
CREATE TABLE vazir1.telegram_reactions_buffer AS vazir1.telegram_reactions
ENGINE = Buffer(vazir1, telegram_reactions, 16, 10, 600, 10000, 100000, 10000000, 200000000);7. Prepare Input Channels
Edit input/channels.json:
{
"channels": [
"https://t.me/channel_name_1",
"https://t.me/channel_name_2"
]
}8. Create Telegram Sessions
Run the session creator to authenticate your Telegram accounts:
python3 session/create_esssion.pyThis will prompt you to enter phone numbers and verification codes for each session.
Usage
Run the Full Pipeline
bash run_full_scraper.shThis will:
- Scrape posts from all channels (last 1 day by default).
- Wait for post scraping to complete.
- Launch tmux workers to fetch comments and reactions.
Run Post Scraper Only
python3 get_posts.py --channels input/channels.json --days 3Arguments:
| Argument | Default | Description |
|---|---|---|
--channels |
input/channels.json |
Path to the channels JSON file |
--days |
1 |
Number of past days to scrape posts for |
Run Comments & Reactions Scraper Only
python3 get_comments_reactions.py --limit 36Arguments:
| Argument | Default | Description |
|---|---|---|
--limit |
36 |
Maximum number of comments to fetch per post |
Output Formats
Post Record (ClickHouse)
{
"channel": "example_channel",
"msgid": 12345,
"ptype": "text + photo",
"views": 15200,
"reactions": 42,
"is_forwarded": 0,
"forwarded_from": "",
"txtContent": "Post content text...",
"date": "2025-06-20 14:30:00",
"shdate": "1404-3-30",
"day_of_week": 4,
"is_holiday": 0
}Comment Record (ClickHouse)
{
"comment_id": 78910,
"channel": "example_channel",
"msgid": 12345,
"user_id": 456789123,
"ptype": "text",
"reactions": 3,
"is_reply": 1,
"reply_to": 78900,
"txtContent": "This is a reply comment.",
"date": "2025-06-20 15:00:00",
"shdate": "1404-3-30",
"day_of_week": 4,
"is_holiday": 0
}Reaction Record (ClickHouse)
{
"channel": "example_channel",
"msgid": 12345,
"reaction": "🔥",
"count": 15
}Docker
Build and Run with Docker Compose
docker compose up -dThis starts both the ClickHouse database and the scraper application. See Dockerfile and docker-compose.yml for details.
Run Interactively (for session creation)
docker compose run --rm app python3 session/create_esssion.pyConfiguration Reference
| File | Purpose |
|---|---|
config.py |
Telegram API ID/hash, HTTP proxy, MTProto proxy, session count |
fetch_tools/configDB.py |
ClickHouse HTTP URL, auth credentials, database name |
input/channels.json |
List of Telegram channel URLs to scrape |
License
This project is licensed under the MIT License. You are free to use, modify, and contribute.
Contributing
Bug reports and feature requests are welcome! Please open an issue or submit a pull request.