umerkhan95/shopextract
Extract, compare, and monitor product data from any e-commerce store. Supports Shopify, WooCommerce, Magento, BigCommerce, Shopware, and generic stores. Zero config — just give a URL, get products.
shopextract
Extract, compare, and monitor product data from any e-commerce store.
No existing pip package lets you extract structured product data from any store URL with zero config. shopextract does. Point it at a store, get back clean product data -- titles, prices, images, GTINs, variants -- ready for analysis, comparison, or feed generation.
Works on any website -- not just 6 platforms. Shopify, WooCommerce, Magento, BigCommerce, Shopware get the fast API path. Everything else (IKEA, Nike, custom stores) goes through the intelligent scraper. JS-heavy sites use LLM extraction with 17+ provider support including free local models via Ollama.
Installation
pip install shopextractRequires Python 3.10+. Includes everything: extraction, comparison, monitoring, LLM support, pandas export.
Quick Start
import asyncio
import shopextract
async def main():
result = await shopextract.extract("https://example-store.com")
for product in result.products:
print(f"{product.title}: {product.price} {product.currency}")
asyncio.run(main())Three lines. That's it.
Features
Extract products from any store
The extract() function handles everything -- platform detection, URL discovery, and tiered extraction with automatic fallback.
import asyncio
import shopextract
async def main():
# Extract from any store URL
result = await shopextract.extract("https://example-store.com", max_urls=50)
print(f"Platform: {result.platform}") # shopify, woocommerce, magento, ...
print(f"Tier: {result.tier}") # api, unified_crawl, css
print(f"Quality: {result.quality_score}") # 0.0 - 1.0
print(f"Products: {result.product_count}")
for p in result.products[:5]:
print(f" {p.title} - {p.price} {p.currency}")
print(f" GTIN: {p.gtin} SKU: {p.sku}")
print(f" Image: {p.image_url}")
asyncio.run(main())Extract a single product page:
raw = await shopextract.extract_one("https://example-store.com/products/cool-widget")
print(raw) # {"title": "Cool Widget", "price": "29.99", ...}Use LLM for hard-to-scrape sites (JS-heavy, no structured data):
# With OpenAI
result = await shopextract.extract(
"https://hard-to-scrape-store.com",
llm_api_key="sk-...",
llm_model="openai/gpt-4o-mini",
)
# With local Ollama (free, no API key)
result = await shopextract.extract(
"https://hard-to-scrape-store.com",
llm_model="ollama/llama3.1",
)
# Or set env vars and forget about it
# export OPENAI_API_KEY=sk-...
result = await shopextract.extract("https://any-store.com")Import from a Google Shopping feed:
result = await shopextract.from_feed("https://example-store.com/feed.xml")
print(f"Imported {result.product_count} products from feed")Detect platform
Identify which e-commerce platform a store runs on, with confidence scoring and detection signals.
import asyncio
import shopextract
async def main():
result = await shopextract.detect("https://example-store.com")
print(f"Platform: {result.platform}") # e.g. Platform.SHOPIFY
print(f"Confidence: {result.confidence}") # 0.0 - 1.0
print(f"Signals: {result.signals}") # ["header:x-shopify", "cdn:cdn.shopify.com", ...]
asyncio.run(main())Discover product URLs
Find all product pages on a store without extracting them.
import asyncio
import shopextract
async def main():
urls = await shopextract.discover("https://example-store.com", max_urls=100)
print(f"Found {len(urls)} product URLs")
for url in urls[:10]:
print(f" {url}")
asyncio.run(main())Uses a three-phase strategy: platform API pagination, sitemap parsing (with XML safety via defusedxml), and browser-based link crawling as a fallback.
Compare prices across stores
Search for a product across multiple stores and see who has the best price.
import asyncio
import shopextract
async def main():
result = await shopextract.compare(
"Wireless Headphones",
stores=[
"https://store-a.com",
"https://store-b.com",
"https://store-c.com",
],
)
print(f"Found {len(result.matches)} matches for '{result.query}'")
if result.cheapest:
print(f"Cheapest: {result.cheapest.price} at {result.cheapest.store}")
if result.most_expensive:
print(f"Most expensive: {result.most_expensive.price} at {result.most_expensive.store}")
print(f"Average price: {result.avg_price}")
print(f"Price spread: {result.price_spread}")
asyncio.run(main())Compare two entire catalogs:
diff = await shopextract.compare_catalogs(
"https://store-a.com",
"https://store-b.com",
)
print(f"Only in A: {len(diff.only_in_a)}")
print(f"Only in B: {len(diff.only_in_b)}")
print(f"In both: {len(diff.in_both)}")
print(f"Cheaper in A: {len(diff.cheaper_in_a)}")
print(f"Cheaper in B: {len(diff.cheaper_in_b)}")Match products by title similarity or GTIN:
# Fuzzy title matching
matches = shopextract.fuzzy_match(products_a, products_b, threshold=0.8)
for prod_a, prod_b, similarity in matches:
print(f"{prod_a['title']} <-> {prod_b['title']} ({similarity:.0%})")
# Exact GTIN/SKU matching
found = shopextract.match_gtin("4260442152415", all_products)Monitor stores for changes
Take snapshots over time and detect price changes, new products, and removals.
import asyncio
import shopextract
async def main():
# Take a snapshot (stored in ~/.shopextract/snapshots.db)
count = await shopextract.snapshot("https://example-store.com")
print(f"Snapshot saved: {count} products")
# Later, take another snapshot and check for changes
await shopextract.snapshot("https://example-store.com")
detected = shopextract.changes("example-store.com")
for change in detected:
if change.change_type == shopextract.ChangeType.PRICE_CHANGE:
print(f"Price changed: {change.title} {change.old_price} -> {change.new_price}")
elif change.change_type == shopextract.ChangeType.NEW_PRODUCT:
print(f"New product: {change.title} ({change.price})")
elif change.change_type == shopextract.ChangeType.REMOVED_PRODUCT:
print(f"Removed: {change.title}")
asyncio.run(main())Get price history for a specific product:
history = shopextract.price_history("example-store.com", "Cool Widget Pro")
for timestamp, price in history:
print(f" {timestamp.date()}: {price}")Continuous watch mode with an async generator:
async def monitor():
async for change in shopextract.watch("https://example-store.com", interval=3600):
print(f"[{change.change_type}] {change.title}")Analyze catalogs
Get statistical insights from extracted product data.
import asyncio
import shopextract
async def main():
# Analyze directly from a URL
stats = await shopextract.analyze("https://example-store.com")
print(f"Total products: {stats.total_products}")
print(f"Price range: {stats.price_range[0]} - {stats.price_range[1]}")
print(f"Average price: {stats.avg_price}")
print(f"Median price: {stats.median_price}")
print(f"In stock: {stats.in_stock} / Out of stock: {stats.out_of_stock}")
print(f"Have GTIN: {stats.has_gtin}")
print(f"Have images: {stats.has_images}")
print(f"Completeness score: {stats.completeness_score:.0%}")
print(f"Top brands: {dict(list(stats.brands.items())[:5])}")
asyncio.run(main())Or analyze an already-extracted product list:
# From raw product dicts
stats = shopextract.analyze_products(result.raw_products)
# Price distribution buckets
dist = shopextract.price_distribution(products)
# {"0-10": 5, "10-25": 12, "25-50": 30, "50-100": 18, "100-250": 8, ...}
# Find pricing outliers (beyond 2 standard deviations)
weird = shopextract.outliers(products, std_multiplier=2.0)
for p in weird:
print(f"Outlier: {p['title']} at {p['price']}")
# Brand market share
brands = shopextract.brand_breakdown(products)
for brand, pct in brands.items():
print(f" {brand}: {pct}%")Competitive intelligence
Understand where you stand against competitors.
import asyncio
import shopextract
async def main():
# How does my product's price rank?
my_product = {"title": "Premium Coffee Beans 1kg", "price": 24.99}
position = await shopextract.price_position(
my_product,
competitors=["https://competitor-a.com", "https://competitor-b.com"],
)
print(f"Rank: #{position.rank} of {position.total_competitors + 1}")
print(f"Percentile: {position.percentile}%")
print(f"Market average: {position.market_avg}")
print(f"Cheapest: {position.cheapest} Most expensive: {position.most_expensive}")
# What categories and brands am I missing?
gaps = await shopextract.assortment_gaps(
"https://my-store.com",
competitors=["https://competitor-a.com", "https://competitor-b.com"],
)
print(f"Missing categories: {gaps.missing_categories}")
print(f"Missing brands: {gaps.missing_brands}")
asyncio.run(main())Brand coverage across multiple catalogs:
coverage = shopextract.brand_coverage({
"my-store": my_products,
"competitor-a": comp_a_products,
"competitor-b": comp_b_products,
})
for brand, stores in coverage.items():
print(f"{brand}: {stores}")
# {"Nike": {"my-store": 12, "competitor-a": 25, "competitor-b": 8}, ...}Validate for marketplaces
Check if your product data meets marketplace requirements before submitting feeds.
import shopextract
products = [
{"title": "Widget", "price": 29.99, "image_url": "https://...", "product_url": "https://..."},
{"title": "", "price": -5}, # will fail validation
]
# Validate against Google Shopping, idealo, Amazon, or eBay rules
report = shopextract.validate(products, marketplace="google_shopping")
print(f"Pass rate: {report.pass_rate:.0f}%")
print(f"Valid: {report.valid} Invalid: {report.invalid} Warnings: {report.warnings}")
for issue in report.issues:
severity = "WARN" if issue.severity == "warning" else "ERROR"
print(f" [{severity}] #{issue.product_index}: {issue.field} - {issue.error}")Check for broken image URLs:
issues = await shopextract.check_images(products)
for issue in issues:
print(f" {issue.product_title}: {issue.error} ({issue.image_url})")Find duplicate products:
# By title similarity
dupes = shopextract.find_duplicates(products, method="title", threshold=0.9)
for idx_a, idx_b, similarity in dupes:
print(f" Duplicate: #{idx_a} <-> #{idx_b} ({similarity:.0%})")
# By exact GTIN or SKU
dupes = shopextract.find_duplicates(products, method="gtin")Export to any format
import shopextract
products = [...] # list of product dicts
# Standard formats
shopextract.to_csv(products, "products.csv")
shopextract.to_json(products, "products.json")
# Marketplace feeds
shopextract.to_feed(products, "google_feed.xml", format="google_shopping")
shopextract.to_feed(products, "idealo_feed.tsv", format="idealo")
# Data science formats
df = shopextract.to_dataframe(products)
shopextract.to_parquet(products, "products.parquet")CLI
Every feature is available from the command line.
# Extract products from a store
shopextract extract https://example-store.com
shopextract extract https://example-store.com -n 50 -f csv -o products.csv
# Detect platform
shopextract detect https://example-store.com
# Discover product URLs
shopextract discover https://example-store.com -n 200
# Compare prices
shopextract compare "Wireless Headphones" -s https://store-a.com -s https://store-b.com
# Monitor a store
shopextract snapshot https://example-store.com
shopextract changes example-store.com
shopextract history example-store.com "Cool Widget Pro"
# Analyze catalog
shopextract analyze https://example-store.com -n 100
# Validate product data
shopextract validate products.json -m google_shopping
shopextract validate products.json -m idealoSupported Platforms
API-Detected Platforms (fastest extraction)
| Platform | Market Share | Detection | Extraction Method |
|---|---|---|---|
| Shopify | ~26% | Headers, CDN, /products.json |
Public REST API |
| WooCommerce | ~36% | Headers, wp-json, plugins | Public Store API |
| Magento 2 | ~2% | Headers, REST API | Public REST API |
| BigCommerce | ~2% | Meta tags, CDN | UnifiedCrawl |
| Shopware 6 | ~1% | Headers, API config | UnifiedCrawl |
Any Other Website (universal scraping)
| Site Type | Example | Extraction Method |
|---|---|---|
| Sites with JSON-LD | IKEA, Target, Walmart | httpx fast path (no browser) |
| Sites with OG tags | Most retail sites | httpx fast path |
| JS-rendered sites | Custom stores | Browser + markdown parsing |
| Anti-bot / JS-heavy | Zara, H&M | LLM extraction (17+ providers) |
shopextract works on any website with product pages. Platform detection enables the fast API path for known platforms. Everything else goes through the intelligent scraper with automatic fallback through 4 tiers.
Extraction Tiers
shopextract uses a tiered fallback strategy -- it tries the fastest method first and falls back automatically.
| Tier | Method | Speed | Reliability | Cost | Works On |
|---|---|---|---|---|---|
| API | Platform REST APIs | Fast | High | Free | Shopify, WooCommerce, Magento |
| UnifiedCrawl | JSON-LD + OG + markdown parsing | Medium | High | Free | Any site with structured data |
| CSS | Browser-based CSS selectors | Slow | Medium | Free | Any site |
| LLM | AI-powered extraction | Slow | High | Varies | Any site (universal fallback) |
LLM Tier Configuration
The LLM tier requires an API key (or Ollama for local/free). It supports every major LLM provider via LiteLLM:
# Pass API key directly
result = await shopextract.extract(
"https://some-store.com",
llm_api_key="sk-...",
llm_model="openai/gpt-4o-mini",
)
# Or use environment variables
# export SHOPEXTRACT_LLM_API_KEY=sk-...
# export SHOPEXTRACT_LLM_MODEL=anthropic/claude-sonnet-4-20250514
result = await shopextract.extract("https://some-store.com")
# Local models with Ollama (free, no API key)
result = await shopextract.extract(
"https://some-store.com",
llm_model="ollama/llama3.1",
)Supported Providers
| Provider | Model Examples | Env Var | Cost |
|---|---|---|---|
| OpenAI | openai/gpt-4o-mini, openai/gpt-4o |
OPENAI_API_KEY |
~$0.01-0.03/page |
| Anthropic | anthropic/claude-sonnet-4-20250514, anthropic/claude-haiku-4-5-20251001 |
ANTHROPIC_API_KEY |
~$0.01-0.02/page |
| Google Gemini | gemini/gemini-2.0-flash, gemini/gemini-2.5-pro-preview-06-05 |
GEMINI_API_KEY |
~$0.01/page |
| Ollama (local) | ollama/llama3.1, ollama/mistral, ollama/qwen2.5, ollama/deepseek-r1, ollama/phi3 |
None needed | Free |
| Mistral | mistral/mistral-large-latest, mistral/mistral-small-latest |
MISTRAL_API_KEY |
~$0.01/page |
| DeepSeek | deepseek/deepseek-chat |
DEEPSEEK_API_KEY |
~$0.002/page |
| Groq | groq/llama-3.1-70b-versatile, groq/llama-3.3-70b-versatile |
GROQ_API_KEY |
Free tier |
| Cohere | cohere/command-r-plus |
COHERE_API_KEY |
~$0.01/page |
| Perplexity | perplexity/sonar-pro |
PERPLEXITY_API_KEY |
~$0.01/page |
| Together AI | together_ai/meta-llama/... |
TOGETHER_API_KEY |
Varies |
| AWS Bedrock | bedrock/anthropic.claude... |
AWS_ACCESS_KEY_ID |
Varies |
| Google Vertex AI | vertex_ai/gemini-... |
GOOGLE_APPLICATION_CREDENTIALS |
Varies |
| Azure OpenAI | azure/gpt-4o |
AZURE_API_KEY |
Varies |
| Cloudflare | cloudflare/... |
CLOUDFLARE_API_KEY |
Free tier |
| Replicate | replicate/... |
REPLICATE_API_TOKEN |
Varies |
| OpenRouter | openrouter/... (100+ models) |
OPENROUTER_API_KEY |
Varies |
Any model supported by LiteLLM works.
API Key Resolution Order
llm_api_keyparameter (explicit)SHOPEXTRACT_LLM_API_KEYenvironment variable- Provider-specific env var (e.g.,
OPENAI_API_KEYforopenai/...models) - For
ollama/*models -- no key needed (runs locally)
CLI Reference
| Command | Description | Key Options |
|---|---|---|
shopextract extract <url> |
Extract products from a store | -n max URLs, -f format (json/csv), -o output file |
shopextract detect <url> |
Detect the e-commerce platform | -- |
shopextract discover <url> |
Discover product URLs | -n max URLs |
shopextract compare <query> |
Compare prices across stores | -s store URL (repeatable) |
shopextract snapshot <url> |
Save a catalog snapshot | -- |
shopextract changes <domain> |
Show changes between snapshots | -- |
shopextract history <domain> <product> |
Price history for a product | -- |
shopextract analyze <url> |
Catalog statistics | -n max products |
shopextract validate <file> |
Validate products against marketplace | -m marketplace |
All commands output JSON by default.
API Reference
Core
| Function | Signature | Returns |
|---|---|---|
extract |
async (url, *, platform=None, max_urls=20, shop_url=None, llm_api_key=None, llm_model="openai/gpt-4o-mini", llm_temperature=0.2) |
ExtractionResult |
extract_one |
async (url, *, llm_api_key=None, llm_model="openai/gpt-4o-mini") |
dict |
from_feed |
async (feed_url, *, shop_url="") |
ExtractionResult |
detect |
async (url, *, client=None) |
PlatformResult |
discover |
async (url, *, platform=None, max_urls=100, timeout=30.0, client=None) |
list[str] |
normalize |
(raw, *, platform=GENERIC, shop_url="") |
Product | None |
QualityScorer.score_product |
(product: dict) |
float |
QualityScorer.score_batch |
(products: list[dict]) |
float |
Compare
| Function | Signature | Returns |
|---|---|---|
compare |
async (query, stores, *, max_per_store=50, threshold=0.6) |
ComparisonResult |
compare_catalogs |
async (store_a, store_b, *, max_products=200, threshold=0.8) |
CatalogDiff |
fuzzy_match |
(products_a, products_b, *, threshold=0.8) |
list[tuple[dict, dict, float]] |
match_gtin |
(gtin, products) |
list[dict] |
Monitor
| Function | Signature | Returns |
|---|---|---|
snapshot |
async (url, *, db_path="~/.shopextract/snapshots.db", max_urls=200) |
int |
changes |
(domain, *, db_path=...) |
list[Change] |
price_history |
(domain, product_title, *, db_path=...) |
list[tuple[datetime, float]] |
watch |
async (url, *, interval=3600, db_path=...) |
AsyncGenerator[Change] |
Analyze
| Function | Signature | Returns |
|---|---|---|
analyze |
async (url, max_products=500) |
CatalogStats |
analyze_products |
(products: list[dict]) |
CatalogStats |
price_distribution |
(products, buckets=None) |
dict[str, int] |
outliers |
(products, std_multiplier=2.0) |
list[dict] |
brand_breakdown |
(products: list[dict]) |
dict[str, float] |
Competitive Intelligence
| Function | Signature | Returns |
|---|---|---|
price_position |
async (my_product, competitors, *, max_products=200) |
PricePosition |
assortment_gaps |
async (my_store, competitors, *, max_products=200) |
AssortmentGaps |
brand_coverage |
(catalogs: dict[str, list[dict]]) |
dict[str, dict[str, int]] |
Validate
| Function | Signature | Returns |
|---|---|---|
validate |
(products, marketplace="google_shopping") |
ValidationReport |
check_images |
async (products, *, timeout=10.0, concurrency=20) |
list[ImageIssue] |
find_duplicates |
(products, method="title", threshold=0.9) |
list[tuple[int, int, float]] |
Export
| Function | Signature | Returns |
|---|---|---|
to_csv |
(products, path) |
None |
to_json |
(products, path, indent=2) |
None |
to_feed |
(products, path, format="google_shopping") |
None |
to_dataframe |
(products) |
pandas.DataFrame |
to_parquet |
(products, path) |
None |
Data Models
| Model | Description |
|---|---|
Product |
Unified product with title, price, currency, description, image_url, gtin, sku, variants, etc. |
Variant |
Product variant (variant_id, title, price, sku, in_stock) |
ExtractionResult |
Extraction output: products, raw_products, tier, quality_score, platform, errors |
ExtractorResult |
Raw extractor output: products, complete, error, page counts |
PlatformResult |
Detection result: platform, confidence, signals |
Platform |
Enum: SHOPIFY, WOOCOMMERCE, MAGENTO, BIGCOMMERCE, SHOPWARE, GENERIC |
ExtractionTier |
Enum: API, UNIFIED_CRAWL, GOOGLE_FEED, CSS, LLM |
ComparisonResult |
Price comparison: query, matches, cheapest, most_expensive, avg_price, price_spread |
Match |
Matched product: title, price, currency, store, product_url, similarity |
CatalogDiff |
Catalog comparison: only_in_a, only_in_b, in_both, cheaper_in_a, cheaper_in_b |
Change |
Base change event: change_type, title, detected_at |
PriceChange |
Price change: old_price, new_price, currency |
NewProduct |
New product detected: price, currency |
RemovedProduct |
Product removed: last_price, currency |
ChangeType |
Enum: PRICE_CHANGE, NEW_PRODUCT, REMOVED_PRODUCT |
CatalogStats |
Catalog statistics: total, price_range, avg, median, brands, categories, completeness |
PricePosition |
Competitive pricing: rank, percentile, market_avg, competitor_prices |
AssortmentGaps |
Category/brand gaps: missing_categories, missing_brands |
ValidationReport |
Validation result: marketplace, total, valid, invalid, issues, pass_rate |
ValidationIssue |
Single issue: product_index, field, error, severity |
ImageIssue |
Image problem: product_index, image_url, status_code, error |
Environment Variables
| Variable | Default | Description |
|---|---|---|
SHOPEXTRACT_LLM_API_KEY |
-- | API key for LLM extraction (any provider) |
SHOPEXTRACT_LLM_MODEL |
openai/gpt-4o-mini |
LLM model identifier |
OPENAI_API_KEY |
-- | Auto-detected for openai/... models |
ANTHROPIC_API_KEY |
-- | Auto-detected for anthropic/... models |
GEMINI_API_KEY |
-- | Auto-detected for gemini/... models |
MISTRAL_API_KEY |
-- | Auto-detected for mistral/... models |
DEEPSEEK_API_KEY |
-- | Auto-detected for deepseek/... models |
GROQ_API_KEY |
-- | Auto-detected for groq/... models |
For Ollama models (ollama/llama3.1, etc.), no API key is needed -- just have Ollama running locally.
Interactive Demo
Try shopextract without installing anything:
The notebook demonstrates all features: extraction, analysis, matching, validation, monitoring, export, quality scoring, and duplicate detection.
Testing
Test Stores
The notebooks and tests use public demo stores designed for developer testing:
| Platform | URL | Description |
|---|---|---|
| Shopify | https://hydrogen-preview.myshopify.com |
Official Shopify Hydrogen demo store |
| Magento | https://magento.softwaretestingboard.com |
Official Magento test store |
These are maintained by their respective platforms for integration testing and will not trigger anti-bot protections.
Running Tests
pip install -e ".[dev]"
python -m pytest tests/ -qContributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Install dev dependencies:
pip install -e ".[dev]" - Run tests:
pytest(308 tests) - Submit a pull request
Legal & Responsible Use
shopextract extracts publicly visible product data (titles, prices, images, SKUs) — factual information that is not copyrightable. It does not extract personal data, bypass authentication, or circumvent CAPTCHAs.
By default:
robots.txtis respected (check_robots_txt=True)- Requests are rate-limited (max 10 concurrent per domain)
- Extraction is capped at 20 URLs by default
- No login bypass or authentication circumvention
Users are responsible for ensuring their use complies with applicable laws, including:
- EU Database Directive (96/9/EC) — extracting a "substantial part" of a protected database may require authorization from the database maker. shopextract is designed for analysis, comparison, and research — not for reproducing entire catalogs.
- GDPR — shopextract does not collect personal data. If you extend it to process personal data, you are responsible for GDPR compliance.
- Website Terms of Service — some websites prohibit automated access in their ToS. Violating ToS is a contractual matter, not criminal, but users should review the terms of sites they extract from.
This library is a tool. Like any tool, it can be used responsibly or irresponsibly. Use it ethically.
License
MIT -- Copyright (c) 2026 Umer Khan