shopextract

Extract, compare, and monitor product data from any e-commerce store.

No existing pip package lets you extract structured product data from any store URL with zero config. shopextract does. Point it at a store, get back clean product data -- titles, prices, images, GTINs, variants -- ready for analysis, comparison, or feed generation.

Works on any website -- not just 6 platforms. Shopify, WooCommerce, Magento, BigCommerce, Shopware get the fast API path. Everything else (IKEA, Nike, custom stores) goes through the intelligent scraper. JS-heavy sites use LLM extraction with 17+ provider support including free local models via Ollama.

Installation

pip install shopextract

Requires Python 3.10+. Includes everything: extraction, comparison, monitoring, LLM support, pandas export.

Try it now:

Quick Start

import asyncio
import shopextract

async def main():
    result = await shopextract.extract("https://example-store.com")
    for product in result.products:
        print(f"{product.title}: {product.price} {product.currency}")

asyncio.run(main())

Three lines. That's it.

Features

Extract products from any store

The extract() function handles everything -- platform detection, URL discovery, and tiered extraction with automatic fallback.

import asyncio
import shopextract

async def main():
    # Extract from any store URL
    result = await shopextract.extract("https://example-store.com", max_urls=50)

    print(f"Platform: {result.platform}")       # shopify, woocommerce, magento, ...
    print(f"Tier: {result.tier}")               # api, unified_crawl, css
    print(f"Quality: {result.quality_score}")    # 0.0 - 1.0
    print(f"Products: {result.product_count}")

    for p in result.products[:5]:
        print(f"  {p.title} - {p.price} {p.currency}")
        print(f"    GTIN: {p.gtin}  SKU: {p.sku}")
        print(f"    Image: {p.image_url}")

asyncio.run(main())

Extract a single product page:

raw = await shopextract.extract_one("https://example-store.com/products/cool-widget")
print(raw)  # {"title": "Cool Widget", "price": "29.99", ...}

Use LLM for hard-to-scrape sites (JS-heavy, no structured data):

# With OpenAI
result = await shopextract.extract(
    "https://hard-to-scrape-store.com",
    llm_api_key="sk-...",
    llm_model="openai/gpt-4o-mini",
)

# With local Ollama (free, no API key)
result = await shopextract.extract(
    "https://hard-to-scrape-store.com",
    llm_model="ollama/llama3.1",
)

# Or set env vars and forget about it
# export OPENAI_API_KEY=sk-...
result = await shopextract.extract("https://any-store.com")

Import from a Google Shopping feed:

result = await shopextract.from_feed("https://example-store.com/feed.xml")
print(f"Imported {result.product_count} products from feed")

Detect platform

Identify which e-commerce platform a store runs on, with confidence scoring and detection signals.

import asyncio
import shopextract

async def main():
    result = await shopextract.detect("https://example-store.com")
    print(f"Platform: {result.platform}")       # e.g. Platform.SHOPIFY
    print(f"Confidence: {result.confidence}")   # 0.0 - 1.0
    print(f"Signals: {result.signals}")         # ["header:x-shopify", "cdn:cdn.shopify.com", ...]

asyncio.run(main())

Discover product URLs

Find all product pages on a store without extracting them.

import asyncio
import shopextract

async def main():
    urls = await shopextract.discover("https://example-store.com", max_urls=100)
    print(f"Found {len(urls)} product URLs")
    for url in urls[:10]:
        print(f"  {url}")

asyncio.run(main())

Uses a three-phase strategy: platform API pagination, sitemap parsing (with XML safety via defusedxml), and browser-based link crawling as a fallback.

Compare prices across stores

Search for a product across multiple stores and see who has the best price.

import asyncio
import shopextract

async def main():
    result = await shopextract.compare(
        "Wireless Headphones",
        stores=[
            "https://store-a.com",
            "https://store-b.com",
            "https://store-c.com",
        ],
    )

    print(f"Found {len(result.matches)} matches for '{result.query}'")
    if result.cheapest:
        print(f"Cheapest: {result.cheapest.price} at {result.cheapest.store}")
    if result.most_expensive:
        print(f"Most expensive: {result.most_expensive.price} at {result.most_expensive.store}")
    print(f"Average price: {result.avg_price}")
    print(f"Price spread: {result.price_spread}")

asyncio.run(main())

Compare two entire catalogs:

diff = await shopextract.compare_catalogs(
    "https://store-a.com",
    "https://store-b.com",
)
print(f"Only in A: {len(diff.only_in_a)}")
print(f"Only in B: {len(diff.only_in_b)}")
print(f"In both: {len(diff.in_both)}")
print(f"Cheaper in A: {len(diff.cheaper_in_a)}")
print(f"Cheaper in B: {len(diff.cheaper_in_b)}")

Match products by title similarity or GTIN:

# Fuzzy title matching
matches = shopextract.fuzzy_match(products_a, products_b, threshold=0.8)
for prod_a, prod_b, similarity in matches:
    print(f"{prod_a['title']} <-> {prod_b['title']} ({similarity:.0%})")

# Exact GTIN/SKU matching
found = shopextract.match_gtin("4260442152415", all_products)

Monitor stores for changes

Take snapshots over time and detect price changes, new products, and removals.

import asyncio
import shopextract

async def main():
    # Take a snapshot (stored in ~/.shopextract/snapshots.db)
    count = await shopextract.snapshot("https://example-store.com")
    print(f"Snapshot saved: {count} products")

    # Later, take another snapshot and check for changes
    await shopextract.snapshot("https://example-store.com")
    detected = shopextract.changes("example-store.com")

    for change in detected:
        if change.change_type == shopextract.ChangeType.PRICE_CHANGE:
            print(f"Price changed: {change.title} {change.old_price} -> {change.new_price}")
        elif change.change_type == shopextract.ChangeType.NEW_PRODUCT:
            print(f"New product: {change.title} ({change.price})")
        elif change.change_type == shopextract.ChangeType.REMOVED_PRODUCT:
            print(f"Removed: {change.title}")

asyncio.run(main())

Get price history for a specific product:

history = shopextract.price_history("example-store.com", "Cool Widget Pro")
for timestamp, price in history:
    print(f"  {timestamp.date()}: {price}")

Continuous watch mode with an async generator:

async def monitor():
    async for change in shopextract.watch("https://example-store.com", interval=3600):
        print(f"[{change.change_type}] {change.title}")

Analyze catalogs

Get statistical insights from extracted product data.

import asyncio
import shopextract

async def main():
    # Analyze directly from a URL
    stats = await shopextract.analyze("https://example-store.com")

    print(f"Total products: {stats.total_products}")
    print(f"Price range: {stats.price_range[0]} - {stats.price_range[1]}")
    print(f"Average price: {stats.avg_price}")
    print(f"Median price: {stats.median_price}")
    print(f"In stock: {stats.in_stock} / Out of stock: {stats.out_of_stock}")
    print(f"Have GTIN: {stats.has_gtin}")
    print(f"Have images: {stats.has_images}")
    print(f"Completeness score: {stats.completeness_score:.0%}")
    print(f"Top brands: {dict(list(stats.brands.items())[:5])}")

asyncio.run(main())

Or analyze an already-extracted product list:

# From raw product dicts
stats = shopextract.analyze_products(result.raw_products)

# Price distribution buckets
dist = shopextract.price_distribution(products)
# {"0-10": 5, "10-25": 12, "25-50": 30, "50-100": 18, "100-250": 8, ...}

# Find pricing outliers (beyond 2 standard deviations)
weird = shopextract.outliers(products, std_multiplier=2.0)
for p in weird:
    print(f"Outlier: {p['title']} at {p['price']}")

# Brand market share
brands = shopextract.brand_breakdown(products)
for brand, pct in brands.items():
    print(f"  {brand}: {pct}%")

Competitive intelligence

Understand where you stand against competitors.

import asyncio
import shopextract

async def main():
    # How does my product's price rank?
    my_product = {"title": "Premium Coffee Beans 1kg", "price": 24.99}
    position = await shopextract.price_position(
        my_product,
        competitors=["https://competitor-a.com", "https://competitor-b.com"],
    )
    print(f"Rank: #{position.rank} of {position.total_competitors + 1}")
    print(f"Percentile: {position.percentile}%")
    print(f"Market average: {position.market_avg}")
    print(f"Cheapest: {position.cheapest}  Most expensive: {position.most_expensive}")

    # What categories and brands am I missing?
    gaps = await shopextract.assortment_gaps(
        "https://my-store.com",
        competitors=["https://competitor-a.com", "https://competitor-b.com"],
    )
    print(f"Missing categories: {gaps.missing_categories}")
    print(f"Missing brands: {gaps.missing_brands}")

asyncio.run(main())

Brand coverage across multiple catalogs:

coverage = shopextract.brand_coverage({
    "my-store": my_products,
    "competitor-a": comp_a_products,
    "competitor-b": comp_b_products,
})
for brand, stores in coverage.items():
    print(f"{brand}: {stores}")
# {"Nike": {"my-store": 12, "competitor-a": 25, "competitor-b": 8}, ...}

Validate for marketplaces

Check if your product data meets marketplace requirements before submitting feeds.

import shopextract

products = [
    {"title": "Widget", "price": 29.99, "image_url": "https://...", "product_url": "https://..."},
    {"title": "", "price": -5},  # will fail validation
]

# Validate against Google Shopping, idealo, Amazon, or eBay rules
report = shopextract.validate(products, marketplace="google_shopping")
print(f"Pass rate: {report.pass_rate:.0f}%")
print(f"Valid: {report.valid}  Invalid: {report.invalid}  Warnings: {report.warnings}")

for issue in report.issues:
    severity = "WARN" if issue.severity == "warning" else "ERROR"
    print(f"  [{severity}] #{issue.product_index}: {issue.field} - {issue.error}")

Check for broken image URLs:

issues = await shopextract.check_images(products)
for issue in issues:
    print(f"  {issue.product_title}: {issue.error} ({issue.image_url})")

Find duplicate products:

# By title similarity
dupes = shopextract.find_duplicates(products, method="title", threshold=0.9)
for idx_a, idx_b, similarity in dupes:
    print(f"  Duplicate: #{idx_a} <-> #{idx_b} ({similarity:.0%})")

# By exact GTIN or SKU
dupes = shopextract.find_duplicates(products, method="gtin")

Export to any format

import shopextract

products = [...]  # list of product dicts

# Standard formats
shopextract.to_csv(products, "products.csv")
shopextract.to_json(products, "products.json")

# Marketplace feeds
shopextract.to_feed(products, "google_feed.xml", format="google_shopping")
shopextract.to_feed(products, "idealo_feed.tsv", format="idealo")

# Data science formats
df = shopextract.to_dataframe(products)
shopextract.to_parquet(products, "products.parquet")

CLI

Every feature is available from the command line.

# Extract products from a store
shopextract extract https://example-store.com
shopextract extract https://example-store.com -n 50 -f csv -o products.csv

# Detect platform
shopextract detect https://example-store.com

# Discover product URLs
shopextract discover https://example-store.com -n 200

# Compare prices
shopextract compare "Wireless Headphones" -s https://store-a.com -s https://store-b.com

# Monitor a store
shopextract snapshot https://example-store.com
shopextract changes example-store.com
shopextract history example-store.com "Cool Widget Pro"

# Analyze catalog
shopextract analyze https://example-store.com -n 100

# Validate product data
shopextract validate products.json -m google_shopping
shopextract validate products.json -m idealo

Supported Platforms

API-Detected Platforms (fastest extraction)

Platform	Market Share	Detection	Extraction Method
Shopify	~26%	Headers, CDN, `/products.json`	Public REST API
WooCommerce	~36%	Headers, wp-json, plugins	Public Store API
Magento 2	~2%	Headers, REST API	Public REST API
BigCommerce	~2%	Meta tags, CDN	UnifiedCrawl
Shopware 6	~1%	Headers, API config	UnifiedCrawl

Any Other Website (universal scraping)

Site Type	Example	Extraction Method
Sites with JSON-LD	IKEA, Target, Walmart	httpx fast path (no browser)
Sites with OG tags	Most retail sites	httpx fast path
JS-rendered sites	Custom stores	Browser + markdown parsing
Anti-bot / JS-heavy	Zara, H&M	LLM extraction (17+ providers)

shopextract works on any website with product pages. Platform detection enables the fast API path for known platforms. Everything else goes through the intelligent scraper with automatic fallback through 4 tiers.

Extraction Tiers

shopextract uses a tiered fallback strategy -- it tries the fastest method first and falls back automatically.

Tier	Method	Speed	Reliability	Cost	Works On
API	Platform REST APIs	Fast	High	Free	Shopify, WooCommerce, Magento
UnifiedCrawl	JSON-LD + OG + markdown parsing	Medium	High	Free	Any site with structured data
CSS	Browser-based CSS selectors	Slow	Medium	Free	Any site
LLM	AI-powered extraction	Slow	High	Varies	Any site (universal fallback)

LLM Tier Configuration

The LLM tier requires an API key (or Ollama for local/free). It supports every major LLM provider via LiteLLM:

# Pass API key directly
result = await shopextract.extract(
    "https://some-store.com",
    llm_api_key="sk-...",
    llm_model="openai/gpt-4o-mini",
)

# Or use environment variables
# export SHOPEXTRACT_LLM_API_KEY=sk-...
# export SHOPEXTRACT_LLM_MODEL=anthropic/claude-sonnet-4-20250514
result = await shopextract.extract("https://some-store.com")

# Local models with Ollama (free, no API key)
result = await shopextract.extract(
    "https://some-store.com",
    llm_model="ollama/llama3.1",
)

Supported Providers

Provider	Model Examples	Env Var	Cost
OpenAI	`openai/gpt-4o-mini`, `openai/gpt-4o`	`OPENAI_API_KEY`	~$0.01-0.03/page
Anthropic	`anthropic/claude-sonnet-4-20250514`, `anthropic/claude-haiku-4-5-20251001`	`ANTHROPIC_API_KEY`	~$0.01-0.02/page
Google Gemini	`gemini/gemini-2.0-flash`, `gemini/gemini-2.5-pro-preview-06-05`	`GEMINI_API_KEY`	~$0.01/page
Ollama (local)	`ollama/llama3.1`, `ollama/mistral`, `ollama/qwen2.5`, `ollama/deepseek-r1`, `ollama/phi3`	None needed	Free
Mistral	`mistral/mistral-large-latest`, `mistral/mistral-small-latest`	`MISTRAL_API_KEY`	~$0.01/page
DeepSeek	`deepseek/deepseek-chat`	`DEEPSEEK_API_KEY`	~$0.002/page
Groq	`groq/llama-3.1-70b-versatile`, `groq/llama-3.3-70b-versatile`	`GROQ_API_KEY`	Free tier
Cohere	`cohere/command-r-plus`	`COHERE_API_KEY`	~$0.01/page
Perplexity	`perplexity/sonar-pro`	`PERPLEXITY_API_KEY`	~$0.01/page
Together AI	`together_ai/meta-llama/...`	`TOGETHER_API_KEY`	Varies
AWS Bedrock	`bedrock/anthropic.claude...`	`AWS_ACCESS_KEY_ID`	Varies
Google Vertex AI	`vertex_ai/gemini-...`	`GOOGLE_APPLICATION_CREDENTIALS`	Varies
Azure OpenAI	`azure/gpt-4o`	`AZURE_API_KEY`	Varies
Cloudflare	`cloudflare/...`	`CLOUDFLARE_API_KEY`	Free tier
Replicate	`replicate/...`	`REPLICATE_API_TOKEN`	Varies
OpenRouter	`openrouter/...` (100+ models)	`OPENROUTER_API_KEY`	Varies

Any model supported by LiteLLM works.

API Key Resolution Order

llm_api_key parameter (explicit)
SHOPEXTRACT_LLM_API_KEY environment variable
Provider-specific env var (e.g., OPENAI_API_KEY for openai/... models)
For ollama/* models -- no key needed (runs locally)

CLI Reference

Command	Description	Key Options
`shopextract extract <url>`	Extract products from a store	`-n` max URLs, `-f` format (json/csv), `-o` output file
`shopextract detect <url>`	Detect the e-commerce platform	--
`shopextract discover <url>`	Discover product URLs	`-n` max URLs
`shopextract compare <query>`	Compare prices across stores	`-s` store URL (repeatable)
`shopextract snapshot <url>`	Save a catalog snapshot	--
`shopextract changes <domain>`	Show changes between snapshots	--
`shopextract history <domain> <product>`	Price history for a product	--
`shopextract analyze <url>`	Catalog statistics	`-n` max products
`shopextract validate <file>`	Validate products against marketplace	`-m` marketplace

All commands output JSON by default.

API Reference

Core

Function	Signature	Returns
`extract`	`async (url, *, platform=None, max_urls=20, shop_url=None, llm_api_key=None, llm_model="openai/gpt-4o-mini", llm_temperature=0.2)`	`ExtractionResult`
`extract_one`	`async (url, *, llm_api_key=None, llm_model="openai/gpt-4o-mini")`	`dict`
`from_feed`	`async (feed_url, *, shop_url="")`	`ExtractionResult`
`detect`	`async (url, *, client=None)`	`PlatformResult`
`discover`	`async (url, *, platform=None, max_urls=100, timeout=30.0, client=None)`	`list[str]`
`normalize`	`(raw, *, platform=GENERIC, shop_url="")`	`Product \| None`
`QualityScorer.score_product`	`(product: dict)`	`float`
`QualityScorer.score_batch`	`(products: list[dict])`	`float`

Compare

Function	Signature	Returns
`compare`	`async (query, stores, *, max_per_store=50, threshold=0.6)`	`ComparisonResult`
`compare_catalogs`	`async (store_a, store_b, *, max_products=200, threshold=0.8)`	`CatalogDiff`
`fuzzy_match`	`(products_a, products_b, *, threshold=0.8)`	`list[tuple[dict, dict, float]]`
`match_gtin`	`(gtin, products)`	`list[dict]`

Monitor

Function	Signature	Returns
`snapshot`	`async (url, *, db_path="~/.shopextract/snapshots.db", max_urls=200)`	`int`
`changes`	`(domain, *, db_path=...)`	`list[Change]`
`price_history`	`(domain, product_title, *, db_path=...)`	`list[tuple[datetime, float]]`
`watch`	`async (url, *, interval=3600, db_path=...)`	`AsyncGenerator[Change]`

Analyze

Function	Signature	Returns
`analyze`	`async (url, max_products=500)`	`CatalogStats`
`analyze_products`	`(products: list[dict])`	`CatalogStats`
`price_distribution`	`(products, buckets=None)`	`dict[str, int]`
`outliers`	`(products, std_multiplier=2.0)`	`list[dict]`
`brand_breakdown`	`(products: list[dict])`	`dict[str, float]`

Competitive Intelligence

Function	Signature	Returns
`price_position`	`async (my_product, competitors, *, max_products=200)`	`PricePosition`
`assortment_gaps`	`async (my_store, competitors, *, max_products=200)`	`AssortmentGaps`
`brand_coverage`	`(catalogs: dict[str, list[dict]])`	`dict[str, dict[str, int]]`

Validate

Function	Signature	Returns
`validate`	`(products, marketplace="google_shopping")`	`ValidationReport`
`check_images`	`async (products, *, timeout=10.0, concurrency=20)`	`list[ImageIssue]`
`find_duplicates`	`(products, method="title", threshold=0.9)`	`list[tuple[int, int, float]]`

Export

Function	Signature	Returns
`to_csv`	`(products, path)`	`None`
`to_json`	`(products, path, indent=2)`	`None`
`to_feed`	`(products, path, format="google_shopping")`	`None`
`to_dataframe`	`(products)`	`pandas.DataFrame`
`to_parquet`	`(products, path)`	`None`

Data Models

Model	Description
`Product`	Unified product with title, price, currency, description, image_url, gtin, sku, variants, etc.
`Variant`	Product variant (variant_id, title, price, sku, in_stock)
`ExtractionResult`	Extraction output: products, raw_products, tier, quality_score, platform, errors
`ExtractorResult`	Raw extractor output: products, complete, error, page counts
`PlatformResult`	Detection result: platform, confidence, signals
`Platform`	Enum: SHOPIFY, WOOCOMMERCE, MAGENTO, BIGCOMMERCE, SHOPWARE, GENERIC
`ExtractionTier`	Enum: API, UNIFIED_CRAWL, GOOGLE_FEED, CSS, LLM
`ComparisonResult`	Price comparison: query, matches, cheapest, most_expensive, avg_price, price_spread
`Match`	Matched product: title, price, currency, store, product_url, similarity
`CatalogDiff`	Catalog comparison: only_in_a, only_in_b, in_both, cheaper_in_a, cheaper_in_b
`Change`	Base change event: change_type, title, detected_at
`PriceChange`	Price change: old_price, new_price, currency
`NewProduct`	New product detected: price, currency
`RemovedProduct`	Product removed: last_price, currency
`ChangeType`	Enum: PRICE_CHANGE, NEW_PRODUCT, REMOVED_PRODUCT
`CatalogStats`	Catalog statistics: total, price_range, avg, median, brands, categories, completeness
`PricePosition`	Competitive pricing: rank, percentile, market_avg, competitor_prices
`AssortmentGaps`	Category/brand gaps: missing_categories, missing_brands
`ValidationReport`	Validation result: marketplace, total, valid, invalid, issues, pass_rate
`ValidationIssue`	Single issue: product_index, field, error, severity
`ImageIssue`	Image problem: product_index, image_url, status_code, error

Environment Variables

Variable	Default	Description
`SHOPEXTRACT_LLM_API_KEY`	--	API key for LLM extraction (any provider)
`SHOPEXTRACT_LLM_MODEL`	`openai/gpt-4o-mini`	LLM model identifier
`OPENAI_API_KEY`	--	Auto-detected for `openai/...` models
`ANTHROPIC_API_KEY`	--	Auto-detected for `anthropic/...` models
`GEMINI_API_KEY`	--	Auto-detected for `gemini/...` models
`MISTRAL_API_KEY`	--	Auto-detected for `mistral/...` models
`DEEPSEEK_API_KEY`	--	Auto-detected for `deepseek/...` models
`GROQ_API_KEY`	--	Auto-detected for `groq/...` models

For Ollama models (ollama/llama3.1, etc.), no API key is needed -- just have Ollama running locally.

Interactive Demo

Try shopextract without installing anything:

The notebook demonstrates all features: extraction, analysis, matching, validation, monitoring, export, quality scoring, and duplicate detection.

Testing

Test Stores

The notebooks and tests use public demo stores designed for developer testing:

Platform	URL	Description
Shopify	`https://hydrogen-preview.myshopify.com`	Official Shopify Hydrogen demo store
Magento	`https://magento.softwaretestingboard.com`	Official Magento test store

These are maintained by their respective platforms for integration testing and will not trigger anti-bot protections.

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -q

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Install dev dependencies: pip install -e ".[dev]"
Run tests: pytest (308 tests)
Submit a pull request

Legal & Responsible Use

shopextract extracts publicly visible product data (titles, prices, images, SKUs) — factual information that is not copyrightable. It does not extract personal data, bypass authentication, or circumvent CAPTCHAs.

By default:

robots.txt is respected (check_robots_txt=True)
Requests are rate-limited (max 10 concurrent per domain)
Extraction is capped at 20 URLs by default
No login bypass or authentication circumvention

Users are responsible for ensuring their use complies with applicable laws, including:

EU Database Directive (96/9/EC) — extracting a "substantial part" of a protected database may require authorization from the database maker. shopextract is designed for analysis, comparison, and research — not for reproducing entire catalogs.
GDPR — shopextract does not collect personal data. If you extend it to process personal data, you are responsible for GDPR compliance.
Website Terms of Service — some websites prohibit automated access in their ToS. Violating ToS is a contractual matter, not criminal, but users should review the terms of sites they extract from.

This library is a tool. Like any tool, it can be used responsibly or irresponsibly. Use it ethically.

umerkhan95/shopextract