GitHunt
UM

umerkhan95/shopextract

Extract, compare, and monitor product data from any e-commerce store. Supports Shopify, WooCommerce, Magento, BigCommerce, Shopware, and generic stores. Zero config — just give a URL, get products.

shopextract

Extract, compare, and monitor product data from any e-commerce store.

PyPI version
Python versions
License: MIT
Tests
Open In Colab

No existing pip package lets you extract structured product data from any store URL with zero config. shopextract does. Point it at a store, get back clean product data -- titles, prices, images, GTINs, variants -- ready for analysis, comparison, or feed generation.

Works on any website -- not just 6 platforms. Shopify, WooCommerce, Magento, BigCommerce, Shopware get the fast API path. Everything else (IKEA, Nike, custom stores) goes through the intelligent scraper. JS-heavy sites use LLM extraction with 17+ provider support including free local models via Ollama.


Installation

pip install shopextract

Requires Python 3.10+. Includes everything: extraction, comparison, monitoring, LLM support, pandas export.

Try it now: Open In Colab


Quick Start

import asyncio
import shopextract

async def main():
    result = await shopextract.extract("https://example-store.com")
    for product in result.products:
        print(f"{product.title}: {product.price} {product.currency}")

asyncio.run(main())

Three lines. That's it.


Features

Extract products from any store

The extract() function handles everything -- platform detection, URL discovery, and tiered extraction with automatic fallback.

import asyncio
import shopextract

async def main():
    # Extract from any store URL
    result = await shopextract.extract("https://example-store.com", max_urls=50)

    print(f"Platform: {result.platform}")       # shopify, woocommerce, magento, ...
    print(f"Tier: {result.tier}")               # api, unified_crawl, css
    print(f"Quality: {result.quality_score}")    # 0.0 - 1.0
    print(f"Products: {result.product_count}")

    for p in result.products[:5]:
        print(f"  {p.title} - {p.price} {p.currency}")
        print(f"    GTIN: {p.gtin}  SKU: {p.sku}")
        print(f"    Image: {p.image_url}")

asyncio.run(main())

Extract a single product page:

raw = await shopextract.extract_one("https://example-store.com/products/cool-widget")
print(raw)  # {"title": "Cool Widget", "price": "29.99", ...}

Use LLM for hard-to-scrape sites (JS-heavy, no structured data):

# With OpenAI
result = await shopextract.extract(
    "https://hard-to-scrape-store.com",
    llm_api_key="sk-...",
    llm_model="openai/gpt-4o-mini",
)

# With local Ollama (free, no API key)
result = await shopextract.extract(
    "https://hard-to-scrape-store.com",
    llm_model="ollama/llama3.1",
)

# Or set env vars and forget about it
# export OPENAI_API_KEY=sk-...
result = await shopextract.extract("https://any-store.com")

Import from a Google Shopping feed:

result = await shopextract.from_feed("https://example-store.com/feed.xml")
print(f"Imported {result.product_count} products from feed")

Detect platform

Identify which e-commerce platform a store runs on, with confidence scoring and detection signals.

import asyncio
import shopextract

async def main():
    result = await shopextract.detect("https://example-store.com")
    print(f"Platform: {result.platform}")       # e.g. Platform.SHOPIFY
    print(f"Confidence: {result.confidence}")   # 0.0 - 1.0
    print(f"Signals: {result.signals}")         # ["header:x-shopify", "cdn:cdn.shopify.com", ...]

asyncio.run(main())

Discover product URLs

Find all product pages on a store without extracting them.

import asyncio
import shopextract

async def main():
    urls = await shopextract.discover("https://example-store.com", max_urls=100)
    print(f"Found {len(urls)} product URLs")
    for url in urls[:10]:
        print(f"  {url}")

asyncio.run(main())

Uses a three-phase strategy: platform API pagination, sitemap parsing (with XML safety via defusedxml), and browser-based link crawling as a fallback.

Compare prices across stores

Search for a product across multiple stores and see who has the best price.

import asyncio
import shopextract

async def main():
    result = await shopextract.compare(
        "Wireless Headphones",
        stores=[
            "https://store-a.com",
            "https://store-b.com",
            "https://store-c.com",
        ],
    )

    print(f"Found {len(result.matches)} matches for '{result.query}'")
    if result.cheapest:
        print(f"Cheapest: {result.cheapest.price} at {result.cheapest.store}")
    if result.most_expensive:
        print(f"Most expensive: {result.most_expensive.price} at {result.most_expensive.store}")
    print(f"Average price: {result.avg_price}")
    print(f"Price spread: {result.price_spread}")

asyncio.run(main())

Compare two entire catalogs:

diff = await shopextract.compare_catalogs(
    "https://store-a.com",
    "https://store-b.com",
)
print(f"Only in A: {len(diff.only_in_a)}")
print(f"Only in B: {len(diff.only_in_b)}")
print(f"In both: {len(diff.in_both)}")
print(f"Cheaper in A: {len(diff.cheaper_in_a)}")
print(f"Cheaper in B: {len(diff.cheaper_in_b)}")

Match products by title similarity or GTIN:

# Fuzzy title matching
matches = shopextract.fuzzy_match(products_a, products_b, threshold=0.8)
for prod_a, prod_b, similarity in matches:
    print(f"{prod_a['title']} <-> {prod_b['title']} ({similarity:.0%})")

# Exact GTIN/SKU matching
found = shopextract.match_gtin("4260442152415", all_products)

Monitor stores for changes

Take snapshots over time and detect price changes, new products, and removals.

import asyncio
import shopextract

async def main():
    # Take a snapshot (stored in ~/.shopextract/snapshots.db)
    count = await shopextract.snapshot("https://example-store.com")
    print(f"Snapshot saved: {count} products")

    # Later, take another snapshot and check for changes
    await shopextract.snapshot("https://example-store.com")
    detected = shopextract.changes("example-store.com")

    for change in detected:
        if change.change_type == shopextract.ChangeType.PRICE_CHANGE:
            print(f"Price changed: {change.title} {change.old_price} -> {change.new_price}")
        elif change.change_type == shopextract.ChangeType.NEW_PRODUCT:
            print(f"New product: {change.title} ({change.price})")
        elif change.change_type == shopextract.ChangeType.REMOVED_PRODUCT:
            print(f"Removed: {change.title}")

asyncio.run(main())

Get price history for a specific product:

history = shopextract.price_history("example-store.com", "Cool Widget Pro")
for timestamp, price in history:
    print(f"  {timestamp.date()}: {price}")

Continuous watch mode with an async generator:

async def monitor():
    async for change in shopextract.watch("https://example-store.com", interval=3600):
        print(f"[{change.change_type}] {change.title}")

Analyze catalogs

Get statistical insights from extracted product data.

import asyncio
import shopextract

async def main():
    # Analyze directly from a URL
    stats = await shopextract.analyze("https://example-store.com")

    print(f"Total products: {stats.total_products}")
    print(f"Price range: {stats.price_range[0]} - {stats.price_range[1]}")
    print(f"Average price: {stats.avg_price}")
    print(f"Median price: {stats.median_price}")
    print(f"In stock: {stats.in_stock} / Out of stock: {stats.out_of_stock}")
    print(f"Have GTIN: {stats.has_gtin}")
    print(f"Have images: {stats.has_images}")
    print(f"Completeness score: {stats.completeness_score:.0%}")
    print(f"Top brands: {dict(list(stats.brands.items())[:5])}")

asyncio.run(main())

Or analyze an already-extracted product list:

# From raw product dicts
stats = shopextract.analyze_products(result.raw_products)

# Price distribution buckets
dist = shopextract.price_distribution(products)
# {"0-10": 5, "10-25": 12, "25-50": 30, "50-100": 18, "100-250": 8, ...}

# Find pricing outliers (beyond 2 standard deviations)
weird = shopextract.outliers(products, std_multiplier=2.0)
for p in weird:
    print(f"Outlier: {p['title']} at {p['price']}")

# Brand market share
brands = shopextract.brand_breakdown(products)
for brand, pct in brands.items():
    print(f"  {brand}: {pct}%")

Competitive intelligence

Understand where you stand against competitors.

import asyncio
import shopextract

async def main():
    # How does my product's price rank?
    my_product = {"title": "Premium Coffee Beans 1kg", "price": 24.99}
    position = await shopextract.price_position(
        my_product,
        competitors=["https://competitor-a.com", "https://competitor-b.com"],
    )
    print(f"Rank: #{position.rank} of {position.total_competitors + 1}")
    print(f"Percentile: {position.percentile}%")
    print(f"Market average: {position.market_avg}")
    print(f"Cheapest: {position.cheapest}  Most expensive: {position.most_expensive}")

    # What categories and brands am I missing?
    gaps = await shopextract.assortment_gaps(
        "https://my-store.com",
        competitors=["https://competitor-a.com", "https://competitor-b.com"],
    )
    print(f"Missing categories: {gaps.missing_categories}")
    print(f"Missing brands: {gaps.missing_brands}")

asyncio.run(main())

Brand coverage across multiple catalogs:

coverage = shopextract.brand_coverage({
    "my-store": my_products,
    "competitor-a": comp_a_products,
    "competitor-b": comp_b_products,
})
for brand, stores in coverage.items():
    print(f"{brand}: {stores}")
# {"Nike": {"my-store": 12, "competitor-a": 25, "competitor-b": 8}, ...}

Validate for marketplaces

Check if your product data meets marketplace requirements before submitting feeds.

import shopextract

products = [
    {"title": "Widget", "price": 29.99, "image_url": "https://...", "product_url": "https://..."},
    {"title": "", "price": -5},  # will fail validation
]

# Validate against Google Shopping, idealo, Amazon, or eBay rules
report = shopextract.validate(products, marketplace="google_shopping")
print(f"Pass rate: {report.pass_rate:.0f}%")
print(f"Valid: {report.valid}  Invalid: {report.invalid}  Warnings: {report.warnings}")

for issue in report.issues:
    severity = "WARN" if issue.severity == "warning" else "ERROR"
    print(f"  [{severity}] #{issue.product_index}: {issue.field} - {issue.error}")

Check for broken image URLs:

issues = await shopextract.check_images(products)
for issue in issues:
    print(f"  {issue.product_title}: {issue.error} ({issue.image_url})")

Find duplicate products:

# By title similarity
dupes = shopextract.find_duplicates(products, method="title", threshold=0.9)
for idx_a, idx_b, similarity in dupes:
    print(f"  Duplicate: #{idx_a} <-> #{idx_b} ({similarity:.0%})")

# By exact GTIN or SKU
dupes = shopextract.find_duplicates(products, method="gtin")

Export to any format

import shopextract

products = [...]  # list of product dicts

# Standard formats
shopextract.to_csv(products, "products.csv")
shopextract.to_json(products, "products.json")

# Marketplace feeds
shopextract.to_feed(products, "google_feed.xml", format="google_shopping")
shopextract.to_feed(products, "idealo_feed.tsv", format="idealo")

# Data science formats
df = shopextract.to_dataframe(products)
shopextract.to_parquet(products, "products.parquet")

CLI

Every feature is available from the command line.

# Extract products from a store
shopextract extract https://example-store.com
shopextract extract https://example-store.com -n 50 -f csv -o products.csv

# Detect platform
shopextract detect https://example-store.com

# Discover product URLs
shopextract discover https://example-store.com -n 200

# Compare prices
shopextract compare "Wireless Headphones" -s https://store-a.com -s https://store-b.com

# Monitor a store
shopextract snapshot https://example-store.com
shopextract changes example-store.com
shopextract history example-store.com "Cool Widget Pro"

# Analyze catalog
shopextract analyze https://example-store.com -n 100

# Validate product data
shopextract validate products.json -m google_shopping
shopextract validate products.json -m idealo

Supported Platforms

API-Detected Platforms (fastest extraction)

Platform Market Share Detection Extraction Method
Shopify ~26% Headers, CDN, /products.json Public REST API
WooCommerce ~36% Headers, wp-json, plugins Public Store API
Magento 2 ~2% Headers, REST API Public REST API
BigCommerce ~2% Meta tags, CDN UnifiedCrawl
Shopware 6 ~1% Headers, API config UnifiedCrawl

Any Other Website (universal scraping)

Site Type Example Extraction Method
Sites with JSON-LD IKEA, Target, Walmart httpx fast path (no browser)
Sites with OG tags Most retail sites httpx fast path
JS-rendered sites Custom stores Browser + markdown parsing
Anti-bot / JS-heavy Zara, H&M LLM extraction (17+ providers)

shopextract works on any website with product pages. Platform detection enables the fast API path for known platforms. Everything else goes through the intelligent scraper with automatic fallback through 4 tiers.


Extraction Tiers

shopextract uses a tiered fallback strategy -- it tries the fastest method first and falls back automatically.

Tier Method Speed Reliability Cost Works On
API Platform REST APIs Fast High Free Shopify, WooCommerce, Magento
UnifiedCrawl JSON-LD + OG + markdown parsing Medium High Free Any site with structured data
CSS Browser-based CSS selectors Slow Medium Free Any site
LLM AI-powered extraction Slow High Varies Any site (universal fallback)

LLM Tier Configuration

The LLM tier requires an API key (or Ollama for local/free). It supports every major LLM provider via LiteLLM:

# Pass API key directly
result = await shopextract.extract(
    "https://some-store.com",
    llm_api_key="sk-...",
    llm_model="openai/gpt-4o-mini",
)

# Or use environment variables
# export SHOPEXTRACT_LLM_API_KEY=sk-...
# export SHOPEXTRACT_LLM_MODEL=anthropic/claude-sonnet-4-20250514
result = await shopextract.extract("https://some-store.com")

# Local models with Ollama (free, no API key)
result = await shopextract.extract(
    "https://some-store.com",
    llm_model="ollama/llama3.1",
)

Supported Providers

Provider Model Examples Env Var Cost
OpenAI openai/gpt-4o-mini, openai/gpt-4o OPENAI_API_KEY ~$0.01-0.03/page
Anthropic anthropic/claude-sonnet-4-20250514, anthropic/claude-haiku-4-5-20251001 ANTHROPIC_API_KEY ~$0.01-0.02/page
Google Gemini gemini/gemini-2.0-flash, gemini/gemini-2.5-pro-preview-06-05 GEMINI_API_KEY ~$0.01/page
Ollama (local) ollama/llama3.1, ollama/mistral, ollama/qwen2.5, ollama/deepseek-r1, ollama/phi3 None needed Free
Mistral mistral/mistral-large-latest, mistral/mistral-small-latest MISTRAL_API_KEY ~$0.01/page
DeepSeek deepseek/deepseek-chat DEEPSEEK_API_KEY ~$0.002/page
Groq groq/llama-3.1-70b-versatile, groq/llama-3.3-70b-versatile GROQ_API_KEY Free tier
Cohere cohere/command-r-plus COHERE_API_KEY ~$0.01/page
Perplexity perplexity/sonar-pro PERPLEXITY_API_KEY ~$0.01/page
Together AI together_ai/meta-llama/... TOGETHER_API_KEY Varies
AWS Bedrock bedrock/anthropic.claude... AWS_ACCESS_KEY_ID Varies
Google Vertex AI vertex_ai/gemini-... GOOGLE_APPLICATION_CREDENTIALS Varies
Azure OpenAI azure/gpt-4o AZURE_API_KEY Varies
Cloudflare cloudflare/... CLOUDFLARE_API_KEY Free tier
Replicate replicate/... REPLICATE_API_TOKEN Varies
OpenRouter openrouter/... (100+ models) OPENROUTER_API_KEY Varies

Any model supported by LiteLLM works.

API Key Resolution Order

  1. llm_api_key parameter (explicit)
  2. SHOPEXTRACT_LLM_API_KEY environment variable
  3. Provider-specific env var (e.g., OPENAI_API_KEY for openai/... models)
  4. For ollama/* models -- no key needed (runs locally)

CLI Reference

Command Description Key Options
shopextract extract <url> Extract products from a store -n max URLs, -f format (json/csv), -o output file
shopextract detect <url> Detect the e-commerce platform --
shopextract discover <url> Discover product URLs -n max URLs
shopextract compare <query> Compare prices across stores -s store URL (repeatable)
shopextract snapshot <url> Save a catalog snapshot --
shopextract changes <domain> Show changes between snapshots --
shopextract history <domain> <product> Price history for a product --
shopextract analyze <url> Catalog statistics -n max products
shopextract validate <file> Validate products against marketplace -m marketplace

All commands output JSON by default.


API Reference

Core

Function Signature Returns
extract async (url, *, platform=None, max_urls=20, shop_url=None, llm_api_key=None, llm_model="openai/gpt-4o-mini", llm_temperature=0.2) ExtractionResult
extract_one async (url, *, llm_api_key=None, llm_model="openai/gpt-4o-mini") dict
from_feed async (feed_url, *, shop_url="") ExtractionResult
detect async (url, *, client=None) PlatformResult
discover async (url, *, platform=None, max_urls=100, timeout=30.0, client=None) list[str]
normalize (raw, *, platform=GENERIC, shop_url="") Product | None
QualityScorer.score_product (product: dict) float
QualityScorer.score_batch (products: list[dict]) float

Compare

Function Signature Returns
compare async (query, stores, *, max_per_store=50, threshold=0.6) ComparisonResult
compare_catalogs async (store_a, store_b, *, max_products=200, threshold=0.8) CatalogDiff
fuzzy_match (products_a, products_b, *, threshold=0.8) list[tuple[dict, dict, float]]
match_gtin (gtin, products) list[dict]

Monitor

Function Signature Returns
snapshot async (url, *, db_path="~/.shopextract/snapshots.db", max_urls=200) int
changes (domain, *, db_path=...) list[Change]
price_history (domain, product_title, *, db_path=...) list[tuple[datetime, float]]
watch async (url, *, interval=3600, db_path=...) AsyncGenerator[Change]

Analyze

Function Signature Returns
analyze async (url, max_products=500) CatalogStats
analyze_products (products: list[dict]) CatalogStats
price_distribution (products, buckets=None) dict[str, int]
outliers (products, std_multiplier=2.0) list[dict]
brand_breakdown (products: list[dict]) dict[str, float]

Competitive Intelligence

Function Signature Returns
price_position async (my_product, competitors, *, max_products=200) PricePosition
assortment_gaps async (my_store, competitors, *, max_products=200) AssortmentGaps
brand_coverage (catalogs: dict[str, list[dict]]) dict[str, dict[str, int]]

Validate

Function Signature Returns
validate (products, marketplace="google_shopping") ValidationReport
check_images async (products, *, timeout=10.0, concurrency=20) list[ImageIssue]
find_duplicates (products, method="title", threshold=0.9) list[tuple[int, int, float]]

Export

Function Signature Returns
to_csv (products, path) None
to_json (products, path, indent=2) None
to_feed (products, path, format="google_shopping") None
to_dataframe (products) pandas.DataFrame
to_parquet (products, path) None

Data Models

Model Description
Product Unified product with title, price, currency, description, image_url, gtin, sku, variants, etc.
Variant Product variant (variant_id, title, price, sku, in_stock)
ExtractionResult Extraction output: products, raw_products, tier, quality_score, platform, errors
ExtractorResult Raw extractor output: products, complete, error, page counts
PlatformResult Detection result: platform, confidence, signals
Platform Enum: SHOPIFY, WOOCOMMERCE, MAGENTO, BIGCOMMERCE, SHOPWARE, GENERIC
ExtractionTier Enum: API, UNIFIED_CRAWL, GOOGLE_FEED, CSS, LLM
ComparisonResult Price comparison: query, matches, cheapest, most_expensive, avg_price, price_spread
Match Matched product: title, price, currency, store, product_url, similarity
CatalogDiff Catalog comparison: only_in_a, only_in_b, in_both, cheaper_in_a, cheaper_in_b
Change Base change event: change_type, title, detected_at
PriceChange Price change: old_price, new_price, currency
NewProduct New product detected: price, currency
RemovedProduct Product removed: last_price, currency
ChangeType Enum: PRICE_CHANGE, NEW_PRODUCT, REMOVED_PRODUCT
CatalogStats Catalog statistics: total, price_range, avg, median, brands, categories, completeness
PricePosition Competitive pricing: rank, percentile, market_avg, competitor_prices
AssortmentGaps Category/brand gaps: missing_categories, missing_brands
ValidationReport Validation result: marketplace, total, valid, invalid, issues, pass_rate
ValidationIssue Single issue: product_index, field, error, severity
ImageIssue Image problem: product_index, image_url, status_code, error

Environment Variables

Variable Default Description
SHOPEXTRACT_LLM_API_KEY -- API key for LLM extraction (any provider)
SHOPEXTRACT_LLM_MODEL openai/gpt-4o-mini LLM model identifier
OPENAI_API_KEY -- Auto-detected for openai/... models
ANTHROPIC_API_KEY -- Auto-detected for anthropic/... models
GEMINI_API_KEY -- Auto-detected for gemini/... models
MISTRAL_API_KEY -- Auto-detected for mistral/... models
DEEPSEEK_API_KEY -- Auto-detected for deepseek/... models
GROQ_API_KEY -- Auto-detected for groq/... models

For Ollama models (ollama/llama3.1, etc.), no API key is needed -- just have Ollama running locally.


Interactive Demo

Try shopextract without installing anything:

Open In Colab

The notebook demonstrates all features: extraction, analysis, matching, validation, monitoring, export, quality scoring, and duplicate detection.


Testing

Test Stores

The notebooks and tests use public demo stores designed for developer testing:

Platform URL Description
Shopify https://hydrogen-preview.myshopify.com Official Shopify Hydrogen demo store
Magento https://magento.softwaretestingboard.com Official Magento test store

These are maintained by their respective platforms for integration testing and will not trigger anti-bot protections.

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -q

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Install dev dependencies: pip install -e ".[dev]"
  4. Run tests: pytest (308 tests)
  5. Submit a pull request

shopextract extracts publicly visible product data (titles, prices, images, SKUs) — factual information that is not copyrightable. It does not extract personal data, bypass authentication, or circumvent CAPTCHAs.

By default:

  • robots.txt is respected (check_robots_txt=True)
  • Requests are rate-limited (max 10 concurrent per domain)
  • Extraction is capped at 20 URLs by default
  • No login bypass or authentication circumvention

Users are responsible for ensuring their use complies with applicable laws, including:

  • EU Database Directive (96/9/EC) — extracting a "substantial part" of a protected database may require authorization from the database maker. shopextract is designed for analysis, comparison, and research — not for reproducing entire catalogs.
  • GDPR — shopextract does not collect personal data. If you extend it to process personal data, you are responsible for GDPR compliance.
  • Website Terms of Service — some websites prohibit automated access in their ToS. Violating ToS is a contractual matter, not criminal, but users should review the terms of sites they extract from.

This library is a tool. Like any tool, it can be used responsibly or irresponsibly. Use it ethically.


License

MIT -- Copyright (c) 2026 Umer Khan