stephaniewang526/duckdb-mongo
Integrates DuckDB with MongoDB, enabling direct SQL queries over MongoDB collections without exporting data or ETL.
duckdb-mongo
Integrates DuckDB with MongoDB, enabling direct SQL queries over MongoDB collections without exporting data or ETL.
Announcement
We currently support DuckDB v1.5.0. The extension is built against the DuckDB submodule in this repo, and that
submodule is updated when new DuckDB releases are validated. See
How to Maintain an Extension Through DuckDB Releases
for the community extension policy.
Note: Community extensions are built and distributed only for the latest stable DuckDB release. Older DuckDB
versions can keep using the last compatible mongo extension build, but they will not receive updates. To keep
getting new mongo extension releases, upgrade DuckDB to the latest stable version. To confirm the version
locally, runSELECT version();in DuckDB (orduckdb --versionif you are using the CLI).
Quick Start
-- Attach to MongoDB
ATTACH 'host=localhost port=27017' AS mongo_db (TYPE MONGO);
-- Query your collections
SELECT * FROM mongo_db.mydb.mycollection LIMIT 10;Using Secrets with MongoDB Atlas (recommended for production):
See the DuckDB Secrets Manager documentation for more details on managing secrets.
-- Create a secret with MongoDB Atlas credentials
CREATE SECRET mongo_creds (
TYPE mongo,
HOST 'cluster0.xxxxx.mongodb.net',
USER 'myuser',
PASSWORD 'mypassword',
SRV 'true'
);
-- Attach using the secret (use readPreference=secondaryPreferred for replica sets)
ATTACH 'dbname=mydb?readPreference=secondaryPreferred' AS atlas_db (TYPE mongo, SECRET 'mongo_creds');
-- Query your collections
SELECT * FROM atlas_db.mydb.mycollection;Features
- Direct SQL queries over MongoDB collections (no ETL/export)
- MongoDB Atlas support via connection strings or DuckDB Secrets
- TLS/SSL encryption for secure connections
- Flexible schema handling: User-provided schemas,
__schemadocument support (Atlas SQL compatibility), or automatic inference - Nested document flattening with underscore-separated names
- BSON type mapping including Decimal128, arrays, and nested documents (see BSON Type Mapping)
- Query pushdown to reduce data transfer (see Pushdown Strategy):
- Filters (WHERE clauses, complex expressions, semi-join IN)
- Projections (SELECT columns)
- Limits and TopN (ORDER BY _id LIMIT N)
- Aggregations (COUNT, SUM, MIN, MAX, AVG with GROUP BY)
- Read-only (write support may be added)
Installation
The easiest way to install the mongo extension is from the DuckDB community extensions repository:
INSTALL mongo FROM community;
LOAD mongo;After installation, you can use the extension as described in the Connecting to MongoDB section.
Note: You can also build from source and load the extension binary directly with
LOAD '/path/to/mongo.duckdb_extension';
Connecting to MongoDB
Connection String Format
1. Key-value format:
ATTACH 'host=localhost port=27017' AS mongo_db (TYPE MONGO);
ATTACH 'host=localhost port=27017 dbname=mydb' AS mongo_db (TYPE MONGO);
ATTACH 'host=cluster0.xxxxx.mongodb.net user=myuser password=mypass srv=true' AS atlas_db (TYPE MONGO);2. MongoDB URI format:
ATTACH 'mongodb://user:pass@localhost:27017/mydb' AS mongo_db (TYPE MONGO);Connection Parameters:
| Name | Description | Default | Applies To Format |
|---|---|---|---|
host |
MongoDB hostname or IP address | localhost |
Both 1 and 2 |
port |
MongoDB port number | 27017 |
Both 1 and 2 |
user / username |
MongoDB username | - | Both 1 and 2 |
password |
MongoDB password | - | Both 1 and 2 |
dbname / database |
Specific MongoDB database to connect to | - | Both 1 and 2 |
authsource |
Authentication database | - | Both 1 and 2 |
srv |
Use SRV connection format (for MongoDB Atlas) | false |
Both 1 and 2 |
tls / ssl |
Enable TLS/SSL encryption | false |
Both 1 and 2 |
tls_ca_file |
Path to CA certificate file | - | Both 1 and 2 |
tls_allow_invalid_certificates |
Allow invalid certificates (for testing only) | false |
Both 1 and 2 |
options |
Additional MongoDB connection string query parameters | - | Format 1 only |
Tip: For replica sets (including MongoDB Atlas), use
readPreference=secondaryPreferredto route reads to secondaries.
TLS/SSL Connections
The extension supports TLS/SSL encrypted connections for secure MongoDB access. Enable TLS by setting tls=true or ssl=true:
-- Basic TLS connection
ATTACH 'host=mongodb.example.com port=27017 user=myuser password=mypass tls=true' AS mongo_secure (TYPE MONGO);
-- TLS with custom CA certificate file
ATTACH 'host=mongodb.example.com port=27017 user=myuser password=mypass tls=true tls_ca_file=/path/to/ca.pem' AS mongo_secure (TYPE MONGO);Using Secrets with TLS:
CREATE SECRET mongo_tls_secret (
TYPE mongo,
HOST 'mongodb.example.com',
PORT '27017',
USER 'myuser',
PASSWORD 'mypass',
TLS 'true',
TLS_CA_FILE '/path/to/ca.pem'
);
ATTACH 'dbname=mydb' AS mongo_secure (TYPE mongo, SECRET 'mongo_tls_secret');Using DuckDB Secrets
Store credentials securely using DuckDB Secrets instead of embedding them in connection strings:
-- Create a secret with MongoDB credentials
CREATE SECRET mongo_creds (
TYPE mongo,
HOST 'cluster0.xxxxx.mongodb.net',
USER 'myuser',
PASSWORD 'mypassword',
SRV 'true'
);
-- Attach using the secret (options in ATTACH path merge with secret)
ATTACH 'dbname=mydb?readPreference=secondaryPreferred' AS atlas_db (TYPE mongo, SECRET 'mongo_creds');Default secret: Create an unnamed secret to use as the default for all ATTACH operations:
CREATE SECRET (TYPE mongo, HOST 'localhost', USER 'myuser', PASSWORD 'mypass');
ATTACH '' AS mongo_db (TYPE mongo); -- Uses __default_mongo automatically
ATTACH 'dbname=mydb' AS mongo_db (TYPE mongo); -- Options merge with secretNote: An explicit database alias (
AS alias_name) is required. Thedbnameparameter specifies which MongoDB database to connect to, not the DuckDB database name.
Entity Mapping
When using ATTACH to connect to MongoDB, the extension maps MongoDB entities to DuckDB entities as follows:
MongoDB Entity → DuckDB Entity
─────────────────────────────────────────
MongoDB Instance → Catalog (via ATTACH)
MongoDB Database → Schema
MongoDB Collection → Table/View
Default Schema Behavior:
- Without
dbname: Creates a schema for each MongoDB database plus amainschema; defaults to"main" - With
dbname: Creates only the specified database schema; defaults to that schema
ATTACH 'host=localhost port=27017' AS mongo_all (TYPE MONGO);
USE mongo_all; -- Defaults to "main", but all databases available as schemas
ATTACH 'host=localhost port=27017 dbname=mydb' AS mongo_db (TYPE MONGO);
USE mongo_db; -- Defaults to "mydb" (only schema available)Querying MongoDB
Setting Up Test Data (For Examples)
Prerequisites:
- MongoDB instance running (e.g.,
docker run -d -p 27017:27017 mongoor local MongoDB installation) mongoshinstalled
To follow along with the examples in this README, you can create a test database with sample data:
Option 1: Use the test script (recommended)
bash test/create-mongo-tables.shOption 2: Manual setup with mongosh
mongosh "mongodb://localhost:27017/duckdb_mongo_test" --eval "db.orders.insertMany([{order_id: 'ORD-001', items: [{product: 'Laptop', quantity: 1, price: 999.99}, {product: 'Mouse', quantity: 2, price: 29.99}], total: 1059.97, status: 'completed'}, {order_id: 'ORD-002', items: [{product: 'Desk', quantity: 1, price: 299.99}], total: 299.99, status: 'pending'}, {order_id: 'ORD-003', items: [], total: 0, status: 'cancelled'}, {order_id: 'ORD-004', items: [{product: 'Keyboard', quantity: 1}], total: 79.99, status: 'pending'}]);"Option 3: Interactive mongosh
mongosh "mongodb://localhost:27017/duckdb_mongo_test"Then paste:
db.orders.insertMany([
{ order_id: 'ORD-001', items: [{ product: 'Laptop', quantity: 1, price: 999.99 }, { product: 'Mouse', quantity: 2, price: 29.99 }], total: 1059.97, status: 'completed' },
{ order_id: 'ORD-002', items: [{ product: 'Desk', quantity: 1, price: 299.99 }], total: 299.99, status: 'pending' },
{ order_id: 'ORD-003', items: [], total: 0, status: 'cancelled' },
{ order_id: 'ORD-004', items: [{ product: 'Keyboard', quantity: 1 }], total: 79.99, status: 'pending' }
]);Basic Queries
-- Attach to MongoDB (using test database from setup above)
ATTACH 'host=localhost port=27017 dbname=duckdb_mongo_test' AS mongo_test (TYPE MONGO);
-- Show attached databases
SHOW DATABASES;
┌───────────────┐
│ database_name │
│ varchar │
├───────────────┤
│ memory │
│ mongo_test │
└───────────────┘
-- List schemas in the attached catalog (only the specified database when using dbname=)
SELECT schema_name FROM information_schema.schemata WHERE catalog_name = 'mongo_test';
┌───────────────────┐
│ schema_name │
│ varchar │
├───────────────────┤
│ duckdb_mongo_test │
└───────────────────┘
-- Select data from a specific collection
SELECT order_id, status, total FROM mongo_test.duckdb_mongo_test.orders;
┌──────────┬───────────┬─────────┐
│ order_id │ status │ total │
│ varchar │ varchar │ double │
├──────────┼───────────┼─────────┤
│ ORD-001 │ completed │ 1059.97 │
│ ORD-002 │ pending │ 299.99 │
│ ORD-003 │ cancelled │ 0.0 │
│ ORD-004 │ pending │ 79.99 │
└──────────┴───────────┴─────────┘
-- Query arrays of objects using list_extract (1-based indexing)
SELECT order_id, list_extract(items, 1).product AS product, list_extract(items, 1).price AS price FROM mongo_test.duckdb_mongo_test.orders;
┌──────────┬──────────┬────────┐
│ order_id │ product │ price │
│ varchar │ varchar │ double │
├──────────┼──────────┼────────┤
│ ORD-001 │ Laptop │ 999.99 │
│ ORD-002 │ Desk │ 299.99 │
│ ORD-003 │ NULL │ NULL │
│ ORD-004 │ Keyboard │ NULL │
└──────────┴──────────┴────────┘
-- Expand arrays into multiple rows using UNNEST
SELECT order_id, UNNEST(items).product AS product, UNNEST(items).price AS price
FROM mongo_test.duckdb_mongo_test.orders
WHERE order_id = 'ORD-001';
┌──────────┬──────────┬─────────┐
│ order_id │ product │ price │
│ varchar │ varchar │ double │
├──────────┼──────────┼─────────┤
│ ORD-001 │ Laptop │ 999.99 │
│ ORD-001 │ Mouse │ 29.99 │
└──────────┴──────────┴─────────┘
-- Query with aggregation
SELECT status, COUNT(*) as count, SUM(total) as total_revenue
FROM mongo_test.duckdb_mongo_test.orders
GROUP BY status
ORDER BY status;
┌───────────┬───────┬───────────────┐
│ status │ count │ total_revenue │
│ varchar │ int64 │ double │
├───────────┼───────┼───────────────┤
│ cancelled │ 1 │ 0.0 │
│ completed │ 1 │ 1059.97 │
│ pending │ 2 │ 379.98 │
└───────────┴───────┴───────────────┘
-- Filter on array element fields using UNNEST
SELECT DISTINCT order_id FROM mongo_test.duckdb_mongo_test.orders, UNNEST(items) AS unnest
WHERE unnest.product = 'Mouse';
┌──────────┐
│ order_id │
│ varchar │
├──────────┤
│ ORD-001 │
└──────────┘Using mongo_scan Directly
You can also use the mongo_scan table function directly without attaching:
-- Basic usage
SELECT * FROM mongo_scan('mongodb://localhost:27017', 'mydb', 'mycollection');
-- With filter and sample size
SELECT * FROM mongo_scan('mongodb://localhost:27017', 'mydb', 'mycollection',
filter := '{"status": "active"}', sample_size := 200);
-- With explicit schema
SELECT * FROM mongo_scan('mongodb://localhost:27017', 'mydb', 'mycollection',
columns := {'_id': 'VARCHAR', 'name': 'VARCHAR', 'age': 'BIGINT'});
-- With nested path mapping
SELECT * FROM mongo_scan('mongodb://localhost:27017', 'mydb', 'mycollection',
columns := {
'_id': 'VARCHAR',
'name': 'VARCHAR',
'city': {'type': 'VARCHAR', 'path': 'address.city'}
});Parameters:
connection_string: MongoDB connection stringdatabase: MongoDB database namecollection: MongoDB collection namefilter(optional): MongoDB query filter as JSON string (e.g.,'{"status": "active"}')sample_size(optional): Number of documents to sample for schema inference (default: 100)columns(optional): Explicit schema definition as a struct (see Schema Resolution for details)schema_mode(optional): How to handle type mismatches:'permissive'(default),'dropmalformed', or'failfast'(see Schema Enforcement Modes)
Cache Management
When using ATTACH to connect to MongoDB, the extension caches schema information, collection names, and view metadata to improve query performance. If the MongoDB schema changes (e.g., new collections are added, or collection schemas change), you may need to clear the cache:
-- Clear all MongoDB caches for all attached databases
SELECT * FROM mongo_clear_cache();This function clears all caches for all attached MongoDB databases:
- Collection names cache
- View info cache (including schema information)
- Schema cache
Note: Currently, cache clearing is all-or-nothing (all databases). Selective cache clearing for specific databases or collections is not yet supported.
After clearing the cache, the next query will re-scan schemas and re-infer collection schemas.
Reference
BSON Type Mapping
| BSON Type | DuckDB Logical Type | Notes |
|---|---|---|
String |
VARCHAR |
|
Int32, Int64 |
BIGINT |
|
Double |
DOUBLE |
|
Decimal128 |
DOUBLE |
High-precision decimals converted to double (may lose precision) |
Boolean |
BOOLEAN |
|
Date |
TIMESTAMP / DATE |
DATE if time component is midnight UTC, else TIMESTAMP |
ObjectId |
VARCHAR |
24-character hex string |
Binary |
BLOB |
|
Array |
LIST or VARCHAR |
LIST(STRUCT(...)) for arrays of objects, LIST(primitive) for arrays of primitives, LIST(LIST(...)) for arrays of arrays (see Array Handling) |
Document |
VARCHAR |
Nested documents stored as JSON string |
Null, Undefined |
VARCHAR |
Type refined from other documents during inference |
Regex, Code, Symbol, Timestamp, MinKey, MaxKey |
VARCHAR |
Special BSON types stored as string representation |
Schema Resolution
The extension uses a three-tier schema resolution strategy with the following priority order:
- User-provided
columnsparameter (highest priority) __schemadocument in collection (for Atlas SQL compatibility)- Automatic schema inference (fallback)
User-Provided Schema
You can explicitly specify the schema using the columns parameter when calling mongo_scan:
Simple Format:
SELECT * FROM mongo_scan(
'mongodb://localhost:27017',
'mydb',
'mycollection',
columns := {'_id': 'VARCHAR', 'name': 'VARCHAR', 'age': 'BIGINT', 'active': 'BOOLEAN'}
);Nested Format with Path Mapping:
For nested fields, you can map column names to MongoDB dot notation paths:
SELECT * FROM mongo_scan(
'mongodb://localhost:27017',
'mydb',
'mycollection',
columns := {
'_id': 'VARCHAR',
'name': 'VARCHAR',
'city': {'type': 'VARCHAR', 'path': 'address.city'},
'street': {'type': 'VARCHAR', 'path': 'address.street'}
}
);The columns parameter accepts:
- Simple format:
'column_name': 'TYPE'where TYPE is a DuckDB type string (e.g.,'VARCHAR','BIGINT','DOUBLE','BOOLEAN','DATE','TIMESTAMP') - Nested format:
'column_name': {'type': 'TYPE', 'path': 'mongo.path'}for mapping to nested MongoDB fields
__schema Document (Atlas SQL Compatibility)
For MongoDB Atlas SQL compatibility, you can store a schema document in your collection with _id: "__schema". The extension will automatically detect and use this schema.
Simple Format (schema fields directly in document):
{
"_id": "__schema",
"name": "VARCHAR",
"age": "BIGINT",
"email": "VARCHAR"
}Nested Format (schema in nested schema field):
{
"_id": "__schema",
"schema": {
"name": "VARCHAR",
"age": "BIGINT",
"email": "VARCHAR"
}
}Path Mapping Format (for nested MongoDB fields):
{
"_id": "__schema",
"name": "VARCHAR",
"city": {"type": "VARCHAR", "path": "address.city"},
"street": {"type": "VARCHAR", "path": "address.street"}
}Note: When using
ATTACHto connect to MongoDB, the__schemadocument is cached along with other schema information. Usemongo_clear_cache()to invalidate the cache after schema changes.
Schema Inference
When neither user-provided schema nor __schema document is available, the extension automatically infers schemas by sampling documents (default: 100, configurable via sample_size):
- Nested Documents: Flattened with underscore-separated names (e.g.,
user_address_city), up to 5 levels deep - Type Conflicts: Frequency-based resolution:
- VARCHAR if >70% of values are strings
- DOUBLE if ≥30% are doubles (or any doubles present)
- BIGINT if ≥30% are integers (when no doubles)
- BOOLEAN/TIMESTAMP if ≥70% match
- Defaults to VARCHAR
- Missing Fields: NULL values
Schema Enforcement Modes
When using an explicit schema (via columns parameter or __schema document), you can control how the extension handles documents that don't match the expected types using the schema_mode parameter:
| Mode | Behavior | Use Case |
|---|---|---|
permissive |
Set invalid fields to NULL (default) | Exploratory analysis, fault-tolerant pipelines |
dropmalformed |
Skip entire rows with schema violations | Data quality filtering, clean datasets |
failfast |
Throw error immediately on first mismatch | Production pipelines, data contracts |
Examples:
-- PERMISSIVE (default): Invalid values become NULL
SELECT * FROM mongo_scan(
'mongodb://localhost:27017', 'mydb', 'mycol',
columns := {'name': 'VARCHAR', 'age': 'INTEGER'},
schema_mode := 'permissive'
);
-- DROPMALFORMED: Skip rows where 'age' is not a valid integer
SELECT * FROM mongo_scan(
'mongodb://localhost:27017', 'mydb', 'mycol',
columns := {'name': 'VARCHAR', 'age': 'INTEGER'},
schema_mode := 'dropmalformed'
);
-- FAILFAST: Throw error if any document has invalid 'age' value
SELECT * FROM mongo_scan(
'mongodb://localhost:27017', 'mydb', 'mycol',
columns := {'name': 'VARCHAR', 'age': 'INTEGER'},
schema_mode := 'failfast'
);Note: Schema enforcement only applies when an explicit schema is provided (via
columnsor__schema). Inferred schemas use permissive behavior regardless of theschema_modesetting.When using
dropmalformedorfailfast, certain query optimizations are disabled to ensure accurate validation (e.g., aggregate pushdowns run in DuckDB instead of MongoDB, and all schema columns are fetched for validation). For best performance with large collections, usepermissive(the default) unless strict enforcement is required.
Array Handling
Arrays of Objects:
-
Arrays of objects are stored as DuckDB
LIST(STRUCT(...))types -
Schema Inference: Scans up to 10 elements per array to discover all field names across array elements
- This ensures fields that only exist in later elements are still discovered
- Example: If
items[0]has{product, quantity}anditems[5]has{product, quantity, discount}, thediscountfield will be included in the STRUCT - Creates a LIST type containing a STRUCT with all discovered fields
-
Querying Arrays: Use
list_extract()to access specific elements (1-based indexing) orUNNEST()to expand arrays into multiple rows. See Basic Queries for examples.
Arrays of Primitives:
- Arrays of primitives (strings, numbers) are stored as
LISTtypes - Example:
tags: ['admin', 'user']→LIST(VARCHAR)containing['admin', 'user'] - Can be queried with list_extract (1-based indexing):
list_extract(tags, 1)returns'admin' - Can be expanded with UNNEST:
SELECT UNNEST(tags) FROM mongo_test.duckdb_mongo_test.users
Arrays of Arrays:
- Arrays of arrays are stored as
LIST(LIST(...))types - Supports nested arrays of any depth (up to 5 levels)
- Example:
matrix: [[1,2], [3,4]]→LIST(LIST(BIGINT))containing[[1,2],[3,4]] - Example:
data: [[[1,2], [3,4]], [[5,6], [7,8]]]→LIST(LIST(LIST(BIGINT)))for 3D arrays - Arrays of arrays of objects:
data: [[{x: 1}, {x: 2}], [{x: 3}, {x: 4}]]→LIST(LIST(STRUCT(...))) - Can be queried with nested list_extract (1-based indexing):
- For 2D arrays:
list_extract(list_extract(matrix, 1), 2)returns2(second element of first row) - For 3D arrays:
list_extract(list_extract(list_extract(data, 1), 1), 2)returns2(second element of first row of first layer)
- For 2D arrays:
Mixed Array Depths:
- When documents in a collection have arrays of different depths, the schema inference uses the deepest depth found across all sampled documents
- Documents with shallower arrays are automatically wrapped to match the expected depth, allowing all arrays to be returned as DuckDB LIST types
- Example: If one document has
data: [[[1,2], [3,4]]](3D) and another hasdata: [[1,2], [3,4]](2D), the schema infersLIST(LIST(LIST(BIGINT)))(3D)- The 2D array
[[1,2], [3,4]]is automatically wrapped to[[[1,2]], [[3,4]]]to match the 3D schema - Both documents return valid LIST values that can be queried using DuckDB's LIST functions
- The 2D array
- This ensures data is preserved and queryable even when array structures vary across documents
Limitations
- Read-only
- Schema inference (when used as fallback) samples documents and may miss fields that don't appear in the sample
- Schema re-inferred per query when using
mongo_scandirectly (cached when usingATTACH; usemongo_clear_cache()to invalidate) - Decimal128 precision: Converted to DOUBLE, which may lose precision for high-precision decimal values
- Nested documents in arrays: Stored as VARCHAR (JSON strings) rather than nested STRUCT types
- Example:
items: [{product: 'Laptop', specs: {cpu: 'Intel', ram: '16GB'}}]→specsfield is VARCHAR, not STRUCT
- Example:
Advanced Topics
Architecture
The extension enables in-process analytical SQL queries over MongoDB data using DuckDB's embedded analytical engine. Queries execute against live MongoDB data in real-time, with analytical operations (joins, aggregations, window functions) performed locally in memory.
┌─────────────────────────────────────────┐
│ User/Application │
└────────┬───────────────────────┬────────┘
│ ▲
│ SQL Query │ Result set (columnar)
▼ │
┌─────────────────────────────────────────┐
│ DUCKDB ENGINE │
│ ┌───────────────────────────────────┐ │
│ │ Query Planning & Optimization │ │
│ │ - Pushdown & Plan Optimization │ │
│ └───────────────────────────────────┘ │
│ ┌───────────────────────────────────┐ │
│ │ Query Execution │ │
│ │ - Joins, Aggregations │ │
│ │ - Window Functions, CTEs │ │
│ └───────────────────────────────────┘ │
└────────┬───────────────────────┬────────┘
│ ▲
│ mongo_scan │ DataChunks
▼ │
┌────────┴───────────────────────┴────────┐
│ duckdb-mongo Extension │
│ • Schema Resolution │
│ • Pushdown Optimization │
│ • BSON → Columnar Conversion │
└────────┬───────────────────────┬────────┘
│ ▲
│ MQL │ BSON stream
▼ │
┌─────────────────────────────────────────┐
│ MONGODB DATABASE │
│ ┌───────────────────────────────────┐ │
│ │ Document Store Operations │ │
│ │ - Query Execution ($match, $group)│ │
│ │ - Document Streaming (cursor) │ │
│ └───────────────────────────────────┘ │
│ Data stays here (No ETL/Export) │
└─────────────────────────────────────────┘
mongo_scan Execution Flow
┌─────────────────────────────────────────────────────────────────┐
│ mongo_scan Execution │
└─────────────────────────────────────────────────────────────────┘
1. BIND PHASE (happens once per query)
┌────────────────────────────────────────────────────────────┐
│ Parse connection string, database, collection │
│ Create MongoDB connection │
│ │
│ Schema Resolution: │
│ • User-provided, __schema document, or inference │
│ • Build column names and types │
│ │
│ Return schema to DuckDB │
└────────────────────────────────────────────────────────────┘
│
▼
2. INIT PHASE (happens once per query)
┌────────────────────────────────────────────────────────────┐
│ Build MongoDB query: │
│ • Filter pushdown ($match) │
│ • Projection pushdown │
│ • Aggregation/Limit pushdown ($group, $limit) │
│ │
│ Create MongoDB cursor or aggregation pipeline │
└────────────────────────────────────────────────────────────┘
│
▼
3. EXECUTION PHASE (called repeatedly for each chunk)
┌────────────────────────────────────────────────────────────┐
│ Fetch documents from cursor: │
│ • Retrieve BSON documents from MongoDB │
│ │
│ For each document: │
│ • Parse BSON structure │
│ • Extract fields by path │
│ • Convert BSON types → DuckDB types │
│ • Flatten nested structures │
│ • Write to columnar DataChunk │
│ │
│ Return chunk to DuckDB (up to STANDARD_VECTOR_SIZE rows) │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ DuckDB processes chunk: │
│ • Filters already applied in MongoDB (via pushdown) │
│ • Performs aggregations, joins, etc. │
│ • Requests next chunk if needed │
└────────────────────────────────────────────────────────────┘
Pushdown Strategy
The extension uses a selective pushdown strategy: filter at MongoDB (reduce data transfer), analyze in DuckDB (analytical operations).
Pushed Down to MongoDB:
- WHERE clauses (automatic conversion to MongoDB
$matchqueries) - Column projections (only columns used in SELECT are fetched)
- LIMIT clauses: simple
LIMIT N(cursor limit) andORDER BY _id LIMIT N(aggregation pipeline) - Manual
filterparameter (for MongoDB-specific operators like$elemMatch) - Aggregations:
COUNT(*),COUNT(col),SUM,MIN,MAX,AVGwith optionalGROUP BY(see Aggregation Pushdown)
Kept in DuckDB:
- Joins, window functions, CTEs, subqueries
- ORDER BY (except
ORDER BY _id LIMIT Nwhich is pushed down)
Automatic Filter Pushdown
WHERE clauses are automatically converted to MongoDB $match queries. Use EXPLAIN to see which operations are pushed down:
-- Filter pushed down to MongoDB
EXPLAIN SELECT order_id, status FROM mongo_test.duckdb_mongo_test.orders WHERE status = 'completed';The plan shows filters and projections in MONGO_SCAN, indicating pushdown:
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ PROJECTION │
│ ──────────────────── │
│ order_id │
│ status │
│ │
│ ~1 row │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ MONGO_SCAN │
│ ──────────────────── │
│ Function: MONGO_SCAN │
│ │
│ Projections: │
│ status │
│ order_id │ ← Only SELECT columns fetched
│ │
│ Filters: │
│ status='completed' │ ← Pushed to MongoDB
│ │
│ ~1 row │
└───────────────────────────┘
For aggregations, filters are pushed down while aggregation happens in DuckDB:
Supported Filter Operations:
- Comparison operators:
=,!=,<,<=,>,>= - IN clauses:
WHERE status IN ('active', 'pending')→ MongoDB{status: {$in: ['active', 'pending']}} - NULL checks:
IS NULLandIS NOT NULL - Multiple conditions: AND/OR combinations merged into efficient MongoDB queries
- Nested fields: Flattened fields (e.g.,
address_city) converted to dot notation (address.city) - Complex filters: Function calls (e.g.,
LENGTH(name) > 5) and column-to-column comparisons (e.g.,age > balance) pushed down as MongoDB$exprqueries (see Complex Filter Pushdown)
Examples:
-- Equality (using test data)
SELECT * FROM mongo_test.duckdb_mongo_test.orders WHERE status = 'completed';
-- MongoDB: {status: 'completed'}
-- Range (example with users collection)
SELECT * FROM mongo_test.duckdb_mongo_test.users WHERE age > 28 AND age < 40;
-- MongoDB: {age: {$gt: 28, $lt: 40}}
-- IN
SELECT * FROM mongo_test.duckdb_mongo_test.orders WHERE status IN ('completed', 'pending');
-- MongoDB: {status: {$in: ['completed', 'pending']}}
-- Nested field (example with users collection)
SELECT * FROM mongo_test.duckdb_mongo_test.users WHERE address_city = 'New York';
-- MongoDB: {'address.city': 'New York'}Note: When using
mongo_scandirectly, you can provide an optionalfilterparameter (e.g.,filter := '{"status": "active"}') for MongoDB-specific operators. If both WHERE clauses andfilterare present, WHERE clauses take precedence.
Note: Filters on array elements (using
UNNEST) are not pushed down to MongoDB—they are applied in DuckDB after expanding arrays. This means all documents are fetched from MongoDB, then filtered in DuckDB. For large collections, consider using MongoDB's$elemMatchoperator via thefilterparameter inmongo_scanto filter at the database level. See Basic Queries for array filtering examples.
LIMIT Pushdown
LIMIT is automatically pushed down when directly above the table scan:
SELECT * FROM mongo_test.duckdb_mongo_test.orders LIMIT 10;
-- MongoDB uses: .limit(10)Note: When
ORDER BY _idis present with LIMIT, the extension uses TopN pushdown via aggregation pipeline (see TopN Pushdown). For other ORDER BY columns, sorting is performed in DuckDB after fetching data.
Projection Pushdown
Projection pushdown automatically fetches only the columns used in the SELECT clause, reducing data transfer and serialization overhead.
How it works:
- DuckDB analyzes the query to identify which columns are needed
- Only those columns are included in the MongoDB projection
- Reduces network transfer and BSON parsing overhead
Example:
-- Only fetches order_id, status, and total (not items or other columns)
SELECT order_id, status, total FROM mongo_test.duckdb_mongo_test.orders WHERE status = 'completed';
-- MongoDB projection: {order_id: 1, status: 1, total: 1, _id: 1}Use EXPLAIN to see which columns are projected:
EXPLAIN SELECT order_id, status FROM mongo_test.duckdb_mongo_test.orders;The plan shows Projections: order_id, status in the MONGO_SCAN operator.
Filter Prune Optimization
Filter prune works together with projection pushdown to further reduce data transfer by excluding filter columns that are not used in the SELECT clause when filters are pushed down to MongoDB.
How it works:
- Filters are pushed down to MongoDB (server-side filtering)
- Filter columns not in the SELECT clause are excluded from the projection
- Only columns needed for query results are fetched
Example:
-- Filters on status and total, but only selects order_id
SELECT order_id
FROM mongo_test.duckdb_mongo_test.orders
WHERE status = 'completed' AND total > 500;
-- MongoDB: Filters pushed down, but only order_id is fetched
-- status and total are NOT fetched (filtered server-side)Use EXPLAIN to verify filter prune is working. The plan shows filters that are pushed down but not included in projections:
EXPLAIN SELECT order_id FROM mongo_test.duckdb_mongo_test.orders WHERE status = 'completed' AND total > 500;The plan shows Projections: order_id and Filters: status='completed' AND total>500.0 in the MONGO_SCAN operator, indicating filters are pushed down but filter columns are not fetched.
Complex Filter Pushdown
Complex filter pushdown enables pushing complex filter expressions (function calls, column-to-column comparisons, etc.) to MongoDB using $expr queries. This allows MongoDB to filter server-side for expressions that cannot be handled by simple TableFilter pushdown.
Supported Complex Filter Types:
- Function calls:
LENGTH(name) > 5,CHAR_LENGTH(name) > 5 - Substring filters:
SUBSTRING(name, 1, 3) = 'Ann'(constant start/length only) - Column-to-column comparisons:
age > balance,price >= cost - Combined simple and complex filters:
age > 25 AND LENGTH(name) > 5
How it works:
- Simple filters (column-to-constant comparisons like
age > 25) are handled by TableFilter pushdown, which produces faster MongoDB native queries - Complex filters (function calls, column-to-column comparisons) are converted to MongoDB
$exprformat and pushed down to MongoDB - Both simple and complex filters can be combined in a single query, with each handled by the appropriate pushdown mechanism
Examples:
-- Function call filter (complex) - pushed down as $expr
SELECT name, email FROM mongo_test.duckdb_mongo_test.users WHERE LENGTH(name) > 5;
-- MongoDB: {$expr: {$gt: [{$strLenCP: "$name"}, 5]}}
-- Substring filter (complex) - pushed down as $expr
SELECT name FROM mongo_test.duckdb_mongo_test.users WHERE SUBSTRING(name, 1, 3) = 'Ann';
-- MongoDB: {$expr: {$eq: [{$substrCP: ["$name", 0, 3]}, "Ann"]}}
-- Column-to-column comparison (complex) - pushed down as $expr
SELECT name, age, balance FROM mongo_test.duckdb_mongo_test.users WHERE age > balance;
-- MongoDB: {$expr: {$gt: ["$age", "$balance"]}}
-- Combined simple and complex filters
SELECT name FROM mongo_test.duckdb_mongo_test.users WHERE age > 25 AND LENGTH(name) > 5;
-- MongoDB: {age: {$gt: 25}, $expr: {$gt: [{$strLenCP: "$name"}, 5]}}Use EXPLAIN to verify complex filter pushdown:
EXPLAIN SELECT name FROM mongo_test.duckdb_mongo_test.users WHERE LENGTH(name) > 5;The plan shows MONGO_SCAN directly (no FILTER operator above it), indicating the complex filter was pushed down to MongoDB.
Note: Complex filter pushdown works alongside simple filter pushdown. Simple filters are always handled by TableFilter pushdown for optimal performance, while complex filters are handled by
$exprpushdown when needed.Note: Substring pushdown requires constant start and length arguments (
SUBSTRING(col, start, length)), withstart >= 1andlength >= 0.
Semi-Join IN Filter Pushdown
Semi-join IN filter pushdown enables DuckDB to push IN filters from semi-joins (subqueries) to MongoDB as $in queries. This optimization works automatically when DuckDB's JoinFilterPushdownOptimizer determines that a semi-join's build side is small enough to push as an IN filter.
How it works:
- DuckDB builds a hash table from the subquery (e.g.,
SELECT p_partkey FROM part WHERE p_name LIKE 'forest%') - If the hash table is small, DuckDB generates an IN filter with the matching values
- The extension pushes this IN filter to MongoDB as a
$inquery
Example:
-- Query with IN subquery (semi-join)
SELECT ps_suppkey, ps_availqty
FROM partsupp
WHERE ps_partkey IN (
SELECT p_partkey FROM part WHERE p_name LIKE 'forest%'
);The extension automatically pushes the ps_partkey IN (...) filter to MongoDB as:
{"ps_partkey": {"$in": [123, 456, 789, ...]}}Supported Filter Types:
The extension supports the following DuckDB filter types for pushdown:
| Filter Type | MongoDB Equivalent | Description |
|---|---|---|
CONSTANT_COMPARISON |
$eq, $ne, $lt, $lte, $gt, $gte |
Basic comparisons |
IN_FILTER |
$in |
Value in list |
IS_NULL |
{field: null} |
Null check |
IS_NOT_NULL |
{$ne: null} |
Not null check |
CONJUNCTION_AND |
Merged conditions | AND of filters |
CONJUNCTION_OR |
$or / $in |
OR of filters |
STRUCT_EXTRACT |
Dot notation (a.b.c) |
Nested field access |
OPTIONAL_FILTER |
Unwraps child | Semi-join IN pushdown |
DYNAMIC_FILTER |
Unwraps child | Runtime filter pushdown |
Aggregation Pushdown
Aggregation pushdown enables pushing COUNT, SUM, MIN, MAX, AVG aggregates (with optional GROUP BY) to MongoDB as aggregation pipelines. This reduces data transfer by computing aggregates server-side rather than fetching all documents to DuckDB.
Supported Aggregates:
| Function | MongoDB Pipeline | Notes |
|---|---|---|
COUNT(*) |
$count |
Pushed as optimized $count stage |
COUNT(col) |
$group + $sum + $cond |
Counts non-null values |
SUM(col) |
$group + $sum |
|
MIN(col) |
$group + $min |
|
MAX(col) |
$group + $max |
|
AVG(col) |
$group + $avg |
Requirements for Aggregation Pushdown:
- Aggregate functions must use direct column references (no expressions like
SUM(price * quantity)) GROUP BYkeys must be direct column references- No
DISTINCT,FILTER, orORDER BYwithin aggregates - Single grouping set only (no
GROUPING SETS,ROLLUP, orCUBE)
Examples:
-- COUNT(*) pushed down as $count pipeline
SELECT COUNT(*) FROM mongo_test.duckdb_mongo_test.users WHERE active = true;
-- MongoDB pipeline: [{$match: {active: true}}, {$count: "count"}]
-- GROUP BY with aggregates pushed down as $group pipeline
SELECT status, COUNT(*), SUM(total) FROM mongo_test.duckdb_mongo_test.orders GROUP BY status;
-- MongoDB pipeline: [{$group: {_id: {status: "$status"}, __agg0: {$sum: 1}, __agg1: {$sum: "$total"}}}, ...]Use EXPLAIN to verify aggregation pushdown:
EXPLAIN SELECT COUNT(*) FROM mongo_test.duckdb_mongo_test.users;The plan shows MONGO_SCAN with scan_method: aggregate and pipeline containing $count or $group, indicating the aggregation was pushed down to MongoDB.
TopN Pushdown
TopN pushdown enables pushing ORDER BY _id LIMIT N queries to MongoDB as aggregation pipelines with $sort and $limit stages. This is particularly efficient for paginated queries ordered by the indexed _id field.
Requirements for TopN Pushdown:
- Must order by
_idcolumn only (MongoDB's indexed primary key) - Must have a
LIMITclause (no offset) - No intermediate operations between
ORDER BYand the table scan (projections are allowed)
Example:
-- TopN pushed down as $sort + $limit pipeline
SELECT _id, name FROM mongo_test.duckdb_mongo_test.users ORDER BY _id LIMIT 10;
-- MongoDB pipeline: [{$sort: {_id: 1}}, {$limit: 10}]Use EXPLAIN to verify TopN pushdown:
EXPLAIN SELECT _id FROM mongo_test.duckdb_mongo_test.users ORDER BY _id LIMIT 5;The plan shows MONGO_SCAN with scan_method: aggregate and pipeline containing $sort and $limit, indicating the TopN operation was pushed down to MongoDB.
Note: TopN pushdown is conservative and only applies to
ORDER BY _idqueries. This ensures MongoDB can use its indexed_idfield efficiently. Other ORDER BY columns are processed in DuckDB after fetching data.
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
Building from Source
Prerequisites:
- CMake 3.5 or higher
- C++ compiler with C++17 support
- vcpkg (for dependency management)
Build Steps:
- Clone the repository with submodules:
git clone --recurse-submodules https://github.com/stephaniewang526/duckdb-mongo.git
cd duckdb-mongo- Set up vcpkg (if not already done):
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh # On Windows: .\bootstrap-vcpkg.bat
export VCPKG_TOOLCHAIN_PATH=`pwd`/scripts/buildsystems/vcpkg.cmake- Install dependencies (first time only):
# Install MongoDB C++ driver via vcpkg
../vcpkg/vcpkg install --triplet arm64-osx # or x64-osx for Intel Mac- Build the extension:
# Set vcpkg environment
export VCPKG_TOOLCHAIN_PATH=../vcpkg/scripts/buildsystems/vcpkg.cmake
export VCPKG_TARGET_TRIPLET=arm64-osx # or x64-osx for Intel Mac
# Build
make releaseOr use the build script:
bash scripts/build.shBuilt binaries:
./build/release/duckdb- DuckDB shell with the extension pre-loaded./build/release/test/unittest- Test runner./build/release/extension/mongo/mongo.duckdb_extension- Loadable extension binary
Loading the Extension (Development)
./build/release/duckdb # Extension auto-loadedOr load explicitly:
LOAD '/path/to/mongo.duckdb_extension';Running Tests
# Set up test database
bash test/create-mongo-tables.sh
# Run tests
MONGODB_TEST_DATABASE_AVAILABLE=1 make test_releaseLicense
See LICENSE file for details.
Third-Party Licenses
This project uses the MongoDB C++ Driver (mongocxx and bsoncxx), which is licensed under the Apache License, Version 2.0. See the NOTICE file for third-party license information and attributions.