Generic Text Embedder

A generic C++ application for generating text embeddings using ONNX models.

This is just used for testing. It's better to provide a service to do so (that does not load the model each time).

Caution

The current results are incorrect — I’m just using this repository for learning purposes.

Getting Models

You can download ONNX embedding models from Hugging Face using the huggingface-cli tool.

Install Hugging Face CLI

pip install -U "huggingface_hub[cli]"

See the official documentation for more details.

Download a Model

For example, to download the nomic-embed-text-v1 model:

huggingface-cli download Xenova/nomic-embed-text-v1

This will download the model to your local cache. You can then set the environment variable to point to the cached model:

export EMBEDDING_MODEL_PATH=$HOME/.cache/huggingface/hub/models--Xenova--nomic-embed-text-v1/snapshots/0b85f78966a655763985a595b770f221374dda10

Note: The exact snapshot hash (the long string at the end) may vary depending on the model version.

Building

Prerequisites:

CMake 3.12+
ONNX Runtime libraries
C++17 compatible compiler

cmake .
make

Usage

The embedder supports both single text processing and batch processing for better performance:

Single Text Processing

The embedder can be used in two ways for single texts:

Method 1: Specify model path as argument (traditional)

./embedder <model_path> <input_text> [--verbose]

Method 2: Use environment variable (new)

export EMBEDDING_MODEL_PATH=/path/to/model
./embedder <input_text> [--verbose]

Batch Processing (NEW)

For better performance when processing multiple texts, use batch mode. Important: Batch mode now uses null bytes (\0) as the default delimiter to safely handle texts containing newlines.

# Batch processing with null delimiter (RECOMMENDED - safe for any text content)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch [--verbose]

# Batch processing with custom delimiter
echo "Text 1|||Text 2|||Text 3" | ./embedder --batch --delimiter="|||" [--verbose]

# Batch processing with explicit model path
printf "Text 1\0Text 2\0" | ./embedder <model_path> --batch [--verbose]

# From file with null-delimited content
cat null_delimited_texts.txt | ./embedder --batch [--verbose]

# UNSAFE: Line-based (only use if texts don't contain newlines)
echo -e "Text 1\nText 2\nText 3" | ./embedder --batch --delimiter="\n" [--verbose]

Why null delimiter? Text content often contains newlines, tabs, and other whitespace. Null bytes (\0) are the safest delimiter as they rarely appear in regular text content.

Arguments

model_path: Path to directory containing the model and vocabulary files (optional if EMBEDDING_MODEL_PATH is set)
input_text: Text to generate embedding for (wrap in quotes if it contains spaces) - single mode only
--batch: Enable batch processing mode (reads texts from stdin using delimiter)
--delimiter=DELIM: Set custom delimiter for batch mode (default: \0 null byte)
--verbose: Optional flag to enable verbose output (shows model info and embedding dimension)

Examples

# Traditional usage with explicit model path
./embedder ./model_directory "Hello world"

# Using environment variable
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world"

# With verbose output
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world" --verbose

# Batch processing examples (SAFE - handles texts with newlines)
export EMBEDDING_MODEL_PATH=./model_directory

# Process texts using null delimiter (recommended)
printf "Hello world\0Text with\nnewlines\0Third text\0" | ./embedder --batch

# Process texts using custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"

# From file with null-delimited content
printf "First text\0Second text\nwith newlines\0" > texts.dat
cat texts.dat | ./embedder --batch --verbose

# UNSAFE: Line-based (only if no newlines in text content)
echo -e "Simple1\nSimple2\nSimple3" | ./embedder --batch --delimiter="\n"

# Batch with explicit model path
printf "Text1\0Text2\0" | ./embedder ./model_directory --batch

# Mixing approaches (environment variable as fallback)
export EMBEDDING_MODEL_PATH=./default_model
./embedder ./specific_model "Hello world"  # Uses ./specific_model
./embedder "Hello world"                   # Uses ./default_model

Model Directory Structure

The embedder supports two directory structures:

Option 1: Direct model placement

model_directory/
├── model.onnx
└── vocab.txt

Option 2: ONNX subdirectory

model_directory/
├── onnx/
│   └── model.onnx
└── vocab.txt

Output

Single Text Mode

Without --verbose: Outputs the full embedding as space-separated floating-point numbers.

With --verbose: Additionally shows:

Model loading confirmation
Input/output node information
Vocabulary size
Embedding dimension

Batch Processing Mode

Without --verbose: Outputs one embedding per line, each as space-separated floating-point numbers.

With --verbose: Additionally shows:

Batch processing information
Number of texts processed
Output tensor shape
Model and vocabulary info

Performance Benefits

Batch processing provides significant performance improvements when processing multiple texts:

Model Loading: The model is loaded only once for the entire batch
Memory Efficiency: Better GPU/CPU memory utilization
Parallel Processing: Takes advantage of vectorized operations
Reduced Overhead: Eliminates per-text setup costs

For example, processing 100 texts individually might take 10 seconds, while batch processing the same 100 texts could take only 2-3 seconds.

Important: Handling Texts with Newlines

⚠️ Critical Issue: The original implementation used newlines (\n) as delimiters, which breaks when processing texts that contain newlines (which is common in real-world text data).

✅ Solution: This implementation now uses null bytes (\0) as the default delimiter, which safely handles texts containing newlines, tabs, and other whitespace characters.

Examples of problematic texts (that would break with line-based parsing):

Multi-paragraph text
Code snippets
Formatted text with line breaks
Text with embedded newlines

Safe usage:

# ✅ SAFE: Null-delimited (recommended)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch

# ✅ SAFE: Custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"

# ⚠️ UNSAFE: Line-based (only for simple texts without newlines)
echo -e "Text1\nText2\nText3" | ./embedder --batch --delimiter="\n"

phimage/embedder