Generic Text Embedder
A generic C++ application for generating text embeddings using ONNX models.
This is just used for testing. It's better to provide a service to do so (that does not load the model each time).
Caution
The current results are incorrect — I’m just using this repository for learning purposes.
Getting Models
You can download ONNX embedding models from Hugging Face using the huggingface-cli tool.
Install Hugging Face CLI
pip install -U "huggingface_hub[cli]"See the official documentation for more details.
Download a Model
For example, to download the nomic-embed-text-v1 model:
huggingface-cli download Xenova/nomic-embed-text-v1This will download the model to your local cache. You can then set the environment variable to point to the cached model:
export EMBEDDING_MODEL_PATH=$HOME/.cache/huggingface/hub/models--Xenova--nomic-embed-text-v1/snapshots/0b85f78966a655763985a595b770f221374dda10Note: The exact snapshot hash (the long string at the end) may vary depending on the model version.
Building
Prerequisites:
- CMake 3.12+
- ONNX Runtime libraries
- C++17 compatible compiler
cmake .
makeUsage
The embedder supports both single text processing and batch processing for better performance:
Single Text Processing
The embedder can be used in two ways for single texts:
Method 1: Specify model path as argument (traditional)
./embedder <model_path> <input_text> [--verbose]Method 2: Use environment variable (new)
export EMBEDDING_MODEL_PATH=/path/to/model
./embedder <input_text> [--verbose]Batch Processing (NEW)
For better performance when processing multiple texts, use batch mode. Important: Batch mode now uses null bytes (\0) as the default delimiter to safely handle texts containing newlines.
# Batch processing with null delimiter (RECOMMENDED - safe for any text content)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch [--verbose]
# Batch processing with custom delimiter
echo "Text 1|||Text 2|||Text 3" | ./embedder --batch --delimiter="|||" [--verbose]
# Batch processing with explicit model path
printf "Text 1\0Text 2\0" | ./embedder <model_path> --batch [--verbose]
# From file with null-delimited content
cat null_delimited_texts.txt | ./embedder --batch [--verbose]
# UNSAFE: Line-based (only use if texts don't contain newlines)
echo -e "Text 1\nText 2\nText 3" | ./embedder --batch --delimiter="\n" [--verbose]Why null delimiter? Text content often contains newlines, tabs, and other whitespace. Null bytes (\0) are the safest delimiter as they rarely appear in regular text content.
Arguments
model_path: Path to directory containing the model and vocabulary files (optional ifEMBEDDING_MODEL_PATHis set)input_text: Text to generate embedding for (wrap in quotes if it contains spaces) - single mode only--batch: Enable batch processing mode (reads texts from stdin using delimiter)--delimiter=DELIM: Set custom delimiter for batch mode (default:\0null byte)--verbose: Optional flag to enable verbose output (shows model info and embedding dimension)
Examples
# Traditional usage with explicit model path
./embedder ./model_directory "Hello world"
# Using environment variable
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world"
# With verbose output
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world" --verbose
# Batch processing examples (SAFE - handles texts with newlines)
export EMBEDDING_MODEL_PATH=./model_directory
# Process texts using null delimiter (recommended)
printf "Hello world\0Text with\nnewlines\0Third text\0" | ./embedder --batch
# Process texts using custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"
# From file with null-delimited content
printf "First text\0Second text\nwith newlines\0" > texts.dat
cat texts.dat | ./embedder --batch --verbose
# UNSAFE: Line-based (only if no newlines in text content)
echo -e "Simple1\nSimple2\nSimple3" | ./embedder --batch --delimiter="\n"
# Batch with explicit model path
printf "Text1\0Text2\0" | ./embedder ./model_directory --batch
# Mixing approaches (environment variable as fallback)
export EMBEDDING_MODEL_PATH=./default_model
./embedder ./specific_model "Hello world" # Uses ./specific_model
./embedder "Hello world" # Uses ./default_modelModel Directory Structure
The embedder supports two directory structures:
Option 1: Direct model placement
model_directory/
├── model.onnx
└── vocab.txt
Option 2: ONNX subdirectory
model_directory/
├── onnx/
│ └── model.onnx
└── vocab.txt
Output
Single Text Mode
Without --verbose: Outputs the full embedding as space-separated floating-point numbers.
With --verbose: Additionally shows:
- Model loading confirmation
- Input/output node information
- Vocabulary size
- Embedding dimension
Batch Processing Mode
Without --verbose: Outputs one embedding per line, each as space-separated floating-point numbers.
With --verbose: Additionally shows:
- Batch processing information
- Number of texts processed
- Output tensor shape
- Model and vocabulary info
Performance Benefits
Batch processing provides significant performance improvements when processing multiple texts:
- Model Loading: The model is loaded only once for the entire batch
- Memory Efficiency: Better GPU/CPU memory utilization
- Parallel Processing: Takes advantage of vectorized operations
- Reduced Overhead: Eliminates per-text setup costs
For example, processing 100 texts individually might take 10 seconds, while batch processing the same 100 texts could take only 2-3 seconds.
Important: Handling Texts with Newlines
\n) as delimiters, which breaks when processing texts that contain newlines (which is common in real-world text data).
✅ Solution: This implementation now uses null bytes (\0) as the default delimiter, which safely handles texts containing newlines, tabs, and other whitespace characters.
Examples of problematic texts (that would break with line-based parsing):
- Multi-paragraph text
- Code snippets
- Formatted text with line breaks
- Text with embedded newlines
Safe usage:
# ✅ SAFE: Null-delimited (recommended)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch
# ✅ SAFE: Custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"
# ⚠️ UNSAFE: Line-based (only for simple texts without newlines)
echo -e "Text1\nText2\nText3" | ./embedder --batch --delimiter="\n"