📧 RAG Email Search System

🔍 Intelligent semantic search for your email inbox - Find exactly what you're looking for using state-of-the-art RAG (Retrieval-Augmented Generation) technology

Transform your Gmail archive into a searchable knowledge base using advanced NLP embeddings and vector similarity search. Ask natural language questions like "What was my latest eBay order?" or "When was my last flight booking?" and get instant, relevant results.

✨ Features

🎯 Semantic Search: Find emails based on meaning, not just keywords
🌍 Multilingual Support: Works with multiple languages using multilingual-e5-large-instruct
⚡ Fast Retrieval: Efficient similarity search using FAISS vectorstore
🔄 Flexible Architecture: Support for both chunked and full-document embeddings
📊 Multiple Approaches: Compare direct cosine similarity vs. vectorstore implementations
🎨 Metadata Enrichment: Preserve sender, subject, date, and labels for enhanced context
🧹 Robust Preprocessing: Handle multipart emails, HTML content, and various encodings
🔁 Cross-Encoder Reranking: Optional reranking for improved result quality
💾 Caching System: Save embeddings for fast subsequent queries

🚀 Quick Start

Prerequisites

Python 3.10+
Gmail MBOX export file (How to export Gmail)
8GB+ RAM recommended

Installation

Clone the repository

git clone https://github.com/nsourlos/rag_gmail.git
cd rag_gmail

Create virtual environment using UV (recommended)

# For macOS/Linux
uv venv RAG_email --python 3.10
source RAG_email/bin/activate

# For Windows
uv venv RAG_email --python 3.10
.\RAG_email\Scripts\activate

Install dependencies

uv pip install sentence-transformers==5.1.1 ipykernel langchain==0.3.27 faiss-cpu==1.12.0

# Optional: For GPU support on Windows
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Set up Jupyter kernel

# macOS/Linux
RAG_email/bin/python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"

# Windows
RAG_email\Scripts\python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"

Place your Gmail MBOX file

# Export your Gmail and place the file as:
# ./gmail.mbox

📖 Usage

Basic Example

# Load your emails
mbox = load_mbox('gmail.mbox')

# Process and create embeddings
format_messages_for_embeddings(mbox)

# Load the model and embeddings
doc_embeddings, model = load_embeddings()

# Ask natural language questions
queries = [
    'What was my latest eBay order?',
    'When was my last flight booking?',
    'What was my most recent Amazon purchase?'
]

# Get results
task = 'Given an email, retrieve relevant passages that answer the query'
similarity_scores = get_similarity_scores(queries, task, model, doc_embeddings, top_k=200)
print_results(similarity_scores, queries, documents_cleaned, top_k=200)

Advanced: Using Chunking

For better performance with long emails:

# Enable chunking
chunked = True
chunk_size = 800
chunk_overlap = 160

# Process with chunks
format_messages_for_embeddings(mbox, chunk_size=chunk_size, chunk_overlap=chunk_overlap)

Advanced: Using FAISS Vectorstore

For scalability with large email archives:

# Build vectorstore index
index = build_vectorstore_index(embeddings)

# Search using vectorstore
similarity_scores, retrieved_indices = get_similarity_scores(
    queries, task, model, doc_embeddings, index=index, top_k=200
)

🏗️ Architecture

┌─────────────────┐
│  Gmail MBOX     │
│  Export File    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Preprocessing  │
│  • Metadata     │
│  • Body Extract │
│  • Cleaning     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Embedding      │
│  multilingual-  │
│  e5-large       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Vector Store   │
│  (FAISS/Direct) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Query & Search │
│  Top-K Results  │
└─────────────────┘

🔬 Technical Details

Embedding Model

Model: intfloat/multilingual-e5-large-instruct
Dimensions: 1024
Context Length: 512 tokens
Languages: 100+ languages supported

Search Methods

Direct Cosine Similarity - Fast GPU-accelerated search for smaller datasets
FAISS Vectorstore - Scalable CPU-based search with IndexFlatIP
Cross-Encoder Reranking - Optional refinement using ms-marco-MiniLM

Supported Email Features

✅ Multipart emails (text/plain, text/html)
✅ Multiple encodings (UTF-8, Latin-1, etc.)
✅ Gmail labels and threads
✅ HTML content cleaning
✅ Email metadata (sender, date, subject)

🎓 Use Cases

📦 Order Tracking: "Show me all my Amazon orders from last year"
✈️ Travel Planning: "When was my flight to Paris?"
💼 Work History: "Find emails about the Q3 project proposal"
🔐 Account Recovery: "What's my Netflix subscription email?"
📅 Event Recall: "When did I receive the wedding invitation?"

🛠️ Customization

Adjust Search Parameters

# More results for better recall
top_k = 500

# Enable chunking for long emails
chunk_size = 800
chunk_overlap = 160

# Use GPU/MPS acceleration
device = "mps" if torch.backends.mps.is_available() else "cpu"

Use Different Embedding Models

The notebook includes experiments with:

google/embeddinggemma-300m
Qwen/Qwen3-Embedding-0.6B
Cross-encoders for reranking

📝 Project Structure

RAG_email/
├── RAG_email.ipynb               # Main notebook
├── README.md                     # This file
├── gmail.mbox                    # Your email export (not included)

🐛 Troubleshooting

Issue: Out of Memory

Solution: Enable chunking or use CPU device instead of GPU

Issue: Encoding Errors

Solution: The notebook includes robust encoding handlers for Greek, UTF-8, and other character sets

Issue: Slow Search

Solution: Use FAISS vectorstore or reduce top_k parameter

Issue: MBOX File Not Found

Solution: Export your Gmail following this guide and place it in the project root

🔮 Future Enhancements

Web interface for easier querying
Integration with LLMs for answer generation
Support for attachments and images
Real-time email monitoring
Advanced filtering by date ranges and senders
Export search results to CSV
Multi-account support

📚 References & Acknowledgments

Sentence Transformers - Embedding framework
FAISS - Efficient similarity search
LangChain - RAG framework
multilingual-e5-large-instruct - Embedding model

🤝 Contributing

Contributions are welcome! Feel free to:

🐛 Report bugs
💡 Suggest features
🔧 Submit pull requests

⭐ Star History

If you find this project useful, please consider giving it a star! ⭐

📧 Contact

For questions or feedback, please open an issue on GitHub.

nsourlos/RAG_gmail