nsourlos/RAG_gmail
๐ Intelligent semantic search for email archives using RAG, multilingual embeddings, and FAISS. Ask natural questions, get instant answers from your Gmail inbox.
๐ง RAG Email Search System
๐ Intelligent semantic search for your email inbox - Find exactly what you're looking for using state-of-the-art RAG (Retrieval-Augmented Generation) technology
Transform your Gmail archive into a searchable knowledge base using advanced NLP embeddings and vector similarity search. Ask natural language questions like "What was my latest eBay order?" or "When was my last flight booking?" and get instant, relevant results.
โจ Features
- ๐ฏ Semantic Search: Find emails based on meaning, not just keywords
- ๐ Multilingual Support: Works with multiple languages using
multilingual-e5-large-instruct - โก Fast Retrieval: Efficient similarity search using FAISS vectorstore
- ๐ Flexible Architecture: Support for both chunked and full-document embeddings
- ๐ Multiple Approaches: Compare direct cosine similarity vs. vectorstore implementations
- ๐จ Metadata Enrichment: Preserve sender, subject, date, and labels for enhanced context
- ๐งน Robust Preprocessing: Handle multipart emails, HTML content, and various encodings
- ๐ Cross-Encoder Reranking: Optional reranking for improved result quality
- ๐พ Caching System: Save embeddings for fast subsequent queries
๐ Quick Start
Prerequisites
- Python 3.10+
- Gmail MBOX export file (How to export Gmail)
- 8GB+ RAM recommended
Installation
- Clone the repository
git clone https://github.com/nsourlos/rag_gmail.git
cd rag_gmail- Create virtual environment using UV (recommended)
# For macOS/Linux
uv venv RAG_email --python 3.10
source RAG_email/bin/activate
# For Windows
uv venv RAG_email --python 3.10
.\RAG_email\Scripts\activate- Install dependencies
uv pip install sentence-transformers==5.1.1 ipykernel langchain==0.3.27 faiss-cpu==1.12.0
# Optional: For GPU support on Windows
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121- Set up Jupyter kernel
# macOS/Linux
RAG_email/bin/python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"
# Windows
RAG_email\Scripts\python -m ipykernel install --user --name=RAG_email --display-name "RAG_email"- Place your Gmail MBOX file
# Export your Gmail and place the file as:
# ./gmail.mbox๐ Usage
Basic Example
# Load your emails
mbox = load_mbox('gmail.mbox')
# Process and create embeddings
format_messages_for_embeddings(mbox)
# Load the model and embeddings
doc_embeddings, model = load_embeddings()
# Ask natural language questions
queries = [
'What was my latest eBay order?',
'When was my last flight booking?',
'What was my most recent Amazon purchase?'
]
# Get results
task = 'Given an email, retrieve relevant passages that answer the query'
similarity_scores = get_similarity_scores(queries, task, model, doc_embeddings, top_k=200)
print_results(similarity_scores, queries, documents_cleaned, top_k=200)Advanced: Using Chunking
For better performance with long emails:
# Enable chunking
chunked = True
chunk_size = 800
chunk_overlap = 160
# Process with chunks
format_messages_for_embeddings(mbox, chunk_size=chunk_size, chunk_overlap=chunk_overlap)Advanced: Using FAISS Vectorstore
For scalability with large email archives:
# Build vectorstore index
index = build_vectorstore_index(embeddings)
# Search using vectorstore
similarity_scores, retrieved_indices = get_similarity_scores(
queries, task, model, doc_embeddings, index=index, top_k=200
)๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ
โ Gmail MBOX โ
โ Export File โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Preprocessing โ
โ โข Metadata โ
โ โข Body Extract โ
โ โข Cleaning โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Embedding โ
โ multilingual- โ
โ e5-large โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Vector Store โ
โ (FAISS/Direct) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Query & Search โ
โ Top-K Results โ
โโโโโโโโโโโโโโโโโโโ
๐ฌ Technical Details
Embedding Model
- Model:
intfloat/multilingual-e5-large-instruct - Dimensions: 1024
- Context Length: 512 tokens
- Languages: 100+ languages supported
Search Methods
- Direct Cosine Similarity - Fast GPU-accelerated search for smaller datasets
- FAISS Vectorstore - Scalable CPU-based search with IndexFlatIP
- Cross-Encoder Reranking - Optional refinement using ms-marco-MiniLM
Supported Email Features
- โ Multipart emails (text/plain, text/html)
- โ Multiple encodings (UTF-8, Latin-1, etc.)
- โ Gmail labels and threads
- โ HTML content cleaning
- โ Email metadata (sender, date, subject)
๐ Use Cases
- ๐ฆ Order Tracking: "Show me all my Amazon orders from last year"
โ๏ธ Travel Planning: "When was my flight to Paris?"- ๐ผ Work History: "Find emails about the Q3 project proposal"
- ๐ Account Recovery: "What's my Netflix subscription email?"
- ๐ Event Recall: "When did I receive the wedding invitation?"
๐ ๏ธ Customization
Adjust Search Parameters
# More results for better recall
top_k = 500
# Enable chunking for long emails
chunk_size = 800
chunk_overlap = 160
# Use GPU/MPS acceleration
device = "mps" if torch.backends.mps.is_available() else "cpu"Use Different Embedding Models
The notebook includes experiments with:
google/embeddinggemma-300mQwen/Qwen3-Embedding-0.6B- Cross-encoders for reranking
๐ Project Structure
RAG_email/
โโโ RAG_email.ipynb # Main notebook
โโโ README.md # This file
โโโ gmail.mbox # Your email export (not included)
๐ Troubleshooting
Issue: Out of Memory
Solution: Enable chunking or use CPU device instead of GPU
Issue: Encoding Errors
Solution: The notebook includes robust encoding handlers for Greek, UTF-8, and other character sets
Issue: Slow Search
Solution: Use FAISS vectorstore or reduce top_k parameter
Issue: MBOX File Not Found
Solution: Export your Gmail following this guide and place it in the project root
๐ฎ Future Enhancements
- Web interface for easier querying
- Integration with LLMs for answer generation
- Support for attachments and images
- Real-time email monitoring
- Advanced filtering by date ranges and senders
- Export search results to CSV
- Multi-account support
๐ References & Acknowledgments
- Sentence Transformers - Embedding framework
- FAISS - Efficient similarity search
- LangChain - RAG framework
- multilingual-e5-large-instruct - Embedding model
๐ค Contributing
Contributions are welcome! Feel free to:
- ๐ Report bugs
- ๐ก Suggest features
- ๐ง Submit pull requests
โญ Star History
If you find this project useful, please consider giving it a star! โญ
๐ง Contact
For questions or feedback, please open an issue on GitHub.