asifnoushadsharafudeen/ai_rag_wikipedia
Offline RAG pipeline: Wikipedia β Vector DB β LLM QA using LangChain and FAISS
π§ RAG Wikipedia QA (Offline)
A fully offline Retrieval-Augmented Generation (RAG) system that lets you query any Wikipedia topic using a local LLM and vector store β with no internet or API keys required.
π Features
β
Search any Wikipedia topic and save the content locally
β
Embed saved documents into a local FAISS vector store
β
Ask questions using Retrieval-Augmented Generation (RAG)
β
Runs fully offline β no API keys, no cloud dependency
β
Lightweight model (sshleifer/tiny-gpt2) β runs even on CPU
β
CLI interface for easy interaction
β
LangChain deprecation warnings cleaned
ποΈ Folder Structure
RAG-Wikipedia-QA/
β
βββ docs/ # Saved Wikipedia text files
βββ embeddings/ # FAISS vector DBs saved here
βββ rag_wikipedia.py # Main script
βββ wiki.png # Image (Step 1 & 2)
βββ QA.png # Image (Step 3)
βββ README.md # This file
π§ How It Works (Step-by-Step)
β Step 1 β Fetch Wikipedia Content
- User is prompted to enter a topic name (e.g.,
India,Python programming language) - The script fetches the article summary and saves it as a
.txtfile under/docs
β Step 2 β Embed Text with FAISS
- Loads the saved
.txtfile - Splits the text into chunks using LangChainβs
CharacterTextSplitter - Embeds the chunks into vectors using Hugging Face embeddings
- Stores them in a FAISS vector database (
.faissand.pkl) inside/embeddings
β Step 3 β Ask a Question (RAG)
- Prompts user to enter the same filename
- Loads the vector store and retrieves relevant chunks based on the question
- Feeds context + question into a local GPT2 model
- Generates and returns an answer offline
π¦ Requirements
Install dependencies using:
pip install -r requirements.txt
π€ Author
Asif Noushad Sharafudeen
π LinkedIn
π GitHub
