Inverted Indexer

A word-level inverted index builder and search tool written in Python.

About

This is an inverted index implementation I wrote for an Information Retrieval course. It builds a word-level index from a corpus of HTML or text documents, enabling fast full-text search. The pipeline includes HTML parsing with BeautifulSoup, tokenization and stopword removal with NLTK, and stemming with Snowball. A sample corpus is included in the corpus/ directory.

Features

Builds a word-level inverted index from a directory of documents
Parses HTML documents using BeautifulSoup
Tokenization, stopword removal, and Snowball stemming via NLTK
Ranked retrieval using TF-IDF weighting
Interactive menu: search an existing index, rebuild from a corpus, or exit
Supports nested corpus directories (subdirectories are merged during preprocessing)
Rich terminal UI with progress bars and formatted output

Built With

Python 3
NLTK — tokenization, stemming, stopwords
Beautiful Soup — HTML parsing
Rich — terminal formatting and progress bars

Getting Started

Prerequisites

Python 3

Installation & Running

git clone https://github.com/mosamaasif/Inverted_Indexer.git
cd Inverted_Indexer
pip3 install nltk rich beautifulsoup4
python3 indexer.py

On first run, you may need to download the NLTK data:

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Usage

The program presents a menu with three options:

Search Only — search an existing index (must have been built previously)
Rebuild Index and Search — point it to a corpus directory (e.g., corpus/), rebuild the index, then search
Exit

Note: The corpus can have subdirectories. Provide the path to the root directory and each subdirectory will be processed and merged automatically.