GitHunt

Inverted Indexer

A word-level inverted index builder and search tool written in Python.

License: MIT

About

This is an inverted index implementation I wrote for an Information Retrieval course. It builds a word-level index from a corpus of HTML or text documents, enabling fast full-text search. The pipeline includes HTML parsing with BeautifulSoup, tokenization and stopword removal with NLTK, and stemming with Snowball. A sample corpus is included in the corpus/ directory.

Features

  • Builds a word-level inverted index from a directory of documents
  • Parses HTML documents using BeautifulSoup
  • Tokenization, stopword removal, and Snowball stemming via NLTK
  • Ranked retrieval using TF-IDF weighting
  • Interactive menu: search an existing index, rebuild from a corpus, or exit
  • Supports nested corpus directories (subdirectories are merged during preprocessing)
  • Rich terminal UI with progress bars and formatted output

Built With

Getting Started

Prerequisites

Installation & Running

git clone https://github.com/mosamaasif/Inverted_Indexer.git
cd Inverted_Indexer
pip3 install nltk rich beautifulsoup4
python3 indexer.py

On first run, you may need to download the NLTK data:

import nltk
nltk.download('punkt')
nltk.download('stopwords')

Usage

The program presents a menu with three options:

  1. Search Only — search an existing index (must have been built previously)
  2. Rebuild Index and Search — point it to a corpus directory (e.g., corpus/), rebuild the index, then search
  3. Exit

Note: The corpus can have subdirectories. Provide the path to the root directory and each subdirectory will be processed and merged automatically.

Screenshots

Menu Screen

Building Index

Storing Index

Search Query

Search Results

License

Distributed under the MIT License. See LICENSE for details.

Languages

Python100.0%

Contributors

MIT License
Created June 8, 2021
Updated February 5, 2026
mosamaasif/Inverted_Indexer | GitHunt