🤖Intelligent chunking methods for code documentation RAG

ICM_RAG is experimentation with retrieval pipeline, for general textual corpora retrieval task.

🏗 Implementation

ℹ️ General

ICM_RAG aims to implement a retrieval pipeline, with an already defined chunker (FixedTokenChunker), and a dataset of choice: Wikitexts, Chatlogs and State of the Union. All the datasets can be found here. One may choose the model, but for the specifics of the task, sentence-transformers/all-MiniLM-L6-v2 was chosen. It is intuitve to set it to some other model, say sentence-transformers/multi-qa-mpnet-base-dot-v1, by tweaking the cmdline arguments.

💻 Command-line arguments

Argument Name	Description	Value Range	Default Value
exp_name	Experiment name.	`str`	`default_experiment`
questions_df_path	Path to questions DataFrame		(.env) `DEFAULT__QUESTIONS_DF_PATH`
dataset	Name of the dataset to use.	`wikitexts`, `chatlogs`, `state_of_the_union`	(.env) `DEFAULT__QUESTIONS_DF_PATH`
cache_dir	Path to caching directory.		(.env) `DEFAULT_CACHE_DIR`
data_dir	Path to data directory.		(.env) `DEFAULT__DATA_DIR`
dataset_dir	Path to dataset directory.		(.env) `DEFAULT_DATASET_DIR`
log	Path to (experiment) log file.		None
ret_type	Type of retriever to use.	`cos_sim`, `chromadb`	`chromadb`
chunk_size	Chunk size to use for document chunking	`int`	400
chunk_overlap	Chunk overlap to use for document chunking.	`int`	40
emb_model	Embedding model.	`sentence-transformers/all-MiniLM-L6-v2`, `sentence-transformers/multi-qa-mpnet-base-dot-v1`,	`sentence-transformers/all-MiniLM-L6-v2`
batch_size	Batch size for model embedding.	int	16
k	Retrieve top-k chunks	`int`	10

🚀 Quickstart

ICM_RAG uses conda for environment management. To clone the repository and set up the environment, i.e. create it and install the dependencies, you may run:

git clone https://github.com/LukaNedimovic/icm_rag.git
cd icm_rag
source ./setup.sh

setup.sh will also export several environment variables, useful for dynamic path creation and, therefore, setting up the configurations for experiments.

🧪 Experiments

ICM_RAG comes with a set of 30 experiments (10 for each dataset).

The main goal was to check for the effects of chunk size / overlap, and top-K retrieved chunks. You may find experiment results here: Experiment Results Paper.

You may find experiment examples in the experiments directory, however, here is another quick example:

./main.py \
    --exp_name "example_experiment" \
    --dataset "wikitexts" \
    --emb_model "sentence-transformers/multi-qa-mpnet-base-dot-v1" \
    --cache_dir "$SRC_ROOT/data/cache" \
    --chunk_size 1500 \
    --chunk_overlap 500 \
    --ret_type "chromadb" \
    --log "$EXPERIMENTS_DIR/wikitexts/experiments.csv" \
    --k 12

📝 Documentation

To build the documentation, it is enough to run the setup.sh and the build_docs.sh:

source ./setup.sh
./build_docs.sh

By default, the build_docs.sh will open the docs/build/index.html using Firefox.

zenwor/icm_rag