BMI local Implementation
Installation
git clone https://github.com/HTAustin/CAL.git- Intall Sofia-ML package: https://code.google.com/archive/p/sofia-ml/
- Make the kisssdb indexer.
cd CAL && make - Change the path for Sofia-ML in doAll_Baseline
SOFIA="/the/path/to/sofia-ml-read-only/src/sofia-ml"
Usage
- Run CAL Auto TAR:
bash doAll_Baseline - Configure behaviour through environment variables
MODE - default tfidf. Valid values: 4gram, tfidf
MAXTHREADS - number of threads. Default: 4
SOFIA - path to sofia ml binary. Default: ./sofia-ml/src/sofia-ml
CORP - corpus to use. Default: oldreut
CACHE - if set, enable caching of corpus specific pre computations. Default not set.
eg.
$ MODE=4gram MAXTHREADS=16 SOFIA=/home/nghelani/sofia-ml/sofia-ml CORP=aquaint bash doAll_Baseline
- Important files assumed by the script
Corpus/<CORP>.tgz - Corpus
judgement/<CORP>.topic.stemming.txt - Topics separated by newline (each line is "<topic_id>:<query>")
judgement/qrels.<JUDGECLASS>.list - Relevance judgements for topics (each line is "<topic> 0 <doc> <score>")
- The output of BMI are stored in
result/folder. - The gain curve can be plotted by analyzing
result/baseline/<corp>/<topic>/<topic>.record.list - Plot gain curves with
gainCurve.py(seepython2 gainCurve.py -h)
Speedup Tips
- Comment out the
./dofastline if you already completed fine the last time - If using qrels for assessment, consider quitting the iterations when you have found the desired number of relevant documents (See the sample snippet)
NUM_REL=$(cat rel.$TOPIC.fil | sort | uniq | wc -l)
TOT_REL=$(grep "^$TOPIC.*[1-9]$" ../judgement/qrels.$JUDGECLASS.list | cut -d' ' -f3 | sort | uniq | wc -l)
if [ $NUM_REL -eq $TOT_REL ]; then
break
fi- Lower the number of iterations. The default number of iterations (=100) might be too high for your purpose.
Contribute
Please feel free to open issues and report bugs.
