smicksan/TopicModelingChiLitCorpus
This repository features two Jupyter Notebooks: one for text preprocessing and the other showcasing topic modeling using the OCTIS library with Latent Dirichlet Allocation (LDA) applied to the ChiLit Corpus. Dive into the README for detailed instructions, citations, and links to the datasets and libraries used in this project.
Topic Modeling with LDA on the ChiLit Corpus
Author: Sandra Mickwitz
Institution: Università Cattolica del Sacro Cuore
Introduction
The aim of this project is to train a Latent Dirichlet Allocation (LDA) model to analyze and understand the topics in the ChiLit Corpus from mahlberg-lab/corpora.
A key research question during model training was:
What is the ideal number of topics?
The goal was to find a balance between not too many fine-grained topics and not too few overly broad topics.
ChiLit Corpus Description
The ChiLit Corpus is used in this notebook.
The corpus can be accessed here.
Additionally, a notebook is available for preprocessing the corpus.
Corpus Statistics
| Metric | Unprocessed Files | Processed Files | Why the Change? |
|---|---|---|---|
| Total Tokens | 5,404,761 | 2,191,892 | Stopwords & punctuation removed |
| Unique Tokens | 70,728 | 51,673 | Slight reduction, but more meaningful words remain |
| Lexical Richness | 0.0131 | 0.0236 | Fewer redundant words, unique word ratio increased |
Topic Modeling with OCTIS
This project utilizes OCTIS (Optimized Configuration of Topic Models), a Python library designed to facilitate topic model training and evaluation.
OCTIS provides:
- Preprocessing tools to clean and prepare text data.
- Multiple topic modeling algorithms, including:
- LDA (Latent Dirichlet Allocation)
- ProdLDA (Product of Experts LDA)
- ETM (Embedded Topic Model)
- Evaluation metrics such as:
- Topic Diversity
- Topic Coherence
- Hyperparameter optimization for tuning the number of topics.
For more details, visit the official repository.
OCTIS Citation
@inproceedings{terragni2020octis,
title={{OCTIS}: Comparing and Optimizing Topic Models is Simple!},
author={Terragni, Silvia and Fersini, Elisabetta and Galuzzi, Bruno Giovanni and Tropeano, Pietro and Candelieri, Antonio},
year={2021},
booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.eacl-demos.31",
pages = "263--270",
}Short Information About the Complete ChiLit Corpus
The GLARE 19th Century Children’s Literature Corpus is part of CLiC!
📌 Blog post: ChiLit: the GLARE 19th Century Children’s Literature Corpus in CLiC
Corpus Overview
- Developed as part of the MC-Project GLARE (Exploring Gender in Children’s Literature from a Cognitive Corpus Stylistic Perspective).
- Directed by Anna Čermáková and M. Mahlberg.
- 71 books (35 by female, 36 by male authors).
- 38 authors (14 female, 24 male).
- Published between 1826-1911.
- 4,480,386 words in total.
- Source: Gutenberg Project (Public Domain).
Reference
Čermáková, A. (2018, 14 February).
ChiLit: the GLARE 19th Century Children’s Literature Corpus in CLiC
Retrieved from University of Birmingham Blog.
📢 Feedback and contributions are welcome!
Feel free to submit issues or pull requests.
