The Alexithymic Language Project
A Psicobōtica Labs research project on discovering patterns in alexithymic discourse.
Raúl Arrabales Moreno (raul@psicobotica.com) / Sept. 2020 / Psicobōtica Labs.
Abstract:
- We asked adult participants to describe what they see in several, rather ambiguous, visual stimuli.
- We estimated the degree of alexithymia (difficulty to identify emotions) of the same participants using a standarized test.
- In this project, we analyse the narratives provided by the participants, looking for specific patterns in those with the highest degree of alexithymia.
Most of the code is also designed to serve as educational resource for junior data scientists, and it's being used at different introductory courses on Natural Language Processing (NLP) and Natural Language Understanding (NLU).
Context information:
- What is alexithymia.
- Data collection via citizen science projects.
- Prolexitim NLP. The tool used to present the visual stimuli.
- Prolexitim TAS-20. The tool used to measure the level of alexithymia.
NLP/NLU Analytics Pipeline
This project is intended to contain a fairly complete end to end pipeline, representing some of the most popular techniques applied when dealing with natural language in written form. Here is a summary of the main tasks that we perform using the python code included in this project:
-
- Research design and problem modeling.
-
- Loading dataset and exploratory analysis.
-
- Variables transformation according to the research design.
-
- Pre-processing: Tokenization.
-
- Pre-processing: Stemming.
-
- Pre-processing: PoS (Part of Speech) Tagging.
-
- Pre-processing: NER (Named Entity Recognition) Tagging.
-
- Pre-processing: DEP (syntactic Dependencies) Tagging.
-
- Frequency-Based Feature Engineering: frequencies, diversity scores.
-
- Bag of Words (BoW) Vector Space Feature Engineering.
-
- TF/IDF Vector Space Feature Engineering.
-
- N-Grams models and text generation.
-
- PoS tokens counts. Most frequent verbs, nouns, adjectives, etc. per class.
-
- Inferring personality traits, needs, values and preferences.
-
- Sentiment analysis (polarity, intensity, etc.).
-
- Topic Detection with Latent Semantic Analysis (LSA).
-
- Topic Detection with Latent Dirichlet Allocation (LDA).
-
- Word2Vec embedding training using CBoW and Skip Gram.
-
- Using word vectors for classification (feature vectors as word vector sequences).
-
- Using a bidirectional LSTM (biLSTM) for classification.
-
- Using an attention layer for explainability.
-
- Doc2Vec generation using a pre-trained Word2Vec model (Spanish 3B).
-
- Doc2Vec using a pre-trained sentence encoder.
-
- Using sentence vectors for classification.
HTML folder
This folder contains HTML versions of the iPython notebooks (Jupyter exported) for quick inspection and visualization.
Data folder
This folder contains the original dataset obtained from the Prolexitim project as well as newly generated datasets with additional processing and features.
The main information provided by the original dataset consists of rows representing:
- Anonymous code of participant.
- Degree of alexithymia of that participant (measured using the TAS-20 instrument).
- Identification of a visual stimulus (an ambiguous picture taken from the Thematic Apperception Test (TAT) card set).
- The original Spanish text corresponding to the narrative reported by the participant when presented with the visual stimulus.
The specific description of the variables represented in the CSV files can be found in data folder's readme.
Note that multiple features have been obtained from the original documents (original raw text field) and included as additional columns in tabular dataset files.
Here is the description of all variables contained in generated datasets.
Lexicon folder
This folder contains Spanish lexicon datasets:
- Sentiment analysis lexicons.
NLP folder
This folder contains the interactive notebooks (ipynb) used for data analysis:
- 1-Preprocessing.ipynb: prolexitim dataset exploration, class variable definition and standard NLP processing (tokenization, stemming, lemmatization, POS, NER, DEP, etc.).
- 1b_SA-Lexicons.ipynb: preparation of Sentiment Analysis lexicons in Spanish.
- 2_Features.ipynb: standard natural language feature engineering (counts, lengths, frequencies, diversity scores, etc.).
- 3_BoW.ipynb: words and stems Bag of Word models. BoW Vector space model dimensionality reduction (PCA and t-SNE).
- 3b_TF-IDF.ipynb: words and stems TF/IDF models. TF/IDF Vector space model dimensionality reduction (PCA and t-SNE).
- 3c_N-Grams.ipynb: chars and words N-Gram models. N-Gram based text generation.
- 4_Lexicosemantics.ipynb: PoS frequency in different classes. Semantic analysis.
- 5_Personality.ipynb: Use of the Personality Insights API (IBM Cloud) to annotate text with personality variables.
- 6_Sentiment.ipynb: Sentiment analysis using different techniques (lexicon-based, third-party API, etc.).
- 7_SemanticAnalysis.ipynb: Latent Semantic Analysis LSA (topic detection).
- 8_LatentDirichletAllocation.ipynb: Latent Dirichlet Allocation LDA (topic detection).
- 9_Word2Vec.ipynb: Traning neural vector models with our dataset (CBoW and Skip Gram).
- 9b_Word2Vec_S3B.ipynb: Using a pre-trained neural vector model, build Doc2Vec and other embedding features.
- 9c_Word2Vec_S3B_ExportVecs.ipynb: Exporting only those vectors corresponding to the lexicon in my corpus.
- 10_Embeddings_USEM.ipynb: Doc2Vec using Google's Universal Sentence Encoder - Multilingual L3.
- 10b_USEM_Check.ipynb: Using encoded vectors for classification.
- 11_biLSTM.ipynb: bidirectional LSTM model using S3B word embeddings for classification.
- 11b_biLSTM_Attention.ipynb: biLSTM model with attention layer and explainer.
Models folder
This folder contains both pre-trained models from third party contributors and models trained with our own data.
- Stanford NER tagger model - Spanish. Stanford Entity Recognizer.
- Stanford POS tagger model - Spanish. Stanford POS Tagger.
- Other Stanford CoreNLP models (Spanish). Stanford CoreNLP.
- Word BoW Model.
- Stem BoW Model.
- Char-Based N-Gram (n=6) models.
- Word-Based N-Gram (n=3, trigrams) models.
- Word2Vec models.
- Doc2Vec models.
- Classifier models (biLSTM).