X-XENDROME-X/speech-emotion-recognition
A deep learning project to recognize emotions from speech using a CNN and the RAVDESS dataset.
Speech Emotion Recognition
A deep learning project that classifies eight emotions from speech audio using a Convolutional Neural Network (CNN). Trained on the RAVDESS dataset, this repository demonstrates end-to-end audio processingโfrom feature extraction to model inferenceโideal for applications like virtual assistants, mental health monitoring, and human-computer interaction.
๐ Table of Contents
- โจ Features
- ๐๏ธ Project Structure
- ๐ ๏ธ Installation
โถ๏ธ Usage- ๐ Results & Visualizations
- ๐ฎ Future Improvements
- ๐ค Contributing
- ๐ License
โจ Features
-
Dataset: RAVDESS audio dataset (1,440 samples across 8 emotions).
-
Feature Extraction:
- 231-dimensional feature vectors
- MFCCs, Chroma, Mel Spectrograms, Zero-Crossing Rate, Spectral Contrast
-
Model Architecture: CNN built in PyTorch with:
- 2 Convolutional Layers
- Batch Normalization & Dropout
- Fully Connected Layers for classification
-
Performance:
- Single CNN: ~72.5% accuracy
- Ensemble of 3 CNNs: ~74.2% accuracy
-
Prediction: Function to infer emotion from new audio files
๐๏ธ Project Structure
โโโ .gitignore
โโโ LICENSE
โโโ README.md
โโโ img.jpg
โโโ requirements.txt
โโโ SpeechEmotionCNN.ipynb
.gitignore: Excludes caches and checkpointsLICENSE: MIT open-source licenseREADME.md: Project documentation (this file)img.jpg: Sample process diagramrequirements.txt: For project dependenciesSpeechEmotionCNN.ipynb: Notebook covering data loading, feature extraction, training, evaluation, and prediction
๐ ๏ธ Installation
-
Clone the repository
git clone https://github.com/X-XENDROME-X/speech-emotion-recognition.git cd speech-emotion-recognition -
Set up a virtual environment (optional but recommended)
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Prepare the RAVDESS dataset
- The notebook will auto-download from Zenodo if
./data/ravdess/is empty - Or download manually and unzip into
./data/ravdess/
- The notebook will auto-download from Zenodo if
โถ๏ธ Usage
-
Launch Jupyter Notebook
jupyter notebook SpeechEmotionCNN.ipynb
-
Run through the cells:
- Load and preprocess audio files
- Extract features with Librosa
- Train CNN model(s)
- Evaluate and visualize results
- Save model weights and label encoder
-
Predict on new audio
from predict import predict_emotion_pytorch model_path = 'speech_emotion_model_pytorch.pth' encoder_path = 'label_encoder.joblib' audio_file = 'path/to/audio.wav' predicted_emotion = predict_emotion_pytorch( audio_file, model_path, encoder_path, device='cpu' ) print(f"Predicted Emotion: {predicted_emotion}")
๐ Results & Visualizations
-
Accuracy: 72.5% (single model) โ 74.2% (ensemble)
-
Insights:
- High accuracy on distinct emotions (e.g., angry, happy)
- Lower performance for neutral/calm due to class imbalance (addressed with weighted loss)
-
Visuals:
- Waveform plots
- Confusion matrices
- Training & validation accuracy/loss curves
๐ฎ Future Improvements
- Experiment with Transformer-based architectures (e.g., Audio Spectrogram Transformers)
- Integrate pre-trained audio networks (e.g., Wav2Vec)
- Add extra features: pitch, prosody, formants
- Test & fine-tune on other datasets (e.g., TESS, CREMA-D)
- Deploy as a REST API or in a real-time application
๐ค Contributing
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature-name) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature-name) - Open a Pull Request
Please adhere to the existing code style and include tests where applicable.
๐ License
This project is licensed under the MIT License. See the LICENSE file for details.
