Speech Emotion Recognition

A deep learning project that classifies eight emotions from speech audio using a Convolutional Neural Network (CNN). Trained on the RAVDESS dataset, this repository demonstrates end-to-end audio processing—from feature extraction to model inference—ideal for applications like virtual assistants, mental health monitoring, and human-computer interaction.

✨ Features

Dataset: RAVDESS audio dataset (1,440 samples across 8 emotions).
Feature Extraction:
- 231-dimensional feature vectors
- MFCCs, Chroma, Mel Spectrograms, Zero-Crossing Rate, Spectral Contrast
Model Architecture: CNN built in PyTorch with:
- 2 Convolutional Layers
- Batch Normalization & Dropout
- Fully Connected Layers for classification
Performance:
- Single CNN: ~72.5% accuracy
- Ensemble of 3 CNNs: ~74.2% accuracy
Prediction: Function to infer emotion from new audio files

🗂️ Project Structure

├── .gitignore
├── LICENSE
├── README.md
├── img.jpg
├── requirements.txt
├── SpeechEmotionCNN.ipynb

.gitignore: Excludes caches and checkpoints
LICENSE: MIT open-source license
README.md: Project documentation (this file)
img.jpg: Sample process diagram
requirements.txt: For project dependencies
SpeechEmotionCNN.ipynb: Notebook covering data loading, feature extraction, training, evaluation, and prediction

🛠️ Installation

Clone the repository

git clone https://github.com/X-XENDROME-X/speech-emotion-recognition.git
cd speech-emotion-recognition

Set up a virtual environment (optional but recommended)

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Prepare the RAVDESS dataset
- The notebook will auto-download from Zenodo if ./data/ravdess/ is empty
- Or download manually and unzip into ./data/ravdess/

▶️ Usage

Launch Jupyter Notebook
```
jupyter notebook SpeechEmotionCNN.ipynb
```
Run through the cells:
- Load and preprocess audio files
- Extract features with Librosa
- Train CNN model(s)
- Evaluate and visualize results
- Save model weights and label encoder

Predict on new audio

from predict import predict_emotion_pytorch

model_path = 'speech_emotion_model_pytorch.pth'
encoder_path = 'label_encoder.joblib'
audio_file = 'path/to/audio.wav'

predicted_emotion = predict_emotion_pytorch(
    audio_file,
    model_path,
    encoder_path,
    device='cpu'
)
print(f"Predicted Emotion: {predicted_emotion}")

📊 Results & Visualizations

Accuracy: 72.5% (single model) ➔ 74.2% (ensemble)
Insights:
- High accuracy on distinct emotions (e.g., angry, happy)
- Lower performance for neutral/calm due to class imbalance (addressed with weighted loss)
Visuals:
- Waveform plots
- Confusion matrices
- Training & validation accuracy/loss curves

🔮 Future Improvements

Experiment with Transformer-based architectures (e.g., Audio Spectrogram Transformers)
Integrate pre-trained audio networks (e.g., Wav2Vec)
Add extra features: pitch, prosody, formants
Test & fine-tune on other datasets (e.g., TESS, CREMA-D)
Deploy as a REST API or in a real-time application

🤝 Contributing

Contributions are welcome! To contribute:

Fork the repository
Create a feature branch (git checkout -b feature-name)
Commit your changes (git commit -m 'Add new feature')
Push to the branch (git push origin feature-name)
Open a Pull Request

Please adhere to the existing code style and include tests where applicable.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

X-XENDROME-X/speech-emotion-recognition