Speech To Text To Speech Translation

This project handles local real-time audio to audio translation over sockets with OpenAI's Whisper and Microsoft SpeechT5 TTS. The project includes a client and server Python program, meaning that users can choose host the service on a high-performance GPU, then be able to use it on any consumer-level device.

Audio is sent by the client through WebSockets, which is transcribed into English with Whisper. SpeechT5 TTS then generates audio which reads off the transcription. The audio is sent back to the client and piped into any virtual audio device on their system. The end-to-end processing time is 1.5 seconds on an A100.

Demo Video

Data flow

Server-side Flow

Within the client, the user can pipe in the audio output to any virtual microphone or audio device they would like. One application is for video calls, the user can pipe the output to a virtual microphone, then use that audio device in a meeting so that everything they say is translated.

Server Installation Instructions:

These are all important steps!

Ensure your ports specified in server.py is open! The default port we chose was 4444.

Make sure your XCode CLI or C++ compiler tools are fully updated!
https://developer.apple.com/forums/thread/677124

Install FFmpeg:
sudo apt install ffmpeg

Install Anaconda for your device:
https://www.anaconda.com/download

Run the initialization command:
conda init

Make sure Conda is updated to the latest version:
conda upgrade conda

Create a virtual environment
conda create -n "speech-to-speech" python==3.11

conda activate speech-to-speech

Install Pytorch here:
https://pytorch.org/get-started/locally/

Install Librosa
conda install -c conda-forge librosa

Install Transformers:
conda install -c huggingface transformers

Install pyaudio:
pip install pyaudio

Install Project Packages:
cd server
pip install -r requirements.txt

Finally run the server:
python server.py

Client Installation

If you'd like to use the translation in a video call or such, you can install software to create a virtual microphone. On Mac you can use Blackhole.

Install FFmpeg - you can do so with brew, or here: https://ffmpeg.org/download.html.
brew install ffmpeg

Install requirements.txt in the clients folder
pip install -r requirements.txt

If you're running server.py on a remote server, change "localhost" to your remote server ip in client .py in this line:

client.start(("localhost", 4444))

Finally run the client:
python client.py

Within the client, you can select the appropriate input and output device that audio will be piped through.

Notebooks for Testing

speech.ipynb

Audiofile -> Translates to english text with whisper -> Create a synthesized voice with MS T5 TTS

transcribe.ipynb

Microphone -> Transcribe to english text

speech-to-transcribe

Microphone -> Translate and transcribe to english text

Errors

clang: error: no such file or directory: '/Users/kensonhui/anaconda3/envs/speech-to-speech/lib/python3.11/config-3.11-darwin/libpython3.11.a'

PY_SSIZE_T_CLEAN macro must be defined for '#' formats

You'll have to update conda, update XCode, update brew, update your XCode CLI tools. Destroy your env, and rebuild your environment :D.

kensonhui/Realtime-Speech-to-Speech-Translation