This has been a weekend project over President's Day weekend '24, and was inspired by another one of my projects. I started this project with a very different approach, and have used multiple VoiceAPIs, and text-to-speech providers.
Ensure that you first have all of the required dependencies installed, which relate to fastapi, ngrok, etc.
pip install -r requirements.txt
You can run the API by calling api.py
.
python3 api.py
I have used NGROK in this project, as it has allowed me to quickly develop and establish a tunnel from my development environment to outside IDEs. This is not required, but will need some editing to change.
I have provided an empty environment file template below. Copy and paste this into .env
,
and begin pasting in your API keys.
NGROK_AUTHTOKEN=""
# telephon
VONAGE_API_KEY=""
VONAGE_API_SECRET=""
VONAGE_APPLICATION_ID=""
VONAGE_APPLICATION_NAME=""
VONAGE_SIGNATURE_SECRET=""
VONAGE_JWT=""
# speech to tex
DEEPGRAM_API_KEY=""
# gpt to generate a respons
OPENAI_API_KEY=""
# text to speech
PLAYHT_USER_ID=""
PLAYHT_API_KEY=""
ELEVENLABS_VOICE_ID=""
ELEVENLABS_API_KEY=""
This project has been setup with three main components.
This is where our FastAPI instance lives. This is also the entrypoint to this program.
This is where all project related settings go.
This is where all files related to processing go. This contains four main modules.
generate_response
- This is where we generate the response to send to the client.speechtotext
- This is where we convert speech to texttexttospeech
- This is where we convert text to speechtelephony
- This provides a class that allows us to change Telephony providers
All modules in this folder are object oriented. In each of the modules, I have added an
abstract.py
, where I have defined the base class. In each of the files inside each module,
the Abstract class is implemented, and then finally imported in api.py
. To swap modules,
look at APISettings
in api.py
.
This was a fun project, however, I was not thrilled with some of the things that I ran into. First of all, I started by using the Telnyx Voice API. The Telnyx documentation was great, but I could not get any high-quality audio without establishing a RTC connection. I also tried twilio, and found the same issue. Vonage, on the other hand, provided twice as good audio over websockets, so that was nice to see.
If I had more time I would implement more, and on a local machine with a gpu. This would cut down on the latency. For example, if I used openai/whisper, instead of Deepgram, I have a feeling I could cut down on the speech processing time.
Finally, context seems to be important to streaming text-to-speech, so we need to wait for the chat completion to finish before generating audio. I think I would change this to have a sliding window of context to feed to the text-to-speech service, but am not sure how I would do this yet.
Also, as seen in the texttospeech/open_ai.py
file, I was not able to get the opus bindings to work for python,
and was not able to implement openai text to speech (I could get it to work, but it's Monday and time for schoolwork).