Over the past year, I've been experimenting with neural text to speech in various forms. I have done hours of experimentation and research, training models and getting varying results along the way. Some of you may have heard of Piper, an open source synthesizer and add on for NVDA that can be trained by anyone. It is currently in active development, and I have been there from the beginning, testing and evaluating the various versions. For years, I have had a goal to create a high-quality voice that is truly usable by a screen reader user, and yesterday I managed to achieve this. I'm really excited to share Alba, a female Scottish English voice. I'm considering this a beta phase, and I'm looking for feedback to make improvements as needed. Please note that you will most likely get an error upon installation, however the voice should still show up to NVDA, and I'm working on fixing this as soon as possible.
Link to Piper: https://github.com/rhasspy/piper/tree/v0.1.0
Link to addon: https://github.com/mush42/piper-nvda?ref=building.open-home.io
Link to Alba: https://drive.google.com/file/d/1wZHuIll6aEEFd4OdLBCVcxF7bd3PbQTB/view?usp=share_link #TTS #AI #ScreenReader #Piper
@ppatel Thank you so much, that really means a lot. I completely agree with you, responsiveness is not there yet, but it is currently being worked on. I'm hoping in time we will get to a point where it's quite usable, but even now I'm surprised that it works at all, considering that ML speech synthesis has been restricted to the cloud up until fairly recently. I also want to stress that this was very easy to train, and I plan on creating a guide to help people make their own models in almost any language they would like.
@ZBennoui Oh that guide would be incredibly helpful. Thank you for doing this.
@ppatel Absolutely, I'm excited to do it! Just to give some perspective, I took about 1200 audio files from a dataset by CSTR, Downsampled them to 22,050HZ wave files, transcribe them using open AI whisper and put them into the correct format, then trained for about three hours. I did not have any input during the training process, this is the raw result from Piper.
@ZBennoui @ppatel Out of curiosity how much material did you need for this (as in how many hours was your training data) and what computer did you use for training? Got a friend who is interested in trying to train a Polish voice as a test, perhaps using the raw data we have from training Polish RHVoices.
@pitermach Hi Piotr, I only used around 59 minutes of training audio. The entire CSTR dataset is around four hours, but I wanted to start with less just to see how it would perform. The great thing about machine learning is depending on the model architecture you choose you can get away with very little data and still get good results, so it's great for low resource languages. I unfortunately do not have an Nvidia GPU to take advantage of, I'm on macOS, so I used Google Colab pro to train the model, before exporting it to the onyx runtime to work with Piper.
@ZBennoui All right, thanks. Sounds like training it on RHVoice data might be very practical then because the scripts they use for that are about 2 hours of audio. I'll definitely look forward to your guide then because the voice you generated sounds great
@pitermach Great to hear, guide is coming soon.