Attempted Festival TTS Integration with NVDA
For last 6 months, I’ve been spending lot of Saturdays in trying to integrate Festival TTS with NVDA Screen reader. This report summarizes the effort. TLDR version is that it’s not a great idea, and I’ll be trying flite addon next.
The Hindi voice and lexicon analyzer,
There is a certain voice for Hindi lanauges that was required to be added to NVDA. The nice folks developed the voice files and the lexicon generator, and had it running in Linux. Since Orca has a very nice festival integration, it was also able to use the Hindi voice files.
On the NVDA side,
There is a nice Festival addon for NVDA, with Russin lanauge, working with 2010 version of NVDA. That plugin had lot of code in C++ and Scheme, and drives the sound output itself. On commparision with the espeak integration of NVDA, i concluded that there was too much code in the existing plugin, and it’s possible to make things easieer. I decied to model the addon in the way the espeak integration has been designed, and use the nvwave.py (part of NVDA ) to play out the wav files.
I believed that pause and cancel can be achieved only in this way, but I can be completely wrong here.
Using the nvwave.py
The nvwave.py uses python ctypes to call the Windows multimedia functions to play the wave data. With espeak, the wave data is delivered as a callback, and then played out. The source file _espeak.py implements this functionality. We’ll name this callback based wave-data delivery as audio steraming. One interesting thing is that nvwaev.py calls the Windows audio playback functions in blocking call, i.e. the functions returns only when the complete data has been played back. Now, if the espeak’s synthesis ( text to wave data ) and this blocking playback is happening in same thread, some pauses in playbacks can be expected. My theory is that, either espeak is just too fast, or it’s actually running synthesis in seperate thread. Another theory is that i’m just missing something completely here.
Festival on Windows
One important point to know is, that festival is not meant to be used as a production TTS, and is a vehicle for implementing and experimenting with various aspects of TTS, from lexicon generation to synthesis. The main use case of the code base is to make a binary on linux, and trying to make a DLL on Windows is like trying to push a round peg in a square hole. Fortunately, the CMU guys have provide a good way to create the makefiles within cygwin, and compile using Visual C, which works great for making the speech tools and festival binaries.
Integrating with NVDA
Once basic Festival for Windows is there, there are two options, one to use the client server mode of festival, and second is to try to make a DLL out of festival. I decied to first try out the DLL model so that I can leverage the current espeak
codebase. It’ll be worth attempting the client-server model too, although it’ll require putting some clever hooks in NVDA to manage the festival server lifetime.
All the intelligence in festival is implemented in the Scheme language, and I wrote a very simple C layer over festival API, to evaluate scheme expressions and return the result. There was another function to synthesis text to wave data. Now when NVDA would give a large text, to synthesis, it’ll take it’s own time, and start playing back after a delay, which was unacceptable.
The Audio streaming
The festival doesn’t provide the audio streaming api, so i hooked in the HTS engine ( the Hindi voice file was an HTS file ), to collect the wave data as it’s generated, and deliver to nvda , just like espeak implementation. The problem was that there was some post synthesis resampling happening in festival, which was not happening with this callback approach. Also, because festival synthesis ( which is little slow due to all the file based data exchange and scheme overhead) and the blocking callback were in the same thread , the voice was streched out ( like running audio at a very slow speed ). The obvious thing to attempt is to run the festival in a different thread. With this approach, there were random memory corruptions happening, even with festival DLL compiled in ‘multi threaded DLL’ mode.
The parting comments,
The performance of Festival will remain in a challange in my opinion. This will be because of lot of file based exchanges of data even within festival codebase, and use of scheme language, and calling out the lexicon analyzer as a binary for each of the word. On most windows, there are various anti malware and anti virus software running which scan all file IO in real time, making them slow.
Over all, I’m not convinved that a robust and usable for average NVDA user can be created with this, so I don’t plan to work on this more.
I’ll move on to develop a flite addon with all the knowledge i’ve gained in doing this project. Flite is from festival authors, and has been designed to be embeddedm and used in production. It’s also gaining Indian languages slowly and steadily.
For the brave soul,
If anyone want to pursue festival+NVDA further, they should so. Here are some suggestions
1. revisit code base of 2010 festival NVDA addon. May be it can be used.
2. revisit using festival binary server mode. It still won’t give audio streaming.
3. Update the C code of festival-dll-wrapper, to do things in a seperate thead, but callback in the main thread. I tried doing this in python, and this will attempt doing it in C. This would also require to debug the multi-threaded-dll behaviour of festival code.
If robust audio streaming is achieved, there will some other things, like getting rid of printf/couts from festival code, implementing pitch and rate control etc, which are the easier part.