May 032020
 

This afternoon, I was pondering about how I might do text-to-speech, but still have the result sound somewhat natural. For what use case? Well, two that come to mind…

The first being for doing “strapper call” announcements at horse endurance rides. A horse endurance ride is where competitors and their horses traverse a long (sometimes as long as 320km) trail through a wilderness area. Usually these rides (particularly the long ones) are broken up into separate stages or “legs”.

Upon arrival back at base, the competitor has a limited amount of time to get the horse’s vital signs into acceptable ranges before they must present to the vet. If the horse has a too-high temperature, or their horse’s heart rate is too high, they are “vetted out”.

When the competitor reaches the final check-point, ideally you want to let that competitor’s support team know they’re on their way back to base so they can be there to meet the competitor and begin their work with the horse.

Historically, this was done over a PA system, however this isn’t always possible for the people at base to achieve. So having an automated mechanism to do this would be great. In recent times, Brisbane WICEN has been developing a public display that people can see real-time results on, and this also doubles as a strapper-call display.

Getting the information to that display is something of a work-in-progress, but it’s recognised that if you miss the message popping up on the display, there’s no repeat. A better solution would be to “read out” the message. Then you don’t have to be watching the screen, you can go about your business. This could be done over a PA system, or at one location there’s an extensive WiFi network there, so streaming via Icecast is possible.

But how do you get the text into speech?

Enter flite

flite is a minimalist speech synthesizer from the Festival project. Out of the box it includes 3 voices, mostly male American voices. (I think the rms one might be Richard M. Stallman, but I could be wrong on that!) There’s a couple of demos there that can be run direct from the command line.

So, for the sake of argument, let’s try something simple, I’ll use the slt voice (a US female voice) and just get the program to read out what might otherwise be read out during a horse ride event:

$ flite_cmu_us_slt -t 'strapper call for the 160 kilometer event competitor numbers 123 and 234' slt-strapper-nopunctuation-digits.wav
slt-strapper-nopunctuation-digits.ogg

Not bad, but not that great either. Specifically, the speech is probably a little quick. The question is, how do you control this? Turns out there’s a bit of hidden functionality.

There is an option marked -ssml which tells flite to interpret the text as SSML. However, if you try it, you may find it does little to improve matters, I don’t think flite actually implements much of it.

Things are improved if we spell everything out. So if you instead replace the digits with words, you do get a better result:

$ flite_cmu_us_slt -t 'strapper call for the one hundred and sixty kilometer event competitor number one two three and two three four' slt-strapper-nopunctuation-words.wav
slt-strapper-nopunctuation-words.ogg

Definitely better. It could use some pauses. Now, we don’t have very fine-grained control over those pauses, but we can introduce some punctuation to have some control nonetheless.

$ flite_cmu_us_slt -t 'strapper call.  for the one hundred and sixty kilometer event.  competitor number one two three and two three four' slt-strapper-punctuation.wav
slt-strapper-punctuation.ogg

Much better. Of course it still sounds somewhat robotic though. I’m not sure how to adjust the cadence on the whole, but presumably we can just feed the text in piece-wise, render those to individual .wav files, then stitch them together with the pauses we want.

How about other changes though? If you look at flite --help, there is feature options which can control the synthesis. There’s no real documentation on what these do, what I’ve found so far was found by grep-ing through the flite source code. Tip: do a grep for feat_set_, and you’ll see a whole heap.

Controlling pitch

There’s two parameters for the pitch… int_f0_target_mean controls the “centre” frequency of the speech in Hertz, and int_f0_target_stddev controls the deviation. For the slt voice, …mean seems to sit around 160Hz and the deviation is about 20Hz.

So we can say, set the frequency to 90Hz and get a lower tone:

$ flite_cmu_us_slt --setf int_f0_target_mean=90 -t 'strapper call' slt-strapper-mean-90.wav
slt-strapper-mean-90.ogg

… or 200Hz for a higher one:

$ flite_cmu_us_slt --setf int_f0_target_mean=200 -t 'strapper call' slt-strapper-mean-200.wav
slt-strapper-mean-200.ogg

… or we can change the variance:

$ flite_cmu_us_slt --setf int_f0_target_stddev=0.0 -t 'strapper call' slt-strapper-stddev-0.wav
$ flite_cmu_us_slt --setf int_f0_target_stddev=70.0 -t 'strapper call' slt-strapper-stddev-70.wav
slt-strapper-stddev-0.ogg
slt-strapper-stddev-70.ogg

We can’t change these values during a block of speech, but presumably we can cut up the text we want to render, render each piece at the frequency/variance we want, then stitch those together.

Controlling rate

So I mentioned we can control the rate, somewhat coarsely using usual punctuation devices. We can also change the rate overall by setting duration_stretch. This basically is a control of how “long” we want to stretch out the pronunciation of words.

$ flite_cmu_us_slt --setf duration_stretch=0.5 -t 'strapper call' slt-strapper-stretch-05.wav
$ flite_cmu_us_slt --setf duration_stretch=0.7 -t 'strapper call' slt-strapper-stretch-07.wav
$ flite_cmu_us_slt --setf duration_stretch=1.0 -t 'strapper call' slt-strapper-stretch-10.wav
$ flite_cmu_us_slt --setf duration_stretch=1.3 -t 'strapper call' slt-strapper-stretch-13.wav
$ flite_cmu_us_slt --setf duration_stretch=2.0 -t 'strapper call' slt-strapper-stretch-20.wav
slt-strapper-stretch-05.ogg
slt-strapper-stretch-07.ogg
slt-strapper-stretch-10.ogg
slt-strapper-stretch-13.ogg
slt-strapper-stretch-20.ogg

Putting it together

So it looks as if all the pieces are there, we just need to stitch them together.

RC=0 stuartl@rikishi /tmp $ flite_cmu_us_slt --setf duration_stretch=1.2 --setf int_f0_target_stddev=50.0 --setf int_f0_target_mean=180.0 -t 'strapper call' slt-strapper-call.wav
RC=0 stuartl@rikishi /tmp $ flite_cmu_us_slt --setf duration_stretch=1.1 --setf int_f0_target_stddev=30.0 --setf int_f0_target_mean=180.0 -t 'for the, one hundred, and sixty kilometer event' slt-160km-event.wav
RC=0 stuartl@rikishi /tmp $ flite_cmu_us_slt --setf duration_stretch=1.4 --setf int_f0_target_stddev=40.0 --setf int_f0_target_mean=180.0 -t 'competitors, one two three, and, two three four' slt-competitors.wav
Above files stitched together in Audacity

Here, I manually imported all three files into Audacity, arranged them, then exported the result, but there’s no reason why the same could not be achieved by a program, I’m just inserting pauses after all.

There are tools for manipulating RIFF waveform files in most languages, and generating silence is not rocket science. The voice itself could be fine-tuned, but that’s simply a matter of tweaking settings. Generating the text is basically a look-up table feeding into snprintf (or its equivalent in your programming language of choice).

It’d be nice to implement a wrapper around flite that took the full SSML or JSML text and rendered it out as speech, but this gets pretty close without writing much code at all. Definitely worth continuing with.