Collections as Models

By Andrei Maberley | 17 November 2023

Collections as Models : Text to Speech with Queensland voices.

With our dataset format defined and a pipeline for creation created, the next major decision is the TTS model to use. To pick a model we need to set achievable goals for quality and speed of generation (inference) within a budget for hardware, and a time-frame for results.

As a starting point for a Queensland voice we are looking to capture some of the tonality and regional variation that makes our voices distinct. High fidelity and totally naturalistic speech is most likely beyond our reach, as our time-frame for training will be limited to a few sessions per week for a month, on consumer-grade hardware (a single 4090rtx GPU). Training of models is the most costly stage of the process in time and hardware. While not as intensive as training a large language model (LLM), training times for state-of-the-art (SOTA) TTS speech models range from days to weeks of continuous training on large cloud computing platforms.

Fortunately we can piggy-back from already existing models, with a process called 'fine-tuning'. Fine-tuning means we can take an already trained model (a checkpoint), and resume training with our own dataset. Not only does fine-tuning mean we can use smaller datasets, we can also run training for shorter periods.

When the time comes to test our models using inference, which fortunately is a much less hardware intensive process. Ideally we will be able run inference with minimal processing delay on modest hardware so we can present an interactive model for web deployment using free or low-cost cloud compute, and in future make available our models for public use on average computer hardware or IOT devices.

Choosing a TTS model

Thankfully, there has been an explosion in TTS systems on the last 5 years, with incredible advances in realism, efficiency and ease-of-use. Various commercial businesses such as elevenlab.ai and open.ai offer impressive TTS for a price. However a number of open source projects are implimenting cutting-edge models. Perhaps the most fully featured speech synthesis library and certainly the most starred on github is coqui-tts. Started by ex Mozilla employees who worked on the Common Voice project, Coqui-tts contains implementations of just about every open source TTS model, and includes support for multiple platforms and hardware and models can be trained from a wide variety of datasets. Its a great way to test out models, training datasets and inference outputs, with a whole folder of ‘recipes' dedicated to training models on LJ Speech style datasets.

One model in particular called tortoise-tss built by James Betker using their own hardware produces excellent results. However with an original dataset of over 50,000 hours and the training rig linked to above its a little out of reach.

Another intriguing system is bark from suno.ai. Architecturally similar to Google’s AudioLM, coqui-tts offers a recipe for voice cloning using bark, which produces can produce naturalistic results, at the cost of slow inference and random albeit intriguing artefacts.

As with most there is a trade-off between quality and speed, despite the quality of speech from both tortoise and bark, achieving acceptable results particularly at inference would rely on hardware ware beyond the scope of this project.

A third candidate is Variational Inference for Text-to-Speech (VITS) which when released in represented a significant step forward in the TTS domain, particularly in terms of quality, naturalness, and efficiency. VITS scores well on MOS (mean opinion score) when compared with both other models and recording of real people. Most importantly for this project it is less intensive in both training and inference. In fact, after some research I stumbled across an implementation of VITS for the raspberry pi4, a credit card sized computer that we’ve used intensively for various public programs through at The Edge.

Piper-TTS

Piper is a project that encapsulates the best aspects of open source development. Building on a excellent TTS system, piper packages the training, inference and distribution of these models in a streamline fashion. By converting to the onnx format, piper can produce real-time results on its target hardware with models a quarter of the size (or less) of original trained models. Already used in a wide variety of projects and papers, piper is a natural fit for our project.

Training

Piper expects a dataset in a derivative of the LJ Speech format, requiring minimal tweaking of our pipeline. With clear instructions and couple of contributed comprehensive training guides available I was able to achieve acceptable results on the first effort of fine-tuning based on a single speaker dataset.

Interactive Demonstration

Making a public demonstration of this project was straight-foward, and given piper was designed to run a low resource raspberry pi4, we can achieve acceptable performance on low tier cloud GPU. Huggingface spaces are ideally suited for this kind of demonstration. To enable our demo, all that is required is a few modifications to the piper code and uploading of our models to Huggingface.

Please check out the demonstration here – and feel free to contact the Digital Library Initiatives team or myself directly if you have any questions or suggestions for the project.

Acknowledgements

I’d like to acknowledge the assistance of my colleagues at State Library in particular the Digital Library Initiatives and Applied Creativity teams, as well as Phil Gullberg. Also the authors and contributors of all the open source projects, and the creators and contributors of State Library's collections I’ve mentioned that have made this project possible and so exciting to undertake. And thanks GPT4 :) when you hit AGI – remember my dumb questions fondly eh?

Collections as Models

Collections as Models : Text to Speech with Queensland voices.

Choosing a TTS model

Interactive Demonstration

Acknowledgements

Comments