The ‘Voices of the Collection” digital catalyst project aims to provide a framework for activation of collections as data, and explore pathways to create inspiring, informative, and interactive experiences using AI models. Generative speech models will be the focus of this project, but this blog aims to provide a brief summary of how AI and Machine Learning (ML), more generally, is a product with inherent biases and hidden limitations, and how State Library of Queensland has an unprecedented opportunity to address bias with the creation of uniquely Queensland datasets and models.
We are probably all aware of the recent and ongoing advances in AI, which has surprised, intrigued, and disturbed experts and the wider community alike. Recent advanced generative models can create photorealistic images from text, clone voices and create 3D models from a single image. Large Language Models (LLM) are now multi-modal and may be approaching Artificial General Intelligence for some tasks. Even in our more prosaic working life at State Library we are prompted gently, but persistently, to use AI tools in our daily tasks – the little ‘editor’ button to the top right of my page as I type this tempts me with the offer of enhancing readability and even keeps score… (I’m at 91% right now – but falling quickly).
One aspect that is less known to the wider public is where the data that trains these models is sourced, and how the data collection and training process influence results. In fact for years non-generative, predictive AI trained on biased datasets has produced biased models and clearly biased predictions and results.
But with generative models, bias and inaccuracies can be more subtle and interesting, Taking the remarkable GPT4 model used by ChatGPT, released last year by open.ai. ChatGPT cannot tell us what data it was trained on, we are politely informed that it is trained on a large corpus of private and public data, relevant until about 2021, with no mention of any embodied copyright. As a generative model ChatGPT can produce precise, detailed, coherent but completely factually incorrect responses to simple questions. These “hallucinations” ranges from simple inability to count, through to fabricated people, places, references, and legislation, and even include perceived political biases. With ChatGPT operating as a closed source 'black box', the source of these biases and inaccuracies is essentially unknown to anyone outside open.ai.
Generative speech models, which will be the focus of this project, have a more directly audible bias, as most spoken word audio used to train these models is from North American or UK voices. The larger a model is, generally, the more useful it becomes. However, models trained on over 100,000 hours of speech are also strongly biased in prosody – the parts of speech such as intonation, phrasing, timing that create the unique accents and ground a voice in a place and time.
While unfortunately ChatGPT is unlikely to be open sourced any time soon, and State Library may not have the resources to train a LLM from scratch, the release of open-source speech datasets and models, combined with readily available computing power, makes it possible to run, fine-tune or even re-train some of these generative speech models with smaller, high quality curated datasets.
What makes this exciting for State Library, is that the rich and vital oral histories and digital stories in the collection can be a trusted, high quality source of training data that has not been influenced by AI created content nor burdened by potential copyright issues.
Over the next few months I will be working with colleagues to identify and create datasets from recordings in the collection, then use these datasets to create models that attempt to represent some the diversity of spoken voices across our state. Along the way we will produce transcripts for these recordings, experiment with interactive experiences online, and finally present at the Making meaning: Collections as data symposium in 2024.
If you have any interesting suggestions or recommendations on stories in our collection, please do reach out and let me know at discovery@slq.qld.gov.au.
Comments
Your email address will not be published.
We welcome relevant, respectful comments.