From collections to datasets

By Andrei Maberley | 6 October 2023

As part of my staff digital catalyst project, I’ve taken some time to research and implement a workflow for turning collections into datasets – the raw material used to create machine learning models. I’ve split this workflow into three processes: Collection Discovery -> Content Evaluation -> Data Processing. In this post I’ll cover the first two of these processes in a little more detail. Its worth bearing in mind that the end result of this workflow doesn't result in new collection material per se, more of a reframe of the existing content and metadata in a format suitable for model training. Which particular format I'll be using and how much content we need I'll discuss in the next blog post as well.

Collection Discovery

With collection activity spanning many decades, a broad range of source materials and diverse subject matter, audio recordings in State Library of Queensland’s collection run the gamut from the documentation of the mundane in everyday environments, to extraordinary people and events at far-flung locations. These collections are found in the John Oxley Library which contains collections unique to Queensland. To narrow down our options, we are looking for original collection content, with of duration of at least 30 minutes, featuring entirely spoken voices, released under a creative commons license, that (ideally) have a transcription available.

The key tool for this unsurprisingly, the State Library’s catalogue - One Search. Using filters and wildcards, we can create a search URL that lets us filter down to online, digital audio resources in the John Oxley Library. We then need to go through each collection item details to determine the licence, format, duration and get a sense the stories being told. Once an item is shortlisted, we can download the audio and transcript file and move onto evaluation.

Content Evaluation

The creators of these items have a wide range of skills, capabilities and interests, and the technical quality of recordings they have donated (or State Library has commissioned) is equally diverse. Empirical analysis and decisions are vital at this stage, as audio quality directly impacts the usefulness of any dataset. To help make these decisions I've developed a few heuristics to be used with critical listening and analysis. The recording studio at The Edge, is an ideal place to undertake this kind of listening, particularly with the recent upgrades to hardware, software and perhaps most importantly an on/off switch for the room air-conditioning!

Hueristics for audio quality

The source materials range from various analogue audio formats through to ‘born digital’ audio and video stories. All the collections we are dealing with are digitised, so the basic criteria of sample rate, bit depth/rate and file format can be ascertained easily from a digital collection. To ensure that the stated attributes are truly representative of the media in question, we need to go a little further. For example, a WAV format audio file may have a sample rate of 48kHz ar 24 bits. This indicates a possible frequency range of 20 Hz to 24 kHz (indicated by sampling rate) and a theoretical dynamic range of 144 dB (indicated by bit depth). However, when analysed statistically and spectrally, it may become obvious that the audio file in question was originally recorded at a lower sampling rate and/or dynamic range. Izotope RX 7(available in The Edge recording studio) is perfect for this kind of analysis, and provides a range of processing tools we will use in Data Processing. Other heuristics I've used are:

Number of channels : Stereo or mono? (training of model don't usually require stereo files)
Number of speakers (primary? Secondary? Interviewer/interviewee?): How many voices are present on the recording?
Distinctiveness: Are voices easily distinguished and separable to the ear? Do speakers overlap often?
Clarity (vocal presence): Are the voices equally present on the recording? Or is one or more voices further or more distant from the microphone.
Reverberation (Room tone): Is the acoustic environment obvious? What does it sound like? Think of this as a sliding scale from a small hard surfaced room – like a kitchen – to a large outdoor space.
Distortion (harmonic): Is the recording obviously distorted throughout?
Distortion (a-periodic): Are loud passages distorted – e.g. laughing, clapping, shouting etc.
Clipping: does the recording hit digital zero? If so what is number of samples clipped? Is the clipping audible?
Noise (floor): Is there noise inherent in the recording media or devices.
Noise (background): Is there audible signal other than spoken voice – is it musical? Periodic? Random?

Some of these qualities are essential - recording that is regularly distorted, or has considerable background noise can still tell an interesting Queensland story, but may be unusable as a dataset. Once the evaluation is complete we have a clear understanding of how much work will be required in the Data Processing stage to wrangle the audio content.

Transcript Quality

Transcripts are available in a range of formats, including PDFs, DOCX and simple TXT files. Some of the transcripts have been checked by interviewees, and contain varying level of detail. However for straightforward conversion to a datasets, they are all missing key component : timecode for each line of dialogue. Given the quality and speed of recent advances in Text to speech (TTS) it is now feasible, and easier to re-transcribe audio to produce a subtitle file, and then use the original transcript as a source for quality checking.

With our collection items identified and evaluated, its time to move onto the most time-consuming and complex process - Data Processing.

From collections to datasets

Collection Discovery

Content Evaluation

Hueristics for audio quality

Transcript Quality

Comments