Do you even speak 'Stralian? Machine Learning in the Minority

We interact directly with a Machine Learning (ML) product daily –we talk to Siri, Google Assistant, Alexa or Cortana.  We spend the money on a smart speaker, phone or other device,  sign away our intimate moments to be analysed or listened to by third-party contractors, and in return we get convenient verbal control over our digital technology.

This interaction seems straight forward on the surface – we say something, the assistant responds with information or makes an action on our behalf and gives us some feedback.  Behind this simple interaction there is a complicated pipeline of natural language processing (NLP) for automatic speech recognition (ASR) and intent parsing, some data retrieval to find a fact, maybe a service integration to play some music, then more NLP to produce audible speech or written text.

These are fascinating topics, some of which we cover in our ML workshops, but for this blog post we’ll look at the first stage - ASR.  Instead of just assuming – like most Australians do -  that much of what we say will be misunderstood by Google because of our accent,  I’ve set out to put this to the test using freely available web services and Application Programming Interfaces (APIs).

 I’ve chosen 30 seconds of what I like to call the “Gold Standard in Aussie Audio”  to test with Google’s automatic YouTube transcription, speech-to-text (STT) and text-to-speech (TTS) services.

To begin with we have a nice clear (human) transcription of 32 seconds of this gem, which will give us a target for the machines to reach.

“And he just decided he'd scoot up the road,  and I just said, "nah, it's not going on like that mate." So I jumped in my car, and I started chasing him up the road. And then he went down a side street, and then the police were coming, and I flashed them and sent them off in the direction of him. But mate all I had was my jocks on. I was chasing him up the street, and I'm just, like,”

What does the auto-transcription service come up with?

“and he just decided he'd screwed up the road not just hit nice snow going on what that may.  so I jumped in my car and I started chasing him up the road and then he went down and I saw it straight and then the police were coming and I flashed him and sent them off in the direction him but mate all I had was me jock so no I was chasing him up the street and I'm just like”

Hmm.  Perhaps Google’s cloud STT?  This is a step up in terms of quality and price, so let’s give that a go:

I need to sort if it screwed up the road not just setting are still going on with that mate. So I jumped in the car, and I started chase them up the road then he went down the side street in the police were coming and I flash hemant set them off in the Direction him, but mate all that had with me jocks are not all I was chasing him up the street and I’m just like”

Getting better?  To test our accent theory, what happens if we use Google’s own TTS service to turn our human transcription back into speech with a generic American English accent?

Lets run that back through Google’s cloud STT again:

“and he just decided he’s good at the road, and I just said no, it’s not going on like that, mate.  So I jumped in my car, and I started chasing him up the road and then he went down the side street, and the police were coming and I flash them in sent them off in the direction him but meet all I had was my jocks on. I was chasing him up the street and I’m just like”

Apart from “he'd scoot” being replaced with “he’s good”  and the second “mate” being replace with “meet”  the transcript is very close.  Given the accuracy of the generic American English ASR result I think we can assume the Australian accent is the main issue.

This test is unfair in the sense that more often than not we will be giving simple commands or queries to our digital assistants, not expecting a complete transcript of a conversation.   Also – as a colleague gently pointed out -  most of the world won’t understand that snippet of Aussie gold, so why should we expect AI in its infancy to do any better? 

This is one of the dirty secrets of speech recognition technology.  The world biggest companies are competing hard to put a box, speaker or app in our homes and devices, to listen, understand and take action on our behalf,  but make no promises regarding accuracy or suitability.  If you are a minority in global or technical terms, then you can probably assume that ASR is biased against you.  And if you have a physical difficulty that results in a speech impediment, making a digital assistant using ASR more useful to you – then it is probably even less likely to be usable by you.  Is it then reasonable to assume that the entire NLP pipeline is also biased?

Over the coming months we'll examine what we can do to overcome these issues on our own and collectively, we will look at a range of major cloud providers' ASR on simple, direct instructions to get a sense of the state-of-the-art, and we'll explore the DIY solutions that are moving towards a truly personal digital assistant.







We welcome relevant, respectful comments.
Please read our Comment Policy before commenting.
We also welcome direct feedback via Contact Us.
You may also want to ask our librarians.

Be the first to write a comment