A post prompted by advertisement of reference 1 in the EAORC bulletin of reference 2.
In the beginning, automatic speech verification (ASV) systems were all about checking that a person on the telephone to a bank or a person talking to Alexa (to take just two examples) was who he said he was or who he was supposed to be. Then, as computers got cleverer and started to get rather good at spoofing voices, counter measures (CM) were needed which could distinguish real human voices from fake human voices. Resulting in the system structure sketched above, with the sketch being lifted from the beginning of reference 6.
In what follows, we are concerned with spoofing the spoken word. There is also plenty of interest both in spoofing face and voice at the same time and in spoofing the appearance of a VIP or celebrity of one sort or another in pornographic films, but neither of these latter interests is addressed in what follows, although it may well be that similar considerations are applicable.
In the sketch above, target is the voice of the right person. Non target is the voice of someone else. Spoof is a computer-generated voice, in this context usually spoofing a particular person.
One might recognise just two outputs, accept or reject. Or one might prefer to work with three: CM reject; CM accept and ASV reject; and, CM accept and ASV accept.
Either way, such a system can err in one of two ways: either by rejecting a voice signal that it should not or by accepting a voice signal that it should not. The type I and type II errors of statisticians, often linked to the testing of the null hypothesis, for which see reference 7. With the idea being to come up with a score for the system’s performance in terms of those errors – and then to tune the system to optimise that score. And there are lots of people, lots of teams out there doing just that. Reference 6 introduces this scoring.
ASV systems might work using any text, provided it is sufficiently long to generate enough person identification features. Or they might work using a particular word, phrase or sentence.
There seem to be four varieties of attack, four approaches to spoofing:
Impersonation. Use an actor or other such person to impersonate, to spoof the target person.
Replay. Harvest recorded speech from the target person for the bits needed to assemble the required text.
Voice conversion (VC). Manipulate a real voice saying the required text to spoof the target person saying the required text.
Text to speech synthesis (TTS). Generate the signal of the target person saying the required text using low-level features of the target person’s voice to qualify that generation.
An impersonator might be able to spoof the high-level features of the target’s speech, but low-level features are more difficult, if not impossible. Although it presumably helps if the impersonator has a similar voice to the target, with similar low-level features.
Replay involves splicing together bits of recorded speech. I don’t yet know whether this can work with syllables as well as with whole words. With both recording and splicing leaving detectable traces.
It has all got quite competitive with the bad guys building spoofing machines and the good guys building spoof detection machines. A competition which will go on for a while? Some of this competition, some of what might be called the ASV industry, is reflected at reference 3, snapped above, host to competitions between some of the teams building spoof detectors. A competition presently divided into three zones:
Logical access. A mixture of real and spoofed speech data (signal) – spoofed using either TTS or VC algorithms – which undergoes coding, compression and transmission across a variety of telephony channels. The challenge is to design spoofing counter measures which generalise well to transmission channel variation.
Physical access. A mixture of real and spoofed speech data (sound) – spoofed using replay algorithms – undergoes acoustic propagation from a variety of real physical spaces. The challenge is to design spoofing counter measures which generalise well to physical space variation.
Speech deepfake. Testing spoofing detection solutions which detect the spoofing of compressed speech data posted online. For example, spoofing the social media posts of a VIP or celebrity of one sort or another.
While the rather more accessible reference 1 goes back to the very beginning. Can humans detect fakes?
Back to basics
The research questions addressed by this paper were as follows:
How well can humans detect speech deepfakes?
Are there differences in detection capabilities depending on the language?
Does detection performance improve with a modest amount of training?
There were 529 subjects, mean age just under 30, half men, slightly more than half fluent English speakers, slightly under half fluent Mandarin speakers. This sample was divided into two sub-samples, unary and binary. In the first sub-sample the idea was to say whether the sound clip presented was faked or not, with about half being fake. In the second the idea was to say which of the two sound clips presented is the fake. No feedback was given.
There were 100 sound clips in all, that is to say 50 pairs of fake/non-fake, with the fake being generated from the text of the non-fake. The raw material was drawn from an English dataset (LJSpeech) and a rather more elaborate looking Chinese dataset (CSMSC).
Each subject was presented with 20 tests, randomly drawn from the pool of 50 pairs. The pool of 50 pairs appears to cover both English and Chinese, so this drawing must have taken language into account. The English version of the computer screens used is snapped above: unary left, binary right.
The answers to the first question was that people get it right about 70% of the time. To the second, that the performance of English speakers was much the same as that of Chinese speakers. To the third, that training did improve performance, but not by much.
The discussion suggests that while the quality of faking will no doubt go up, so will the quality of fake detection. So who knows how it will all turn out? But it is clearly a matter of some concern that humans are not very good at fake detection. They might, for example, be tricked into doing something unfortunate by someone on the telephone who sounded like their boss. How many of us bother – like nurses and soldiers – to ask for written copies of orders?
It also points out that these results are likely to have been disturbed by the subjects knowing that the tests were about faking and by the high prevalence – around 50% – of fakes.
Other matters
In the course of all this, the word ‘codecs’ cropped up from time to time. Eventually I learned that voice over internet protocol (VoIP) communication is facilitated by a technology called codecs, short for code-decode, which first compress a speaker’s audio sounds into data for transmission in packets across the internet and then unpacks them again. A substantial, well-documented technology in its own right.
From type I and type II errors, I associate to the well-publicised problems of the Post Office Horizon system, the product of a massive project to computerise the accounts of the sub-post-office network, cut down from an even more massive project linking in the benefits system. The problems being the false identification by the system of instances of fraud and the aggressive prosecution of the supposed perpetrators. With the main contractor being the same ICL with which my career in government IT started, and with ICL morphing into the Fujitsu with which it ended. I have turned up the massive, long running inquiry we have come to expect when this sort of thing happens at reference 8, but I have not been able to turn up anything about the computer system itself – beyond it accounting for the first 50 or so of the 218 issues to be looked at by the inquiry. In all this, I assume that one of the advertised benefits of the Horizon system was bearing down on fraud in said sub-post-office network.
Then only this morning, I was reminded of another fraud detection system, the national identity card, with various schemes for having one of these having been promoted over the years; schemes which have so far always been rejected, despite the employment of management consultants with a financial stake in their acceptance. The paradox whereby we don’t mind about companies like Google and Amazon knowing all about us (and our more or less unique email addresses), companies which exist to extract money out of us, while we do mind about government knowing all about us, a government which exists to serve us, to look after us. Maybe the Tory answer would be that extracting money from people is relatively benign compared with the lust for power for it own sake.
And I remember that I could once play the gambling game called spoof, having been taught to play one Sunday lunchtime in a public house, now demolished in favour of a block of flats, in East Street, here in Epsom. A simple but entertaining game, but one which can result in the rapid loss of money.
Conclusions
The problem has been neatly structured in the opening sketch. But I worry whether this is to only way to look at the problem.
That apart, the story seems to be that, collectively, we are not very good at detecting fake voices. A problem compounded by over confidence.
To my mind, all this is one more line of evidence that the capabilities of computers are outstripping our capability to keep them under control. People are right to worry.
PS: the brain’s ability to detect fakes is clearly limited. It is always possible to fool the brain with a cleverly done fake. Which leads to the thought, given that we are, to some large extent, our memories, that we are fakes to the extent that our memories are fakes. Our innermost, most private thoughts, are just the echoes of some fake from the past. Which more or less empties the word of meaning.
References
Reference 1: Warning: Humans cannot reliably detect speech deepfakes – Kimberly T. Mai, Sergi Bray, Toby Davies, Lewis D. Griffin – 2023. Downloads\journal.pone.0285333.
Reference 2: http://martinedwardes.me.uk/eaorc/.
Reference 3: https://www.asvspoof.org/.
Reference 4: ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild – Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, Kong Aik Lee – 2023.
Reference 5: Voice conversion versus speaker verification: an overview – Zhizheng Wu, Haizhou Li – 2014.
Reference 6: Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals – T. Kinnunen, H. Delgado, N. Evans, and others – 2020.
Reference 7: https://en.wikipedia.org/wiki/Type_I_and_type_II_errors.
Reference 8: https://www.postofficehorizoninquiry.org.uk/.