Beam me up, Scotty: The growth of voice biometric authentication

Back in the day, before “apps” and ”websites,” you might log into one or two computers using passwords, and it was OK. Safe, even. But today, who remembers the passwords to the dozens of sites many use regularly? People cope by reusing passwords, or maybe by using a password manager, but what about mobile? Organizations in turn are charged with protecting sensitive data using tools from the stone age of computing. Everyone, both users and technology professionals, agrees that passwords are a poor fit for today’s needs, but nonetheless, they live on.

This is slowly changing with the rise of multifactor / multimodal authentication. It’s a bit of the Wild West, though, with one-time PINs, smartphones, fingerprint scanners, and so on. This article is about voice biometric authentication, its perils and promises.

What is voice biometric authentication?

Modern signal processing techniques can slice up human speech (thus the “bio”) into thousands of readings per second (“metric”). Dozens of parameters, including the tone, pitch, size of a person’s larynx, etc., can be derived and the result stored as a mathematical representation colloquially known as a “voiceprint.” Decades of advances in both hardware and algorithms can even account for voices afflicted by a cold.

For people familiar with password technology, think of a voiceprint as a “voice hash.” Unlike password hashes, though, voiceprints are based on probability, not mathematical certainty. It is important to remember this when evaluating vendor claims or architecting a possible voice solution.

Biometric authentication is usually added into an existing system as an additional factor. Two-factor (“2FA”) is becoming increasingly common: something you know (password) and something you have (one-time PIN from an app or SMS on your smartphone). 3FA adds a third factor, something you are, e.g. fingerprint, iris print, or voice. There are even multimodal systems in which two or more biometrics, such as fingerprint and voice, are used.

To participate in a voice authentication system, a user must “enroll” their voice. There are two kinds of speaker verification systems: text-dependent and text-independent. The former requires a person to repeat the same phrase they enrolled with, while the latter is more flexible. The voice signal is mathematically analyzed in a process known as “feature extraction.” Feature extraction is done on future voice samples and compared to the enrollment data. Since voice samples never produce a 100% mathematical match, the feature comparison produces a probability estimate, which the system’s managers can tune to authenticate the maximum number of legitimate users, while at the same time keeping out imposters.

A typical text-independent system might be an organization’s customer support hotline: in the first 30 seconds or so, the organization’s voice authentication system can transparently sample a person’s natural speech. The organization might also look at what phone number a person is calling from. While neither test on its own mathematically guarantees anything, as part of a whole system, which may include other proprietary fraud detection methods, the organization can provide its customers with a natural way of securely communicating with them over the phone without having to remember a password.

Later in this article, we will walk through a complete example of enrolling and using a text-dependent version of Microsoft’s emerging cloud-based voice authentication APIs.

Why now?

Voice biometrics has a checkered history, with several commercial flops, including the dramatic collapse of Lernout & Hauspie, once a global leader in voice authentication. Why should you believe that now is different?

There are several factors that account for the renewed interest. Some are technological – improvements in software algorithms and advances in noise-cancelling smartphone microphones among others. Others are social – millions of people are familiar with Apple’s Siri and Amazon’s Alexa and some 2 billion people own a smartphone. Yet others are commercial – as hackers have upped their game, companies are more willing to experiment with biometrics as a factor in 2FA or 3FA. And still others are motivated by the government – for example, federal government guidelines for secure internet banking, which we will discuss later in this article.

For end users, who are not always in the driver’s seat when it comes to technology innovation, there are practical benefits to voice logins – it’s hard to both remember and type complex passwords when you’re using a smartphone. As IBM notes,

The alphanumeric password, originally conceived for a world with desktop computers that are equipped with full-size keyboards, does not adapt well to the new technology paradigm, where keyboards are rapidly fading away. With the introduction of voice-activated personal assistants, and voice-activated mobile apps, voice is increasingly becoming prevalent as an interaction method with technology.

Finally, with advances in artificial intelligence and neural networking, Microsoft, Google, Facebook, Amazon, and Apple are moving into voice recognition, providing additional mainstream attention and respectability. Voice authentication is clearly a growth space.

How accurate are voice biometrics?

There are a number of concerns with voice authentication. Most fundamentally, unlike password encryption algorithms, it’s based on probabilities. Vendors often say “every person’s voice is unique” but today’s technology can only give the probability of a match. It is up to real-world implementations to decide where the sweet spot lies between false accepts (granting access to an imposter) and false rejects (rejecting a valid user). Systems tuned to be more stringent will inevitably inconvenience some users with false rejects. Claims by vendors touting a so-called “equal error rate” (a technical term used for comparing voice systems that is often cited in marketing literature) of under 1% should be treated with a grain of salt. Can they point to an actual independent study or is it just advertising?

That said, real world commercial implementations usually will not rely solely on voice biometrics – they’re just one factor in an overall system. Rather than outright reject an iffy voice authentication attempt, a company might decide that it’s coming from a known phone number and weigh its decision slightly differently. Nuance, a major voice recognition vendor, discusses what it calls “Overall Security Rate” in Measuring Performance in a Biometrics Based Multi-Factor Authentication Dialog. As a business and security matter, you have to evaluate if the tradeoff between the expense and complexity of an implementation vs. convenience to users of the system is worth it – there is no simple answer to this question.

While not strictly related to commercial access control, which generally involves passphrases used in a controlled setting, it should also be recalled that the use of voice forensics (which almost always involve free, uncontrolled speech) in legal proceedings is generally disallowed in US jurisdictions – according to the Wall Street Journal, federal courts have never ruled on the admissibility of voice biometrics as evidence. And according to the FBI’s Technology Assessment for the State of the Art Biometrics Excellence Roadmap, agents who use voice biometrics as an investigative tool are not permitted to offer expert testimony —  voice analysis isn’t considered rigorous enough to qualify under the so-called Daubert standard for scientific expert testimony in federal court. This doesn’t mean that commercial uses of speech recognition technology are invalid – granting someone access to a computing system isn’t the same as putting them in jail – but it is worth understanding the broader social context of the technology.

Who are the players?

While this article isn’t a product review, I’ll mention certain companies (there are many others and this should not be taken as an endorsement) whose technology is representative of the broader market. Nuance, also the company behind Dragon NaturallySpeaking, provides its VocalPassword as an on-premises system and claims to be the “the world’s most widely deployed voice biometric solution”. Nuance’s software is also the technology behind Apple’s Siri. VoiceVault’s ViGo is a cloud-based offering that can readily be integrated into mobile apps. A startup called SayPay, for example, is using VoiceVault to build a token-based mobile payment solution. Tens of millions of people’s voices have been enrolled in these systems to date.

Besides these specialist companies, Amazon is offering Echo as a full-blown home speech recognition appliance while industry giants like Facebook, Google and Microsoft are experimenting with cloud-based speech recognition using artificial intelligence algorithms. An open question is whether the coming wave of cloud speech APIs will include authentication (Microsoft is offering an experimental voice authentication API, as demoed later in this article).

Who is using voice authentication?

Financial and governmental institutions, perhaps because they need to confront fraud, have been key early adopters of voice as part of a multi-factor authentication solution. The Federal Financial Institutions Examination Council (FFIEC), a US federal interagency body, has for years called for layered security in Internet banking, and there has been a wave of uptake recently in the banking and financial sector.

Vanguard, the mutual fund giant, has offered voice authentication since 2012 to many of its phone users.   HSBC Bank will be offering voice authentication for phone and mobile banking to some 15 million users by the summer of 2016, making it one of the largest banking implementations to date. ING Netherlands offers online payments authorized by voice to over 100,000 of its Dutch customers. Some 250,000 Citi credit card users have enrolled in the bank’s voice authentication system for phone transactions. And in the Asia Pacific region, a bellwether for all things online digital, Citi is aggressively pushing voice authentication and several banks in Singapore have announced plans for 2016.

A growing number of government programs use voice authentication. It has been used for years in the United States for tracking parolees. New Zealand’s equivalent of the IRS has enrolled over 1 million users in its voice system and is considering federating its use across multiple government agencies. South Africa’s Social Security Agency has enrolled millions of citizens to receive benefits using voice authentication to counter fraud.

And in one of the largest voice biometric implementations to date, the largest mobile phone operator in Turkey, Turkcell, has over 10 million enrollees in its voice authentication system.

Publicly available implementation details are scarce at this stage. If voice is replacing an existing factor, say passwords, how does a company ensure security given that voiceprints can only be matched to a crude probability limit? When reading about this or that new deployment, you should always keep a critical sense of what you’re being told and what not.

Futuristic uses of voice biometrics include voice e-signatures (though a voice match failure of even 1% raises questions about the validity of such “signatures”), the verification of student identities in distance learning, unlocking cars, and the addition of voice authentication to social networking sites to prevent impersonation. The needs of non-financial, non-governmental organizations to assure “identity-at-a-distance” are becoming paramount in our increasingly networked world, inevitably raising the same privacy and security questions that existing implementers have had to deal with–and that we discuss later.

Risks and benefits

Spoofing is an obvious concern – what if someone records your voice and plays it back? It has been said that, unlike a password, you can’t change your voice if someone steals it. True enough, but vendors have come up with several methods for countering this trick. Simple replay attacks can be defeated because there should never be a 100% mathematical match between two voice samples (there will always be slight variation among the dozens of parameters measured even for speech samples using the same words). “Liveness detection” is another approach to defeating recording attacks in which the authentication system prompts the user to repeat an ad hoc phrase. Of course, this adds time and complexity (and possibly extra licensing fees for the company) to the voice authentication process and some organizations forego this extra step.

Nevertheless, spoofing is an ongoing threat that vendors (and purchasers of their systems) will need to be on the watch for. The SPIES computing group at the University of Alabama reported in 2015 that researchers were able to use an off-the-shelf voice-morphing tool to effectively clone a victim’s voice and fool state-of-the-art voice verification systems. A skeptical take on this finding comes from Opus Research, but it is safe to assume that if voice authentication spreads, hackers will inevitably up their game.

The human voice can change due to colds or aging. Contemporary systems supposedly can handle the former, while the latter can be handled through having people periodically re-record their voices.

An interesting anti-fraud mechanism, unique to voice authentication systems, is storing the voice records of known criminals in a fraudster detection blacklist. If someone’s voice triggers a match against the fraud list, it doesn’t necessarily mean that they will automatically be blocked but as a potential red flag it can add another level to an organization’s layered security.

Fraudster detection also raises questions of consent – preventing fraud is a good thing, but if someone’s voice is recorded without their knowledge, even for a seemingly legitimate purpose, it may run afoul of laws in different jurisdictions. Notifications such as “this call will be recorded for quality and training purposes” do not explicitly touch on security, so look for this commonly-heard phrase to quietly expand in the future to cover this case.

Customer privacy and security

Key questions that security specialists will ask are “where and how are voice authentication records stored?” Password leaks occur all the time – are there risks if voice identifiers are leaked?

One issue is where the data is stored – in an on-premises system owned and managed by your organization or in a cloud provider’s system? Another issue is the way in which the data is encoded for storage in the voice database – raw audio, mathematically-transformed, plain text, or encrypted?

Modern systems never store unencrypted passwords, only a hash, and a good voice authentication system will do something similar. For example, rather than store a raw WAV file, systems will store a mathematical representation.

A potential concern with such a system, though, is that unlike password hashes, which have no use beyond authentication, voice “hashes” can in theory be used for de-anonymization.  Imagine a LinkedIn-level leak in which more than 100 million voiceprints are for sale on hacker forums.

This may seem like a far-fetched threat but is clearly a risk inherent in any biometric system where user IDs and an immutable biological characteristic are stored together. Voice biometric data would seem to be Personally Identifiable Information (PII) of the kind federal government agencies and contractors are obligated to protect under the Privacy Act. Storing voice authentication PII in an encrypted form would mitigate the risk of data leaks, though it not clear which vendors, if any, do so.

While I am not aware of any current voice ID system that only stores authentication data on a user’s local device, it is conceivable that such a system could be built (see How US Government Can Go Mobile With FIDO for a discussion of how local authorization and public key crypto might work in tandem). This would allow voice print authentication to anticipate the kinds of concerns raised by Senator Al Franken at a 2013 Senate hearing on the privacy of Apple’s first fingerprint reading smartphone, the iPhone 5s. Apple stated that Touch ID is only used locally, fingerprints are stored in a secure chip on the phone, and that no data is stored on Apple’s servers. When Apple subsequently opened up the Touch ID API to 3rd party developers with the release of iOS 8, apps began appearing that took advantage of local fingerprint authentication.  The FIDO (Fast Identity Online) Alliance is promoting just such interoperability but it remains to be seen where voice fits in their roadmap.

Hacked voice databases are not the only concern. Like much data today, biometric PII such as voice is only patchily regulated at present in the United States, and as a result, there are few controls over how the data can be used or shared. However, given increased public awareness of privacy issues, your organization may need to take into account future changes in the regulatory environment. Texas and Illinois (“State Forays Into the Regulation of Biometric Data”), for example, already have laws on the books governing the gathering and use of biometric data. And don’t forget that the European Union has different privacy rules than the United States.

Bear in mind that biometric databases may be susceptible to demands by the US government based on the 3rd party doctrine (a Supreme Court precedent that asserts data shared with 3rd parties does not require the government to get a search warrant to access it).  But in a key case involving GPS tracking from 2012, Justice Sotomayor wrote:

“… it may be necessary to reconsider the premise that an individual has no reasonable expectation of privacy in information voluntarily disclosed to third parties. This approach is ill suited to the digital age, in which people reveal a great deal of information about themselves to third parties in the course of carrying out mundane tasks.”

The take home lesson? If your organization deploys voice biometric authentication, you need to be aware that it potentially has privacy implications and you should responsibly handle the trust your users place in you.

Azure Cognitive Service: speaker recognition

In addition to full-blown on-premises solutions such as those offered by leading specialist firms such as Nuance, other industry heavyweights are trying to get a piece of the action. As an example of the coming wave of cloud-based voice biometric APIs, here’s a walkthrough of how you can begin exploring Microsoft’s Azure Cognitive Service speaker recognition service, which is currently in preview. In the example here, I enrolled my voice in the service and then tested its ability to validate later samples of my speech. This is a text-dependent system, meaning you enroll your voice using a canned phrase.

To keep things simple, I used free tools Python (a popular command line utility which provides an open source framework we can use for interacting with the Cognitive Service APIs) and Audacity (for manipulating the voice memo files I created with my iPhone). In the real world, your application would need to handle these details programmatically.

  1.     Sign up for a free Microsoft developer account. At the end of the enrollment process, you’ll be assigned a developer subscription key.
  2.      If you don’t already have Python, download it from python.org and make sure to add python.exe to your PATH. Then grab Microsoft’s open source speaker recognition Python scripts.
  3.      We’re going to test the Verification service, which checks if an unknown speech sample matches a previously-enrolled voice. This service requires that the enrollee provide three (3) speech samples using one of several predefined phrases. The sample needs to be in the form of a 16-bit, mono 16 KHz WAV file with PCM encapsulation (since the iPhone voice memo app, which I used in my tests, produces m4a formatted files, I first loaded them into Audacity and exported them to WAV format using Microsoft’s required settings).
  4.      From a Windows command prompt, go to the folder where you downloaded the scripts in step 2.
  5.      Perform the following steps. First create a speaker verification profile by supplying the 32-bit subscription key from step 1:

python CreateProfile.py <subscription_key>

This command should return a new 32-bit Profile ID. Now you need to supply 3 training samples using a predefined phrase (I chose “I am going to make him an offer he cannot refuse”).  After each successful submission you’re told how far along in the process you are:


python EnrollProfile.py <subscription_key> <profile_id> training-1.wav
Enrollments Completed = 1
Remaining Enrollments = 2
Enrollment Status = Enrolling
Enrollment Phrase = i am going to make him an offer he cannot refuse

python EnrollProfile.py <subscription_key> <profile_id> training-2.wav
Enrollments Completed = 2
Remaining Enrollments = 1
Enrollment Status = Enrolling
Enrollment Phrase = i am going to make him an offer he cannot refuse

python EnrollProfile.py <subscription_key> <profile_id> training-3.wav
Enrollments Completed = 3
Remaining Enrollments = 0
Enrollment Status = Enrolled
Enrollment Phrase = i am going to make him an offer he cannot refuse

  1. And now the acid test – submit several speech samples (one good, one where my voice was a bit scratchy and another where I didn’t use the phrase I enrolled with) to see what happens.  Note the differing results and confidence levels.


python VerifyFile.py <subscription_key) good.wav <profile_id>
Verification Result = Accept
Confidence = High

python VerifyFile.py <subscription_key) scratchy.wav <profile_id>
Verification Result = Accept
Confidence = Normal

python VerifyFile.py <subscription_key) wrongphrase.wav <profile_id>
Verification Result = Reject
Confidence = High

As more cloud voice authentication APIs become available, you can use your training data to objectively compare the systems and not simply take their claims on faith.

Moving forward, voice biometric authentication is full of both promise and peril. People are wary of change, often for good reason, but passwords have their own long-standing problems and it is clear that voice biometric authentication is cropping up in more and more places. Before asking Scotty to beam us up, both organizations and end users will need time to adapt and see where the technology makes sense.

Photo credits: Jonathan Gross, William Chew, and Toshiyuki IMAI.

Leave a Comment

Your email address will not be published.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top