Ajaya Neupane (University of California Riverside), Nitesh Saxena (University of Alabama at Birmingham), Leanne Hirshfield (Syracuse University), Sarah Elaine Bratt (Syracuse University)
A new generation of scams has emerged that uses voice impersonation
to obtain sensitive information, eavesdrop over voice calls and extort
money from unsuspecting human users. Research demonstrates that users are fallible to voice
impersonation attacks that exploit the current advancement in
speech synthesis. In this paper, we set out to elicit a deeper
understanding of such human-centered “voice hacking” based
on a neuro-scientific methodology (thereby corroborating and
expanding the traditional behavioral-only approach in significant
ways). Specifically, we investigate the *neural underpinnings*
of voice security through *functional near-infrared spectroscopy*
(fNIRS), a cutting-edge neuroimaging technique, that captures
neural signals in both temporal and spatial domains. We design
and conduct an fNIRS study to pursue a thorough investigation
of users’ mental processing related to *speaker legitimacy detection*
– whether a voice sample is rendered by a target speaker, a
different other human speaker or a synthesizer mimicking the
speaker. We analyze the neural activity associated
within this task as well as the brain areas that may control such
activity.
Our key insight is that there may be no statistically significant
differences in the way the human brain processes the *legitimate
speakers vs. synthesized speakers*, whereas clear differences are
visible when encountering *legitimate vs. different other human
speakers*. This finding may help to explain users’ susceptibility
to synthesized attacks, as seen from the behavioral self-reported
analysis. That is, the impersonated synthesized voices may seem
*indistinguishable* from the real voices in terms of both behavioral
and neural perspectives. In sharp contrast, prior studies showed
*subconscious* neural differences in other real vs. fake artifacts (e.g.,
paintings and websites), despite users failing to note these differences
behaviorally. Overall, our work dissects the fundamental
neural patterns underlying voice-based insecurity and reveals
users’ susceptibility to voice synthesis attacks at a biological level.
We believe that this could be a significant insight for the security
community suggesting that the human detection of voice synthesis
attacks may not improve over time, especially given that voice
synthesis techniques will likely continue to improve, calling for
the design of careful machine-assisted techniques to help humans
counter these attacks.