Short - term voiceprint authentication based on depth learning

Fujitsu Research and Development Center Ltd. (FRDC) has developed a high-precision voiceprint authentication technology that uses deep learning to identify the speaker from a short piece of speech. The technology incorporates two deep learning engines, one engine to extract features associated with voice content and another engine to extract speaker-related features to enable the "voice-password" authentication functionality, namely: Only when the speaker correctly sets out the pre-set content can his identity be accepted. With this technology, the error rate of identity authentication can reach about 2.2% in voice segments of no more than 3s.

The technology can be widely used in applications such as call center and IoT device interaction to enhance the security and convenience of operations by quickly and securely authenticating users.

【Development Background】

Voiceprint identification is an important branch of biometric authentication. Owing to the unique advantages of remote operation, voiceprint-based identity authentication has gradually been recognized in the fields of financial telephone banking, smart home, criminal investigation and security and has become an important means of fraud prevention. In call center business, customers often need to enter a passcode or verify their identity by answering a series of questions. This inquiry-based authentication process takes an average of 60 seconds or more, which not only affects the work efficiency of customer service, but also gives the customer a sense of boredom. Therefore, the development of a secure and effective remote authentication method will greatly enhance the operational efficiency of call centers and reduce their operating costs.

【Subject】

The traditional voiceprint recognition technology, rely on statistics and signal processing technology to extract features related to the speaker's characteristics from the voice, in order to achieve identity authentication. However, this technique often requires longer speech to identify the speaker, for example, a duration of 30 seconds. In applications such as financial contact center and IoT device interaction, the identity of users needs to be quickly verified. The traditional voiceprint identification technology obviously can not meet this requirement. In addition, traditional authentication methods do not prevent fraudulent use of others' recordings to impersonate identities.

Method of Development

(1) The use of deep learning technology to effectively reduce the voice duration

The traditional voiceprint recognition technology usually divides the voice into small fragments (usually about 20ms, called one frame), and then uses thousands of Gaussian models to identify the speaker-related features from each voice fragment. Due to the large number of Gaussian models and the high dimensions, this statistical method can only obtain valid speaker characteristics only when enough speech data is available. Shown in Figure 1, deep learning technology can handle multiple frames of speech at the same time, learning from it and speaker-specific features. Due to the increased length of speech being processed, it contains more features related to the way of utterance, such as intonation changes, pauses, audio and the like. So, this contextual technique can greatly reduce the voice length required for authentication.

Short - term voiceprint authentication based on depth learning

(2) Fusion speaker characteristics and voice content

In the present technology, we adopt two depth learning models to separately extract the features related to the speaker himself and the voice content for identity verification, so as to realize the function of "voice password". That is, only the speaker himself correctly states the preset Content, its identity can be accepted, as shown in Figure 2. The use of a fixed voice password, on the one hand, prevents the fraudulent use of others' recordings to impersonate identities and, in addition, helps extract more effective speaker characteristics. For example, if a person's voice password contains a syllable [a], and the person's syllable [a] is pronounced differently from others, then this particular pattern is learned by the speaker model as a distinguishing person Important features. Even if other people know the voice code, their identities can not be accepted due to the different pronunciation patterns of syllables [a].

Short - term voiceprint authentication based on depth learning

Short - term voiceprint authentication based on depth learning

【Effect】

Due to the full use of contextual information, our technology can identify the speaker only in 2-3 seconds of voice clips. Although the voice length shorter, but our technology can still achieve higher recognition accuracy. On a data set of 200 people, the error rate for authentication can be as low as about 2.2%.

【Future】

FRDC will apply this technology to call centers in the financial and insurance industries in the future and provide customers with efficient and safe authentication solutions. In addition, FRDC will continue to promote and expand the use of voiceprint certification in prison family telephone management.

Professional Training Product

Patella Knee Tendon Strap,Exercise Bench,Erobics Step Platform,Stepper Trainer

Nantong Gympro Sports Co.,Ltd , https://www.gymprocn.com

Posted on