From Surf Wiki (app.surf) — the open knowledge base

Speaker diarisation

Partitioning a stream of human speech by identity of speaker

Summary

Partitioning a stream of human speech by identity of speaker

Speaker diarisation (or diarization) is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?" Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarisation has received much attention by the speech community, as is manifested by the specific evaluations devoted to it under the auspices of the National Institute of Standards and Technology for telephone speech, broadcast news and meetings. A leading list tracker of speaker diarization research can be found at Quan Wang's github repo.

Main types of diarisation systems

In speaker diarisation, one of the most popular methods is to use a Gaussian mixture model to model each of the speakers, and assign the corresponding frames for each speaker with the help of a Hidden Markov Model. There are two main kinds of clustering strategies. Bottom-up algorithms are the most popular, and they start by splitting the full audio content in a succession of clusters and progressively try to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second, top-down algorithms, start with a single cluster for all the audio data and try to split them iteratively until reaching a number of clusters equal to the number of speakers.

More recently, speaker diarisation is performed via neural networks leveraging large-scale GPU computing and methodological developments in deep learning.

References

Bibliography

References

(2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned".
"Improved speaker diarization using speaker identification".
"Speaker Segmentation and Clustering".
"Rich Transcription Evaluation Project". [[National Institute of Standards and Technology.
"Awesome Speaker Diarization".
(2021-11-26). "A Review of Speaker Diarization: Recent Advances with Deep Learning".

Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

speech-recognition speech-processing

Want to explore this topic further?

Ask Mako anything about Speaker diarisation — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report