βΉοΈ Based on https://github.com/pyannote/pyannote-audio.
$ python cluster.py $HOME/Downloads/mclip /tmp/outdir
This script mainly extracts speaker embeddings (features) from lots of audio files and decides which audios belong to the same speaker. This task is known as clustering, as is often applied as the last step of speaker diarization.
The script receives two directories as args. The restriction is that the input
dir must contain at least one subdir, either for a male or female speaker,
which are expected to come from inaSpeechSegmenter
(inaSS.) Subdirectories
must contain a pair of audio-transcription files (namely .wav
and .txt
extensions, the script will sanity-check it.) The output dir will not be wiped
out at each run, so be careful to remove it before execution while debugging.
The expected format of the subdirs within the input directory is as follows:
two IDs separated by an underscore char, the second ID starting with a gender
ID (M
of F
.)
<BROADCASTER_ID>_<GENDER_ID><YMD_DATE_TAG>
e.g.:
andaiafm_M20201105
andaiafm_F20201105
$ tree $HOME/Downloads/mclip -C | head
$HOME/Downloads/mclip $HOME/Downloads/mclip
βββ andaiafm_F20201105 βββ andaiafm_M20201105
βΒ Β βββ mclip-00000003.txt βΒ Β βββ mclip-00000001.txt
βΒ Β βββ mclip-00000003.wav βΒ Β βββ mclip-00000001.wav
βΒ Β βββ mclip-00000004.txt βΒ Β βββ mclip-00000002.txt
βΒ Β βββ mclip-00000004.wav βΒ Β βββ mclip-00000002.wav
βΒ Β βββ mclip-00000005.txt βΒ Β βββ mclip-00000012.txt
βΒ Β βββ mclip-00000005.wav βΒ Β βββ mclip-00000012.wav
βΒ Β βββ mclip-00000006.txt βΒ Β βββ mclip-00000013.txt
βΒ Β βββ mclip-00000006.wav βΒ Β βββ mclip-00000013.wav
... ...
Example output:
$ tree /tmp/outdir
/tmp/outdir /tmp/outdir
βββ andaiafm20201105 βββ andaiafm20201105
βββ andaiafm20201105-F0001 βββ andaiafm20201105-M0001
βΒ Β βββ andaiafm20201105F0001_000000.txt βΒ Β βββ andaiafm20201105M0001_000000.txt
βΒ Β βββ andaiafm20201105F0001_000000.wav βΒ Β βββ andaiafm20201105M0001_000000.wav
βΒ Β βββ andaiafm20201105F0001_000001.txt βΒ Β βββ andaiafm20201105M0001_000001.txt
βΒ Β βββ andaiafm20201105F0001_000001.wav βΒ Β βββ andaiafm20201105M0001_000001.wav
βΒ Β βββ andaiafm20201105F0001_000002.txt βΒ Β βββ andaiafm20201105M0001_000002.txt
βΒ Β βββ andaiafm20201105F0001_000002.wav βΒ Β βββ andaiafm20201105M0001_000002.wav
βΒ Β βββ andaiafm20201105F0001_000003.txt βΒ Β βββ andaiafm20201105M0001_000003.txt
βΒ Β βββ andaiafm20201105F0001_000003.wav βΒ Β βββ andaiafm20201105M0001_000003.wav
βΒ Β βββ andaiafm20201105F0001_000004.txt βΒ Β βββ andaiafm20201105M0001_000004.txt
βΒ Β βββ andaiafm20201105F0001_000004.wav βΒ Β βββ andaiafm20201105M0001_000004.wav
βΒ Β βββ andaiafm20201105F0001_000005.txt βΒ Β βββ andaiafm20201105M0001_000005.txt
βΒ Β βββ andaiafm20201105F0001_000005.wav βΒ Β βββ andaiafm20201105M0001_000005.wav
βΒ Β βββ andaiafm20201105F0001_000006.txt βΒ Β βββ andaiafm20201105M0001_000006.txt
βΒ Β βββ andaiafm20201105F0001_000006.wav βΒ Β βββ andaiafm20201105M0001_000006.wav
βΒ Β βββ andaiafm20201105F0001_000007.txt βΒ Β βββ andaiafm20201105M0001_000007.txt
βΒ Β βββ andaiafm20201105F0001_000007.wav βΒ Β βββ andaiafm20201105M0001_000007.wav
βΒ Β βββ andaiafm20201105F0001_000008.txt βΒ Β βββ andaiafm20201105M0001_000008.txt
βΒ Β βββ andaiafm20201105F0001_000008.wav βΒ Β βββ andaiafm20201105M0001_000008.wav
βΒ Β βββ andaiafm20201105F0001_000009.txt βΒ Β βββ andaiafm20201105M0001_000009.txt
βΒ Β βββ andaiafm20201105F0001_000009.wav βΒ Β βββ andaiafm20201105M0001_000009.wav
βΒ Β βββ andaiafm20201105F0001_000010.txt βΒ Β βββ andaiafm20201105M0001_000010.txt
βΒ Β βββ andaiafm20201105F0001_000010.wav βΒ Β βββ andaiafm20201105M0001_000010.wav
βΒ Β βββ andaiafm20201105F0001_000011.txt βΒ Β βββ andaiafm20201105M0001_000011.txt
βΒ Β βββ andaiafm20201105F0001_000011.wav βΒ Β βββ andaiafm20201105M0001_000011.wav
βΒ Β βββ andaiafm20201105F0001_000012.txt βΒ Β βββ andaiafm20201105M0001_000012.txt
βΒ Β βββ andaiafm20201105F0001_000012.wav βΒ Β βββ andaiafm20201105M0001_000012.wav
βΒ Β βββ andaiafm20201105F0001_000013.txt βΒ Β βββ andaiafm20201105M0001_000013.txt
βΒ Β βββ andaiafm20201105F0001_000013.wav βΒ Β βββ andaiafm20201105M0001_000013.wav
βΒ Β βββ andaiafm20201105F0001_000014.txt βΒ Β βββ andaiafm20201105M0001_000014.txt
βΒ Β βββ andaiafm20201105F0001_000014.wav βΒ Β βββ andaiafm20201105M0001_000014.wav
βΒ Β βββ andaiafm20201105F0001_000015.txt βΒ Β βββ andaiafm20201105M0001_000015.txt
βΒ Β βββ andaiafm20201105F0001_000015.wav ...
βββ andaiafm20201105-F0002 βββ andaiafm20201105-M0002
βΒ Β βββ andaiafm20201105F0002_000016.txt βΒ Β βββ andaiafm20201105M0002_000113.txt
βΒ Β βββ andaiafm20201105F0002_000016.wav βΒ Β βββ andaiafm20201105M0002_000113.wav
βΒ Β βββ andaiafm20201105F0002_000017.txt βββ andaiafm20201105-M0003
βΒ Β βββ andaiafm20201105F0002_000017.wav βΒ Β βββ andaiafm20201105M0003_000114.txt
βΒ Β βββ andaiafm20201105F0002_000018.txt βΒ Β βββ andaiafm20201105M0003_000114.wav
βΒ Β βββ andaiafm20201105F0002_000018.wav βββ andaiafm20201105-M0004
βΒ Β βββ andaiafm20201105F0002_000019.txt βΒ Β βββ andaiafm20201105M0004_000115.txt
βΒ Β βββ andaiafm20201105F0002_000019.wav βΒ Β βββ andaiafm20201105M0004_000115.wav
βΒ Β βββ andaiafm20201105F0002_000020.txt βββ andaiafm20201105-M0005
βΒ Β βββ andaiafm20201105F0002_000020.wav βΒ Β βββ andaiafm20201105M0005_000116.txt
βΒ Β βββ andaiafm20201105F0002_000021.txt βΒ Β βββ andaiafm20201105M0005_000116.wav
βΒ Β βββ andaiafm20201105F0002_000021.wav βββ andaiafm20201105-M0006
βΒ Β βββ andaiafm20201105F0002_000022.txt βΒ Β βββ andaiafm20201105M0006_000117.txt
βΒ Β βββ andaiafm20201105F0002_000022.wav βΒ Β βββ andaiafm20201105M0006_000117.wav
βββ andaiafm20201105-F0003 βββ andaiafm20201105-M0007
βΒ Β βββ andaiafm20201105F0003_000023.txt βΒ Β βββ andaiafm20201105M0007_000118.txt
βΒ Β βββ andaiafm20201105F0003_000023.wav βΒ Β βββ andaiafm20201105M0007_000118.wav
βββ andaiafm20201105-F0004 βββ andaiafm20201105-M0008
βΒ Β βββ andaiafm20201105F0004_000024.txt βΒ Β βββ andaiafm20201105M0008_000119.txt
βΒ Β βββ andaiafm20201105F0004_000024.wav βΒ Β βββ andaiafm20201105M0008_000119.wav
βββ andaiafm20201105-F0005 βββ andaiafm20201105-M0009
βΒ Β βββ andaiafm20201105F0005_000025.txt βΒ Β βββ andaiafm20201105M0009_000120.txt
βΒ Β βββ andaiafm20201105F0005_000025.wav βΒ Β βββ andaiafm20201105M0009_000120.wav
βΒ Β βββ andaiafm20201105F0005_000026.txt βΒ Β βββ andaiafm20201105M0009_000121.txt
βΒ Β βββ andaiafm20201105F0005_000026.wav βΒ Β βββ andaiafm20201105M0009_000121.wav
βΒ Β βββ andaiafm20201105F0005_000027.txt βΒ Β βββ andaiafm20201105M0009_000122.txt
βΒ Β βββ andaiafm20201105F0005_000027.wav βΒ Β βββ andaiafm20201105M0009_000122.wav
βΒ Β βββ andaiafm20201105F0005_000028.txt βΒ Β βββ andaiafm20201105M0009_000123.txt
βΒ Β βββ andaiafm20201105F0005_000028.wav βΒ Β βββ andaiafm20201105M0009_000123.wav
βΒ Β βββ andaiafm20201105F0005_000029.txt βΒ Β βββ andaiafm20201105M0009_000124.txt
βΒ Β βββ andaiafm20201105F0005_000029.wav βΒ Β βββ andaiafm20201105M0009_000124.wav
βΒ Β βββ andaiafm20201105F0005_000030.txt βΒ Β βββ andaiafm20201105M0009_000125.txt
βΒ Β βββ andaiafm20201105F0005_000030.wav βΒ Β βββ andaiafm20201105M0009_000125.wav
βΒ Β βββ andaiafm20201105F0005_000031.txt βΒ Β βββ andaiafm20201105M0009_000126.txt
βΒ Β βββ andaiafm20201105F0005_000031.wav βΒ Β βββ andaiafm20201105M0009_000126.wav
βΒ Β βββ andaiafm20201105F0005_000032.txt βΒ Β βββ andaiafm20201105M0009_000127.txt
βΒ Β βββ andaiafm20201105F0005_000032.wav βΒ Β βββ andaiafm20201105M0009_000127.wav
βΒ Β βββ andaiafm20201105F0005_000033.txt βΒ Β βββ andaiafm20201105M0009_000128.txt
βΒ Β βββ andaiafm20201105F0005_000033.wav βΒ Β βββ andaiafm20201105M0009_000128.wav
βΒ Β βββ andaiafm20201105F0005_000034.txt βΒ Β βββ andaiafm20201105M0009_000129.txt
βΒ Β βββ andaiafm20201105F0005_000034.wav βΒ Β βββ andaiafm20201105M0009_000129.wav
βΒ Β βββ andaiafm20201105F0005_000035.txt βΒ Β βββ andaiafm20201105M0009_000130.txt
βΒ Β βββ andaiafm20201105F0005_000035.wav βΒ Β βββ andaiafm20201105M0009_000130.wav
βββ andaiafm20201105-F0006 βΒ Β βββ andaiafm20201105M0009_000131.txt
βΒ Β βββ andaiafm20201105F0006_000036.txt βΒ Β βββ andaiafm20201105M0009_000131.wav
βΒ Β βββ andaiafm20201105F0006_000036.wav βΒ Β βββ andaiafm20201105M0009_000132.txt
βββ andaiafm20201105-F0007 βΒ Β βββ andaiafm20201105M0009_000132.wav
βΒ Β βββ andaiafm20201105F0007_000037.txt βΒ Β βββ andaiafm20201105M0009_000133.txt
βΒ Β βββ andaiafm20201105F0007_000037.wav βΒ Β βββ andaiafm20201105M0009_000133.wav
βββ andaiafm20201105-F0008 βΒ Β βββ andaiafm20201105M0009_000134.txt
βΒ Β βββ andaiafm20201105F0008_000038.txt βΒ Β βββ andaiafm20201105M0009_000134.wav
βΒ Β βββ andaiafm20201105F0008_000038.wav βΒ Β βββ andaiafm20201105M0009_000135.txt
βΒ Β βββ andaiafm20201105F0008_000039.txt βΒ Β βββ andaiafm20201105M0009_000135.wav
βΒ Β βββ andaiafm20201105F0008_000039.wav βΒ Β βββ andaiafm20201105M0009_000136.txt
βΒ Β βββ andaiafm20201105F0008_000040.txt βΒ Β βββ andaiafm20201105M0009_000136.wav
βΒ Β βββ andaiafm20201105F0008_000040.wav βΒ Β βββ andaiafm20201105M0009_000137.txt
βΒ Β βββ andaiafm20201105F0008_000041.txt βΒ Β βββ andaiafm20201105M0009_000137.wav
βΒ Β βββ andaiafm20201105F0008_000041.wav βΒ Β βββ andaiafm20201105M0009_000138.txt
βΒ Β βββ andaiafm20201105F0008_000042.txt βΒ Β βββ andaiafm20201105M0009_000138.wav
βΒ Β βββ andaiafm20201105F0008_000042.wav βΒ Β βββ andaiafm20201105M0009_000139.txt
βΒ Β βββ andaiafm20201105F0008_000043.txt βΒ Β βββ andaiafm20201105M0009_000139.wav
βΒ Β βββ andaiafm20201105F0008_000043.wav ...
In theory, pyannote-audio
solves the entire diarization problem, but it does
not filter out music nor noise, so inaSS has a point there. A second issue is
that inaSS uses TensorFlow as ML backend while pyannote uses PyTorch. It would
be nice to unify. Third and last, both libs are far from perfect: frequent
misalignments/missegmentations/misclusterings occur.
$ pip install -r requirements.txt
- pyannote.audio (includes scipy, numpy, etc.)
- torch