- Generic
spaCy
does not help in NER extraction. stanza
have a large vocabulary of entities and classified better thanspaCy
.GLiNER-spaCy
uses spaCy dataset but gliner models and you also specifying additional entities unlike the above two.custom_spacy_config = { "gliner_model": "urchade/gliner_multi", "chunk_size": 250, "labels": [ "person", "company", "location", "organization", "city", "date", "time", "product", "vehicle", "percentage", "book", "facility", "quantity", "ordinal", "cardinal", "money", "event", "nationality", "religion", "political group", "crypto" ], "style": "ent" }
- Later on, adopting a fall-back approach using
stanza
because GLiNER was tested to be identifying unnecessery words and multiple classification for the same entities. - Pronouns were also extracted as
PERSON
but to add context to them, we are looking into co-referencing.
Best way to process transcript for Youtube videos is for the audio (.wav) of the video and then use transcriber (to get the transcript) an diarization (to get the speakers) models.
In my findings, the best choice for extracting transcripts from audio is OpenAI’s open-sourced model:
-> Whisper by OpenAI
But better and more optimized models have come out, like faster-whisper
It is a faster and more memory-efficient version of OpenAI’s Whisper model. It was designed to work well even on machines without powerful GPUs which is done by using a special backend called CTranslate2
that makes the model run super fast by optimizing how the model processes data:
It is dedicated inference engine (more precisely, a runtime framework) for Transformer-based models. It takes models like Whisper and makes them run much more efficiently — both on CPU and GPU.
Here’s what makes it special:
- supports 8-bit quantization, which means it compresses the model without losing much accuracy — saving memory.
- batch process audio chunks for better speed.
- lightweight and great for deployment on servers and even edge devices.
faster-whisper built on top of CTranslate2 is to reimplement Whisper's inference logic in a much more optimized way. This makes it significantly faster and more memory efficient than OpenAI’s original PyTorch-based Whisper implementation, especially on CPUs or when deploying on limited hardware.