Analysis for NER

Generic spaCy does not help in NER extraction.
stanza have a large vocabulary of entities and classified better than spaCy.

GLiNER-spaCy uses spaCy dataset but gliner models and you also specifying additional entities unlike the above two.

custom_spacy_config = {
  "gliner_model": "urchade/gliner_multi",
  "chunk_size": 250,
  "labels": [
      "person", "company", "location", "organization", "city", "date", "time", "product",
      "vehicle", "percentage", "book", "facility", "quantity", "ordinal", "cardinal",
      "money", "event", "nationality", "religion", "political group", "crypto"
  ],
  "style": "ent"
}

Later on, adopting a fall-back approach using stanza because GLiNER was tested to be identifying unnecessery words and multiple classification for the same entities.
Pronouns were also extracted as PERSON but to add context to them, we are looking into co-referencing.

Analysis for Youtube Extraction for CPU Support

Best way to process transcript for Youtube videos is for the audio (.wav) of the video and then use transcriber (to get the transcript) an diarization (to get the speakers) models.

In my findings, the best choice for extracting transcripts from audio is OpenAI’s open-sourced model:
-> Whisper by OpenAI

But better and more optimized models have come out, like faster-whisper

What is Faster-Whisper?

It is a faster and more memory-efficient version of OpenAI’s Whisper model. It was designed to work well even on machines without powerful GPUs which is done by using a special backend called CTranslate2 that makes the model run super fast by optimizing how the model processes data:

-> FasterWhisper

What is CTranslate2?

It is dedicated inference engine (more precisely, a runtime framework) for Transformer-based models. It takes models like Whisper and makes them run much more efficiently — both on CPU and GPU.

Here’s what makes it special:

supports 8-bit quantization, which means it compresses the model without losing much accuracy — saving memory.
batch process audio chunks for better speed.
lightweight and great for deployment on servers and even edge devices.

Why Faster-Whisper Uses CTranslate2

faster-whisper built on top of CTranslate2 is to reimplement Whisper's inference logic in a much more optimized way. This makes it significantly faster and more memory efficient than OpenAI’s original PyTorch-based Whisper implementation, especially on CPUs or when deploying on limited hardware.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
bbc-comparison-result		bbc-comparison-result
bbc-gliner-spacy		bbc-gliner-spacy
bbc-stanza		bbc-stanza
cnbc-comparison-result		cnbc-comparison-result
comparison-youtube		comparison-youtube
google-serp		google-serp
results_json		results_json
trafilatura-web-scraper		trafilatura-web-scraper
youtube		youtube
.gitignore		.gitignore
README.md		README.md
alok_specific_comparison.py		alok_specific_comparison.py
botasaurus_stanza_ner.py		botasaurus_stanza_ner.py
cnbc_video_clicker.py		cnbc_video_clicker.py
crawl4ai.py		crawl4ai.py
final_json.json		final_json.json
flair_ner.py		flair_ner.py
gliner-spacy-multiple.py		gliner-spacy-multiple.py
gliner_spacy_ner.py		gliner_spacy_ner.py
requirements.txt		requirements.txt
research.ipynb		research.ipynb
slim_bert_ner.py		slim_bert_ner.py
spacy_ner.py		spacy_ner.py
stanza-multiple.py		stanza-multiple.py
stanza_ner.py		stanza_ner.py
test.json		test.json
video_extraction.py		video_extraction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Analysis for NER

Analysis for Youtube Extraction for CPU Support

What is Faster-Whisper?

What is CTranslate2?

Why Faster-Whisper Uses CTranslate2

About

Uh oh!

Releases

Packages

Languages

LynnLox/web-Scraper-testing

Folders and files

Latest commit

History

Repository files navigation

Analysis for NER

Analysis for Youtube Extraction for CPU Support

What is Faster-Whisper?

What is CTranslate2?

Why Faster-Whisper Uses CTranslate2

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages