Skip to content

LynnLox/web-Scraper-testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Analysis for NER

  • Generic spaCy does not help in NER extraction.
  • stanza have a large vocabulary of entities and classified better than spaCy.
  • GLiNER-spaCy uses spaCy dataset but gliner models and you also specifying additional entities unlike the above two.
    custom_spacy_config = {
      "gliner_model": "urchade/gliner_multi",
      "chunk_size": 250,
      "labels": [
          "person", "company", "location", "organization", "city", "date", "time", "product",
          "vehicle", "percentage", "book", "facility", "quantity", "ordinal", "cardinal",
          "money", "event", "nationality", "religion", "political group", "crypto"
      ],
      "style": "ent"
    }
  • Later on, adopting a fall-back approach using stanza because GLiNER was tested to be identifying unnecessery words and multiple classification for the same entities.
  • Pronouns were also extracted as PERSON but to add context to them, we are looking into co-referencing.

Analysis for Youtube Extraction for CPU Support

Best way to process transcript for Youtube videos is for the audio (.wav) of the video and then use transcriber (to get the transcript) an diarization (to get the speakers) models.

In my findings, the best choice for extracting transcripts from audio is OpenAI’s open-sourced model:
-> Whisper by OpenAI

But better and more optimized models have come out, like faster-whisper

What is Faster-Whisper?

It is a faster and more memory-efficient version of OpenAI’s Whisper model. It was designed to work well even on machines without powerful GPUs which is done by using a special backend called CTranslate2 that makes the model run super fast by optimizing how the model processes data:

-> FasterWhisper

What is CTranslate2?

It is dedicated inference engine (more precisely, a runtime framework) for Transformer-based models. It takes models like Whisper and makes them run much more efficiently — both on CPU and GPU.

Here’s what makes it special:

  • supports 8-bit quantization, which means it compresses the model without losing much accuracy — saving memory.
  • batch process audio chunks for better speed.
  • lightweight and great for deployment on servers and even edge devices.

Why Faster-Whisper Uses CTranslate2

faster-whisper built on top of CTranslate2 is to reimplement Whisper's inference logic in a much more optimized way. This makes it significantly faster and more memory efficient than OpenAI’s original PyTorch-based Whisper implementation, especially on CPUs or when deploying on limited hardware.