Multimedia Tooling #146

BradKML · 2024-12-21T04:36:05Z

BradKML
Dec 21, 2024

When I first saw this one I got a little bit inspired for the amount of support for vast diversity of media https://github.com/rmusser01/tldw
But not just from the document reading or web crawling side, but multimedia handling side of things as well #138
Podcasts, live streams, presentations, video essays, theater transcriptions, documentaries... Most of these use something like OpenAI's Whisper model for voice, maybe video VLMs for visuals, but even then there are options

Major hurdles:

Multiple speakers, OR speaker vs video clippings => https://en.wikipedia.org/wiki/Speaker_diarisation
Summarization and relying on tools like "Chain of Density" => https://www.prompthub.us/blog/better-summarization-with-chain-of-density-prompting
Chunking of information on multiple layers (document-based chunking first then recursive and semantic chunking) while handling them with RAG => https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d
Pure audio transcriptions vs inclusion on what is "in frame" for video => see @byjlw for example
Multi-lingual support for media from other countries => BLOOM and other multi-lingual LLMs, requires advancements in multi-lingual Whisper
Eliminating music to prevent accidental mis-encoding => vocal isolation or instrumental isolation (commonly used in music remixing and sampling)

Examples:

SmartManoj · 2024-12-25T12:58:36Z

SmartManoj
Dec 25, 2024
Maintainer

Would you happen to have any video with the expected output to test?

1 reply

BradKML Dec 27, 2024
Author

With all these tools I can imagine this pipeline:

Provide a list of video/audio, or a channel, or even a list of channels, to download
For each piece of video/audio, create general transcript with high accuracy first
Separate main speaker(s), guest speaker(s), music (e.g. background, show intro), and media clippings (e.g. news, speech reference)
Focus on transcribing and tagging the main speaker(s) and guest speaker(s), with media clippings treated as quotations
RAG them based on 4-8 minute segments AND on episode-level or video-level (with monologue/response vs dialogue awareness)
Create CoD summaries for episode-level or video level media chunks with context on who says/reports what

Bonus: the pipeline can also recognize what is on video when things are happening (e.g. live news reporting), and can extract accurate thumbnails, and this would be stored as the segment-level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimedia Tooling #146

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Multimedia Tooling #146

BradKML Dec 21, 2024

Replies: 1 comment · 1 reply

SmartManoj Dec 25, 2024 Maintainer

BradKML Dec 27, 2024 Author

BradKML
Dec 21, 2024

Replies: 1 comment 1 reply

SmartManoj
Dec 25, 2024
Maintainer

BradKML Dec 27, 2024
Author