Overview

This project is designed to parse text from various media types: audio (.wav), video (.mp4), and text documents (.pdf). The implementation utilizes Python and its libraries, relying exclusively on free APIs and libraries for unlimited usage.

Features

Parse Text from Video
- Implementation: Extracts audio from the video in .wav format using the Moviepy library. Processes the audio through text extraction, reusing the logic implemented for audio parsing.
- Issues Faced: Alternative libraries, such as SpeechRecognition, proved less efficient than Moviepy.
Parse Text from Audio
- Implementation: Utilized the SpeechRecognition library with the Google Web Speech-to-Text service for audio transcription. Dealt with limitations of 120-second processing by splitting audio into 115-second chunks using pydub.silence. Errors such as UnknownValueError and RequestError were managed with appropriate fallbacks.
- Challenges: Processing long audio files is time-intensive (e.g., 15-minute audio may take 30-40 minutes). Paid APIs with better performance were avoided due to limited free usage.
Parse Text from PDF
- Implementation: Used the PyMuPDF library to extract clean and structured text.
- Challenges: Preservation of tabular data was not supported natively. However, the row structure of tabular data was maintained by outputting each column in a new line. Alternatives like PyPDF2, Textract, and Tika were tested but found less efficient.

Libraries and Tools

Text Extraction: SpeechRecognition, PyMuPDF
Audio Processing: Moviepy, pydub
Error Handling: Custom error management for transcription inconsistencies

Known Limitations

Limited accuracy and speed of the Google Web Speech-to-Text service. Inefficiency in preserving tabular data in PDF files. For further improvements, exploring better transcription APIs and tabular-specific libraries like Tabulapy is recommended.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
extract_text_from_audio.py		extract_text_from_audio.py
extract_text_from_pdf.py		extract_text_from_pdf.py
extract_text_from_video.py		extract_text_from_video.py
main.py		main.py
mp3_to_wav.py		mp3_to_wav.py
report.pdf		report.pdf
write_to_file.py		write_to_file.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Libraries and Tools

Known Limitations

About

Releases

Packages

Languages

jahnabiroy/Text-Extractor

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Libraries and Tools

Known Limitations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages