Webscraping with Python Example

This Python script aims to gather text data from websites or YouTube videos and save it within a directory named 'data'. It allows users to reference an Excel file containing URLs or create a list of URLs for scraping.

Overview

Installation of Required Libraries

Ensure the required libraries are installed within the virtual environment. See the requirements.txt file for more information on necessary libraries.

Imported Packages

from pathlib import Path
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urlunparse
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter
import json
import openpyxl
import requests
import re
import os
import shutil

Webscraping Process

Retrieve URLs from Excel File

The script reads URLs from an Excel file containing website and YouTube video links, storing them in separate lists for further processing.

Scrape Websites

The script traverses through the list of website URLs, retrieves their content, removes unnecessary elements (headers, footers), and saves the extracted text content and metadata in separate lists. These details are then saved to individual .txt files and a metadata file.

Scrape YouTube Transcripts

For YouTube video URLs, the script retrieves their transcripts, cleans the text, and saves it into .txt files. Similarly, metadata regarding the video is stored in the metadata file.

Save All Webpage Content

The script saves all collected webpage content, including websites, transcripts, and additional content, into .txt files, while also storing metadata about each piece of content.

Instructions for Use

Ensure the required libraries are installed.
Prepare an Excel file with URLs or create a list of URLs.
Replace your_file_name_here with the relevant file path in the script.
Run the script, and it will scrape text content from the provided URLs, saving the data and metadata in the 'data' directory.

Note: Prior to running the script, ensure the environment is set up and the necessary packages are installed to execute it successfully. Adjustments to the code might be necessary based on specific website structures and changes in the YouTubeTranscript API or website layouts.

Feel free to modify and adapt the script for your specific requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
webscrape_text_data.ipynb		webscrape_text_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webscraping with Python Example

Overview

Installation of Required Libraries

Imported Packages

Webscraping Process

Retrieve URLs from Excel File

Scrape Websites

Scrape YouTube Transcripts

Save All Webpage Content

Instructions for Use

About

Releases

Packages

Languages

patzacher/webscrape

Folders and files

Latest commit

History

Repository files navigation

Webscraping with Python Example

Overview

Installation of Required Libraries

Imported Packages

Webscraping Process

Retrieve URLs from Excel File

Scrape Websites

Scrape YouTube Transcripts

Save All Webpage Content

Instructions for Use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages