A Python library that is helpful in scrapping complete webpages including HTML, JavaScript, CSS, and Favicons. Just plug and play.
A Python library that allows you to download the HTML, JavaScript, CSS, and favicons of a webpage. This library is useful for web scraping, archiving web pages, or analyzing web content locally.
To use this code you need to install the following libraries on your system.
- pandas
- BeautifulSoup
- Selenium
- webdriver-manager
- requests
- pillow
pip install pandas
pip install beautifulsoup4
pip install selenium
pip install webdriver-manager
pip install requests
pip install pillow
Change the name of the LogFile to whatever name you require (make sure the extension is .xlsx). Change the mainPage_URL to the URL of the PhishTank page containing legitimate URLs if you want to scrape the data for legitimate or to the page containing Phishy URLs if you want to scrape the data for phishy.
# mainPage_URL for the webpage containing list of legitimate URLs
mainPage_URL = f"https://phishtank.org/phish_search.php?page={pageNo}&valid=n&Search=Search"
# mainPage_URL for the webpage containing list of Phishy URLs
mainPage_URL = f"https://phishtank.org/phish_search.php?page={pageNo}&active=y&valid=y&Search=Search"
The code is designed to perform web scraping on a list of URLs retrieved from the Phishtank database. For each URL in the list, the code conducts comprehensive web scraping, capturing various resources, including:
- The HTML code of the landing page.
- Javascript content (both inline and external).
- CSS content (both inline and external).
- Images found on the landing page.
- The website's favicon.
- A screenshot of the landing page.
This process allows for the extraction and analysis of multiple types of data from each URL, which can be useful for various purposes such as security analysis, content archiving, and data extraction.
Contributers:
- Patel Shahil Manishbhai (Indian Institute of Technology, Dharwad, India)
- Shivam Pradip Tirmare (Indian Institute of Technology, Dharwad, India)
- Aditya Kulkarni (Indian Institute of Technology, Dharwad, India)
- Vivek Balachandran (Singapore Institute of Technology, Singapore)
- Tamal Das (Indian Institute of Technology, Dharwad, India)