Skip to content

It is a text analysis tool that performs linguistic analysis on a collection of web pages. It includes sentiment analysis, readability metrics, and other derived variables.

License

Notifications You must be signed in to change notification settings

shib1111111/WebText-Analyzer

Repository files navigation

WebText Analyzer: Uncover Insights from Web Pages

This project provides a text analysis tool that performs linguistic analysis on a collection of web pages. It includes sentiment analysis, readability metrics, and other derived variables. The tool reads web page URLs from an Input.xlsx file, fetches the content of each URL, and saves the title and descriptions of each page in separate text files. It then performs text analysis on these text files and saves the results in the output.csv file.

Prerequisites

Before running the code, please make sure the following libraries are installed:

  • pandas: For handling data in tabular format.
  • requests: For making HTTP requests to fetch web page content.
  • beautifulsoup4: For parsing HTML content.
  • nltk: The Natural Language Toolkit library for natural language processing.
  • pyphen: For counting syllables in words. You can install these libraries using
pip install pandas
pip install requests
pip install beautifulsoup4
pip install nltk
pip install pyphen

Data Files

Make sure the following data files are present in the same directory as the code: analysis.py : the code file.. Input.xlsx: This file contains the URL ID and URLs.

Output Data Structure.xlsx: This file specifies the output file format.

MasterDictionary/positive-words.txt: A text file containing a list of positive words, one word per line.

MasterDictionary/negative-words.txt: A text file containing a list of negative words, one word per line.

StopWords: A directory containing text files with stop words. Each filename should start with "StopWords" and end with .txt.

Code Execution

  • Place the code file analysis.py in a directory along with the required data files.

  • Create an Input.xlsx file with two columns:

  • URL_ID: An integer identifier for each URL.

  • URL: The web page URLs to analyze.

  • Create an Output Data Structure.xlsx file with the following columns:

    • URL_ID: The same integer identifier for each URL as in the Input.xlsx file.
  • Additional columns for storing the computed text analysis variables.

  • Please execute the code using a Python interpreter or IDE.

  • After execution, the computed text analysis results will be saved in the output.xlsx file.

Usage

  • Ensure that the code file and data files are set up as described above.
  • Run the code by executing the Python script or using an IDE.
  • The code will fetch the web page content, perform text analysis, and save the results in the output.csv file.
  • You can customize the code and parameters as per your requirements.
  • Refer to the code comments for detailed explanations of each step.

License

This project is licensed under the MIT License.

Thank you for viewing this repo! Feel free to reach out with any questions or feedback.

✨ --- Designed & made with Love by Shib Kumar Saraf ✨

About

It is a text analysis tool that performs linguistic analysis on a collection of web pages. It includes sentiment analysis, readability metrics, and other derived variables.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages