Web Scraper for News Articles

Overview

This project is a web scraper designed to extract news articles from various news websites based on your scope. This is supported for example for these outlets:

Handelsblatt
Wirtschaftswoche
Der Spiegel

The extracted articles are translated into both English and German and then saved into a Word document. The selection of important articles is handled by a ChatGPT wrapper, and translations are powered by the DeepL API. To keep API costs low, articles are first of all filtered by date to only include recent articles. Additionally, the articles are embedded which allows extracting a similarity score between all articles. Articles about the same topic from different news sources are then filtered, keeping the article from the more prestigious news platform.

Example output

Features

Article Scraping: Scrape articles from top German news websites.
Intelligent Filtering: Only important articles are selected for extraction, using a ChatGPT wrapper for context-aware decision-making. To edit the scope of selection, edit the prompt at Messages/introduction.txt
Multi-language Support: Articles are automatically translated into English using the DeepL API.
Word Document Output: The selected and translated articles are formatted and saved into a German and an English Word document for easy reading and distribution.

Requirements

Python 3.x
Required Python packages (listed in requirements.txt)
Access to the DeepL API
A ChatGPT API key for the article selection process

Installation

Clone this repository:

# Clone the repository
git clone https://github.com/anton325/speedreader.git

# Navigate to the project directory
cd MediaMonitoring

# Install the required Python packages
pip install -r requirements.txt

Set up your API keys:
- DeepL API: You will need a DeepL API key for translation service. Insert yours in translate_articles.py
- ChatGPT API: A ChatGPT API key is required for the intelligent article selection process, enter yours in api.py
Add these keys to your environment variables or configure them in the script settings.
Run the main
```
python main.py
```

Once the algorithm is terminated, the created Word Documents can be found in Briefings/

Contributing

Contributions are welcome! Please feel free to fork the repository, make changes, and submit a pull request. Due to the modular construction of the webscraping, you can easily extend this project and write a scraping implementation for a newspaper thats currently not included. For that, you need to inspect the website and insert unique CSS identifiers for the components the template requires to work.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Briefings/briefing240820		Briefings/briefing240820
Messages		Messages
images		images
utils		utils
webscraping		webscraping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
api.py		api.py
append_gui.py		append_gui.py
count_tokens.py		count_tokens.py
create_hyperlink.py		create_hyperlink.py
filter_articles.py		filter_articles.py
main.py		main.py
match_answers.py		match_answers.py
prepare_query.py		prepare_query.py
requirements.txt		requirements.txt
to_docx.py		to_docx.py
translate_articles.py		translate_articles.py
warning.py		warning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper for News Articles

Overview

Example output

Features

Requirements

Installation

Contributing

License

About

Releases

Packages

Languages

License

anton325/MediaMonitoring

Folders and files

Latest commit

History

Repository files navigation

Web Scraper for News Articles

Overview

Example output

Features

Requirements

Installation

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages