methodshub.qmd

---
title: paperwizard - Scrape News Sites using 'readability.js'
format:
  html:
    embed-resources: true
  gfm: default
---

## Description

<!-- - Provide a brief and clear description of the method, its purpose, and what it aims to achieve. Add a link to a related paper from social science domain and show how your method can be applied to solve that research question.   -->

uses Mozillas readability.js to scrape text from news websites without dedicated scrapers.

## Keywords

<!-- EDITME -->

* Digital Behavioral Data 
* news scraping
* data gathering

## Science Usecase(s)

Extracting content from news websites is a powerful tool for social science research, enabling large-scale analysis of media narratives, framing, and information dissemination. By systematically gathering articles across outlets, researchers can study media bias, agenda-setting, and the representation of events or social groups. This data can uncover how different regions, political orientations, or journalistic styles shape public discourse and influence opinion. It also facilitates temporal studies of how news coverage evolves during crises, elections, or significant cultural moments. Using techniques like sentiment analysis, topic modeling, and network mapping of sources and citations, researchers can investigate the dynamics of information flow, echo chambers, and the global spread of narratives. Such insights are crucial for understanding the role of media in shaping societal values, attitudes, and behaviors in a rapidly changing digital landscape.

## Repository structure

This repository follows [the standard structure of an R package](https://cran.r-project.org/doc/FAQ/R-exts.html#Package-structure).

## Environment Setup

With R installed:

```r
install.packages("devtools")
devtools::install_github("schochastics/paperwizard")
```

<!-- ## Hardware Requirements (Optional) -->
<!-- - The hardware requirements may be needed in specific cases when a method is known to require more memory/compute power.  -->
<!-- - The method need to be executed on a specific architecture (GPUs, Hadoop cluster etc.) -->


## Input Data 

<!-- - The input data has to be a Digital Behavioral Data (DBD) Dataset -->
<!-- - You can provide link to a public DBD dataset. GESIS DBD datasets (https://www.gesis.org/en/institute/digital-behavioral-data) -->

<!-- This is an example -->

The package does not require imput data besides URLs of news articles.


## How to Use

The package has one major function `pw_deliver()` which takes URLs as input and outputs a data.frame of article meta data and the text of the article.

## Contact Details

Maintainer: David Schoch <david@schochastics.net>

Issue Tracker: [https://github.com/schochastics/paperwizard/issues](https://github.com/schochastics/paperwizard/issues)

<!-- ## Publication -->
<!-- - Include information on publications or articles related to the method, if applicable. -->

<!-- ## Acknowledgements -->
<!-- - Acknowledgements if any -->

<!-- ## Disclaimer -->
<!-- - Add any disclaimers, legal notices, or usage restrictions for the method, if necessary. -->