Download and Prepare the Data Sources

News Dataset

For this project, the News Category Dataset from Kaggle is used.

This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost. This is one of the biggest news datasets and can serve as a benchmark for a variety of computational linguistic tasks. HuffPost stopped maintaining an extensive archive of news articles sometime after this dataset was first collected in 2018, so it is not possible to collect such a dataset in the present day. Due to changes in the website, there are about 200k headlines between 2012 and May 2018 and 10k headlines between May 2018 and 2022.

Download the JSON file with the dataset into this directory. Note that an account at Kaggle is required.

For further data processing, it is more convenient if the data is in a CSV format instead of JSON. In addition, some links are invalid and/or duplicated Therefore, the following steps will convert the JSON file into a CSV file, fix and de-duplicate the links:

Create a new CSV file which will hold the data:

touch News_Category_Dataset_v3.csv

Provide read permissions to those files to anyone to allow access within container:

chmod -R g+r,o+r .

Build the Docker image to be able to run the script without the need to install additional requirements:

docker image build -t json-to-csv:1.0.0 .

Run the container:

docker run --rm -v "$(pwd)/News_Category_Dataset_v3.csv:/usr/app/src/News_Category_Dataset_v3.csv:z" json-to-csv:1.0.0

Usernames Dataset

For this project, the Reddit Usernames from Kaggle is used.

This dataset contains the username of any reddit account that has left at least one comment, and their number of comments. This data was grabbed in December 2017 from the Reddit comments dataset hosted on Google BigQuery. It should be current up to November 2017.

Download the CSV file with the dataset into this directory. Note that an account at Kaggle is required.

Provide read permissions to those files to anyone to allow access within container:

chmod -R g+r,o+r .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.adoc

README.adoc

Download and Prepare the Data Sources

News Dataset

Usernames Dataset

Files

README.adoc

Latest commit

History

README.adoc

File metadata and controls

Download and Prepare the Data Sources

News Dataset

Usernames Dataset