This repository contains the R markdown files and datasets for course project of Social Data Science - ETH Zurich
Group members: Tianyuan Wang, Haonan Yang, Yuehan Yang, Yuanwen Yue
Supervisor: Prof. David Garcia
The datasets were collected using rtweet and Twitter's offical REST API. The datasets contain the following ten parts. Each dataset is provided in both csv and Rds formats (Due to Github storage's limitation, we only provide Rds format for repliesAll). You can also use all.RData to load all the data.
- Environmental celebrities's profiles [users.csv] / [ users.Rds]
- Environmental celebrities's fields label [celebrity.csv] / [ celebrity.Rds]
- Environmental celebrities subset with fields label [fieldsdf.csv] / [ fieldsdf.Rds]
- Environmental keywords [keyword.csv] / [ keyword.Rds]
- Environmental related tweets [environmental_tweets.csv] / [ environmental_tweets.Rds]
- Environmental related tweets with reply [tweets_with_reply.csv] / [ tweets_with_reply.Rds]
- Replies in all conversations [repliesAll.Rds]
- VADER sentiments scores for all celebrities [vader_result.csv] / [ vader_result.Rds]
- Sample replies for validation [sampled_replies.csv] / [ sampled_replies.Rds]
- Manual scores for sample replies [manual_scores.csv] / [ manual_scores.Rds]
- Source: [users.csv] / [ users.Rds]
- Description: the twitter profiles of 240 selected environmental celebrities. We referred to three public lists of top environmental influencers selected by Climate Week NYC, Onalytica and Corporate Knights. The twitter list can be found here.
- Column: user_id, name, screen_name, location, description, url, followers_count friends_count, listed_count, created_at, favourites_count, etc.
- Source: [celebrity.csv] / [ celebrity.Rds]
- Description: Based on Twitter profile and Wikipedia, celebrities were labeled into scientists, environmentalists, businessmen, politicians, athletes, writers/journalists, actors/singers/hosts, organisations, social activists and NGO officers.
- Column: user_id, name, screen_name, location, field.
- Source: [fields.csv] / [ fields.Rds]
- Description: We conduct network analysis on 240 celebrities and screened out 173 celebrities who have at least one connection with other celebrities
- Column:
- id: user id of this celebrity
- name: screen name of thie celebrity
- deg: node degree of this celebrity
- community: community id which this celebrity belongs to
- kcore: kcore of this celebrity
- field: field of this celebrity
- Source: [keyword.csv] / [ keyword.Rds]
- Description: We browsed all the 240 celebrities' recent tweets and selected 35 keywords.
- Column:
- keyword: keywords related to environmental protection
- importance: relevance to environmental protection
- notes: some notes
- synonyms: synonyms of keywords
- Source: [environmental_tweets.csv] / [ environmental_tweets.Rds]
- Description: We manually checked all the 240 celebrities' recent tweets regarding environmental issues and selected 35 keywords. Then, we ran grep function in R to filter the last 1,000 tweets of each celebrity and obtained 45,298 related tweets.
- Column:
- screen_name: screen name of user who posted this Tweet
- tweet_url: url of this Tweet
- Source: [tweets_with_reply.csv] / [ tweets_with_reply.Rds]
- Description: Before crawling replies, filter tweets based on two conditions:
- The reply_count should not be 0, which means that there must be at least one reply to this tweet.
- The tweet id should be the same with the conversation id, which means this tweet should be the original Tweet that started the conversation.
- Column:
- author_id: the id of the author who posted this tweet
- id: tweet id
- conversation_id: the id of the conversation to which this tweet belongs
- public_metrics: includes retweet_count, reply_count, like_count, quote_count
- created_at: the time this tweet was created
- text: the text content of this tweet
- Source: [repliesAll.Rds]
- Description: The replies in all conversations. There are 534135 records in total.
- Column:
- conversation_id: conversation id of this reply.
- user_id: id of user who posted the original Tweet that started the conversation.
- screen_name: screen name of user who posted the original Tweet that started the conversation.
- author_id: id of user who posted the reply
- created_at: the time this reply was created
- text: the content of this reply
- Source: [vader_result.csv] / [ vader_result.Rds]
- Description: VADER sentiments scores for all celebrities.
- Column:
- screen_names: screen name of the celebrity.
- vader_mean: mean of vader scores of all the replies to the celebrity's selected tweets.
- vader_sd: standard deviation of vader scores of all the replies to the celebrity's selected tweets.
- comments_num: number of replies to the celebrity's selected tweets.
- Source: [sampled_replies.csv] / [ sampled_replies.Rds]
- Description: a random sample of 500 replies used for validation
- Column:
- x: the content of this reply
- Source: [manual_scores.csv] / [ manual_scores.Rds]
- Description: the sentiment score of the sample of 500 replies labelled by group memers.
- Column:
- score1: scoring of group member 1
- score2: scoring of group member 2
- score3: scoring of group member 3
- score4: scoring of group member 4
- mean: average of the 4 group members’ scores, used as the true value
The use of the data should comply with Twitter's developer agreement and policy. If you have any questions with the data, please contact us by yuayue@ethz.ch.