This research artifact accompanies our ICSE 2019 paper "Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source". If you use the artifact, please consider citing:
@inproceedings{QiuNBSV19,
author = {Qiu, Huilian Sophie and
Nolte, Alexander and
Brown, Anita and
Serebrenik, Alexander and
Vasilescu, Bogdan},
title = {Going Farther Together: The Impact of Social Capital on Sustained Participation in Open Source},
booktitle = {Proceedings of the 41st International Conference on Software Engineering (ICSE) 2019, Montreal, Canada},
note = {to appear},
organization = {IEEE},
year = {2019},
}
The artifact consists of three main parts:
-
Data collection scripts, written in Python.
The code can be used to select open source contributors, collect their GitHub projects, gather data such as contributors’ years of experience on GitHub, and projects’ age and size. The code also calculates social capital measures, including team familiarity, recurring cohesion, and heterogeneity of programming language expertise.
The final output is a csv file, each row of which is a data point used in our survival analysis. Each row consists of information per person per project per quarter (three-month time window), including all the social capital measures.
The code was implemented in Python 2 and tested on a Linux machine. Required dependencies:
pymysql
,sqlalchemy
,numpy
,scipy
,sklearn
, andpandas
. You also need to have (access to) a MySQL dump of GHTorrent. -
The survey instrument used in the paper.
-
Survey analysis scripts, written in R.
We give more details on data collection scripts next.
In addition to the standard GHTorrent tables, we created a table
ght_namsor_s
, containing inferred gender data kindly provided by NamSor (thanks, Elian Carsenat!)
-
Use
MySQL_queries/filter_valid_users
to find valid users. -
Run
sample_user.py
to construct a balanced sample of male and female contributors. The result is saved indata/uid.list
. In order to obtain a sample with equal number of men and women,sample_user.py
calls our gender classifier to determine users' genders. The code for the gender classifier is stored in thegender/
folder. Details about these files are in the following section. -
Run
setup.py
, which reads the filesdict/alias_map_b.dict
,dict/reverse_alias_map_b.dict
, anddata/uid.list
, and generates filesdata/pid.list
,data/all_contributors.list
,data/watchers_monthly_counts_win.csv
,dict/contr_projs.dict
,data/all_projs.list
, anddict/proj_contrs_count.dict
. -
Run
get_user_info.py
,get_proj_info.py
, andget_user_proj_info.py
. They write todata/results_users.csv
,data/results_proj.csv
, anddata/results_user_proj.csv
repectively. -
Run
merge_result.py
to combine these tables. The result will be saved indata/proj_user_proj.csv
, which will be used for data analysis.
Our gender classifier uses names' n-grams as well as results from two other existing gender classifiers, NamSor and genderComputer as features.
In the gender/
folder are two Python files that demostrate how
our gender classifer works.
First, get_feature.py
reads users'
information from the MySQL ght_namsor_s
table, which contains
users' combined data from GHTorrent and origin and gender
information obtained from NamSor.
Then it gets classification results from genderComputer.
To get a better result from genderComputer, we need to know the
user's country.
For this, we use the data provided by Namsor on one's origin,
computed based on their names.
There are other gender classifiers one can use, e.g., genderize.io. To use them, simply make the result a new feature in the model.
In determine_gender.py
, our
classifier divides the name into n-grams and uses them as
additional features.
The result will be written to data/gender.csv
, which will later
be used in sample_user.py
for balance sampling as described above.
The survey analysis script is survey.R
. It contains
code to calculate reliability measures, correlations, plots, and conduct logistic regression analysis on the collected survey data.
The models as reported in the paper are created by the
survival_analysis.R
R script.
We have also included an annonymous version of data we used for this paper
here.
Each row in the csv file is one data point in our model. It represents one user's activity in one project during one three-month window.
The csv file consists of 34 columns. Those with prefix "u_" are information about users and those start with "p_" are about projects:
u_age
is the number of three-month windows since the user's first activity.u_commits_to_date
is the number of commits made by this user across all projects up to that three-month window.u_email
is the md5 hash of the user's email address.u_follower
is the number of followers the user had up to that three-month window.u_gender
is the user's gender.u_id
is the id of the user in the GHTorrent datasetusers
table.u_login
is the md5 has of the user's login.u_nichewidth
is the number of programming languages that the user had used up to that three-month window.u_projects_to_date
is the number of projects to which the user had submitted commits up to that three-month window.u_temp_failure
is a binary indicator of whether the user had been inactive for half a year (2 three-month windows).u_temp_failure_1_year
is a binary indicator of whether the user had been inactive for a year (4 three-month windows).u_window_active_to_date
is the number of three-month windows during which the user had submitted commits.window_num
represents the current three-month window. 2008 Jan to 2008 Mar will bewindow_num = 1
.owner_company
is a binary indicator of whether the owner of the repository displays their company in their profile.owner_gender
is the repository owners' genders, with -1 representing male, 1 female, and 0 unknown.p_id
is the id of the project in the GHTorrent datasetprojects
table.u_is_major
is a binary indicator of whether the user is a major contributor (more than 5% commits) to that project.u_is_owner
is a binary indicator of whether the user is the owner of that project.u_pr_merge
is a binary indicator of whether the user can merge pull request in that project.p_age
is the number of three-month windows since the creation of that project.p_div_langdenom
is the value of the programming language diversity of that projectp_id
in that windowwindow_num
.p_fam_no_decay
is the value of the team familiarity of that project in that window.p_lang
is the major programming language of that project.p_num_commits
is the number of commits of that project in that window.p_num_commits_to_date
is the total number of commits of that project since is creation.p_num_stars
is the project's number of stars in that window.p_num_users_to_date
is the number of users who had sent commits to the project up to that window.p_owner
is the project owner's id in GHTorrent datasetusers
table.p_recurring_co
is the value of the recurring cohesion of that project in that window.p_sharenewcomers
is the percentage of new GitHub users out of the users who had sent commits to that project.p_sharenewcomers
is the percentage of new users to that project out of the users who had sent commits to that project.p_team_size
p_windows_active_to_date
is the number of three-month windows that the project received commits.window
is the date format of three-month window.