A curated repository of data sets and tools that can be used for data-driven empirical software engineering, a method also known as mining software repositories (MSR). For examples of such work see the Mining Software Repositories conference.
This list is under construction and requires your input. Please contribute additions through a GitHub pull request. (Or send me an email if you find that too cumbersome.) The additions you specify should be genuinely useful to MSR researchers; the objective of this list is utility rather than comprehensiveness.
For more awesome lists, see awesome.
- AndroZoo - a growing collection of Android Applications
- Boa - a domain-specific language and infrastructure that eases mining software repositories
- Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse
- Enron Spreadsheets and Emails - all the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'
- Findbugs-maven - a set of FindBugs reports for the Java projects of the Maven repository
- GHTorrent - an effort to create a scalable, queriable, offline mirror of data offered through the Github REST API
- Maven metrics - a collection of software complexity & sizing metrics for the Maven Repository
- mzdata - Multi-extract and Multi-level Dataset of Mozilla Issue Tracking History
- Stack Exchange - an anonymized dump of all user-contributed content on the Stack Exchange network.
- tera-PROMISE - a research dataset repository specializing in software engineering research datasets
- TravisTorrent - TravisTorrent provides free and easy-to-use Traivs CI build analyses.
- Unix history - a Git repository with 46 years of Unix history evolution
- Your input goes here
To the extent possible under law, Diomidis Spinellis has waived all copyright and related or neighboring rights to this work.