Put all data under data
, all code under src
. Track everything under data
in git LFS.
Under data
, the data/input
folder is to contain all data that comes to the project from outside. If it comes from a git repository, use git submodules to import it. If the data is over about 500MB in size, do not include it, but instead add a README.md
that points to it in e.g. IDA.
All data produced by code should go either into data/work
or data/output
. If the purpose of the repo is to produce a unified clean dataset, put the files part of that into data/output
, and make that directory its own git repo and a submodule. This way, downstream analysis repos can import only the clean data and not the code, input or other cruft.
Files that are clearly just intermediate/temporary steps in the process should be put under data/work
. Mostly do not commit this directory or the files in it into git. The exception is if reproducing them takes a long time, or if you think they're otherwise useful (but then maybe they don't belong under work
in the first place).
Everything under data/work
and data/output
should in principle be automatically regenerateable from what is in data/input
. If there needs to be some manually created data in data/output
, two choices are available:
- Put this data in
data/input
and have theMakefile
or equivalent copy it underdata/output
. - Document clearly which parts of
data/output
are not regenerateable.
If hand-curation of data is necessary, the best practice for this is to employ patch files added to data/input
that document entries to be amended, what to remove and what to add and so on. This way, these patches can be re-applied to new data coming in, while also maintaining a clear record of what has been manually changed, further improving reproducibility and rerunnability.
If the repository has multiple people experimenting with different things in it that are liable to make the repo a mess (e.g. exploratory analyses), partition both code and data into subdirectories by username, e.g. src/analysis/jiemakel
, data/work/jiemakel
and so on, so that each person is free to make a mess only in their subpart of the project, where they'll hopefully remember what's what.
For analysis repositories, create a common basis that loads/prepares access to a unified common source data set, so that everyone is operating with a shared common version of the data. Name this e.g. src/common_basis.R
and call it from e.g. the analysis notebooks to load the data.
In data file columns as well as code variable and function names, use primarily _
as a separator. E.g. document_id
, not documentId
, parse_document_id
not parseDocumentId
. For filenames, either parse-document-ids.py
or parse_document_ids.py
is permissible, but please be consistent within a single project.
- When using Python, try to follow pep8. autopep8 and a matching editor plugin/pre-commit-hook or equivalent support will help in this.
- When using R, prefer tidyverse functions to base R. Also try to follow the tidyverse style guide. styler and a matching editor plugin/pre-commit-hook or equivalent support will help.
- It is desireable to have a
Makefile
or equivalent in the project, through which we can track what code produces what data and through which we can also rerun the pipelines - Generally, try to make sure there is an easy way by which all the dependencies of a project can be installed. To ensure this, prefer isolated project environments.
- Using notebooks is ok, but:
- Try to ensure dependency management is still sensible
- For Python, prefer
#%%
style in plain.py
files instead of.ipynb
(see e.g. Scientific Tools for PyCharm, Python Interactive window in VSCode or Jupytext for Jupyter Notebook). - For R, prefer
.Rmd
to.ipynb
. - If storing cell outputs is desired (e.g. for analysis repositories), for R store an
.nb.html
alongside the.Rmd
(RStudio does this automatically when the notebook type is set tohtml_notebook
) and for Python, store an.ipynb
alongside (You can use Jupytext to keep these in sync. Automated syncing is enabled in Jupyter by 1) installing the jupytext extension and 2) adding the appropriate metadata to the files)
- Store data as TSV files with the
.tsv
file extension. Use"
as the quote character, and quote character doubling as the escape method. Quote output only when necessary. Encode missing values as the empty string (notNA
from R) - If the file size of your uncompressed TSV file is over about 150MB, gzip the file into a
.tsv.gz
. - For naming columns, use English, unless the number of columns is large and comes from an external source, making translating them too costly.
- If you have multiple tables, make sure column names are unique across all of them. So e.g. if both books as well as people have names, use
book_name
in the book table andperson_name
in the person table instead of using justname
in both. - Organize data using tidy data principles.
- Unless it is blatantly obvious from the file and column names (it never is), include a README file giving an explanation of what each data field contains and how the different files fit together. Also link each file to what creates it, be that a pipeline or an external source (see also above on reproducibility/Makefiles).
- For parsed date and datetime fields, use ISO 8601 (
YYYY-MM-DD
andYYYY-MM-DDThh:mm:ss[Z]
). For datetimes, if possible use Coordinated Universal Time by first giving the local datetime and then the UTC offset, so e.g. midnight Helsinki time should be encoded as00:00:00+02:00
or00:00:00+03:00
depending on Daylight Saving Time instead of22:00:00Z
or21:00:00Z
. If resolving times to UTC is too hard, just stick to local time and don't give an UTC offset. - If your data contains one-to-many or many-to-many relations, generally follow relational modeling principles, with one-to-one core attribute tables for the entries, and separate link tables encoding the one-to-many and many-to-many relations. Particularly, no cross-product tables that duplicate attributes. See further below.
If an attribute may have multiple values, in general, prefer long format separate tables:
person_id book_id role
person_1 book_1 author
person_1 book_1 publisher
In some situations, it is permissible to instead put multiple values in the same field. In these cases, use |
as the separator.
person_id person_first_name person_last_name roles
person_1 John Doe author|publisher
The situations in which this is permissible are when there are many core attributes associated with the entity, of which only one or two may contain multiple values, and it would feel stupid to extract only those into a separate table.
Never duplicate values unnecessarily by e.g. creating cross-product tables like the following:
person_id person_name book_id role
person_1 John Doe book_1 author
person_1 John Doe book_1 publisher
Instead, have two tables:
person_id person_name
person_1 John Doe
and
person_id book_id role
person_1 book_1 author
person_1 book_1 publisher
- Repeat from above: everything that is output from the project should in principle be automatically regenerateable from what comes in to the project. This enforces repeatability of the research process and makes its validation possible, while also making it able to be re-run for inevitably changing inputs.
- If an analysis repo ends up as an article, at the point of submission, clearly separate final code (e.g. code producing the final figures used in the article) from other code. You can do this either by moving these to separate files, or by clearly flagging the relevant parts of a larger notebook file. The important bit is that it must be clear to an outsider which code path leads to reproducing the results in the article.
- At this point, also make sure that it is possible to run the code from a fresh checkout (e.g. make sure dependency management works).
- Make sure that both the code repo as well as all repositories it depends on for either data or code are versioned and those versions archived (e.g. by linking the repositories to Zenodo and creating versioned releases from them)